TCSVT2025

Affiliations: Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China; School of Computer Science and Technology, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Xi’an, China; School of Electronic Engineering, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Xi’an, China; School of Artificial Intelligence, Key Laboratory of Intelligent Perception and Image Understanding, Ministry of Education, Xidian University, Xi’an, China; Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, USA

Abstract:
Existing 3D mask learning methods encounter performance bottlenecks under limited data, and our objective is to overcome this limitation. In this paper, we introduce a triple point masking scheme, named TPM, which serves as a scalable plug-and-play framework for MAE pre-training to achieve multi-mask learning for 3D point clouds. Specifically, we augment the baseline methods with two additional mask choices (i.e., medium mask and low mask) as our core insight is that the recovery process of an object can manifest in diverse ways. Previous high-masking schemes focus on capturing the global representation information but lack fine-grained recovery capabilities, so that the generated pre-training weights tend to play a limited role in the fine-tuning process. With the support of the proposed TPM, current methods can exhibit more flexible and accurate completion capabilities, enabling the potential autoencoder in the pre-training stage to consider multiple representations of a single 3D point cloud object. In addition, during the fine-tuning stage, an SVM-guided weight selection module is proposed to fill the encoder parameters for downstream networks with the optimal weight, maximizing linear accuracy and facilitating the acquisition of intricate representations for new objects. Extensive experimental results and theoretical analysis show that five baselines equipped with the proposed TPM achieve comprehensive performance improvements on various downstream tasks. Our code and models are available at https://github.com/liujia99/TPM.

Abstract:
Self-knowledge distillation has emerged as a powerful method, notably boosting the prediction accuracy of deep neural networks while being resource-efficient, setting it apart from traditional teacher-student knowledge distillation approaches. However, in safety-critical applications, high accuracy alone is not adequate; conveying uncertainty effectively holds equal importance. Regrettably, existing self-knowledge distillation methods have not met the need to improve both prediction accuracy and uncertainty quantification simultaneously. In response to this gap, we present an uncertainty-aware self-knowledge distillation method named UASKD. UASKD introduces an uncertainty-aware contrastive loss and a prediction synthesis technique within the self-knowledge distillation process, aiming to fully harness the potential of self-knowledge distillation for improving both prediction accuracy and uncertainty quantification. Extensive assessments illustrate that UASKD consistently surpasses other self-knowledge distillation techniques and numerous uncertainty calibration methods in both prediction accuracy and uncertainty quantification metrics across various classification and object detection tasks, highlighting its efficacy and adaptability.

Abstract:
Domain Generalization (DG) is a growing field in machine learning that aims to train the model across multiple source domains, thereby enabling effective generalization to new, unseen target domains. Recent studies suggest that data augmentation, which enhances the diversity of the source domain, might be a promising solution to address this task. Current data augmentation methods use random fusion coefficients or local regional fusion, which cannot adaptively design the weights based on data, or preserve the integrity of original semantics. Inspired by the pre-trained model CLIP, which contains extensive multimodal knowledge, we propose ClipMix to address these limitations. Firstly, we use the CLIP model as the external knowledge to adaptively evaluate the alignment between images and their labels, using this alignment to assess the complexity of learning each image and guide adaptive augmentation. Secondly, we implement a label shift mechanism to dynamically assign soft labels to fused images, helping the model focus on hard-to-learn patterns and also gather domain-agnostic representation. Furthermore, we enhance the diversity of fused images at both the pixel and feature levels. Experimental results across sixteen domains from four databases verify the effectiveness of our method.

Abstract:
Multi-view clustering (MVC) aims to extract consensus information from multi-source data and has developed rapidly. Although generative model-based methods perform well by leveraging predefined priors, they often overlook inter-instance relationships, which are essential for high-quality clustering. To address this issue, we propose Graph Variational Multi-view Clustering (GVMVC), which integrates graph information into the generative process. Specifically, we treat the original multi-view features and the graph information from each view as observed data to guide the learning of latent representations. The key principles of our approach are: 1) enhancing discriminative feature learning through graph integration; and 2) ensuring consistent multi-view learning via graph-based constraints. Extensive experiments show that GVMVC outperforms state-of-the-art methods across various datasets and metrics. Code is available at https://github.com/WenB777/GVMVC.git

Affiliations: School of Applied Science, Beijing Information Science and Technology University, Beijing, China; School of Emergency Management, Institute of Disaster Prevention, Langfang, China; Command Center of Natural Resources Comprehensive Survey, China Geological Survey, Beijing, China; Department of Earth and Space Sciences, Southern University of Science and Technology, Shenzhen, China; Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, China; Department of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou, China

Abstract:
The gradient-based meta-learning algorithm gains meta-learning parameters from a pool of tasks. Starting from the obtained meta-learning parameters, it can achieve better results through fast fine-tuning with only a few gradient descent updates. The two-layer meta-learning approach that shares initialization parameters has achieved good results in solving few-shot learning domain. However, in the training of multiple similar tasks in the inner layer, the difficulty and benefits of the tasks have been consistently overlooked, resulting in conflicts between tasks and ultimately compromising the model to unexpected positions. Therefore, this paper proposes a task-adaptive selection meta-learning algorithm called TSML. Specifically, we construct a task selection trainer to assess the difficulty of tasks and calculate their future benefits. Designing more optimal training strategies for each task based on difficulty and benefit, altering the current compromise in multi-task settings, and balancing the impact of tasks on meta-learning parameters. Additionally, the outer meta-parameter updating method for traditional meta-learning has been adjusted, enabling the meta-parameters to attain a better position. By doing so, we can rapidly improve the generalization and convergence of the meta-learning parameters on unknown tasks. Experimental results indicate a 2.1% improvement over the base model in the 4-conv setting, with a more pronounced effect as the neural network is progressively complexified, reaching a 4.1% improvement in resnet12.

Abstract:
Pursuing efficient text shape representations helps scene text detection models focus on compact foreground regions and optimize the contour reconstruction steps to simplify the whole detection pipeline. Current approaches either represent irregular shapes via box-to-polygon strategy or decomposing a contour into pieces for fitting gradually, the deficiency of coarse contours or complex pipelines always exists in these models. Considering the above issues, we introduce EdgeText to fit text contours compactly while alleviating excessive contour rebuilding processes. Concretely, it is observed that the two long edges of texts can be regarded as smooth curves. It allows us to build contours via continuous and smooth edges that cover text regions tightly instead of fitting piecewise, which helps avoid the two limitations in current models. Inspired by this observation, EdgeText formulates the text representation as the edge approximation problem via parameterized curve fitting functions. In the inference stage, our model starts with locating text centers, and then creating curve functions for approximating text edges relying on the points. Meanwhile, truncation points are determined based on the location features. In the end, extracting curve segments from curve functions by using the pixel coordinate information brought by truncation points to reconstruct text contours. Furthermore, considering the deep dependency of EdgeText on text edges, a bilateral enhanced perception (BEP) module is designed. It encourages our model to pay attention to the recognition of edge features. Additionally, to accelerate the learning of the curve function parameters, we introduce a proportional integral loss (PI-loss) to force the proposed model to focus on the curve distribution and avoid being disturbed by text scales. Ablation experiments demonstrate that EdgeText can fit scene texts compactly and naturally. Comparisons show that EdgeText is superior to existing methods on multiple public datasets. Code is available at https://github.com/omtcyang/EdgeTD.

Abstract:
The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.

Abstract:
Multi-view clustering based on deep auto-encoder networks has garnered increasing attention and made significant progress in recent years. However, we argue that most existing methods inadequately explore the discriminability while learning clustering assignments, resulting in models struggling to accurately cluster data, particularly those with ambiguous semantics. To address this problem, we propose a novel framework termed deep discriminative multi-view clustering (DDMvC). This framework is designed to further increase the inter-cluster distances by learning a discriminative projection dictionary with global prior information. To begin with, we enhance the reliability of the dictionary atoms by initializing them with class-specific prototypes derived from concatenated global features across multiple views. Subsequently, we iteratively refine the atoms to guarantee their independence from any specific cluster. Simultaneously, we incorporate contrastive learning for the cluster assignments projected by these atoms, striving for inter-view consistent clustering results. Experimental results on benchmark multi-view datasets demonstrate that our framework achieves the state-of-the-art clustering performance.

Abstract:
Image quality assessment (IQA) and its computational models play a vital role in modern computer vision applications. Research has traditionally focused on signal distortions arising during image compression and transmission, and their impact on perceived image quality. However, little attention is paid to image manipulation that alters an image using various filters. With the prevalence of image manipulation in real-life scenarios, it is critical to understand how humans perceive filter-altered images and to develop reliable IQA models capable of automatically assessing the quality of filtered images. In this paper, we build a new IQA database for filter-altered images, comprised of 360 images manipulated by various filters. To ensure the subjective IQA faithfully reflects human visual perception, we conduct a fully-controlled psychovisual experiment. Building upon the ground truth, we propose an innovative deep learning-based no-reference IQA (NR-IQA) model named IMQA that can accurately predict the perceived quality of filter-altered images. This model involves constructing an image filtering-aware module to learn discriminatory features for filter-altered images; and fuses these features with the representations generated by an image quality-aware module. Experimental results demonstrate the superior performance of the proposed IMQA model.

Abstract:
Neuromorphic imaging is an emerging technique that imitates the human retina to sense variations in dynamic scenes. It responds to pixel-level brightness changes by asynchronous streaming events and boasts microsecond temporal precision over a high dynamic range, yielding blur-free recordings under extreme illumination. Nevertheless, this modality falls short in spatial resolution and leads to a low level of visual richness and clarity. Pursuing hardware upgrades is expensive and might cause compromised performance due to more burdens on computational requirements. Another option is to harness offline, plug-in-play super-resolution solutions. However, existing ones, which demand substantial sample volumes for lengthy training on massive computing resources, are largely restricted by real data availability owing to the current imperfect high-resolution devices, as well as the randomness and variability of motion. To tackle these challenges, we introduce the first self-supervised neuromorphic super-resolution prototype. It can be self-adaptive to per input source from any low-resolution camera to estimate an optimal, high-resolution counterpart of any scale, without the need of side knowledge and prior training. Evaluated on downstream tasks, such a simple yet effective method can obtain competitive results against the state-of-the-arts, significantly promoting flexibility but not sacrificing accuracy. It also delivers enhancements for inferior natural images and optical micrographs acquired under non-ideal imaging conditions, breaking through the limitations that are challenging to overcome with frame-based techniques. In the current landscape where the use of high-resolution cameras for event-based sensing remains an open debate, our solution is a cost-efficient and practical alternative, paving the way for more intelligent imaging systems.

Abstract:
Pre-trained large-scale vision-language models (VLMs) have acquired profound understanding of general visual concepts. Recent advancements in efficient transfer learning (ETL) have shown remarkable success in fine-tuning VLMs within the scenario of limited data, introducing only a few parameters to harness task-specific insights from VLMs. Despite significant progress, current leading ETL methods tend to overfit the narrow distributions of base classes seen during training and encounter two primary challenges: (i) only utilizing uni-modal information to modeling task-specific knowledge; and (ii) using costly and time-consuming methods to supplement knowledge. To address these issues, we propose a Conditional Prototype Rectification Prompt Learning (CPR) method to correct the bias of the base examples and augment limited data in an effective way. Specifically, we alleviate over-fitting on base classes from two aspects. First, each input image acquires knowledge from both textual and visual prototypes and then generates sample-conditional text tokens. Second, we extract utilizable knowledge from unlabeled data to further refine the prototypes. These two strategies mitigate biases that stem from base classes, yielding a more effective classifier. Extensive experiments on 11 benchmark datasets show that our CPR achieves state-of-the-art performance on few-shot classification, base-to-new generalization, and cross-dataset generalization tasks. Our code is available at https://github.com/chenhaoxing/CPR

Abstract:
The disperse structure distributions (discreteness) and variant scattering characteristics (variability) of SAR airplane targets lead to special challenges of object detection and recognition. The current deep learning-based detectors encounter challenges in distinguishing fine-grained SAR airplanes against complex backgrounds. To address it, we propose a novel physics-guided detector (PGD) learning paradigm for SAR airplanes that comprehensively investigate their discreteness and variability to improve the detection performance. It is a general learning paradigm that can be extended to different existing deep learning-based detectors with “backbone-neck-head” architectures. The main contributions of PGD include the physics-guided self-supervised learning, feature enhancement, and instance perception, denoted as PGSSL, PGFE, and PGIP, respectively. PGSSL aims to construct a self-supervised learning task based on a wide range of SAR airplane targets that encodes the prior knowledge of various discrete structure distributions into the embedded space. Then, PGFE enhances the multi-scale feature representation of a detector, guided by the physics-aware information learned from PGSSL. PGIP is constructed at the detection head to learn the refined and dominant scattering point of each SAR airplane instance, thus alleviating the interference from the complex background. We propose two implementations, denoted as PGD and PGD-Lite, and apply them to various existing detectors with different backbones and detection heads. The experiments demonstrate the flexibility and effectiveness of the proposed PGD, which can improve existing detectors on SAR airplane detection with fine-grained classification task (an improvement of 3.1% mAP most), and achieve the state-of-the-art performance (90.7% mAP) on SAR-AIRcraft-1.0 dataset. The project is open-source at https://github.com/XAI4SAR/PGD

Abstract:
Data-Free Knowledge Distillation (DFKD) enables knowledge transfer from teacher networks without access to the real dataset. However, generator-based DFKD methods often suffer from insufficient diversity or low-confidence in synthetic images, negatively impacting student network performance. This paper introduces DFMC, a generative feature-driven framework to mitigate the inherent limitations of DFKD. We propose exploiting semantic description between generative feature domains to guide augmentation strategies, avoiding random abstract inputs caused by inconsistent semantic quality. Then, by applying noise to the generative features, we produce contrastive learning pairs indirectly, limiting the sampling range of the feature domain to encourage the student network to learn domain-invariant features. Finally, we guide the student network to deeply mimic the teacher’s layer-wise implicit classification behavior for the augmented synthetic images. Extensive experiments across various datasets and downstream tasks demonstrate the effectiveness of DFMC, achieving significant improvements while preventing student networks from overfitting to semantic ambiguous images.

Abstract:
One challenge in stereo-LiDAR fusion arises from the sparsity and non-uniform distribution of LiDAR data. Existing methods expand sparse LiDAR data to produce semi-dense hints as guidance for fusion. However, the absence of depth cues beyond the expanded areas may still limit performance. To address this challenge, we propose a novel sparse-to-dense hint guided stereo-LiDAR fusion method. The key idea is to use a dense hint map generated by a lightweight network as guidance, with sparse LiDAR points and a monocular image as inputs. The dense hints are then employed to construct and explicitly regularize a multi-modal cost volume via integrating the geometric cues from the hints and the visual information from the images to produce better stereo prediction. The construction and aggregation of cost volume follow a well-designed coarse-to-fine strategy along with a pixel-wise search range adjustment module, facilitating fast computation while preserving fine details. Finally, a confidence-based fusion module is performed to adaptively produce the ultimate prediction based on the monocular and stereo estimations. The experimental results show that our method significantly outperforms existing methods with high inference efficiency across multiple benchmark datasets. To contribute to the community, we will release the code at: https://github.com/LiAngLA66/DG-Fusion

Abstract:
In recent years, due to the rich parameters contained in deep neural network (DNN) models, researchers have proposed DNN model steganography, using DNN models as carriers. In this paper, we propose a constructive DNN model steganography method based on parameter initialization. Unlike existing DNN model steganography methods that embed secret information into a pre-trained DNN model, we generate parameters containing secret information through parameter initialization for a DNN model structure, and the embedded secret information can still be extracted after model training. Specifically, we first generate the model parameters needed for the DNN model structure through a secret information-driven encoder, and then we jointly train the encoder and decoder to ensure the correct extraction of secret information. Additionally, we introduce a noise layer to simulate the model training process to guarantee the robustness of our method. Experimental results demonstrate that our method not only achieves high hiding capacity but also exhibits satisfactory stealthiness and robustness. Furthermore, our method is generalizable, which can be applied to various network structures, such as multilayer perceptron (MLP), convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Transformers.

Abstract:
Multimodal semantic segmentation is developing rapidly, but the modality of RGB-Polarization remains underexplored. To delve into this problem, we construct a UPLight RGB-P segmentation benchmark with 12 typical underwater semantic classes. In this work, we design the ShareCMP, an RGB-P semantic segmentation framework with a shared dual-branch architecture (ShareCMP Encoder), which reduces the parameters and memory space by about 33.8% compared to previous dual-branch models. It encompasses a Polarization Generate Attention (PGA) module designed to generate polarization modal images with richer polarization properties for the encoder. In addition, we introduce the Class Polarization-Aware Loss (CPALoss) with Class Polarization-Aware Auxiliary Head (CPAAHead) to improve the learning and understanding of the encoder for polarization modal information and to optimize the PGA module. With extensive experiments on a total of three RGB-P benchmarks, our ShareCMP achieves the best performance in mIoU with fewer parameters on the UPLight (92.45 (+0.32)%), ZJU (92.7 (+0.1)%), and MCubeS (50.99 (+1.51)%) datasets. And our ShareCMP (w/o PGA) achieves competitive or even higher performance on other RGB-X datasets compared to the corresponding state-of-the-art RGB-X methods. The code and datasets are available at https://github.com/LEFTeyex/ShareCMP

Abstract:
Video captioning through deep learning presents a multifaceted challenge that encompasses the extraction of complex spatio-temporal visual features and the synthesis of meaningful natural language descriptions. Most of the existing deep learning models can be broadly grouped as either convolution-based or transformer-based encoder-decoder networks, with video captions generated from features encoded at the pixel level for the former, and from features encoded at grid, frame, or video levels depending on encoder complexity for the latter. This paper advocates frame-level features as a more balanced and compact representation for fast caption generation, and introduces the Tracking-guided Information Augmentation for Captioning (Track4Cap) model, which integrates tracking-guided information augmentation to enhance frame-level features without relying on complex architectures or additional data modalities. Specifically, Track4Cap employs the Frame-by-Frame Multi-object Tracking module (FMoT) to identify the most relevant objects in the input video and the Object Relation Encoder (ORE) to model inter-object relationships as supplementary high-level cues for caption generation. By avoiding time-consuming end-to-end training and leveraging compact representations, Track4Cap achieves computational efficiency while improving captioning performance. Extensive experiments on two commonly used benchmark datasets demonstrate that Track4Cap not only achieves faster inference times but also outperforms state-of-the-art convolution-based and transformer-based video captioning models. The implementation of our method is publicly available at https://github.com/ccc000-png/Tracker4Cap.

Abstract:
Event cameras, which are highly sensitive to light intensity changes, often generate substantial noise during imaging. Existing denoising methods either lack the speed for real-time processing or struggle with dynamic scenes, mistakenly discarding valid events. To address these issues, we propose a novel dual-stage polarity-focused denoising (PFD) method that leverages the consistency of polarity and its changes within local pixel areas. Whether due to camera motion or dynamic scene changes, the polarity and its changes in triggered events are highly correlated with these movements, allowing for effective noise handling. We introduce two versions: PFD-A, which excels at reducing background activity (BA) noise, and PFD-B, which is designed to address both BA and flicker noise. Both versions are lightweight and computationally efficient. The experimental results show that PFD outperforms benchmark methods in terms of the SNR and ESR metrics, achieving state-of-the-art performance across various datasets. Additionally, we propose an FPGA implementation of PFD processes that handles each event in just 7 clock cycles, ensuring real-time performance. The codes are available at https://github.com/shicy17/PFD.

Abstract:
Few-shot object detection (FSOD) focuses on detecting objects of novel classes with only a small number of annotated samples. Due to the limited number of new class samples and the presence of intra-class variance, current FSOD methods struggle to acquire sufficient discriminative information to represent the corresponding class, thus restricting the performance of FSOD. To address this issue, we propose a Structure-Guided Few-shot object detection (SGFNet) method that utilizes the structural information of targets to provide richer discriminative information. Specifically, we first design a Multi-Frequency Structural Feature (MFSF) module, where the highly discriminative structural information of objects in images is extracted and used to enhance the discriminativeness of the features of the target. Based on the MFSF, we then propose a Saliency Information Enhancement (SIE) module that utilizes saliency information to enhance the object-related structural features while suppressing background interference. In addition, we present a novel Soft Cosine Classifier (SCC) based on soft cosine similarity to extract consistent discriminative information between the support and query features for distinguishing targets. Extensive experiments on PASCAL VOC and MS COCO demonstrate that our method significantly outperforms a strong baseline (up to 13.8%) and previous state-of-the-art methods (4.8% in average).

Abstract:
Multi-modal object tracking has received increasing attention, given the limitations the representation ability in certain challenging scenarios of single RGB modality. Recent prompt tuning techniques enable multimodal tracking to effectively inherit knowledge from foundation models trained with a large amount of RGB tracking data and achieve parameter-efficient training. However, few works focus on the efficient inference of multimodal tracking handling multiple RGB-X (RGB-Thermal, RGB-Depth, RGB-Event, etc.) tracking tasks simultaneously, especially on resource-limited devices such as CPU. In this work, we propose an efficient multimodal tracker named EMTrack. EMTrack follows a concise and unified multimodal tracking framework with simple knowledge distillation. RGB modality and auxiliary modality are added after patch-embedding layer for fusion, reducing the computational complexity of multimodal tracking compared with that of single modality. Before fusion operation, we introduce a modal-specific spatial modulation module to exploit and realize adaptive spatial adjustment of different modality features. Multiple modal-specific experts are adopted to capture specific information for different RGB-X tracking tasks, which assists in handling such tasks in a unified model with joint training. EMTrack achieves competitive performance on various RGB-X tracking benchmarks while reaching a good balance of performance and speed on different platforms. Especially on an Intel Core i9-10850K CPU device, EMTrack achieves 29.1 fps, a real-time speed, with only 2.0G MAC computation.

Abstract:
Multi-view clustering (MvC) utilizes information from multiple views to uncover the underlying structures of data. Despite significant advancements in MvC, mitigating the impact of missing samples in specific views on the integration of knowledge from different views remains a critical challenge. This paper proposes a novel Mask-informed Deep Contrastive Incomplete Multi-view Clustering (Mask-IMvC) method, which elegantly identifies a view-common representation for clustering. Specifically, we introduce a mask-informed fusion network that aggregates incomplete multi-view information while considering the observation status of samples across various views as a mask, thereby reducing the adverse effects of missing values. Additionally, we design a prior knowledge-assisted contrastive learning loss that boosts the representation capability of the aggregated view-common representation by injecting neighborhood information of samples from different views. Finally, extensive experiments are conducted to demonstrate the superiority of the proposed Mask-IMvC method over state-of-the-art approaches across multiple MvC datasets, both in complete and incomplete scenarios. The demo code for our work will be publicly available at https://github.com/guanyuezhen/Mask-IMvC

Abstract:
Post-training quantization (PTQ) is an effective solution for deploying deep neural networks on edge devices with limited resources. PTQ is especially attractive because it does not require access to the entire original training dataset on the promise of being able to use a much smaller calibration dataset. However, many existing PTQ methods still require a sufficiently large calibration dataset (e.g., more than 1000 images) to achieve satisfactory model accuracy. In this paper, we present a novel post-training quantization method that estimates quantization parameters using a Bayesian Maximum A Posterior (MAP) estimator. By modeling the uncertainty of quantization operations, we formulate the neural network quantization as a Bayesian inference problem. In our method, we first employ probabilistic programming techniques to optimize quantization parameters by maximizing the posterior of quantization step sizes. In addition, we introduce a Minimum Description Length (MDL) prior that favors low quantization bit widths and a validation procedure, which enhances PTQ performance when learning from small calibration datasets. Comprehensive evaluations demonstrate that the proposed method can improve the PTQ performance using a minimal calibration dataset of just 64 images, and achieve nearly state-of-the-art PTQ performance. Furthermore, the proposed method shows strong generalization ability when calibrated on different data sources and tested across diverse data.

Abstract:
Multi-View Stereo (MVS) reconstructs detailed 3D structures from multi-view images by establishing spatial correspondences. While learning-based methods have significantly advanced the MVS task, challenges such as ambiguous matching caused by textureless surfaces and lighting variations persist. To address these issues, we propose GAP-MVSNet, a framework that leverages surface normals from a monocular normal foundation model as priors to enhance the geometric awareness of reconstruction targets. In this work, surface normal priors are seamlessly integrated into the MVS pipeline to improve depth prediction robustness and accuracy. Specifically, we introduce a structure-aware feature pyramid network that incorporates surface normal information and utilizes uncertainty-aware feature resampling to extract robust image features. Additionally, we present the spatial geometry enhanced regularization that combines sampled depth hypotheses with surface normals to generate a spatial geometric prior, guiding the cost regularization process and enforcing strong spatial coherence, particularly in textureless regions. Furthermore, we design a local consistency depth refinement module that utilizes surface normals to establish depth relationships as a local geometric prior, thereby refining classification-based depth predictions and aligning them with ground truth depth. Extensive experiments on the DTU and Tanks & Temples datasets demonstrate that our method achieves state-of-the-art performance.

Abstract:
Center-based 3D object detection has underperformed recently compared to advanced techniques. We experimentally find that the root lies in two weaknesses of the basic sample mechanism: 1) Unreasonable assignment that close-range and high-frequency objects dominate the network optimization since samples are equally assigned to each object. 2) Ambiguous encoding that samples exhibit suboptimal object discrimination ability, as the encoding process is restricted to a limited receptive field. To realize a reasonable assignment, Dynamic Multi-Quality Assignment (DMQA) is proposed, which dynamically assigns and supervises samples through fine-grained control. Concretely, initial samples are defined based on prior attributes (category and distance) per object, and dynamically adjusted upon the learning effect (classification and localization confidence). Besides, multi-scale auxiliary losses are introduced, ensuring precise sample learning. As for ambiguous encoding, Interactive Enhancement (IE) is introduced to improve sample representation through cross-task and cross-sample interaction. Cross-task interaction first aggregates neighborhood context from another task map. Parallel attention further performs cross-sample interaction on both local and global levels. Based on DMQA and IE, we propose a novel 3D detector named Sample Enhanced 3D Object Detection (SampleDet3D). Comprehensive experiments demonstrate that SampleDet3D effectively enhances center-based detection and achieves state-of-the-art performance on both Waymo and ONCE datasets.

Abstract:
Removing haze from real-world images is challenging due to unpredictable weather conditions, resulting in the misalignment of hazy and clear image pairs. In this paper, we propose an innovative dehazing framework that operates under non-aligned supervision. This framework is grounded in the atmospheric scattering model, and consists of three interconnected networks: dehazing, airlight, and transmission networks. In particular, we explore a non-alignment scenario that a clear reference image, unaligned with the input hazy image, is utilized to supervise the dehazing network. To implement this, we present a multi-scale reference loss that compares the feature representations between the referred image and the dehazed output. Our scenario makes it easier to collect hazy/clear image pairs in real-world environments, even under conditions of misalignment and shift views. To showcase the effectiveness of our scenario, we have collected a new hazy dataset including 415 image pairs captured by mobile Phone in both rural and urban areas, called “Phone-Hazy”. Furthermore, we introduce a self-attention network based on mean and variance for modeling real infinite airlight, using the dark channel prior as positional guidance. Experimental results demonstrate the superior performance of our framework over existing state-of-the-art techniques in the real-world image dehazing task. Phone-Hazy and code will be available at https://fanjunkai1.github.io/projectpage/NSDNet/index.html

Abstract:
Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as occlusion, blur and domain variance. In this work, we argue that multi-modal language-driven features provide complementary information to classical visual features, thereby aiding in improving the robustness to such environmental interference. To this end, we propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity (scene-and instance-level) and combines it with standard visual features to obtain discriminative representations. To develop LG-MOT, we annotate existing MOT datasets with scene-and instance-level language descriptions. We then encode both scene-and instance-level language information into high-dimensional embeddings, which are utilized to guide the visual features during training. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Extensive experiments on three benchmarks, MOT17, DanceTrack and SportsMOT, reveal the merits of the proposed contributions leading to state-of-the-art performance. On the DanceTrack test set, our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score), compared to the baseline using only visual features. Further, our LG-MOT exhibits strong cross-domain generalizability. Source code and pre-trained models are available at https://github.com/WesLee88524/LG-MOT

Abstract:
Label assignment, which aims to classify region proposals as positive or negative samples depending on the correlations between their classification and localization predictions with the corresponding ground truth, is recognized as an essential ingredient in object detection and strongly affects the detection performance. Recently, some dynamic label assignment methods have been proposed to overcome the limitations of the static methods and achieve promising performance improvement. Despite eliminating the restrictions of the human prior sampling knowledge in static methods, existing dynamic principles usually suffer from two weaknesses. First, most of them deploy mixture models or implicit branch in prediction head to coarsely estimate the spatial distribution of the positive samples for objects. They give little attention to the effect of appearance information of the objects. Furthermore, these methods still cannot perceive the quality distribution of the positive samples, and these low-quality samples lead to adverse effects on the detection performance. To address issues, this paper presents a novel automatic label assignment for object detection. Specifically, our method first introduces an instance property branch into object detection pipeline to distinguish the foreground from the background. Then, an objectness prediction module which is composed by the confidence and weight mechanisms is developed to generate the positive and negative weight maps for the objects. The instance property branch and objectness prediction module can provide a coarse-to-fine optimization framework to make our method realize the appearance of the objects. Finally, a positive sample selection strategy is proposed to explore the quality statistical distribution of the positive samples, which are trained by different designed label targets. We evaluate our method on the MS COCO dataset and we achieve 48.4%, 47.9%, 48.0% and 49.3% on ResNet-101, ResNeXt-101, DCN-ResNet-101 and DCN-ResNeXt-101 in terms of AP _0.5:0.95 , respectively. We evaluate the timing complexity of ALA by calculating the inference speed and the frame per second (FPS) for these four backbones are 11.9, 10.4, 9.9 and 8.0, respectively. The experiment results demonstrate that we can obtain clear improvement over the competing methods with favorable performance compared to the state-of-the-arts.

Abstract:
In recent years, the rapid development of deep learning has brought new opportunities for steganography. However, the current advanced white-box model steganography methods are not suitable for large language models. Since the parameter scale and complexity of large language models are far beyond that of ordinary models, retraining them to hide secret data is extremely challenging. Moreover, the cover parameters or structures of the embedded data are vulnerable to detection by attackers. To enhance practicality, we propose a black-box steganographic scheme for large language models, which embeds secret data into the third-party pre-trained large language models using backdoor techniques without knowing the internal complex structure and parameters of the large language models. Specifically, the sender first encodes the secret data into trigger labels and then uses a certain proportion of trigger samples and clean samples to fine-tune the third-party large language model to embed the secret data without significantly reducing the model performance. The receiver uses trigger samples to extract the secret data by interacting with the large language model, thereby achieving covert communication of the secret data. Experiments demonstrate the effectiveness of the proposed scheme in terms of embedding capacity, robustness, and security.

Abstract:
Recently, methods that utilize prompt tuning to rapidly transfer pretrained vision-language models (VLMs) to downstream tasks have been proposed. Although these models have produced reasonable results, they typically learn a single prompt, which limits their ability to capture more diverse information. This ability is crucial for addressing fine-grained classification challenges and intraclass visual variability (e.g., color, pose, and size variations within the same category). However, learning multiple prompts provides a larger optimization space, which further exacerbates the overfitting phenomenon. This makes it more challenging balance the performances acienved for base and new categories. To address these issues, we propose progressive multi-prompt (PMP) learning method. Specifically, we introduce multiple prompts in a step-by-step manner to focus on various information. To reduce overfitting, we utilize a late attaching mechanism to defer the interactions of prompts and features to a deeper encoding layer. Furthermore, we balance the prompts for different layers with learnable weights to guide the optimal optimization procedure. We compared our method with several state-of-the-art approaches in base-to-new task settings and demonstrate superior base-new tradeoff performance. Additionally, we conducted cross-dataset transfer, domain generalization, and few-shot experiments to further validate the effectiveness of our method. Our code is available at https://github.com/JunL-Geek/PMP.git.

Abstract:
Attributed graph clustering, aiming to discover the underlying graph structure and partition the graph nodes into several disjoint categories, is a basic task in graph data analysis. Although recent efforts over graph contrastive clustering have achieved decent performances, most of them get accustomed to construct the positive neighbor set by the generated pseudo clustering information, directly ignoring the ready-made neighbor nodes and the underlying semantics of edges in a graph. How to well deal with the graph neighbor-specific information to facilitate the performance of graph contrastive clustering is still a challenging problem. Therefore, in this work, we propose a novel Homophily Induced Contrastive Attributed Graph Clustering (HomoCAGC) method, where the power of homophily is exploited in facilitating the performance of contrastive attributed graph clustering, while the pseudo homophily in a graph is also explored and distinguished. Especially, the node feature as reliable information guidance is used to compute the underlying feature-oriented pair-wise node similarity, based on which the positive node pairs in contrastive regularizer are adjusted for better node representation characterization. According to the refined node representations, a triplet self-supervised clustering objective is well-designed to ensure the output embedding is cluster-oriented, and suitable for the downstream clustering task. Extensive experiments on seven benchmarks are conducted to demonstrate the effectiveness of HomoCAGC.

Abstract:
The features provided by RGB and Thermal Infrared (TIR) images have their own characteristics. Therefore, how to adaptively fuse multi-modal features according to different tracking scenarios is crucial for RGB-T tracking. However, current mainstream RGB-T tracking algorithms often use fixed fusion operations for modal interaction in different scenarios. Consequently, their tracking permanence is deteriorated due to they are unable to dynamically adjust the fused multi-modal features based on the current scenes. To address this issue, we propose a novel RGB-T tracking algorithm called AETrack, which can dynamically extract effective modal features in different scenarios for adaptive fusion. Firstly, we design an adaptive expert decision mechanism that employs multiple experts to process the input features. Each expert focuses on and learns different relevant features. Based on this mechanism, we then propose a feature-guided method that leverages the correlations between modalities to provide cross-modal information. This guidance enables the adaptive expert mechanism to adaptively select the most suitable expert to output effective features based on different scenarios, ensuring that our proposed AETrack prioritizes effective features and thus alleviates interference from irrelevant information. Finally, we design a Progressive Cross-modal Fusion operation to achieve multi-level adaptive fusion of effective features across different modalities. Benefiting from this adaptive fusion process, we can effectively achieve multi-modal interaction in different scenarios to guide robust tracking. Extensive experiments on three popular benchmarks (i.e., LasHeR, RGBT210, RGBT234) show that our proposed AETrack can significantly improve tracking performance.

Abstract:
Natural language tracking aims to locate the position of a target specified by a natural language description. Existing methods are trained on vision-language datasets with a small number of language descriptions, which may lead to limited semantic generalization. Moreover, they extract visual and language features separately, which limits visual-semantic capabilities. To overcome these limitations, we propose a novel semantic-aware tracking framework, SATrack, which integrates a semantic-aware attention module and a cross-modal aggregation module. The proposed SATrack enjoys several merits. First, the semantic-aware attention module utilizes language semantics as a bridge to build associations between visual features, enabling stronger visual-semantic capabilities. Second, the cross-modal aggregation module transfers the semantic knowledge of CLIP into the tracking framework for semantic generalization. Extensive experimental results demonstrate that SATrack outperforms previous state-of-the-art trackers on four natural language tracking benchmarks.

Abstract:
Vision-language object tracking integrates advanced linguistic information, enhancing its robustness and accuracy in complex scenarios. Nevertheless, current methods are constrained by a lack of sufficient vision-language data, making it challenging for the model to learn generalized knowledge. To alleviate this issue, we propose a new prompt-based framework for vision-language tracking, named ProVLT. This framework casts language information as a prompt for pretrained vision-based tracking models, thereby leveraging the knowledge from extensive tracking data. Experiments demonstrate that ProVLT achieves competitive performance while training only a fraction of parameters (approximately 29% of modal parameters). For instance, ProVLT achieves competitive performance, attaining AUC of 59.8% on TNL2K benchmark. Furthermore, we augment five mainstream vision-only tracking benchmarks with language annotations, and find that the inclusion of linguistic information consistently improves tracking performance. On these benchmarks, the linguistic information improves the performance by an average of 2.9% compared with the vision-based tracker. We will release the code, models, and benchmarks for the community.

Abstract:
Banding is a visually annoying artifact that frequently occurs along the chain of video acquisition, production, distribution, and display, showing a significant need for improvement in many fields. Thus far, efforts on banding removal are mainly knowledge-driven or merely learning on RGB space, which is either limited by domain knowledge or lacks the consideration for banding in chrominance channels. In this work, we propose a unified deep neural network that explicitly disentangles the luminance and chrominance channels, and simultaneously recovers intensity gradients and color discontinuity from detection-free measurement in an end-to-end manner. Our debanding model is comprised of a luminance restoration network (LR-Net) and a chrominance restoration network (CR-Net). Each of them follows an encoder-decoder architecture, where a cascade of residual blocks is employed to exploit hierarchical non-local features in spatial dimensions for more powerful feature representation. Moreover, we investigate the characteristics of banding artifacts and apply specific loss functions to guide the debanding in different channels, thus boosting the restoration performance. Both qualitative and quantitative experiments show that our model significantly surpasses the existing method in terms of all 7 metrics. Ultimately, our network trained on simulated data exhibits good adaptiveness under various compression scenarios, which further demonstrates the effectiveness of the proposed model.

Abstract:
JPEG is daily used for compressing natural images, while the compressed images often contain visually annoying artifacts especially at low rates. To reduce the compression artifacts, it has been proposed to preprocess an image before the JPEG compression with the help of deep learning, which maintains the standard compliance. However, the existing methods were not fully justified from the rate-distortion optimization perspective. We address this limitation and propose a truly rate-distortion-optimized deep preprocessing method for JPEG compression. We decompose a rate-distortion cost into three parts: rate, distortion, and Lagrangian multiplier. First, we design a rate estimation network and propose to train the network to estimate the JPEG compression rate. Second, we propose to estimate the actual end-to-end distortion (between original and reconstructed images) with a differentiable JPEG simulator, where we specifically design an adaptive discrete cosine transform (DCT) domain masking algorithm. Third, we propose to estimate the actual content-dependent Lagrangian multipliers to combine rate and distortion into a joint loss function that drives the training of the preprocessing network. Our method makes no change to the JPEG encoder and decoder and supports any differentiable distortion measure (e.g. MSE, MS-SSIM, LPIPS). On the Kodak dataset, our method achieves on average 7.59% BD-rate reduction compared to the JPEG baseline when using MSE. With per-image optimization for LPIPS, our method achieves as high as 38.65% BD-rate reduction, and produces high-quality reconstructed images with much less artifacts.

Abstract:
The current training strategies based on knowledge distillation for image captioning assume that each learning model possesses complete learning value, lacking review and guidance mechanisms among the interactive process of models. To address this problem, we propose a novel Captioner with Deep Reciprocal Learning (CaDReL) for image captioning inspired by the social learning theory, which realizes interactive learning between models controlled by salient semantic evaluation. In CaDReL, we analyze the semantic saliency of each learning network to better control the parameter transfer in knowledge distillation by cyclically alternately freezing and unfreezing two learning networks with identical review mechanisms. We also propose a novel cascade bridging diffusion module, which fuses feature information from different levels of visual information and attention ranges in the encoder by a cascade diffusion mechanism to capture rich image details and contextual information. Meanwhile, an attention guided knowledge augmentation module is proposed to guide knowledge transferring by the attention maps from the respective encoders of two peer joint learning modules for improving the robustness of the whole training strategy. Experimental results illustrated that the proposed CaDReL achieves excellent performance on the MSCOCO dataset, and outperforms most state-of-the-art methods. Codes are available at https://github.com/ZJ-VIP-Lab/Deep-Reciprocal-Learning-for-Image-Captioning.

Abstract:
Neural Radiance Fields (NeRFs) have shown great potential in modeling 3D scenes. Dynamic NeRFs extend this model by capturing time-varying elements, typically using deformation fields. The existing dynamic NeRFs employ a similar Eulerian representation for both light radiance and deformation fields. This leads to a close coupling of appearance and motion and lacks a physical interpretation. In this work, we propose Dynamic Appearance Particle Neural Radiance Field (DAP-NeRF), which introduces particle-based representation to model the motions of visual elements in a dynamic 3D scene. DAP-NeRF consists of the superposition of a static field and a dynamic field. The dynamic field is quantized as a collection of appearance particles, which carries the visual information of a small dynamic element in the scene and is equipped with a motion model. All components, including the static field, the visual features and the motion models of particles, are learned from monocular videos without any prior geometric knowledge of the scene. We develop an efficient computational framework for the particle-based model. We also construct a new dataset to evaluate motion modeling. Experimental results show that DAP-NeRF is an effective technique to capture not only the appearance but also the physically meaningful motions in a 3D dynamic scene. Code is available at: https://github.com/Cenbylin/DAP-NeRF.

Abstract:
Continual learning, focusing on sequential knowledge acquisition and retention, necessitates efficient memory management. This paper introduces a holistic approach, diverging from traditional methods that separately optimize neural network and replay buffer memory. We aim to enhance overall memory efficiency, addressing neural network parameters and replay buffer concurrently within strict memory constraints. This is achieved by harnessing neural network parameter redundancies and employing compression techniques like pruning and quantization, allowing data replay storage without extra memory overhead. Balancing memory use across components is challenging due to the complex search space of combined tasks. We tackle this by conceptualizing it as a bi-level optimization problem, integrating all tasks under a single objective, thus optimizing memory use and managing the interplay between different components. We employ a synergy of optimization techniques to solve this challenging bi-level optimization problem. Our experimental findings affirm the superior performance of our proposed method, outperforming existing techniques such as prompt-based, feature-replay, exemplar-replay, and regularization-based methods under stringent memory constraints, consistently across various datasets and neural network architectures.

Abstract:
Parameter-Efficient Fine-Tuning methods based on vision-language models (such as CLIP) for few-shot learning have recently received considerable attention. However, previous works only fine-tune either the image or text branch, breaking the alignment of the original two branches, meanwhile fine-tuning both branches of the CLIP would inevitably introduce more trainable parameters and likely cause more severe over-fitting due to the limited training data. In this study, we propose a novel Dual-branch Adapter-Tuning framework (DAT), which collaboratively trains the visual adapter and textual adapter added to the two branches of the original CLIP with multiple consistency constraints. By effectively utilizing the semantically detailed class-specific prompts and outputs of the original CLIP to guide the fine-tuning of both branches, our method gains exceptional adaptation ability to the downstream few-shot learning tasks and alleviates the over-fitting issue, meanwhile maximally preserving the generalization ability of the original CLIP model. Our proposed framework has achieved superior performance on diverse datasets under various few-shot learning settings compared to the existing approaches. The source code is available at https://github.com/SandyXi/DAT.

Affiliations: School of Artificial Intelligence, Optics and Electronics (iOPEN), and the Key Laboratory of Intelligent Interaction and Applications (Ministry of Industry and Information Technology), Northwestern Polytechnical University, Xi’an, Shaanxi, China; School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China; School of Computer Science, the School of Artificial Intelligence, Optics and Electronics (iOPEN) and the Key Laboratory of Intelligent Interaction and Applications (Ministry of Industry and Information Technology), Northwestern Polytechnical University, Xi’an, Shaanxi, China

Abstract:
Ensemble clustering, which combines the information from multiple base clusterings to obtain a better partition result, has received extensive attention due to its effectiveness and robustness. Although many algorithms have been developed in recent years that have achieved impressive results in practical applications, two challenging issues in ensemble clustering remain. First, most algorithms assume that all base clusterings have the same impact on the clustering results, assigning them the same weight. This makes the clustering performance susceptible to the influence of redundant, low-quality base clusterings. Second, co-association matrix-based algorithms often rely on additional methods, such as hierarchical agglomerative clustering, to obtain the final clustering result after constructing the weighted co-association matrix. This not only complicates optimization process but also leads to the loss of some sample-similarity information during clustering. To address this problem, we propose a novel Toward Balance Adaptive Weighted Ensemble Clustering (TBAWEC) algorithm. This method transforms the ensemble clustering problem into an optimization problem, producing the final result without requiring additional clustering algorithms. Moreover, we introduce balanced technology into ensemble clustering for the first time, significantly improving the balance of clustering results. Extensive experiments on real datasets demonstrate that the proposed algorithm outperforms the most advanced ensemble and balanced clustering algorithms simultaneously.

Abstract:
Multi-view clustering leverages the complementary and compatible information among various views to achieve superior clustering outcomes. The approach of multi-view clustering through non-negative matrix factorization (NMF) has garnered extensive interest, attributed to its remarkable interpretability and clustering efficacy. Nonetheless, existing NMF-based multi-view subspace clustering methods fall short in thoroughly harnessing the complementary information across different views, potentially impairing clustering performance. To mitigate this issue, we introduce an orthogonal semi-nonnegative matrix tri-factorization model. This model excels in clustering interpretability, enabling the direct derivation of cluster labels from the clustering indicator matrix, thereby eliminating the need for post-processing. Our model employs tensor Schatten p-norm as a constraint, adeptly capturing both the complementary information and spatial structure information across views. Extensive experimental evaluations on a variety of benchmark datasets affirm the superior clustering performance of our proposed method.

Abstract:
Current multi-illumination color constancy methods typically estimate illumination for each pixel directly. However, according to the multi-illumination imaging equation, the color of each pixel is determined by various components, including the innate color of the scene content, the colors of multiple illuminations, and the weightings of these illuminations. Failing to distinguish between these components results in color coupling. On the one hand, there is color coupling between illumination and scene content, where estimations are easily misled by the colors of the content, and the distribution of the estimated illuminations is relatively scattered. On the other hand, there is color coupling between illuminations, where estimations are susceptible to interference from high-frequency and heterogeneous illumination colors, and the local contrast is low. To address color coupling, we propose a Color Decoupling Network (CDNet) that includes a Content Color Awareness Module (CCAM) and a Contrast HArmonization Module (CHAM). CCAM learns scene content color priors, decoupling the colors of content and illuminations by providing the model with the color features of the content, thereby reducing out-of-gamut estimations and enhancing consistency. CHAM constrains feature representation, decoupling illuminants by mutual calibration between adjacent features. CHAM utilizes spatial correlation to make the model more sensitive to the relationships between neighboring features and utilizes illumination disparity degree to guide feature classification. By enhancing the uniqueness of homogeneous illumination features and the distinctiveness of heterogeneous illumination features, CHAM improves local edge contrast. Additionally, by allocating fine-grained margin coefficients to emphasize the soft distinctiveness of similar illumination features, further enhancing local contrast. Extensive experiments on single- and multi-illumination benchmark datasets demonstrate that the proposed method achieves superior performance.

Abstract:
To achieve high-quality and high-resolution image processing, this work presents a novel vision processor that facilitates deep learning-enhanced image processing pipelines. At the system level, by identifying that a divide-and-conquer approach is essential to synergize both classical image processing and image enhancement networks, we develop a tightly coupled system with strip-tile conversion dataflow to enable fine-grained low-latency data interactions between image signal processors (ISPs) and the deep learning accelerator (DLA). At the architecture level, we design a comprehensive set of 21 efficient image processing modules to construct classical ISP pipelines, a tile-based strip layer fusion DLA specifically optimized for networks, and a programmable pixel pool that seamlessly supports the data access patterns of the ISP and the DLA. At the software and hardware co-design level, we propose a comprehensive optimization framework to address the implementation overhead of networks while maintaining the image quality. Finally, evaluations of the AI-ISP vision processor demonstrate 53.95% external memory access reduction and 35.51% latency reduction, delivering superior image quality with minimal on-chip memory overhead. A throughput of up to 168.5 frames per second facilitates efficient processing of ultra-high definition (UHD) resolution images.

Abstract:
In this paper, a high efficient 3D convolution feature compression method is proposed. This method is mainly used to compress the three-dimensional convolution deep features extracted by video analysis. Gather the compressed features to the cloud server instead of all video features. This method can solve the problem that the data aggregation of video big data analysis requires a large amount of network bandwidth. Quantization is a common method for feature compression, but the existing quantization-based methods often carry out model training and quantization in stages, which makes the robustness of quantization results poor. To solve this problem, the method proposed in this paper is to apply the feature quantization operation directly to the network and train it with the analysis task, and use the parameter iterative optimization method to solve the non-differentiable problem of quantization operation. Different from the deep features extracted from images or single objects, the 3D convolution features extracted from video clips have high time-domain redundancy. In this paper, by serializing the three-dimensional convolution features, the time-domain prediction coding method is used to remove the time-domain redundancy of the three-dimensional convolution features, to improve the feature compression ratio. The experimental results show that this method can only use 1 bit to represent the elements in the three-dimensional convolution deep feature. When the analysis accuracy loss is no more than 1%, the feature compression ratio can reach 4500 times compared with the original feature data, and the data transmission can be reduced by 96%.

Abstract:
Prompt tuning has been successfully used in leveraging the knowledge of Large-scale Vision-Language Pre-trained (VLP) models on downstream tasks. Most existing prompt tuning approaches learn prompts by maximizing the pairwise similarity. Although samples in different modalities might be relatively aligned pairwisely, such alignment does not fully utilize the information between samples, which can be less consistent on the modality level. In this paper, we propose a novel prompt tuning strategy by distributionally matching different modalities. Specifically, we minimize the distribution-wise distance between the image and text modalities with optimal transport (OT) theory. Simultaneously, we add a constraint on the learned transport plan during the modality matching to enhance the learning of vision and text prompts. Our proposed one can be applied to improve existing uni-modal and multi-modal prompt learning methods for being a plug-and-play method, which can generate modality-consistent representations. Experiments on eleven public datasets demonstrate that our proposed method has excellent performance, achieving substantial improvements on both uni-modal and multi-modal prompt tuning methods.

Abstract:
Phytoplankton, a crucial component of aquatic ecosystems, requires efficient monitoring to understand marine ecological processes and environmental conditions. Traditional phytoplankton monitoring methods, relying on non-in situ observations, are time-consuming and resource-intensive, limiting timely analysis. To address these limitations, we introduce PhyTracker, an intelligent in situ tracking framework designed for automatic tracking of phytoplankton. PhyTracker overcomes significant challenges unique to phytoplankton monitoring, such as constrained mobility within water flow, inconspicuous appearance, and the presence of impurities. Our method incorporates three innovative modules: a Texture-enhanced Feature Extraction (TFE) module, an Attention-enhanced Temporal Association (ATA) module, and a Flow-agnostic Movement Refinement (FMR) module. These modules enhance feature capture, differentiate between phytoplankton and impurities, and refine movement characteristics, respectively. Extensive experiments on the PMOT dataset validate the superiority of PhyTracker in phytoplankton tracking, and additional tests on the MOT dataset demonstrate its general applicability, outperforming conventional tracking methods. This work highlights key differences between phytoplankton and traditional objects, offering an effective solution for phytoplankton monitoring.

Affiliations: School of Electrical and Electronic Engineering, Nanyang Technological University, Jurong West, Singapore; Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hung Hom, SAR, Hong Kong; Continental-NTU Corporate Laboratory, Nanyang Technological University, Jurong West, Singapore; College of Computer Science and Technology, Harbin Engineering University, Harbin, China; Continental Automotive Singapore Pte. Ltd., Boon Keng Rd, Singapore

Abstract:
Exploring new knowledge is a fundamental human ability that can be mirrored in the development of deep neural networks, especially in the field of object detection. Open world object detection (OWOD) is an emerging area of research that adapts this principle to explore new knowledge. It focuses on recognizing and learning from objects absent from initial training sets, thereby incrementally expanding its knowledge base when new class labels are introduced. This survey paper offers a thorough review of the OWOD domain, covering essential aspects, including problem definitions, benchmark datasets, source codes, evaluation metrics, and a comparative study of existing methods. Additionally, we investigate related areas like open set recognition (OSR) and incremental learning (IL), underlining their relevance to OWOD. Finally, the paper concludes by addressing the limitations and challenges faced by current OWOD algorithms and proposes directions for future research. To our knowledge, this is the first comprehensive survey of the emerging OWOD field with over one hundred references, marking a significant step forward for object detection technology. A comprehensive source code and benchmarks are archived and concluded at https://github.com/ArminLee/OWOD_Review.

Abstract:
The absorption and scattering of light in the water medium naturally impair the quality of underwater images, leading to multiple degradation effects including color casts, reduced visibility, and blurriness. Underwater Image Enhancement (UIE) techniques strive to mitigate these issues, yet the efficacy of different UIE algorithms remains highly variable. This variability underscores the necessity for an objective quality metric capable of precisely assessing the visual quality of underwater images. Traditional quality metrics, which primarily rely on a single score to depict the overall quality level, are insufficiently comprehensive to describe the complex degradation characteristics intrinsic to underwater environments and the multi-dimensional nature of underwater image quality. To address this issue, we construct the first UIE quality evaluation dataset with multi-dimensional quality annotations, broadening the subjective labels from a single overall quality score to multiple specific degradation-related scores. The dataset is known as an enhanced version of our previous Subjectively Annotated UIE Benchmark Dataset (SAUD) and is called SAUD2.0 hereinafter. Based on the SAUD2.0 dataset, we also introduce a Multi-stream COllaborative LEarning network (MCOLE) tailored for quality evaluation of enhanced underwater images. MCOLE capitalizes on the multi-dimensional quality annotations within SAUD2.0, facilitating the training of three specialized networks focused on extracting distinct sets of features: color, visibility, and semantic. These extracted features are then interacted and cohesively merged for quality prediction. Comprehensive experiments conducted on two benchmark datasets reveal that the proposed MCOLE outperforms current underwater image quality metrics. These results clearly validate the efficacy of exploring the multi-dimensional nature of underwater image quality and integrating such multi-dimensional quality annotations into underwater image quality evaluation. Our dataset and code are available at https://github.com/0117Tzx/MCOLE.

Affiliations: Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Provincial Key Laboratory of Industrial Network and Information System Security, Shandong Computer Science Center (National Supercomputer Center in Jinan), and Shandong Fundamental Research Center for Computer Science, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; School of Information Science & Technology, Dalian Maritime University, Dalian, China; School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China

Abstract:
In this paper, we propose a robust image steganography method utilizing color conversion, leveraging de-colorization and colorization models to achieve covert transmission of secret information. The motivation is to use color conversion of the stego image to conceal steganographic behavior. For the sender, secret information is embedded into the color cover image using a robust embedding algorithm based on quaternion exponent moments. The stego images are then de-colorized to obtain grayscale images, which can be transmitted over public channels. For the receiver, a corresponding colorization network is designed to reconstruct the stego image and extract the secret information. Additionally, an attack module using Gaussian noise is implemented to enhance the robustness of the proposed steganography. Given a color image, its grayscale version can be chosen from various options, making it difficult for attackers to detect steganographic activity as long as the generated grayscale image appears normal and meaningful. Extensive simulation results demonstrate the feasibility and scalability of the proposed steganography method.

Abstract:
Video anomaly detection (VAD) is a challenging task aiming to recognize anomalies in video frames, and existing large-scale VAD researches primarily focus on road traffic and human activity scenes. In industrial scenes, there are often a variety of unpredictable anomalies, and the VAD method can play a significant role in these scenarios. However, there is a lack of applicable datasets and methods specifically tailored for industrial production scenarios due to concerns regarding privacy and security. To bridge this gap, we propose a new dataset, IPAD, specifically designed for VAD in industrial scenarios. The industrial processes in our dataset are chosen through on-site factory research and discussions with engineers. This dataset covers 16 different industrial devices and contains over 6 hours of both synthetic and real-world video footage. Moreover, we annotate the key feature of the industrial process, i.e., periodicity. Based on the proposed dataset, we introduce a period memory module and a sliding window inspection mechanism to effectively investigate the periodic information in a basic reconstruction model. Our framework leverages LoRA adapter to explore the effective migration of pretrained models, which are initially trained using synthetic data, into real-world scenarios. Our proposed dataset and method will fill the gap in the field of industrial video anomaly detection and drive the process of video understanding tasks as well as smart factory deployment. Project page: https://ljf1113.github.io/IPAD_VAD.

Affiliations: School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; Chinese Academy of Sciences, Aerospace Information Research Institute, Beijing, China; State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Abstract:
Vision-language models (VLMs) have demonstrated remarkable open-vocabulary object recognition capabilities, motivating their adaptation for dense prediction tasks like segmentation. However, directly applying VLMs to such tasks remains challenging due to their lack of pixel-level granularity and the limited data available for fine-tuning, leading to overfitting and poor generalization. To address these limitations, we propose Generalization Boosted Adapter (GBA), a novel adapter strategy that enhances the generalization and robustness of VLMs for open-vocabulary segmentation. GBA comprises two core components: (1) a Style Diversification Adapter (SDA) that decouples features into amplitude and phase components, operating solely on the amplitude to enrich the feature space representation while preserving semantic consistency; and (2) a Correlation Constraint Adapter (CCA) that employs cross-attention to establish tighter semantic associations between text categories and target regions, suppressing irrelevant low-frequency “noise” information and avoiding erroneous associations. Through the synergistic effect of the shallow SDA and the deep CCA, GBA effectively alleviates overfitting issues and enhances the semantic relevance of feature representations. As a simple, efficient, and plug-and-play component, GBA can be flexibly integrated into various CLIP-based methods, demonstrating broad applicability and achieving state-of-the-art performance on multiple open-vocabulary segmentation benchmarks. Code are available at https://github.com/clearxu/BGA.

Abstract:
Video conferencing has become indispensable in human communication. Researchers are exploring immersive capabilities to enhance video conferencing experiences by delivering realistic interactions. However, existing methods have stringent and extra hardware beyond a typical video conference, including multiple depth cameras, large screens, and headsets, which pose obstacles to the widespread adoption due to high costs and complex setups. Thus, there is an urgent demand for light-weight systems using only on-hand devices including single RGB camera and standard screen, without additional hardware. We propose DVCO, a novel 3D video conferencing system via on-hand devices. With DVCO, users can experience lifelike virtual conferencing that includes natural contact and interactive features. To achieve this, DVCO has two main components. Virtual Camera Transformation (VCT) and New View Generator (NVG). VCT computes a downscaled sender image from tracking to determine viewpoint and gaze vector, enhancing virtual presence on standard screens. NVG takes an input frame and desired view angle to produce an output reflecting the new view from a single RGB camera. Together, these provide an affordable, easy-to-integrate enhancement for current video conferencing systems without expensive upgrades. Through a user study, it has been demonstrated that DVCO offers an exceptional level of immersion when compared to traditional systems. Experiments are conducted to showcase the superior performance of VCT and NVG in comparison to baseline methods.

Abstract:
Accurate disparity estimation in diverse and complex scenes remains a significant challenge in stereo matching, requiring precise geometric perception and robust generalization. Traditional methods often struggle in capturing fine-grained details and maintaining structural consistency under varying conditions, leading to a great reduction of the disparity estimation performance. To address these limitations, we propose Reg-Stereo, a novel framework based on region-aware distribution optimization. It leverages region-aware awakening to extract structural cues and explicitly optimizes the spatial distribution of feature responses. By strengthening relative structural attributes within local regions and expanding them to the global context, our approach enables a more precise and context-aware representation of geometric structures, effectively capturing fine details while preserving global consistency. This innovative approach enables the framework to adapt effectively to diverse and challenging environments, improving both robustness and generalization. Extensive experiments on multiple datasets validate the effectiveness of Reg-Stereo, surpassing exisitng state-of-the-art methods in disparity estimation with enhanced adaptability across complex and heterogeneous scenarios.

Abstract:
Screen-shooting watermarking is an effective means of protecting screen content from unauthorized capture and illegal dissemination. However, existing methods are primarily designed for full-image capture, making them ineffective for partial screen-shooting prevalent in real-world scenarios. To address this limitation, we propose \textsf FPSMark , a flexible watermarking method tailored for partial screen-shooting that embeds consistent watermarks in multiple uniformly distributed cover blocks. Specifically, considering that robustness requirements vary according to the layout of each image, we model the mathematical relationship between the watermark block count and robustness, proving the flexibility of \textsf FPSMark in ensuring partial screen-shooting robustness. Moreover, partial screen-shooting disrupts watermark synchronization, posing challenges for precise watermark localization. To overcome this, we design an intrinsic signal localization network optimized with a hybrid loss. The localization network exploits the inherent distinctions between the watermark and non-watermark features, while the hybrid loss constrains the network at three dimensions: pixel-level, region-level, and sample-level. Experimental results demonstrate the superiority of \textsf FPSMark , showing robust performance across partial capture percentages. Its extraction accuracy exceeds 98% even with only half of the image captured, and it achieves 82% accuracy at a 40% capture ratio, whereas existing methods achieve only around 50% under the same conditions.

Abstract:
Recently, Transformer-based methods have demonstrated satisfactory results on lightweight Image Super-Resolution. However, most of them limit the computational range of Transformer within a local neighbourhood, thus missing much global information. In addition, exploring Transformer on only one scale seems less powerful. To address these problems, we propose a concise and powerful Pyramid Clustering Transformer Network (PCTN) for lightweight image super-resolution. PCTN is constructed by multiple stacked Pyramid Clustering Transformer Blocks (PCTBs). Each PCTB is composed of two parts: Information Recurrent Distillation Block (IRDB) and Pyramid Clustering Transformer Attention (PCTA). Specifically, we first employ an IRDB to extract local structural information effectively, which can generate a larger receptive field without introducing additional learnable parameters. On the heels of that, we design a PCTA covering the most informative and relevant locations globally at different scales with less GPU memory and computational cost. Extensive experiments show that the proposed PCTN outperforms state-of-the-art lightweight SR algorithms in terms of visual quality and computational complexity.

Abstract:
Sketch-based image retrieval (SBIR) relies on free-hand sketches to retrieve natural photos within the same class. However, its practical application is limited by its inability to retrieve classes absent from the training set. To address this limitation, the task has evolved into Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR), where model performance is evaluated on unseen categories. Traditional SBIR primarily focuses on narrowing the domain gap between photo and sketch modalities. However, in the zero-shot setting, the model not only needs to address this cross-modal discrepancy but also requires a strong generalization capability to transfer knowledge to unseen categories. To this end, we propose a novel framework for ZS-SBIR that employs a pair-based relation-aware quadruplet loss to bridge feature gaps. By incorporating two negative samples from different modalities, the approach prevents positive features from becoming disproportionately distant from one modality while remaining close to another, thus enhancing inter-class separability. We also propose a Relation-Aware Meta-Learning Network (RAMLN) to obtain the margin, a hyper-parameter of cross-modal quadruplet loss, to improve the generalization ability of the model. RAMLN leverages external memory to store feature information, which it utilizes to assign optimal margin values. Experimental results obtained on the extended Sketchy and TU-Berlin datasets show a sharp improvement over existing state-of-the-art methods in ZS-SBIR.

Abstract:
Image matting is a fundamental task in computer vision that focuses on the precise separation of foreground objects from their backgrounds in images. This process is essential for numerous applications, such as image editing, film production, and augmented reality. Traditional methods often rely on a trimap, a predefined region that helps to distinguish the foreground from the background. However, generating an accurate trimap requires the provision of raw alpha matte, which is labor intensive and prone to drawing errors, limiting the applicability of the relevant methods in practical applications. In this paper, a novel inpaint and outpaint synergy matting approach (IOSM) is proposed to generate masks for image matting tasks without supervision by iterating through the inpaint and outpaint processes, avoiding the dependence on trimap. Specifically, the inpaint process is able to eliminate the false-positive regions present in the initial mask, while the outpaint process reduces the false-negative regions by expanding the pixels to the outer regions. The above process reduces inaccurate regions in the initial mask by means of adversarial updating, providing accurate target information for the subsequent matting stage. By iteratively combining these two processes, a more accurate mask is generated, which is then fed into the mask-to-matte (MTM) module along with the original image to obtain the final alpha matte. This approach allows for the seamless integration of the mask with the original image, improving the matting task and resulting in higher-quality matte outcomes. Experimental results demonstrate that IOSM outperforms other mainstream methods on the AIM-500, Distinctions-646 and PPM-100 datasets. Our project page is available at:https://github.com/xuecheng990531/IOSM

Abstract:
We propose the Style-Preserving Generator (SPG) to generate synthetic license plate data to train License Plate Recognition (LPR) models, and compare the performance with the same models trained on real-world data. The proposed SPG can edit the characters on real-world license plates while maintaining their original styles, allowing synthetic license plate data to be generated with user-specified characters. We can therefore synthesize license plates with desired characters to effectively alleviate the data attribute imbalance and privacy issues associated with real-world license plates. To the best of our knowledge, this work is the first study to present the making of synthetic LP data by proposing a novel text-editing approach tailor-made for LP data, that is the proposed SPG. The SPG consists of a transformer, a source encoder, a source style encoder, a character mask decoder, a target generator, and a target discriminator. Given a source license plate image and a specified text as input, these components collaborate to compute the self- and cross-attention embeddings, predict character masks, and generate a synthetic license plate in the source style but with source characters replaced by the specified characters. We adopt a two-phase training scheme. Phase 1 training uses synthetic data only, but Phase 2 training uses synthetic and real-life data. To showcase the effectiveness of the SPG, we introduce a new benchmark dataset, the LP-2025 (License Plate 2025), which alleviates the limitations of existing datasets and presents new challenges for license plate recognition and generative models. We validate SPG performance on the LP-2025 dataset and other benchmark datasets and compare it against state-of-the-art text-editing approaches.

Abstract:
Hyperspectral videos contain a larger number of spectral bands, providing extensive spectral information and material identification capabilities. This advantage confers hyperspectral trackers to achieve superior performance in challenging tracking scenarios. However, the limited availability of hyperspectral training data and the inability of existing algorithms to fully exploit hyperspectral information restrict the tracking performance. To address this issue, a novel framework, Spectral Prompt-based Hyperspectral Object Tracking (SP-HST), is proposed. SP-HST leverages a RGB tracking network as the main branch for feature extraction and tracking, which accounts for more than 98% of the total parameters and remains frozen during the training procedure. Additionally, the Spectral Prompt Learning (SPL) branch, comprising multiple lightweight prompt blocks, is introduced to generate complementary spectral representations as the prompt. The prompts contain abundant spectral information from hyperspectral data, enhancing the discriminative ability of features within the main branch. Furthermore, the Complementary Weight Learning (CWL) is employed to calculate the importance of spectral information from different prompts, enabling the features for hyperspectral object tracking to contain more spectral information that is absent in the feature of the main branch. By utilizing the spectral information as prompt, the number of trainable parameters is less than 2% of that in the tracking network, and the convergence is reached in 12 training epoch. Extensive experiments demonstrate the superiority of SP-HST, achieving a new state-of-the-art tracking performance, 71.3% of the AUC score on the HOTC dataset and 96.7% of the DP@20P score on the IMEC25 dataset. The code will be released at https://github.com/lgao001/SP-HST

Abstract:
Recently, advancements in text-to-image synthesis and image customization have drawn significant attention. Among these technologies, foreground-driven image synthesis models aim to create diverse scenes for specific foregrounds, showing broad application prospects. However, existing foreground-driven diffusion models struggle to accurately generate scenes with layouts that align with user intentions. To address these challenges, we propose CompCraft, a training-free framework that enhances layout control and improves overall generation quality in current models. First, CompCraft identifies that the failure of existing methods to achieve effective control arises from the excessive influence of fully denoised foreground information on the generated scene. To address this, we propose a foreground regularization strategy that modifies the foreground-related attention maps, reducing their impact and ensuring better integration of the foreground with the generated scene. Then, we propose a series of inference-time layout guidance strategies to guide the image generation process with the user’s finely customized layouts. These strategies enable current foreground-driven diffusion models with accurate layout control. Finally, we introduce a comprehensive benchmark to evaluate CompCraft. Both quantitative and qualitative results demonstrate that CompCraft can effectively generate high-quality images with precise customized layouts, showcasing its strong capabilities in pratical image synthesis applications.

Abstract:
With the rapid advances in wireless communication and IoT platforms, it is increasingly difficult to analyze relevant multi-modal data distributed across geographically diverse and heterogeneous platforms. One promising approach is to rely on federated learning to build compact cross-modal hash codes. However, existing federated learning methods easily exhibit degenerative performance in the global model due to the distributed data being derived from diverse domains. In addition, directly forcing each client to adopt the same global parameters as local parameters, without effective local training, significantly reduces the performance of each client. To overcome these challenges, we propose a novel federated adversarial cross-modal hashing, called Dual Prototypes-based personalized Federated Adversarial (DP-FeAd), which provides iterated training of shared dual prototypes. Specifically, aiming to expand local hashing models beyond their knowledge realms, DP-FeAd enables participating clients to engage in cooperative learning through two constructions: cluster prototypes and unbiased prototypes, instead of the traditional global prototypes, ensuring both generalization and stability. Specifically, the cluster prototypes are derived from local class-level prototypes and adversarially trained with local approximate hash codes to align their distributions. The unbiased prototypes are averaged from cluster prototypes and integrated into the training of local hashing models to maintain consistency across different local class-level prototypes further. The experiments conducted on two benchmark datasets demonstrate that our proposed method significantly enhances the performance of deep cross-modal hashing models in both IID (Independent and Identically Distributed) and non-IID scenarios.

Abstract:
Deep cross-modal hashing has demonstrated strong performance in large-scale retrieval but remains challenging in few-shot scenarios due to limited data and weak cross-modal alignment. We propose Generative Augmentation Hashing (GAH), a new framework that synergizes Visual-Language Models (VLMs) and generation-driven hashing to address these limitations. GAH first introduces a cycle generative augmentation mechanism: VLMs generate descriptive textual captions for images, which, combined with label semantics, guide diffusion models to synthesize semantically aligned images via inconsistency filtering. These images then regenerate coherent textual descriptions through VLMs, forming a self-reinforcing cycle that iteratively expands cross-modal data. To resolve the diversity-alignment trade-off in augmentation, we design cross-modal perturbation enhancement, injecting synchronized perturbations with controlled noise to preserve inter-modal semantic relationships while enhancing robustness. Finally, GAH employs dual-level adversarial hash learning, where adversarial alignment of modality-specific and shared latent spaces optimizes both cross-modal consistency and discriminative hash code generation, effectively bridging heterogeneous gaps. Extensive experiments on benchmark datasets show that GAH outperforms state-of-the-art methods in few-shot cross-modal retrieval, achieving significant improvements in retrieval accuracy. Our source codes and datasets are available at https://github.com/xiaolaohuuu/GAH

Abstract:
Denoising is a critical task in computer vision tasks, especially in challenging environments like extreme low-light conditions. However, the lack of research on denoising raw video in extreme low-light environments is notable, as is the absence of datasets specifically designed for this purpose. The primary challenge in denoising lies in balancing noise removal with detail preservation. Excessive denoising can cause the loss of fine details, while the insufficient denoising fails to adequately suppress noise, leading to degraded performance in downstream tasks such as object detection and recognition. To address these limitations, we present a novel raw video dataset consisting of noisy-clean paired sequences captured under extreme low-light conditions, featuring diverse scenes and extended frames. In addition, we propose an efficient denoising framework tailored for this challenging scenario. Our approach combines shallow denoising, deformable convolution-based temporal alignment, and spatiotemporal attention to reduce noise while preserving texture and temporal consistency effectively. A texture-preserving loss is also proposed to prevent over-smoothing and retain fine details. Our proposed method outperforms state-of-the-art denoising models in terms of PSNR and SSIM on both synthetic and real-world extreme low-light videos, while exhibiting minimal side effects and preserving sharp details, as demonstrated by the quantitative results.

Abstract:
Recent advances in computer vision have introduced generalized image segmentation models applicable across various domains. However, camouflaged object detection (COD) remains a particularly challenging task that requires dedicated approaches, owing to the minimal visual distinction between objects and their backgrounds. The degree of camouflage is influenced by three critical aspects—color, texture, and edge—requiring methodologies that address these simultaneously. Efforts to detect camouflaged objects have continually focused on these aspects. In this study, we propose the Tri-Aspects Network (TANet) for COD, designed to overcome the limitations of existing approaches that primarily focus on a single aspect. TANet emphasizes differences in color, texture, and edge to detect camouflaged objects. It consists of an ensemble of two independent networks that learn from the color differences extracted through color conversion and the textural features extracted using Bayar convolution filters. Each independent network enhances high-level features extracted from the input image through the Context Enhancement Block (CEB) and maximizes the difference between the background and camouflaged objects during reconstruction with the prediction mask using the Multi-scale Edge Refinement Block (MERB). The results from these two networks are then ensembled. Additionally, by using an erosion kernel to ensure that the prediction mask’s edge closely matches the ground truth edge, more fine-grained predictions can be achieved. TANet proposes a novel COD network that shows outstanding results compared to existing models in three key evaluation metrics (S-measure, E-measure, and weighted F-score), demonstrating its contribution to the field.

Abstract:
Open-world person re-identification aims to train a model on source domains and generalize well on unseen domains. Existing domain generalizable person re-identification methods primarily employ the equality training paradigm to train the model on multi-source domains. However, in open-world scenarios, domain imbalance often causes domain bias issue that leads to sub-optimal generalization ability, which is seriously overlooked. In this paper, we propose a Multi-model Synergy Perception (MSP) framework equipped with an Asynchronous Training Paradigm (ATP) on biased domains to maintain the domain balance for exploring the domain-invariant features. With the philosophy of divide and conquer, we divide the biased source domains into multiple debiased sub-source domains and employ a multi-network architecture to learn these sub-source domains in parallel. Additionally, to better generalize knowledge across these sub-source domains, we propose a Structure Synergy Perception (SSP) module that constructs the feature relationship distribution for each sub-domain and aligns them to map the unique knowledge to each other. Furthermore, considering the consistency of sub-source domains, we further propose a Synergy Distillation Perception (SDP) to improve the model both semantic and domain generalization ability. The main idea of SDP is to use the center guided soft label and the part based triplet graph to distill each submodel, which can facilitate the network to explore domain-invariant representations of images. Extensive experiments demonstrate that our method outperforms state-of-the-arts for open-domain person ReID.

Abstract:
Model compression methods such as pruning and quantization have been proposed to facilitate the deployment of convolutional neural networks (CNNs) on resource-constrained devices. Existing methods aim to combine the two for simultaneous improvement in compression ratio and runtime efficiency. However, most of the joint methods adopt linear tandem structures. Due to the lack of a unified framework, different optimization directions result in suboptimal solutions, especially when the compression ratio is extremely high. In this paper, we propose a novel adaptive curvature-based compression (ACC) method, which achieves a dual-depth unified joint optimization of pruning and quantization. In the first depth, we unify the pruning and quantization criteria using mean curvature, which leverages the discrete nature of image data and the continuum theory of differential geometry. In the second depth, we replace the traditional training process in the joint pruning-quantization method with curvature-aware knowledge distillation (CKD), unifying the two-stage approach into a simple but powerful parallel step. Our method is effective and interpretable by utilizing inherent properties to promote the understanding of information distribution and the importance of feature maps. Extensive experiments on multiple advanced benchmarks and diverse downstream task datasets have validated the superiority and generalizability of our ACC. Notably, we can achieve a 1.05% Top-1 accuracy improvement over the baseline under an extreme compression ratio of 454.55× , outperforming existing state-of-the-art (SOTA) methods.

Abstract:
Visible and infrared cross-modal re-identification tasks often encounter significant modal discrepancies, which undermine the effectiveness of feature extraction and compromise the reliability of similarity metrics. These discrepancies pose a substantial challenge for accurately matching data across different modalities. To address these issues, we propose a novel approach centered on the maximum mean metric discrepancy (MMMD). We leverage kernel-based statistical techniques to effectively capture and quantify the disparities in cross-modal metrics, providing a robust framework for aligning metrics from different modalities. Building upon the foundation of MMMD, we develop the metric discrepancy harmonization (MDH) method. This method integrates a temperature-controlled optimization technique designed to enhance metric alignment across various modal configurations, ensuring more consistent and reliable performance. By focusing on metric alignment, our approach enhances the accuracy of cross-modal re-identification tasks. Comprehensive evaluations on the LLCM, RGBN300, and SYSU-MM01 datasets demonstrate that our approach achieves state-of-the-art performance.

Abstract:
Recent work indicates that video recognition models are vulnerable to adversarial examples, posing a serious security risk to downstream applications. However, current research has primarily focused on adversarial attacks, with limited work exploring defense mechanisms. Furthermore, due to the spatial-temporal complexity of videos, existing video defense methods face issues of high cost, overfitting, and limited defense performance. Recently, diffusion-based adversarial purification methods have achieved robust defense performance in the image domain. However, due to the additional temporal dimension in videos, directly applying these diffusion-based adversarial purification methods to the video domain suffers performance and efficiency degradation. To achieve an efficient and effective video adversarial defense method, we propose the first diffusion-based video purification framework to improve video recognition models’ adversarial robustness: VideoPure. Given an adversarial example, we first employ temporal DDIM inversion to transform the input distribution into a temporally consistent and trajectory-defined distribution, covering adversarial noise while preserving more video structure. Then, during DDIM denoising, we leverage intermediate results at each denoising step and conduct guided spatial-temporal optimization, removing adversarial noise while maintaining temporal consistency. Finally, we input the list of optimized intermediate results into the video recognition model for multi-step voting to obtain the predicted class. We investigate the defense performance of our method against state-of-the-art black-box, gray-box, and adaptive attacks on benchmark datasets and models. Compared with other adversarial purification methods, our method overall demonstrates better defense performance against different attacks. Moreover, our method can be applied as a flexible defense plugin for video recognition models. Our code is available at https://github.com/deep-kaixun/VideoPure

Abstract:
Neural image compression (NIC) usually adopts a predefined family of probabilistic distributions as the prior of the latent variables, and meanwhile relies on entropy models to estimate the parameters for the probabilistic family. More complex probabilistic distributions may fit the latent variables more accurately, but also incur higher complexity of the entropy models, limiting their practical value. To address this dilemma, we propose a solution to decouple the entropy model complexity from the prior distributions. We use a finite set of trainable priors that correspond to samples of the parametric probabilistic distributions. We train the entropy model to predict the index of the appropriate prior within the set, rather than the specific parameters. Switching between the trained priors further enables us to embrace a skip mode into the prior set, which simply omits a latent variable during the entropy coding. To demonstrate the practical value of our solution, we present a lightweight NIC model, namely FastNIC, together with the learning of switchable priors. FastNIC outperforms BPG with encoding and decoding complexities below 12 and 10 KMACs/pixel, respectively. We also implanted the switchable priors into state-of-the-art NIC models and observed improved compression efficiency with a significant reduction of entropy coding complexity.

Abstract:
Large-scale image-text pre-trained models have shown promising transferability to various downstream tasks. Video-text retrieval benefits from it by transferring pre-trained CLIP to video-text domain. Although these pre-trained models have shown impressive performance, full fine-tuning becomes prohibitively expensive as the size of these pre-trained models grows rapidly. To solve this, parameter-efficient tuning methods have been proposed, and prompt tuning is one of the most promising directions. However, existing prompt tuning methods do not have sufficient performance due to the lack of cross-modal interaction and prompt reliability assurance. To address these issues, we propose an effective and efficient Agent-based Control Prompt Tuning method (AbC-PT) for parameter-efficient video-text retrieval. The proposed AbC-PT enjoys several merits. Firstly, we design a parameter-efficient agent decoder with a carefully designed consistent attention mechanism to effectively capture video temporal information, mine contextual texts and perform cross-modal interaction between them. Secondly, we introduce two different sets of prompts, i.e., the vanilla prompt prepended to the input tokens and the concept prompt as the agent of the agent decoder. In addition, to ensure cross-modal semantic consistency of the concept prompt, we design a semantic consistency constraint loss. Thirdly, we devise a parameter-free prompt controller for adaptively calibrating each vanilla prompt based on its semantic in a data-driven way. Extensive experiments on five challenging benchmarks demonstrate that our method not only outperforms state-of-the-art parameter-efficient tuning methods, but even surpasses the full fine-tuning with 0.46% parameter overhead.

Abstract:
Due to the extremely low latency, events have recently been utilized to complement lost information in motion deblurring. Existing approaches largely rely on the perfect pixel-wise alignment between intensity images and events, which usually conflicts with the real world. To tackle this problem, we propose a novel coarse-to-fine framework, named network of event-based motion deblurring with stereo event and intensity cameras (St-EDNet), to recover high-quality images directly from the misaligned inputs that contain both blurry images and the concurrent event stream. Specifically, the coarse spatial alignment of the blurry image and the event stream is first implemented with a cross-modal stereo-matching module without the need for ground-truth depths. Then, a dual-feature embedding architecture is proposed to gradually build the fine bidirectional association of the coarsely aligned data and reconstruct the sequence of the latent sharp images. Furthermore, we build a new dataset with stereo event and intensity cameras (StEIC), containing real-world events, intensity images, and dense disparity maps. Experiments on real-world datasets demonstrate the superiority of the proposed network over state-of-the-art methods. The code and dataset are available at https://mingyuan-lin.github.io/St-ED_web/

Abstract:
Accurately predicting driver attention is crucial for enhancing advanced driving assistance systems and autonomous vehicles, attracting increasing research interest. Most existing approaches, rooted in general, task-free saliency detection, adopt data-driven paradigms to correlate bottom-up environmental situations with attention distributions. However, they often overlook the complex top-down task-driven aspects of driver attention that are fundamental for the safe navigation of driving tasks, leading to limitations in handling real-world scenarios. In this paper, we take an initial step to explore and introduce BKnet, a Behavior-aware Knowledge-embedded model that innovatively integrates driving behaviors and empirical knowledge. Specifically, inspired by the human long-term cognitive process, we introduce a novel knowledge memory mechanism. It dynamically associates varied traffic scenarios with consistent driving behaviors, fostering the generation of robust behavior-aware empirical knowledge representations. To this end, BKnet facilitates a nuanced and comprehensive simulation of drivers’ attention mechanisms, driven synergistically by both top-down and bottom-up processes. Additionally, we further contribute to the field by collecting a novel Behavior-Aware Driver Attention (BADA) dataset. To the best of our knowledge, BADA is the first attention dataset explicitly incorporated into real-world driving behavior tasks from multiple drivers. Lastly, comprehensive experiments underscore BKnet’s superiority over existing state-of-the-art approaches and validate the effectiveness and necessity of integrating behavior-aware knowledge into driver attention prediction.

Abstract:
Evolutionary computation (EC)-based neural architecture search (NAS) has achieved remarkable performance in the automatic design of neural architectures. However, the high computational cost associated with evaluating searched architectures poses a challenge for these methods, and a fixed form of learning rate (LR) schedule means greater information loss on diverse searched architectures. This paper introduces an efficient EC-based NAS method to solve these problems via an innovative meta-learning framework. Specifically, a meta-learning-rate (Meta-LR) scheme is used through pretraining to obtain a suitable LR schedule, which guides the training process with lower information loss when evaluating each individual. An adaptive surrogate model is designed through an adaptive threshold to select the potential architectures in a few epochs and then evaluate the potential architectures with complete epochs. Additionally, a periodic mutation operator is proposed to increase the diversity of the population, which enhances the generalizability and robustness. Experiments on the CIFAR-10, CIFAR-100, and ImageNet1K datasets demonstrate that the proposed method achieves high performance comparable to that of many state-of-the-art peer methods, with lower computational cost and greater robustness. The source code of our method is released at https://github.com/Cipher2k29/MetaNAS.

Abstract:
In recent years, Siamese network-based visual tracking methods have gained popularity and success in terms of efficiency and accuracy. However, typical Siamese trackers utilize two independent weight-sharing streams to describe the exemplar and search region without any interaction between the two streams. As a result, such trackers employ only shallow cross-correlation or correlation filters to obtain the final information association, which neglects the deep interaction between the exemplar and search region and may reduce the discriminative power of the trackers. To address this issue, we propose a novel multi-parallel interactive transformer-based (MPIT) tracking framework to introduce sufficient interaction so that the two streams can guide the prediction heads to focus on the target more easily. Unlike recent one-stream transformer-based trackers that directly concatenate template and search tokens to perform joint feature learning, our multi-parallel interactive framework introduces a transmission band module to deliver global information for both the exemplar and the search region with low computational cost. Moreover, to integrate dynamic information, we incorporate temporal level extraction into the tracking framework to increase the variety of the templates. The experimental results show that the proposed MPIT method achieves a remarkable tracking speed of 136 frames per second (FPS) while attaining performance better than or comparable to that of state-of-the-art trackers.

Abstract:
The rolling shutter (RS) effect is commonly encountered when capturing images with CMOS sensors, and the RS effect in CMOS sensor imaging frequently results in distortions that degrade image quality and hinder subsequent processing tasks. Traditional rolling shutter correction (RSC) techniques often struggle with complex motion and occlusions. Although deep learning-based methods have advanced the field, they have yet to fully overcome the challenges posed by occlusions. To bridge this gap, we present DM-RSC, a generative framework that utilizes diffusion models (DMs) for multi-frame RSC. Unlike previous works, DM-RSC begins by simultaneously generating bidirectional flow motions between RS frames and their corresponding global shutter (GS) frames. This process not only facilitates the correction of RS frames but also identifies occluded regions. Subsequently, DM-RSC capitalizes on the multi-frame information within occlusion regions and the generative process of DMs to manage occlusion issues, generating the desired GS image. DM-RSC achieves state-of-the-art performance on the leading benchmarks Carla-RS, Fastec-RS, and BS-RSC, marking a leap forward in the domain of RSC. Our code is available at: https://github.com/lhaippp/DM-RSC.

Abstract:
We propose a spatio-temporal adaptive deep video compression scheme, which is capable of intelligently adjusting the spatial resolution and temporal frame rate for content adaptive compression, with the aim of pursuing enhanced rate-distortion performance. In particular, a neural network-based spatio-temporal adaptation network is integrated into the deep video coding paradigm, enabling the adaptive determination of the optimal rescaling ratios for compression, leading to the further reduction of spatial and temporal redundancies. Moreover, learning-based modules for rescaling parameter determination are incorporated into the spatio-temporal adaptation network. The proposed scheme can be easily plugged into, and seamlessly collaborate with the existing deep video coding frameworks. Experimental results demonstrate that, compared to the original neural video codecs, the proposed method achieves significant bitrate savings in terms of both PSNR and MS-SSIM.

Abstract:
By the preferable efficiency in storage and computation, deep cross-modal has gained much attention in large-scale multimedia retrieval. Current deep hashing employs the probability outputs of the likelihood function, i.e., Sigmoid or Cauchy, to quantify the semantic similarity between samples in a common Hamming space. However, the inherent weakness of the Sigmoid likelihood function or the Cauchy likelihood function in gradient optimization leads to hashing models failing to exactly describe the hamming ball, which indicates the absolute semantic boundary among classes, thereby giving the high neighborhood ambiguity. In this paper, with the analysis of the likelihood function from the perspective of similarity metric learning, the novel Deep Discriminative Boundary Hashing framework (DDBH) is proposed to learn the discriminative embedding space that separates neighbors and non-neighbors well. Specifically, by introducing the remapping strategy and the base-point adaptive selection, the boundary-preserving loss based on the adjustable likelihood function is proposed to project data points with small gradients to regions with large gradients and give larger gradients for hard samples, facilitating better separation among classes. Meanwhile, to learn class-dependent binary codes, the class-wise quantization loss is designed to heuristically transfer class-wise prior knowledge to the binary quantization, significantly improving the discriminative capability of compact discrete codes. Comprehensive experiments on three benchmark datasets show that our proposed DDBH framework outperforms other representative deep cross-modal hashing. The corresponding code is available at https://github.com/QinLab-WFU/DDBH

Abstract:
Event-based motion deblurring aims at reconstructing a sharp image from a single blurry image and its corresponding events triggered during the exposure time. Existing methods learn the spatial distribution of blur from blurred images, then treat events as temporal residuals and learn blurred temporal features from them, and finally restore clear images through spatio-temporal interaction of the two features. However, due to the high coupling of detailed features such as the texture and contour of the scene with blur features, it is difficult to directly learn effective blur spatial distribution from the original blurred image. In this paper, we provide a novel perspective, i.e., employing the blur indication provided by events, to instruct the network in spatially differentiated image reconstruction. Due to the consistency between event spatial distribution and image blur, event spatial indication can learn blur spatial features more simply and directly, and serve as a complement to temporal residual guidance to improve deblurring performance. Based on the above insight, we propose an event-based motion deblurring network consisting of a Multi-Scale Event-based Double Integral (MS-EDI) module designed from temporal residual guidance, and a Blur-Aware Filter Prediction (BAFP) module to conduct filter processing directed by spatial blur indication. The network, after incorporating spatial residual guidance, has significantly enhanced its generalization ability, surpassing the best-performing image-based and event-based methods on both synthetic, semi-synthetic, and real-world datasets. In addition, our method can be extended to blurry image super-resolution and achieves impressive performance. Our code is available at: https://github.com/ChenYichen9527/MBNet now.

Abstract:
Few-Shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes from limited samples while preventing catastrophic forgetting. With the increasing distribution of learning data across different clients and privacy concerns, FSCIL faces a more realistic scenario where few learning samples are distributed across different clients, thereby necessitating a Federated Few-Shot Class-Incremental Learning (FedFSCIL) scenario. However, this integration faces challenges from non-IID problem, which affects model generalization and training efficiency. The communication overhead in federated settings also presents a significant challenge. To address these issues, we propose Class-Aware Prompting for Federated Few-Shot Class-Incremental Learning (FedCAP). Our framework leverages pre-trained models enhanced by a class-wise prompt pool, where shared class-wise keys enable clients to utilize global class information during training. This unifies the understanding of base class features across clients and enhances model consistency. We further incorporate a class-level information fusion module to improve class representation and model generalization. Our approach requires very few parameter transmission during model aggregation, ensuring communication efficiency. To our knowledge, this is the first study to explore the scenario of FedFSCIL. Consequently, we designed comprehensive experimental setups and made the code publicly available.

Abstract:
Spatiotemporal action detection localizes the action instances along both spatial and temporal dimensions, by identifying action start time and end time, action class, and object (e.g., actor) bounding boxes. It faces two primary challenges: 1) varying durations of actions and inconsistent tempo of action instances within the same class, and 2) modeling complex object interactions, which are not well handled by previous methods. For the former, we develop the coarse-to-fine attention module, which employs an efficient dynamic time warping to make a coarse estimation of action frames by eliminating context-agnostic features, and further adopts the attention mechanism to capture the first-order object relations within those action frames. This results in a finer-granularity of action estimation. For the latter, we design the ternary high-order hypergraph neural networks, which model the spatial relation, the motion dynamics, and the high-order relations of different objects across frames. This encourages the positive relation of the objects within the same actions, while suppressing the negative relation of those in different actions. Therefore, we present a Coarse-to-Fine Hypergraph Network, abbreviated as CFHN, for spatiotemporal action detection, by considering the object local context, the first-order object relations, and the high-order object relations together. It combines the spatiotemporal first-order and high-order features along the channel dimension to obtain satisfying detection results. Extensive experiments on several benchmarks including AVA, JHMDB-21, and UCF101-24 demonstrate the superiority of the proposed approach.

Abstract:
Perceptual image encryption serves as a pivotal mechanism for delegating processing while ensuring the visual security of image data. The robustness of such encryption schemes is traditionally evaluated through cryptanalysis techniques, yet these approaches heavily rely on manual labor and prerequisite knowledge of the encryption algorithms. Recently, some works attempt to reveal visual content from perceptually encrypted images Based on CNN architectures. However, it is still tricky for these increasingly complex methods to reveal informative visual details. In this study, we focus on the extraction and utilization of inherent hierarchical features within the input image itself to significantly advances the field. To achieve it, we present a novel Progressive Fusion Attack Network (PFAN) to fully explore the hierarchical features. PFAN incorporates multiple subbranches, forming a progressive fusion structure that facilitates informative hierarchical feature representations and offers robust model fault tolerance. To enhance the reconstruction of encryption-induced distortions, we incorporate a Multiscale Feature Extraction Module (MFEM) that captures robust hierarchical features across various scales. Meanwhile, a Hierarchical Feature Fusion Module (HFFM) is designed to adaptively integrate and highlight the optimal feature representations, further optimizing the visual content reconstruction process. Extensive experimental evaluation demonstrates that PFAN exhibits remarkable agnosticism towards different perceptual encryption schemes and encryption strengths, achieving superior performance. Furthermore, PFAN outperforms state-of-the-art CNN-based image restoration methods in terms of effectiveness and generalizability.

Abstract:
Video Question Answering (VideoQA) is a challenging task in the vision-language field. Due to the time-consuming and labor-intensive labeling process of the question-answer pairs, fully supervised methods are no longer suitable for the current increasing demand for data. This has led to the rise of zero-shot VideoQA, and some works propose to adapt large language models (LLMs) to assist zero-shot learning. Despite recent progress, the inadequacy of LLMs in comprehending temporal information in videos and the neglect of temporal differences, e.g., the different dynamic changes between scenes or objects, remain insufficiently addressed by existing attempts in zero-shot VideoQA. In light of these challenges, a novel Temporal-guided Mixture-of-Experts Network (T-MoENet) for zero-shot video question answering is proposed in this paper. Specifically, we apply a temporal module to imbue language models with the capacity to perceive temporal information. Then a temporal-guided mixture-of-experts module is proposed to further learn the temporal differences presented in different videos. It enables the model to effectively improve the capacity of generalization. Our proposed method achieves state-of-the-art performance on multiple zero-shot VideoQA benchmarks, notably improving accuracy by 5.6% on TGIF-FrameQA and 2.3% on MSRVTT-QA while remaining competitive with other methods in the fully supervised setting. The codes and models developed in this study will be made publicly available at https://github.com/qyx1121/T-MoENet.

Abstract:
Currently, most few-shot object detection (FSOD) methods apply the two-stage training strategy, which first requires training in abundant base classes and transfers the learned prior knowledge to the novel stage. However, due to the inherent imbalance between the base and novel classes, the trained model tends to have a bias toward recognizing novel classes as base ones when they are similar. To address this problem, we propose an adversarial feature training (AFT) strategy aimed at effectively calibrating the decision boundary between novel and base classes to alleviate classification confusion in FSOD. Specifically, we introduce the Classification Level Fast Gradient Sign Method (CL-FGSM), which leverages gradient information from the classifier module to generate adversarial samples with extra feature attention. By attacking the high-level features, we can create adversarial feature samples that are combined with clean high-level features in a suitable range of proportions. Such adversarial feature samples, generated by CL-FGSM, are then combined with clean high-level features in a suitable range of proportions to train the few-shot detector. By this, the novel model is forced to learn extra class-specific features that improve the robustness of the classifier to establish a correct decision boundary, which avoids confusion between base and novel classes in FSOD. Extensive experiments demonstrate that our proposed AFT strategy effectively calibrates the classification decision boundary to avoid classification confusion between base and novel classes and significantly improves the performance of FSOD. Our code is available at https://github.com/wutianxu/AFT.

Abstract:
Event cameras are bio-inspired sensors with diverse advantages, including high temporal resolution and minimal power consumption. Therefore, event cameras enjoy a wide range of applications in computer vision, among which event keypoint detection plays a vital role. However, repeatable event keypoint detection remains challenging because the lack of temporal inter-frame interaction leads to descriptors with limited temporal consistency, which restricts the ability to perceive keypoint motion. Besides, detectors learned at single scale features are not suitable for event keypoints with significant motion speed differences in high-speed scenarios. To deal with these problems, we propose a novel Spatio-Temporal Pyramid Keypoint Detection Network (STPNet) for event cameras via a temporally consistent descriptor learning (TCL) module and a spatially diverse detector learning (SDL) module. The proposed STPNet enjoys several merits. First, the TCL module generates temporally consistent descriptors for specific keypoint motion patterns. Second, the SDL module produces spatially diverse detectors for applications in high-speed motion scenarios. Extensive experimental results on three challenging benchmarks show that our method notably outperforms state-of-the-art event keypoint detection methods. Specifically, our STPNet can outperform the best event keypoint detection method by 0.21px in reprj. error on Event-Camera, 4% in IoU on N-Caltech101, 0.13px in reprj. error on HVGA ATIS Corner and 5.94% in matching accuracy on DSEC.

Affiliations: State Key Laboratory of Electromechanical Integrated Manufacturing of High-Performance Electronic Equipment and the Center for Complex Systems, School of Mechano-Electronic Engineering, Xidian University, Xi’an, Shaanxi, China; School of Information Science and Technology, Northwest University, Xi’an, Shaanxi, China; School of Information Science and Technology, University of Science and Technology of China, Hefei, China; School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an, China

Abstract:
In recent years, unsupervised linear regression has attracted attention for its ability to directly capture the mapping relationship between samples and targets. However, existing algorithms can only utilize limited information from a single view, which often leads to unsatisfactory results. To address this problem, we propose a regression clustering model based on multi-view information fusion, called Scalable Multi-view Regression Clustering. This model consists of two parts: intra-view information fusion and inter-view information fusion. In the first part, to capture the local correlations among samples, we propose constructing view-specific bipartite graphs. Unlike traditional single-view and multi-view clustering algorithms, we treat the weights of the bipartite graph as additional features of the samples, thereby directly incorporating the local manifold structure of the samples at the feature level. Furthermore, since the original features of the samples also contain valuable information, we perform unsupervised linear regression separately on the samples represented by the original features and those represented by the bipartite graph weights in each view. The results are then integrated in a weighted manner. In the second part, we propose adaptively weighting the clustering results from each view to capture complementary information across views, thereby enhancing clustering performance. This strategy not only avoids the bipartite graph alignment issue in multi-view clustering but also enables clustering with linear time complexity, making it effective for handling large-scale data. An iterative optimization algorithm is developed to update all variables alternately. Experiments conducted on benchmark datasets demonstrate the superiority of our proposed model.

Abstract:
Existing CNN-based and Transformer-based methods have demonstrated remarkable performance in low-level visual tasks, including image deblurring. These methods generally capture spatial features only in a single way, such as by stacking blocks of CNNs and Transformers, resulting in inadequate utilization of spatial context. To address this issue, we propose a new feature aggregation scheme for image deblurring, named Omni-Deblurring. The core of our omni-deblurring is the omni-range context block, which enables explicitly aggregating the local-range, regional-range, and global-range features in a compact manner. With this design, it can bring a wider receptive field for modeling the contextual features. Extensive experiments on synthetic and real-world blurry datasets demonstrate the effectiveness of our proposed method in both quantitative and qualitative evaluations. Furthermore, the quality of our deblurring model is evaluated in the task of object detection, and the mean Average Precision (mAP) metric increases by 10% across all classes compared with other deblurring models. Code is available at https://github.com/yaowli468/Omni-Deblurring.

Abstract:
The success of modern deep learning algorithms requires large amounts of training data, which leads to high computational and storage costs. Dataset Distillation (DD) is a rising research field that resolves this issue by synthesizing a compact training dataset from a large one. Recent gradient matching DD methods have achieved remarkable results. However, these methods typically utilize weak models for DD performance improvement, while well-trained models are often considered inferior choices due to their lower performance. Conversely, our study provides new insights into the role of well-trained models in DD, particularly under high-storage budget scenarios. We identify a previously overlooked design principle—a positive correlation between model capability and storage budget. Based on this principle, we propose Drop2Sparse, an approach that randomly sparsifies well-trained models to create efficient models for various storage budget scenarios. Drop2Sparse concurrently infuses significant model diversity and regularization effects into DD, outperforming previous state-of-the-art methods by up to 3.8% on CIFAR and 3.6% on ImageNet-subset. Moreover, our method exhibits remarkable cross-architecture generalization and achieves promising results even under challenging scenarios, such as using an extremely reduced model pool or highly accelerated training.

Abstract:
Remote photoplethysmography (rPPG) uses RGB facial videos to measure cardiac signals. It holds promise for future applications in telemedicine, affective computing, liveness-based face anti-spoofing, driver monitoring, etc. Supervised deep learning methods have been leading in performance but are severely limited by data availability, as recording face videos with ground truth physiological signals is expensive. Recent self-supervised methods aim to solve the data issue but struggle to learn robust features from data in challenging scenarios. These scenarios are characterized by overwhelming environmental noise caused by head movements, illumination variations, and recording device changes. We propose RS+rPPG, a novel contrastive method that effectively leverages a large set of eleven rPPG priors, enabling strong self-supervision even with challenging data. RS+rPPG comprehensively exploits intra-data and inter-data information present in videos via diverse augmentations and learning constraints. We extensively experimented on seven rPPG datasets and demonstrated that RS+rPPG can outperform state-of-the-art supervised methods without using any labels. Additionally, we demonstrate the high generalization capability, demographic fairness, and mixed-data stability of our method.

Abstract:
The pretrain-finetune paradigm brings about the release of numerous model weights. Under this background, model merging is becoming increasingly popular, as it enables a model to handle multiple tasks by fusing model weights from these tasks, without the need for labeled data, additional training, or high training costs. Though with great potential, model merging suffers from severe performance degradation due to the interference among model weights. And existing model merging methods (i.e., static merging) commonly provide a single set of merging coefficients for all the input samples and do not distinguish layers based on the severity of weight interference, which may not be the optimal solution. In this paper, we propose MoW-Merging, a dynamic model merging method based on Mixture of Weights. First, we apply a gating network to adaptively generate merging coefficients depending on the input samples, realizing sample-wisely dynamic merging and automated classifier selection. The gating network is lightweight and is trained with only a small number of unlabeled data. Further, we utilize a weight similarity metric to judge the severity of weight interference of each layer and apply suitable merging methods to different layers. The proposed MoW-Merging shows plug-and-play capabilities and can be seamlessly combined with various model merging methods to greatly boost their performance. The effectiveness of MoW-Merging is validated by comprehensive experiments on various classical and newly-established benchmarks under multiple settings. The code is available at https://github.com/harveyhuang18/Mixture_of_Weights.

Abstract:
Action Quality Assessment (AQA) is a task aimed at automatically and fairly evaluating the level of movement execution, which holds significant importance for action understanding. Previous methods, while adept at extracting video features, often neglect human regions. This leads to a limited capability to discern subtle action differences and results in a lack of interpretative depth. In this work, we propose a Pose-Guided Transformer framework, termed PGT, for assessing action quality more accurately. Essentially, this framework incorporates pose information to augment human region features during video feature extraction. The PGT framework incorporates two critical modules: a pose-guided attention layer and a global-local feature extractor. The former is designed to isolate body-specific features, effectively minimizing background noise, while the latter further delineates fine-grained features by utilizing decomposed information from various human body parts. The proposed PGT achieves significant results on various challenging AQA benchmarks. Notably, on MTL-AQA dataset, with a Spearman’s rank correlation of 0.9630. Additionally, on the AQA-7 dataset, our approach achieves an average Spearman’s rank correlation of 0.8673, further validating the effectiveness of our method. These findings demonstrate that our framework excels in the task of action quality assessment, providing a viable solution for accurate and fair evaluation of movement execution.

Abstract:
Interactive segmentation has gained significant attention due to its applications in human-computer interaction and data annotation. To address the challenge of target scale variations in interactive segmentation, we propose a novel multi-scale token fusion algorithm. This algorithm selectively fuses only the most important tokens, enabling the model to better capture multi-scale characteristics in important regions. To further enhance the robustness of multi-scale token selection, we introduce a token learning algorithm based on contrastive loss. This algorithm fully utilizes the discriminative information between target and background multi-scale tokens, effectively improving the quality of selected tokens. Extensive benchmark testing demonstrates the effectiveness of our approach in addressing multi-scale issues. The code and data have been made publicly available at https://github.com/hahamyt/mst.

Affiliations: Guangzhou Institute of Technology, Xidian University, Guangzhou, China; School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China; National Center for Applied Mathematics, Chongqing Normal University, Chongqing, China; Faculty of Engineering and Information Technology, Australian Artificial Intelligence Institute, University of Technology Sydney, Ultimo, Australia; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China

Abstract:
Most extant underwater object tracking (UOT) utilize generic tracking algorithms, which lack applicability to underwater tracking scenarios. Moreover, these algorithms primarily emphasize minimizing interference from various challenging tasks to prevent target drift, but pay less attention to the strategies for mitigating target drift once it occurs. To alleviate the above problems, we propose a simple, effective, and UOT-focused adaptive trajectory correction framework, named ATCTrack. From the perspective of tracking failure, this methodology aims to promptly identify and rectify unreasonable target drift through accurate trajectory coordinate correction and trajectory template updates. Additionally, to mitigate the adverse effects of potential erroneous corrections, we implement an adaptive strategy that corrects only significant target drift, allowing for self-correction within a certain margin. Finally, we introduce an adaptive underwater image enhancement technique to improve the underwater image quality and maintain the trajectory’s stability and clarity. Our tracker achieves state-of-the-art performance on the currently prevalent UOT tracking benchmarks compared to other trackers.

Abstract:
Object detection generally involves two main components: classification and regression. Despite the impressive performance achieved by recent refinement localization works, there is still room for improvement due to the limitations of current multistep regression strategies and task misalignment. To overcome these challenges, we propose a novel offset-aware progressive regression detector (OAPR) comprising an offset-aware head and a progressive regression predictor. Initially, we develop a head network incorporating our innovative plug-and-play offset-aware module. By utilizing the offset from one task to guide feature learning in another task, we intuitively achieve task alignment to address feature misalignment. We subsequently employ a progressive regression predictor to locate objects. In first-step regression, the aim is to identify a region within the object rather than the object itself. This is followed by second-step regression to locate the object precisely. Extensive experiments conducted on MS COCO datasets demonstrate the superior performance of our OAPR compared with recent state-of-the-art detectors with various backbones, including ATSS (~3.0 AP), GFL (~2.0 AP), BorderDet (~2.0 AP), and VFNet (~1.0 AP). Our code will be released.

Abstract:
The dynamic hand gesture is an emerging biometric trait that has attracted the attention of researchers due to its rich physiological and behavioral characteristics. The previous studies primarily focused on extracting and utilizing the physiological characteristics, while ignoring the rich behavioral characteristics contained in hand gesture movements. The dynamic hand gesture authentication performance will be improved if behavioral characteristics can be effectively extracted and fused with physiological characteristics for authentication. In addition, existing methods still suffer from insufficient feature extraction capabilities and low efficiency in extracting behavioral characteristics from complex dynamic hand gestures. To address these issues, this paper first proposes multiscale dynamic hand gesture (MDHG) super-images to represent the behavioral characteristics of hand gestures, containing sufficient local and global motion cues. Furthermore, for the super-images, this paper proposes a two-stream network consisting of a spatiotemporal feature extraction backbone and an identity-aggregation module to fully extract and fuse the physiological and behavioral characteristics of hand gestures, which significantly improves the accuracy of dynamic hand gesture authentication. Extensive experiments on two benchmark datasets, SCUT-DHGA and HandLogin, show that our method achieves superior performance with fewer parameters and FLOPs than other networks, validating the effectiveness, generalizability, and security of our proposed method.

Abstract:
This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e ., one-step planning. Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper. To ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. To test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability.

Abstract:
The rapid growth of generative models has led to a new direction in steganography called generative steganography (GS). It allows message-to-image generation without the need for a carrier image. Recently, generative steganography methods have been proposed using generative adversarial networks (GANs) and Flow models. On the one hand, methods that use GANs to generate stego images struggle to fully recover the hidden message because the networks are not reversible. On the other hand, methods based on Flow encounter a problem where the images they create might not look real, mainly because the network has limitations in being reversible. Diffusion models fulfill network reversibility while generating high-quality images. However, the framework of existing diffusion models is reversible, but hidden message recovery is not perfectly reversible, resulting in the recovered message being similar but not exactly the same as the hidden one. Existing diffusion models are typically trained for one-directional image generation tasks, so they face some problems when dealing with bi-directional steganography tasks. If pre-trained diffusion models are directly used to generate stego images, exact secret data extraction through the diffusion process cannot be achieved. In this paper, we present an improved generative steganography based on the diffusion model (GSD), which conceals secret data in the frequency domain of random noise to enhance the security and accuracy of steganography, and re-trains the denoising diffusion implicit model (DDIM) for steganography, called the StegoDiffusion. During training StegoDiffusion, random noise is injected into the clean natural images and then trained through the forward diffusion process to obtain the re-trained StegoDiffusion. Our proposed GSD scheme achieves a 100% extraction accuracy for hidden secret data with a payload of 1 bit-per-pixel (bpp) in a single channel, and generates high-quality stego images in PNG format.

Abstract:
Dunhuang mural inpainting aims to fill in the missing regions of damaged murals with realistic content. Denoising probabilistic diffusion model (DDPM) has made great strides in semantic generation and shown promising results in image inpainting. However, three potential challenges prevent existing diffusion-based methods from restoring the Dunhuang murals: 1) effective visual information cannot be accurately extracted due to historical reasons, with most of the pixels being faded; 2) there are semantic discrepancy between damaged and visible regions in the inpainting results; and 3) the original structure and style of the damaged regions cannot be adequately restored. To this end, we propose a novel adversarial diffusion model for mural inpainting, which consists of: 1) a mural enhancement module named pixel-enhanced fire-controlled pulse-coupled neural network (PEFCPCNN), designed to enhance faded pixels to accurately extract the visual features of the mural; 2) a novel adversarial diffusion framework that optimizes the sampling prediction of mural over time steps; and 3) line drawing and different loss functions to constrain the reconstructed content to approximate the structure and style of original mural. The variational transform layer (VTL) and multi-scale contextual feature aggregation (MCFA) module are proposed to reconstruct content that is structurally coherent and texturally reasonable. Experiments on the Dunhuang mural dataset demonstrate that the proposed method outperforms state-of-the-art methods in terms of both the semantic reasonableness and global semantic consistency of inpainting content.

Abstract:
Knowledge distillation is a mainstream algorithm in model compression by transferring knowledge from the larger model (teacher) to the smaller model (student) to improve the performance of student. Despite many efforts, existing methods mainly investigate the consistency between instance-level feature representation or prediction, which neglects the category-level information and the difficulty of each sample, leading to undesirable performance. To address these issues, we propose a novel preview-based category contrastive learning method for knowledge distillation (PCKD). It first distills the structural knowledge of both instance-level feature correspondence and the relation between instance features and category centers in a contrastive learning fashion, which can explicitly optimize the category representation and explore the distinct correlation between representations of instances and categories, contributing to discriminative category centers and better classification results. Besides, we introduce a novel preview strategy to dynamically determine how much the student should learn from each sample according to their difficulty. Different from existing methods that treat all samples equally and curriculum learning that simply filters out hard samples, our method assigns a small weight for hard instances as a preview to better guide the student training. Extensive experiments on several challenging datasets, including CIFAR-100, ImageNet and Pascal VOC, demonstrate the superiority over state-of-the-art methods.

Abstract:
Point cloud registration is a fundamental task in the fields of computer vision and robotics. Recent advancements in transformer-based methods have demonstrated enhanced performance in this domain. However, the standard attention mechanisms employed in these approaches tend to incorporate numerous points of low relevance, and therefore struggle to focus their attention weights on sparse yet meaningful points. This inefficiency leads to limited local structure modeling capabilities and quadratic computational complexity. To overcome these limitations, we propose the Point Tree Transformer (PTT), a novel transformer-based approach for point cloud registration that efficiently extracts comprehensive local and global features while maintaining linear computational complexity. The PTT constructs hierarchical feature trees from point clouds in a coarse-to-dense manner, and introduces a novel Point Tree Attention (PTA) mechanism. This mechanism adheres to the tree structure to facilitate the progressive convergence of attended regions toward salient points. Specifically, each tree layer selectively identifies a subset of relevant points with the highest attention scores, and subsequent layers focus attention on areas of significant relevance, derived from the child points of the selected point set. The feature extraction process additionally incorporates coarse point features that capture high-level semantic information, thus facilitating local structure modeling and the progressive integration of multiscale information. Consequently, the PTA enables the model to focus on essential local structures and extract intricate local information while maintaining linear computational complexity. Extensive experiments conducted on the 3DMatch, ModelNet40, and KITTI datasets demonstrate that our method outperforms state-of-the-art methods in terms of performance. The code for our method is publicly available at https://github.com/CGuangyan-BIT/PTT.

Abstract:
In recent years, significant advancements in deep learning have expanded its application in a variety of computer vision tasks. However, the performance of these models heavily depends on the quality of the training data. While existing crowd counting methods yield satisfactory results on labeled datasets, they often face serious domain adaptation issues when applied to unlabeled data, the latter being more common in real-world scenarios. To mitigate this issue, we present a novel Scene-adaptive Unsupervised Crowd Counting (SUCC) framework aimed at enhancing the domain adaptability of counting models. This framework integrates a bi-branch attention network (BBA-Net) that leverages human prior knowledge to generate highly accurate density and anchor maps, which are essential for producing intermediate domain data as pseudo labels. Our SUCC framework eliminates the need for laborious manual annotation within the new data domain. Instead, it continually performs adaptive intermediate domain generation and model fine-tuning, establishing a beneficial feedback loop. Comprehensive experiments on multiple video crowd counting datasets show that our SUCC framework significantly improves domain generalizability. Furthermore, it exhibits satisfactory model stability and algorithm interpretability, attributes that are vital for the practical deployment of counting applications. The open-source code and model weights can be found on Github.

Affiliations: National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, and the Marshall Laboratory of Biomedical Engineering, Shenzhen University, Shenzhen, China; School of Mechanical and Control Engineering, Baicheng Normal University, Baicheng, China; Department of Ultrasound Medicine, South China Hospital, Shenzhen University, Shenzhen, China; College of Management, Shenzhen University, Shenzhen, China; School of Artificial Intelligence, Sun Yat-sen University, Guangzhou, China

Abstract:
Colorectal polyp segmentation in endoscopic images is very important for the prevention and treatment of colorectal cancer. Because of the high similarity between polyps and their surrounding tissues, most deep neural network (DNN) based methods often struggle with blurry boundaries and result in inaccurate segmentation. In this paper, we propose a Boundary-guided Feature-aligned Network (BFNet) for polyp segmentation by taking a boundary prediction task as an auxiliary. Firstly, BFNet aggregates multi-layer features extracted from the backbone to mine boundary cues. Secondly, a flexible feature aggregation (FFA) module is used at each layer to adaptively fuse cross-layer features for coarse polyp localization. In the FFA module, considering the spatial misalignment between features at different layers, the feature of the high layer is aligned to and fused with that of the current layer using the deformable convolution and flexible merge block. After that, a boundary-guided feature enhancement (BFE) module is applied to refine the localization at boundary areas. In the BFE module, the boundary information is extracted and highlighted in both channel and spatial dimensions using the attention mechanisms with the assistance of boundary cues. By applying deep supervision to the BFE modules, BFNet can produce accurate polyp segmentation. Experimental results show that our BFNet outperforms 14 state-of-the-art DNN-based polyp segmentation methods on both in-domain and out-of-domain tests.

Abstract:
Accurate segmentation of medical ultrasound images is crucial for guiding treatment decisions and assessing intervention effectiveness. The challenge of segmenting lesions in ultrasound images arises from factors such as low contrast, high speckle noise, artifacts, and blurred boundaries. Furthermore, this complexity varies significantly among lesions in different cases. While methods based on Convolutional Neural Networks (CNNs) and Transformers have shown promising results in this field, each approach possesses distinct advantages and limitations. To address these challenges, we propose a novel Frequency-aware Interaction Network (FINet). At the core of our FINet lies the proposed Multi-scale Frequency-aware Self-attention (MFS) module, which effectively captures multi-scale feature information within the self-attention layer. This enables our network to model both local and global features, capitalizing on the strengths of both CNNs and Transformers. Additionally, a frequency-aware network is introduced to learn the interactions between spatial locations in the frequency domain to enhance detailed feature representation such as edges. Furthermore, we present a collaborative interactive decoder network, in which a Selective Feature Interaction (SFI) module is proposed to facilitate the semantic and boundary feature interaction, resulting in more precise segmentation outcomes. Experimental results on four medical ultrasound image datasets show the superiority of our FINet over other state-of-the-art segmentation methods. More importantly, our model achieves an excellent trade-off between performance and computational efficiency.

Abstract:
In this paper, we propose the Prompt-based Variational Adapter (PVA), a novel approach designed to fine-tune the pre-trained Vision-Language Models (VLMs) in data-imbalanced scenarios. Unlike existing methods that focus primarily on pairwise alignment of visual-text relationships during fine-tuning, PVA relaxes pairwise explicit constrains and emphasizes the harmonization of visual and text modality distributions, enhancing generalization and cross-modal understanding. To realize this harmonization, we develop two variational adapters, which are appended separately to the visual and text encoders. These adapters transform the feature embeddings into latent spaces that implicitly align with the corresponding modality distributions. We then adopt a divide-and-conquer strategy, dividing classes into data-abundant and data-limited sets to reduce prediction bias. Within each set, we independently fine-tune the models by incorporating both the model’s original general knowledge and specialized knowledge gained from training samples. Extensive experiments across two data-imbalanced scenarios validate the superiority of our approach, establishing a new state-of-the-art on popular benchmarks.

Abstract:
Zero-shot learning (ZSL) aims to transfer the knowledge learned in the seen classes to the unseen classes through semantic knowledge. However, to ensure the model’s versatility on different datasets, existing methods divide the image into blocks of the same size, resulting in the loss of information between attributes. More importantly, existing methods ignore that not every image contains all attributes corresponding to that class. In this paper, we propose a progressive feature reconstruction network, called PFRN. PFRN consists of an attribute relation sub-net and an attention-based feature reconstruction sub-net. Specifically, the attribute relation sub-net first adopts the attribute-related region module to obtain the attribute-related regions in the visual features, which are input to the attribute relation discovery module to find the relationships between attributes. The attention-based feature reconstruction sub-net obtains the fine-grained features based on attributes by the attribute attention module and uses the feature reconstruction module to randomly lose some attributes to reconstruct the new visual features of the missing attributes. The new visual features are fed back into the network for training. Finally, the attribute information learned by the attribute relation sub-net is fused to the visual embedding learned by the attention-based features reconstruction sub-net, and the ideal visual semantic interaction is performed with the semantic vector classified by ZSL. Extensive experiments on three ZSL benchmark datasets demonstrate the significant generalization performance of our proposed method over the state-of-the-art methods.

Abstract:
Video-based person re-identification (Re-ID) aims at associating the video sequences of the identical person across multiple cameras. The ubiquitous appearance misalignment poses a major obstacle for video person Re-ID. Existing alignment-based methods generally rely on off-the-shelf semantic parsing models to locate visible human parts, which ignore identifiable personal belongings and cannot handle various interferences (e.g., pedestrian detection errors and occlusions) in video clips. In this work, we propose a novel framework termed Context-Aided Semantic-Aware Self-Alignment (CSSA) for video-based person Re-ID. First, we propose to jointly learn pixel-level part-aligned representations and semantic-aligned global-level representations in an end-to-end manner. Unlike most existing approaches that depend on prior information in terms of pose for part estimation, CSSA can locate different body parts and achieve the pixel-level semantic alignment without extra human topology semantics. Second, a Context-Aided Region Enhancement (CARE) module is proposed to efficiently highlight macro-visual patterns associated with the target pedestrian and suppress noise caused by factors like background clutters and occlusions. Third, we propose a Semantic-Aware Global Feature Alignment (SGFA) method for generating pair-wise semantic-aligned global representations, which play an essential role in both the training and inference phases. Extensive experimental results on multiple challenging benchmarks indicate the superiority and effectiveness of the proposed CSSA.

Affiliations: China Academy of Electronics and Information Technology, Beijing, China; National Satellite Meteorological Center and the Innovation Center for FengYun Meteorological Satellite, China Meteorological Administration, Beijing, China; Department of Hydraulic Engineering, State Key Laboratory of Hydroscience and Engineering, Tsinghua University, Beijing, China; Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China; School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China; Department of Electrical and Computer Engineering, New York University Abu Dhabi, Abu Dhabi, United Arab Emirates

Abstract:
Object detection is a fundamental task in computer vision that involves accurately locating and classifying objects within images or video frames. In remote sensing, this task is particularly challenging due to the high resolution, multi-scale features, and diverse ground object characteristics inherent in satellite and UAV imagery. These challenges necessitate more advanced approaches for effective object detection in such environments. While deep learning methods have achieved remarkable success in remote sensing object detection, they typically rely on large amounts of labeled data. Acquiring sufficient labeled data, particularly for novel or rare objects, is both challenging and time-consuming in remote sensing scenarios, limiting the generalization capabilities of existing models. To address these challenges, few-shot learning (FSL) has emerged as a promising approach, aiming to enable models to learn new classes from limited labeled examples. Building on this concept, few-shot object detection (FSOD) specifically targets object detection challenges in data-limited conditions. However, the generalization capability of FSOD models, particularly in remote sensing, is often constrained by the complex and diverse characteristics of the objects present in such environments. In this paper, we propose the Generalization-Enhanced Few-Shot Object Detection (GE-FSOD) model to improve the generalization capability in remote sensing FSOD tasks. Our model introduces three key innovations: the Cross-Level Fusion Pyramid Attention Network (CFPAN) for enhanced multi-scale feature representation, the Multi-Stage Refinement Region Proposal Network (MRRPN) for more accurate region proposals, and the Generalized Classification Loss (GCL) for improved classification performance in few-shot scenarios. GE-FSOD demonstrates superior robustness and accuracy in remote sensing FSOD tasks through these enhancements. Extensive experiments on the DIOR and NWPU VHR-10 datasets show that our model achieves state-of-the-art performance, significantly advancing the field of few-shot object detection in remote sensing. The source code is available at (https://github.com/leenamx/GE-FSOD).

Abstract:
As an emerging application in autonomous driving, multi-agent collaborative perception has recently received significant attention. Despite promising advances from previous efforts, several unavoidable challenges that cause performance bottlenecks remain, including the single-frame detection dilemma, communication redundancy, and defective collaboration process. To this end, we propose SCOPE++, a versatile collaborative perception framework aggregating spatio-temporal information across on-road agents to tackle these issues. We introduce four components in SCOPE++ for robust collaboration by seeking a reasonable trade-off between perception performance and communication bandwidth. First, we devise a context-aware information aggregation to capture valuable semantic cues in the temporal context and enhance the current local representation of the ego agent. Second, an exclusivity-aware sparse communication is introduced to filter perceptually unnecessary information from collaborators and transmit complementary features relative to the ego agent. Third, we present an importance-aware cross-agent collaboration to incorporate semantic representations of spatially critical locations across agents flexibly. Finally, a contribution-aware adaptive fusion is designed to integrate multi-source representations based on dynamic contributions. Our framework is evaluated on multiple LiDAR-based collaborative detection datasets in real-world and simulated scenarios, and comprehensive experiments show that SCOPE++ outperforms state-of-the-art methods on all datasets.

Abstract:
Motion in-betweening that aims to generate motion transitions between known keyframes plays a significant role in the 3D character animation industry. However, generating long-term transitions is highly challenging due to the non-stationary nature and considerable spatio-temporal uncertainty of motions. Leading transformer-based methods operate at a single temporal scale while they overlook the spatial interactions among joints and temporally hierarchical structure of motions, leading to the generation of over-smoothed and weak transitions. In this paper, we propose a novel spatio-temporal framework for the motion in-betweening task. First, a spatial transformer is introduced to capture the per-frame spatial dependencies among joints, enhancing the capability of the model to generalize across diverse action types. Furthermore, to alleviate over-smoothing, a multi-scale temporal transformer is designed to generate dynamic and realistic transitions by capturing the hierarchical structure of motions, which includes both global motion trends and local subtle variations. Extensive experiments on the LAFAN1 dataset demonstrate that our method achieves state-of-the-art performance compared to existing methods. In addition, the corresponding ablation studies and sensitivity analyses verify the effectiveness of the proposed spatio-temporal framework.

Abstract:
Animating quadruped 3D objects, such as chairs and tables, typically involves three steps in the traditional computer graphics pipeline: Rigging, Skinning, and Retargeting. Commonly, prevailing methods for each specific step are conceived in isolation. For rigging and skinning steps, optimization-based methods are typically used, but these approaches tend to be slow and susceptible to variations in 3D mesh surfaces. For the retargeting step, the obtained results often fall short of expectations, especially when dealing with dissimilar source and target skeletons, leading to issues like joint twisting. The devised procedure is also time-intensive, resulting in a complex final pipeline. To this end, we present a unified framework, termed Mesh2Animation, providing an end-to-end solution to these challenges. In Mesh2Animation, a learning-based method is proposed for quadruped 3D skeleton estimation. We introduce both skeleton-level and mesh-level loss, allowing the rigging, skinning, and retargeting steps to be optimized simultaneously. Specifically, a general predicted estimation from the rigging step initializes the skeleton, making the skinning step faster and more accurate, which in turn leads to better results in the retargeting step. Finally, the rigging, skinning and retargeting processes are optimized simultaneously under static and temporal constraints. Additionally, we can construct a novel animating dataset termed ShapeNet2Animation (SN2Animation) based on the proposed method, which shows potential application for pose transfer. Qualitative and quantitative results on SN2Animation, ShapeNet, Object3D and ModelNet10 datasets for animation demonstrate that our method achieves competitive performance and shows promising generalization ability on quadruped 3D objects. Our project is available athttps://sites.google.com/view/mesh2animation.

Abstract:
Detecting small, oriented objects in remote sensing images remains a bottleneck for prevailing detection paradigms. The discriminative cues essential for detecting small instances are often inaccessible owing to the restrained spatial extent and poor visual responses, which further compromises the model and necessitates reliance on low-level patterns for identification and localization, exacerbating vulnerability to structural distortions and intra-class confusion especially in complex scenarios. To address these desiderata, we devise a Semantic Differentiation (SemDiff) framework for oriented small object detection in remote sensing images. Starting with randomly initialized category-specific units, we deliver a differentiation pipeline where distinctive features steer the evolution of these embeddings via a tailored differentiation loss. Afterwards, these class-aligned vectors function as dynamic kernels, infusing hierarchical representations with semantic understanding. Moreover, an improved centerness metric that is more accommodating to size-constrained instances is introduced. Building upon this, we design an instance-level recalibration mechanism to regulate the training process, thereby ensuring adequate optimization even for exceptionally small instances. By integrating semantic in an explicit fashion, our SemDiff efficiently facilitates the discriminative capabilities of hierarchical features, thereby revitalizing foreground responses and alleviating semantic-level ambiguity. On the challenging small object detection benchmarks SODA-A and Tiny-DOTA, our approach outstrips prevailing single-stage paradigms by a substantial margin, and achieves competitive performance to its two-stage counterparts, but with an edge of speed. Codes will be available at https://github.com/shaunyuan22/SemDiff.

Abstract:
Vision-based depression estimation is an emerging yet impactful task, whose challenge lies in predicting the severity of depression from facial videos lasting at least several minutes. Existing methods primarily focus on fusing frame-level features to create comprehensive representations. However, they often overlook two crucial aspects: 1) inter- and intra-cue correlations, and 2) variations among samples. Hence, simply characterizing sample embeddings while ignoring to mine the relation among multiple cues leads to limitations. To address this problem, we propose a novel Multi-Cue Contrastive Learning (MCCL) framework to mine the relation among multiple cues for discriminative representation. Specifically, we first introduce a novel cross-characteristic attentive interaction module to model the relationship among multiple cues from four facial features (e.g., 3D landmarks, head poses, gazes, FAUs). Then, we propose a temporal segment attentive interaction module to capture the temporal relationships within each facial feature over time intervals. Moreover, we integrate contrastive learning to leverage the variations among samples by regarding the embeddings of inter-cue and intra-cue as positive pairs while considering embeddings from other samples as negative. In this way, the proposed MCCL framework leverages the relationships among the facial features and the variations among samples to enhance the process of multi-cue mining, thereby achieving more accurate facial depression estimation. Extensive experiments on public datasets, DAIC-WOZ, CMDC, and E-DAIC, demonstrate that our model not only outperforms the advanced depression methods but that the discriminative representations of facial behaviors provide potential insights about depression. Our code is available at: https://github.com/xkwangcn/MCCL.git

Abstract:
We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder’s integration in the image generation process to identify the feature-dense space of the encoder. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results of personalized image generation, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.

Abstract:
Condition-based video generation aims to create video content based on given information that describes specific subjects. However, most existing works can only utilize a single condition to guide the denoising process, thereby limiting their applicability to specific scenarios. Although some works attempt to accommodate multiple conditions within one framework, they often require multiple encoders, leading to inefficiencies in integrating multi-condition features. In this work, we present a framework that, with the support of the proposed Unified Adapter (UniAdapter), enables simultaneous multi-condition control of video generation within a single model. To effectively merge these conditions, we propose a novel Probabilistic Multi-condition Concatenator (PMC) module, which employs a unified encoder to accommodate multiple conditions and concatenate condition features at the pixel level to achieve fine-grained control. Following the PMC module, we employ 2D down-sampling blocks to refine features for injection into the Video Diffusion Model (VDM). Moreover, our UniAdapter is designed to be model-agnostic and compatible with any U-Net-based VDM, offering a versatile solution for improving video generation quality. Experimental results on public benchmarks UCF-101 and MSR-VTT show that our method achieves superior results in both quantitative and qualitative evaluations.

Abstract:
Supervised cross-modal hashing has gained significant attention due to its efficiency in reducing storage and computation costs while maintaining rich semantic information. Despite substantial progress in generating compact binary codes, two key challenges remain: (1) insufficient utilization of labels to mine and fuse multi-grained semantic information, and (2) unreliable cross-modal interaction, which does not fully leverage multi-grained semantics or accurately capture sample relationships. To address these limitations, we propose a novel method called Bi-direction Label-Guided Semantic Enhancement for cross-modal Hashing (BiLGSEH). To tackle the first challenge, we introduce a label-guided semantic fusion strategy that extracts and integrates multi-grained semantic features guided by multi-labels. For the second challenge, we propose a semantic-enhanced relation aggregation strategy that constructs and aggregates multi-modal relational information through bi-directional similarity. Additionally, we incorporate CLIP features to improve the alignment between multi-modal content and complex semantics. In summary, BiLGSEH generates discriminative hash codes by effectively aligning semantic distribution and relational structure across modalities. Extensive performance evaluations against 18 competitive methods demonstrate the superiority of our approach. The source code for our method is publicly available at: https://github.com/yileicc/BiLGSEH.

Abstract:
In the field of autonomous driving, a variety of sensor data types exist, each representing different modalities of the same scene. Therefore, it is feasible to utilize data from other sensors to facilitate image compression. However, few techniques have explored the potential benefits of utilizing inter-modality correlations to enhance the image compression performance. In this paper, motivated by the recent success of learned image compression, we propose a new framework that uses sparse point clouds to assist in learned image compression in the autonomous driving scenario. We first project the 3D sparse point cloud onto a 2D plane, resulting in a sparse depth map. Utilizing this depth map, we proceed to predict camera images. Subsequently, we use these predicted images to extract multi-scale structural features. These features are then incorporated into learned image compression pipeline as additional information to improve the compression performance. Our proposed framework is compatible with various mainstream learned image compression models, and we validate our approach using different existing image compression methods. The experimental results show that incorporating point cloud assistance into the compression pipeline consistently enhances the performance.

Abstract:
The local deployment of federated large language models (FLLM) has further advanced the development of edge intelligence. However, the resource constraints of end devices, device heterogeneity, and the non-independent and identically distributed (Non-IID) nature of data pose significant challenges to the application of FLLM. To address this issue, we propose an Adaptive Asynchronous Accelerated FLLM (Tri-AFLLM) algorithm to achieve the efficient utilization of limited resources and improve model accuracy in the edge computing (EC) scenarios. Specifically, Tri-AFLLM first ships an off-the-shelf LLM, i.e., CLIP, to each end device, keeping the backbone parameters frozen and updating only the parameters of the adapter containing two linear transformation layers by using momentum gradient descent (MGD). Next, a toy example is provided to illustrate the necessity of using different numbers of local iterations for heterogeneous devices in resource-constrained environments. Subsequently, the convergence bound of the Tri-AFLLM under a given resource budget is discussed. Then, we formulated the bound into a resource consumption minimization problem with the number of local iterations as the optimization variable under a given model accuracy to mitigate the contribution disparity of local models to the global aggregation. Finally, extensive experiments are conducted to validate the superiority of Tri-AFLLM in terms of resource consumption, model accuracy, and addressing the Non-IID problem.

Abstract:
While a majority of single-modal ship detectors solely rely on RGB images, a novel multi-modal real-time transformer-based ship detection and classification method, called the MM-ShipNet, is proposed in this paper that integrates the data acquired from three modalities—i.e., RGB camera, radar, and automatic identification system (AIS). First, a bounding box is generated based on the position information from radar and ship’s actual size information from AIS. This physical information are fused and projected onto the camera-acquired RGB image frame. Each bounding box is then possibly weighted depending on the ship size presented on the image. The generated weighted ship masks (WSMs) will be exploited for facilitating ship classification task. In the second stage of MM-ShipNet, multi-modal detection transformer (MM-DETR) introduces an multi-modal cross-scale encoder (MCE) for improving ship detection and classification performance. Our MCE exploits a dual-flow structure to fuse the features extracted from the WSMs and the RGB images under different scales. Since our method is the first work entailing three aforementioned modalities, no such dataset with all modalities can be found in the open source. Thus, we construct a multi-modal ship dataset, termed MMShips, as another contribution. Our MMShips dataset comprises 9,513 camera-acquired real-life maritime RGB images and their aligned ship masks generated from radar and AIS. Experimental results clearly demonstrate that our MM-ShipNet significantly outperforms multiple state-of-the-art single-modal and multi-modal ship detectors.

Abstract:
Vision-Language Tracking (VLT) aims to predict the target state in video sequences using two types of heterogeneous information: 1) the static text description detailing main characteristics of the tracked object, and 2) the dynamic image patches containing the target and its surroundings. However, as the tracking proceeds, inconsistencies may arise between the linguistic information embedded in the text description and the visual representations stored in the search images. In such cases, the direct fusion of vision and language could result in conflicts. To tackle this issue, we propose MugTracker, which integrates image-to-text generation into the VLT framework and attempts a generative updating way to mitigate the effects of inconsistencies. Specifically, we design two branch tasks: multi-modal understanding for reasoning and multi-modal generation for updating. We develop a dynamic text generator based on the hybrid architecture of the pre-trained foundation model BLIP and adaptively update the text reference as the context varies for more accurate target modeling. The semantically consistent visual and linguistic representations are then aligned and associated by the reasoning branch built on the BLIP dual-encoder to infer the target state. To better transfer the foundation model to build a strong tracker, we introduce the proposed TE-Adapter in the visual components for target enhancement and Text-Adapter in the linguistic components to strengthen the learning of discriminative semantics. Our MugTracker has been extensively evaluated on three datasets, and the superior performance compared to the state-of-the-arts demonstrates its effectiveness.

Abstract:
In the rapidly evolving field of unmanned aerial vehicles (UAVs), real-time object detection is crucial for enhancing UAV intelligence. However, existing research often prioritizes complex networks to boost performance, neglecting the inherent computational resource constraints of UAVs. This paper presents FLDet, a family of faster and lighter detectors specifically designed for UAVs. By revisiting the architecture of modern lightweight detectors from a top-down perspective, FLDet offers a novel and comprehensive redesign of the head, neck, and backbone components. Firstly, we propose a Scale Sparse Head (SSH) that utilizes only two heads to detect objects of varying sizes, leveraging scale sparse feature pyramids to balance performance and efficiency. This design provides heuristic guidance for detector architecture development, offering a new paradigm for detector development. Secondly, a Partial Interaction Neck (PIN) is introduced to facilitate partial interaction between different feature scales, thereby reducing computational costs while effectively integrating multi-scale information. Thirdly, inspired by the primate visual pathway, a Stage-Wise Heterogeneous Network (SHN) is presented, employing heterogeneous blocks to capture both local details and contextual information. Finally, we develop a training strategy called Decay Data Augmentation (DDA) to enhance the detector’s generalization capability, leveraging diverse representations generated by strong data augmentation techniques. Experimental results on two challenging UAV-view detection benchmarks, VisDrone2019 and UAVDT, demonstrate that FLDet achieves a state-of-the-art balance among accuracy, latency, and parameter efficiency. In real scenarios tests, the fastest variant, FLDet-N, achieves real-time performance exceeding 52 FPS on an NVIDIA Jetson Xavier NX with only 1.2M parameters. The source code is available at https://github.com/wsy-yjys/FLDet.

Abstract:
The diversity of contextual information is of great importance for accurate semantic segmentation. However, most methods focus on single spatial contextual information, which results in an overlap of the semantic content of categories and a loss of contour information of objects. In this article, we propose a novel contour knowledge-aware perception learning network (CKPL-Net) to capture diverse contextual information by space-category aggregation module (SCAM) and contour-aware calibration module (CACM). First, SCAM is introduced to enhance intraclass consistency and interclass differentiation of features. By integrating space-aware and category-aware attention, SCAM reduces the redundancy of features from a categorical perspective while maintaining spatial correlation of pixels, substantially avoiding the overlap of the semantic content in categories. Second, CACM is designed to maintain the integrity of objects by perceiving contour contextual information. It develops a novel contour-aware knowledge and adaptively transforms the grid structure of convolutions for boundary pixels, which effectively calibrates the representation of features near boundaries. Finally, the quantitative and qualitative analyses on the three public datasets: ISPRS Potsdam dataset, ISPRS Vaihingen dataset, and WHDLD dataset, demonstrate that the proposed CKPL-Net achieves superior performance compared with prevalent methods, which indicates diverse contextual information is beneficial for accurate segmentation.

Affiliations: State Key Laboratory of Electromechanical Dynamic Control, School of Mechatronical Engineering, Beijing Institute of Technology, Beijing, China; New Laboratory of Pattern Recognition, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China; School of Electrical and Data Engineering, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW, Australia

Abstract:
In recent years, growing needs for advanced security and traffic management have significantly heightened the prominence of the visible-infrared person re-identification community (VI-ReID), garnering considerable attention. A critical challenge in VI-ReID is the performance degradation attributable to label noise, an issue that becomes even more pronounced in cross-modal scenarios due to an increased likelihood of data confusion. While previous methods have achieved notable successes, they often overlook the complexities of instance-dependent and real-world noise, creating a disconnect from the practical applications of person re-identification. To bridge this gap, our research analyzes the primary sources of label noise in real-world settings, which include a) instantiated identities, b) blurry infrared images, and c) annotators’ errors. In response to these challenges, we develop a Robust Hybrid Loss function (RHL) that enables targeted recognition and retrieval optimization through a more fine-grained division of the noisy dataset. The proposed method categorises data into three sets: clean, obviously noisy, and indistinguishably noisy, with bespoke loss calculations for each category. The identification loss is structured to address the varied nature of these sets specifically. For the retrieval sub-task, we utilize an enhanced triplet loss, adept at handling noisy correspondences. Furthermore, to empirically validate our method, we have re-annotated a real-world dataset, SYSU-Real. Our experiments on SYSU-MM01 and RegDB, conducted under various noise ratios of random and instance-dependent label noise, demonstrate the generalized robustness and effectiveness of our proposed approach.

Abstract:
Few-Shot Class-Incremental Learning (FSCIL) faces a huge stability-plasticity challenge due to continuously learning knowledge from new classes with a small number of training samples without forgetting the knowledge of previously seen old classes. To alleviate this challenge, we propose a novel method called Prompt-based Concept Learning (PCL) for FSCIL, which generalizes conceptual knowledge learned from old classes to new classes by simulating human learning capabilities. In our PCL, in the base session, we simultaneously learn common basic concepts from the training data and the class-concept weight of each class in a prompt learning manner, and in each incremental session, class-concept weights between new classes and previously learned basic concepts are learned to achieve incremental learning. Furthermore, in order to avoid catastrophic forgetting, we propose a distribution estimation module to retain feature distributions of previously seen classes and a data replay module to randomly sample features of previously seen classes in incremental sessions. We verify the effectiveness of our PCL on widely used benchmarks, such as miniImageNet, CIFAR-100, and CUB-200. Experimental results show that our PCL achieves competitive results compared with other state-of-the-art methods, especially we achieve an average accuracy of 94.02% across all sessions on the miniImageNet benchmark.

Abstract:
Open-vocabulary object detection (OVOD) aims to detect the objects beyond the set of classes observed during training. This work introduces a straightforward and efficient strategy that utilizes pre-trained vision-language models (VLM), like CLIP, to identify potential novel classes through zero-shot classification. Previous methods use a class-agnostic region proposal network to detect object proposals and consider the proposals that do not match the ground truth as background. Unlike these methods, our method selects a subset of background proposals and treats them as novel classes during training. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Compared to previous pseudo methods, our approach does not require re-training and offline labeling processing, which is more efficient and effective in one-shot training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance without incurring additional parameters or computational costs during inference. In addition, we also apply our method to various baselines. In particular, compared with the previous method, F-VLM, our method achieves a 2.5% improvement on the LVIS dataset. Combined with the recent method CLIPSelf, our method also achieves 46.7 novel class AP on COCO without introducing extra data for pertaining. We also achieve over 6.5% improvement over the F-VLM baseline in the recent challenging V3Det dataset. We release our code and models at https://github.com/xushilin1/dst-det.

Abstract:
Occluded person re-identification (ReID) is a challenging task due to some of the essential features are interfered by obstacles or other pedestrians. Multi-granularity local feature extraction and recognition can effectively improve the accuracy of ReID under occlusion. However, manual segmentation methods for local features can lead to feature misalignment. Feature alignment based on pose estimation often ignores non-body details (e.g., handbags, backpacks, etc.) while increasing the complexity of the model. To address the above challenges, we propose a novel Adaptive Occlusion-Aware Network (AOANet), which mainly consists of two modules, the Adaptive Position Extractor (APE) and the Occlusion Awareness Module (OAM). In order to adaptively extract distinguishing features of body parts, APE optimizes the representation of multi-granularity features by the guidance of attention mechanism and keypoint features. To further perceive the occluded region, the OAM is developed by adaptively calculating the occlusion weights for body parts. These weights can lead to highlighting the non-occluded parts and suppressing the occluded parts, which in turn improves the accuracy in the occluded situation. Extensive experimental results confirm the advantages of our method on the MSMT17, DukeMTMC-reID, Market-1501, Occluded-Duke and Occluded-ReID datasets. The comparative results demonstrate that our method outperforms comparable methods. Especially on the Occluded-Duke dataset, our method achieved 70.6% mAP and 81.2% Rank-1 accuracy.

Abstract:
Current video object segmentation methods heavily rely on pixel-level mask annotations when training, which are expensive and time-consuming to acquire. To address this problem, some approaches try to train with sparse scribble annotations and take sparse target scribble as initial information for inference. However, due to the sparsity of scribble annotations, the performance is often limited, and the corresponding loss function needs to be designed. Inspired by the powerful ability of Segment Anything Model (SAM) to leverage prompt for segmentation, we argue that this problem can be alleviated by improving the quality of scribble. Therefore, we propose SEVOS, a framework for scribble-supervised video object segmentation, which contains a scribble enhancement algorithm and an semi-supervised video object segmentation network. Specifically, the scribble enhancement algorithm first samples corresponding positive sample points and negative sample points from target scribbles, and then feeds them into the SAM in turn, achieving high-quality scribble enhancement without human intervention. This algorithm augments the scribble-annotated video dataset, which is used for additional training of the model. Furthermore, we design a post-processing enhancement algorithm to further improve the prediction results. The obtained model outperforms state-of-the-art methods with a considerable performance gap, indicating the generalization and effectiveness of the proposed model.

Abstract:
The discussion of compositional generalization in action recognition, i.e., Compositional Action Recognition (CAR), has recently received increasing attention. CAR challenges models to recognize unseen combinations of actions and objects, with the primary challenge being the distribution shift from training to testing. Most previous approaches for CAR incorporate supplementary object annotations (e.g. bounding boxes and objects categories) to learn an instance-centric dynamic representation. However, these methods inevitably introduce stronger visual inductive bias, including object appearance and background bias, that impact generalization performance, particularly in out-of-distribution scenarios. To this end, this work attempts to construct an appearance-agnostic de-biased representation by leveraging the powerful segmentation capability of Segment Anything Model (SAM), which is the first exploration of SAM in the field of compositional action recognition. Specifically, we propose a novel SAM-driven Appearance-Agnostic Representation Learning (A2RL) framework for CAR, which contains two effective sub-modules: Fore-Back Mask (FBM) and Dynamic Relation Modeling (DRM). In FBM, we design a fine-grained instance-invisible and background-removed masking strategy to effectively weaken the strong connection between visual cues and action labels, as well as minimize the impact of irrelevant factors. In DRM, we explore the potential association between subjects and objects involved in one action and then build appearance-agnostic relational descriptors for dynamic modeling. Extensive experiments demonstrate the generalization ability of this work. Notably, FBM achieves significant improvements in all three compositional settings without adding any additional model parameters. The proposed also gains state-of-the-art performance in comparison with the most recent methods in CAR.

Abstract:
The pre-trained vision-language model, exemplified by CLIP, advances zero-shot semantic segmentation by aligning visual features with class embeddings through a transformer decoder to generate semantic masks. Despite its effectiveness, prevailing methods within this paradigm encounter challenges, including overfitting on seen classes and small fragmentation in segmentation masks. To mitigate these issues, we propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of linguistic and visual information. Specifically, we leverage class embeddings as anchors due to their discrete and abstract nature, steering visual features toward class embeddings. Moreover, to achieve a more compact visual space, we introduce route attention into the transformer decoder to find visual consensus, thereby enhancing semantic consistency within the same object. Equipped with a vision-language prompting strategy, our approach significantly boosts the generalization capacity of segmentation models for unseen classes. Experimental results underscore the effectiveness of our approach, showcasing mIoU gains of 4.5% on the PASCAL VOC 2012 and 3.6% on the COCO-Stuff 164K for unseen classes compared with the state-of-the-art methods.

Abstract:
The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations. However, grounded captioning models rely on deliberate grounding annotations as supervision, which are relatively hard to obtain. Moreover, the captioning accuracy often suffers from degenerated object appearances on the annotated area such as motion blur and video defocus, and these models seldom consider the complex interactions among entities. In this paper, we propose a comprehensive visual grounding network to improve video captioning, by using inexpensive pseudo annotation while avoiding the need to collect large amounts of manual annotations. Specifically, the network consists of spatial-temporal entity grounding and action grounding. The proposed entity grounding encourages the attention mechanism to focus on informative spatial areas across video frames. The action grounding dynamically associates the verbs to related subjects and the corresponding context, which keeps fine-grained spatial and temporal details for action prediction. Both entity grounding and action grounding are formulated as a unified task guided by a soft grounding supervision. More importantly, the grounding objective is supervised by pseudo annotations automatically produced by a grounding annotation generation module, thus our model can be easily applied to the challenging dataset without any grounding annotation provided. We conduct extensive experiments on three benchmark datasets and demonstrate significant performance improvements of +2.4 CIDEr on MSR-VTT, +4.7 CIDEr on MSVD, and +5.1 CIDEr on ActivityNet-Entities compared to state-of-the-arts.

Abstract:
Existing diffusion models for low-light image enhancement typically incrementally remove noise introduced during the forward diffusion process using a denoising loss, with the process being conditioned on input low-light images. While these models demonstrate remarkable abilities in generating realistic high-frequency details, they often struggle to restore fine details that are faithful to the input. To address this, we present a novel detail-preserving diffusion model for realistic and faithful low-light image enhancement. Our approach integrates a size-agnostic diffusion process with a reverse process reconstruction loss, significantly enhancing the fidelity of enhanced images to their low-light counterparts and enabling more accurate recovery of fine details. To ensure the preservation of region- and content-aware details, we employ an efficient noise estimation network with a simplified channel-spatial attention mechanism. Additionally, we propose a multiscale ensemble scheme to maintain detail fidelity across diverse illumination regions. Comprehensive experiments on eight benchmark datasets demonstrate that our method achieves state-of-the-art results compared to over twenty existing methods in terms of both perceptual quality (LPIPS) and distortion metrics (PSNR and SSIM). The code is available at: https://github.com/CSYanH/DePDiff.

Abstract:
The fusion of low-resolution hyperspectral images (LR HSI) and high-resolution multispectral images (HR MSI) is a crucial approach for generating hyperspectral images (HSI). However, existing hyperspectral image fusion methods often rely on a single feature mapping process, which makes it difficult to accommodate the significant differences in features between the source images and the real images. Consequently, the generated images frequently exhibit varying degrees of information loss across different spectral bands and limit the overall performance of the fusion. To address this issue, we propose a novel hyperspectral image fusion network. Specifically, an asymptotic spectral mapping module is designed to enhance the detail information fitting capabilities of the fusion network. This module transforms the fitting process of missing information into multiple sets of fitting processes with varying degrees, which can map features with different spectral fidelity to various scales and gradually fit spectral information, thereby reducing spectral distortion. Additionally, we introduce an adaptive defect optimization loss that guides the network to focus on reconstructing regions with substantial spectral differences between LR HSI and HSI, optimizing the network’s constraints regarding the similarity between predicted and real images. Experimental results demonstrate that the proposed fusion network outperforms existing state-of-the-art methods across diverse datasets.

Abstract:
Recent studies on Video Coding for Machine (VCM) have achieved remarkable results. However, in practical visual-analytic applications, except for transmitting the visual feature for machine analysis, video textures are also mandatory for human monitoring and decision-making. To this end, this paper proposes a human-machine friendly video compression scheme (HMFVC) which can satisfy both human viewing and machine analysis well. First, we propose a learned semantic representation (LSR) method to extract semantic information between temporal neighboring frames. LSR could be utilized in signal reconstruction for human viewing and visual analysis for machine understanding. Second, given the proposed LSR, we design an end-to-end optimized video compression framework to jointly optimize the visual quality for human perception, analysis accuracy for machines, and compression efficiency as well. Finally, an HMFVC codec is developed, which can achieve higher action recognition accuracy and better reconstruction quality than the traditional codecs and learned video compression approaches. Specifically, HMFVC saves 77% bitrate to achieve the same analysis performance with the original videos compared to x265. To our knowledge, HMFVC is the first end-to-end optimized video compression scheme to serve both humans and machines. It is a promising framework for human-machine friendly video compression approaches.

Abstract:
Unlike Conventional Zero-Shot Learning (CZSL) which only focuses on the recognition of unseen classes by using a classifier trained on seen classes and semantic embeddings, Generalized Zero-Shot Learning (GZSL) requires a classifier trained on seen classes to recognize objects from both seen and unseen classes. To tackle this problem, feature generative-based models have been proposed to synthesize visual features for unseen classes conditioned on their semantic descriptors. However, they treat these semantic descriptors as independent individuals without exploring their structural relations among categories. We propose a novel approach, dubbed Relation Extrapolation based feature generation for GZSL (RE-GZSL), which generates features of unseen classes by borrowing some features that are extrapolated from seen classes based on semantic relations. In RE-GZSL, a visual-semantic relations alignment loss and an instance-prototype contrastive loss are presented to align visual relations with semantic relations. To maintain the information of the visual features before and after the alignment, a discrimination preservation loss is further introduced. Besides, a feature mixing module is built to synthesize features for unseen classes, which are more realistic and tightly related to seen classes. Experimental results demonstrate that RE-GZSL outperforms competitors on four benchmark datasets. Comprehensive ablation studies and analyses are provided to dissect what factors led to this success. Code is available at: https://github.com/Barcaaaa/RE-GZSL.

Abstract:
Network pruning is widely used in model compression due to its simplicity and efficiency. Existing methods typically introduce sparse loss regularization to learn masks. However, this sparse regularization approach lacks a clear criterion for evaluating channel importance and relies on manually defined rules, leading to a decline in model performance. In this article, a Self-Supervised Mask Learning (SSML) method for global channel pruning is proposed, casting mask learning as a self-supervised binary classification task to automatically identify less important channels. Specifically, a dedicated pretext task is designed for the channelwise masks, which leverages the original network to generate pseudo-labels from the mask itself to guide mask learning. Then, a polarization mask loss function is proposed, transforming the discrete mask learning problem into a differentiable binary classification problem. The proposed loss function distinguishes the similarity between pseudo-labels and masks, clustering similar masks together in the feature space and separating dissimilar masks, ultimately allowing channels with masks of 0 to be safely removed without damaging the performance of the pruned model. In addition, SSML can train from scratch to yield a compact model. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet datasets demonstrate that SSML outperforms state-of-the-art methods. For instance, SSML prunes 52.7% of the FLOPs of ResNe34 on the ImageNet dataset with only 0.01% drop in Top-1 accuracy. Moreover, the generalization of SSML is verified on downstream tasks.

Abstract:
Frequent occurrences of marine extreme climate and weather events pose significant threats to human life and property, underscoring the practical significance of meteorological data forecasting methods. Notably, significant advancements in meteorological forecasting fields have been achieved by data-driven deep learning techniques, which leverage observed meteorological datasets and employ deep networks to capture complex patterns. However, challenges remain in accurately extracting local details and capturing spatial-temporal correlations when dealing with multiple meteorological forecasting tasks that exhibits diverse temporal and spatial scales. Hence, in this paper, we propose a Multi-Scale Spatial Temporal Transformer (MS-STT) framework to achieve efficient and accurate meteorological data forecasting. Specifically, to achieve more detailed and multi-scale representation of meteorological data, we design the regionally coherent encoding strategy and multi-scale feature aggregation for visual representation. To enhance the multi-scale ability in terms of learning spatial-temporal correlations, we propose a multi-scale spatial-temporal transformer network, which integrates a multi-scale spatial transformer to learn the spatial association between local patches and multi-scale regions and a temporal transformer to learn the temporal dynamic evolution properties. Extensive quantitative and qualitative experiments on three popular spatial temporal forecasting tasks validate the effectiveness of the proposed method. In particular, compared to the representative data-driven deep learning ENSO forecasting method Earthformer, our approach achieves a 3.7% performance improvement with only one-third of the parameters.

Abstract:
With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates’ history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, \mathrm LaSOT_ext , TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.

Abstract:
Ensemble clustering based on co-association matrices integrates multiple connective matrices from base clusterings to achieve superior results. However, these methods primarily focus on inter-sample relationships, neglecting variations across different base clusterings, potentially introducing noise. Additionally, they overlook interactions between samples and base clusterings, which are crucial for extracting common information and avoiding post-processing steps that may cause information loss and instability in clustering results. To address these issues, we propose the Tensorized Graph Learning for Spectral Ensemble Clustering (TGLSEC) model. TGLSEC stacks all connective matrices into a third-order tensor, employs Fast Fourier Transform (FFT) for encoding, and elucidates inter-relations in the frequency domain. By minimizing the tensor Schatten p-norm, TGLSEC extracts common information in the low-rank space, eliminating noise and improving the quality of the common shared graph. Incorporating Laplacian rank constraints, TGLSEC learns a common shared graph with c-connected components, directly representing the clustering structure and avoiding post-processing steps, leading to more stable clustering results. To enhance computational efficiency for large-scale datasets, TGLSEC has been expanded into a bipartite-graph-based model, TGLSEC-BG, reducing complexity and computational time. Extensive experiments on real-world datasets demonstrate that TGLSEC and TGLSEC-BG exhibit superior clustering performance and robustness to noise.

Abstract:
With the success of self-supervised learning, multimodal foundation models have rapidly adapted a wide range of downstream tasks driven by vision and language (VL) pre-training. State-of-the-art methods achieve impressive performance by pre-training on large-scale datasets. However, bridging the semantic gap between the two modalities remains a non-negligible challenge for VL tasks. In this work, we propose an efficient computation framework for multimodal alignment by introducing a novel visual semantic module to further improve the performance of the VL tasks. Specifically, we propose a flexible model, namely Artificial-Spiking Hierarchical Networks (ASH-Nets), which combines the complementary advantages of Artificial Neural Network (ANN) and Spiking Neural Network (SNN) to enrich visual semantic representations. In particular, a visual concrete encoder and a semantic abstract encoder are constructed to learn continuous and discrete latent variables to enhance the flexibility of semantic encoding. Considering the spatiotemporal properties of SNN modeling, we introduce a contrastive learning method to optimize the inputs of similar samples. This can improve the computational efficiency of the hierarchical network, while the augmentation of hard samples is beneficial to the learning of visual representations. Furthermore, the Spiking to Text Uni-Alignment Learning (STUA) pre-training method is proposed, which only relies on text features to enhance the encoding ability of abstract semantics. We validate the performance on multiple well-established downstream VL tasks. Experiments show that the proposed ASH-Nets achieve competitive results. Our code is available on GitHub (https://github.com/ZSYTJ/ASH-Nets).

Abstract:
Modeling discriminative spectral-spatial features is a key to improving hyperspectral image classification performance. However, existing methods cannot fully characterize the spatial specificity of hyperspectral images, thus making them unable to fully explore the useful information within the image and further improve the discriminative power of features. To address this issue, this work proposes a dual heterogeneous network (DHNet) for hyperspectral image classification. Specifically, the network consists of spatial-specific and spectral-specific branches and captures spectral-spatial features with complementarity by combining convolution and spectral-spatial involution. To better characterize spatial specificity, the spectral-spatial involution modifies the weight parameters based on the center spectral information and neighborhood spatial information of various spatial locations. Besides, two feature calibration modules are proposed. Spatial-specific and spectral-specific weights are generated from the respective branches to calibrate the features captured by the other branches to improve the information interaction between the two branches. The center spectral mapping integrates the spectral features of the target pixel into the feature to suppress the influence of the neighboring disturbing pixels. Experimental results on four datasets indicate that DHNet achieves an accuracy improvement of 1.23%, 2.03%, 2.52%, and 1.77% over the state-of-the-art peers, respectively.

Abstract:
Corruption-invariant Person Re-identification (CI-ReID) aims to build robust identity correspondence across non-overlapped cameras even when severe image corruptions occur. It is challenging as those corruptions contaminate intrinsic pedestrian characteristics and cause semantic misalignment in feature space. To address this issue, this paper proposes a coarse-to-fine semantic alignment framework that learns corruption-invariant pedestrian features for re-identification from the perspective of multi-modal feature alignment. In this framework, a Coarse-to-Fine Feature Alignment Transformer (CFAT) is introduced to extract and align features of pedestrian images with different corruptions. Specifically, the CFAT aligns features of corrupted samples to that of the corresponding clean samples in a knowledge distillation manner in the coarse alignment stage, i.e., a teacher network distils identity-related semantics from clean samples and supervises the student network learning semantic-consistent features from corrupted samples. To avoid information loss of the strict alignment, we propose to integrate a Bridge Feature Generation (BFG) module into CFAT to construct meaningful latent structures among modalities in the fine alignment stage. This enables seamless alignment of the same identity between corrupted and clean modalities, leading to better re-identification performance. To evaluate the effectiveness of the proposed method, extensive experiments are conducted on three public benchmark datasets, i.e., Market-1501, CUHK-03, and MSMT-17. The experimental results demonstrate our CFAT outputs state-of-the-arts with a large margin in various corrupted scenes.

Affiliations: Chinese Academy of Sciences, Institute of Intelligent Machines, Hefei Institutes of Physical Science, Hefei, China; School of Informatics, Xiamen University, Xiamen, China; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, China; High Magnetic Field Laboratory, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, China; Department of Automation, University of Science and Technology of China, Hefei, China

Abstract:
Pan-sharpening aims to generate high-detail multi-spectral images (HRMS) through the fusion of panchromatic (PAN) and multi-spectral (MS) images. However, existing pan-sharpening methods often suffer from significant performance degradation when dealing with out-of-distribution data, as they assume the training and test datasets are independent and identically distributed. To overcome this challenge, we propose a novel frequency domain-irrelevant feature learning framework that exhibits exceptional generalization capabilities. Our approach involves parallel extraction and processing of domain-irrelevant information from the amplitude and phase components of the input images. Specifically, we design a frequency information separation module to extract the amplitude and phase components of the paired images. The learnable high-pass filter is then employed to eliminate domain-specific information from the amplitude spectrums. After that, we devised two specialized sub-networks (AFL-Net and PFL-Net) to perform targeted learning of the frequency domain-irrelevant information. This allows our method to effectively capture the complementary domain-irrelevant information contained in the amplitude and phase spectra of the images. Finally, the information fusion and restoration module dynamically adjusts the feature channel weights, enabling the network to output high-quality HRMS images. Through this frequency domain-irrelevant feature learning framework, our method balances generalization capability and network performance on the distribution of training dataset. Extensive experiments conducted on various satellite datasets demonstrate the effectiveness of our method for generalized pan-sharpening. Our proposed network outperforms state-of-the-art methods in terms of both quantitative metrics and visual quality, showcasing its superior ability to handle diverse, out-of-distribution data.

Abstract:
Improving the efficiency of Neural Architecture Search (NAS) is a challenging but significant task that has received much attention. Previous studies mainly adopt the Differentiable Architecture Search (DARTS) and improve its search strategies or modules to enhance search efficiency. Recently, some methods have started considering data reduction for speedup, but they are not tightly coupled with the architecture search process and cannot capture the training dynamics of DARTS well, resulting in sub-optimal performances. To this end, this work pioneers an exploration into the critical role of dataset characteristics in the bi-level optimization of DARTS, and then proposes a novel Bi-level Data Pruning (BDP) paradigm that targets the weights and architecture levels of DARTS to enhance efficiency from a data perspective. Specifically, we introduce a progressive bi-level data pruning strategy that utilizes supernet prediction dynamics as the metric to gradually prune unsuitable samples for DARTS during the search. An effective automatic class balance constraint is also integrated into BDP, to suppress potential class imbalances resulting from data-efficient algorithms. Comprehensive evaluations on the NAS-Bench-201 search space, DARTS search space, and MobileNet-like search space validate that BDP reduces search costs by over 50% while achieving superior performance when applied to the baseline DARTS. Besides, we demonstrate that BDP can harmoniously integrate with advanced DARTS variants, like P-DARTS, PC-DARTS, EG-NAS, and \beta -DARTS, offering an approximately 2× speedup with minimal performance compromise.

Abstract:
Depth information provides valuable insights into the 3D structure, especially the outline of objects, which can be utilized to enhance semantic segmentation. However, a naive fusion of depth information can disrupt features and compromise accuracy due to the gap between depth and RGB modalities. In this work, we introduce a depth-guided texture diffusion approach that effectively tackles the outlined challenge. Our method extracts low-level features from edges and textures to create a texture image. This image is then selectively diffused across the depth map, enhancing structural information vital for precisely extracting object outlines. By integrating this enhanced depth map with the original RGB image into a joint feature embedding, our method effectively bridges the disparity between depth and RGB modalities, enabling more accurate semantic segmentation. We conduct comprehensive experiments on diverse, widely-used datasets covering various semantic segmentation tasks, including Camouflaged Object Detection (COD), Salient Object Detection (SOD), and indoor semantic segmentation. With source-free estimated depth or depth captured by depth cameras, our method consistently outperforms existing baselines and achieves new state-of-the-art results, demonstrating the effectiveness of our depth-guided texture diffusion for image semantic segmentation. The source code and datasets are publicly available at https://github.com/Wistzz/Texture-Diffusion.git.

Abstract:
The current state-of-the-art text-to-image (T2I) models have found numerous applications, driven by their ability to produce photorealistic images. Concept learning, as one notable application, aims to enable T2I models to generate personalized content and better enable users to create images according to their interests. Nevertheless, the process of concept learning often involves model fine-tuning, which in turn brings the potential risk of overfitting. Such overfitting causes the T2I model to have reduced output diversity and results in poor editability. To mitigate the overfitting problem, we introduce two simple yet effective designs, namely masked textual inversion (MaskTI) and text regularization (TextReg). MaskTI is a variant of vanilla textual inversion that forces the learnable identifier to only attend to the class descriptor. This modification can effectively reduce the overfitting to those uninterested backgrounds. TextReg regulates the fine-tuning of cross-attention modules with simple text prompts without identifiers, which avoids the usage of real images as the regularization prior. Our extensive experiments demonstrate that not only does our approach effectively protect prior knowledge but also has high editability for the personalized model.

Abstract:
Addressing degraded weather conditions plays a vital role in practical applications. Many existing restoration approaches are limited to specific weather types, which limits their applicability to different weather scenarios. Advanced technologies, encompassing Transformer and diffusion model, have been harnessed to confront this challenge. However, these methods often heighten network complexity and prolong inference duration. To this end, we present MW-ConvNet, a U-shaped convolution-based network for multi-weather restoration. Specifically, the MW-Enc block and MW-Dec block are introduced to achieve simple yet strong feature extraction, which rely entirely on traditional 2D convolution. To improve adaptability to multiple weather conditions, a prompt generation module is designed to generate a representative weather prompt at the encoder’s terminus. Drawing inspiration from style transfer, the weather prompt is used to guide the decoder learning through a progressive restoration procedure. For future high-fidelity restoration, we introduce frequency separation through wavelet pooling blocks in encoder phase and corresponding up-sampling blocks in decoder phase. The segregated treatment of low-frequency and high-frequency features curbs the loss of textural information during network computation. It also future improves the quality and accuracy of generated weather prompt. Extensive experiments demonstrate that the proposed MW-ConvNet obtains superior performance compared to state-of-the-art methods across both weather-specific and real-world restoration tasks. Significantly, our method achieves an impressive inference speed of 0.12 seconds per 256× 256 image, outpacing transformer-based and diffusion-based models.

Abstract:
Recently, diffusion models as a hot paradigm have shown considerable superiority in image restoration with an unsupervised manner. However, they require iterative refinement from isotropic Gaussian through thousands of steps to produce a sample with exceptional quality. Most existing methods are devoted to designing fast solvers for the reverse stochastic differential equation (SDE) to accelerate sampling, while neglecting the potential of forward SDE. To better stimulate this potential, we propose the Torch-Advent-Civilization-Evolution (TACE), a novel diffusion model-based zero-shot framework for image restoration. Specifically, we propose the “Torch”, a latent vector that explicitly contains content information from the measurement image. By utilizing the Torch instead of isotropic Gaussian as initialization, our TACE significantly accelerates image restoration with better consistency and realness. To acquire the Torch from the forward process, we propose Prometheus SDEs, a cluster of equivalent SDEs. Furthermore, we construct a Conditional Guidance Projection (CGP) for the reverse SDE to strengthen the consistency of restored images. Finally, we design a Civilization Shuttle Strategy (CSS) for the generation process to enhance the realness of restored images. Extensive experiments validate that our TACE achieves state-of-the-art performance with fewer sampling steps in various typical tasks, such as super-resolution, deblurring, and colorization.

Abstract:
Image manipulation has sparked widespread concern due to its potential security threats on the Internet. The boundary between the authentic and manipulated region exhibits artifacts in image manipulation localization (IML). These artifacts are more pronounced in heterogeneous image splicing and homogeneous image copy-move manipulation, while they are more subtle in removal and inpainting manipulated images. However, existing methods for image manipulation detection tend to capture boundary artifacts via explicit edge features and have limitations in effectively addressing subtle artifacts. Besides, feature redundancy caused by the powerful feature extraction capability of large models may prevent accurate identification of manipulated artifacts, exhibiting a high false-positive rate. To solve these problems, we propose a novel edge-aware network (EAN) to capture boundary artifacts effectively. This network treats the image manipulation localization problem as a segmentation problem inside and outside the boundary. In EAN, we develop an edge-aware mechanism to refine implicit and explicit edge features by the interaction of adjacent features. This approach directs the encoder to prioritize the desired edge information. Also, we design a multi-feature fusion strategy combined with an improved attention mechanism to enhance key feature representation significantly for mitigating the effects of feature redundancy. We perform thorough experiments on diverse datasets, and the outcomes confirm the efficacy of the suggested approach, surpassing leading manipulation localization techniques in the majority of scenarios.

Abstract:
Standard domain adaptation methods require access to both source and target data. However, sharing source data is often impractical in real-world scenarios due to data privacy and memory limitation issues. In this work, we focus on the source-free domain adaptation (SFDA). Existing SFDA methods mainly learn independent information within individual samples and lack the utilization of topological information between samples. For this reason, we explicitly constrain the sample relational knowledge in the mean teacher framework for solving SFDA. Specifically, three relational graphs are first constructed based on the similarity between sample feature pairs: teacher, student, and teacher-student. Then, model adaptation is achieved via two consistency regularizations: 1) Inter-graph consistency constrains the consistency between graph structures. 2) Intra-graph consistency enhances the compactness of the samples within classes. In addition, to mitigate the effect of noisy pseudo labels, local prototypes during the iterations are continuously utilized to calibrate the global prototype to generate high-quality pseudo labels for the target samples. Further, the classification loss is reweighted according to the uncertainty of the pseudo labels, which allows the model to not only highlight the role of high-reliability samples but also to fully exploit the entire target domain. Extensive experimental results on Office-31, Office-Home, VisDA-2017, DomainNet and Digit dataset demonstrate the effectiveness of our method.

Abstract:
Visual object tracking is a challenging task that aims to accurately estimate the scale and position of a designated target. Recently, segmentation networks have proven effective in visual tracking, producing outstanding results for target scale estimation. However, segmentation-based trackers still lack robustness due to the presence of similar distractors. To mitigate this issue, we propose an Attention-based Gating Network (AGNet) that produces gating weights to diminish the impact of feature maps linked to similar distractors. Subsequently, we incorporate the AGNet into the segmentation-based tracking paradigm to achieve accurate and robust tracking. Specifically, the AGNet utilizes three cascading Multi-Head Cross-Attention (MHCA) modules to generate gating weights that govern the generation of feature maps in the baseline tracker. The proficiency of the MHCA in modeling global semantic information effectively suppresses feature maps associated with similar distractors. Additionally, we introduce a distractor-aware training strategy that leverages distractor masks to train our model. To alleviate the issue of partial occlusion, we introduce a box refinement module to enhance the accuracy of the predicted target box. Comprehensive experiments conducted on 11 challenging tracking benchmarks show that our approach significantly surpasses the baseline tracker across all metrics and achieves excellent results on multiple tracking benchmarks.

Abstract:
Ranking-based skill assessment is an essential component of video understanding. In this task lacking precise procedure annotations, existing methods place greater emphasis on evaluating the procedure quality via manually normalizing the execution duration. However, the inherent duration-related procedural patterns will undergo alteration. Experimentally, we discover that distinct duration biases are prevalent in duration-sensitive skills, such as those in medical and everyday life. Hence, duration information is crucial for ranking-based skill assessment when dealing with varying durations. Additionally, similar execution processes tend to have closer execution durations. Thus, another critical factor lies in extracting duration-related procedural information alongside similar durations. It is defined as mining rhythm patterns, which are inspired by music rhythms including various duration and duration-related procedures. In our work, a rhythm-aware transformer is proposed to mine the rhythm patterns adaptively. Given pairwise inputs, a co-attention module is designed to mutually highlight duration-related procedure information when comparing pairwise input videos with similar durations, and adaptively attenuate the efficacy when confronted with pairwise inputs featuring significantly different durations. A rhythm-encoding module further embeds duration information into the concatenation of raw features and co-attention features. Following these features, the transformer decoder is designed to learn duration-related queries supervised by a novel duration grouping loss among various duration groups. The experimental results demonstrate that the rhythm-aware transformer is effective for ranking-based skill assessment.

Abstract:
Multispectral object detection has attracted increasing attention recently due to its superior detection capacity under various illumination conditions. The key challenge lies in the effective aggregation of multi-spectral features to derive highly discriminative representations. To address this challenge, we propose a novel Multidimensional Fusion Network (MMFN) to explore multi-modal information from local, global, and channel perspectives. Specifically, at the local level, local features of different modalities and their inter-relationships are captured by a window-shifted fusion. As a complement to the local information, we designed a global interaction module that facilitates the fusion of holistic, high-level semantic information spanning the entire image. We distillate the channel dependencies and complementarities between different modalities through cross-channel learning and generate the final fused representation. Comprehensive experiments conducted on three publicly available datasets provide compelling evidence validating the superiority of the proposed methodology. The results exhibit notable performance gains over state-of-the-art multispectral object detectors. Our code will be released.

Abstract:
Existing video inpainting approaches tend to adopt vision transformers with rare customized designs, which poses two limitations. Firstly, the conventional self-attention mechanism treats tokens from invalid and valid regions equally and mingles them, which may incur blurriness. Secondly, these approaches merely employ forward frames as references, while ignoring the past inpainted frames, which are also valuable in enhancing temporal consistency and offering more available information. In this paper, we propose a new video inpainting network, called Bidirectional Error-Aware Fusion Network (BEAF-Net). Concretely, on one hand, we propose a tailored Error-Aware Transformer (EAT) that discerns different tokens by assigning dynamic weights to bridle the use of erroneous tokens. Meanwhile, each EAT is equipped with a Spatial Feature Enhancement (SFE) layer to synthesize features with multi-scales. On the other hand, we apply a pair of EATs to utilize forward reference frames and past inpainted frames simultaneously, and a proposed Bidirectional Fusion (BiF) layer is exerted to blend the aggregation results adaptively. By coupling these novel designs, our proposed BEAF-Net completely leverages the location priors, multi-scale perception, and past predictions to produce more faithful and consistent inpainting results. We corroborate our BEAF-Net on two commonly-used video inpainting datasets: DAVIS and Youtube-VOS, where the experimental results demonstrate BEAF-Net compares favorably with state-of-the-art solutions. Video examples can be found at https://github.com/JCATCV/BEAF-Net.

Abstract:
Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods have explored various techniques for measuring the similarity between query and support images, along with meta-training and pre-training strategies, to leverage visual features more effectively. However, the potential of multi-modality information remains unexplored, presenting a promising avenue for improvement in few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification without image-level captions. LPN leverages the complementarity of vision and language modalities through two parallel branches with pre-fusion and post-fusion. Firstly, we introduce the language modality by utilizing a pre-trained text encoder to extract class-level text features from class names. In the visual branch, we process images using a conventional image encoder and leverage the class-level features to align the visual features, effectively capturing more class-relevant visual information. In the text branch, we combine the class-level text features with the visual features using a language-guided decoder. This decoder generates image-specific text features for the pre-fusion step. Additionally, we utilize these class-level text features to refine the prototypical head, creating robust prototypes for subsequent measurements. Finally, to enhance overall performance, we aggregate the visual and text logits, adjusting for discrepancies between the modalities during the post-fusion process. Extensive experiments demonstrate that LPN outperforms several state-of-the-art methods on benchmark datasets, showcasing its effectiveness and robustness.

Abstract:
Graph matching aims to establish node correspondences between graphs, which is a classic combinatorial optimization problem. In recent years, (deep) learning-based methods have emerged as a superior alternative to traditional graph matching solvers. However, these methods typically rely on node-level correspondence labels, which can be prohibitively expensive or unrealistic. Inspired by contrastive learning that is a prevalent paradigm for self-supervised representation learning, we develop a Contrastive Learning Network for Unsupervised Graph Matching (CUGM), which is an end-to-end differentiable pipeline to learn node permutations. Specifically, we propose three-level augmentation including raw image augmentation, graph augmentation and model augmentation for generating diverse enough contrastive views to enrich training instances. Then a contrastive learning network is constructed to capture the higher-order structural information in graphs and learn the final node representations for yielding the affinity matrix to directly solve a linear assignment problem. More importantly, we propose a node-level contrastive loss with false negative cancellation for optimizing the whole network to extract the tailored node feature representations to improve graph matching accuracy. Experimental results on standard graph matching benchmarks demonstrate that our end-to-end unsupervised method achieves the competitive performance compared with state-of-the-art supervised and unsupervised graph matching methods.

Abstract:
Most cameras on portable devices adopt a rolling shutter (RS) mechanism, encoding sufficient temporal dynamic information through sequential readouts. This advantage can be exploited to recover a temporal sequence of latent global shutter (GS) images. Existing methods rely on fully supervised learning, necessitating specialized optical devices to collect paired RS-GS images as ground-truth, which is too costly to scale. In this paper, we propose a self-supervised learning framework for the first time to produce a high frame rate GS video from two consecutive RS images, unleashing the potential of RS cameras. Specifically, we first develop the unified warping model of RS2GS and GS2RS, enabling the complement conversions of RS2GS and GS2RS to be incorporated into a uniform network model. Then, based on the cycle consistency constraint, given a triplet of consecutive RS frames, we minimize the discrepancy between the input middle RS frame and its cycle reconstruction, generated by interpolating back from the predicted two intermediate GS frames. Experiments on various benchmarks show that our approach achieves comparable or better performance than state-of-the-art supervised methods while enjoying stronger generalization capabilities. Moreover, our approach makes it possible to recover smooth and distortion-free videos from two adjacent RS frames in the real-world BS-RSC dataset, surpassing prior limitations.

Abstract:
Despite the significant success of deep learning in computer vision tasks, cross-domain tasks still present a challenge in which the model’s performance will degrade when the training set and the test set follow different distributions. Most existing methods employ adversarial learning or instance normalization for achieving data augmentation to solve this task. In contrast, considering that the batch normalization (BN) layer may not be robust for unseen domains and there exist the differences between local patches of an image, we propose a novel method called patch-aware batch normalization (PBN). To be specific, we first split feature maps of a batch into non-overlapping patches along the spatial dimension, and then independently normalize each patch to jointly optimize the shared BN parameter at each iteration. By exploiting the differences between local patches of an image, our proposed PBN can effectively enhance the robustness of the model’s parameters. Besides, considering the statistics from each patch may be inaccurate due to their smaller size compared to the global feature maps, we incorporate the globally accumulated statistics with the statistics from each batch to obtain the final statistics for normalizing each patch. Since the proposed PBN can replace the typical BN, it can be integrated into most existing state-of-the-art methods. Extensive experiments and analysis demonstrate the effectiveness of our PBN in multiple computer vision tasks, including classification, object detection, instance retrieval, and semantic segmentation.

Abstract:
High-precision image matching and localization technology in a 3D environment map is essential for many tasks, such as marine engineering detection, robotics, and autonomous navigation. However, current visual localization and reconstruction methods overly depend on point features, which lack robustness in low-texture environments. To address this limitation, we propose a novel framework for point and line localization and 3D reconstruction with semantic constraints, which integrates multiple innovative components to achieve superior performance. Firstly, we design a point-localization optimization strategy with uniform point sampling and point-based instance segmentation constraints, significantly improving image matching and camera localization accuracy. Secondly, we optimize the selection of 2D-3D lines and line matching using instance segment constraints, leveraging the structural and semantic richness of line features to complement point features. Thirdly, we perform a joint point and line feature 3D reconstruction, enabling the creation of accurate 3D environment maps even in challenging low-texture marine scenes. Our approach has been extensively tested on popular datasets and compared with state-of-the-art methods. This work significantly advances current visual localization and 3D reconstruction techniques by addressing their limitations in low-texture environments, while also providing a robust foundation for future research and applications in marine engineering, robotics, and autonomous navigation.

Abstract:
Text-to-image models based on diffusion models are capable of generating highly realistic images from text descriptions. Nevertheless, in practical applications, the generated images frequently fail to fully satisfy user requirements regarding position and structure due to the absence of detailed location information and complex structural demands in the text descriptions. In order to improve the accuracy of the generated image in position and structure, the introduction of additional control conditions such as keypoint annotations or semantic segmentation has become an important research direction. This paper proposes a novel method based on a lightweight pre-trained diffusion model called CTIGEN-CDM. The model reduces computational costs by pruning the denoising network of the diffusion model and integrates control conditions into the denoising process through a gating mechanism to guide image generation. These control conditions encompass Canny edge detection, HED edge detection, depth maps, keypoints, and semantic segmentation. Experimental results reveal that CTIGEN-CDM possesses excellent generation quality and broad application potential. This method can generate high-quality images with precise positioning and structure while significantly saving computational resources, and it offers a promising new solution for text-to-image generation tasks.

Abstract:
Superpixel segmentation aims to automatically group visually similar pixels within an image into compact regions. This approach provides an efficient low-level representation of image data, effectively reducing the complexity of image primitives for subsequent vision tasks. Recent deep convolutional networks have shown their advantages in superpixel segmentation task. However, many existing deep learning methods still struggle to preserve object edges and accurately perceive similar pixels. This limitation can be attributed to their inadequate ability to model edge information and capture effective context within the image. To address these issues, we propose an Edge guided Local-Global Attention Network (ELGANet) for superpixel segmentation. Specifically, we first devise an Edge Enhancement Module (EeEM), which integrates multiple edge features into the superpixel-friendly features. Then, we develop a Local-Global Attention Module (LGAM) to analyze the relationship between pixels and local or global region patches, expecting to obtain effective context information for grouping similar pixels. The edge features and deep global semantic features are subsequently fused to generate the superpixel-friendly features. The final superpixel-friendly features are then mapped into final superpixels. Extensive experiments on four benchmark datasets demonstrate the effectiveness and superiority of our ELGANet compared with ten state-of-the-art models.

Abstract:
Infrared and visible image fusion aims to generate fused images with rich textures and clear target representations. Existing methods generally assume high-quality input images, thus overlooking issues such as reduced contrast and loss of details in visible images under low-light conditions. The naive enhance-then-fuse strategy cannot perform fuse-oriented image enhancement, which always reaches a sub-optimal result. To address this challenge, we propose a perceptual transform fusion of infrared and visible images, which simultaneously optimizes low-light enhancement and image fusion. Specifically, to improve computational efficiency and optimize key feature representations while suppressing noise interactions caused by lighting variations, we introduce a lightweight adaptive sparse Transformer block (ASTBlock). This model adaptively integrates sparse and dense attention mechanisms to enhance feature representations and employs a feed-forward network to eliminate redundant information, thereby ensuring the quality of image fusion. Subsequently, to retain significant details while reducing the impact of noise introduced by low-light enhancement, we incorporate discrete wavelet transform (DWT) for feature decomposition and fusion, further enhancing the representation capability and feature preservation of fused images. Meanwhile, to tackle the issues of insufficient contrast and hidden details in low-light conditions, we design an illumination perception module and an illumination consistency loss to improve the contrast and clarity of fused images. Experimental results on multiple public benchmark datasets for quality assessment and downstream tasks, e.g., pedestrian detection, demonstrate that our method significantly outperforms the state-of-the-art (SOTA) methods. The code is available at https://github.com/hinmouc/PIVFusion.

Abstract:
Low-light image enhancement (LLIE) aims to restore low-light images to their normal-light counterparts with optimal global illumination distribution and clear local details. With the advancement of deep learning, deep learning-based methods have become the mainstream in the LLIE community. However, most deep learning-based method cannot yet fully exploit the global and local contextual information in the low-light image. In this paper, we introduce a dual-branch module to simultaneously restore global and local features from spatial and frequency domain. To fuse these multi-level features, we propose a perception module to perform feature interaction between global and local features via cross attention and self-gating. By integrating the two developed modules into a U-Net backbone, we present a global-local interaction network for LLIE. Furthermore, recent studies have shown that contrastive learning can be an effective paradigm for the LLIE task. However, previous works typically use semantically-inconsistent under-/over-exposed images as negative samples. These images are very dissimilar to the ground-truth and cannot provide sufficient regularization in contrastive learning. To address this limitation, we explore a practical multi-exposure progressive contrastive regularization framework for LLIE. With a customized sample generation, sample selection, and progressive learning strategy, our proposed framework progressively narrows down the solution space around the optimum, and helps to improve the performance of LLIE methods without additional inference overhead. Combining the proposed network and contrastive regularization, our proposed method achieves favorable results compared to state-of-the-art LLIE methods on benchmark datasets. Extensive experiments further demonstrate the generalization ability of our proposed method.

Abstract:
This paper conducts a comprehensive cryptanalysis of a novel image cryptosystem, namely NIC, from three distinct perspectives, uncovering several inherent vulnerabilities within the scheme. Unlike most existing cryptanalysis methods that rely on a single approach, this study presents three innovative attack strategies tailored to the specific flaws of NIC. The first attack exploits the equivalence between multi-level and single-level diffusion processes in NIC, revealing the inefficiency of its diffusion mechanism. The second strategy leverages the synthesis and decomposition properties of base images, taking advantage of the scheme’s same-type diffusion operations. The third and most significant contribution is the introduction of the plaintext-to-ciphertext sensitivity (S-PTC) attack, a novel approach initially proposed in this work. Departing from the conventional focus on ciphertext-to-plaintext sensitivity (S-CTP), this study emphasizes S-PTC, an often-neglected aspect of cryptosystem security. By highlighting the fatal flaw of low S-PTC in NIC, the study exposes a critical weakness overlooked in previous research. Furthermore, an in-depth analysis demonstrates that the S-PTC attack is not confined to NIC but can be extended to other general cryptosystems with similar algorithmic structures. Rigorous theoretical derivations and experimental validations have been conducted to confirm the effectiveness and feasibility of all three proposed attacks, providing significant insights into the security evaluation of image encryption schemes.

Affiliations: Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China; State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, School of Artificial Intelligence,, Anhui University, Hefei, China; Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Computer Science and Technology, Anhui University, Hefei, China

Abstract:
Existing single-modal and multi-modal salient object detection (SOD) methods focus on designing specific architectures tailored for their respective tasks. However, developing completely different models for different tasks leads to labor and time consumption, as well as high computational and practical deployment costs. In this paper, we attempt to address both single-modal and multi-modal SOD in a unified framework called UniSOD, which fully exploits the overlapping prior knowledge between different tasks. Nevertheless, assigning appropriate strategies to modality variable inputs is challenging. To this end, UniSOD learns modality-aware prompts with task-specific hints through adaptive prompt learning, which are seamlessly plugged into the proposed pre-trained baseline SOD model to handle corresponding tasks, while only requiring few learnable parameters compared to training the entire model from scratch. In particular, each modality-aware prompt is solely generated from a homogeneous switchable prompt generation (SPG) block, which adaptively performs structural switching based on single-modal and multi-modal inputs without manual intervention, ensuring that the framework can effectively handle diverse input cases (e.g., RGB-only, RGB-D, RGB-T) with a unified approach. Through end-to-end joint training, UniSOD achieves ovrall competitive performance on 14 benchmark datasets, demonstrating its ability to efficiently unify single-modal and multi-modal SOD tasks. Code has been available at https://github.com/Angknpng/UniSOD

Abstract:
Light propagation in underwater scenes is significantly hindered by wavelength- and distance-dependent attenuation and scattering, leading to low contrast and severe color distortion in underwater images. Recent advancements in diffusion models have shown impressive performance in image restoration by learning data distribution prior knowledge (diffusion prior) from large amounts of paired data. However, due to the difficulties in collecting paired underwater images, the available data for underwater image enhancement is limited in both quality and quantity. This scarcity leads to a biased diffusion prior and suboptimal performance of diffusion models. To address this issue, we propose a novel method, termed SeaDiff, to learn underwater diffusion prior with wavelength- and distance-dependent degradation awareness. Specifically, we introduce a Prior Knowledge Mining Model (PKMM), which includes two key components: (1) the Physical Prior Embedding Module (PPEM) that simulates the underwater imaging process through a distance-dependent physical model and embeds physical prior by incorporating generalizable distance-aware cues from a large vision foundation model; and (2) the Color Prior Embedding Module (CPEM) that extracts wavelength-dependent color distribution prior from a log-chroma color space. Additionally, we propose a Degradation-Aware Diffusion Model (DADM) that seamlessly integrates degradation prior with diffusion prior and enhances the underwater images with high visual quality. Extensive experiments on popular UIE benchmarks and downstream tasks demonstrate that the proposed SeaDiff achieves state-of-the-art performance in terms of both visual quality and quantitative metrics. The code will be released at https://github.com/Henry-Bi/SeaDiff.

Abstract:
Self-supervised monocular depth estimation (MDE) has received increasing interests in the last few years. The objects in the scene, including the object size and relationship among different objects, are the main clues to extract the scene structure. However, previous works lack the explicit handling of the changing sizes of the object due to the change of its depth. Especially in a monocular video, the size of the same object is continuously changed, resulting in size and depth ambiguity. To address this problem, we propose a Depth-converted-Scale Convolution (DcSConv) enhanced monocular depth estimation framework, by incorporating the prior relationship between the object depth and object scale to extract features from appropriate scales of the convolution receptive field. The proposed DcSConv focuses on the adaptive scale of the convolution filter instead of the local deformation of its shape. It establishes that the scale of the convolution filter matters no less (or even more in the evaluated task) than its local deformation. Moreover, a Depth-converted-Scale aware Fusion (DcS-F) is developed to adaptively fuse the DcSConv features and the conventional convolution features. Our DcSConv enhanced monocular depth estimation framework can be applied on top of existing CNN based methods as a plug-and-play module to enhance the conventional convolution block. Extensive experiments with different baselines have been conducted on the KITTI benchmark and our method achieves the best results with an improvement up to 11.6% in terms of SqRel reduction. Ablation study also validates the effectiveness of each proposed module.

Abstract:
Existing multi-view 3D human pose estimation methods heavily rely on precise extrinsic calibration, which significantly restricts their practical deployment in uncontrolled environments. To address this limitation, we propose an Extrinsic Parameter-free Multi-view 3D Human Skeleton Estimation (EFMK) framework with three technical contributions. First, a Local-Global Pose Embedding scheme is proposed to simultaneously capture the fine-grained joint dependencies while establishing cross-view correspondences. Second, a Spatial-View Joint Transformer architecture is developed with three dedicated components: 1) Feature Transformation Modulation generates adaptive modulation vectors for distinct tokens to model heterogeneous relationship patterns; 2) Prior Knowledge Enhancement systematically integrates human kinematic constraints and multi-view geometric priors into attention computation through structural topology encoding; 3) Spatial-View Joint Attention implements decoupled spatial-view attention computation followed by joint distribution modeling to capture hierarchical spatial-view dependencies. Third, a Bone-wise Reprojection-based Multi-view Aggregation mechanism is introduced to consolidate multiple 3D outputs into a single, higher-quality 3D pose for practical applications. Extensive experiments on three benchmarks demonstrate that our method achieves state-of-the-art performance while maintaining a compact model size. Code and results are available at https://github.com/Z-Z-J/EFMK

Abstract:
Visual question answering (VQA) tasks have witnessed significant advancements in recent years. So far, enhancing the robustness of models on diverse datasets and improving their performance in 3D environments remains a challenging research direction. In this paper, we propose a high-performance framework called Bias3D-VQA for 3D-VQA based on generative adversarial networks via bias learning, addressing the inherent biases that arise from the model’s dependency on dataset-specific patterns or tendencies during training. Such biases often lead the model to focus on more frequently occurring but incorrect answers. Our framework comprises a target model, a bias model, and a generative adversarial component. In each training iteration, we employ an alternating training approach for the target and bias models. When training the bias model, fake point cloud data is generated from random noise, and then we accumulate biases present in language modality and various modules through adversarial training. When training the target model, both the question and the 3D point cloud are inputted into the bias model simultaneously, and the output of the bias model is utilized to correct the loss of the target model. Our approach(Bias3D-VQA) is the first to focus on enhancing model robustness by addressing diverse biases in the 3D-VQA domain. Our target model demonstrates superior performance compared to state-of-the-art models, showing significant improvements in classification accuracy and text generation quality. Notably, in the metrics such as EM@1 and CIDEr, our model even surpasses some pre-trained models with large additional datasets. The source code is available at https://github.com/coderr727/bias_3DQA

Abstract:
Modern compression systems use linear transformations in their encoding and decoding processes, with transforms providing compact signal representations. While multiple data-dependent transforms for image/video coding can adapt to diverse statistical characteristics, assembling large datasets to learn each transform is challenging. Also, the resulting transforms typically lack fast implementation, leading to significant computational costs. Thus, despite many papers proposing new transform families, the most recent compression standards predominantly use traditional separable sinusoidal transforms. This paper proposes integrating a new family of Symmetry-based Graph Fourier Transforms (SBGFTs) of variable sizes into a coding framework, focusing on the extension from our previously introduced 8× 8 SBGFTs to the general case of NxN grids. SBGFTs are non-separable transforms that achieve sparse signal representation while maintaining low computational complexity thanks to their symmetry properties. Their design is based on our proposed algorithm, which generates symmetric graphs on the grid by adding specific symmetrical connections between nodes and does not require any data-dependent adaptation. Furthermore, for video intra-frame coding, we exploit the correlations between optimal graphs and prediction modes to reduce the cardinality of the transform sets, thus proposing a low-complexity framework. Experiments show that SBGFTs outperform the primary transforms integrated in the explicit Multiple Transform Selection (MTS) used in the latest VVC intra-coding, providing a bit rate saving percentage of \mathbf 6.23% , with only a marginal increase in average complexity. A MATLAB implementation of the proposed algorithm is available online at https://github.com/AlessandroGnutti/Variable-SBGFTs.

Abstract:
Video-based person re-identification (Re-ID) aims to match the target pedestrian from video sequences. Recent methods perform frame-level feature extraction followed by temporal aggregation to obtain video representations. However, they pay insufficient attention to the quality of frame-level features, which suffer from issues including multi-frame misalignment, partial occlusion and appearance confusion. People live in a 3D space. 3D pedestrian representations can provide rich geometric information and shape cues that offer promising solutions to these challenges in video-based Re-ID. To mitigate these issues, this paper proposes a 3D-Aid Pedestrian Representation Learning (3DAPRL) network, which introduces 3D modality to video-based Re-ID. Specifically, two novel modules are designed, i.e., the Cross-Modal Fusion (CMF) module and the Shape-aware Spatial-Temporal Interaction (SSTI) module, to enhance pedestrian representation learning. The CMF module generates discriminative fusion representations by utilizing 3D pedestrian data, while the SSTI module learns spatial-temporal 3D shape representation which are distinguishable for finding the target pedestrian in video scenarios. Both features generated from the CMF and SSTI modules contribute to the final video representation. Extensive experiments on four challenging video-based Re-ID datasets demonstrate that our 3DAPRL network reaches better performance than state-of-the-arts methods.

Abstract:
Neural network training is a memory- and compute-intensive task. Quantization, which enables low-bitwidth formats in training, can significantly mitigate the workload. To reduce quantization error, recent methods have developed new data formats and additional pre-processing operations on quantizers. However, it remains quite challenging to achieve high accuracy and efficiency simultaneously. In this paper, we explore sub-8-bit integer training from its essence of gradient descent optimization. Our integer training framework includes two components: ShiftQuant to realize accurate gradient estimation, and L1 normalization to smoothen the loss landscape. ShiftQuant attains performance that approaches the theoretical upper bound of group quantization. Furthermore, it liberates group quantization from inefficient memory rearrangement. The L1 normalization facilitates the implementation of fully quantized normalization layers with impressive convergence accuracy. Our method frees sub-8-bit integer training from pre-processing and supports general devices. This framework achieves negligible accuracy loss across various neural networks and tasks (0.92% on 4-bit ResNets, 0.61% on 6-bit Transformers). The prototypical implementation of ShiftQuant achieves more than 1.85× /15.3% performance improvement on CPU/GPU compared to its FP16 counterparts, and 33.9% resource consumption reduction on FPGA than the FP16 counterparts. The proposed fully-quantized L1 normalization layers achieve more than 35.54% improvement in throughout on CPU compared to traditional L2 normalization layers. Moreover, theoretical analysis verifies the advancement of our method.

Affiliations: College of Computer and Information Technology, the Three Gorges Digital Intelligence Institute, and Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University, Yichang, China; School of Computer Science and the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China; School of Information Engineering, Zhongnan University of Economics and Law, Wuhan, China

Abstract:
Recently, deep neural networks have been extensively explored in remote sensing image haze removal and achieved remarkable performance. However, existing methods fail to effectively fuse the features extracted from Convolutional Neural Networks (CNNs) and Transformer networks, leading to performance degradation. Moreover, most dehazing methods lack further exploration of the distinct properties of high- and low-frequency features, which are crucial for texture restoration and haze removal. To address these issues, we propose a Bidirectional-Modulation Frequency-Heterogeneous Network (BMFH-Net). Specifically, we propose a Differential-Expert Guided Bidirectional Modulation (DGBM) module that incorporates Differential experts and physical inversion models to exploit the complementarity of CNN-Transformer features and extract their latent haze-related physical characteristics, thereby enabling more effective bidirectional alignment. Furthermore, a Wavelet Frequency Heterogeneous Enhancement (WFHE) Module is designed to capture the most representative high-frequency features to refine image texture details, while enhancing the global perception of haze and reconstructing structural information during low-frequency processing. Experiments on challenging remote sensing image datasets demonstrate that our BMFH-Net outperforms several state-of-the-art haze removal methods. The code is released publicly at https://github.com/zqf2024/BMFH-Net

Abstract:
Audio-visual zero-shot learning (ZSL) has been extensively researched for its capability to classify video data from unseen classes during training. Nevertheless, current methodologies often struggle with background scene biases and inadequate motion detail. This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++), which decouples contextual semantic information and sparse dynamic motion information. The recurrent joint learning unit is proposed to extract contextual semantic information and capture joint knowledge across various modalities to understand the environment of actions. By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases. Moreover, we introduce a discrepancy analysis block to model audio motion information. To enhance the robustness of SNNs in extracting temporal and motion cues, we dynamically adjust the threshold of Leaky Integrate-and-Fire neurons based on global motion and contextual semantic information. Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks. Additionally, incorporating motion and multi-timescale information significantly improves HM and ZSL accuracy by 26.2% and 39.9%.

Abstract:
The fundamental challenge in SAR target detection lies in developing discriminative, efficient, and robust representations of target characteristics within intricate non-cooperative environments. However, accurate target detection is impeded by factors including the sparse distribution and discrete features of the targets, as well as complex background interference. In this study, we propose a Gamma Diffusion Model Network with MambaSAR module (MaDiNet) for SAR target detection. Specifically, MaDiNet leverages the Gamma distribution to model the statistical characteristics of SAR images, and conceptulizes SAR target detection as the task of generating target bounding boxes in the image space. Furthermore, we design a MambaSAR module to capture intricate spatial structural information of targets and enhance the capability of the model to differentiate between targets and complex backgrounds. The experimental results on multi-class target detection datasets have all achieved SOTA, with a particularly notable improvement of 6.7% in mAP50 on the ODSOG-1.0 dataset, proving the effectiveness of the proposed network. Code is available at https://github.com/JoyeZLearning/MaDiNet

Abstract:
Most semi-supervised domain object detection (SDAOD) methods are based on the mean-teacher framework. This framework primarily utilizes object-level features provided by pseudo-labels. However, the pseudo-labels generated by the Teacher model often contain notable noise, which limits the detector’s performance. Unlike pseudo-labels, domain labels are more precise and can offer accurate domain-level features. Motivated by this, we incorporate domain-level features into contrastive learning by designing different label assignment strategies and thus propose Contrastive-Domain Mean Teacher (CDMT) for SDAOD. Specifically, domain-level features include both inter-domain and intra-domain features. For inter-domain features, our strategy regards samples with the same domain label as positive pairs, enabling contrastive learning to extract global feature representations. While, intra-domain features from the same image are treated as positive pairs, which helps contrastive learning to extract fine-grained feature representations. Thorough experiments demonstrate that CDMT achieves state-of-the-art performance on Foggy Cityscapes and Clipart combined with recent Mean Teacher framework methods. Notably, for more challenging foggiest images (’0.02’ split) based on the Probabilistic Teacher (PT) baseline, CDMT outperforms the previously best CMT by 4.1% on mAP, which shows its priority on cross-domain detection tasks.

Abstract:
Implicit neural representation (INR) has emerged as a powerful representation for data (e.g., multispectral images and videos). Previously, most INR methods directly represent data in the original space. However, since different frequency components are mixed in the original space, it is difficult to capture these frequency components simultaneously and accurately. To alleviate this limitation, we suggest a new frequency-aware implicit neural representation (FA-INR) working in a physically interpretable and learnable frequency space by cleverly introducing an extra frequency dimension, which allows us to readily decouple and modulate different frequency components, leading to a more accurate characterization of different frequency components in a divide-and-conquer manner. Specifically, the FA-INR consists of two important modules, i.e., the frequency module and the integration module. In the frequency module, we propose a new low-rank tensor frequency function to compactly and continuously represent the latent frequency space. In the integration module, different frequency components are adaptively integrated back to the original space. Extensive experiments on various multi-dimensional data, including multispectral images, color videos, and light field data, demonstrate that the proposed FA-INR significantly outperforms the state-of-the-art INR methods, especially for characterizing high-frequency components (e.g., textures and edges).

Abstract:
As an emerging direction of machine learning, multi-target domain adaptation (MTDA) aims to address the challenges of adapting models to multiple target domains. However, existing studies often focus on single-target domain adaptation or fail to delve into the complexities associated with multiple target domains. So there is a notable lack of comprehensive research and exploration in MTDA. Consequently, we propose a cross-attention with conditional matching for MTDA that intends to overcome the challenges posed by domain discrepancy, multi-target domain heterogeneity, and scalability. Foremost, we design a novel multi-target conditional matching that aims to align the sample distribution by leveraging nearest neighbor principle. This strategy takes into account the unique characteristics of each target domain, facilitating adaptive adaptation across multiple domains. Furthermore, we use the transformer module and well-design a cross-attention mechanism to facilitate the alignment of distributions across the source and target domains, as well as among the target domains, thus mitigating discrepancies among multiple domains. Through integrating the cross-attention mechanism into the training phase, attaining effective alignment of cross-domain distributions, we improve the adaptability and performance of the method. By the end, our approach demonstrates effective and superior experimental results indicating the significance of our work.

Affiliations: School of Computing and Artificial Intelligence, Institute of Artificial Intelligence, the National Engineering Laboratory of Integrated Transportation Big Data Application Technology, the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, and the Manufacturing Industry Chains Collaboration and Information Support Technology Key Laboratory of Sichuan Province, Southwest Jiaotong University, Chengdu, China; College of Computer and Information Science, Chongqing Normal University, Chongqing, China; Department of Computer Science and Information Engineering, Asia University, Wufeng, Taichung, Taiwan

Abstract:
Multiview data possess different discriminability in different views, which is challenging to catch but crucial for a feature selection model. Multiscale information, which represents vertical exploration in each view, is vital for further mining traits implied in multiview data. However, most existing studies neglect these beneficial multi-granulation characteristics. This study first embeds the multiscale information into the sparse learning framework for multiview feature selection. A class-specific discriminability and multiscale information-based multiview feature selection (CDMIMFS) method is proposed. It explores the fuzzy and uncertain class-specific discriminability which is inherently discrepant in different views by the fuzzy rough set theory. It relaxes the over strict requirement for complete consistency in multiscale information systems to make a trade-off, which further enhances discriminative feature selection. An effective iteration algorithm is proposed to solve the optimization. Both the theoretical proof and experimental demonstration of convergence are provided. Comprehensive experiments are conducted on the CDMIMFS compared with state-of-the-art algorithms. Results on different evaluation metrics exhibit the advantages of the proposed method.

Abstract:
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. Such ability stems from their capacity to identify common features shared between new and previously seen images while disregarding distractions such as background variations. However, for artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. In this paper, we propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches and encoding them using the pre-trained Vision Transformer (ViT) architecture. Specifically, we swap the class (CLS) token and patch tokens between the support and query sets to have the mutual attention, which enables each set to focus on the most useful information. This facilitates the strengthening of intra-class representations and promotes closer proximity between instances of the same class. For implementation, we adopt the ViT-based network architecture and utilize pre-trained model parameters obtained through self-supervision. By leveraging Masked Image Modeling as a self-supervised training task for pre-training, the pre-trained model yields semantically meaningful representations while successfully avoiding supervision collapse. We then employ a meta-learning method to fine-tune the last several layers and CLS token modules. Our strategy significantly reduces the number of parameters that require fine-tuning while effectively utilizing the capability of pre-trained model. Extensive experiments show that our framework is simple, effective and computationally efficient, achieving superior performance as compared to the state-of-the-art baselines on five popular few-shot classification benchmarks under the 5-shot and 1-shot scenarios.

Abstract:
The proliferation of video applications has exacerbated digital piracy issues, notably evidenced by unauthorized video digital editing and camcording processes. While existing research has introduced robust watermarking methods to safeguard video copyrights, these methods often address specific attack scenarios, limiting their overall efficacy. To address this problem, we propose a robust video watermarking scheme based on Frequency-Spherical Cavity Transformation (FSCT), offering a comprehensive solution for both digital editing and camcording processes. Our approach treats the spatial and temporal aspects of the video as a 3D cube, utilizing FSCT to ensure temporal translational invariance and resilience against spatial attacks. To mitigate artifacts induced by motion characteristics, we analyze the properties of FSCT and introduce a visual quality optimization strategy, enhancing imperceptibility while ensuring robustness. Simultaneously, during extraction process, the watermark can be successfully retrieved from video camcording with only a specified time interval, eliminating the need for temporal synchronization. Through extensive experimentation, the proposed method exhibits superior robustness against digital editing and camcording compared to existing methods.

Abstract:
Anticipating future actions in daily life videos is crucial for seamless human-machine collaboration. However, accurately predicting these actions is challenging due to the inherent uncertainty and non-determinism of future events. To address this, we propose the uncertainty-aware mixture-of-experts framework for action anticipation (AntMoE), which employs multiple anticipation experts to model diverse video evolution patterns through learnable expert embeddings. These anticipation experts generate diverse predictions by integrating the top-k semantically similar observed video frames related to the current predicted feature representation, along with their corresponding expert embeddings. An anticipation router then aggregates these predictions based on the relationship between the current feature representation and all expert embeddings. To enhance the effectiveness of AntMoE, we introduce an expert regularization loss with three components: orthogonal loss promotes orthogonality among expert embeddings; expert balance loss ensures equal activation of all experts during training; and stability loss encourages the generation of numerically stable aggregation weights. Additionally, we incorporate an anticipation ranking loss function that aligns the model’s confidence across varying anticipation time durations with the ground-truth ranking order, where a shorter anticipation time length corresponds to a higher confidence level. Experimental results across multiple benchmarks demonstrate that our method achieves remarkable anticipation performance.

Abstract:
Depth reconstruction for transparent objects is a challenging problem, where surface feature matching methods are hindered by complex refraction and reflection. Existing learning-based reconstruction methods by regressing or completing depth maps for entire scenes are data-costly and lack generalization in different environments. To solve this problem, we propose a novel transparent object reconstruction pipeline with a guided object-centric 3D diffusion model. Specifically, we train an unconditional 3D diffusion model with only 3D point cloud data. To control the output of the diffusion model, we design a silhouette-based guidance function and a completion framework with outline points for each step of diffusion process. Specifically, for each step, we design a re-projection pipeline to estimate a silhouette with uncertainty and constrain the partially-noised point cloud to align with it. We further apply stereo matching to compute the outline points in the stereo silhouettes and use a completion framework to fuse them with the partially-denoised point cloud. Finally, we transform the transparent objects to the world frame by applying the transformation from pose estimation. Experiment results show that our method can achieve state-of-the-art performance for transparent object depth reconstruction compared to existing depth regression and completion methods.

Abstract:
Incremental Learning (IL) aims to learn deep models on sequential tasks continually, where each new task includes a batch of new classes and deep models have no access to task ID information at the inference time. Recent vast pre-trained models (PTMs) have achieved outstanding performance by prompt technique in practical IL without the old samples (rehearsal-free) and with a memory constraint (memory-constrained): Prompt-extending and Prompt-fixed methods. However, prompt-extending methods need a large memory buffer to maintain an ever-expanding prompt pool and meet an extra challenging prompt selection problem. Prompt-fixed methods only learn a single set of prompts on one of the incremental tasks and can not handle all the incremental tasks effectively. To achieve a good balance between the memory cost and the performance on all the tasks, we propose a Parameter-Efficient Cross-Task Prompt (PECTP) framework with Prompt Retention Module (PRM) and classifier Head Retention Module (HRM). To make the final learned prompts effective on all incremental tasks, PRM constrains the evolution of cross-task prompts’ parameters from Outer Prompt Granularity and Inner Prompt Granularity. Besides, we employ HRM to inherit old knowledge in the previously learned classifier heads to facilitate the cross-task prompts’ generalization ability. Extensive experiments show the effectiveness of our method. The source codes are available at https://github.com/RAIAN08/PECTP

Abstract:
Timestamp-supervised action segmentation aims to segment and classify actions in untrimmed videos with a random frame annotated per action. Precisely localizing action boundaries from timestamp annotations is crucial for this setting, as it enables generating framewise pseudo-labels and applying the well-explored fully-supervised training. However, prevailing methods struggle with intrinsic uncertainty in boundary localization due to less discriminative features in action-transiting regions. This imprecise boundary estimation significantly reduces the stability and reliability of the generated pseudo-labels in ambiguous action-transiting regions, consequently resulting in performance deterioration of the trained segmentation models. In our paper, we introduce the boundary voting network that mitigates feature ambiguity by hierarchically propagating video-level global prior knowledge into local action-transiting regions. By generating key action representations as votes throughout the video and targeting action-transiting regions, all votes collaboratively contribute to action-transiting feature enhancement and boundary localization refinement. Extensive experiments demonstrate the effectiveness of our method on GTEA, 50Salads, and Breakfast datasets.

Abstract:
Referring multi-object tracking (RMOT) aims to identify specific targets based on sentence descriptions. To enhance multi-modal learning, previous works typically rely on a simple fusion module at early or late stages. However, those methods frequently underutilize textual semantics and struggle to model the relationships between region-level features and word-level features. To address these limitations, we propose CGATracker, a correlation-aware graph alignment method for RMOT, which facilitates precise relationship modeling through relational scoring. Specifically, we design a Language-driven Relational Alignment (LRA) module, which establishes two connection graphs to generate positive and negative samples for the visual-textual alignment. Additionally, to effectively leverage referring information, we introduce a Semantic Clarify Booster (SCBooster) module based on a semantic infusion mechanism and a bias-aware verification mechanism for interactions with different modalities. Moreover, by designing a Multi-level Cross-modal Fusion (MCF) module, our method aggregates contextual features at multiple depths to enable the creation of the enriched correlation-aware graph. Extensive experiments conducted on the Refer-KITTI and Refer-KITTI-V2 datasets demonstrate the effectiveness of CGATracker.

Abstract:
Few-Shot Class-Incremental Learning (FSCIL) is to continuously learn novel classes from a few samples without forgetting previous knowledge. Adapting directly to limited novel data typically results in significant forgetting of base class knowledge. Consequently, prevailing FSCIL methods are devoted to training a strong initial model that can be frozen in incremental sessions. However, these works face a dilemma in poor generalization: they benefit mainly from base class performance, yet underperform in novel classes. To alleviate this issue, we design a two-stage training framework to simultaneously enhance generalization for novel classes and maintain base class discrimination. In the first stage, an asymmetric supervised contrastive learning (AsyCon) algorithm is proposed. AsyCon introduces a predicted feature to achieve an asymmetric alignment of positive pairs. It alleviates over-similarity within positive features, allowing the model to better transfer to new classes in incremental sessions. In the second stage, the model is finetuned for promoting its performance on base classes. To maintain the generalization obtained in the first stage, we employ an L2 normalized regularization (LR) to keep the feature consistent with the model in the first stage. The finetuned model, termed AsyCLR, effectively balances generalization and discrimination, significantly outperforming existing FSCIL works especially in novel class accuracy. Experiments on CUB200, CIFAR100, and mini-ImageNet verify the effectiveness of our method. Additionally, our method also performs well in the standard few-shot recognition scenario due to its strong generalization ability. Our codes are available at https://github.com/APORduo/AsyCLR

Abstract:
Existing underwater image enhancement (UIE) methods typically prioritize improving image quality at the expense of algorithmic efficiency. In this paper, we propose a fusion-based, channel-wise isotropic convergent UIE method designed for real-time performance. The proposed approach comprises three key modules: 1) a non-linear transformation module that corrects color casts and aligns the pixel distribution with the gray-world assumption (GWA); 2) a channel-wise isotropic convergence scheme that reduces intensity distribution disparities across channels, promoting balanced convergence; and 3) a patch-based enhancement strategy that divides the image into smaller patches to better capture local features and improve adaptability to non-uniform degradation. Moreover, certain critical steps in our method are optimized to achieve O(1) time complexity, allowing it to meet real-time requirements. Extensive experiments validate the effectiveness of each module in the proposed method, showcasing its superiority when compared to the existing state-of-the-art (SOTA) approaches. Code has been released at https://github.com/JohnChenS/FCICE_UnderwaterImageEnhancement

Abstract:
Mamba architecture achieves the same performance as attention mechanisms with linear complexity, leading to significant progress in remote sensing land cover classification. However, existing Mamba methods rarely leverage the representational complementarity and consistency between different modalities, resulting in challenges such as incomplete fusion. To address these issues, we propose Semi-Mamba, a novel semi-supervised framework specifically designed for high-dimensional multi-modal data fusion. We introduce the Mamba Cross-Modality Fusion Module, which enables cross-modal learning of temporal features through state-space model interactions and smooth integration of input matrices, enhancing the fusion of richer feature representations. Additionally, to tackle the inherent difficulty of acquiring pixel-level annotations in remote sensing datasets, we introduce a multi-modal semi-supervised mechanism. This mechanism utilizes cross-modal supervision between different modalities to maximize data utilization and improve learning efficiency. It effectively enables joint training on both labeled and unlabeled data without relying on pseudo-labels. We integrate these innovations into a unified end-to-end framework. Compared to state-of-the-art CNN and Transformer-based architectures, our framework shows a significant improvement of over 3.12%, setting a new benchmark for semi-supervised multi-modal data fusion. The code has open sourced at https://github.com/LDXDU/Semi_Mamba_RS.

Abstract:
The dim shooting environment and light scattering and absorption frequently result in degraded underwater images. The images are characterized by uneven brightness, low contrast, color deterioration, and blurred details. Existing underwater image enhancement methods excel in full-reference and non-reference metrics, yet may fail to align with human visual tendencies. To make the restored images more consistent with natural visual effects, an underwater image enhancement framework named BRIUIE is proposed. BRIUIE draws inspiration from the morphology and functions of various cell layers in the vertebrate retina. Following the visual transmission mechanisms of retinal signals, image brightness is balanced by simulating the feedback and dynamic regulation processes of horizontal cells in response to illumination variation. Meanwhile, simulating the center-surround receptive fields of bipolar and ganglion cells and implementing the color opponent mechanism effectively mitigate color distortion and low contrast. The designed multi-scale feature fusion module facilitates the complementary advantage of the ON and OFF visual pathways of ganglion cells, employing a contrastive learning strategy to prevent overfitting because of simple consistency loss. Comprehensive full-/non-reference experiments demonstrate the proposed BRIUIE outperforms other SOTA methods in quantitative evaluations, while also delivering qualitative results that closely align with human visual assessment standards.

Abstract:
Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object’s bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.3% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.

Abstract:
Accurate data association plays a crucial role in Multi-Object Tracking (MOT) as it helps reduce confusion such as identity switches and assignment errors. However, many existing advanced methods often overlook the diversity among trajectories and the ambiguity and conflicts present in various types of cues. Consequently, when performing simple global data association, confusion arises between detections, trajectories, and associations. To address this problem, we propose a simple, versatile, and highly interpretable Deconfused Data Association Framework (DDAF). DDAF decomposes the traditional association problem into multiple sub-problems using a series of non-learnable modules, and selectively resolves confusion in each sub-problem by strategically utilizing new cues. Building upon DDAF, we design a powerful multi-object tracker named DfTrack, which specifically targets confusion in MOT. Furthermore, we discuss different specific implementations of DDAF to tackle challenging environments characterized by low frame rate, camera motion, and cross-domain scenarios. Correspondingly, we also develop several variants of DfTrack, demonstrating the remarkable scalability and adaptability of DDAF. Extensive experiments conducted on the MOT17, MOT20, and DanceTrack datasets demonstrate that DDAF significantly outperforms simple global association methods, and its variants can adapt to various challenging environments. Furthermore, DfTrack achieves state-of-the-art performance on multiple datasets, with HOTA of 65.2%, 63.9%, and 64.4% on MOT17, MOT20, and DanceTrack, respectively. The DfTrack-Hybrid variant further improves the performance on this basis. These results validate that our DDAF can effectively decompose and resolve various confusion in global association without any learning cost.

Abstract:
Compositional Zero-Shot Learning (CZSL) aims to combine known attributes and objects as primitives for recognizing previously unseen attribute-object pairs. The essence of CZSL lies in separately modeling attributes and objects for transfer to unseen compositions. Since primitives exhibit weak entanglement in language features, Vision-Language Models (VLMs) based CZSL methods focus on prompt learning to assist primitive disentanglement in visual spaces. However, these approaches overlook the polysemy of primitives across different compositional contexts, leading the models to overly rely on seen compositions, thereby limiting their generalization capability. To tackle this issue, we propose a novel primitive disambiguation model named DPR, aiming to enhance CZSL performance by acquiring diversified primitive representations through joint-prompt learning. Specifically, DPR introduces Knowledge-Aware Hard Prompting (KAHP), which generates diverse descriptive sentences as supplementary prompts for each composition from the rich knowledge of Large Language Models (LLMs). Simultaneously, we design Sample-Adaptive Soft Prompting (SASP), which employs a lightweight neural network to generate input-guided tokens for each image. SASP ensures enhanced generalization capability in DPR’s primitive representations. We validate the effectiveness of our model through extensive experiments and achieve state-of-the-art performance on three CZSL benchmark datasets in both closed- and open-world settings.

Affiliations: School of Computer Science, Northwestern Polytechnical University, Xi’an, China; National-Local Joint Engineering Research Center of Biodiagnosis and Biotherapy, The Second Affiliated Hospital of Xi’an Jiaotong University, Xi’an, China; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; College of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an, China

Abstract:
Due to the lack of appropriate priors, generating the content of dark regions remains a challenge in low-light image enhancement tasks. Currently, diffusion models employ robust image generation capabilities for enhancing low-light images. However, diffusion models require multiple iterations at the image feature level to generate details and content, which limits the speed. Moreover, the diffusion-based methods tend to generate unexpected artifacts in the degraded regions. To address these issues, we propose a Frequency Priors-guided Image Enhancement (FPIE) network, including a frequency prior generation network and an image restoration network. FPIE significantly accelerates inference by learning abstract prior with frequency domain constraints. Concretely, to learn compacted priors at the frequency domain, we introduce a joint training approach for the prior generation and restoration models to constrain the distribution of priors. Furthermore, to better utilize frequency-domain features for enhancing the network’s generation capabilities, a wavelet-based transformer block is introduced to produce intricate details and avoid the artifacts of the output. Extensive experimental results on the commonly used benchmarks demonstrate that our approach achieves state-of-the-art performances and well generalization to real-world images.

Abstract:
We introduce a new task, Open-set Mixed Domain Adaptation (OSMDA), which considers the potential mixture of multiple distributions in the target domains, thereby better simulating real-world scenarios. To tackle the semantic ambiguity arising from multiple domains, our key idea is that the linguistic representation can serve as a universal descriptor for samples of the same category across various domains. We thus propose a more practical framework for cross-domain recognition via visual-linguistic guidance. On the other hand, the presence of multiple domains also poses a new challenge in classifying both known and unknown categories. To combat this issue, we further introduce a visual-linguistic focal evolving approach to gradually enhance the classification ability of a known/unknown binary classifier from two aspects. Specifically, we start with identifying highly confident focal samples to expand the pool of known samples by incorporating those from different domains. Then, we amplify the feature discrepancy between known and unknown samples through dynamic entropy evolving via an adaptive entropies min/max game, enabling us to accurately identify possible unknown samples in a gradual manner. Extensive experiments demonstrate our method’s superiority against the state-of-the-arts in both open-set and open-set mixed domain adaptation.

Abstract:
Event data can asynchronously capture variations in light intensity, thereby implicitly providing valuable complementary cues for RGB-Event tracking. Existing methods typically employ a direct interaction mechanism to fuse RGB and event data. However, due to differences in imaging mechanisms, the representational disparity between these two data types is not fixed, which can lead to tracking failures in certain challenging scenarios. To address this issue, we propose a novel prior knowledge-driven hybrid prompter learning framework for RGB-Event tracking. Specifically, we develop a frame-event hybrid prompter that leverages prior tracking knowledge from the foundation model as intermediate modal support to mitigate the heterogeneity between RGB and event data. By leveraging its rich prior tracking knowledge, the intermediate modal reduces the gap between the dense RGB and sparse event data interactions, effectively guiding complementary learning between modalities. Meanwhile, to mitigate the internal learning disparities between the lightweight hybrid prompter and the deep transformer model, we introduce a pseudo-prompt learning strategy that lies between full fine-tuning and partial fine-tuning. This strategy adopts a divide-and-conquer approach to assign different learning rates to modules with distinct functions, effectively reducing the dominant influence of RGB information in complex scenarios. Extensive experiments conducted on two public RGB-Event tracking datasets show that the proposed HPL outperforms state-of-the-art tracking methods, achieving exceptional performance.

Abstract:
Hyperspectral image (HSI) classification involves assigning unique labels to each pixel to identify various land cover categories. While deep classifiers have achieved high predictive accuracy in this field, they lack the ability to rigorously quantify confidence in their predictions. This limitation restricts their application in critical contexts where the cost of prediction errors is significant, as quantifying the uncertainty of model predictions is crucial for the safe deployment of predictive models. To address this limitation, a rigorous theoretical proof is presented first, which demonstrates the validity of Conformal Prediction, an emerging uncertainty quantification technique, in the context of HSI classification. Building on this foundation, a conformal procedure is designed to equip any pre-trained HSI classifier with trustworthy prediction sets, ensuring that the true labels are included with a user-defined probability (e.g., 95%). Furthermore, a novel framework of Conformal Prediction specifically designed for HSI data, called Spatial-Aware Conformal Prediction ( SACP ), is proposed. This framework integrates essential spatial information of HSI by aggregating the non-conformity scores of pixels with high spatial correlation, effectively improving the statistical efficiency of prediction sets. Both theoretical and empirical results validate the effectiveness of the proposed approaches. The source code is available at https://github.com/J4ckLiu/SACP

Abstract:
Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Building on the well-established CLIP model, we introduce view selection in the vision side that minimizes entropy to identify the most informative views for 3D shape. On the textual side, hierarchical prompts combined of hand-crafted and GPT-generated prompts are proposed to refine predictions. The first layer prompts several classification candidates with traditional class-level descriptions, while the second layer refines the prediction based on function-level descriptions or further distinctions between the candidates. Extensive experiments demonstrate the effectiveness of the proposed modules for zero-shot 3D shape recognition. Remarkably, without the need for additional training, our proposed method achieves impressive zero-shot 3D classification accuracies of 84.44%, 91.51%, and 66.17% on ModelNet40, ModelNet10, and ShapeNet Core55, respectively. Furthermore, we will make the code publicly available to facilitate reproducibility and further research in this area.

Abstract:
Image super-resolution (SR) has significantly advanced through the adoption of Transformer architectures. However, conventional techniques aimed at enlarging the self-attention window to capture broader contexts come with inherent drawbacks, especially the significantly increased computational demands. Moreover, the feature perception within a fixed-size window of existing models restricts the effective receptive field (ERF) and the intermediate feature diversity. We demonstrate that a flexible integration of attention across diverse spatial extents can yield significant performance enhancements. In line with this insight, we introduce Multi-Range Attention Transformer (MAT) for SR tasks. MAT leverages the computational advantages inherent in dilation operation, in conjunction with self-attention mechanism, to facilitate both multi-range attention (MA) and sparse multi-range attention (SMA), enabling efficient capture of both regional and sparse global features. Combined with local feature extraction, MAT adeptly capture dependencies across various spatial ranges, improving the diversity and efficacy of its feature representations. We also introduce the MSConvStar module, which augments the model’s ability for multi-range representation learning. Comprehensive experiments show that our MAT exhibits superior performance to existing state-of-the-art SR models with remarkable efficiency ( ～ 3.3× faster than SRFormer-light). The codes are available at https://github.com/stella-von/MAT.

Abstract:
In underwater environments, the absorption and scattering of light often result in various types of degradation in captured images, including color cast, low contrast, low brightness, and blurriness. These undesirable effects pose significant challenges for both underwater photography and downstream tasks such as object detection, recognition, and navigation. To address these challenges, we propose a novel end-to-end underwater image enhancement (UIE) network via the multistage and mixed attention mechanism and a residual-based feature refinement module, called ERD. Specifically, our network includes an encoder stage for extracting features from input underwater images with channel, spatial, and patch attention modules to emphasize degraded channels and regions for restoration; a residual stage for further purification of informative features through sufficient feature learning; and a decoder stage for effective image reconstruction. Inspired by visual perception mechanism, we design the frequency domain loss and edge details loss to retain more high-frequency information and object details while ensuring that the enhanced image approximates the reference image in terms of color tone while preserving content and structure. To comprehensively evaluate our proposed UIE model, we also curated three additional underwater image datasets through online collection and generation using Cycle-GAN. Rigorous experiments conducted on a total of eight underwater image datasets demonstrate that the proposed ERD model outperforms state-of-the-art methods in enhancing both real-world and generated underwater images. Our code and datasets are available at https://github.com/fansuregrin/ERD.

Abstract:
Multi-contrast MRI super-resolution (SR) aims to restore high-resolution target image from low-resolution one, where reference image from another contrast is used to promote this task. To better meet clinical needs, current studies mainly focus on developing arbitrary-scale MRI SR solutions rather than fixed-scale ones. However, existing arbitrary-scale SR methods still suffer from the following two issues: 1) They typically rely on fixed convolutions to learn multi-contrast features, struggling to handle the feature transformations under varying scales and input image pairs, thus limiting their representation ability. 2) They simply combine the multi-contrast features as prior information, failing to fully exploit the complementary information in the texture-rich reference images. To address these issues, we propose a Dynamic Implicit Network (DINet) for multi-contrast MRI arbitrary-scale SR. DINet offers several key advantages. First, the scale-adaptive dynamic convolution facilitates dynamic feature learning based on scale factors and input image pairs, significantly enhancing the representation ability of multi-contrast features. Second, the dual-branch implicit attention enables arbitrary-scale upsampling of MR images through implicit neural representation. Following this, we propose the modulation-then-fusion block to adaptively align and fuse multi-contrast features, effectively incorporating complementary details from reference images into the target images. By jointly combining the above-mentioned modules, our proposed DINet achieves superior MRI SR performance at arbitrary scales. Extensive experiments on three datasets demonstrate that DINet significantly outperforms state-of-the-art methods, highlighting its potential for clinical applications. The code is available at https://github.com/weijinbao1998/DINet.

Abstract:
Animal pose estimation is often constrained by the scarcity of annotations and the diversity of scenarios and species. The pseudo-label generation based unsupervised domain adaptation paradigm, which discriminates the predicted keypoints of unlabeled data based on the skeleton position consistency, has demonstrated effectiveness for such problems. However, existing methods generate pseudo-labels with massive false positives, because they cannot effectively distinguish sample pairs with the same errors. In this study, we propose a cross-domain animal pose estimation model from a novel perspective of skeleton anomaly learning. We construct a graph contrastive learning mechanism to acquire the skeleton anomaly-aware knowledge, which enables the generation of accurate pseudo-labels for target domain and imposes graph constraint on unlabeled data. And a skeleton anomaly-feedback based domain adaptation framework is designed to facilitate implicit alignment of object-specific features and joint training of cross-domain. Besides, we propose a novel rat pose dataset named UDARP-9.4K to address the gap of small-sized animal pose datasets encompassing diverse experimental scenarios. The related datasets are reviewed and evaluated in detail. Extensive experiments are conducted on UDARP-9.4K and two public datasets to demonstrate the superiority of the proposed model in cross-scenarios and cross-species animal pose estimation tasks. Further analysis reveals the effectiveness of the proposed model for skeleton structure feature learning. The UDARP-9.4K dataset is available here https://github.com/CSDLLab/UDARP-9.4K-Dataset.

Abstract:
Semi-supervised video anomaly detection (SS-VAD) is essential for intelligent monitoring. However, collecting large-scale surveillance videos from various organizations raises significant privacy concerns regarding sensitive information. Federated learning offers a promising solution by enabling distributed learning among multiple participants while safeguarding privacy. Despite its potential, research on applying federated learning to SS-VAD remains unexplored due to the inherent challenges of this task. In this paper, we solve this task via proposing DLPP, a novel distributed learning framework for privacy-preserving SS-VAD. It addresses the issue of statistical heterogeneity among data from different participants in real-world federated SS-VAD applications, particularly focusing on non-independent and identically distributed (non-IID) data and imbalanced data volumes. In specific, it addresses these challenges in two key innovations: 1) For the non-IID data challenge, it dynamically updates the client model based on the overall gradient at the client of the previous training round and the degree of divergence between the server model and the client model. In this way, it can better adapt the server model to each client and promote convergence. 2) For the imbalanced data volumes challenge, it adaptively allocates client aggregation weights by comprehensively considering the data volumes, model quality, and learning efficiency of clients. This means a more robust server model can be obtained, and model bias reduced. We conduct extensive experiments to evaluate the performance of DLPP on benchmark datasets by partitioning data to simulate various degrees of non-IID environments. The results show that DLPP significantly outperforms both Baseline and SOTA methods, achieving up to a 3.89% improvement, and its communication efficiency is 3x better than FedAvg.

Abstract:
Utilizing the high-level semantic information of language to compensate for the limitations of vision information is a highly regarded approach in single-object tracking. However, most existing vision-language (VL) trackers employ full-parameter fine-tuning, which can easily lead to catastrophic forgetting. Therefore, they fail to fully exploit the prior knowledge of pre-trained models from upstream tasks, resulting in unsatisfactory tracking performance. To alleviate the above problem, we propose a simple yet effective Vision- Language Tracking pipeline based on Mamba Adapter, named MAVLT, which adopts the idea of parameter-efficient fine-tuning (PEFT) to realize the interaction between vision-language modalities. This novel approach offers the following advantages: 1) The knowledge of the upstream pre-trained model is efficiently inherited by freezing its parameters. This ensures that the VL tracking framework only learns the modules for vision and language interaction, with a focus on the fusion between modalities. 2) The modal interaction between language and vision encoders is flexibly bridged in each encoder layer via proposed mamba adapter, enabling efficient interaction of visual and language information at multiple levels. Extensive experiments on five popular vision-language tracking benchmarks validate the effectiveness of the proposed MAVLT. Particularly, the MAVLT achieves 73.4% AUC score on the LaSOT benchmarks with only 0.18%(0.32M) of the total parameters updates. Code and models are available at https://github.com/GXNU-ZhongLab/MAVLT.

Abstract:
Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt image-pair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.

Abstract:
Image captioning (IC) involves the comprehension of images from the visual domain to generate descriptions that are grounded in visual elements within the linguistic domain. Current image captioning methods typically rely on pre-trained unimodal visual backbones or vision-language models to identify visual entities. Subsequently, these methods employ unimodal self-attention fusion to uncover high-level semantic associations. However, we uncover this paradigm suffers from the inherent intra-modal semantic gap from the input features. Unimodal pre-trained visual features lack sufficient linguistic semantic information due to the modality misalignment. Furthermore, contrastive pre-trained vision-language models, such as CLIP, confine to the global cross-modal alignment, leading to local visual features belonging to the same object exhibiting distinct semantics. Given the semantically insufficient visual features, unimodal self-attention fusion struggles to accurately capture semantic associations among visual patches, thereby exacerbating the semantic gap. This gap results in inaccurate visual entities and associations in the generated captions. Therefore, we propose a novel Fully Semantic Gap Recovery (FSGR) method to broaden the robust cross-modal bridge of CLIP into a fine-grained level and consolidate vision-language semantic associations for more precise visual comprehension. Technically, we first propose a local contrastive learning method to aggregate the semantically similar visual patches. Next, we design a semantic quantification module to abstract the language-bridged visual map from the enhanced local visual features. Finally, fine-grained cross-modal interaction consolidates the image patches with their corresponding linguistic semantics, allowing the generation of plausible captions based on the aggregated features. Extensive experiments on comprehensive metrics demonstrate that our model has achieved new state-of-the-art performance on the MSCOCO dataset, while also exhibiting competitive cross-domain capability on the Nocaps dataset. Source code released at https://github.com/gjc0824/FSGR.

Abstract:
Empirical evidence has demonstrated that learning-based image compression can outperform classical compression frameworks. This has led to the ongoing standardization of learned-based image codecs, namely Joint Photographic Experts Group (JPEG) AI. The objective of JPEG AI is to enhance compression efficiency and provide a software and hardware-friendly solution. Based on our research, JPEG AI represents the first standardization that can facilitate the implementation of a learned image codec on a mobile device. This article presents an overview of the variable rate coding functionality in JPEG AI, which includes three variable rate adaptations: a three-dimensional quality map, a fast bit rate matching algorithm, and a training strategy. The variable rate adaptations offer a continuous rate function up to 2.0 bpp, exhibiting a high level of performance, a flexible bit allocation between different color components, and a region of interest function for the specified use case. The evaluation of performance encompasses both objective and subjective results. With regard to the objective bit rate matching, the main profile with low complexity yielded a 13.1% BD-rate gain over VVC intra, while the high profile with high complexity achieved a 19.2% BD-rate gain over VVC intra. The BD-rate result is calculated as the mean of the seven perceptual metrics defined in the JPEG AI common test conditions. With respect to subjective results, the example of improving the quality of the region of interest is illustrated.

Abstract:
Video summarization aims to seek the most important information from a source video while still retaining its primary content. In practical application, unsupervised video summarizers are acknowledged for their flexibility and superiority without requiring annotated data. However, they are looking for the determined rules on how much each frame is essential enough to be selected as a summary. Unlike conventional frame-based scoring methods, we propose a shot-level unsupervised video summarizer termed Hybrid Siamese Masked Autoencoders (H-SMAE) from a higher semantic perspective. Specifically, our method consists of Multi-view Siamese Masked Autoencoders (MV-SMAE) and Shot Diversity Enhancer (SDE). MV-SMAE tries to recover the masked shots from original frame feature and three unmasked shot subsets with elaborate Siamese masked autoencoders. Inspired by the masking idea in MAE, MV-SMAE introduces a Siamese architecture to model prior references to guide the reconstruction of masked shots. Besides, SDE improves the diversity of generated summary by minimizing the repelling loss among selected shots. Afterward, these two modules are fused followed by 0-1 knapsack algorithm to produce a video summary. Experiments on two challenging and diverse datasets demonstrate that our approach outperforms other state-of-the-art unsupervised and weakly-supervised methods, and even generates comparable results with several excellent supervised methods. The source code of H-SMAE is available at https://github.com/wzq0214/H-SMAE.

Abstract:
Image-text matching is a significant technology for vision-language tasks, as it bridges the semantic gap between visual and text modalities. Although existing methods have achieved remarkable progress, high-dimensional embeddings or ensemble methods are often used to achieve sufficiently good recall or accuracy, which significantly increase the computational and storage costs in practical applications. Knowledge distillation can help achieve resource-efficient deployment, however, existing techniques are not directly applicable to cross-modal matching scenarios. The main difficulties arise from two aspects: 1) the distillation from teacher model to student model is usually conducted in two separate stages, and this inconsistency in learning objectives may lead to sub-optimal compression results. 2) distilling knowledge from each modality independently cannot ensure the preservation of cross-modal alignment established in the original embeddings, which can lead to the compressed ones failing to achieve accurate alignment. To address these issues, we propose a novel Matryoshka Learning with Metric Transfer framework (MAMET) for image-text matching. After capturing multi-granularity information through multiple high-dimensional embeddings, we propose an efficient Matryoshka training process with shared backbone to compress the different granularity information into a low-dimensional embedding, facilitating the integration of cross-modal matching and knowledge distillation in one single stage. Meanwhile, a novel metric transfer criterion is innovated to diversely align the metric relations across embedding spaces of different dimensions and modalities, ensuring a good cross-modal alignment after distillation. In this way, our MAMET transfers strong representation and generalization capability from the high-dimensional ensemble models to a basic network, which not only can get great performance boost, but also introduce no extra overhead during online inference. Extensive experiments on benchmark datasets demonstrate the superior effectiveness and efficiency of our MAMET, consistently achieving an average of 2%-20% performance improvement over state-of-the-art methods across various backbones and domains.

Abstract:
3D Gaussian splatting (3DGS) suggests the use of explicit point-based 3D representations for high-fidelity novel view synthesis, with training and rendering speeds that are better than prior neural radiance fields. However, 3DGS relies heavily on synthetic point clouds generated by structure from motion (SfM) or multi view stereo (MVS) techniques, and lacks a well-defined method to initially condition them. In this work, we first propose a mesh-aligned method that attaches the 3D Gaussian to the extracted surface meshes for reliable initialization during training, and then directly manipulates the learnable 3DGS in the local coordinate using the triangular meshes. In addition, to constrain the stereo Gaussian on a planar mesh for better rendering, both normal and depth losses are designed to optimize the orientation and significance of the Gaussian. In practice, we further apply this method to multi-resolution scene rendering while resolving the aliasing effect. Unlike directly changing the size and number of Gaussians that may interfere the rendering quality at different resolutions, we argue that the properties of the modelled mesh are naturally resistant to aliasing effects. By utilising triangular meshes for Gaussian binding and adaptive learning, the proposed method can maintain high-fidelity rendering after splatting on multi-resolution concrete images. Extensive experiments demonstrate the effectiveness of our approach and its advantages over single full-resolution baseline.

Abstract:
The introduction of natural language for vision-language (VL) tracking has been proven to improve performance. However, natural language remains under-explored in existing aerial trackers. Moreover, existing VL trackers ignore the misalignment of language with dynamic target states, which is prominent in complex UAV scenarios. In this work, we present AVLTrack, a flexible framework for aerial vision-language tracking. It consists of three key components, a dynamic sparse learning (DSL) module, an efficient Transformer backbone, and a multi-level language perception (MLP) strategy. First, DSL sparsely connects language and images via dynamic sparse attention, providing accurate multi-modal prompts. To adapt to target state variations, the sparsity in DSL is dynamically adjusted based on semantic information, flexibly highlighting target-specific tokens. Next, the Transformer backbone follows highly parallelized one-stream architectures, allowing efficient multi-modal feature extraction and interaction. Finally, MLP enables the iterative interaction of language and visual information, aiming to utilize language priori to guide the generation of discriminative visual features. Moreover, we construct the DTB70-NLP dataset to facilitate UAV vision-language tracking. Extensive experiments on WebUAV-3M and DTB70-NLP demonstrate the leading performance of AVLTrack compared to existing outstanding trackers while maintaining a high running speed of 80.5 FPS. The dataset and codes are available at https://github.com/xyl-507/AVLTrack.

Abstract:
Supervised hashing models for image-text retrieval are fundamental and versatile in social media analysis and cross-lingual web search. Among them, supervised bilinear drift hashing is one of the most popular approaches. However, it still faces several challenges. For instance, how to leverage the power of bilinear drift hashing to distinguish similar and dissimilar data samples effectively; how to strengthen the semantic relationship between similar data and supervision. To solve these problems, we propose Robust Hashing with Bilinear Drift (RHBD) to improve the accuracy and robustness of the supervised model. The key idea of this work is to generate effective hash codes between image-text feature representations by combining robust data distributions and multiple supervision information. The benefits of bilinear drift with robust hashing, which enhance the discrimination of hash binary, are manifested mainly in two ways: (1) RHBD employs a semantic autoencoder with a linear drift to get a discriminative common feature representation between image and text modalities; (2) RHBD explores iteration quantization with a linear drift to well generate similarity-preserving hash codes. Moreover, we introduce multiple supervision learning to promote the consistency between data information and supervision knowledge for semantic complementarity. Results on three public datasets show that RHBD is effective in image-text retrieval, consistently outperforming other state-of-the-art models with comparable training efficiency to competitive baselines.

Abstract:
3D scene graph has emerged as a powerful high-level representation of the environment, and is considered a prerequisite for long-term autonomous robotic operations. However, building rich representations from RGB-D sequences remains a challenging problem. Existing methods ignore the semantic gap between linguistic and geometric feature spaces or neglect the importance of historical context in incrementally captured data. This limits the learning of visual-textual correspondence and the capability of relationship prediction. To address these problems, we propose a history-enhanced 3D scene graph reasoning framework that incrementally builds a consistent 3D semantic scene graph from an RGB-D image sequence. Specifically, we first introduce a cross-domain unified feature representation module to describe the object instances and their relationships distinctly. Next, we build a one-hot candidate matrix-enabled recurrent mechanism to reason the 3D scene graph, combining the perceived global and local history information. Finally, we design history-aware supervised semantics contrastive learning to optimize the scene-specific global history features. Extensive experiments on the 3DSSG dataset show the effectiveness of the proposed method in this challenging task, outperforming state-of-the-art approaches. Our code will be available at https://github.com/cbyan1003/HE-3DSGR.

Abstract:
Individuals can easily generate highly realistic images using artificial intelligence-generated content technologies, which complicates the verification of images’ ownership rights. This raises potential issues such as spreading misinformation, fraud, and copyright infringement. Digital watermarking is a promising solution to protect the copyright of a digital image by embedding watermarks within it. However, many existing deep learning-based watermarking approaches struggle to simultaneously resist multiple attacks effectively and maintain the quality of watermark images. In this paper, we propose a bidirectional-interactive and context-aware (BICA) deep network designed to enhance the robustness of the watermark while maintaining the quality of the encoded image. We propose a new attention module in the encoder to improve the invisibility and robustness of the watermarked images by implementing an adaptive two-way interaction between local and global features. Additionally, we employ fine-grained downsampling to enhance the attention module’s ability to capture comprehensive feature information. Extensive experimental results demonstrate that the BICA network can embed watermark information into an image without compromising image quality. For instance, BICA has an accuracy exceeding 95% against various moderate noise attacks, with average PSNR and SSIM values of 40.4021 dB and 0.9943, respectively.

Abstract:
Images are generally uploaded to the cloud in plaintext and can be retrieved in the cloud, but privacy may be exposed. To solve this problem, Privacy Preserving Content Based Image Retrieval (PPCBIR) system was proposed. In this system, noise-like image encryption algorithm was used in the early scheme, and Thumbnail Preserving Encryption (TPE) technology was proposed later to balance image privacy and visual usability. However, the existing TPE schemes supporting retrieval have shortcomings in mining the visual usability of TPE images, which limits the retrieval accuracy. Based on this, we propose a VF-PPCBIR scheme combining TPE and image visual features to improve retrieval efficiency and accuracy while ensuring image privacy. Specifically, we redesign a new TPE algorithm for lossless encryption and decryption of arbitrary size images. The design concept of the encryption algorithm is novel, and the encryption effect is more stable. The retrieval process generates thumbnails of the retrieved image and extracts local features in the spatial domain, which are matched with the features extracted from TPE thumbnails in the cloud, and the user can directly select the desired image. In addition, the retrieval scheme uses adjustable feature algorithm to achieve approximate similarity between the ciphertext and the plaintext thumbnail, to achieve accurate feature matching. The experimental results show that the time cost, and mean average precision (mAP) can reach 9.121s and 64.343%, respectively.

Abstract:
Moiré patterns usually depend on the style of display grids and the position of shooting camera, appearing in the form of stripes, meshes or ripples, with various and irregular colors. Compared with low-resolution moiré images, high-definition (HD) and ultra-high-definition (UHD) moiré images exhibit more complex moiré patterns, e.g., wider distribution of moiré frequencies and higher coupling degree of moirés of different scales, which poses a greater challenge to the modeling capabilities of the model. To address these challenges, we propose a novel Pyramid Learnable Bandpass Filtering Network (PBNet) for demoiréing UHD images. Specifically, we propose a pyramid learnable bandpass filter (P-LBF) to perform multi-scale filtering in the same semantic context to obtain richer frequency domain information. The P-LBF contains three stages: aligning, filtering and fusing. First, we introduce a pyramid alignment (DA) to align neighbor pixels for eliminating the deviations raised by different styles of display grids and relative position of the shooting camera. Then, a pyramid filtering (PF) is conducted to model the complex and variable moiré patterns with aligned neighbor pixels. Finally, the frequency domain responses of these different scales are fused with a multi-dimensional feature fusion (MFF). The PBNet is constructed based on the P-LBF, incorporating a cross-layer feature fusion (CLF) module to facilitate more effective information interaction between features at different depths. Extensive experiments on four public datasets show that our model achieves state-of-the-art performance for both high- and low-resolution moiré images. The code is publicly available at: https://github.com/liuzhongqi1/PBNet.

Abstract:
Real-world datasets usually suffer from class imbalance and label noise. To solve the joint challenge of long-tailed distribution and label noise, most previous works usually aim to design a noise detector to distinguish the noisy from clean samples. While effective, they may be limited in handling the joint issue in a unified way. In this work, we bridge this gap by effectively extracting a clean training subset from the noisy and long-tailed dataset, where we develop a novel re-labeling method using class prototypes from the perspective of distribution matching that can be solved with optimal transport. By using the learned transport plan to re-label training samples and setting a class-specific probability measure, our method can simultaneously reduce the side-effects of label noise and data imbalance during label refinement. Then we introduce a simple yet effective filter by combining the observed and refined labels to obtain a clean subset for robust model training. Comprehensive experiments show that our method can effectively extract clean subsets and bring significant performance gains in noisy long-tailed classification. Code is available at https://github.com/BIRlz/NLT_prototype_clean_subset_extraction

Abstract:
In robotic grasp detection, challenges such as uncertainty in object type, size, and placement within the scene diminish grasping accuracy. However, the inability to effectively locate the graspable area and incomplete feature extraction for grasp detection are two key factors that hinder grasp detection accuracy and are not considered in current methods. This paper presents a novel retentive attention-based multiscale perception grasp detection network (RAMPGrasp) to address this constraint. First, we introduce retentive attention in the feature extraction module, which significantly improves the efficiency of attention score computation for long sequences in visual tasks. Second, we propose a multiscale spatial pyramid attention module, which can effectively adjust the importance of multiscale feature sequences and feature channels, while enhancing the correlation of multiscale features. Third, we design the prediction module as a coarse-to-fine framework, improving feature representation for grasp detection by considering the distribution trend of grasp poses. As a result, RAMPGrasp achieves state-of-the-art grasp detection accuracy, with 98.4% and 95.6% on the Cornell and Jacquard datasets, respectively.

Abstract:
We introduce TGAvatar, a novel framework for 3D head animation and reconstruction that revolutionizes the use of 3D Gaussian Splatting (3DGS). TGAvatar significantly advances rendering quality by leveraging the intricate properties of 3DGS to achieve detailed and realistic representations of human head geometries and textures. We use an innovative application of linear blending techniques to imitate 3D Morphable Model (3DMM) coefficients within 3DGS, thereby enabling precise and dynamic facial feature and expression modeling. Further enhancing TGAvatar’s capabilities, a transformer based tri-plane module is incorporated to accurately infer spherical harmonics and alpha parameters. This integration is pivotal for the method, as it allows allows us to efficiently and precisely represent the visual characteristics of gaussians, tailored specifically to the intricate details of the head’s components. Our exhaustive evaluations show that TGAvatar not only elevates the fidelity and realism of 3D head reconstructions but also sets a new standard by surpassing existing methods in rendering quality and computational efficiency. Please see our project page at https://hrg0417.github.io/TGAvatar/

Abstract:
Cross-covariate gait recognition aims to analyze a pedestrian’s gait to extract an identity representation that is invariant across varying covariates. However, prevailing methods that have achieved good results on controlled in-the-lab datasets often perform poorly on realistic datasets. In this work, we find a significant cause is that the widely used pairwise metric learning paradigm cannot correctly handle the relationship between samples from different covariate conditions. Even worse, it may yield harmful signals that inadvertently mislead models to focus on covariate-related features, particularly when covariate distributions vary across subjects. To address this issue, we propose a Cross-Covariate Causal Intervention (GaitC3I) framework, a unified causality-inspired approach aimed at enhancing the robustness of gait recognition across diverse conditions. Specifically, our method consists of two parts: 1) an effective causal intervention metric learning paradigm based on backdoor adjustment, which strategically mitigates spurious correlations induced by covariates, thus ensuring a more invariant gait representation; and 2) an annotation-free selection strategy that progressively matches each positive sample with negative samples from similar covariate conditions at various granularities. We demonstrate the effectiveness of our GaitC3I through extensive evaluation on six popular gait datasets-Gait3D, GREW, OUMVLP, CASIA-B, CCPG, and CCGR-achieving substantial improvements. Our method not only outperforms existing state-of-the-art models but also provides a systematic solution to remove the spurious correlations in gait recognition.

Abstract:
Underwater object detection (UOD) plays an important role in the exploitation of marine ecological resources. Different from terrestrial images, the complex underwater environment leads to significant degradation in underwater images, which brings great difficulty in accurate object detection. In recent years, many specially designed UOD methods have been proposed to improve the detection precision in two aspects based on underwater image characteristics: 1) Some UOD methods utilize underwater image enhancement (UIE) to alleviate degradation with the expectation of clean features. However, neither preprocessing nor cascade approaches are fully effective for detection-oriented enhancement, while the additional UIE network increases inference time. 2) Other UOD methods consider low visibility of objects, blurriness of small objects, and occlusion problems. However, the semantic complementarity between objects of the same category but different qualities and the background patterns of specific objects are ignored. Based on these two observations, we propose a novel framework for the UOD task, which performs feature enhancement in two ways. First, a group contrastive-based feature enhancement module (GCFEM) is proposed to bridge UIE and UOD. Specifically, multiple enhanced versions by UIEs are evaluated by the object detection precision evaluation pipeline. Then, group-based contrastive learning is introduced, which utilizes multiple groups of enhanced versions to guide the backbone in extracting detection-friendly features. Second, a prior-guided dual-reference feature enhancement module (PDFEM) is proposed to enhance the representation of objects further. Specifically, the explicit object-object relationship allows low-quality object regions to refer to high-quality ones, guided by a transmission map. At the same time, the implicit object-background relationship provides cues about the surroundings for the representation of the objects. Experimental results demonstrate that the proposed algorithm outperforms many state-of-the-art UOD methods on RUOD and URPC2020 datasets.

Abstract:
Despite significant progress, the shortage of labeled data and expert knowledge remains a challenge for Fine-grained Visual Classification (FGVC). Some multi-source approaches that incorporate additional modalities, such as sound or bounding boxes, show promise for data enrichment but introduce added complexity to data collection. In this paper, we pose the question: can multi-source capabilities be achieved solely with existing images? The answer, confirmed by a pilot study, is affirmative. By analyzing the probability distribution of model output with different resolutions image, we find that complementary information beneficial to FGVC exists among images of different resolutions. Although the classification accuracy of low-resolution images is lower than high-resolution images, it can provide additional information for high-resolution input images. We designed a naive baseline that uses mixed training of multi-resolution images. Through the experimental results of the baseline, we find that i) not all low-resolution images are beneficial, and ii) adaptively selecting low-resolution images is what we need. Therefore, we proposed a meta-learning-based adaptive “resolution” pooling layer. Through the pooling operation, the features of low-resolution images are obtained from high-resolution images, and the most appropriate complementary features are selected for the features of high-resolution images through the gating mechanism, which enables the model to fully and autonomously exploit the complementary information. Experimental results on three FGVC datasets validate the effectiveness of our proposed method. Our code is available at https://github.com/PRIS-CV/Adaptive-Multi-Resolution-Feature-Fusion.

Abstract:
Artificial Intelligence Generated Content (AIGC) has created a fertile ground for image steganography. Existing Coverless Image Steganography (CIS) methods rely on image semantics to encode secrets, transmitting stego images without embedding, inherently resisting steganalysis. However, constructing CIS Datasets (CISDs) for these methods demands excessive resources, making them impractical for communication. Moreover, achieving low cost and high security is unattainable under these conditions. Therefore, we propose a CIS method based on semantic-controlled text-to-image generation. Our method disguises users as typical AIGC community members utilizing mainstream black-box text-to-image generation with Stable Diffusion (SD). During pre-processing, plain prompts, derived from dialogues with a large language model, are divided into coded and uncoded prompts through our encryption process, where a secret key determines coded prompts. In communication, confusion prompts are selected from uncoded and coded prompts, excluding those determined by secrets. Subsequently, our stego shuffling process combines topic, secret, and confusion prompts to produce stego prompt sets. Diverse stego images maintaining visual topic consistency are generated from these sets using SD with generation seeds indicating transmission order. By introducing confusion prompts, our method is secure from recognition when revealing stego prompts. Experimental results demonstrate our method achieves low communication costs and enhances communication security.

Abstract:
The rapid advancement of large language models (LLMs) has raised concerns regarding potential misuse and underscores the importance of verifying text authenticity. Text watermarking, which embeds covert identifiers into generated content, offers a viable means for such verification. Such watermarking can be implemented either by modifying the generation process of an LLM or via post-processing techniques like lexical substitution, with the latter being particularly valuable when access to model parameters is restricted. However, existing lexical substitution-based methods often face a trade-off between maintaining text quality and ensuring robust watermarking. Addressing this limitation, our work focuses on enhancing both the robustness and imperceptibility of text watermarks within the lexical substitution paradigm. We propose a localization-based watermarking method that enhances robustness while maintaining text naturalness. First, a precise localization module identifies optimal substitution targets. Then, we leverage LLMs to generate contextually appropriate synonyms, and the watermark is embedded through binary-encoded substitutions. To address different usage scenarios, we focus on the trade-off between watermark robustness and text quality. Compared to existing methods, our approach significantly enhances watermark robustness while maintaining comparable text quality and achieves similar robustness levels while improving text quality. Even under severe semantic distortions, including word deletion, synonym substitution, polishing, and re-translation, the watermark remains detectable.

Abstract:
Recent multi-person pose estimation methods design end-to-end pipelines under the DETR framework. However, these methods involve complex keypoint decoding processes because the DETR framework cannot be directly used for pose estimation, which results in constrained performance and ineffective information interaction between human instances. To tackle this issue, we propose a hybrid representation learning method for end-to-end multi-person pose estimation. Our method represents instance-level and keypoint-level information as hybrid queries based on point set prediction and can facilitate parallel interaction between instance-level and keypoint-level representations in a unified decoder. We also employ the instance segmentation task for auxiliary training to enrich the spatial context of hybrid representations. Furthermore, we introduce a pose-unified query selection (PUQS) strategy and an instance-gated module (IGM) to improve the keypoint decoding process. PUQS predicts local pose proposals to produce scale-aware instance initializations and can avoid the scale assignment mistake of one-to-one matching. IGM refines instance contents and filters out invalid information with the message of cross-instance interaction and can enhance the decoder’s capability to handle queries of instances. Compared with current end-to-end multi-person pose estimation methods, our method can detect human instances and body keypoints simultaneously through a concise decoding process. Extensive experiments on COCO Keypoint and CrowdPose benchmarks demonstrate that our method outperforms some state-of-the-art methods.

Abstract:
By now, many works have been done on shadow removal for image manipulation. As a result, detecting shadow removal has become a critical part to reveal the traces of image manipulation. However, there are only a few works conducted on shadow removal detection, and these works cannot accurately localize the image regions where the shadows have been removed. In this paper, we present a novel model called Multi-level Feature Fusion Network (MFF-Net) for shadow removal detection. MFF-Net consists of two parts: a dual-branch feature extraction encoder and a dense prediction decoder. The encoder anchors the approximate position of the manipulated regions, while the decoder progressively fills in the details of the estimated shadow masks by integrating multi-level information. In the encoder part, a global modeling branch is constructed to capture long-range dependencies, while a local feature extraction branch is designed to extract local structural information. The features extracted by these two branches are integrated using a feature fusion module. In the decoder part, a multi-scale feature upsampling module is proposed to upsample the input features and integrate them with the low-level features obtained from the encoder part. Meanwhile, the cross attention mechanism is introduced to guide the multi-level feature fusion process. Finally, the features of different resolutions are employed to estimate the shadow masks in a coarse-to-fine manner. Extensive experiments on shadow removal detection demonstrate the superiority of MFF-Net over the state-of-the-art methods. The source code of MFF-Net is publicly available at https://github.com/HITFuxiwen/MFF-Net.

Abstract:
Actual image super-resolution is an extremely challenging task due to complex degradations existing in the image. To solve this problem, two dominant methodologies have emerged: degradation-estimation-based Addressing actual image super-resolution remains a formidable challenge due to the intricate degradations present in images. Two primary methodologies have emerged: degradation-estimation-based and blind-based methods. The former often struggle to accurately estimate degradation, limiting their effectiveness on real low-resolution images. Conversely, blind-based methods rely on a single perceptual perspective, constraining their adaptability to diverse perceptual characteristics. In response to these challenges, we present MPF-Net, a novel super-resolution approach aimed at enhancing real-world image super-resolution tasks by enabling the model to learn multiple perceptual features from input images. Our method features a Multi-Perception Feature Extraction module (MPFE) designed to extract diverse perceptual details, complemented by Cross-Perception Blocks (CPB) facilitating the fusion of this information for efficient super-resolution reconstruction. Additionally, we introduce a contrastive regularization term (CR) to enhance the model’s learning by leveraging newly generated HR and LR images as positive and negative samples. Experimental results on challenging real-world SR datasets demonstrate the superiority of our approach over existing state-of-the-art methods, both qualitatively and quantitatively.

Abstract:
In recent years, correlation filter based trackers have shown great potentials in visual tracking because of their high computational efficiency and low memory consumption. However, their increasing tracking performance typically comes at the cost of sacrificing the computational speed and memory usage. Furthermore, training high-dimensional correlation filters with a large number of parameters usually introduces the risk of over-fitting. In this paper, we propose Multi-Task Target-Specific Correlation Filters (MTSCF) to tackle the above issues. First, we construct a novel regression formulation for multi-task filter learning to promote both competition and collaboration among correlation filters to select discriminative features for robust tracking. This significantly reduces redundancies among features at both spatial level and channel level, which produces sparse correlation filters. Then, we develop an effective filter importance evaluation criterion according to the expansion of designed regression formulation to choose a set of target-specific features for efficient tracking. This significantly reduces the number of filter parameters, which further results in compact correlation filters. Moreover, we propose to efficiently optimize the proposed MTSCF via an Alternating Direction Method of Multipliers (ADMM) algorithm. Evaluation results on six challenging benchmark datasets (i.e., OTB2013, OTB2015, VOT2016, VOT2018, UAV20L and LaSOT) show the proposed method performs favorably against existing state-of-the-art DCF based trackers, and it retains a high speed of 40 FPS on a CPU when evaluated with only hand-crafted features.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives, including related tasks, technologies, challenges, and opportunities. The primary objective is to provide newcomers with a rapid understanding of the field and to assist researchers in methodically organizing existing technologies and challenges. Specifically, we delve into the optimization, application, and extension of 3DGS, categorizing them based on their focuses or motivations. Additionally, we summarize and classify nine types of technical modules and corresponding improvements identified in existing works. Based on these analyses, we further examine the common challenges and technologies across various tasks, proposing potential research opportunities.

Abstract:
With the development of deep 3D tracking models and their broad prospects for safety-critical applications, adversarial robustness, i.e., the ability of deep models to resist malicious adversarial attacks, has become an important research topic. Previous works generate adversarial examples by tampering with points of the input point cloud indiscriminately. Consequently, they suffer from high computing costs and limited attack performance caused by the trade-off between imperceptibility and adversarial strength. In this paper, we propose a novel adversarial attack against 3D object tracking, which is guided by an occlusion-based explainability method to target points crucial for the predictions in the search area and results in a significant deviation between the predictions and the ground truth. Specifically, an attribution map is generated to reveal the importance of points to the model decision, which is achieved by measuring the variations of tracking performance under subsets generated by the downsampling strategy. To facilitate the generation of attribution maps, the downsampling strategy considers prior knowledge of 3D trackers, which assigns higher sampling probabilities to points with potentially higher contributions enclosed by bounding boxes. Multi-scale fusion is also leveraged to integrate the sensitivity of the model to local regions of varying sizes. Considering the requirement of imperceptibility on adversarial attacks, a hard geometric constraint is imposed on the targeted critical points, which produces perturbations with the property of surface invariance. Furthermore, in contrast to existing works devoted to spatial information manipulation only, multiple loss functions are developed to guide the perturbation generation, where the predicted motions of the tracking target representing the spatial-temporal information unique to the tracking task are distorted to deceive 3D trackers. Extensive experiments conducted on public benchmarks and 3D trackers demonstrate that our method can generate effective and imperceptible adversarial examples with tiny perturbations.

Affiliations: School of Electronic and Information Engineering & Guangdong Provincial Key Laboratory of Industrial Intelligent Inspection Technology, Foshan University, Foshan, China; School of Software & Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China; School of Computing, National University of Singapore (NUS), Queenstown, Singapore; College of Computing and Data Science, Nanyang Technological University (NTU), Jurong West, Singapore

Abstract:
Federated domain adaptation (FDA) aims to transfer knowledge collaboratively from multiple source domains to related but different unlabelled target domains. The data of each domain are locally maintained, and various domain gaps exist among them, resulting in extreme challenges in simultaneously mitigating diverse distribution shifts and preserving discriminative knowledge without accessing the source data. Many existing works have failed to fully explore different source models to measure domain shifts and leverage semantic knowledge, resulting in skewed alignment and partial preservation of discriminative information. In this paper, we propose a novel approach named FDAC to address Federated Domain Adaptation by thoroughly investigating source models via dual Contrastive mechanisms. FDAC contrastively increases the data diversity to align features across domains in a fine-grained manner by manipulating the latent deep architecture and compensating for knowledge from each source domain; simultaneously, it contrastively utilizes the comprehensive semantic knowledge of different source domains to guide the adaptation process. Extensive experiments on several real datasets demonstrate that FDAC outperforms all comparative methods under most conditions. Furthermore, FDAC only needs approximately half of the communication rounds compared with the state-of-the-art methods, indicating that FDAC can significantly improve communication efficiency, which is another key factor in the federated setting. The source code is publicly available at https://github.com/ycarobot/FDAC.

Abstract:
Siamese network based trackers develop rapidly in the field of visual object tracking in recent years. The majority of Siamese network based trackers now in use treat each channel in the feature maps generated by the backbone network equally, making the similarity response map sensitive to background influence and hence challenging to focus on the target region. Additionally, there are no structural links between the classification and regression branches in these trackers, and the two branches are optimized separately during training. Therefore, there is a misalignment between the classification and regression branches, which results in less accurate tracking results. In this paper, a Target Highlight Module is proposed to help the generated similarity response maps to be more focused on the target region. To reduce the misalignment and produce more precise tracking results, we propose a corrective loss to train the model. The two branches of the model are jointly tuned with the use of corrective loss to produce more reliable prediction results. Experiments on 5 challenging benchmark datasets reveal that the method outperforms current models in terms of performance, and runs at 38 fps, proving its effectiveness and efficiency.

Abstract:
Visual object tracking is susceptible to adversarial attacks, posing significant security concerns for numerous application systems. Previous attack methods focused on white-box and untargeted attacks against response map. However, obtaining the tracking model in real-world scenarios is challenging, and the resulting adversarial trajectories are often unrealistic, making the attacks easily detectable. This paper proposes a Feature-aware Transferable Adversarial Patch (FTAP) that induces any black-box trackers to follow controllable and smooth trajectories. Tracker Following Assurance module is designed to manipulate bounding boxes to be valid and tightly align with the fake target. The movement of the tracker can be precisely controlled, resulting in adversarial trajectories stable and closely resemble natural trajectories, thereby reducing the risk of detection. The adversarial perturbation is generated solely from the initial template and applied to each frame. Consequently, the well-optimized generator can output universal adversarial patch capable of attacking any video without requiring additional computations. The intermediate layer features are corrupted to make the characteristics of the fake target closer to those of ground truth. Experimental results demonstrate that the proposed FTAP achieves state-of-the-art black-box attack performance and transferability across various tracker architectures.

Abstract:
Large-scale hyperspectral image (HSI) clustering has become an important research task owing to its promising applications in various fields. Recently, beneficial from the correlation modeling capability of graphs, graph contrastive learning methods have received increasing attention in the clustering task. However, these methods usually have limited ability to explore the high-order correlation as well as beneficial clustering information of large-scale HSI, thus limiting the clustering performance on large-scale HSI. To this end, a novel hypergraph contrastive learning network (HCL-Net) for large-scale HSI clustering is proposed in this paper. Specifically, a diffusion hypergraph-based contrastive clustering mechanism is presented, in which a diffusion hypergraph is constructed to model the high-order correlation in large-scale HSI, thus guiding contrastive learning for obtaining more discriminative representations. Besides, by mining the confident clustering information, a confidence-guided positive-negative updating strategy is designed to dynamically update positives and negatives for contrastive learning, thereby obtaining a more compact clustering structure. The proposed method is evaluated on three public large-scale HSI datasets. The experimental results have demonstrated the superior performance of the proposed HCL-Net over state-of-the-art methods.

Abstract:
Due to its excellent query and storage efficiency to facilitate large-scale multimedia retrieval, multi-modal hashing (MMH) has garnered a lot of attention from researchers. Nevertheless, existing MMH methods still suffer from several challenges: 1) Existing MMH methods often rely on graphs to represent complex correlation, but are constrained by the quality of graph construction and the storage overhead. 2) Existing MMH methods only deal with complete multi-modal data where all modalities of each instance are available, but cannot work with incomplete multi-modal data which encounter the problem of missing modalities. 3) Existing MMH methods often ignore the inevitable weak-supervision issue. To address these challenges, this paper proposes an Incomplete Multi-modal wEakly-supervised Hashing with Consensus Bipartite Graph (IMEH-CBG) method, which learns consensus bipartite graph for incomplete multi-modal fusion and corrects weak labels for discriminant hash learning. As far as we know, this is the first MMH method to work with incomplete and weakly-supervised multi-modal data in an unified framework. IMEH-CBG selects unified anchor set and builds consensus bipartite graph jointly for incomplete multi-modal fusion to tackle the first and the second challenges. Then, the semantic labels are predicted and utilized to learn hash code in an asymmetric way to tackle the third challenge. Extensive experiments demonstrate the superiority of IMEH-CBG.

Abstract:
As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI super-resolution needs to take into account geometric distortion resulting from ERP. However, without considering such geometric distortion of ERP images, previous methods only utilize a limited range of pixels and may easily miss self-similar textures for reconstruction. In this paper, we introduce a novel Geometric Distortion Guided Transformer for Omnidirectional image Super-Resolution (GDGT-OSR). Specifically, a distortion modulated rectangle-window self-attention mechanism, integrated with deformable self-attention, is proposed to better perceive the distortion and thus involve more self-similar textures. Distortion modulation is achieved through a newly devised distortion guidance generator that produces guidance for the rectangular windows by exploiting the variability of distortion across latitudes. Furthermore, we propose a dynamic feature aggregation scheme to adaptively fuse the features from different self-attention modules. We present extensive experimental results on public datasets and show that the new GDGT-OSR outperforms methods in existing literature.

Abstract:
Neural implicit representations have recently shown promising progress in dense Simultaneous Localization And Mapping (SLAM). However, existing works have shortcomings in terms of reconstruction quality and real-time performance, mainly due to inflexible scene representation strategy without leveraging any prior information. In this paper, we introduce SP-SLAM, a novel neural RGB-D SLAM system that performs tracking and mapping in real-time. SP-SLAM computes depth images and establishes sparse voxel-encoded scene priors near the surfaces to achieve rapid convergence of the model. Subsequently, the encoding voxels computed from single-frame depth image are fused into a global volume, which facilitates high-fidelity surface reconstruction. Simultaneously, we employ tri-planes to store scene appearance information, striking a balance between achieving high-quality geometric texture mapping and minimizing memory consumption. Furthermore, in SP-SLAM, we introduce an effective optimization strategy for mapping, allowing the system to continuously optimize the poses of all historical input frames during runtime without increasing computational overhead. We conduct extensive evaluations on five benchmark datasets (Replica, ScanNet, TUM RGB-D, Synthetic RGB-D, 7-Scenes). The results demonstrate that, compared to existing methods, we achieve superior tracking accuracy and reconstruction quality, while running at a significantly faster speed.

Abstract:
Most visual cryptography schemes (VCSs) are condition-oriented which implies their designs focus on satisfying the contrast and security conditions in VCS. In this paper, we explore a new architecture of VCS: contrast-oriented region-based progressive probabilistic VCS (CRP2-VCS). The term contrast-oriented indicates the optimality of multi-contrast is taken into consideration when producing shadows. First of all, new requirements for CRP2-VCS, described by probabilities, are introduced. As a non-interference requirement is proposed, the secret interference problem in existing region-based progressive VCS can be avoided. Then, a construction of CRP2-VCS based on a multi-contrast-maximizing model is provided. The multi-contrast-maximizing problem is essentially a probabilistic VCS model that fuses region-based sharing, multi-contrast optimization, and general access structure (GAS) together. Finally, a Max-Min based technique is adopted to solve the multi-objective optimization problem. Moreover, to further boost the visual quality, the proposed method is extended to allow employing XOR operation for image recovery. Experimental results and comparisons are demonstrated to show the effectiveness and advantages, such as optimal visual quality, non-expansible shadow and GAS sharing policy, are provided by the proposed technique.

Abstract:
Successful video deblurring relies on effectively using sharp pixels from other frames to recover the blurry pixels of the current frame. However, mainstream methods only use estimated optical flows to align and fuse features from adjacent frames without considering the pixel-wise blur levels, leading to the introduction of blurry pixels from adjacent frames. Furthermore, these methods fail to effectively exploit information from the entire input video. To address these limitations, we propose STDANet++, which redesigns the state-of-the-art method STDANet by introducing patch-based spatio-temporal deformable attention (PSTDA) module and long-term frame fusion (LTFF) module to the BiRNN-based structure. By effectively utilizing sharp information across the entire video, the proposed method outperforms state-of-the-art methods on the GoPro, DVD and BSD datasets, according to our experimental results. The source code is available at https://github.com/huicongzhang/STDANetPP.

Abstract:
Image restoration is a critical task in low-level computer vision, aiming to restore high-quality images from degraded inputs. Various models, such as convolutional neural networks (CNNs), generative adversarial networks (GANs), transformers, and diffusion models (DMs), have been employed to address this problem with significant impact. However, CNNs have limitations in capturing long-range dependencies. DMs require large prior models and computationally intensive denoising steps. Transformers have powerful modeling capabilities but face challenges due to quadratic complexity with input image size. To tackle these challenges, we propose VmambaIR, one of the first works to introduce State Space Models (SSMs) with linear complexity into comprehensive image restoration tasks. Specifically, we utilize a Unet architecture to stack our proposed Omni Selective Scan (OSS) blocks, consisting of an OSS module and an Efficient Feed-Forward Network (EFFN). Our proposed omni selective scan mechanism overcomes the unidirectional modeling limitation of SSMs by efficiently modeling image information flows in all six directions to better exploit surrounding restoration information. Furthermore, we conducted a comprehensive evaluation of our VmambaIR across multiple image restoration tasks, including image deraining, single image super-resolution, and real-world image super-resolution. Extensive experimental results demonstrate that our proposed VmambaIR achieves state-of-the-art (SOTA) performance with much fewer computational resources and parameters. Our research highlights the potential of state space models as promising alternatives to the transformer and CNN architectures in serving as foundational frameworks for next-generation low-level visual tasks.

Abstract:
In autonomous driving, LiDAR sensors are vital for acquiring 3D point clouds, providing reliable geometric information. However, traditional sampling methods of preprocessing often ignore semantic features, leading to detail loss and ground point interference in 3D object detection. To address this, we propose a multi-branch two-stage 3D object detection framework using a Semantic-aware Multi-branch Sampling (SMS) module and multi-view consistency constraints. The SMS module includes random sampling, Density Equalization Sampling (DES) for enhancing distant objects, and Ground Abandonment Sampling (GAS) to focus on non-ground points. The sampled multi-view points are processed through a Consistent KeyPoint Selection (CKPS) module to generate consistent keypoint masks for efficient proposal sampling. The first-stage detector uses multi-branch parallel learning with multi-view consistency loss for feature aggregation, while the second-stage detector fuses multi-view data through a Multi-View Fusion Pooling (MVFP) module to precisely predict 3D objects. The experimental results on the KITTI dataset and Waymo Open Dataset show that our method achieves excellent detection performance improvement for a variety of backbones, especially for low-performance backbones with simple network structures. The code will be publicly available at https://github.com/HaoJing-SX/SMS.

Abstract:
Multi-view stereo aims to recover the 3D model of a scene from a set of images. However, low-textured areas in the scene have always been a challenge in 3D reconstruction. In this work, we propose a segmentation-guided multi-scale anchor deformation patch multi-view stereo. Specifically, we use the Segment Anything Model to distinguish different instances in the scene, and propose an anchor deformation strategy guided by segmentation to adaptively generate a multi-scale anchor patch, so that the depth can be refined from coarse to fine scales. To propagate a better hypothesis for low-textured areas where photometric consistency is unreliable, we propose a non-local adaptive propagation scheme by using the segmented mask as a propagation domain. In order to reduce the interference of illumination on reconstruction completeness, we propose an outlier depth cost refinement guided by reliable points that improves the performance of depth estimation in unevenly illuminated areas. As a result, our method achieves state-of-the-art performance among traditional methods and exhibits better generalization capabilities on the ETH3D, Tanks and Temples, and DTU datasets.

Abstract:
Panoramic distortion poses a significant challenge in 360° depth estimation, particularly pronounced at the north and south poles. Existing methods either adopt a bi-projection fusion strategy to remove distortions or model long-range dependencies to capture global structures, resulting in either unclear structure or insufficient local perception. In this paper, we propose a spherical geometry transformer, named SGFormer, to address the above issues, with an innovative step to integrate spherical geometric priors into vision transformers. To this end, we retarget the transformer decoder to a spherical prior decoder (termed SPDecoder), which endeavors to uphold the integrity of spherical structures during decoding. Concretely, we leverage bipolar reprojection, circular rotation, and curve local embedding to preserve the spherical characteristics of equidistortion, continuity, and surface distance, respectively. Furthermore, we present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions. It not only boosts the global perception of spatial position but also sharpens the depth structure across different patches. Finally, we conduct extensive experiments on popular benchmarks, demonstrating our superiority over state-of-the-art solutions. Our code will be made publicly at https://github.com/iuiuJaon/SGFormer.

Abstract:
Occluded person re-identification (Re-ID) is a challenging problem due to the absence of notable discriminative features resulting from incomplete body part images and interference from occluded regions. Recently, some transformer-based methods have demonstrated excellent capabilities in resolving this problem, however these methods are not able to precisely focus on the non-occluded body parts and cannot capture fine-grained local features. To achieve these we propose a Mask-Aware Hierarchical Aggregation TrAnsforMer (MAHATMA) method to enhance occluded person Re-ID. Specifically, we propose a Mask Information Embedding (MIE) module, which directs the model to focus on non-occluded body parts by incorporating the mask semantic information of a human body. Furthermore, to effectively capture fine-grained local features, we propose a Hierarchical Feature Aggregation (HFA) module that mines more exploitable high-quality detail information by aggregating hierarchical image patch representations. To further alleviate the feature loss problem, we propose a Diverse Feature Completion (DFC) module, which is able to complete global features through multi-path feature integration. Extensive experimental evaluations demonstrate that our method exhibits superior performance in dealing with occluded and holistic person datasets.

Abstract:
Co-salient object detection (CoSOD) is to find the salient and recurring objects from a series of relevant images, where modeling inter-image relationships plays a crucial role. Different from the commonly used direct learning structure that inputs all the intra-image features into some well-designed modules to represent the inter-image relationship, we resort to adopting a recursive structure for inter-image modeling, and propose a two-tier recursion network (TRNet) to achieve CoSOD in this paper. The two-tier recursive structure of the proposed TRNet is embodied in two stages of inter-image extraction and distribution. On the one hand, considering the task adaptability and inter-image correlation, we design an inter-image exploration with recursive reinforcement module to learn the local and global inter-image correspondences, guaranteeing the validity and discriminativeness of the information in the step-by-step propagation. On the other hand, we design a dynamic recursion distribution module to fully exploit the role of inter-image correspondences in a recursive structure, adaptively assigning common attributes to each individual image through an improved semi-dynamic convolution. Experimental results on five prevailing CoSOD benchmarks demonstrate that our TRNet outperforms other competitors in terms of various evaluation metrics. The code and results of our method are available at https://github.com/rmcong/TRNet_TCSVT2025.

Abstract:
Graphics Interchange Format (GIF) encoding is the art of reproducing an image with limited colors. Existing GIF encoding schemes often introduce unpleasant visual artifacts such as banding artifact, dotted-pattern noise and color shift, especially when the palette size is small. To address the issues above, we propose VivID, a Visually Improved GIF Encoding Network Design, which is compatible with exiting GIF decoders. VivID consists of three modules and two of them provide the functionality within the GIF encoding pipeline. Firstly, in order to reduce the color shift introduced by color quantization, we design the multi-palette extractor to create a GIF image with minimal distortion by extracting a near-optimal palette. This module can significantly improve the image fidelity and gains adaptability to multiple palette sizes after only one-time training. Furthermore, to reduce banding artifact and the dotted-pattern noise caused by dithering process, we propose banding remover which can randomize quantization error to neighbourhood by utilizing a learnable dithering pattern. Moreover, to further eliminate the banding artifacts, we design the banding scorer module, which is a novel metric for evaluating banding artifact and it correlates well with subjective perception. We adopt it as a customized loss for training dithering module. Extensive experiments across various aspects demonstrate that VivID produces visually pleasing results even when the palette size is extremely small, outperforming both traditional and existing learning based GIF encoding methods.

Abstract:
Image inpainting aims to restore a realistic image from a damaged or incomplete version. Although Transformer-based methods have achieved impressive results by modeling long-range dependencies, the inherent quadratic complexity of canonical self-attention has typically led to these approaches adopting uni-dimensional modeling, which limits the model’s ability to capture complex relationships from both spatial and channel dimensions. To this end, this paper exploits a novel attention paradigm termed Dynamic Omni-Attention Mechanism (DOAM) for simultaneously modeling pixel-interaction from both spatial and channel dimensions, and implements the information interaction across the omni-axis (i.e., spatial and channel) with linear computational complexity. In addition, to handle large-scale degradation, this paper proposes a Multi-band Feature Enhancement (MFE) module to enhance feature representation in downsampling, thus unlocking the potential of subsequent attentional interactions. Moreover, motivated by recent advances in image restoration, this paper incorporates a domain-related prior representation from CNN-based Network to modulate the features during proposed attention mechanism and feed-forward networks. Integrating the above designs into an encoder-decoder architecture, the proposed Omni Contextual Aggregation Networks (OCANet) achieve superior performance at lower parameters and time costs than the competitive baselines. Extensive experiments on CelebA-HQ, Paris Street View, FFHQ and Dunhuang datasets validate the efficacy of the proposed method.

Abstract:
Referring expression comprehension (REC) aims at locating the target object described by an expression. We observe that most of the graph-based REC methods only focus on establishing relations between all objects in an image and the given expression during the graph construction while ignoring the relationships between objects in the same category. As a result, these methods are sub-optimal in locating the target object described by the expression, particularly when the target object is surrounded by objects of similar categories. Meanwhile, during reasoning, numerous irrelevant objects are considered for expression, which will introduce significant harmful noise. To address these issues, this paper proposes a new graph-based group division network (GBGDN). Different from the existing works, our work partitions the constructed graphs into several sub-graphs based on the categories of objects and expressions. In each sub-graph, the common visual features of objects will be strengthened through a feature enhancement strategy. Subsequently, the enhanced sub-graphs and expressions undergo joint processing via a filtering-based reasoning module designed to reduce the influence of unrelated nodes in each sub-graph, facilitating more accurate reasoning and matching. Experimental results across various datasets, including RefCOCO /+/g, Flickr30K Entities, RefClef, and Ref-reasoning, showcase the superiority of our proposed method over existing approaches. Most importantly, our method does not need pre-training.

Abstract:
Automatic and precise medical image segmentation (MIS) is of vital importance for clinical diagnosis and analysis. Current MIS methods mainly rely on the convolutional neural network (CNN) or self-attention mechanism (Transformer) for feature modeling. However, CNN-based methods suffer from the inaccurate localization owing to the limited global dependency while Transformer-based methods always present the coarse boundary for the lack of local emphasis. Although some CNN-Transformer hybrid methods are designed to synthesize the complementary local and global information for better performance, the combination of CNN and Transformer introduces numerous parameters and increases the computation cost. To this end, this paper proposes a CNN-Transformer rectified collaborative learning (CTRCL) framework to learn stronger CNN-based and Transformer-based models for MIS tasks via the bi-directional knowledge transfer between them. Specifically, we propose a rectified logit-wise collaborative learning (RLCL) strategy which introduces the ground truth to adaptively select and rectify the wrong regions in student soft labels for accurate knowledge transfer in the logit space. We also propose a class-aware feature-wise collaborative learning (CFCL) strategy to achieve effective knowledge transfer between CNN-based and Transformer-based models in the feature space by granting their intermediate features the similar capability of category perception. Extensive experiments on three popular MIS benchmarks demonstrate that our CTRCL outperforms most state-of-the-art collaborative learning methods under different evaluation metrics. The source code will be publicly available at https://github.com/LanhooNg/CTRCL.

Abstract:
In recent years, vision-language tracking has drawn emerging attention in the tracking field. The critical challenge for the task is to fuse semantic representations of language information and visual representations of vision information. For this purpose, several vision-language tracking methods perform early or late fusion to fuse visual and semantic features. However, these methods cannot take full advantage of the transformer architecture to excavate useful cross-modal context at various levels. To this end, we propose a new progressive joint vision-language transformer (PJVLT) to progressively align and refine visual embedding with semantic embedding for vision-language tracking. Specifically, to align visual signals with semantic signals, we propose to insert a semantic-aware instance encoder layer (SAIEL) into each intermediate layer of transformer encoder to perform progressive alignment of visual and semantic features. Furthermore, to highlight the multi-modal feature channels and patches corresponding to target objects, we propose a unified channel communication patch interaction layer (CCPIL), which is plugged into each intermediate layer of transformer encoder to progressively activate target-aware channels and patches of aligned multi-modal features for fine-grained tracking. In general, by progressively aligning and refining visual features with semantic features in the transformer encoder, our PJVLT can adaptively excavate well-aligned vision-language context at coarse-to-fine levels, therefore highlighting target objects at various levels for more discriminative tracking. Experiments on several tracking datasets show that the proposed PJVLT can achieve favorable performance in comparison with both conventional trackers and other vision-language trackers.

Abstract:
The purpose of texture measurement is to describe and quantify the texture features of pixels in an image. The accuracy of texture measurement plays a crucial role in determining the effectiveness of texture filtering. However, current texture measurement methods face challenges in achieving accurate texture measurement results, particularly for multi-scale texture measurements. This limitation often leads to unsatisfactory texture filtering results, particularly with image details and high-contrast textures. We find that when moving the texture measurement regions for pixels near texture edges further away from the texture edge and keeping the texture measurement regions for pixels far from texture edges unchanged results in an improved accuracy of texture measurement. Based on this observation, we propose a novel texture measurement approach that employs a circular neighborhood with a variable radius as the texture measurement region for each pixel. Furthermore, we proposed an image terrain map model based on a one-pixel texture edge to obtain optimal parameters for texture measurement regions. This model significantly enhances the accuracy of texture measurement at any scale in an image. The experimental results show that the texture filtering method based on our image terrain map model is significantly better than existing methods in terms of edge-preservation, small-structure preservation, and high-contrast texture filtering. Additionally, we presented some applications of the image terrain map model in other areas of image processing to demonstrate its versatility.

Abstract:
Video copy Segment Localization (VSL) requires the identification of the temporal segments within a pair of videos that contain copied content. Current methods primarily focus on global temporal modeling, overlooking the complementarity of global semantic and local fine-grained features, which limits their effectiveness. Some related methods attempt to incorporate local spatial information but often disrupt spatial semantic structures, resulting in less accurate matching. To address these issues, we propose the Instance-Enhanced Spatial-Temporal Alignment Framework (iESTA), based on a proper representation granularity that integrates instance-level local features and semantic global features. Specifically, the Instance-relation Graph (IRG) is constructed to capture instance-level features and fine-grained interactions, preserving local information integrity and better representing the video feature space in a proper granularity. An instance-GNN structure is designed to refine these graph representations. For global features, we enhance the representation of semantic information, capturing temporal relationships within videos using a Transformer framework. Additionally, we design a Complementarity-perception Alignment Module (CAM) to effectively process and integrate complementary spatial-temporal information, producing accurate frame-to-frame alignment maps. Our approach also incorporates a differentiable Dynamic Time Warping (DTW) method to utilize latent temporal alignments as weak supervisory signals, improving the accuracy of the matching process. Experimental results indicate that our proposed iESTA outperforms state-of-the-art methods on both the small-scale dataset VCDB and the large-scale dataset VCSL.

Abstract:
Co-salient object detection (CoSOD) aims to segment the co-occurring salient objects in a given group of relevant images. Existing methods typically rely on extensive group training data to enhance the model’s CoSOD capabilities. However, fitting prior knowledge of the extensive group results in a significant performance gap between the seen and out-of-sample image groups. Relaxing such a fitting with fewer prior groups may improve the generalization ability of CoSOD while alleviating the annotation burdens. Hence, it is essential to explore the use of fewer groups during the training phase, such as using only single group, to pursue a highly generalized CoSOD model. We term this new setting as Sg-CoSOD, which aims to train a model using only a single group and effectively apply it to any unseen RGB and RGB-D CoSOD test groups. Towards Sg-CoSOD, it is important to ensure detection performance with limited data and release class dependency with only a single-group. Thus, we present a method, i.e., cross-excitation between saliency and ‘Co’, which decouples the CoSOD task into two parallel branches: ‘Co’ To Saliency (CTS) and Saliency To ‘Co’ (STC). The CTS branch focuses on mining group consensus to guide image co-saliency predictions, while the STC branch is dedicated to using saliency priors to motivate group consensus mining. Furthermore, we propose a Class-Agnostic Triplet (CAT) loss to constrain intra-group consensus while suppressing the model from acquiring class prior knowledge. Extensive experiments on RGB and RGB-D CoSOD tasks with multiple unknown groups show that our model has higher generalization capabilities (e.g., for large-scale datasets CoSOD3k and CoSal1k with multiple generalized groups, we obtain a gain of over 15% in F_m ). Further experimental analyses also reveal that the proposed Sg-CoSOD paradigm has significant potential and promising prospects.

Abstract:
The use of a single image restoration framework to achieve multi-task image restoration has garnered significant attention from researchers. However, several practical challenges remain, including meeting the specific and simultaneous demands of different tasks, balancing relationships between tasks, and effectively utilizing task correlations in model design. To address these challenges, this paper explores a multi-expert adaptive selection mechanism. We begin by designing a feature representation method that accounts for both the pixel channel level and the global level, encompassing low-frequency and high-frequency components of the image. Based on this method, we construct a multi-expert selection and ensemble scheme. This scheme adaptively selects the most suitable expert from the expert library according to the content of the input image and the prompts of the current task. It not only meets the individualized needs of different tasks but also achieves balance and optimization across tasks. By sharing experts, our design promotes interconnections between different tasks, thereby enhancing overall performance and resource utilization. Additionally, the multi-expert mechanism effectively eliminates irrelevant experts, reducing interference from them and further improving the effectiveness and accuracy of image restoration. Experimental results demonstrate that our proposed method is both effective and superior to existing approaches, highlighting its potential for practical applications in multi-task image restoration. The source code of the proposed method is available at https://github.com/zhoushen1/MEASNet.

Abstract:
Semantically coherent out-of-distribution detection (SCOOD) is a recently proposed realistic OOD detection setting: given labeled in-distribution (ID) data and mixed in-distribution and out-of-distribution unlabeled data as the training data, SCOOD aims to enable the trained model to accurately identify OOD samples in the testing data. Current SCOOD methods mainly adopt various clustering-based in-distribution sample filtering (IDF) strategies to select clean ID samples from unlabeled data, and take the remaining samples as auxiliary OOD data, which inevitably introduces a large number of noisy samples in training. To address the above issue, we propose a concise SCOOD framework based on predictive sample assignment (PSA). PSA includes a dual-threshold ternary sample assignment strategy based on the predictive energy score that can significantly improve the purity of the selected ID and OOD sample sets by assigning unconfident unlabeled data to an additional discard sample set, and a concept contrastive representation learning loss to further expand the distance between ID and OOD samples in the representation space to assist ID/OOD discrimination. In addition, we also introduce a retraining strategy to help the model fully fit the selected auxiliary ID/OOD samples. Experiments on two standard SCOOD benchmarks demonstrate that our approach outperforms the state-of-the-art methods by a significant margin. The code is available at: https://github.com/ZhimaoPeng/PSA.

Abstract:
Few-shot anomaly detection aims to detect defects with only a limited number of normal samples for training. Recent few-shot methods typically focus on object-level features rather than subtle defects within objects, as pretrained models are generally trained on classification or image-text matching datasets. However, object-level features are often insufficient to detect defects, which are characterized by fine-grained texture variations. To address this, we propose FocusCLIP, which consists of a vision-guided branch and a language-guided branch. FocusCLIP leverages the complementary relationship between visual and text modalities to jointly emphasize discrepancies in fine-grained textures of defect regions. Specifically, we design three modules to mine these discrepancies. In the vision-guided branch, we propose the Bidirectional Self-knowledge Distillation (BSD) structure, which identifies anomaly regions through inconsistent representations and accumulates these discrepancies. Within this structure, the Anomaly Capture Module (ACM) is designed to refine features and detect more comprehensive anomalies by leveraging semantic cues from multi-head self-attention. In the language-guided branch, Multi-level Adversarial Class Activation Mapping (MACAM) utilizes foreground-invariant responses to adversarial text prompts, reducing interference from object regions and further focusing on defect regions. Our approach outperforms the state-of-the-art methods in few-shot anomaly detection. Additionally, the language-guided branch within FocusCLIP also demonstrates competitive performance in zero-shot anomaly detection, further validating the effectiveness of our proposed method.

Abstract:
Few-shot generative model adaption is a challenging task that aims to adapt a generative model pre-trained on a large-scale source domain dataset to the target domain with limited training samples. Current methods do their best to transfer source domain knowledge to the target generator in different ways. However, the over-fitting problem has always been a thorny problem in model adaption and is hard to solve due to the limited training data. To overcome such issues, we revisit the training process of model adaption and devise hypothetical experiments. Our research has found that current adaption methods fail to fully use the learned knowledge in the source generator. Meanwhile, in the all-parameter training mode, parameters independent of the source domain are also fine-tuned when fitting the few-shot target samples. In order to circumvent such issues, we propose the optimal kernel modulation method for effective few-shot generative model adaption. The idea of optimal transport theory is leveraged to measure the importance of model parameters for knowledge preservation and transfer. Meanwhile, to realize the control of parameter optimization, we adopt the parameter-efficient kernel modulation method according to its importance. Extensive quantitative and qualitative experiments prove the effectiveness and superiority of our method.

Affiliations: School of Biomedical Engineering, Sun Yat-sen University, Shenzhen, China; Key Laboratory of Intelligent Computing and Signal Processing, Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, Hefei, China; School of Control Science and Engineering, Shandong University, Jinan, China; School of Mathematics and Statistics, The University of Melbourne, Parkville, VIC, Australia; TECNALIA, Basque Research and Technology Alliance (BRTA), Mendaro, Spain; Department of Computer and Data Science and the Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, USA

Abstract:
Domain generalization aims to learn common knowledge from multiple observed source domains and transfer it to unseen target domains, e.g. the object recognition in varieties of visual environments. Traditional domain generalization methods aim to learn the feature representation of the raw data with its distribution invariant across domains. This relies on the assumption that the two posterior distributions (the distributions of the label given the feature distribution and given the raw data) are stable in different domains. However, this does not always hold in many practical situations. In this paper, we relax the above assumption by permitting the posterior distribution of the label given the raw data changes in difference domains, and thus focuses on a more realistic learning problem that infers the conditional domain-invariant feature representation. Specifically, a multi-domain adversarial variational Bayesian inference approach is proposed to minimize the inter-domain discrepancy of the conditional distributions of the feature given the label. Besides, it is imposed by the constraints from the adversarial learning and feedback mechanism to enhance the condition invariant feature representation. The extensive experiments on two datasets demonstrate the effectiveness of our approach, as well as the state-of-the-art performance comparing with thirteen methods.

Abstract:
A comprehensive and explicit understanding of surgical scenes plays a vital role in developing context-aware computer-assisted systems in the operating theatre. However, few works provide systematical analysis to enable hierarchical surgical scene understanding. In this work, we propose to represent the tasks set [phase recognition \rightarrow step recognition \rightarrow action and instrument detection] as multi-level semantic scene understanding (MSSU). For this target, we propose a novel hierarchical context transformer (HCT) network and thoroughly explore the relations across the different level tasks. Specifically, a hierarchical relation aggregation module (HRAM) is designed to concurrently relate entries inside multi-level interaction information and then augment task-specific features. To further boost the representation learning of the different tasks, inter-task contrastive learning (ICL) is presented to guide the model to learn task-wise features via absorbing complementary information from other tasks. Furthermore, considering the computational costs of the transformer, we propose HCT+ to integrate the spatial and temporal adapter to access competitive performance on substantially fewer tunable parameters. Extensive experiments on our cataract dataset and a publicly available endoscopic PSI-AVA dataset demonstrate the outstanding performance of our method, consistently exceeding the state-of-the-art methods by a large margin. The code is available at https://github.com/Aurora-hao/HCT.

Abstract:
Generalisable face forgery detectors strive to detect forgeries generated by unseen manipulations. Recently advanced detection methods have managed to capture subtle blending traces, but their neglect of the diversity of blending traces in different regions leads to limited generalization. Towards this, transformer with global receptive fields and dynamic weight mechanism is a promising solution, but vanilla transformer is weak at capturing subtle blending traces. In this paper, we propose a novel Detail-Aware Transformer (DAT) able to focus on both diverse and subtle blending traces caused by inconsistencies in the low-level image details. The intrinsic multi-head self-attention mechanism of the transformer allows our DAT to adaptively capture diverse blending traces in different regions. Furthermore, we improve the transformer’s capability of capturing subtle blending traces by two inference overhead-free measures, i.e ., self-supervised pre-training based on patch augmentation and region-level contrastive learning. Specifically, the self-supervised pre-training encourages the model to focus on the inconsistencies in low-level image details through a patch number prediction task. The region-level contrastive learning employs a contrastive loss on representations of regions with different low-level details to further improve the transformer’s ability to handle subtle blending traces. Extensive experiments show that our method substantially improves the generalization performance and outperforms the state-of-the-art methods on CDF, DFDC, DFDCP, FFIW, and WildDeepfake datasets.

Abstract:
In object detection, particularly within remote sensing images, the quality of selected samples is crucial for the accuracy and robustness of detection models. However, current sampling strategies demonstrate inherent limitations. They empirically define positive sample sets using fixed thresholds or preset areas, ignoring the actual shapes of the objects and failing to distinguish the intrinsic value of each sample point. To address these critical issues, this article proposes a novel centric probability-based sample selection approach that includes centering probability mapping (CPM), Expectation-Maximization-based boundary optimization (EBO), and probabilistic random sampling (PRS) technologies. Specifically, the CPM is constructed to assign various confidence levels for all sample points based on their proximity to the center of bounding box, effectively discerning the value of individual samples. Then, the EBO is utilized to dynamically optimize the boundaries for positive and negative samples based on the EM algorithm, thus avoiding the sample imbalance problem associated with empirical thresholds. Finally, the PRS strategy is proposed to select training samples from the sample space constructed by CPM and EBO in a manner of random probability sampling, which could improve the diversity of samples while guaranteeing their quality. Experimental validation on three remote sensing image datasets, including DOTA-v1.0, DOTA-v2.0, and DIOR-R, demonstrates that our method achieves robust performance improvements over baseline and significantly surpasses the advanced sample selection methods. The source code will be available at https://github.com/yanqingyao1994/CPSS.

Abstract:
Video moment retrieval (VMR) involves localizing video segments semantically aligned with given queries within videos. Despite the development of numerous methods for VMR in recent years, there remains a need to better incorporate fine-grained modality relation-aware information both in intra-modality and cross-modality. To address these challenges, we propose a Fine-grained Modality Relation-Aware Network (FMRN) tailored for the video moment retrieval task. FMRN effectively explores fine-grained modality relation-aware information within text queries, videos, and proposals. Our approach begins with a semantic graph encoder to capture deep semantic relations in intra-modality. Besides, we introduce a novel fine-grained cross-modality interaction module comprising a cross-similarity weighting module, an intra-modality weighting module, and an adaptive fusion module. These components comprehensively exploit fine-grained relation information within intra-modality and cross-modality contexts. Specifically, the cross-similarity weighting module leverages similarities between text queries and video snippets, as well as between videos and query words. The intra-modality weighting module determines the importance of words and snippets, while the adaptive fusion module combines cross-similarity weighting and intra-modality weighting. Additionally, we design a proposal relation module to enhance retrieval by capturing fine-grained proposals-relation information in videos. Extensive experiments demonstrate that the proposed method can outperform all state-of-the-art methods on the TACoS dataset and obtain comparable results on the Charades-STA and ActivityNet-Captions datasets. Compared with MCMN (TCSVT2024) and DPHANet (TMM2024), FMRN can achieve average improvements of 3.61 % and 5.44 % on the TACoS dataset, respectively.

Abstract:
Video captioning, a challenging task that entails generating natural language descriptions of visual content, often fails to effectively grasp the essence of action semantics. To harness the power of action detection to facilitate a deeper understanding of the video content, we propose an action-driven method, named Hierarchical Semantic Representation and Aggregation (HSRA) network. This method explicitly exploits action clues with a hierarchical semantic representation module, which models visual semantics in a three-level structure: “object-action-event”. By employing learnable action queries, our approach injects extensive action semantics into the model, thereby enabling more accurate and context-rich captions. To further enhance semantic alignment and understanding, we introduce a semantic aggregation composed of a semantic interaction module and a semantic refinement module. This component facilitates the alignment of semantics across different levels and emphasizes key information, ultimately leading to significant improvements in semantic consistency between the video and generated captions. We performed extensive evaluations on two well-established public datasets, MSVD and MSR-VTT, and the findings consistently demonstrate that our proposed HSRA network outperforms contemporary state-of-the-art methods.

Abstract:
Underwater operations frequently encounter turbid environments, where light absorption and scattering by suspended particles degrade image quality by causing color distortion, uneven brightness, and blurred details. Clear imaging in such conditions is essential for enhancing the efficiency and effectiveness of underwater tasks, including exploration, marine ecological monitoring, and the preservation of underwater cultural heritage. However, existing underwater image enhancement methods struggle to perform well in turbid waters, especially in highly turbid conditions. In this study, we present an advanced method designed to significantly improve the clarity of images captured in turbid water. We begin by introducing an adaptive color correction algorithm that uses the dominant color channel’s pixel values to adjust and restore the colors of other channels, mitigating color distortion in turbid conditions. Subsequently, we apply adaptive threshold segmentation and turbidity assessment to automatically calibrate histogram equalization, which enhances local contrast and suppresses noise. Finally, we develop a dark channel prior based on turbidity background light estimation, which further improves color restoration and detail recovery. Our proposed method outperforms existing state-of-the-art techniques in color restoration, turbidity removal, and detail enhancement. Experimental results demonstrate that our approach effectively enhances imaging performance in turbid waters, thereby significantly improving the operational efficiency of various underwater applications.

Abstract:
The demand for resilient watermarking technology in the context of the screen-shooting scenario is steadily on the rise. The principal objective of this technique is to embed messages into the cover image, with the ability to effectively recover the message from the screen-captured image at the extraction end. However, current watermarking methods result in low visual quality watermarked images and are insufficiently robust in screen-shooting scenarios. This is mainly because they only utilize spatial domain information during embedding, and they do not consider the impact of noise that introduced during screen capturing. This paper introduces an innovative network framework, including the wavelet domain concatenation and recovery mechanism, to overcome the dual challenges encountered in robust watermarking, namely visual fidelity and robustness. For fidelity, we present a cascade network operating in the wavelet domain. This network excel at detecting watermark information in the wavelet domain. This capability makes it more sensitive to high and low-frequency details. Discrete wavelet transform can make CNN focus on different frequency characteristics, and the use of discrete inverse wavelet transform in upsampling can make the information high fidelity. As a result, it can more accurately identify and preserve critical visual details in this frequency domain, leading to an overall enhancement in visual quality. For robustness, a recovery network is specifically designed to mitigate the influence of noise introduced during screen-shooting on watermark information extraction. Experimental validation of our proposed method substantiates its effectiveness in significantly enhancing the visual quality and the accuracy of the watermarked images.

Abstract:
Over the past few years, self-supervised monocular depth estimation has received widespread attention. Most efforts focus on designing different types of network architectures and loss functions or handling edge cases, for example, occlusion and dynamic objects. In this work, we take another path and propose a novel conditional diffusion-based generative framework for self-supervised monocular depth estimation, dubbed MonoDiffusion. Because the depth ground-truth is unavailable in a self-supervised setting, we develop a new pseudo ground-truth diffusion process to assist the diffusion for training. Instead of diffusing at a fixed high resolution, we perform diffusion in a coarse-to-fine manner that allows for faster inference time without sacrificing accuracy or even better accuracy. Furthermore, we develop a simple yet effective contrastive depth reconstruction mechanism to enhance the denoising ability of model. It is worth noting that the proposed MonoDiffusion has the property of naturally acquiring the depth uncertainty that is essential to be implemented in safety-critical cases. Extensive experiments on the KITTI, Make3D and DIML datasets indicate that our MonoDiffusion outperforms prior state-of-the-art self-supervised competitors. The source code will be publicly available upon the acceptance.

Abstract:
Digital images in real world applications typically undergo a wide variety of quality degradations before compression or re-compression. Existing learning based codecs are typically data-driven, relying on the predefined compression pipeline with pristine or high quality images as the input. However, the images in the wild may exhibit the substantially different characteristics compared to the high quality images, casting major challenges to the learning based image coding. In this paper, we propose a robust noisy image compression framework with the blind assumption on the specific noise type and level. The specifically designed encoder decomposes the representation of visual content into two types of features, including the Features that represent the Intrinsic Content (FIC) and the Features that account for Additive Degradation (FAD). As such, beyond the philosophy of faithfully reconstructing the given image with high fidelity, only FIC needs to be compactly represented and conveyed. The principled disentanglement strategy facilitates the removal of the redundancy from multiple perspectives (e.g., spatial, channel and content), ensuring the handling of a wide variety of noisy images in the wild. Extensive experimental results show that our model can achieve superior performance in terms of the ultimate quality and exhibit the strong generalizability across images degraded by a variety of means. The proposed scheme also points out a new research avenue on learning based compression for images in the wild, which is technically challenging but desirable in practice.

Abstract:
Learned image compression (LIC) has reached a comparable coding gain with traditional hand-crafted methods such as VVC intra. However, the large network complexity prohibits the usage of LIC on resource-limited embedded systems. Network quantization is an efficient way to reduce the network burden. This paper presents a quantized LIC (QLIC) by channel splitting. First, we explore that the influence of quantization error to the reconstruction error is different for various channels. Second, we split the channels whose quantization has larger influence to the reconstruction error. After the splitting, the dynamic range of channels is reduced so that the quantization error can be reduced. Finally, we prune several channels to keep the number of overall channels as origin. By using the proposal, in the case of 8-bit quantization for weight and activation of both main and hyper path, we can reduce the BD-rate by 0.61%-4.74% compared with the previous QLIC. Besides, we can reach better coding gain compared with the state-of-the-art network quantization method when quantizing MS-SSIM models. Moreover, our proposal can be combined with other network quantization methods to further improve the coding gain. The moderate coding loss caused by the quantization validates the feasibility of the hardware implementation for QLIC in the future.

Abstract:
Although the recent learning-based image and video coding techniques achieve rapid development, the signal fidelity-driven target in these methods leads to the divergence to a highly effective and efficient coding framework for both human and machine. In this paper, we aim to address the issue by making use of the power of generative models to bridge the gap between full fidelity (for human vision) and high discrimination (for machine vision). Therefore, relying on existing pretrained generative adversarial networks (GAN), we build a GAN inversion framework that projects the image into a low-dimensional natural image manifold. In this manifold, the feature is highly discriminative and also encodes the appearance information of the image, named as latent code. Taking a variational bit-rate constraint with a hyperprior model to model/suppress the entropy of image manifold code, our method is capable of fulfilling the needs of both machine and human visions at very low bit-rates. To improve the visual quality of image reconstruction, we further propose multiple latent codes and scalable inversion. The former gets several latent codes in the inversion, while the latter additionally compresses and transmits a shallow compact feature to support visual reconstruction. Experimental results demonstrate the superiority of our method in both human vision tasks, i.e. image reconstruction, and machine vision tasks, including semantic parsing and attribute prediction.

Affiliations: School of Computer Science and Technology, Dalian University of Technology, Dalian, China; School of Electronic and Computer Engineering, Peking University, Shenzhen, China; National Astronomical Observatories, University of Chinese Academy of Sciences, Beijing, China; School of Information Science and Engineering, Northeastern University, Shenyang, China; School of Control Science and Engineering, Dalian University of Technology, Dalian, China; Department of Automation, Tsinghua University, Beijing, China

Abstract:
Visual SLAM (Simultaneous Localization and Mapping) systems based on planar features have been widely applied in fields such as environmental structure perception and augmented reality (AR). However, current research still faces challenges in accurate localization and map construction in planar ambiguous scenes, primarily due to the insufficient accuracy of the planar features and data association methods employed. In this paper, we propose a visual SLAM system based on planar features designed for ambiguous planar scenes, including planar analysis and processing, data association, and multi-constraint factor graph optimization. Initially, we introduce a planar analysis and processing strategy that integrates semantic information to analyze the structure of planes and further refine the selection of planes, providing accurate planar information for subsequent association and optimization processes. Then, we integrate various planar data to propose a multimodal fusion data association strategy, achieving accurate and robust planar data association in ambiguous planar scenes. Finally, based on accurate and rich planar information along with related constraints, we design a set of multi-constraint factor graphs for camera pose optimization. Public datasets and real-world experiments demonstrate that, compared to state-of-the-art related research, our proposed system shows significant competitive advantages in terms of accuracy and robustness for both map construction and camera localization. Regarding quantifiable localization accuracy, our system achieves an average improvement in Absolute Trajectory Error (ATE) of approximately 57% in planar ambiguous scenes and about 25% in non-planar ambiguous scenes. Additionally, the system exhibits great application potential in fields such as augmented reality.

Abstract:
Full-view finger vein (FV) biometrics systems capture multiple FV images of the presented finger ensuring that the entire surface of the finger is covered. Existing full-view FV systems suffer from three common problems: large device size, high cost for multi-camera system, and sub-optimal illumination in the recorded FV images. To address the problem of device size, we propose a novel Mirror-based Full-view FV (MFFV) capture device. The MFFV device has a compact size by using mirror-reflection approach. We reduce the cost of the device by using low-cost components, in particular, consumer-grade cameras. To address the problems of lower-quality images captured by such cameras and obtain optimally illuminated FV images, we propose a two-step approach. The first step is a Multi-illumination Intensities FV (MIFV) capture strategy, which capture the FV image set with varying illumination intensities. In the second step, a FV illumination adaptation (FVIA) algorithm is proposed to select the optimally illuminated FV image from the MIFV image set. Using the proposed MFFV device, we collect a comprehensive dataset, namely MFFV dataset, along with reproducible baseline FV authentication results for both single-view and full-view FV. Our experimental results demonstrate that the MIFV capture strategy as well as the FVIA algorithm can effectively improve the authentication performance, and that the full-view FV authentication is significantly superior than the single-view FV authentication. The source-code and dataset for reproducing our experimental results are publicly available. The code and the license for MFFV-N dataset can be accessed at: https://github.com/SCUT-BIP-Lab/MFFV.

Abstract:
Pedestrian trajectory prediction is a crucial component in computer vision and robotics, but remains challenging due to the domain shift problem. Previous studies have tried to tackle this problem by leveraging a portion of trajectory data from the target domain to fine-tune the model. However, such domain adaptation methods are impractical in real-world scenarios, as it is infeasible to collect trajectory data from all potential target domains. In this paper, we study a new task named generalized pedestrian trajectory prediction, with the aim of generalizing the model to unseen domains without accessing their trajectories. To tackle this task, we further introduce a Recurrent Aligned Network (RAN) to minimize the domain gap through domain alignment. Specifically, we devise a recurrent alignment module to effectively align the trajectory feature spaces at both time-state and time-sequence levels by the recurrent alignment strategy. Furthermore, we introduce a pre-aligned representation module to combine social interactions with the recurrent alignment strategy, which aims to consider social interactions during the alignment process instead of just target trajectories. We extensively evaluate our method and compare it with state-of-the-art methods on three widely used benchmarks. The experimental results demonstrate the superior generalization capability of our method. Our work not only fills the gap in the generalization setting for practical pedestrian trajectory prediction, but also sets strong baselines in this field.

Abstract:
Few-shot learning (FSL) requires vision models to quickly adapt to brand-new classification tasks with changing task distributions in the presence of limited annotated samples. However, the learned model is susceptible to overfitting and may fail to identify effective classification boundaries due to the biased distribution resulting from a limited number of training samples. Moreover, if the support samples from different classes in the new task are in close proximity, this may lead to fuzzy or even biased class decision boundaries. To address the issues, we propose a generation-based Feature Transductive Distribution Optimization (FTDO) in our research. Specifically, we calibrate the distribution of novel classes by utilizing high-confidence unlabeled query samples from these novel classes, together with the statistics of similar base classes, to generate a sufficient number of virtual training samples. In addition, we introduce a task commonality removal and discriminability enhancement module, which eliminates commonality from all features in the task along the task-commonality direction, and reinforces the retained discriminative features through a channel transformation function. Our method can be implemented using off-the-shelf pre-trained feature extractors and classification models, without requiring additional parameters. Experiments conducted on four few-shot classification datasets substantiate the superiority of our proposed method.

Abstract:
Underwater camera and sonar are naturally complementary in the underwater environment. Combining the information from two modalities will promote better observation of underwater targets. However, this problem has received little attention in previous research. Therefore, this paper introduces a new and challenging RGB-Sonar (RGB-S) tracking task and investigates how to achieve efficient tracking of an underwater target through the interaction of the RGB and sonar modalities. Specifically, we first propose an RGBS50 benchmark dataset containing 50 sequences and more than 87,000 high-quality annotated bounding boxes. Experimental results show that the RGBS50 benchmark poses significant challenges to the currently popular SOT trackers. Second, we propose two RGB-S trackers, which are called SCANet and SCANet-Refine. They include a spatial cross-attention module (SCAM) consisting of a novel spatial cross-attention layer, an attention refinement module, and two independent global integration modules. The spatial cross-attention is used to overcome the problem of spatial misalignment between RGB and sonar images. Third, we propose a SOT data-based RGB-S simulation training method (SRST) to overcome the lack of RGB-S training datasets. It converts RGB images into sonar-like saliency images to construct pseudo-data pairs, enabling the model to learn the semantic structure of RGB-S data. Comprehensive experiments show that the proposed spatial cross-attention effectively achieves the interaction between RGB and sonar modalities, and that SCANet and SCANet-Refine achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/RGBS50.

Abstract:
Explicit stripes in digital images have long posed challenges for computer vision tasks, which significantly disturbs visual perceptions and is not desirable for subsequent applications. Existing methods for destriping tasks, often constrained by image prior assumptions and lacking flexibility, demonstrate limited practical applicability. In response to these challenges, this paper proposes a flexible Destriping framework with arbitrary bounded Image Denoisers, called DID. The proposed framework decouples the destriping task into conditional expectation calculation and stripe estimation, and alternates between these two parts, which finally obtains the maximum likelihood estimation of the image. The former calculates the conditional expectation of the clean image given the estimated stripe, while the latter estimates the mean of each column in the residual image. To calculate the conditional expectation, this paper analyzes the equivalence between general image denoising and conditional expectation calculation based on Bayesian statistics. It is proven that the proposed DID framework flexibly incorporates existing denoisers to calculate the conditional expectation, without the need to explicitly define image prior assumptions. Furthermore, the fixed-point convergence of the DID framework is guaranteed postulating that the applied denoiser is bounded in an F-norm manner. Experimental results on both synthesized and real data validate the effectiveness and generalization of the proposed method, both quantitatively and qualitatively.

Abstract:
How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual video segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate those of the audio modality, due to a unidirectional and insufficient integration of audio cues. This imbalance skews the feature representation towards the visual aspect, impeding the learning of joint audio-visual representations and potentially causing segmentation inaccuracies. To address this issue, we propose AVSAC. Our approach features a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges, enhancing audio cues and fostering continuous interplay between audio and visual modalities. This bidirectional interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. Additionally, we present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD. This strategy enhances the share of auditory components in visual features, contributing to a more balanced audio-visual representation learning. Extensive experiments show that our method has state-of-the-art performance on several AVS public benchmarks.

Abstract:
Current few-shot learning techniques predominantly leverage amortization techniques based on meta-learning frameworks, which effectively adapt to unknown tasks with limited examples. However, these approaches face significant challenges in cross-domain scenarios, where the data distributions between the source domain (training data) and the target domain (testing data) differ substantially. This domain shift can lead to models that overfit the global discriminative model while underfitting their local amortization on the adaptable few-shot structure. To mitigate this problem, our proposal makes an upgrade on Conditional Neural Adaptive Processes, reformulating its conditioning mechanism to better handle cross-domain adaptation. This results in calibrated amortization of task-specific feature extractors and the construction of a robust non-parametric classifier. In our implementation, we first employ generative modeling or deterministic self-attention to all labeled context features, establishing a strong task-level alignment that adapts the extractor across domains. Additionally, we introduce a novel channel-wise normalization to further enhance the adaptation process. Our experiments on the Meta-dataset benchmark demonstrate an average 6.9～ 9 % improvement in out-of-distribution tasks, underscoring the effectiveness of exploiting calibrated adaptation in few-shot cross-domain classification.

Abstract:
Action Quality Assessment (AQA) is a challenging task involving analyzing fine-grained technical subactions, aligning high-level visual-semantic representations, and exploring internal temporal structures that capture the overall meaning of given action sequences. To address these challenges, we propose a Visual-semantic Alignment Temporal Parsing Network (VATP-Net) to understand the high-level visual semantics of subaction sequences and internal temporal structures without explicit supervision for action quality assessment. The proposed approach designs a self-supervised temporal parsing module to generate subaction sequences from the given video by aligning the visual and semantic action features. It captures high-level semantics and the internal temporal dynamics of subaction sequences. Furthermore, a multimodal interaction module is proposed to capture the interaction between different modalities of action features, enabling a comprehensive assessment of fine-grained and scene-invariant action details. The proposed module captures the intricate relationships and encourages interactions between different modalities within an action sequence, enhancing the overall understanding of action assessment. We exhaustively evaluate our proposed approach on the MTL-AQA, Rhythmic Gymnastics (RG), FineFS, and Fis-V datasets. Extensive experimental results demonstrate the effectiveness and feasibility of our proposed approach, which outperforms state-of-the-art methods by a significant margin.

Abstract:
The majority of existing counting models are designed to operate on a singular object category, such as crowds or vehicles. The emergence of multi-modal foundational models, e.g., Contrastive Language-Image Pre-training (CLIP), has paved the way for class-agnostic counting. This approach facilitates the counting of objects across diverse classes within a single image based on textual indications. However, class-agnostic counting models based on CLIP confront two primary challenges. Firstly, the CLIP model exhibits limited sensitivity towards location information, which prioritizes global content over the precise localization of objects. Therefore, directly employing the CLIP model is regarded as suboptimal. Secondly, these models commonly employ frozen pre-trained vision and language encoders while disregarding potential misalignment within the constructed hypothesis space. In this paper, we propose a unified framework, named the Vision-Language Prior Guidance (VLPG) Network, to tackle these two challenges. The VLPG consists of three key components, namely the Grounding DINO module, Spatial Prior Calibration (SPC) module, and Object-Centric Alignment (OCA) module. The Grounding DINO module utilizes the spatial-awareness capability of extensive pre-trained object grounding models to incorporate the spatial position as an additional prior for a particular query class. This adaptation enables the network to concentrate more precisely on the exact location of the objects. Meanwhile, the SPC module is built to extract the long-range dependencies and local regions of the spatial position. Additionally, to align the feature space across different modalities, we design an OCA module that condenses textual information into an object query which serves as an instruction for cross-modality matching. Through the collaborative efforts of these three modules, multimodal representations are aligned while maintaining their discriminative nature. Comprehensive experiments conducted on various benchmarks validate the effectiveness of the proposed model.

Abstract:
Deep cross-modal retrieval, with its effective and efficient search capabilities, has gained widespread adoption in today’s media-sharing practices yet raises concerns regarding potential threats to user data privacy. The cutting-edge data-centric countermeasures usually adopt adversarial learning, i.e., laboriously crafting the proper perturbation for each image, resulting in the noticeable noise in adversarial examples that greatly undermines the aesthetic appeal of image sharing. To address this issue, we propose a novel Model-centric Cross-modal Privacy-preserving framework (MCP), wherein the pre-defined invisible backdoor is seamlessly integrated into the global retrieval model via backdoor learning, thereby effectively preventing shared images containing such triggers from being retrieved. Specifically, we introduce a simple yet effective cross-modal backdoor learning algorithm that alternately optimizes two losses: 1) a privacy-preserving loss for perturbing retrieval with a user-injected trigger and 2) the standard utility loss for maintaining normal retrieval performance. Compared to state-of-the-art methods, MCP excels in providing excellent stealthiness, manifesting in a notable improvement of approximately 100% in SSIM metrics. Furthermore, it achieves an outstanding privacy-preserving (backdoor) success rate, as evidenced by a substantial mAP reduction of 22.3% (for FashionVC), 11.5% (for NUS-WIDE), and 21.8% (for MIRFlickr-25K) in poisoned retrieval, while maintaining similar normal retrieval performance. Additionally, MCP exhibits robust resistance against potential black-box defenses (e.g., trigger filtering) and white-box defenses (e.g., fine-tuning and model pruning). The code and data are available at https://github.com/lqsunshine/MCP.

Abstract:
Deep Neural Networks (DNNs) are susceptible to adversarial examples. Conventional attacks generate controlled noise-like perturbations that fail to reflect real-world scenarios and hard to interpretable. In contrast, recent unconstrained attacks mimic natural image transformations occurring in the real world for perceptible but inconspicuous attacks, yet compromise realism due to neglect of image post-processing and uncontrolled attack direction. In this paper, we propose RetouchUAA, an unconstrained attack that exploits a real-life perturbation: image retouching styles, highlighting its potential threat to DNNs. Compared to existing attacks, RetouchUAA offers several notable advantages. Firstly, RetouchUAA excels in generating interpretable and realistic perturbations through two key designs: the image retouching attack framework and the retouching style guidance module. The former custom-designed human-interpretability retouching framework for adversarial attack by linearizing images while modelling the local processing and retouching decision-making in human retouching behaviour, provides an explicit and reasonable pipeline for understanding the robustness of DNNs against retouching. The latter guides the adversarial image towards standard retouching styles, thereby ensuring its realism. Secondly, attributed to the design of the retouching decision regularization and the persistent attack strategy, RetouchUAA also exhibits outstanding attack capability and defense robustness, posing a heavy threat to DNNs. Experiments on ImageNet, Place365 and CUB200 reveal that RetouchUAA achieves nearly 100% white-box attack success against three DNNs, while achieving a better trade-off between image naturalness, transferability and defense robustness than baseline attacks.

Affiliations: Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, School of Computer and Big Data, Minjiang University, Fuzhou, China; School of Computer Science and Mathematics, Fujian University of Technology, Fuzhou, China; School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China; Fujian Key Laboratory for Intelligent Processing and Wireless Transmission of Media Information, Fuzhou University, Fuzhou, China; School of Data Science, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China

Abstract:
Semi-supervised learning suffers from the imbalance of labeled and unlabeled training data in the video surveillance scenario. In this paper, we propose a new semi-supervised learning method called SIAVC for industrial accident video classification. Specifically, we design a video augmentation module called the Super Augmentation Block (SAB). SAB adds Gaussian noise and randomly masks video frames according to historical loss on the unlabeled data for model optimization. Then, we propose a Video Cross-set Augmentation Module (VCAM) to generate diverse pseudo-label samples from the high-confidence unlabeled samples, which alleviates the mismatch of sampling experience and provides high-quality training data. Additionally, we construct a new industrial accident surveillance video dataset with frame-level annotation, namely ECA9, to evaluate our proposed method. Compared with the state-of-the-art semi-supervised learning based methods, SIAVC demonstrates outstanding video classification performance, achieving 88.76% and 89.13% accuracy on ECA9 and Fire Detection datasets, respectively. The source code and the constructed dataset ECA9 will be released in https://github.com/AlchemyEmperor/SIAVC.

Abstract:
In recent years, user-generated content (UGC) videos have become the mainstream of internet videos, which are characterized by their rich content, complicated temporal changes and multiple distortions. However, existing rate control (RC) methods do not consider the above unique characteristics, leading to severe bit-rate errors and coding performance degradation. To address these issues, we propose a content-adaptive RC method for UGC videos, where accurate RC coding parameters are derived by our proposed rate-distortion (RD) model derivations for different types of pictures and a novel bit allocation refinement module. Specifically, the RD models of intra pictures are derived by established SVR-based predictors using some features designed for diverse content, such as texture complexity and regularity. Considering the complex temporal variation, single-reference inter pictures are firstly classified into three categories (i.e., low, regular and high correlation) by a SVM-based classifier using correlation-based features. Training data of the classifier are labeled by introducing a series of classification metrics. Then, RD model is derived by established predictors accordingly for each type of inter pictures. In addition, the RD model of multiple-reference inter pictures is derived by using a updated RD model selection based on content similarity. Based on derived RD models, allocated bits are refined to reduce bit waste. Experimental results show that compared with the default RC method in versatile video coding (VVC), our method can effectively save BD-Rate and reduce bit-rate errors for UGC videos. In particular, 1.99% BD-Rate saving and 0.18% bit-rate error reduction can be achieved under the random access (RA) configuration, and 0.45% BD-Rate saving under the low-delay B (LDB) configuration.

Abstract:
In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression scheme consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding. Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.

Abstract:
In this paper, we focus on diversifying the Video Moment Retrieval (VMR) model into more scenes. Most existing video moment retrieval methods focus on aligning video moments and queries by capturing the cross-modal relationship, which largely ignores the cross-instance relationship behind the representation learning. Thus, they may easily get trouble into the inaccurate cross-instance contrastive relationship in the training process: 1) Existing approaches can hardly identify similar semantic content across different scenes. They incorrectly treat such instances as negative samples (termed faulty negatives), which forces the model to learn the features from query-irrelevant scenes. 2) Existing methods perform unsatisfactorily in locating the queries with subtle differences. They neglect to mine the hard negative samples that belong to similar scenes but have different semantic content. In this paper, we propose a novel robust video moment retrieval method that prevents the model from overfitting the query-irrelevant scene features by accurately capturing both the cross-modal and cross-instance relationships. Specifically, we first develop a scene-independent cross-modal reasoning module that filters out the redundant scene contents and infers the video semantics under the guidance of query information. Then, the faulty and hard negative samples are mined from the negative ones and calibrated for their contribution to the overall loss in contrastive learning. We validate our contributions through extensive experiments on cross-scene video moment retrieval settings, where the training and test data are from different scenes. Experimental results show that the proposed robust video moment retrieval model can effectively retrieve target videos by capturing the real cross-modal and cross-instance relationships.

Abstract:
Time-to-Collision (TTC) is a measure of the time until an object collides with the observation plane which is a critical input indicator for obstacle avoidance and other downstream modules. Previous works have utilized deep neural networks to estimate TTC with monocular cameras in an end-to-end manner, which obtain the state-of-the-art (SOTA) accuracy performance. However, these models usually have deep layers and numerous parameters, resulting in long inference time and high computational overhead. Moreover, existing methods use two frames which are the current and future moments as input to calculate the TTC resulting in a delay during the calculation process. To solve these issues, we propose a novel fast TTC prediction model: FP-TTC. We first use an attention-based scale encoder to model the scale-matching process between images, which significantly reduces the computational overhead as well as improves the model’s accuracy. Meanwhile, a simple but powerful trick is introduced to the model, where we built a time-series decoder and predict the current TTC from RGB images in the past, avoiding the computational delay caused by the system time step interval, and further improved the TTC prediction speed. Our model achieves a parameter reduction of 89.1%, a 5.5-fold increase in inference speed, a 19.3% improvement in accuracy. We also provided a lightweight version of FP-TTC, which further optimized the inference speed and parameter count by 15%. Our code is available at https://github.com/LChanglin/FP-TTC.

Abstract:
Open-vocabulary video instance segmentation strives to segment and track instances belonging to an open set of categories in a videos. The vision-language model Contrastive Language-Image Pre-training (CLIP) has shown robust zero-shot classification ability in image-level open-vocabulary tasks. In this paper, we propose a simple encoder-decoder network, called CLIP-VIS, to adapt CLIP for open-vocabulary video instance segmentation. Our CLIP-VIS adopts frozen CLIP and introduces three modules, including class-agnostic mask generation, temporal topK-enhanced matching, and weighted open-vocabulary classification. Given a set of initial queries, class-agnostic mask generation introduces a pixel decoder and a transformer decoder on CLIP pre-trained image encoder to predict query masks and corresponding object scores and mask IoU scores. Then, temporal topK-enhanced matching performs query matching across frames using the K mostly matched frames. Finally, weighted open-vocabulary classification first employs mask pooling to generate query visual features from CLIP pre-trained image encoder, and second performs weighted classification using object scores and mask IoU scores. Our CLIP-VIS does not require the annotations of instance categories and identities. The experiments are performed on various video instance segmentation datasets, which demonstrate the effectiveness of our proposed method, especially for novel categories. When using ConvNeXt-B as backbone, our CLIP-VIS achieves the AP and APn scores of 32.2% and 40.2% on the validation set of LV-VIS dataset, which outperforms OV2Seg by 11.1% and 23.9% respectively. We will release the source code and models at https://github.com/zwq456/CLIP-VIS.git.

Abstract:
Generalized zero-shot learning (GZSL) requires that models are able to recognize classes they were trained on, and new classes they haven't seen before. Feature-generation approaches are popular due to their effectiveness in mitigating overfitting to the training classes. Existing generative approaches usually adopt simple discriminators for distribution or classification supervision, however, thus limiting their ability to generate visual features that are discriminative of and transferable to novel categories. To overcome this limitation and improve the quality of generated features, we propose a dual prototype contrastive augmented discriminator for the generative adversarial network. Specifically, we design a Dual Prototype Contrastive Network (DPCN), which leverages complementary information between visual space and semantic space through multi-task prototype contrastive learning. Contrastive learning of the visual prototypes enhances the ability of the generated features to distinguish between classes, while the contrastive learning of the semantic prototypes improves their transferability. Furthermore, we introduce margins into the contrastive learning process to ensure both intra-class compactness and inter-class separation. To demonstrate the effectiveness of the proposed approach, we conduct experiments on three widely-used zero-shot learning benchmark datasets, where DPCN achieves state-of-the-art performance for GZSL.

Abstract:
Multi-drone multi-object tracking (MDMOT) aims to localize and identify targets from videos captured simultaneously by multiple drones. To accomplish this task, existing methods typically follow the strategy of associating localized targets to obtain identities. However, their localization and identification stages heavily rely on single-frame information, resulting in the localization being very sensitive to visual information decay and making it struggle to capture discriminative representations for target identification. Consequently, they usually exhibit unreliable performance in challenging scenarios, such as occlusion and high similarity among targets. To this end, we introduce a novel MDMOT framework to interact temporal-spatial features, exploring the guidance of tracklet information across time and space. Specifically, we introduce temporal-spatial feedback loops to enrich cues in our tracker. Meanwhile, a novel temporal-oriented target localization is proposed to enhance the response to difficult samples in feature space by utilizing prior knowledge from existing tracklets beyond the current frame for target localization. Moreover, a spatial-oriented target identification is designed to synergize cross-drone information of tracklets, thereby providing discriminative representations for target identification. It combines target and background information to extract identity representations and interacts features from multiple drones. To our best knowledge, this work reports the first MDMOT system that synergizes features across multiple drones to track targets. By incorporating these two elaborated networks, we develop a robust tracker (named TSMMT). Extensive experiments on the MDMT public dataset demonstrate the superiority of our proposed model. Specifically, TSMMT outperforms state-of-the-art methods by 2.76%~4.66% on MOTA and 2.06%~3.33% on IDF1.

Abstract:
Few-shot image classification is a challenging task that aims to learn from a limited number of labelled training images a classification model that can be generalised to unseen classes. Two strategies are usually taken to improve the classification performances of few-shot image classifiers: either applying data augmentation to enlarge the sample size of the training set and reduce overfitting, or involving attention mechanisms to highlight discriminative spatial regions or channels. However, naively applying them to few-shot classifiers directly and separately may lead to undesirable results; for example, some augmented images may focus majorly on the background rather than the object, which brings additional noises to the training process. In this paper, we propose a unified framework, the selectively augmented attention (SAA) network, that carefully integrates the best of the two approaches in an end-to-end fashion via a selective best match module to select the most representative images from the augmented training set. The selected images tend to concentrate on the objects with less irrelevant background, which can assist the subsequent calculation of attentions by alleviating the interference from background. Moreover, we design a joint attention module to jointly learn both the spatial and channel-wise attentions. Experimental results on four benchmark datasets showcase the superior classification performance of the proposed SAA network compared with the state-of-the-arts.

Abstract:
3D semantic occupancy has garnered considerable attention due to its abundant structural information encompassing the entire autonomous driving scene. However, existing 3D occupancy prediction methods are typically tailored for single-frame inputs, resulting in unsatisfactory performance and temporal inconsistencies in real-world continuous scenarios. In this paper, we introduce LinkOcc, a sparse-queries approach incorporating an efficient temporal association mechanism for 3D semantic occupancy prediction. LinkOcc is conceptually built on the prevalent DETR-like framework for 2D segmentation, and we further construct the temporal association mechanism on this basis. Specifically, we propose a near-online training strategy that jointly trains with two adjacent frames, which successfully combines the benefits of both online and off-online methods. Moreover, we introduce a temporal association strategy with contrastive learning to discriminate features for cross-frame semantic-level association. Comprehensive experiments demonstrate that LinkOcc not only surpasses the state-of-the-art methods in 3D occupancy prediction, but also guarantees a promising performance on foreground classes.

Abstract:
Low-light images often suffer from varying degrees of visual degradation. Current methods for recovering image texture details fail to rely on the self-adaptive correlation texture direction of the image itself, which leads the network to be unable to address the local texture characteristics of different images. To address this challenge, we propose a semantic-aware detail adaptive network (SDANet) that fully considers the image detail information. The network divides low-light images into high-frequency and low-frequency parts. Learning different forms of noise through a novel total variation regularization module with adaptive weights ensures that the final high-frequency part adequately integrates the texture information of the image. Simultaneously, a detail-adaptive module is incorporated to restore finer details in the resulting image. SDANet not only effectively suppresses noise in real low-light images while considering texture details but also effectively addresses the degradation of visible information, and it performs better than other state-of-the-art methods. The code is available at https://github.com/cheer79/SDANet.

Abstract:
With billions of users worldwide, accurately predicting social media popularity is crucial for assessing user behavior, forecasting trends, and enhancing social interactions and business strategies. However, this task presents significant challenges. Firstly, the extraction of valuable insights is complicated by the presence of tri-modal data (visual, text, structured) and pervasive noise. Secondly, the applicability of knowledge acquired during the pre-training phase is often limited due to discrepancies with downstream prediction tasks during the fine-tuning phase. Existing methods for Social Media Popularity Prediction (SMPP), including traditional models and Visual-and-language Models (VLMs), struggle to overcome these challenges, thereby failing to achieve satisfactory accuracy. To tackle these challenges, we propose a novel approach named Tri-Modal Transformers with Mixture-of-Modality-Experts (TTME) for SMPP. TTME integrates Artificial Intelligence Generated Content to mitigate data noise and incorporate a mix of Modality Experts in pre-training phases to effectively utilize tri-modal data. Moreover, to address training disparity, we explore strategies for downstream task adaptation including the integration of diverse pre-training experts and the implementation of DistillSoftmax. Through empirical evaluation, we demonstrate that the TTME significantly improves the accuracy of social media popularity predictions, effectively utilizes tri-modal data with noise, and enhances transferring knowledge from pre-training to downstream tasks.

Abstract:
In the domain of single object tracking, the Ground Truth bounding box is intentionally sized larger than the minimum dimensions required to enclose the target in the initial video frame, inadvertently including extraneous elements and interferences in the template image. Moreover, significant appearance changes of the target during movement present substantial challenges for maintaining robust tracking. To address these issues, this study introduces a novel one-stream tracking framework named CVT-Track. CVT-Track comprises two main components: the Target Valid Token Collection (TaVTC) and the Temporal Valid Token Collection (TeVTC) modules. The TaVTC module effectively mitigates background noise and interference from similar targets, thereby sharpening the focus on the target’s unique features and enhancing tracking accuracy. Conversely, the TeVTC module skillfully extracts target information from historical frames, capturing the target’s dynamic appearance changes throughout the tracking process and thereby improving tracking robustness. The synergistic operation of these modules markedly enhances both the accuracy and robustness of tracking. Empirical evaluations demonstrate that CVT-Track achieves state-of-the-art performance across multiple datasets and maintains superior inference speeds.

Abstract:
It is always well believed that pre-trained vision-language foundation models (e.g., CLIP) would substantially facilitate vision-language tasks. Nevertheless, there has been less evidence in support of the idea on describing novel objects in images. In this paper, we propose the Novel Object Transformer with CLIP (NOTC), a Transformer-based model that innovatively exploits the powerful vision-language representation ability of CLIP to enhance novel object captioning model’s training and sentence decoding processes. Technically, given the primary bag-of-objects extracted by Faster R-CNN, NOTC first capitalize on an object distiller module to emphasize the most salient objects and infer the missing novel ones. The refined object words are additionally fed into the object-centric word predictor to generate sentence word-by-word. During training, we design a CLIP-based self-critical sequence training paradigm to select visually-grounded sampled sentence with higher CLIP score reward, which enables a joint training process of captioning model over out-domain training images with novel objects. Moreover, at inference, a new CLIP beam search algorithm is devised to enforce the existence of novel objects and encourage the partial word sequences with higher CLIP scores, thereby decoding both visually-grounded and comprehensive sentences. Extensive experiments are conducted on held-out COCO and nocaps datasets, and competitive performances are reported when compared to state-of-the-art approaches.

Abstract:
Current tracking methods often adopt a compact template to emphasize target-specific features, alongside an expansive search region to encapsulate surrounding environmental information. However, the employment of a small template size may result in the loss of critical contextual information, which can be particularly harmful in challenging scenarios. Moreover, current tracking methods predominantly focus on spatial or channel operations, neglecting the potential of the frequency domain. To resolve those issues, we propose a novel Mask-Guided Siamese Tracking (MGTrack) framework to enhance tracking efficacy from two perspectives. Firstly, we propose an innovative Template Mask Encoder (TME) that employs a large template to produce a learnable mask embedding, thus preserving more surrounding contextual cues while focusing on target-oriented discriminative features. Secondly, we propose a frequency-spatial hybrid network, which is composed of a Frequency-Spatial Fusion (FSF) module and a Frequency-Spatial Attention (FSA) module. Particularly, the FSF module integrates frequency blocks with local and global fusion blocks, effectively aggregating deep semantic features from the backbone network with shallow texture features. Additionally, the FSA module enables bidirectional information exchange between spatial and frequency attention during the feature interaction process. Experiments across short-term and long-term tracking benchmarks demonstrate that our MGTrack can achieve better tracking performance with fewer parameters and FLOPs than some state-of-the-art tracking frameworks. The code of our MGTrack is available at https://github.com/jiabingxiing/MGTrack.

Abstract:
Stereo matching aims to estimate 3D geometry by computing disparity from a rectified image pair. Most deep learning based stereo matching methods aggregate multi-scale cost volumes computed by downsampling and achieve good performance. However, their effectiveness in fine-grained areas is limited by significant detail loss during downsampling and the use of fixed weights in upsampling. In this paper, we propose an inter-scale similarity-guided cost aggregation method that dynamically upsamples the cost volumes according to the content of images for stereo matching. The method consists of two modules: inter-scale similarity measurement and stereo-content-aware cost aggregation. Specifically, we use inter-scale similarity measurement to generate similarity guidance from feature maps in adjacent scales. The guidance, generated from both reference and target images, is then used to aggregate the cost volumes from low-resolution to high-resolution via stereo-content-aware cost aggregation. We further split the 3D aggregation into 1D disparity and 2D spatial aggregation to reduce the computational cost. Experimental results on various benchmarks (e.g., SceneFlow, KITTI, Middlebury and ETH3D-two-view) show that our method achieves consistent performance gain on multiple models (e.g., PSM-Net, HSM-Net, CF-Net, FastAcv, and FactAcvPlus). The code can be found at https://github.com/Pengxiang-Li/issga-stereo.

Abstract:
Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.

Abstract:
Vision Transformers (ViTs) have emerged as the new fundamental architecture for most computer vision fields. However, the considerable memory and computation costs also hinder their application on resource-limited devices. Currently, binarization has demonstrated remarkable potential as a model compression technique in traditional Convolutional Neural Networks (CNNs), albeit with some accuracy loss. In this paper, we focus on binarization of ViTs, which is still under-studied and suffering a significant performance drop. We start with constructing a strong baseline of binary ViTs, integrating some of the best practices from binary CNNs, which forms the foundation of our exploration. Subsequently, we identify that the severe performance degradation of the baseline is mainly caused by the weight oscillation around the quantization boundary and the information distortion in the activation of ViTs. To address these challenges, we introduce BinaryViT, a precise full binarization framework tailored for Vision Transformers (ViTs), effectively pushing the binarization of ViTs to its limit. Specifically, we propose a novel gradient regularization scheme (GRS), which mitigates oscillations by fostering a smooth moving of latent weights to be away from the quantization boundary during the training process. Additionally, we have devised an Activation Shift Module (ASM) that dynamically adjusts the activation distribution prior to the sign function, thereby minimizing the information distortion stemming from the significant inter-channel variations. Extensive experiments on ImageNet dataset show that our BinaryViT consistently surpasses the strong baseline by 2.05% and improves the accuracy of fully binarized ViTs to a usable level. Furthermore, our method achieves impressive savings of 16.2× and 17.7× in model size and OPs compared to the full-precision DeiT-S.

Abstract:
Camouflaged object segmentation (COS) is a recently emerging task due to its broad application prospect. The coloration and texture similarities between the objects and their surroundings makes it a challenging task. Motivated by this, we propose a consistency-oriented network (CoNet) to address these challenges by looking into the visual consistencies between object and background. Specifically, we design a primary detection module (PDM) to firstly locate the object by fusing the backbone features. A filter is introduced to better focus on the object’s foreground feature based on its primary location. To obtain the visual consistency between the object and background, the foreground feature is then fed into the consistency evaluation module (CEM) to interact with the global feature. Both features are simultaneously processed by a shared discriminator and then fused together to attain the consistency attention map. The final feature refinement is conducted in the detail refinement module (DRM) by merging the consistency attention map with the global features via hierarchical feature fusion. Extensive experiments on benchmark COS datasets show that the proposed CoNet outperforms the state-of-the-art (SOTA) models in most cases. Ablation experiments verify the effectiveness of different backbones, designed modules and upsampling methods. Furthermore, extra studies on the labelling techniques and interdisciplinary applications demonstrate the great potential of the proposed CoNet.

Abstract:
Deep neural networks for skeleton-based human action recognition (HAR) often utilize traditional averaging or maximum temporal pooling to aggregate features by treating all joints and frames equally. However, this approach can excessively aggregate less discriminative or even indiscriminative features into the final feature vectors for recognition. To address this issue, a novel method called asynchronous joint adaptive temporal pooling (AJTP) is introduced in this paper. The method aims to enhance action recognition by identifying a set of informative joints across the temporal dimension and applying a joint-based and asynchronous motion-preservative pooling rather than conventional frame-based pooling. The effectiveness of the proposed AJTP has been empirically validated by integrating it with popular Graph Convolutional Network (GCN) models on three benchmark datasets: NTU RGB+D 120, PKUMMD, and Kinetic400. The results have shown that a GCN model with AJTP substantially improves performance compared to its counterpart GCN model with conventional temporal pooling techniques. The source code is available at https://github.com/ShanakaRG/AJTP.

Abstract:
Transparent and reflective objects, which are common in our everyday lives, present a significant challenge to 3D imaging techniques due to their unique visual and optical properties. Faced with these types of objects, RGB-D cameras fail to capture the real depth value with their accurate spatial information. To address this issue, we propose DITR, a diffusion-based Depth Inpainting framework specifically designed for Transparent and Reflective objects. This network consists of two stages, including a Region Proposal stage and a Depth Inpainting stage. DITR dynamically analyzes the optical and geometric depth loss and inpaints them automatically. Furthermore, comprehensive experimental results demonstrate that DITR is highly effective in depth inpainting tasks of transparent and reflective objects with robust adaptability.

Abstract:
Recently, unsupervised domain adaptation (UDA) techniques have been introduced for cross-scene hyperspectral image (HSI) classification tasks. These techniques aim to transfer knowledge from labeled source scenes to unlabeled target scenes, addressing the issue of limited supervisory information. However, most UDA methods fail to analyze the variability of domain shifts from different source samples to target ones, thus limiting the domain adaptation effect. To this end, this paper develops a consistency-aware customized learning (CACL) approach for cross-scene HSI classification. Overall, domain-level and class-level distribution alignment are designed separately. The former is implemented by adversarial training between the feature extractor and the domain discriminator. For the latter, the spectral-spatial prototypes of the source and target domains are first dynamically extracted, respectively. Then the prototype-based labels are assigned to the target domain samples, according to the cosine similarity-based cross-domain category prototype matching strategy. Considering that the consistency of the prototype-based labels with the predicted pseudo-labels reflects the degree of domain shifts of the target samples, a customized learning strategy is developed via inter-/intra-domain contrastive learning. With the joint domain-level and fine-grained class-level distribution alignment, the supervised information from the source domain is better migrated to the target domain, improving classification performance. Comprehensive experiments on two single-modal and one multi-modal cross-scene datasets demonstrate the effectiveness of the proposed algorithm.

Abstract:
Weakly-supervised fine-grained human parsing, which decomposes the human body into several parts and various fashion items only with some easier labels, poses a more challenging visual task and cannot be well solved by general weakly-supervised approaches. In this case, we first explore the feasibility of utilizing point-level labels to address this task. Toward this, we propose the prior-structure driven weakly-supervised learning for fine-grained human parsing. Following previous practices, we design a pseudo label initialization mechanism to produce high-quality pixel-level pseudo labels by utilizing the powerful image segmentation model Segment Anything Model (SAM). Then we propose the Feature Propagation based on Prior-Structure (FPPS) module which formalizes prior-structure knowledge as an adjacency matrix constructed from superpixel and emploies a learnable Graph Neural Network (GNN) as the feature propagator. FPPS can optimize the features of unlabeled pixels to enhance the weakly-supervised learning. The framework further designs the Refinement Pseudo Label (RPL) strategy to generate denser supervision from past sub-optimal models. To the best knowledge, this work is the first attempt to perform fine-grained human parsing in a weakly-supervised manner. We conduct extensive experiments on two challenging fine-grained datasets, including ATR and LIP. Experimental results show that the proposed weakly-supervised method yields a comparable result to strongly-supervised methods and even outperforms other state-of-the-art approaches in semi-supervised human parsing tasks.

Abstract:
This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion image. Meanwhile, we design a Transformer block based on learnable queries. Compared to the self-attention transformer block, the Transformer block based on learnable queries provides greater adaptability and flexibility, enabling the model to update the attention weights according to specific tasks. Thus, the inversion focus is not limited to the image structure. To further introduce multi-scale properties, we design multi-scale connections in the extraction of feature maps. Multi-scale connections allow the model to gain a comprehensive understanding of the image to avoid loss of detail due to global modeling. Moreover, we propose an inversion discriminator and distribution alignment loss to minimize the distribution differences. Based on the above designs, our SwinStyleformer successfully solves the Transformer’s inversion failure issue and demonstrates SOTA performance in image inversion and several related vision tasks.

Abstract:
The attenuation and scattering of different colors of light underwater are wavelength- and distance-dependent, leading to various degradation problems in underwater images. When enhancing underwater images, many deep learning-based methods rely solely on convolutional neural networks to learn a mapping from degraded images to clear images to achieve enhanced effects. However, such methods have limitations in capturing long-term dependencies, preventing them from accurately capturing the global information of images. Although Transformers can solve this problem, there is a lack of inductive bias in training due to the limited number of training datasets with certain degradation phenomena. To address this issue, a novel Swin Transformer based on physical perception is proposed for the first time. Swin Transformer is used to solve the long- and short-distance dependency problem. Additionally, the underwater image degradation process is considered in network design to solve the problem of poor inductive bias. Combining the advantages of physical imaging, convolutional neural networks and Transformer can effectively improve the visual quality of underwater images. Rich qualitative and quantitative experimental results show that our Transformer achieves competitive performance on 5 benchmark datasets.

Abstract:
Recently, researchers have proposed many deep generative models, including generative adversarial networks (GANs) and denoising diffusion models. Although significant breakthroughs have been made and empirical success has been achieved with the GAN, its mathematical underpinnings remain relatively unknown. This paper focuses on a rigorous mathematical theoretical framework: the composite-functional-gradient GAN (CFG). Specifically, we reveal the theoretical connection between the CFG model and score-based models. We find that the CFG discriminator’s training objective is equivalent to finding an optimal D(\mathrm x) . The optimal D(\mathrm x) ’s gradient differentiates the integral of the differences between the score functions of real and synthesized samples. Conversely, training the CFG generator involves finding an optimal G(\mathrm x) that minimizes this difference. In this paper, we aim to derive an annealed weight preceding the CFG discriminator’s weight. This new explicit theoretical explanation model is called the annealed CFG method. To overcome the annealed CFG method’s limitation, as the method is not readily applicable to the state-of-the-art (SOTA) GAN model, we propose a nested annealed training scheme (NATS). This scheme keeps the annealed weight from the CFG method and can be seamlessly adapted to various GAN models, no matter their structural, loss, or regularization differences. We conduct thorough experimental evaluations on various benchmark datasets for image generation. The results show that our annealed CFG and NATS methods significantly improve the synthesized samples’ quality and diversity. This improvement is clear when comparing the CFG method and the SOTA GAN models.

Abstract:
Zero-shot referring image segmentation (RIS) aims to segment a referent mask via a natural language expression, without any training. Although existing research has made some progress, the lack of a training process in zero-shot learning results in insufficient information, leading to poor zero-shot segmentation performance. We propose a Bidirectional Mask Selection (BMS) framework, which is the first work to incorporate the negative masks into zero-shot RIS. Our idea is based on leveraging the negative masks’ semantic context information around target semantic to enhance the understanding of cross-modal fine-grained correlation. Further, we propose a novel mask adaptive fusion strategy to combine the complementary information from positive and negative masks without additional training. In the experiments, BMS has demonstrated outstanding performance on three prominent RIS datasets, and it has surpassed even the most advanced weakly supervised methods on the RefCOCOg datasets. Code will be available at https://github.com/pcc-99/BMS.

Abstract:
Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize and classify action instances in long untrimmed videos with only video-level category labels as supervision. A critical challenge of WSTAL is the large gap between video-level supervision and unavailable snippet-level supervision. Prevailing methods typically assign pseudo labels to snippets, but these methods suffer from significant noise caused by the pseudo snippet-level labels. In this work, we address the WSTAL from a novel category exclusion perspective, which gradually enhances the snippet-level supervision to bridge the gap. Our proposed Progressive Complementary Learning (ProCL) is inspired by the fact that, video-level labels precisely indicate the categories that all snippets surely do not belong to, which is ignored by previous works. Accordingly, we first exclude these surely non-existent categories by the deterministic complementary learning. And then, we introduce the entropy-based pseudo complementary learning that is able to exclude more categories for snippets of less ambiguity. Furthermore, for the remaining ambiguous snippets, we attempt to reduce the ambiguity by distinguishing foreground actions from the background. Extensive experimental results show that our method achieves new state-of-the-art performance on THUMOS14, ActivityNet1.3, and MultiTHUMOS benchmarks.

Abstract:
Camera arrays have unique advantages in various computer vision tasks, such as 3D scene reconstruction and depth estimation. For these tasks, precise calibration of sub-cameras is crucial. Since the baselines of sub-cameras are usually small, it is challenging to calibrate the camera array through a single recording of the scene. Consequently, the majority of existing calibration methods address this issue by recording a scene at different spatial locations. However, this approach neglects the prior that the relative pose of the sub-cameras remains unchanged across different locations, which leads to an increase in cumulative reprojection errors. In this letter, we propose to incorporate this fixed relative pose prior to precisely calibrate the camera array. Specifically, we first capture dual-array frames by recording a scene at two spatial locations. Then, we incorporate the fixed relative pose prior to the camera array calibration process by integrating the linear constraint into the organization of sub-aperture images (SAIs). Our method maintains the minimum necessary degrees of freedom for the calibration model, and reduces cumulative reprojection error. Moreover, we develop a real-world light field dataset for comprehensive performance evaluation. Experimental results demonstrate that our method can achieve higher calibration accuracy as compared to existing methods. Our code and dataset are available at https://github.com/Zhangyaning-NUDT/Fixed-relative-pose-prior-for-camera-array-self-calibration.

Abstract:
Compositional zero-shot learning (CZSL) aims to recognize novel compositions of known attributes and objects without requiring additional training data. Recent CZSL methods based on vision-language models(e.g., CLIP) suffer from relying solely on text prompts and neglecting the crucial primitive features within compositions, which limits generalization to unseen compositions. To overcome these limitations, we propose a Multi-modal Prompt and Primitives Enhancement method, termed MPPE, which incorporates two key aspects. First, MPPE introduces both text and visual prompts. The text prompts consist of the composition and its corresponding attribute and object prompts, while the visual prompts leverage image masks generated by the segment anything model (SAM). These masks are integrated via an additional Alpha branch to strengthen the CLIP visual encoder to focus on regions of interest within the image. Second, we design a primitives enhancement (PE) module based on cross-attention, which refines attribute and object features obtained from the CLIP text encoder, thereby enriching the representation of novel composition features. Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art performance on three widely-used CZSL benchmarks in both closed-world and open-world CZSL scenarios. Codes are available at https://github.com/YtJin-git/MPPE

Abstract:
Recently, unsupervised face super-resolution (FSR) has attracted significant attention due to its remarkable generalization performance. However, existing methods neglect the incorporation of facial priors, which can effectively guide the restoration of face images. The root cause of this issue lies in the significant challenges associated with incorporating facial priors into unsupervised frameworks. First, unsupervised methods often face the challenge of real-world low-quality (LQ) images that are severely corrupted, making it unrealistic to extract reliable prior information from them. Second, the estimation of facial priors exponentially increases the model’s parameters and computational complexity, contradicting the purpose of unsupervised methods for practical deployment. In this work, we fundamentally address the aforementioned challenges and propose Faith3D-FSR, a novel approach that incorporates faithful 3D facial priors into unsupervised FSR. Specifically, we introduce Faith3D mechanism for faithful prior integration, which deconstructs super-resolution images into 3D elements and uses the 3D priors from real high-quality (HQ) images as reference for calibration solely during the training phase. This strategy enables more precise guidance on the super-resolution in a high-dimensional space, without requiring additional prior estimation during inference. It successfully overcomes the aforementioned challenges, making it more suitable for real-world applications, and offers a plug-and-play solution for incorporating 3D priors into unsupervised FSR. Extensive experiments demonstrate that our approach achieves state-of-the-art (SOTA) performance on multiple benchmark datasets and across a range of evaluation metrics. The code is available https://github.com/han265/Faith3D-FSR

Abstract:
Multi-modal object tracking combines visible light with auxiliary modalities for better tracking accuracy, including sub-tasks like RGB-T (RGB and Thermal), RGB-D (RGB and Depth), and RGB-E (RGB and Event) tracking. However, existing algorithms, designed and trained for specific tasks, impose high costs on universal tracking systems as they rely on multiple independent trackers. Despite efforts to develop universal multi-modal trackers, these methods often fail to effectively integrate both modality-agnostic and modality-specific information across various modalities, constraining their performance. In this paper, we propose the Blind Multi-Modal Tracking with Route-Dynamic Mixture of Experts (BR-MoE). This approach can simultaneously model both modality-agnostic and modality-specific features of each modality within a unified structure and model parameters, all without prior knowledge of the modality types. In particular, BR-MoE employs Vision Transformer (ViT) to extract modality-agnostic features, while embedding four parallel modality-specific experts in each ViT block to capture specific features. Additionally, BR-MoE incorporates a Modality-Aware Module (MAM) that adaptively assigns weights to the modality-specific experts based on the input and combines their outputs to generate the final feature representation. To achieve this, BR-MoE is trained in two phases. In the first, we manually provide modality types to supervise MAM and select the right expert. In the second, we do a weighted aggregation of experts’ features based on MAM’s decision and fine-tune the model for feature changes. Experiments on five multi-modal tracking datasets show BR-MoE gets state-of-the-art performance on RGBT234, DepthTrack and VisEvent datasets, and comparable performance on LasHeR and VOT-RGBD22 datasets.

Abstract:
Self-supervised monocular depth estimation shows great promise since only a single camera is required. However, most existing methods fail to model the geometric structure of objects, leading to poor performance in object boundary depth estimation. To overcome these shortcomings, a dual attention guidance network (DAG-Net), containing two complementary modules termed depth-guided attention module (DAM) and semantic-guided multi-modal attention module (SAM), is proposed in this paper. The DAM utilizes depth features to guide semantic features through multi-head attention. When semantic features are well learned, they guide depth features to learn useful geometric representations through backpropagation. Besides, the SAM is proposed to incorporate multi-modal data from depth estimation and semantic segmentation predictions at different scales. To eliminate the mutual interference between DAM and SAM, we also propose a two-stage training strategy to adjust the convergence direction during the training process. The effectiveness of our proposed DAG-Net is qualitatively and quantitatively verified by various experiments on KITTI, Cityscapes, and Make3D datasets, showing outstanding performance compared with the state-of-the-art methods.

Abstract:
Utilizing hyperspectral remote sensing technology enables the extraction of fine-grained land cover classes. Typically, satellite or airborne images used for training and testing are acquired from different regions or times, where the same class has significant spectral shifts in different scenes. In this paper, we propose a Bi-directional Domain Adaptation (BiDA) framework for cross-domain hyperspectral image (HSI) classification, which focuses on extracting both domain-invariant features and domain-specific information in the independent adaptive space, thereby enhancing the adaptability and separability to the target scene. In the proposed BiDA, a triple-branch transformer architecture (the source branch, target branch, and coupled branch) with semantic tokenizer is designed as the backbone. Specifically, the source branch and target branch independently learn the adaptive space of source and target domains, a Coupled Multi-head Cross-attention (CMCA) mechanism is developed in coupled branch for feature interaction and inter-domain correlation mining. Furthermore, a bi-directional distillation loss is designed to guide adaptive space learning using inter-domain correlation. Finally, we propose an Adaptive Reinforcement Strategy (ARS) to encourage the model to focus on specific generalized feature extraction within both source and target scenes in noise condition. Experimental results on cross-temporal/scene airborne and satellite datasets demonstrate that the proposed BiDA performs significantly better than some state-of-the-art domain adaptation approaches. In the cross-temporal tree species classification task, the proposed BiDA is more than 3%~5% higher than the most advanced method. The codes will be available from the website: https://github.com/YuxiangZhang-BIT/IEEE_TCSVT_BiDA

Abstract:
Image descriptors are crucial in remote sensing image matching tasks. However, the presence of nonlinear transformation and dimensional collapse inherent in the perspective imaging process often poses challenges to achieving accurate matches. Existing descriptors lack a theoretical analysis of the perspective distortion process and fail to mine the patterns hidden in the perspective imaging process, consequently limiting their efficacy in remote sensing image matching. To uncover the underlying patterns in the image and devise a perspective-invariant descriptor, this paper proposes a perspective-invariant descriptor network (PIDNet). In our approach, we first analyze the remote sensing imaging process and demonstrate that it can be described in a new, conceptually simple linear space named the perspective distortion space. Second, we extract the bases from this space via the intersection-over-union (IoU) metric. As a result, each element in the space can be linearly expressed by the bases. Finally, we utilize these bases to design and learn a perspective-invariant descriptor. The core idea of our descriptor is based on the fact that each base corresponds to a unique imaging viewpoint. Therefore, any imaging viewpoint can be linearly represented as a combination of the bases. To implement our PIDNet, we propose a perspective sampling network module (PSNM) based on the spatial transform networks (STN) since no modules are available for our image sampling process. Furthermore, we introduce a perspective convolutional layer (PCLayer) to extract intermediate covariant features. Then, we concatenate the covariant features to learn a perspective-invariant descriptor. Experimental results on three datasets, including single-modal and multi-modal images, demonstrate the superior performance of PIDNet compared to state-of-the-art methods. Our source code will be publicly available at https://github.com/jaxwangkd04/

Abstract:
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: https://github.com/keikeiqi/Audio-Assisted-TTA.

Abstract:
Transformer-based methods have recently achieved significant success in 3D human pose estimation, owing to their strong ability to model long-range dependencies. However, relying solely on the global attention mechanism is insufficient for capturing the fine-grained local details, which are crucial for accurate pose estimation. Existing local feature extraction networks, such as Graph Convolutional Networks, often suffer from over-smoothing, while small-kernel CNNs have limited receptive fields and are highly sensitive to 2D pose errors. These limitations constrain the full potential of data-driven approaches. To address this, we propose SSR-STF, a dual-stream model that effectively integrates local features with global dependencies to enhance 3D human pose estimation. Specifically, we introduce SSRFormer, a simple yet effective module that employs the skeleton selective refine attention (SSRA) mechanism, leveraging large kernels to capture fine-grained local dependencies in human pose sequences. This complements the global dependencies modeled by the Transformer, enabling a more comprehensive understanding of human motion. By adaptively fusing these two feature streams, SSR-STF can better learn the underlying structure of human poses, overcoming the limitations of traditional methods in local feature extraction. To the best of our knowledge, this is the first work to explore the application of large kernels in skeleton-based 3D human pose estimation. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that SSR-STF achieves state-of-the-art performance. Furthermore, the motion representations learned by our model prove effective in downstream tasks such as human mesh recovery. Codes are available at SSR-STF.

Abstract:
Integrating LiDAR and camera data is crucial for precise 3D object detection. Existing methods resort to augmenting virtual points from 2D image space in a random manner to complete the appearance of 3D objects with sparse points. However, these augmented virtual points have unreasonable 3D positions and representations, which brings serious negative effects on accurate detection. To this end, we introduce a general 3D object detection framework called Virtual Point Augmenting (VPA) to enrich the 3D point cloud by controllably generating virtual points with accurate depth and position information as well as domain-gap-eliminated multi-modal representations from image and point cloud spaces. VPA contains two core designs, namely Hybrid Sampling Method (HSM) and Fine-Grained Cross-modal Fusion (FGCF). HSM uses the constructed seed point distribution map based on the edge score and mask score map to sample high-quality seed points, and employs a feature similarity function to sample with k neighbors’ depth to obtain more accurate depth for the seed points, thereby enhancing the quality of the virtual points’ 3D positions. FGCF fuses the multi-modal features, i.e., the semantic feature, the geometric feature from the image space, and the 3D position feature in an adaptive manner using self-attention mechanism, thereby further improving the representation of the virtual points. We apply VPA to the LiDAR-based method CenterPoint and fusion-based method Cross-modal transformer. Experimental results on the nuScenes, KITTI, and Waymo benchmarks validate the efficiency of our VPA, which achieves promising performance with 72.9% mAP and 74.8% NDS without using test-time augmentation and model ensemble techniques on the nuScenes test set. Code is available at https://github.com/jianpingZhonggit/vpa.git

Abstract:
Few-shot image generation aims to generate data of an unseen category based on only a few samples. Apart from basic content generation, a bunch of downstream applications hopefully benefit from this task, such as low-data detection and few-shot classification. To achieve this goal, the generated images should guarantee category retention for classification beyond the visual quality and diversity. In our preliminary work, we present an “editing-based” framework, Attribute Group Editing (AGE), for reliable few-shot image generation, which largely improves the performance compared with existing methods that require re-training a GAN with limited data. Nevertheless, AGE’s performance on downstream classification is not as satisfactory as expected. Furthermore, existing generative models suffer from similar issues. This paper focuses on addressing the issue of universal class inconsistency in all generative models. It not only improves AGE to enhance its ability to preserve class information but also conducts a comprehensive analysis of the causes of this problem in generative models from multiple perspectives, proposing potential directions for resolution. We first propose Stable Attribute Group Editing (SAGE) for more stable class-relevant image generation. SAGE corrects the inaccurate assumptions in AGE and leverages the distribution information from seen categories to accurately estimate the data distribution of unseen categories, thereby eliminating the class inconsistency issue in the generated data. We apply SAGE to both GANs and diffusion models to verify its flexibility and further achieve promising generation performance. Going one step further, we find that even though the generated images look photo-realistic and require no category-relevant editing, they are usually of limited help for downstream classification. We systematically discuss this issue from both the generation and classification perspectives, and propose to boost the downstream classification performance of SAGE by enhancing the pixel and frequency components. Extensive experiments provide valuable insights into extending image generation to wider downstream applications. Codes are available at https://github.com/UniBester/SAGE

Abstract:
In video-based point cloud compression (V-PCC), point clouds are projected as videos using a patch projection method and then compressed using video coding techniques. However, the lossy video compression and the down-sampling of occupancy maps (OMs) can lead to geometry compression artifacts, i.e., depth errors and OM errors, respectively. These errors can significantly affect the reconstruction quality of the point clouds. Existing methods can only eliminate one type of error and therefore have limited quality improvement. In this paper, to improve the quality maximally, a multi-task learning-based geometry compression artifact removal method is proposed to reduce both types of errors simultaneously. Considering the differences between the two tasks, the proposed method deals with the challenges of shared feature extraction and heterogeneous objective optimization. First, we propose a context-aware multi-task learning (CAML) model. The proposed CAML model can extract shared features that are context-aware and satisfy both tasks. Second, an improved optimization scheme is presented to train the proposed model. The improved optimization can fix the gradient imbalance of model updating. Cross-validation experiments show that the proposed method saves an average of over 45% Bj \phi ntegaard Delta bitrate in terms of the D2 metric.

Abstract:
Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.

Abstract:
Image inpainting attempts to fill in missing areas of corrupted images. Previous works used diverse prior information as constraints to recover high-quality images. Nevertheless, these priors rely on heuristic information and highly empirical selection. Moreover, CNN-based methods ignore the global long-range dependencies between spatial positions in images. This paper presents adaptive prior and long-range dependency-based learners (APLRL) for image inpainting. It mainly constructs an adaptive prior extractor (AdaPE) and an adaptive graph convolution (AdaGConv) operator. Specifically, AdaPE devises a learnable network by integrating partial convolution into residual learning. This enables it to mitigate the pollution of prior information caused by mask influence, effectively learn and extract any unknown explicit and implicit priors in a data-driven manner, and assist in image inpainting. Besides, an AdaGConv operator adaptively learns potential sparse graph structures in images by a learnable threshold strategy, and fuses graph convolution operators to acquire long-distance information on image spatial locations. This improves comprehension of the image’s overall structure and contributes to the network filling in the missing areas more effectively. Experiments reveal the superiority of APLRL over different baselines. Notably, AdaPE provides a readily transferable plug-and-play module. The source code is available at https://github.com/QijinXu/APLRL

Abstract:
Lesion segmentation on nasal endoscopic images is challenging due to its complex lesion features. Fully-supervised learning methods achieve promising performance with pixel-level annotations but impose a significant annotation burden on experts. Although weakly supervised or semi-supervised methods can reduce the labelling burden, their performance is still limited. Some weakly semi-supervised methods employ a novel annotation strategy that labels weak single-point annotations for the entire training set while providing pixel-level annotations for a small subset of the data. However, the relevant weakly semi-supervised methods only mine the limited information of the point itself, while ignoring its label property and surrounding reliable information. This paper proposes a simple yet efficient weakly semi-supervised method called the Point-Neighborhood Learning (PNL) framework. PNL incorporates the surrounding area of the point, referred to as the point-neighborhood, into the learning process. In PNL, we propose a point-neighborhood supervision loss and a pseudo-label scoring mechanism to explicitly guide the model’s training. Meanwhile, we proposed a more reliable data augmentation scheme. The proposed method obviously improves performance without increasing the parameters of the segmentation neural network. Experimental results indicate that our method consistently achieves better performance compared to SOTA methods. Additional validation on colonoscopic polyp segmentation datasets confirms our method’s generalizability.

Abstract:
Multi-spectral imaging senses objects from different perspectives, exhibiting the advantages of cross-modal collaboration. However, most existing cross-modal detection algorithms focus mainly on the design of fusion mechanisms, neglecting to assess the effectiveness of individual modalities. In fact, if a certain modality fails to provide distinguishable features, it will introduce unimodal interference and weaken the feature representation of dominate modalities. To address this problem, we propose an enhanced multi-spectral object detection algorithm via Confidence-driven unimodal Interference Removal (CIRDet). Specifically, we explicitly decompose unimodal visual contents into cross-modal consensus features and conflict features. For visual contents where both modalities express confidence, we employ an equal weighting fusion strategy to exploit the synergistic effect of modal information. In cases of modal discrepancy, we introduce the global and local feature confidence fusion mechanisms to induce the network to follow the guidance of the dominant modality, thereby removing conflicted interference from inferior modality. By decoupling the features and processing separately, the proposed method prevents the loss of valid information in inferior modality and filters out unimodal interference more accurately. Extensive experiments on three widely-used multi-spectral object detection benchmarks demonstrate our method outperforms state-of-the-arts by a large margin, e.g., with a CNN backbone, CIRDet achieves 4.2 mAP@[0.5:0.95] improvement compared to Transformer-based methods. The code will be released after possible publication.

Abstract:
The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose an Large Multi-Modal based Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we reformulate the quality regression problem into a question and answering (Q&A) task and construct Q&A prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score as well as the quality level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of 5% in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at https://github.com/Sueqk/LMM-VQA

Affiliations: School of Computer Science and Technology, Dalian University of Technology, Dalian, China; School of Electronic and Computer Engineering, Peking University, Shenzhen, China; National Astronomical Observatories, University of Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, Nankai University, Tianjin, China; School of Control Science and Engineering, Dalian University of Technology, Dalian, China; Department of Automation, Tsinghua University, Beijing, China

Abstract:
Addressing the impact of dynamic factors on localization accuracy and constructing a long-term consistent map containing only static elements are two crucial tasks in visual simultaneous localization and mapping (SLAM) for dynamic scenes. The introduction of dynamic elements can compromise the geometric constraints essential for visual SLAM, leading to a decrease in localization accuracy. Existing related research faces challenges in simultaneously ensuring localization accuracy in both low-dynamic and high-dynamic scenarios, while also maintaining the system’s real-time performance. To address this issue, we propose a two-stage, coarse-to-fine static-probability-based localization scheme. The construction of object-level maps offers strong support for tasks involving higher-level intelligent agent manipulation as well as augmented reality (AR). However, current research is inadequate for dynamic scenes where the objects to be modeled are frequently and irregularly obscured by dynamic objects, and where there are significant challenges such as severe image and point cloud noise, semantic noise, and lack of observational perspectives. To overcome these challenges, we first propose an object parameter estimation algorithm that combines clustering, weighted Principal Component Analysis (PCA) based on an energy function, and a minimum bounding rectangle. Then, we design a multi-modal object data association strategy based on appearance, semantic, and spatial features. The proposed object parameter estimation algorithm and data association strategy demonstrate improved accuracy and robustness in dynamic scenes with the aforementioned challenges. Finally, based on the entire system, we further develop a dynamic object tracking algorithm and construct an AR system to demonstrate the system’s application prospects. A series of public datasets and real-world scene results have been used to evaluate the effectiveness of the proposed system.

Abstract:
Despite recent advances in stereo matching, the extension to intricate underwater settings remains unexplored, primarily owing to: 1) the reduced visibility, low contrast, and other adverse effects of underwater images; 2) the difficulty in obtaining ground truth data for training deep learning models, i.e. simultaneously capturing an image and estimating its corresponding pixel-wise depth information in underwater environments. To enable further advance in underwater stereo matching, we introduce a large synthetic dataset called UWStereo. Our dataset includes 29,568 synthetic stereo image pairs with dense and accurate disparity annotations for left view. We design four distinct underwater scenes filled with diverse objects such as corals, ships and robots. We also induce additional variations in camera model, lighting, and environmental effects. In comparison with existing underwater datasets, UWStereo is superior in terms of scale, variation, annotation, and photo-realistic image quality. To substantiate the efficacy of the UWStereo dataset, we undertake a comprehensive evaluation compared with eleven state-of-the-art algorithms as benchmarks. The results indicate that current models still struggle to generalize to new domains. Hence, we design a new strategy that learns to reconstruct cross domain masked images before stereo matching training and integrate a cross view attention enhancement module that aggregates long-range content information to enhance the generalization ability.

Abstract:
Clinical scoring in X-ray coronary angiography image sequences is widely used for revascularization decision-making in cases of coronary artery disease. Accurately recognizing coronary artery branches is a fundamental step in assessing the severity of quantitative stenosis. Existing methods employ a multistage process that includes view separation, skeletonization, graph building, and classification using topological features. However, the graph often suffers from skeleton errors, leading to incorrect topological connections during the classification stage, which requires manual correction. To address these issues, we propose a unified-stage coronary artery branch recognition network (UniCABR) that integrates the segmentation, skeletonization, and graph-building stages. Specifically, we design a dependency-aware module to build dependency graphs in both semantic and spatial domains, avoiding the use of rigid inter-branch topological connections and thus eliminating the need for manual correction of misconnections resulting from skeleton errors. Furthermore, to suppress nontarget branches according to clinical criteria and enhance the performance of side branches, we introduce a small feature supplementation module coupled with an adaptive merged binary supervision method at the pixel level. Extensive experiments on two datasets and a generalization study demonstrate the superiority of UniCABR in performance and generalization ability for coronary artery branch recognition tasks.

Abstract:
Recent works try to combine clustering and contrastive learning for unsupervised out-of-distribution (OOD) detection, since these two schemes can exploit semantic information and bring in discriminative representation learning. However, most methods based on clustering and contrastive learning struggle with the problems of hard assignment and low-level clustering, i.e., they usually assign each sample to one single cluster and obtain clusters of similar low-level features, which can easily bring in numerous incorrect assignments and hinder the learning of semantic information. To address these problems, this paper proposes a novel framework for unsupervised OOD detection named Soft Cluster-aware Equivariant Contrastive Learning (SCECL). Different from previous works, SCECL devises two modules named Soft Cluster-aware Semantic Relationship Mining (SCSRM) and Contrastive Learning with Invariance and Equivariance (CLIE): SCSRM assigns each sample to multiple clusters with soft assignment weights and utilizes the soft assignment weights with semantic relationships to guide unsupervised contrastive learning for OOD detection, while CLIE introduces the equivariance principle as an additional inductive bias, encouraging the model to learn more discriminative semantic features to avoid low-level clustering. Extensive experimental results on various OOD detection benchmarks demonstrate that the proposed SCECL can effectively utilize semantic information for discriminative representation learning and achieve state-of-the-art performance.

Abstract:
Multi-label image classification, which involves recognizing multiple objects within a single image, is a fundamental task in computer vision. Recently, Visual-Language Models (VLMs) have made remarkable progress in this area. Many approaches combine textual and visual modalities to understand the entire image. In this paper, we find that there is a direct correlation between the accurate localization of objects and the accuracy of multi-label classification. However, previous research methods did not specifically address localization accuracy, resulting in sub-optimal accuracy. Therefore, we propose the AMITA, namely Attribute-guided Masked Image-Text Alignment for multi-label image representation. AMITA improves localization accuracy by segmenting object masks, thereby enhancing the accuracy of multi-label image classification. Additionally, AMITA introduces an AutoFocus method to handle the localization problem of small objects. AutoFocus conducts recognition by resizing and cropping the image respectively, and automatically selects the images useful for the classification target. Moreover, AMITA incorporates Attribute-guided Prompting to strengthen the semantic distinction among different categories. It uses large language models to obtain the attributes of different categories and carefully designs prompts to enhance the attribute differences among different categories. Finally, extensive experiments on three popular datasets, including MS-COCO, Pascal VOC 2007, and NUS-WIDE, demonstrate the superiority of AMITA.

Abstract:
With the widespread deployment of intelligent vision applications, a substantial amount of visual data is being transmitted to machines for automated analysis to alleviate the human burden. However, research on the interaction between human and machine vision remains limited, significantly constraining the collaborative potential of both systems. In this paper, we investigate task-oriented high- and low-level representations, and how they can be used to construct a scalable coding model for human-machine collaboration. First, we propose a semantic-aware base layer combined with an implicit semantic module, designed to encourage the network to learn compact representations for machine tasks under semantic consistency constraints. Second, we propose a saliency-aware enhancement layer, which navigates the compression of visual signals with a saliency prior derived from the base layer so as to construct high-quality visual perceptions that fit the collaborative scene. By interacting and recombining the decoupled features, our model further bridges the gap between high- and low-level representations, so that the learned representations enjoy both machine analysis and visual reconstruction. Experimental results demonstrate that the proposed method outperforms state-of-the-art machine vision codecs on several machine vision tasks, and can also achieve comparable or even better reconstruction quality, while maintaining a modest bit rate cost.

Abstract:
As one of the latest features of ultra-high-definition media services, high frame rate can significantly enhance perceptual quality, but also increases codec complexity in the transmission chain, leading to additional overhead. In this paper, we carry out comprehensive offline experiments in which the codec overhead (e.g., energy and delay) shows a linear or even quadratic increase trend with various frame rates, while correspondingly, when the frame rate increases to 75FPS, its bitrate is 24.2% lower than that under 15FPS for several scenarios. This illustrates that the overhead is more significant than the load from data traffic in the frame rate control problem. Thus, we propose a Bilateral Adaptive video Transmission framework that establishes Bilateral game-theoretic Control (BAT-BC) between sender and viewer. Through dynamically adjusting frame rate for sender and service payment for viewer, BAT-BC can flexibly adapt to the external environment such as computational state and scenario changes and it is expected to provide viewers with a smoother experience. Furthermore, we extend it to the scenario including concurrent multi-viewer and discuss the effects of grouping utility. Finally, we design a prototype system and the proposed solution is deployed on it to evaluate the performance. The frame drop rate is reduced by 61%, resulting in a 31% improvement in subjective QoE. The objective metric achieves the same level of actual experience as a fixed 60 FPS under dynamic environment.

Abstract:
Recently, audiences have increasingly watched diverse dramas and movies on streaming platforms, prompting platform administrators to enhance their understanding of video semantics. In particular, capturing and interpreting social relations among characters is critical for content-driven intelligent services and enhancing user experience. However, most existing research has solely approached social relations recognition as a classification problem, without justifiable interpretation of prediction results within the video context. To address this issue, we study cognitive science research and the Chain-of-Thought (CoT) strategy. Based on these foundations, we propose SaMo-CoT, an approach that leverages large language models (LLMs) in step-by-step reasoning. This approach simulates human cognitive processes by combining social scenarios in videos and incorporating empirical social interaction knowledge. In this way, we enable verbal rationales for determining social relations in video understanding. Furthermore, we present an innovative doubly-right social relation recognition framework that predicts both correct social relation labels and correct scenario rationales. Specifically, we translate verbal CoT into multimodal CoT by leveraging scenario-aware prompts and contrastive learning. Extensive experiments demonstrate significant improvements in classification accuracy and interpretability compared to traditional approaches.

Abstract:
In low-light environments, images often suffer from quality degradation issues such as low contrast, insufficient brightness, and color distortion due to inadequate light reaching the camera sensor. Most existing methods overlook the issue of color distortion caused by insufficient illumination when enhancing low-light images. In this work, we propose GWRetinex-Net, a novel deep learning network model based on Retinex theory and the gray world assumption. It consists of an image decomposition network, a reflection component enhancement network, and an illumination component enhancement network. During the training process of the image decomposition network, the gray world assumption is innovatively introduced to constrain the ill-posed decomposition problem caused by the absence of ground truth for reflection and illumination components, ensuring that the reflection components obtained after decomposition retain accurate color information. Based on the decomposition results, the reflection component enhancement network is responsible for mitigating degradation in the reflection components of low-light images, while the illumination component enhancement network focuses on adjusting the illumination distribution in the illumination components of low-light images. Comprehensive qualitative and quantitative experiments demonstrate that our GWRetinex-Net significantly outperforms comparative methods on multiple public datasets. Compared with the best-performing comparative methods, the images enhanced by the proposed method achieve an average improvement of 6.55% in SSIM and 8.87% in PSNR, along with an average decrease of 14.68% in MAE and 6.10% in NIQE. Additionally, object detection experiment in low-light environments further reveals the potential application value of GWRetinex-Net.

Abstract:
Graph-based methods have demonstrated strong performance in multi-view clustering (MVC) due to their capability to capture complex data structures. Among these, discrete spectral embedding learning has emerged as an effective strategy for directly producing clustering assignments, thereby avoiding potential suboptimality introduced by post-processing. However, most existing discrete MVC methods overlook the problem of skewed cluster assignments, which can significantly affect the quality and interpretability of clustering results in practical applications. To address this issue, we propose a novel framework for Balanced and Discrete Multi-view Clustering via Adaptive Graph Learning (BDMC-AGL). The proposed model jointly integrates adaptive graph construction and size-constrained spectral embedding learning into a unified optimization framework, enhancing the robustness of clustering while explicitly encouraging balanced partitioning. The introduction of size constraints into the discrete spectral embedding, however, poses a challenging optimization problem. To effectively solve it, we develop an efficient algorithm that guarantees convergence to an locally optimal solution. Extensive experiments conducted on benchmark datasets demonstrate that BDMC-AGL consistently outperforms state-of-the-art methods in terms of clustering accuracy and balance. Moreover, ablation studies validate the significant contribution of the size constraint mechanism in improving multi-view clustering performance. The source code is publicly available at: https://github.com/haha1206/BDMC-AGL.

Abstract:
Unsupervised visible-infrared person re-identification (US-VI-ReID) aims to match unlabeled pedestrian images captured under varying lighting conditions. The key challenge lies in generating accurate pseudo-labels, alongside alleviating the significant modality gap between visible and infrared modalities. Existing methods mainly focus on mitigating the effects of noisy labels through loss functions during backward propagation. However, these noisy labels already influence the forward propagation, leading to incorrect cross-modality correspondences. To address this issue, we propose a Hierarchical Centrality Collaborative Learning (HCCL) framework for US-VI-ReID, which proactively identifies noisy labels during the forward propagation. The rationale behind HCCL is that intra-modality refinement serves as the foundation for establishing cross-modality correspondences, reflecting the principle of learning from yourself to others. For intra-modality learning, we propose a Closeness Centrality Selection (CCS), quantifying sample confidence using closeness centrality to identify noisy samples. By discarding the noisy samples during forward propagation, CCS mitigates their adverse effects and ensures identity-consistent representation learning. For cross-modality learning, a Hierarchical Consistency Matching (HCM) is proposed to establish local instance-level label associations by leveraging bidirectional consistency with the most reliable samples identified during intra-modality learning. These local associations are then propagated to guide the global cluster-level cross-modality correspondences. Extensive experiments demonstrate that our HCCL achieves competitive performance on mainstream datasets, even surpassing some supervised counterparts. Additionally, outstanding results on corrupted datasets verify its generalizability and robustness.

Affiliations: School of Computer Science and Technology, Shandong Jianzhu University, Jinan, China; Kailin Environmental Protection Equipment, Heze, China; Shandong Institute of Scientific and Technical Information, Jinan, China; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; School of Electronics and Information Engineering, Soochow University, Suzhou, China; School of Software, Shandong University, Jinan, China

Abstract:
In clinical practice, obtaining a large amount of labeled CNV data is very difficult. Semi-supervised learning can effectively utilize a large amount of unlabeled CNV data. Since CNV has complex features such as blurred and unevenly distributed pixels on the edges, there are differences in the segmentation difficulty between pixels in the same image. Existing semi-supervised segmentation methods do not consider the segmentation difficulty of pixels, which will reduce the segmentation accuracy. To address this problem, we propose a dual difficulty-aware adaptive pseudo-label learning (D2APL) method for semi-supervised CNV segmentation. The proposed dual difficulty awareness includes segmentation difficulty perception of pixels in labeled and unlabeled data. For labeled data, we propose a classification confidence-guided difficulty perception method. For unlabeled data, we propose a model stability-guided difficulty perception method. Finally, we propose a difficulty-aware self-training method to dynamically adjust the threshold of pseudo-labels according to the difficulty, thereby improving the utilization of difficult-to-segment pixels in unlabeled data. Experimental results show that our method outperforms the state-of-the-art method in CNV segmentation.

Abstract:
No-Reference Image Quality Assessment (NR-IQA) plays a crucial role in various real-world applications by predicting image quality scores without the need for reference images. Despite the impressive performance of deep learning-based NR-IQA models, they remain vulnerable to adversarial attacks, which introduce imperceptible perturbations to input images, causing significant changes in predicted scores. In this study, we explore the use of simple JPEG compression techniques, as well as their combination with norm regularization training, to defend against these adversarial attacks. Our results demonstrate that image compression is an effective method to enhance model robustness, and it can further improve the robustness of NR-IQA models when combined with appropriate training strategies. Since excessive compression may reduce performance on clean images, it is essential to strike a balance. This work provides valuable insights into designing effective image compression methods for NR-IQA models.

Abstract:
Contemporary works in multi-modal image fusion often excessively rely on aligned source images, resulting in limited practicality when encountering misaligned data. However, there is still a significant gap in developing effective multi-modal image registration methods to address this problem. Moreover, existing multi-modal image registration models are largely restricted to specific types of multi-modal data, lacking a general model applicable to diverse multi-modal data types. To address th above issues, this study introduces a novel method named PGMR, which stands as the first plug-and-play general multi-modal image registration model. PGMR comprises three components: Modality Prompt Module (MPM), Universal Registration Framework (URF), and Detail Enhancement Module (DEM). URF serves as the fundamental registration framework, handling both rigid and non-rigid deformations to achieve basic multi-modal image registration. MPM, one core component of this paper, is embedded within URF. Leveraging prompt learning, MPM dynamically integrates modality prompts into the intermediate output of URF, not only alleviating modality discrepancies but also promoting the ability of the registration model across various multi-modal data types. DEM is a detail enhancement module for multi-modal image registration. It can enrich the details of registration results, thereby enhancing the effectiveness of subsequent tasks. We evaluate the performance of PGMR on four multi-modal types and extensive experiments validate the feasibility of PGMR, demonstrating the superiority of our method compared to state-of-the-art alternatives. The code will be available at https://github.com/stwts/PGMR

Abstract:
In this paper, we present a dynamic learnable label assignment (DLLA) method for indoor anchor-free one-stage 3D object detection. Existing methods principally depend on hand-crafted strategies with fixed thresholds, which fail to adapt to the inherent variability in object characteristics such as size, shape, and occlusion levels. This lack of adaptability results in suboptimal sample assignments and unstable detection performance. To address this challenge, we map the features of proposals and ground truths separately into the same embedding space, enabling a dynamic strategy of assigning appropriate positive samples to each instance. Specifically, we first interact with the features of all proposals to effectively integrate information from each proposal in the scene and capture long-range dependencies between different locations. Additionally, to extract more discriminative and generalized features for positive and negative samples, we employ a contrastive learning process to optimize the elemental relationships and distances between proposals and ground truths. Finally, we introduce a denoising task to alleviate the difficulty of the unsupervised learning process in DLLA. Experimental results show that our DLLA outperforms other methods on three popular indoor datasets (ScanNet V2, SUN RGB-D, and ScanNet200).

Abstract:
Recent researches have yielded promising results by integrating online action detection and action anticipation tasks to explore the correlations between past, present and future. However, these approaches treat incomplete historical information equally and neglect intrinsic connections between actions, resulting in a limited perception of the throughout evolution. To address this limitation, we reconsider the patterns and dependencies in event evolution, innovatively constructing a comprehensive deductive process that inscribes the entire temporal spectrum via procedural features. Here, we propose the Throughout Procedural Transformer (TPT) comprising Procedural History Evolution Encoder and Progressive Deduction Decoder, to thoroughly span the entirety of time from history to the future through procedural modeling. TPT utilizes long-term procedural history acquired through procedure sampling to model long-term procedural future, thereby enhancing cognitive inference ability by enriching short-term history and short-term future with a broad grasp of throughout event evolution. We conduct extensive experiments to evaluate TPT on five demanding benchmarks THUMOS’14, TVSeries, FineAction, HACS and EPIC-Kitchens-100 for online action detection and anticipation tasks, demonstrating significant improvements over existing methods.

Abstract:
The task of weakly supervised temporal sentence grounding (WSTSG) aims to detect temporal intervals corresponding to a language description from untrimmed videos with only video-level video-language correspondence. For an anchor sample, most existing approaches generate negative samples either from other videos or within the same video for contrastive learning. However, some training samples are highly similar to the anchor sample, directly regarding them as negative samples leads to difficulties for optimization and ignores the correlations between these similar samples and the anchor sample. To address this, we propose Positive Sample Mining (PSM), a novel framework that mines positive samples from the training set to provide more discriminative supervision. Specifically, for a given anchor sample, we partition the remaining training set into semantically similar and dissimilar subsets based on the similarity of their text queries. To effectively leverage these correlations, we introduce a PSM-guided contrastive loss to ensure that the anchor proposal is closer to similar samples and further from dissimilar ones. Additionally, we design a PSM-guided rank loss to ensure that similar samples are closer to the anchor proposal than to the negative intra-video proposal, aiming to distinguish the anchor proposal and the negative intra-video proposal. Experiments on the WSTSG and grounded VideoQA tasks demonstrate the effectiveness and superiority of our method.

Abstract:
Kinship verification using facial information determines whether two faces share a familial relationship. Existing methods improve verification by leveraging negative sample information and addressing distribution differences but often extract independent features from parent and child images separately, ignoring variations in pairwise similarity. To overcome this, we propose CI3Former, a Swin-Transformer-based model that enables cross-image information interaction for joint feature extraction. By incorporating a Self-Attention based Interaction (SAI) module within each Swin-Transformer block, our method allows mutual querying between parent and child features, dynamically guiding region-level feature extraction and adaptively focusing on similar regions. Additionally, we introduce a Multi-metric Similarity based Interaction (MSI) module for feature fusion, which processes paired features through similarity measurements before final prediction. The model is trained with contrastive and binary cross-entropy losses to enhance coupled feature learning. Extensive experiments on four kinship verification datasets and a signature verification dataset demonstrate that CI3Former outperforms state-of-the-art methods, showcasing its effectiveness, robustness, and strong cross-task generalization.

Abstract:
Distributed video coding (DVC) transfers the complex process of the encoder to the decoder, which is suitable for video applications with limited encoding resources. Deep learning has shown impressive performance in video coding tasks in learning nonlinear compact representations of input frames and reconstructing video frame details. It is worth exploring whether deep learning implementation of the DVC paradigm is feasible and whether performance gains can be obtained. This paper proposes a deep DVC scheme (DDVC) using a quality enhancement network (QEN), which maps pixels to a more compressible latent space via an autoencoder resulting in a compact representation of Wyner-Ziv (WZ) frames. Moreover, considering the spatio-temporal correlation between the WZ frame and the Key frame, the QEN on the decoder side, using CNN and LSTM iteratively extracts common information between the WZ frame and the Key frame, which could further finetune the WZ frame reconstruction. We evaluated DDVC in limited encoding resources application scenarios with 19 related video sequences. Results on the video sequences with different motion intensity levels show that DDVC significantly outperforms existing schemes in reconstruction quality with the same compression ratio. We open-sourced the implementation at https://github.com/yixiangbo/DDVC

Abstract:
Private inference outsourcing ensures the privacy of both clients and model owners when model owners deliver inference services to clients through third-party cloud servers. Existing solutions either reduce inference accuracy due to model approximations or rely on the unrealistic assumption of non-colluding servers. Moreover, their efficiency falls short of HELiKs, a solution focused solely on client privacy protection. In this paper, we propose Skybolt, a single-server private inference outsourcing framework without resorting to model approximations, achieving greater efficiency than HELiKs. Skybolt is built upon efficient secure two-party computation protocols that safeguard the privacy of both clients and model owners. For the linear calculation protocol, we devise a ciphertext packing algorithm for homomorphic matrix multiplication, effectively reducing both computational and communication overheads. Additionally, our nonlinear calculation protocol features a lightweight online phase, involving only the addition and multiplication on secret shares. This stands in contrast to existing protocols, which entail resource-intensive techniques such as oblivious transfer. Extensive experiments on popular models, including ResNet50 and DenseNet121, show that Skybolt achieves a 5.4-7.3 × reduction in inference latency, accompanied by a 20.1-39.6 × decrease in communication cost compared to HELiKs.

Abstract:
Event cameras detect per-pixel brightness changes and output asynchronous event streams with high temporal resolution, high dynamic range, and low latency. However, the unstructured nature of event streams means that humans cannot analyze and interpret them in the same way as natural images. Event-based video reconstruction is a widely used method aimed at reconstructing intuitive videos from event streams. Most reconstruction methods based on traditional artificial neural networks (ANNs) have high energy consumption, which counteracts the low-power advantage of event cameras. Spiking neural networks (SNNs) are a new generation of event-driven neural networks that encode information via discrete spikes, which leads to greater computational efficiency. Previous methods based on SNNs overlooked the asynchronous nature of event streams, leading to reconstructions that suffer from artifacts, flickering, low contrast, etc. In this work, we analyze event streams and spiking neurons and explain poor reconstruction quality. We specifically propose a novel spatial-temporal heterogeneous (STH) spiking neuron suitable for reconstructing asynchronous event streams. The STH neuron adjusts the membrane decay coefficient adaptively and has better spatiotemporal perception. In addition, we propose a temporal-frequency calibration module (TFCM) based on the Fourier transform to improve the contrast of the reconstructions. On the basis of the above proposed neuron and module, we construct two SNN-based models, referred to as the STHSNN and TFCSNN. The goal of the former is to reduce the artifacts and flickering in reconstructions, whereas the latter focuses on enhancing the contrast. The experimental results demonstrate that our models can yield reconstructions in various scenarios, achieving better quality and lower energy consumption than previous SNNs. Specifically, the TFCSNN and STHSNN achieve top-2 performance among the SNN-based models, with energy consumption reductions of 3.48 times and 12.40 times, respectively.

Abstract:
Most existing non-blind deblurring methods formulate the problem into a maximum-a-posteriori framework and address it by manually designing a variety of regularization terms and data terms of the latent clear images. However, explicitly designing these two terms is quite challenging, which usually leads to complex optimization problems. In this paper, we propose a Discriminative Shrinkage Deep Network for fast and accurate deblurring. Most existing methods use deep convolutional neural networks (CNNs), or radial basis functions only to learn the regularization term. In contrast, we formulate both the data and regularization terms while splitting the deconvolution model into data-related and regularization-related sub-problems. We explore the properties of the Maxout function and develop a deep CNN model with Maxout layers to learn discriminative shrinkage functions, which directly approximate the solutions of these two sub-problems. Moreover, we develop a U-Net according to Krylov subspace method to restore the latent clear images effectively and efficiently, which plays a role but is better than the conventional fast-Fourier-transform-based or conjugate gradient method. Experimental results show that the proposed method performs favorably against the state-of-the-art methods regarding efficiency and accuracy.

Abstract:
In video lane detection, there are rich temporal contexts among successive frames, which is under-explored in existing lane detectors. In this work, we propose LaneTCA to bridge the individual video frames and explore how to effectively aggregate the temporal context. Technically, we develop an accumulative attention module and an adjacent attention module to abstract the long-term and short-term temporal context, respectively. The accumulative attention module continuously accumulates visual information during the journey of a vehicle, while the adjacent attention module propagates this lane information from the previous frame to the current frame. The two modules are meticulously designed based on the transformer architecture. Finally, these long-short context features are fused with the current frame features to predict the lane lines in the current frame. Extensive quantitative and qualitative experiments are conducted on two prevalent benchmark datasets. The results demonstrate the effectiveness of our method, achieving several new state-of-the-art records. The codes and models are available at https://github.com/Alex-1337/LaneTCA.

Abstract:
Most methods address occluded pedestrian Re-Identification (Re-ID) by employing external auxiliary models in the feature output stage of the backbone network to locate visible appearance areas. Nevertheless, these approaches suffer from issues such as occlusion information diffusion and imprecise masks generated by external models, indicating the need for further exploration in the decoupling of pedestrian features from occlusion information. In light of these challenges, we propose an innovative algorithm called Pose-Skeleton guided Cross-attention Representation fusion (PSCR) method. Firstly, we introduce the Visible Appearance Region Attention (VARA) model designed to leverage pose information for guiding the backbone network in effectively distinguishing between occlusion information and pedestrian features at the intermediate layer. By employing a suppression strategy, the model is able to effectively suppress occlusion interference and alleviate the diffusion of occlusion information. Next, to achieve precise localization of pedestrian-specific semantic regions, a groundbreaking Skeletal Area Modeling (SAM) is proposed. Leveraging the principles of mathematical modeling and capitalizing on the efficacy of human keypoint confidence, this module generates finely-grained masks for local skeleton regions and extracts an exhaustive set of local features. Lastly, under the constraints imposed by spatial attention masks, a cross-attention mechanism is employed to fuse the features acquired from the previous two steps with local features. This fusion process results in the generation of enhanced local features that seamlessly integrate aligning high-level semantic information. Extensive experimentation demonstrates that the proposed algorithm exhibits notable performance advancements when compared to existing methodologies.

Abstract:
This paper focuses on few-shot action recognition (FSAR), where the machine is required to understand human actions, with each only seeing a few video samples. Even with only a few explorations, the most cutting-edge methods employ the action textual features, pre-trained by a visual-language model (VLM), as a cue to optimize video prototypes. However, the action textual features used in these methods are generated from a static prompt, causing the network to overlook rich motion cues within videos. To tackle this issue, we propose a novel framework, namely, motion-aware visual-language representation modulation network (MoveNet). The proposed MoveNet utilizes dynamic motion cues within videos to integrate motion-aware textual and visual feature representations, as a way to modulate the video prototypes. In doing so, a long short motion aggregation module (LSMAM) is first proposed to capture diverse motion cues. Having the motion cues at hand, a motion-conditional prompting module (MCPM) utilizes the motion cues as conditions to boost the semantic associations between textual features and action classes. One further develops a motion-guided visual refinement module (MVRM) that adopts motion cues as guidance in enhancing local frame features. The proposed components compensate for each other and contribute to significant performance gains over the FASR task. Thorough experiments on five standard benchmarks demonstrate the effectiveness of the proposed method, considerably outperforming current state-of-the-art methods.

Abstract:
Query-based 3D object detection has gained significant success in the application of autonomous driving due to its ability to achieve good performance while maintaining low computational cost. However, it still struggles with the reliable detection of small objects such as bicycles and pedestrians. To address this challenge, this paper introduces a novel sparse query-based approach, termed DinoQuery. This approach utilizes Grounding-DINO with textual prompts to select small-sized objects and generate 2D category-aware queries. These 2D category-aware queries combined with 2D global queries are then lifted to 3D queries by associating each sampled query with its respective 3D position, orientation, and size. The validity of these 3D queries, along with the 2D queries, is verified by the Comprehensive Contrastive Learning (CCL) mechanism. This is achieved by aligning all 2D and 3D queries with their respective 2D and 3D ground truth labels, and computing similarity to select true positive and false positive queries. Then a contrastive loss is introduced to enhance true positive queries and weaken false positive ones based on geometric and semantic similarity. The DinoQuery was tested on the nuScenes dataset and demonstrated excellent performance. Notably, the largest increase of our method is 3.2% on NDS and 3.1% on mAP.

Abstract:
Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer, and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability to capture contextual dependencies and generate accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly produce robust and discriminative global representations. Specifically, we do this by formulating deep features as the keys and values, as well as a set of learnable parameters as the queries. Our EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to output the global representations. Moreover, to provide more powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-rank Parallel Adaptation (LoPA) method to enhance its performance in VPR, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at https://github.com/Tong-Jin01/EDTformer.

Abstract:
Cloud environments enhance diffusion model efficiency but introduce privacy risks, including intellectual property theft and data breaches. As AI-generated images gain recognition as copyright-protected works, ensuring their security and intellectual property protection in cloud environments has become a pressing challenge. This paper addresses privacy protection in diffusion model inference under cloud environments, identifying two key characteristics—denoising-encryption antagonism and stepwise generative nature—that create challenges such as incompatibility with traditional encryption, incomplete input parameter representation, and inseparability of the generative process. We propose PPIDM (Privacy-Preserving Inference for Diffusion Models), a framework that balances efficiency and privacy by retaining lightweight text encoding and image decoding on the client while offloading computationally intensive U-Net layers to multiple non-colluding cloud servers. Client-side aggregation reduces computational overhead and enhances security. Experiments show PPIDM offloads 67% of Stable Diffusion computations to the cloud, reduces image leakage by 75%, and maintains high output quality (PSNR = 36.9, FID = 4.56), comparable to standard outputs. PPIDM offers a secure and efficient solution for cloud-based diffusion model inference.

Abstract:
The (k, n)-threshold Secret Image Sharing (SIS) is a naturally fault-tolerant technique for image privacy protection. A secret image is processed through secret sharing to generate n shadow images, which are then distributed to n different recipients. During the recovery phase, the complete secret image can be reconstructed by any k out of n shadow images. Although (k, n)-threshold SIS itself allows for the loss of up to n-k shadow images, if there are pixel errors in the remaining k shadow images, the recovery of the secret image will be declared a failure. Therefore, Robust Secret Image Sharing (RSIS) has been proposed to address the issue. However, the current proposed RSIS schemes only demonstrated limited robustness against noise attacks. This paper presents a novel k-consistency-based RSIS scheme to resist malicious attacks, including noise, JPEG compression, tampering, and cropping. In the sharing phase, a dual-SIS mechanism is first designed to perform two rounds of secret sharing on the secret image. In the recovery phase, high-quality secret image can be reconstructed based on k-consistency after attacking. The experimental results demonstrated that our scheme not only provides comprehensive robustness but also allows for flexible adjustment of shadow images’ sizes, ensuring both security and efficiency during image sharing.

Abstract:
Image steganography, the technique of hiding secret messages within images, has recently advanced with generative image steganography, which hides messages during image creation. However, current generative steganography methods often face criticism for their low extraction accuracy and poor robustness—particularly their vulnerability to JPEG compression. To address these challenges, we propose a novel generative image steganography method based on the text-to-image multimodal generative model (StegaMGM). StegaMGM utilizes the initial random normalization distribution in the generative process of latent diffusion models (LDMs), the secret message is hidden in the generated image through message sampling, ensuring it follows the same probability distribution as typical image generative. The content of the stego image can also be controlled through the prompts. On the receiver side, using the shared prompt and diffusion inversion, can extract secret message with high accuracy. In the experimental section, we conducted detailed experiments to demonstrate the advantages of our proposed StegaMGM framework in extraction accuracy, resistance to JPEG compression, and security.

Abstract:
Radiance fields, including NeRFs and 3D Gaussians, demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as input. COLMAP is frequently employed for preprocessing to estimate poses. However, COLMAP necessitates a large number of feature matches to operate effectively, and struggles with scenes characterized by sparse features, large baselines, or few-view images. We aim to tackle few-view NeRF reconstruction using only 3 to 6 unposed scene images, freeing from COLMAP initializations. Inspired by the idea of calibration boards in traditional pose calibration, we propose a novel approach of utilizing everyday objects, commonly found in both images and real life, as “pose probes”. By initializing the probe object as a cube shape, we apply a dual-branch volume rendering optimization (object NeRF and scene NeRF) to constrain the pose optimization and jointly refine the geometry. PnP matching is used to initialize poses between images incrementally, where only a few feature matches are enough. PoseProbe achieves state-of-the-art performance in pose estimation and novel view synthesis across multiple datasets in experiments. We demonstrate its effectiveness, particularly in few-view and large-baseline scenes where COLMAP struggles. In ablations, using different objects in a scene yields comparable performance, showing that PoseProbe is robust to the choice of probe objects. Our project page is available at: https://zhirui-gao.github.io/PoseProbe.github.io/

Affiliations: Guangzhou Institute of Technology, Xidian University, Guangzhou, China; School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China; School of Information Science and Technology, University of Science and Technology of China, Hefei, China; National Center for Applied Mathematics, Chongqing Normal University, Chongqing, China; Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China

Abstract:
Thermal infrared (TIR) object tracking is a significant subject within the field of computer vision. Currently, TIR object tracking faces challenges such as insufficient representation of object texture information and underutilization of temporal information, which severely affects the tracking accuracy of TIR tracking methods. To address these issues, we propose a TIR object tracking method (called: FFTR) based on fine-grained feature and template reconstruction. Specifically, aiming at the fine-grained information of the TIR object, we employ a frequency channel attention mechanism that transforms TIR images into the frequency domain using discrete cosine transform features. By capturing the fine-grained feature of TIR images from the frequency domain, we enhance the model’s ability to comprehend these images. To better leverage temporal information, we utilize a template region reconstruction method. This method reconstructs the template from the previous frame based on the search area of the current frame, which is then incorporated into the attention computation for the subsequent frame, thereby improving the tracking capability of TIR objects. Extensive quantitative and qualitative experiments show that our method achieves competitive tracking performance on the TIR benchmarks.

Abstract:
With the continuous advancement of deep learning, object detection has made remarkable progress in accurately identifying a wide range of object categories, even within increasingly complex scenes. However, as the number of categories grows, visual concepts naturally organize into a label hierarchy. We contend that existing hierarchical classification and detection methods predominantly prioritize fine-grained prediction, potentially leading to inconsistencies with realistic human perception. From this perspective, we investigate the Hierarchical Object Detection (HOD) problem to better align with real-world perception. To address the lack of benchmarks in the field, we build a large-scale HOD benchmark termed RHOD with open-source datasets, comprising 740 categories. To better align the hierarchical object detectors towards realistic perception, we propose a new evaluation metric named Hierarchical Average Precision (HAP). Furthermore, we present a novel hierarchical object detection method that includes two components, Tree Soft Labeling (TSL) and Hierarchical Extension and Suppression (HES). Our method mitigates the issue of overconfidence in fine-grained predictions, which has been prevalent in previous approaches. We evaluate a range of existing methods on the RHOD benchmark, including plain, hierarchical, and open-vocabulary models. Additionally, we perform comprehensive experiments to assess the performance of our proposed method. The experimental results show that our method achieves state-of-the-art performance on the RHOD benchmark.

Abstract:
Learned Image Compression (LIC) has attracted considerable attention due to their outstanding rate-distortion (R-D) performance and flexibility. However, the substantial computational cost poses challenges for practical deployment. The issue of feature redundancy in LIC is rarely addressed. Our findings indicate that many features within the LIC backbone network exhibit similarities. This paper introduces ShiftLIC, a novel and efficient LIC framework that employs parameter-free shift operations to replace large-kernel convolutions, significantly reducing the model’s computational burden and parameter count. Specifically, we propose the Spatial Shift Block (SSB), which combines shift operations with small-kernel convolutions to replace large-kernel. This approach maintains feature extraction efficiency while reducing both computational complexity and model size. To further enhance the representation capability in the channel dimension, we propose a channel attention module based on recursive feature fusion. This module enhances feature interaction while minimizing computational overhead. Additionally, we introduce an improved entropy model integrated with the SSB module, making the entropy estimation process more lightweight and thereby comprehensively reducing computational costs. Experimental results demonstrate that ShiftLIC outperforms leading compression methods, such as VVC Intra and GMM, in terms of computational cost, parameter count, and decoding latency. Additionally, ShiftLIC sets a new SOTA benchmark with a BD-rate gain per MACs/pixel of −102.6%, showcasing its potential for practical deployment in resource-constrained environments. The code is released at https://github.com/baoyu2020/ShiftLIC.

Abstract:
Continuous sign language recognition technology enables effective communication for hearing-impaired individuals by recognizing and interpreting sign language. However, existing research has not fully addressed the large amount of temporal and spatial redundancy in sign language videos, which limits recognition accuracy. Additionally, the scarcity of frame-level annotated data hinders the widerspread utilization of gloss-level features from sign language videos by existing weakly supervised learning methods, thereby impedes the model from sufficient training. To address the above challenges, we propose a novel Cross-Modal Adaptive Prototype Learning model for Continuous Sign Language Recognition (CAP-SLR), which leverages prototype learning to fuse features across different modalities and improves recognition accuracy. Initially, we propose a lightweight Keyframe Extractor and a Multi-Scale Dilated Convolutional Attention to alleviate data redundancy and bolster the visual representation of sign language based on spatial-temporal information. Subsequently, we employ the Contextual Position Encoding-assisted (CoPE) transformer to learn the semantic of sign language, ameliorating the issue of cross-modal semantic prior bias inherent in pre-trained models. Finally, we design a Cross-Modal Adaptive Prototype Updating mechanism (CAP), which adaptively fuse visual features, gloss prototypes, and textual features, and subsequently update the gloss prototypes through the classifier-aware feature states, effectively mitigating the problem of easily introducing erroneous features in traditional momentum update methods. Extensive experiments on the PHOENIX14, PHOENIX14-T, and CSL-Daily datasets demonstrate that the proposed CAP-SLR method can effectively align cross-modal features at the gloss level, and achieves competitive recognition performance.

Abstract:
Compared to supervised learning methods, self-supervised learning methods address the domain gap problem between light field (LF) datasets collected under varying acquisition conditions, which typically leads to decreased performance when differences exist in the distribution between the training and test sets. However, current self-supervised light field angular super-resolution (LFASR) techniques primarily focus on exploiting discrete spatial-angular features while neglecting continuous LF information. In contrast to previous work, we propose a self-supervised unconstrained neural light field (UNeLF) to continuously represent LF for LFASR. Specifically, any LF can be described as the camera pose for each sub-aperture image (SAI) and the two-plane that captures these SAIs. To describe the former, we introduce a SAIs-dependent pose optimization method to solve the issue that arises from the narrow baseline of most LF data, which hinders robust camera pose estimation. This mechanism reduces the number of trainable camera parameters from a quadratic to a constant scale, thereby alleviating the complexity of joint optimization. For the latter, we propose a novel adaptive two-plane parameterization strategy to determine the two-plane that captures these SAIs, facilitating refocusing. Finally, we jointly optimize the camera parameters, near-far planes and neural light field, efficiently mapping each adaptive two-plane parameterized ray to its correspondence color in a continuous manner. Comprehensive experiments demonstrate that UNeLF achieves faster training and inference with fewer computational resources while exhibiting superior performance on both synthetic and real-world datasets.

Abstract:
Multi-modal fusion holds great promise for integrating information from different modalities. However, due to a lack of consideration for modal consistency, existing multi-modal fusion methods in the field of remote sensing still face challenges of incomplete semantic information and low computational efficiency in their fusion designs. Inspired by the observation that the visual language pre-training model CLIP can effectively extract strong semantic information from visual features, we propose M3amba, a novel end-to-end CLIP-driven Mamba model for multi-modal fusion to address these challenges. Specifically, we introduce CLIP-driven modality-specific adapters in the fusion architecture to avoid the bias of understanding specific domains caused by direct inference, making the original CLIP encoder modality-specific perception. This unified framework enables minimal training to achieve a comprehensive semantic understanding of different modalities, thereby guiding cross-modal feature fusion. To further enhance the consistent association between modality mappings, a multi-modal Mamba fusion architecture with linear complexity and a cross-attention module Cross-SS2D are designed, which fully considers effective and efficient information interaction to achieve complete fusion. Extensive experiments have shown that M3amba has an average performance improvement of at least 5.98% compared with the state-of-the-art methods in multi-modal hyperspectral image classification tasks in the remote sensing field, while also demonstrating excellent training efficiency, achieving a double improvement in accuracy and efficiency. The code is released at https://github.com/kaka-Cao/M3amba.

Abstract:
Recent rapid advancements in intelligent vehicular systems and deep learning techniques have led to the emergence of diverse applications utilizing high-quality automotive videos in the Internet-of-Vehicles (IoV), often assisted by uncrewed aerial vehicles (UAVs). These applications aim to provide convenience and security for users. However, transmitting automotive videos with high-quality and low-bit-rate poses a challenge due to the inherent lossiness of traditional compression codecs in current UAV-assisted IoV systems, thereby affecting the performance of subsequent tasks. To address this, we propose a spatial-temporal hybrid video compression framework (STHVC), which integrates Space-Time Super-Resolution (STSR) with conventional codecs to enhance the compression efficiency on automotive videos. In our hybrid design, the encoder generates a low-frame-rate and low-resolution version of the source video, which is then compressed using a traditional codec. During the decoding stage, an effective STSR network is developed to increase both the resolution and the frame rate, and mitigate compression artifacts for automotive videos simultaneously. Additionally, we introduce a rectified intermediate flow estimation technique (RecIFE) within the proposed STSR network to address the challenge of noisy and inaccurate motions during the compression pipeline. Extensive experiments on various benchmark datasets demonstrate that our approach achieves bit-rate reductions of 29.97% compared to H.265 (slow) and 31.27% compared to H.266, while also exhibiting superior restoration performance compared to other state-of-the-art learning-based approaches.

Abstract:
We present a novel Task-aware Attentional Dynamic Alignment (TADA) framework for visual-based few-shot video classification (FSVC) that addresses two key challenges in this field: efficiency and nuanced spatio-temporal reasoning. Existing methods are often hindered by computationally expensive video decoding processes and neglect the temporal order of videos. In contrast, our method harnesses compressed domain data to extract rich spatio-temporal cues at a fraction of the cost of traditional video processing methods. Specifically, we propose an embedding module to extract informative features from compressed domain data while minimizing computational overheads. Furthermore, to exploit the temporal order of frames, we develop a prototypical ADA module to align and classify videos with an explicit temporal order constraint. Our framework also incorporates a contextual mixer to enrich video embeddings with task-specific context. Extensive experiments on multiple datasets demonstrate that TADA achieves state-of-the-art performance and outperforms existing methods in accuracy and efficiency.

Abstract:
3D Visual Grounding based on natural language is a fundamental task in Embodied AI. One of the fundamental challenges in localizing objects in 3D scenes through natural language descriptions arises from the variable perception of spatial relationships among objects when viewed from different perspectives. To address this issue, we introduce a model named Pseudo-EV, which decomposes the problem of 3D visual grounding into two stages: (1) predicting an embodied viewpoint and (2) determining the target object within that viewpoint, thereby eliminating viewpoint ambiguity. Given the scarcity of annotations for embodied viewpoint prediction, we employ a large language model (LLM) to generate pseudo-labels for existing datasets as intermediate training targets. However, directly predicting viewpoints in continuous Euclidean space proves inefficient, leading to weaker alignment with textual queries and scene semantics, as well as higher training overhead. To overcome these limitations, we introduce two streamlined strategies: an Embodied Viewpoint with Semantic Structure and a Decoupled Target Prediction Strategy. Extensive experiments demonstrate that predicting intermediate embodied viewpoints substantially boosts the performance of 3D visual grounding, achieving state-of-the-art results on both ScanRefer and Nr3D/Sr3D. Moreover, our framework significantly reduces computational cost compared to other viewpoint-aware approaches.

Abstract:
Establishing global contextual relationships between objects is crucial for weakly supervised semantic segmentation (WSSS) tasks that lack pixel-level labels. Limited by the efficiency of convolutional operations in capturing long-range dependencies with a limited receptive field and to bridge the gap between image-level annotations and pixel-level labels, we propose a Dual Graph Reasoning Mapping (DGRM) module. When integrated into a convolutional network, it conducts contextual graph reasoning on both spatial and interaction spaces of visual features. The first component of this graph reasoning module involves incorporating commonsense knowledge extracted from an external knowledge base into visual features to promote global contextual reasoning for visual graphs. The second component focuses on reasoning in the projected interaction space, utilizing abstracted object class attributes from high-level visual features to establish dependencies among channels in a potential low-dimensional space. Moreover, to capture correspondences at different semantic levels, we model the feature maps in a pyramid-like structure for graph reasoning at various levels. Extensive experiments on popular datasets, such as PASCAL VOC 2012 and MS COCO 2014, demonstrate the superiority of our approach. Our code is provided at https://github.com/JIA-ZHANG666/DGRM.

Abstract:
Visible-Infrared person Re-Identification (VI-ReID) involves querying images of the same person across visible and infrared modalities. To minimize annotation costs, Unsupervised Visible-Infrared person Re-Identification (UVI-ReID) using pseudo-label contrastive learning has emerged. Traditional UVI-ReID approaches often neglected camera domain information and relied on inadequate update strategies during training, only using cosine distance for testing, which led to incorrect mapping of cross-modal relationships. To address these issues, we propose Camera-proxy Enhanced Identity-recalibration Learning (CEIL). It consists of two main stages: first, it employs intra-modal contrastive learning in conjunction with the camera-proxy, updates the memory bank using our innovative Difficulty-aware Cluster-based Memory Updating (DCMU) strategy, and applies Camera Domain-driven Local correlation (CDL) Loss to enhance the learning process. Then utilizes cross-modal contrastive learning, featuring our Proxy-enhanced Cross-modal Mapping (PCM) module, to recalibrate the identity relationships between different modalities. Graph network-based Camera constraint adjustment Re-ranking (GCR) method is adopted during test, utilizing camera domain information to recalibrate the correspondence between identities. Extensive experiments have demonstrated that CEIL achieving state-of-the-art performance on the SYSU-MM01, RegDB, and LLCM datasets and the GCR, as a general unsupervised re-ranking method, can further enhance performance of model on these datasets. The code will be released at https://github.com/maybeextra/CEIL.

Abstract:
The third generation audio video coding standard (AVS3) is the latest video coding standard developed by the China AVS working group. The advanced entropy coding (AEC) tool in AVS3 has critical bin-to-bin data dependencies leading to difficulties in parallelization. The use of a 16384-entries lookup table (LUT) in the AEC context update algorithm poses challenges in balancing area and performance. To address these issues, we propose a high-performance, area-efficient hardware design. Firstly, we introduce a novel multicycle-path parallel architecture and optimize area cost through hardware reuse. Next, we construct a context modeling processing unit to replace the large LUT, significantly reducing area. Finally, we propose a new LUT-free dual-context modeling processing unit, effectively resolving critical paths introduced by parallel context conflicts. As a result, our design processes 2.63 bins per cycle. The synthesis results based on the GlobalFoundries’ 28nm process indicate that its maximum frequency is 990MHz, with an overall throughput of 2604 Mbin/s. Compared to state-of-the-art designs, our design leads in performance by 24.7% while reducing area by 28%.

Abstract:
Hyperspectral image (HSI) clustering is challenging to partition pixels into different clusters due to the complex spatial distribution and high-correlated spectrum. Subspace clustering is a representative learning paradigm and has shown competitive performance in HSIs. Most existing methods ignore potential spatial or structural information and show difficulties in dealing with large-scale HSIs. In this paper, we propose an elastic graph fusion subspace clustering (EGFSC) framework that can flexibly incorporate spectral, spatial and structural information for large HSI clustering. Instead of performing pixel-level learning, superpixel-level learning is conducted according to the generated superpixels to lessen computation burden and memory cost. To explore structural information in two perspectives, a superpixel graph and a band graph are constructed based on the superpixel features. Considering the incompatible sizes of the two graphs, we present three effective dual graph fusion strategies to fuse them in different ways. With these graph fusion strategies, EGFSC is able to improve clustering performance by simultaneously considering spatial and structural information. To solve the proposed framework, we present a closed-form solution for easy implementation. Experiments demonstrate that the proposed EGFSC obtains 70.08%, 75.76%, 87.28% and 77.23% clustering accuracies on the four HSI datasets and outperforms the state-of-the-art methods. The source code is released at https://github.com/ZhangYongshan/EGFSC.

Abstract:
Light field (LF) atmospheric descattering methods using multi-view images from camera arrays offer significant advantages for solving strong scattering due to their ability in exploiting high-dimensional light information. However, the relationship between performance and scattered LF sampling rate (i.e., the density of samples per unit area) is an unknown coupling, affecting acquisition and processing complexity. In this paper, we define the minimum atmospheric scattered LF sampling rate under optimal descattering quality, based on attenuated spectral support in scattering scenarios derived from the proposed atmospheric point spread function (APSF). The proposed APSF integrates the camera model, radiative transfer equation, and modified generalized Gaussian distribution (GGD) to describe multiple scattering. For any scattering parameters, the proposed APSF can be directly derived without infinite series, ensuring full adaptability to all acquisition systems through the integration of system model. Combining APSF with scene and acquisition system information, the scattered LF spectrum is determined, and consequently the minimum atmospheric scattered LF sampling rate is derived for the first time. Experimental results demonstrate the accuracy, effectiveness, and robustness of the proposed atmospheric scattered LF sampling theory through comparisons of atmospheric descattering performance across different LF sampling rates, object types, scene depths, and scattering intensities. The proposed method achieves a reduction in the number of acquisition cameras by an average of 78.4% while maintaining processing quality, which significantly enhances the applicability of LF atmospheric descattering methods.

Abstract:
Multimodal sarcasm detection (MSD) requires predicting the sarcastic sentiment by understanding diverse modalities of data (e.g., text, image). Beyond the surface-level information conveyed in the post data, understanding the underlying deep-level knowledge-such as the background and intent behind the data-is crucial for understanding the sarcastic sentiment. However, previous works have often overlooked this aspect, limiting their potential to achieve superior performance. To tackle this challenge, we propose DeepMSD, a novel framework that generates supplemental deep-level knowledge to enhance the understanding of sarcastic content. Specifically, we first devise a Deep-level Knowledge Extraction Module that leverages large vision-language models to generate deep-level information behind the text-image pairs. Additionally, we devise a Cross-knowledge Graph Reasoning Module to model how humans use prior knowledge to identify sarcastic cues in multimodal posts. This module constructs cross-knowledge graphs that connect deep-level knowledge with surface-level knowledge. As such, it enables a more profound exploration of the cues underlying sarcasm. Experiments on the public MSD dataset demonstrate that our approach significantly surpasses previous state-of-the-art methods.

Abstract:
Existing computer vision methods mainly focus on the recognition of rigid objects, whereas the recognition of flexible objects remains unexplored. Recognizing flexible objects poses significant challenges due to their inherently diverse shapes and sizes, translucent attributes, ambiguous boundaries, and subtle inter-class differences. In this paper, we claim that these problems primarily arise from the lack of object saliency. To this end, we propose the Flexible Vision Graph Neural Network (FViG) to optimize the self-saliency and thereby improve the discrimination of the representations for flexible objects. Specifically, on one hand, we propose to maximize the channel-aware saliency by extracting the weight of neighboring graph nodes, which is employed to identify flexible objects with minimal inter-class differences. On the other hand, we maximize the spatial-aware saliency based on clustering to aggregate neighborhood information for the centroid graph nodes. This introduces local context information and enables extracting of consistent representation, effectively adapting to the shape and size variations in flexible objects. To verify the performance of flexible objects recognition thoroughly, for the first time we propose the Flexible Dataset (FDA), which consists of various images of flexible objects collected from real-world scenarios or online. Extensive experiments evaluated on our FDA, FireNet, CIFAR-100 and ImageNet-Hard datasets demonstrate the effectiveness of our method on enhancing the discrimination of flexible objects.

Abstract:
Video frame interpolation (VFI) synthesizes new frames from original video frames to produce high frame-rate videos and enhance their visual appeal. The quality of these interpolated frames significantly affects the perceptual experience of the synthesized video. Recent research in VFI has increasingly focused on perceptual quality of the interpolated frames and the overall video. However, most existing quality metrics do not align well with human perceptual experiences and often suffer from unnatural artifacts in the interpolated frames. Consequently, there is an urgent need for VFI video quality assessment (VFIVQA) methods to assess the quality of the synthesized videos. In this paper, we propose both a full-reference (FR) method and a no-reference (NR) method for VFIVQA. The FR method employs two feature extraction blocks to measure continuous frame changes, extracting flow features with short temporal spans and motion features with long temporal spans. By calculating multilevel similarities in the temporal dimension of 3D convolutional neural networks and fusing these similarity features, the quality score of the VFI video is obtained from the quality regression network. Since the flow feature extraction block does not utilize the reference VFI video, the proposed NR method consists solely of this feature block. Extensive validation on several VFIVQA datasets demonstrates that the proposed methods outperform state-of-the-art FR and NR methods.

Abstract:
Images captured at nighttime often face challenges such as low light and blur, primarily caused by dim environments and the frequent use of long exposure. Existing methods either handle the two types of degradations independently or rely on carefully designed priors generated by complex mechanisms, resulting in poor generalization ability and high model complexity. To address these challenges, we propose an end-to-end framework named LIEDNet to efficiently and effectively restore high-quality images on both real-world and synthetic data. Specifically, the introduced LIEDNet consists of three essential components: the Visual State Space Module (VSSM), the Local Feature Module (LFM), and the Dual Gated-Dconv Feedforward Network (DGDFFN). The integration of VSSM and LFM enables the model to capture both global and local features while maintaining low computational overhead. Additionally, the DGDFFN improves image fidelity by extracting multi-scale structural information. Extensive experiments on real-world and synthetic datasets demonstrate the superior performance of LIEDNet in restoring low-light, blurry images. The code is available at https://github.com/MingyuLiu1/LIEDNet https://github.com/MingyuLiu1/LIEDNet.

Abstract:
Long video action quality assessment (AQA) aims to evaluate the performance of long-term actions depicted in a video and produce an overall assessment for action quality. A video of long-term actions often contains more complicated temporal and spatial information than that of short-term actions. However, existing approaches that segment a video into individual clips for independent analysis potentially disrupt the narrative flow and diminish contextual details within and across clips, impeding comprehensive video understanding. To address this challenge, we propose an adaptive spatiotemporal graph transformer network (ASGTN) that combines multiple graph structures and transformer attention mechanisms to capture both local and global contextual information within and across clips in a long video. Specifically, the adaptive spatiotemporal graph (ASG) combines a spatial graph branch, designed to enrich the local nuanced spatiotemporal relations within an individual clip, and a temporal graph branch, tailored to dynamically learn the semantic context across different clips. Furthermore, a transformer encoder is integrated to amplify the global dependencies across clips in the entire video. This structure is designed to preserve narrative coherence and maintain essential contextual details in video-level features. Finally, we employ a level-focused decoder to predict the action quality score distribution. Experiments demonstrate that our model achieves state-of-the-art results on popular AQA datasets. Our code is available at https://github.com/jiangliu5/ASGTN_AQA.

Abstract:
3D single object tracking (3D SOT) in LiDAR point clouds plays a crucial role in autonomous driving. It remains a challenging problem due to the incompleteness and the sparsity of points caused by occlusion and limited sensor capabilities. Previous methods design various modules to propagate target perceptual cues to the current frame for target localization. However, perceptual cues may contain less information for occluded or distant objects, which brings great challenges to estimating the target state accurately. To address the above limitations, we propose a novel 3D SOT framework based on the adaptive conceptual prototypes named ACPTrack, which first learns the conceptual prototype from the prior knowledge of the category structure, and then associates weak perceptual cues with the learned conceptual prototypes to improve tracking performance. The proposed ACPTrack enjoys several merits. First, we propose a universal learning method of adaptive conceptual prototype, which can quickly adapt to target-specific structure with given perceptual cues. Second, we design two modules based on the conceptual prototype for structure completion and positioning refinement, which can exploit the rich structure information of the conceptual prototype to deal with sparse and incomplete targets for robust tracking. Third, our framework is generic and compatible with various 3D trackers and brings consistent performance gains. Extensive experiments validate that our method achieves competitive performance on three large-scale datasets.

Abstract:
Inserting objects into scenes and performing realistic relighting are common applications in augmented reality (AR). Previous methods focused on inserting virtual objects using CAD models or real objects from single-view images, resulting in highly limited AR application scenarios. We introduce a novel pipeline based on Neural Radiance Fields (NeRFs) for seamlessly integrating objects into scenes, from two sets of images depicting the object and scene. This approach enables novel view synthesis, realistic relighting, and supports physical interactions such as shadow casting between objects. The lighting environment is in a hybrid representation of Spherical Harmonics and Spherical Gaussians, representing both high- and low-frequency lighting components very well, and supporting non-Lambertian surfaces. Specifically, we leverage the benefits of volume rendering and introduce an innovative approach for efficient shadow rendering by comparing the depth maps between the camera view and the light source view and generating vivid soft shadows. The proposed method achieves realistic relighting effects in extensive experimental evaluations.

Abstract:
Semi-supervised learning (SSL) has proven effective in assigning a pseudo-label to a confident sample whose largest class probability is above a fixed threshold. However, in the context of semi-supervised facial attribute recognition (SSFAR), where a sample is associated with multiple presence and absence pseudo-labels, directly applying existing SSL methods is challenging due to two issues: 1) the lack of a clear boundary between presence and absence predictions for an attribute makes it difficult to distinguish them using a single threshold; 2) the learning difficulty varies across attributes, so the fixed strategy fails to adaptively learn different attributes. To address these challenges, we propose Dynamic thrEShold Pairs (DESP), a simple yet effective method to handle the SSFAR problem. Specifically, during each training stage, we derive two sets for each attribute from labeled samples, which contain the predicted probabilities of presence and absence, respectively. We then compute the mid-ranges of the two sets as paired presence and absence thresholds. Finally, we assign a presence or absence pseudo-label for the attribute to an unlabeled sample when its prediction exceeds the presence threshold or falls below the absence threshold. Extensive experiments on the CelebA and LFWA datasets demonstrate that DESP achieves superior performance compared to state-of-the-art methods, especially in the case of scarce labeled samples. Also, DESP performs well on multi-label datasets such as Pascal VOC and MS-COCO. The code will be publicly available at https://github.com/yihanxxu/DESP.

Affiliations: School of Artificial Intelligence, Hebei University of Technology, Tianjin, China; Information Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, China; Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China; School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen Campus, Shenzhen, China; College of Computer, National University of Defense Technology, Changsha, China

Abstract:
The goal of single domain generalization is to use data from a single domain (source domain) to train a model, which is then deployed over several unknown domains for testing (target domains). This study introduces a practical approach diverging from traditional DG, which typically relies on multiple source domains. We focus on Single Long-Tailed Domain Generalization, which refers to a scenario in the context of long-tail distribution, where although minority classes may have fewer samples in a single domain, these minority classes could become more prevalent and dominant in other domains. We introduce the Graph Convolutional Mixture-of-Experts Learners Network for Long-Tailed Domain Generalization (GCML) as a solution to this problem. Our approach presents two novel tactics. Initially, we utilize an expert learning technique that is skill-diverse. In order to properly manage the unknown target domain, this entails training multiple specialists inside a single long-tailed source domain and combining their knowledge. Then, we use a graph convolutional network to facilitate domain generalization, leveraging joint data structure modeling to learn more domain-invariant feature. Experiments conducted on four established benchmarks reveal that our GCML algorithm outperforms contemporary domain generalization techniques, demonstrating its efficacy in this complex task.

Abstract:
This paper presents a novel approach that leverages two models to integrate features from numerous unlabeled images, addressing the challenge of semi-supervised salient object detection (SSOD). Unlike conventional methods that rely on selecting high-quality pseudo labels, our method identifies the model that produces consistent predictions for original images and their color transformation versions from two models to infer reliable pseudo labels for all unlabeled images, improving the diversity of the training set. Specifically, we propose adaptive selection indicators to quantify prediction differences and guide the updates of the two models using the unlabeled set alternatively. Initially, two models used in our framework are trained on the labeled set. Once the adaptive selection indicator conditions are satisfied, one model is designated as the proxy, generating pseudo labels, while the other serves as the saliency model, which is further trained using these pseudo labels. Subsequently, the updated saliency model optimizes the proxy model’s parameters according to another adaptive selection indicator. Experimental results and ablation studies on six benchmark salient object detection datasets confirm the effectiveness and robustness of our method. Our approach achieves performance comparable to recent fully supervised methods while using only one eighth of the labeled data, demonstrating its potential for efficient and scalable SSOD. This paper is publicly available at https://github.com/Liyuan0905/CATNet.

Abstract:
Pedestrian trajectory prediction aims to forecast future movements based on historical paths. Spatial-temporal (ST) methods often separately model spatial interactions among pedestrians and temporal dependencies of individuals. They overlook the direct impacts of interactions among different pedestrians across various time steps (i.e., high-order cross-time interactions). This limits their ability to capture ST inter-dependencies and hinders prediction performance. To address these limitations, we propose UniEdge with three major designs. Firstly, we introduce a unified ST graph data structure that simplifies high-order cross-time interactions into first-order relationships, enabling the learning of ST inter-dependencies in a single step. This avoids the information loss caused by multi-step aggregation. Secondly, traditional GNNs focus on aggregating pedestrian node features, neglecting the propagation of implicit interaction patterns encoded in edge features. We propose the Edge-to-Edge-Node-to-Node Graph Convolution (E2E-N2N-GCN), a novel dual-graph network that jointly models explicit N2N social interactions among pedestrians and implicit E2E influence propagation across these interaction patterns. Finally, to overcome the limited receptive fields and challenges in capturing long-range dependencies of auto-regressive architectures, we introduce a transformer encoder-based predictor that enables global modeling of temporal correlation. UniEdge outperforms state-of-the-arts on multiple datasets, including ETH, UCY, and SDD.

Abstract:
The complementary characteristics of visible (VIS) and infrared (IR) modalities play a crucial role in scene perception for autonomous driving, especially under poor lighting conditions. However, effectively leveraging the complementary information from visible and infrared images to further enhance perception performance remains a challenging task. These challenges stem from the difficulty of adaptively balancing the contributions of visible and infrared information under dynamic illumination conditions, the reliance on static fusion strategies that fail to fully utilize cross-modal complementarities, and the limitations of existing datasets in terms of diverse scenes, fine-grained illumination annotations, and high imaging quality. To address the challenges, we propose an Edge-guided Illumination-aware Interactive learning-based Detector (EI2Det). It includes three novel modules. The cross-modal interaction module uses visible-priority and infrared-priority multi-head cross-attention mechanisms to refine inter-modality and intra-modality feature representations, improving the model’s robustness and adaptability. The illumination-aware weighting module predicts illumination intensity levels to dynamically adjust the contributions of visible and infrared features, ensuring effective fusion under various lighting conditions. The edge-guided fusion module leverages critical edge information to guide the detector’s attention to object boundaries, significantly enhancing its localization capability. Additionally, we introduce a Multi-modality Full-time dataset for Autonomous Driving (MFAD), featuring 12,370 image pairs with fine-grained annotations of illumination intensity, covering diverse driving scenarios and weather conditions. Extensive experiments on the public M3FD, KAIST, FLIR, LLVIP, and our MFAD datasets demonstrate superior performance and generalization ability of our approach. The code and dataset will be available at https://github.com/hukefy/EI2Det.

Affiliations: Department of Computer Science, City University of Hong Kong, Hong Kong, China; School of Science, Harbin Institute of Technology at Shenzhen, Shenzhen, Guangdong, China; College of Electronic and Optical Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China; School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China; School of Electronics and Information Engineering, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China

Abstract:
Under large-scale data with missing views, fast incomplete multi-view clustering (IMVC) with anchor learning is of critical importance due to its linear complexity \mathcal O(n) . However, existing anchor-based methods only explore the column orthogonality of anchor points, where their arbitrary column orthogonal basis vectors have weak constraint relationships with real samples and significant deviations from more representative anchors, thereby impeding the precise representation of sample similarities. To solve this issue, we propose a Reliable Entropy-induced anchor learning for incomplete Multi-view subspace Clustering (REMC), which performs an entropy approximation term to learn more representative anchors, and we prove that the information entropy minimization can be relaxed into the \ell _2,1 -norm paradigm. Specifically, the proposed REMC first integrates anchor learning and subspace clustering to produce multiple view-specific bipartite graphs and capture the high-order correlations by imposing these bipartite graphs with the tensor nuclear norm. Then, we fuse all the view-specific bipartite graphs to build a consensus bipartite graph with entropy approximation regularization, and hence the proposed REMC can produce a more discriminative similarity graph, preserving each non-zero element in its column close to 1, while the other elements are approaching 0. Besides, an efficient algorithm is designed to solve the proposed REMC. Numerous results show the superior performance of our method on both the complete and incomplete data.

Abstract:
Source-Free Domain Generalization (SFDG) aims to develop a model that performs on unseen domains without relying on any source domains. However, the implementation remains constrained due to the unavailability of training data. Research on SFDG focus on knowledge transfer of multi-modal models and style synthesis based on joint space of multiple modalities, thus eliminating the dependency on source domain images. However, existing works primarily work for multi-domain and less-category configuration, but performance on multi-domain and multi-category configuration is relatively poor. In addition, the efficiency of style synthesis also deteriorates in multi-category scenarios. How to efficiently synthesize sufficiently diverse data and apply it to multi-category configuration is a direction with greater practical value. In this paper, we propose a method called BatStyler, which is utilized to improve the capability of style synthesis in multi-category scenarios. BatStyler consists of two modules: Coarse Semantic Generation and Uniform Style Generation modules. The Coarse Semantic Generation module extracts coarse-grained semantics to prevent the compression of space for style diversity learning in multi-category configuration, while the Uniform Style Generation module provides a template of styles that are uniformly distributed in space and implements parallel training. Extensive experiments demonstrate that our method exhibits comparable performance on less-category datasets, while surpassing state-of-the-art methods on multi-category datasets. Code is available at: https://github.com/Xuxiusheng/BatStyler.

Abstract:
Person search is a challenging task that involves detecting and retrieving individuals from a large set of un-cropped scene images. Existing person search models are mostly trained and deployed in the same-origin scenarios. However, collecting and annotating training samples for each scene is difficult due to the limitation of resources and labor cost. Moreover, large-scale intra-domain data for training are generally not legally available for common developers, due to the regulation of privacy and public security. Leveraging easily accessible large-scale User Generated Video Contents (i.e. UGC videos) to train person search models can fit the real-world distribution, but still suffering a performance gap from the domain difference to surveillance scenes. In this work, we explore enhancing the out-of-domain generalization capabilities of person search models, and propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios. Specifically, we focus on learning domain-invariant representations for both detection and ReID by introducing a multi-task prototype-based domain-specific batch normalization, and a channel-wise ID-relevant feature decorrelation strategy. We also identify and address typical sources of noise in UGC training frames, including inaccurate bounding boxes, the omission of identity labels, and the absence of cross-camera data. Our framework achieves promising performance on two challenging person search benchmarks without using any human annotation or samples from the target domain. The code is available at https://github.com/caposerenity/GPS.

Abstract:
Cell localization and counting in pathological images play an important role in the diagnosis and treatment of life-threatening diseases (e.g., tumor). However, they still remain a challenging work, due to cell clustering and adhesion, blurred boundaries, deformation, and difficulty of annotation. In this work, we address these problems by introducing multi-granularity topological constraints in model training. First, a loss function of topological structure constraint for single cells is proposed, which encourages the trained model to avoid the wrong prediction of multiple cells within an instance (false positives). Second, a loss function of constraint of spatial topological structure distribution is proposed for clustered cells, which helps the trained model to reduce the wrong prediction of some crowded cells as one (false negative). Third, a loss is proposed from the expert check of annotation and inference errors, which enables positioning of difficult samples and facilitates the correction of errors. The multi-granularity loss under topological feature constraints enables a significant enhancement in the performance of the trained model. Experimental results on a self-collected COVID-19 pathological dataset and two public pathological datasets validate the performance advantages of the proposed method over some state-of-the-art methods. Our code will be available at https://github.com/MedicalYajieChen/MGTopology.

Abstract:
Vision-based semantic scene completion task aims to predict dense geometric and semantic 3D scene representations from 2D images. However, 3D modeling from a single view is an ill-posed problem, limited by the field of view and occlusion problems caused by image input. Moreover, existing methods tend to produce erroneous scene hallucinations and overly smooth boundary segmentation due to a lack of information. To address this problem, we propose MixSSC, which mixes the sparsity of forward projection with the denseness of depth-prior backward projection. The aim is to use sparse features to fill information-poor regions and dense features to enhance visible regions. Specifically, we develop the forward-backward mixture module, which enables the generation of scene mixture voxel representation by leveraging the benefits of both forward and backward projection. Subsequently, we design the semantic-spatial fusion module, which utilizes a coarse-to-fine approach to process mixture voxel features at the semantic-spatial level. Extensive experimental results on the SemanticKITTI, SSCBench-KITTI-360 and nuScenes datasets demonstrate the superiority of MixSSC. Our code is available on https://github.com/willemeng/MixSSC.

Abstract:
Generative artificial intelligence has made great progress in enabling clients to create a variety of realistic visual content (such as images, videos and audios), where diffusion model as an emerging generative model can obtain higher quality images than generative adversarial networks (GANs). For resource-constrained devices, high-definition images can be generated by outsourcing the AI model to the server, but the client’s local prompts and generated images may contain private information, increasing the risk of privacy disclosure. In order to alleviate these concerns, we adopt the idea of split learning and homomorphic encryption technology to ensure the privacy of data and prevent the attacker from stealing. Specifically, adopt homomorphic encryption to ensure the confidentiality of the text embedding and prevent the server from obtaining the prompt and text embedding. Secondly, during the process of denoising, a secure cross-attention mechanism is designed for ciphertext based on the embedding matrix to ensure the normal follow-up steps and prevent denoising model parameters from leaking to clients. In addition, since the decoder is deployed locally, the image generated by the denoising model is latent spatial features rather than pixel spatial features, so the generated image is not visible to the server. Finally, through the theoretical analysis and quantitative index of experiments, it is proved that the image generated under ciphertext prompt is almost the same as the image generated under plaintext prompt, which indicates the security and effectiveness of the proposed protocol.

Abstract:
As sensor technology evolves, RGB+X systems combine traditional RGB cameras with another type of auxiliary sensor, which enhances perception capabilities and provides richer information for important tasks such as semantic segmentation. However, acquiring massive RGB+X data is difficult due to the need for specific acquisition equipment. Therefore, traditional RGB+X segmentation methods often perform pretraining on relatively abundant RGB data. However, these methods lack corresponding mechanisms to fully exploit the pretrained model, and the scope of the pretraining RGB dataset remains limited. Recent works have employed prompt learning to tap into the potential of pretrained foundation models, but these methods adopt a unidirectional prompting approach i.e., using X or RGB+X modality to prompt pretrained foundation models in RGB modality, neglecting the potential in non-RGB modalities. In this paper, we are dedicated to developing the potential of pretrained foundation models in both RGB and non-RGB modalities simultaneously, which is non-trivial due to the semantic gap between modalities. Specifically, we present the CPAL (Cross-prompting Adapter with LoRAs), a framework that features a novel bi-directional adapter to simultaneously fully exploit the complementarity and bridging the semantic gap between modalities. Additionally, CPAL introduces low-rank adaption (LoRA) to fine-tune the foundation model of each modal. With the support of these elements, we have successfully unleashed the potential of RGB foundation models in both RGB and non-RGB modalities simultaneously. Our method achieves state-of-the-art (SOTA) performance on five multi-modal benchmarks, including RGB+Depth, RGB+Thermal, RGB+Event, and a multi-modal video object segmentation benchmark, as well as four multi-modal salient object detection benchmarks. The code and results are available at: https://github.com/abelny56/CPAL.

Abstract:
With the assistance of language descriptions, Visual-Language (VL) object tracking can obtain more accurate semantic information compared to traditional Visual-Only object tracking. However, the ability of current VL trackers to obtain target semantic information has not been fully developed due to limitations such as wasted modeling capabilities and insufficient utilization of historical temporal information. On the one hand, the modeling output from Transformer shallow encoders often does not directly participate in the prediction of tracking results, resulting in a certain degree of model capability waste. On the other hand, the semantic information of historical tracking results has also not been fully utilized in the tracking process, resulting in a certain degree of lack of semantic assistance capability. Therefore, we propose a novel hierarchical multi-stage VL tracker called SIEVL-Track to enhance target semantic information. Specifically, we first design a multi-stage visual language tracking framework for modeling multi-scale semantic information in Visual-Language tracking pipeline. Secondly, we propose a selective deep and shallow semantic information fusion module (S-DSFM) that explicitly integrates shallow output features into deep output features, so to reduce the waste of modeling capabilities and obtain more high-frequency semantic information related to the target. Finally, we design a temporal cue modeling module based on linguistic classification and multi-frame historical information(MHLS-TCM), with the aim of more comprehensive utilization of historical temporal semantic information. Benefit from the above designs, our VL tracker can obtain stronger target semantic information. Competitive performance from extensive experimental results on five popular vision-language tracking benchmarks, including LaSOT, OTB99-Lang, WebUAV-3M, LaSOText and TNL2K, have demonstrated the superiority and effectiveness of our SIEVL-Track.

Abstract:
Fine-grained action recognition typically faces challenges with lower inter-class variances and higher intra-class variances. Supervised contrastive learning is inherently suitable for this task, as it can decrease intra-class feature distances while increasing inter-class ones. However, directly applying it into fine-grained action recognition encounters two main problems. The first problem stems from the heavy training cost associated with supervised contrastive learning, which requires numerous training epochs, each involving double augmentation views per instance. To address this issue, we propose the late-stage supervised contrastive learning (late-SC) strategy, which effectively reduces the number of training epochs needed for the contrastive learning process. The second problem is that supervised contrastive loss does not explicitly consider the semantic distances between fine-grained actions when adjusting representation distances. This results in less reasonable and efficient adjustments to the representation space. To overcome this limitation, we introduce the semantic-aware temperature adaptation (STA) mechanism, enhancing the suitability of the supervised contrastive loss for fine-grained action recognition. We conduct experiments on several benchmark datasets for fine-grained action recognition, including Epic-Kitchens-55/100, SomethingSomething-V1, and Diving48-V2. The results demonstrate that our proposed method (referred to as LSC-STA) consistently enhances performance across various base feature extractors, without introducing additional inference overhead and incurring only a marginal increase in training expenses.

Abstract:
In skeleton-based action recognition, self-supervised pre-training paradigms have been extensively investigated. Particularly, masked autoencoders-like methods based on masked target reconstruction have pushed the performance of pre-training to a new height, which are committed to choose a better target for reconstruction. In this work, we propose an asymmetric context-guided adaptive alignment network (ACA2Net) for self-supervised skeleton-based action recognition by utilizing a transformer-based teacher encoder guiding the student encoder to learn richer action contextual information. To tackle the misalignment from the asymmetry, we devise an adaptive alignment module to better align the student representations to the teacher’s. Additionally, considering that the differential operation for temporal motion might cause the prior loss related to the changes of direction, we propose a motion compass-aware masking strategy with fusion prior supplemented by motion and direction intensity. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets demonstrate that our proposed ACA2Net outperforms previous MAE-like methods.

Abstract:
Talking face generation focuses on creating natural facial animations that align with the provided text or audio input. Current methods in this field primarily rely on facial landmarks to convey emotional changes. However, spatial key-points are valuable, yet limited in capturing the intricate dynamics and subtle nuances of emotional expressions due to their restricted spatial coverage. Consequently, this reliance on sparse landmarks can result in decreased accuracy and visual quality, especially when representing complex emotional states. To address this issue, we propose a novel method called Emotional Talking with Action Unit (ETAU), which seamlessly integrates facial Action Units (AUs) into the generation process. Unlike previous works that solely rely on facial landmarks, ETAU employs both Action Units and landmarks to comprehensively represent facial expressions through interpretable representations. Our method provides a detailed and dynamic representation of emotions by capturing the complex interactions among facial muscle movements. Moreover, ETAU adopts a multi-modal strategy by seamlessly integrating emotion prompts, driving videos, and target images, and by leveraging various input data effectively, it generates highly realistic and emotional talking-face videos. Through extensive evaluations across multiple datasets, including MEAD, LRW, GRID and HDTF, ETAU outperforms previous methods, showcasing its superior ability to generate high-quality, expressive talking faces with improved visual fidelity and synchronization. Moreover, ETAU exhibits a significant improvement on the emotion accuracy of the generated results, reaching an impressive average accuracy of 84% on the MEAD dataset.

Abstract:
Contrastive Language-Image Pre-Training (CLIP) model excels in traditional person re-identification (ReID) tasks due to its inherent advantage in generating textual descriptions for pedestrian images. However, applying CLIP directly to intra-camera supervised person re-identification (ICS ReID) presents challenges. ICS ReID requires independent identity labeling within each camera, without associations across cameras. This limits the effectiveness of text-based enhancements. To address this, we propose a novel framework called CLIP-based Camera-Agnostic Feature Learning (CCAFL) for ICS ReID. Accordingly, two custom modules are designed to guide the model to actively learn camera-agnostic pedestrian features: Intra-Camera Discriminative Learning (ICDL) and Inter-Camera Adversarial Learning (ICAL). Specifically, we first establish learnable textual prompts for intra-camera pedestrian images to obtain crucial semantic supervision signals for subsequent intra- and inter-camera learning. Then, we design ICDL to increase inter-class variation by considering the hard positive and hard negative samples within each camera, thereby learning intra-camera finer-grained pedestrian features. Additionally, we propose ICAL to reduce inter-camera pedestrian feature discrepancies by penalizing the model’s ability to predict the camera from which a pedestrian image originates, thus enhancing the model’s capability to recognize pedestrians from different viewpoints. Extensive experiments on popular ReID datasets demonstrate the effectiveness of our approach. Especially, on the challenging MSMT17 dataset, we arrive at 58.9% in terms of mAP accuracy, surpassing state-of-the-art methods by 7.6%. Code is available at https://gitee.com/swjtugx/classmate/tree/master/OurGroup/CCAFL.

Abstract:
In this paper, we explore a novel Text-supervised Egocentic Semantic Segmentation (TESS) task that aims to assign pixel-level categories to egocentric images weakly supervised by texts from image-level labels. In this task with prospective potential, the egocentric scenes contain dense wearer-object relations and inter-object interference. However, most recent third-view methods leverage the frozen Contrastive Language-Image Pre-training (CLIP) model, which is pre-trained on the semantic-oriented third-view data and lapses in the egocentric view due to the “relation insensitive” problem. Hence, we propose a Cognition Transferring and Decoupling Network (CTDN) that first learns the egocentric wearer-object relations via correlating the image and text. Besides, a Cognition Transferring Module (CTM) is developed to distill the cognitive knowledge from the large-scale pre-trained model to our model for recognizing egocentric objects with various semantics. Based on the transferred cognition, the Foreground-background Decoupling Module (FDM) disentangles the visual representations to explicitly discriminate the foreground and background regions to mitigate false activation areas caused by foreground-background interferential objects during egocentric relation learning. Extensive experiments on four TESS benchmarks demonstrate the effectiveness of our approach, which outperforms many recent related methods by a large margin. Code will be available at https://github.com/ZhaofengSHI/CTDN.

Abstract:
The 3D Gaussian Splatting method has recently shown significant advancements in rendering speed and scene composition quality, enhancing its industrial applications and boosting the demand for 3D Gaussian asset generation. However, existing mature 3D generation technologies predominantly rely on implicit representations, which often struggle to balance geometric quality with editability. The production of 3D Gaussian assets generally involves diffusion models that require a dual-stage process of reconstruction and generation, resulting in substantial training and inference costs. To overcome these challenges, we introduce GET3DGS, an innovative approach that combines 3D-aware GANs with 3D Gaussian Splatting representations. This method facilitates the manipulation of the physical attributes of 3D Gaussians, such as geometry and texture, via point deformation fields. Offering faster inference speeds and end-to-end training capabilities, our model outperforms existing diffusion model-based methods. By deriving high-quality Gaussian point cloud geometric representations from 2D images, our approach reduces material accumulation costs and produces data compatible with 3D Gaussian rendering engines. We have evaluated the generative performance of our model on ShapeNet and OmniObject3D and demonstrate competitive results in terms of image and geometric quality relative to previous methods.

Abstract:
Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. To tackle these challenges, we introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations. Specifically, to achieve fine-grained modeling of long-temporal content, we propose a Bottom-up Video Interpretation Mechanism that progressively interprets video content from clips to video. To avoid interference from redundant information in videos, we introduce a Semantic Redundancy Reduction mechanism that removes redundancy at both the visual and textual levels. Our method transforms long videos into hierarchical textual representations that contain multi-granularity information of the video. With these representations, FDVS is applicable to various tasks without any fine-tuning. We evaluate the proposed method across eight datasets spanning three tasks. The performance demonstrates the effectiveness and versatility of our method.

Abstract:
Most Camouflaged Object Detection (COD) methods heavily rely on mask annotations, which are time-consuming and labor-intensive to acquire. Existing weakly-supervised COD approaches exhibit significantly inferior performance compared to fully-supervised methods and struggle to simultaneously support all the existing types of camouflaged object labels, including scribbles, bounding boxes, and points. Even for Segment Anything Model (SAM), it is still problematic to handle the weakly-supervised COD and it typically encounters challenges of prompt compatibility of the scribble labels, extreme response, semantically erroneous response, and unstable feature representations, producing unsatisfactory results in camouflaged scenes. To mitigate these issues, we propose a unified COD framework in this paper, termed SAM-COD, which is capable of supporting arbitrary weakly-supervised labels. Our SAM-COD employs a prompt adapter to handle scribbles as prompts based on SAM. Meanwhile, we introduce response filter and semantic matcher modules to improve the quality of the masks obtained by SAM under COD prompts. To alleviate the negative impacts of inaccurate mask predictions, a new strategy of prompt-adaptive knowledge distillation is utilized to ensure a reliable feature representation. To validate the effectiveness of our approach, we have conducted extensive empirical experiments on three mainstream COD benchmarks. The results demonstrate the superiority of our method against state-of-the-art weakly-supervised and even fully-supervised methods. Our source codes and trained models will be publicly released.

Abstract:
Unsupervised point cloud registration is crucial in 3D computer vision. However, most unsupervised methods struggle to construct effective optimization objectives and reliable unsupervised signals to enhance the performance of the model. To address these issues, with the observation of the significant alignment between the registration process and the Markov Decision Process (MDP), we model point cloud registration as MDP, which can provide more reliable unsupervised signals through the reward. We propose a colored noise based cross-entropy method, which introduces colored noise into sampling process, regulating the power spectral density of the action sequence and expanding the search space, improving the registration effect. Particularly, to strengthen constraints on MDP and training in the transformation space, we utilize equivariance theory to construct transformation equivariant constraint as a new optimization objective and derive equivariant constraint solutions for optimization, providing more reliable unsupervised signals. Extensive experiments demonstrate the superior performance of our method on benchmark datasets.

Abstract:
Weakly supervised segmentation methods have garnered considerable attention due to their potential to alleviate the need for labor-intensive pixel-level annotations during model training. Traditional weakly supervised nuclei segmentation approaches typically involve a two-stage process: pseudo-label generation followed by network training. The performance of these methods is highly dependent on the quality of the generated pseudo-labels, which can limit their effectiveness. In this paper, we propose a novel domain-adaptive weakly supervised nuclei segmentation framework that addresses the challenge of pseudo-label generation through cross-task interaction strategies. Specifically, our approach leverages weakly annotated data to train an auxiliary detection task, which facilitates domain adaptation of the segmentation network. To improve the efficiency of domain adaptation, we introduce a consistent feature constraint module that integrates prior knowledge from the source domain. Additionally, we develop methods for pseudo-label optimization and interactive training to enhance domain transfer capabilities. We validate the effectiveness of our proposed method through extensive comparative and ablation experiments conducted on six datasets. The results demonstrate that our approach outperforms existing weakly supervised methods and achieves performance comparable to or exceeding that of fully supervised methods. Our code is available at https://github.com/zhangye-zoe/DAWN.

Abstract:
Image compression for both human and machine vision has become prevailing to accommodate to rising demands for machine-machine and human-machine communications. Scalable human-machine image compression is recently emerging as an efficient alternative to simultaneously achieve high accuracy for machine vision in the base layer and obtain high-fidelity reconstruction for human vision in the enhancement layer. However, existing methods achieve scalable coding with heuristic mechanisms, which cannot fully exploit the inter-layer correlations and evidently sacrifice rate-distortion performance. In this paper, we propose task-adapted learnable embedded quantization to address this problem in an analytically optimized fashion. We first reveal the relationship between the latent representations for machine and human vision and demonstrate that optimal representation for machine vision can be approximated with post-training optimization on the learned representation for human vision. On such basis, we propose task-adapted learnable embedded quantization that leverages learnable step predictor to adaptively determine the optimal quantization step for diverse machine vision tasks such that inter-layer correlations between representations for human and machine vision are sufficiently exploited using embedded quantization. Furthermore, we develop a human-machine scalable coding framework by incorporating the proposed embedded quantization into pre-trained learned image compression models. Experimental results demonstrate that the proposed framework achieves state-of-the-art performance on machine vision tasks like object detection, instance segmentation, and panoptic segmentation with negligible loss in rate-distortion performance for human vision.

Abstract:
Real-time object detection in Unmanned Aerial Vehicle (UAV) videos remains a significant challenge due to the fast motion and small scale of objects. Existing streaming perception models struggle to accurately capture fine-grained motion cues between consecutive frames, leading to suboptimal performance in dynamic UAV scenarios. To address these challenges, StreamFlow is proposed to integrate optical flow information and enhance real-time object detection in UAV videos. StreamFlow incorporates Flow-Guided Dynamic Prediction (FGDP) to refine position predictions using local optical flow information and Optical Flow Guided Optimization (OFGO) to optimize model parameters considering both localization loss and optical flow reliability. Central to OFGO is the Adaptive Flow Weighting (AFW) module, which focuses on reliable flow samples during training. The proposed integration of optical flow and adaptive weighting scheme significantly enhances the ability of streaming perception models to handle fast-moving objects in dynamic UAV environments. Extensive experiments on four challenging UAV video datasets demonstrate the superior performance of StreamFlow compared to state-of-the-art methods in terms of accuracy.

Abstract:
The Segment Anything Model (SAM) has demonstrated remarkable capability as a general segmentation model given visual prompts such as points or boxes. While SAM is conceptually compatible with text prompts, it merely employs linguistic features from vision-language models as prompt embeddings and lacks fine-grained cross-modal interaction. This deficiency limits its application in referring image segmentation (RIS), where the targets are specified by free-form natural language expressions. In this paper, we introduce ReferSAM, a novel SAM-based framework that enhances cross-modal interaction and reformulates prompt encoding, thereby unleashing SAM’s segmentation capability for RIS. Specifically, ReferSAM incorporates the Vision-Language Interactor (VLI) to integrate linguistic features with visual features during the image encoding stage of SAM. This interactor introduces fine-grained alignment between linguistic features and multi-scale visual representations without altering the architecture of pre-trained models. Additionally, we present the Vision-Language Prompter (VLP) to generate dense and sparse prompt embeddings by aggregating the aligned linguistic and visual features. Consequently, the generated embeddings sufficiently prompt SAM’s mask decoder to provide precise segmentation results. Extensive experiments on five public benchmarks demonstrate that ReferSAM achieves state-of-the-art performance on both classic and generalized RIS tasks. The code and models are available at https://github.com/lsa1997/ReferSAM.

Abstract:
Image captioning is a fundamental task in computer vision that aims to generate precise and comprehensive descriptions of images automatically. Intuitively, humans initially rely on the image content, e.g., “cake on a plate”, to gradually gather relevant knowledge facts e.g., “birthday party”, “candles”, which is a process referred to as divergence. Then, we perform step-by-step reasoning based on the images to refine, and rearrange these knowledge facts for explicit sentence generation, a process referred to as focus. However, existing image captioning methods mainly rely on the encode-decode framework that does not well fit the “divergence-focus” nature of the task. To this end, we propose the knowledge “divergence-focus” method for Image Captioning (K-DFIC) to gather and polish knowledge facts for image understanding, which consists of two components: 1) Knowledge Divergence Module aims to leverage the divergence capability of large-scale pre-trained model to acquire knowledge facts relevant to the image content. To achieve this, we design a scene-graph-aware prompt that serves as a “trigger” for GPT-3.5, encouraging it to “diverge” and generate more sophisticated, human-like knowledge. 2) Knowledge Focus Module aims to refine acquired knowledge facts and further rearrange them in a coherent manner. We design the interactive refining network to encode knowledge, which is refined with the visual features to remove irrelevant words. Then, to generate fluent image descriptions, we design the large-scale pre-trained model-based rearrangement method to estimate the importance of each knowledge word for an image. Finally, we fuse the refined knowledge and visual features to assist the decoder in generating captions. We demonstrate the superiority of our approach through extensive experiments on the MSCOCO dataset. Our approach surpasses state-of-the-art performance across all metrics in the Karpathy split. For example, our model obtains the best CIDEr-D score of 148.4%. Additional ablation studies and visualization further validate our effectiveness.

Abstract:
Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using a Large Language Model (LLM). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.

Abstract:
Click-based interactive segmentation is the most concise and widely used data labeling method. While existing interactive segmentation methods excel in handling simple targets, they encounter challenges in obtaining high-quality masks from some complex scenes, even with a large number of clicks. Also, the cost of retraining the model from scratch for special scenarios is unacceptably high. To address these issues, we propose ClickAdapter, a simple yet powerful interactive segmentation model adapter without the need for no pre-training. Through introducing a small number of additional parameters and computations, the adapter module effectively enhanced the ability of interactive segmentation models to obtain high-quality prediction with limited clicks. Specifically, we incorporate a detail extractor that aims to extract spatial correlations and local detail features of images. These fine-grained data are then integrated into a model with our adapter to generate segmentation masks with sharp and precise edges. During the training process, only the parameters of our adapter are learnable, thereby reducing the training cost. Features in special scenarios can also be infused more efficiently. To verify the efficiency and performance advantages of the proposed method, a series of experiments on a wide range of benchmarks were conducted, demonstrating that the proposed algorithm achieved cutting-edge performance compared to current state-of-the-art (SOTA) methods.

Abstract:
This work explores barely-supervised brain tumor segmentation where minimal supervision, i.e., fewer than ten labeled samples, is available. Current methods often neglect two key problems in barely-supervised segmentation: i) the insufficient labeled data may be not able to offer enough information to networks for accurately segmenting tumor areas across various cases; ii) networks might overfit to the relation of multiple modalities of the limited labeled data, thus overly depending on certain modalities while overlooking other valuable modalities during segmentation. To tackle these two problems, we propose a barely-supervised training framework, called BarelySAM. BarelySAM first employs Segment Anything Model (SAM) during training by generating pseudo labels for unlabeled data. In this manner, pre-trained knowledge exhibited in SAM can be exploited to compensate for limited knowledge in labeled data, boosting network training and thus improving performance. For the overfitting problem, Multi-modality Dependency Minimization (MDM) is designed in BarelySAM to construct various partial combinations for full-modal samples, thus enforcing networks to exploit each modality effectively. Experimental results on two benchmark datasets validate the effectiveness of the integrated SAM and the designed MDM module. In particular, our method attains a 89.92% Dice score for whole tumor segmentation on BRATS2020 with just 6 (2%) labeled samples, just 1.09% lower than the performance of a fully supervised approach. Besides, experiments on barely-supervised multi-modal brain tumor segmentation also validate that our method is inherently robust against missing modalities.

Abstract:
3D Gaussian Splatting (3DGS) has become an emerging tool for dynamic scene reconstruction. However, existing methods mainly focus on developing various strategies to extend static 3DGS into a time-variant representation, while overlooking the rich motion information implicitly carried by 2D observations, thus suffering from performance degradation and model redundancy. To address the above problem, we propose a novel motion-aware enhancement framework for dynamic scene reconstruction, which mines useful motion cues from optical flow to improve different paradigms of dynamic 3DGS. Specifically, we first step beyond the vanilla render-based cross-dimensional supervision that suffers from ambiguity and instability, and establish a more robust and effective dense correspondence between 3D Gaussian movements and pixel-level flows. Then a novel flow augmentation method is introduced with additional insights into uncertainty and loss collaboration. Furthermore, for the prevalent deformation-based paradigm that presents a harder optimization problem, a transient-aware deformation auxiliary module is proposed. We conduct extensive experiments on both multi-view and monocular scenes to verify the merits of our work. Compared with the baselines, our method shows significant superiority in both rendering quality and efficiency. The code will be publicly available at https://github.com/jasongzy/MAGS.

Abstract:
Recently, large-scale synthetic datasets have effectively alleviated the issue of insufficient person re-identification (Re-ID) datasets. However, synthetic datasets grapple with inherent challenges, including the subpar quality of synthetic pedestrians and single data collection. This paper presents InfinitePerson, a costless pipeline that fully utilizes the infinite generation capability of diffusion models to produce diverse UV texture images and effortlessly constructs high-quality synthetic datasets by simulating a real surveillance network. Specifically, we innovatively propose the utilization of diffusion models to generate high-quality, realistic, and diverse UV texture images to address the limitations of clothing textures. This ensures that our 3D character models have complete clothing texture information and look very similar to real-world pedestrians. Moreover, in response to the challenges in replicating synthetic data collection pipelines, we propose a sub-monitoring network data collection method, which can collect pedestrians data from different viewpoints, backgrounds, and lighting conditions through simple scene layout. Finally, a more scalable and realistic large synthetic dataset called InfinitePerson is created, containing 4,700 identities and 535,636 images. Experimental evidence demonstrates show that models trained on InfinitePerson exhibit superior generalization performance, surpassing those trained on both popular real-world and synthetic person Re-ID datasets. The InfinitePerson project is available at https://github.com/zhguoqing/InfinitePerson.

Abstract:
Real-time video matting is essential for applications like online video conferencing but faces challenges in human-object interaction (HOI) scenarios, known as the HOI-matting problem. This problem is challenging due to its open-recognition nature, where no dataset can cover the wide range of potential HOI cases, making it difficult for feature-learning-based methods to generalize effectively. To address this issue, we present an HOI-matting dataset and introduce a Model-Agnostic Meta-Learning-based rule-aware learning approach (MAML-RAL). MAML-RAL combines transfer learning and meta-learning to capture domain-invariant HOI rules, complemented by a fast local adaptation strategy to counter domain shifts and background interference. Our method achieves a mean intersection-over-union (mIoU) of 92.3%, outperforming current algorithms, with local adaptation further boosting performance to a remarkable mIoU of 95.84%.

Abstract:
In recent years, significant progress has been achieved in urban dense prediction tasks, particularly with advancements in deep learning models and novel architectures that enhance segmentation accuracy and computational efficiency. However, the following challenges persist: i) Existing modal fusion methods typically adopt convolutional neural networks (CNNs) or transformer (Trans)-based methods, which lead to inadequate global modeling or excessive computation owing to the introduction of quadratic complexity modeling; and ii) existing dense prediction networks typically utilize discriminative networks (codecs), which result in networks with insufficient discriminative properties. To address these issues, we propose the Mamba-effective diffusion-distillation network (MDNet) for RGB-thermal urban dense prediction. First, a new Mamba-effective fusion module is proposed, which efficiently models long-range pixel-level features using Mamba and generates pixel-level adaptive weights to fully utilize complementary modal information. Second, inspired by human self-reflection, a new diffusion self-distillation (DSD) strategy is proposed. The DSD generates coarse-grained binary semantic information via conditional multimodal image diffusion, which serves as self-distillation labels to improve the discriminative properties of the network. Experimental results demonstrate that the proposed MDNet achieves state-of-the-art performance on the MFNet dataset with fewer parameters and reduced computational effort. Extended experiments on the PST900 dataset further illustrate the effectiveness and generalizability of MDNet. The source code and results are available at https://github.com/Tortoisewhp/MDNet.

Abstract:
Referring Image Segmentation (RIS) aims to semantically segment the target object (referent) in alignment with the provided natural language query. Existing works still suffer from that the non-referent was segmented mistakenly, which can be attributed to the insufficient comprehension of vision and language. To tackle this problem, we propose a Cross-Modal Interactive Reasoning Network (CMIRNet) to explore semantic information that consistently existed between vision and language. Specifically, we first devise a novel Text-Guided Multi-Modality Joint Encoder (TGMM-JE), where the key expression can be extracted and the important visual features will be encoded under the continuous guidance of language expression. Then, we design a Cross-Graph Interactive Positioning (CGIP) module to locate the key pixels of the referent object in deepest layer. The multi-modality graph data is constructed between visual and linguistic features, and the important pixels can be positioned from cross-graph interaction and intra-graph reasoning. Finally, a novel Cross-Modal Attention Enhanced DEcoder (CMAE-DE) is dedicated to refine the referent object mask from coarse to fine progressively, where hybrid cross modal attentions are explored to enhance the representation of referent object. Extensive ablation studies validate the efficacy of our key modules and comprehensive experimental results show the superiority of our proposed model over 22 state-of-the-art (SOTA) models.

Abstract:
Multi-focus image fusion aims to integrate clear segments from different partially focused images, creating an ‘all-in-focus’ composite. Due to the lack of ground-truth for multi-focus image fusion, supervised deep learning methods are deemed inappropriate for this task. In this paper, we present an unsupervised approach for multi-focus image fusion, named Fusion2Void. Fusion2Void ingeniously tackles the challenge of missing ground-truth by framing image inpainting as an auxiliary task. Specifically, Fusion2Void utilizes a fusion network to merge focused regions from multiple source images. Following the fusion process, image patches in the source images are randomly dropped to construct an additional image inpainting task. Subsequently, an image inpainting network uses the fused image as a guide to restore the missing content in the source images. The missing content in the source images includes both focused and defocused regions. Restoring focused image patches is significantly more challenging than restoring their defocused counterparts due to their inclusion of more high-frequency details. If the focused image patches are effectively restored, the repair of the defocused image patches becomes notably easier. Therefore, the image inpainting network implicitly compels the fused image to incorporate all focused content from the source images, as these can be utilized to restore the missing focused regions in the source images perfectly. Based on image inpainting, the fusion network generates ‘all-in-focus’ images in an unsupervised manner. Experiments on several synthetic and real-world datasets highlight Fusion2Void’s state-of-the-art performance relative to other methods.

Abstract:
The family of regularization by denoising (RED) methods introduce denoising operator as the regularization term to perform compressed sensing (CS) reconstruction, which shows higher flexibility and scalability. However, traditional RED framework has strict requirements on several properties of denoiser, making it hard to design the specific denoiser and limits the quality of reconstructed images. Although some relaxation for denoisers can be made by incorporating the fixed point projection during the iteration process, the involved parameters have great impact on the effectiveness and efficiency of the algorithm, which is non-trivial to set them properly. In this paper, we propose an innovative Deep Unfolding Network framework termed FP-DUN based on the iterative process of Regularization by Denoising via Fixed-Point Projection (RED-PRO). In FP-DUN, fix-point projection module is implemented with learnable weights of neural networks, where an effective denoiser based on dual attention mechanism (DAM) is developed to capture the details of the reconstructed image. Additionally, we propose a new loss function based on fixed point constraints, which is able to overcome the over-smoothness caused by multi-stage denoising and maintain the structural details to progressively improve the reconstruction quality. By training the DUN model, the parameters for the process of fix point projection and denoiser are learned automatically. Extensive experimental results comparing with state-of-the-art CS algorithms and traditional RED-PRO approach validate the effectiveness of FP-DUN, especially on some images with complex details.

Abstract:
Multi-modal image fusion synthesizes information from multiple sources into a single image, facilitating downstream tasks such as semantic segmentation. Current approaches primarily focus on acquiring informative fusion images at the visual display stratum through intricate mappings. Although some approaches attempt to jointly optimize image fusion and downstream tasks, these efforts often lack direct guidance or interaction, serving only to assist with a predefined fusion loss. To address this, we propose an “Unfolding Attribution Analysis Fusion network” (UAAFusion), using attribution analysis to tailor fused images more effectively for semantic segmentation, enhancing the interaction between the fusion and segmentation. Specifically, we utilize attribution analysis techniques to explore the contributions of semantic regions in the source images to task discrimination. At the same time, our fusion algorithm incorporates more beneficial features from the source images, thereby allowing the segmentation to guide the fusion process. Our method constructs a model-driven unfolding network that uses optimization objectives derived from attribution analysis, with an attribution fusion loss calculated from the current state of the segmentation network. We also develop a new pathway function for attribution analysis, specifically tailored to the fusion tasks in our unfolding network. An attribution attention mechanism is integrated at each network stage, allowing the fusion network to prioritize areas and pixels crucial for high-level recognition tasks. Additionally, to mitigate the information loss in traditional unfolding networks, a memory augmentation module is incorporated into our network to improve the information flow across various network layers. Extensive experiments demonstrate our method’s superiority in image fusion and applicability to semantic segmentation. The code is available at https://github.com/HaowenBai/UAAFusion.

Abstract:
The progression of medical image analysis methodologies has significantly assisted fundus clinical decision-making, such as disease diagnosis and lesion segmentation. However, low-quality fundus images bring a series of challenges to the automatic screening of diseases and the segmentation of lesions. Most existing methods primarily concentrate on enhancing image quality by utilizing the supervision of paired fundus images, which are difficult to collect in real medical applications. High-quality reference images are essential for guiding quality enhancement. To this end, we propose an enhancement method for low-quality fundus images, called RF-IQE, to alleviate the requirement for paired training images and only requires low-quality fundus images. Specifically, we first construct the patch-level high-/low-quality domains by employing a rule-based quality assessment scheme. Then, to achieve the fundus image quality enhancement and unified illumination styles simultaneously, we formulate them as a patch quality domain adaptation and a multi-style domain adaptation, respectively. We qualitatively and quantitatively demonstrate that our reference-free image quality enhancement network outperforms the conventional methods and exhibits comparable performance than the deep learning-based image enhancement methods with paired images on both the EyeQ and Messidor datasets. Furthermore, we also investigate the influence of the RF-IQE method on various fundus imaging analysis tasks, including vessel segmentation, optic disc segmentation, lesion segmentation, and disease classification.

Abstract:
Domain generalization aims at learning a model with transferable knowledge from one or more source domain(s) in the presence of domain shift, enabling the model to achieve effective generalization for an unseen target domain. Most existing methods pursue domain-invariant representations of samples to address the challenges of heterogeneous distributions across domains. However, most of such methods are limited to simple data manipulation at the instance level or computing style statistics in feature space for distribution alignment. Such operations fail to effectively capture the contextual semantics across domains from both the intra and inter-views. In this paper, we propose contextual Distribution Alignment via a Contrastive Learning strategy with domain correlation, called DACL, which sufficiently exploits both intra- and inter-domain invariant representations for image domain generalization classification. Specifically, a new Fourier-based augmentation method is developed to capture high-level semantic invariant features. Second, a domain-based feature fusion module is further proposed to increase the diversity of features, which mainly extracts both intra- and inter-domain prototypes via clustering to learn cross-domain representations. Finally, we propose a contrastive learning strategy that takes domain correlation into account, which uses spatial second-order statistics as a metric to measure the relevance between multiple source domains. Extensive experiments are conducted on two domain generalization tasks over six benchmarks, demonstrating that DACL achieves state-of-the-art performance against baseline models. A series of ablation studies are performed and in-depth analyses are conducted in visualization to further verify the rationality and effectiveness of the proposed method.

Abstract:
Recent years have witnessed the success of the deep learning-based technique in research of no-reference point cloud quality assessment (NR-PCQA). For a more accurate quality prediction, many previous studies have attempted to capture global and local features in a bottom-up manner, but ignored the interaction and promotion between them. To solve this problem, we propose a novel asynchronous feedback quality prediction network (AFQ-Net). Motivated by human visual perception mechanisms, AFQ-Net employs a dual-branch structure to deal with global and local features, simulating the left and right hemispheres of the human brain, and constructs a feedback module between them. Specifically, the input point clouds are first fed into a transformer-based global encoder to generate the attention maps that highlight these semantically rich regions, followed by being merged into the global feature. Then, we utilize the generated attention maps to perform dynamic convolution for different semantic regions and obtain the local feature. Finally, a coarse-to-fine strategy is adopted to merge the two features into the final quality score. We conduct comprehensive experiments on three datasets and achieve superior performance over the state-of-the-art approaches on all of these datasets. The code will be available at https://github.com/zhangyujie-1998/AFQ-Net

Abstract:
End-to-end optimization via deep neural networks has facilitated lossy image compression. Existing neural network-based entropy models for end-to-end optimized image compression are limited by parameterized Gaussian distributions with deterministic mean and variance and cannot achieve accurate rate estimation for bottleneck representation with varying statistics. In this paper, we propose a novel entropy model based on deep Gaussian process regression (DGPR) to address this problem. Specifically, the proposed entropy model leverages autoregressive DGPR to flexibly predict the channel-wise posterior distributions of high-dimensional bottleneck representation for entropy coding. Consequently, we develop a well-established bit-rate estimation scheme via posterior inference of DGPR using the learned probabilistic distribution. Furthermore, scalable training is achieved via tensor train decomposition and Monte Carlo sampling to enable tractable variational inference of DGPR. To our best knowledge, this paper is the first attempt to develop the learnable probabilistic model for flexible parameter estimation in entropy modeling. Experimental results show that the proposed model outperforms conventional image compression methods (e.g., JPEG2000 and BPG) as well as recent end-to-end optimized methods on the Kodak and Tecnick datasets in terms of rate-distortion performance.

Abstract:
In HTTP Adaptive Streaming (HAS), each video is divided into smaller segments, and each segment is encoded at multiple pre-defined bitrates to construct a bitrate ladder. To optimize bitrate ladders, per-title encoding approaches encode each segment at various bitrates and resolutions to determine the convex hull. From the convex hull, an optimized bitrate ladder is constructed, resulting in an increased Quality of Experience (QoE) for end-users. With the ever-increasing efficiency of deep learning-based video enhancement approaches, they are more and more employed at the client-side to increase the QoE, specifically when GPU capabilities are available. Therefore, scalable approaches are needed to support end-user devices with both CPU and GPU capabilities (denoted as CPU-only and GPU-available end-users, respectively) as a new dimension of a bitrate ladder. To address this need, we propose DeepStream, a scalable content-aware per-title encoding approach to support both CPU-only and GPU-available end-users. (i) To support backward compatibility, DeepStream constructs a bitrate ladder based on any existing per-title encoding approach. Therefore, the video content will be provided for legacy end-user devices with CPU-only capabilities as a base layer (BL). (ii) For high-end end-user devices with GPU capabilities, an enhancement layer (EL) is added on top of the base layer comprising lightweight video super-resolution deep neural networks (DNNs) for each bitrate-resolution pair of the bitrate ladder. A content-aware video super-resolution approach leads to higher video quality, however, at the cost of bitrate overhead. To reduce the bitrate overhead for streaming content-aware video super-resolution DNNs, DeepCABAC, context-adaptive binary arithmetic coding for DNN compression, is used. Furthermore, the similarity among (i) segments within a scene and (ii) frames within a segment are used to reduce the training costs of DNNs. Experimental results show bitrate savings of 34% and 36% to maintain the same PSNR and VMAF, respectively, for GPU-available end-users, while the CPU-only users get the desired video content as usual.

Abstract:
In this age of information, images are a critical medium for storing and transmitting information. With the rapid growth of image data amount, visual compression and visual data perception are two important research topics attracting a lot of attention. However, those two topics are rarely discussed together and follow separate research paths. Due to the compact compressed domain representation offered by learning-based image compression methods, there exists possibility to have one stream targeting both efficient data storage and compression, and machine perception tasks. In this paper, we propose a layered generative facial image compression model achieving high human vision-oriented image reconstructed quality, even at extreme compression ratios. To obtain analysis efficiency and flexibility, a task-agnostic learning-based compression model is proposed, which effectively supports various compressed domain-based analytical tasks while preserving outstanding reconstructed perceptual quality, compared with traditional and learning-based codecs. In addition, joint optimization schedule is adopted to acquire best balance point among compression ratio, reconstructed image quality, and downstream perception performance. Experimental results verify that our proposed compressed domain-based multi-task analysis method can achieve comparable analysis results against the RGB image-based methods with up to 99.6% bit rate saving (i.e., compared with taking original RGB image as the analysis model input). The practical ability of our model is further justified from model size and information fidelity aspects.

Abstract:
Adversarial attacks pose a huge challenge to the deployment of deep neural networks (DNNs) in security-sensitive applications. Adversarial defense methods are developed to resist adversarial perturbation. However, most defenses overlook the generalization to various attacks. In medical field, it is known that targeted therapy is a treatment approach at the cellular and molecular levels that targets already identified carcinogenic sites. Inspired by the popular targeted therapies for cancer, we view adversarial attacks as local lesions of natural benign samples. The mechanism behind this assumption implies our key finding that the salient attack components in an adversarial sample dominate the attacking process, while trivial attack components unexpectedly provide trustworthy evidence for obtaining generalizable robustness. Based on this finding, an explainable but efficient Adversarial Surgery and Regeneration (ASR) model following the targeted therapy mechanism is developed to improve the adversarial generalization of DNNs, which has three merits: 1) A score-based Pixel Surgery (PS) module is proposed to remove the salient attack components while retaining the trivial attack components as a kind of attack-invariant information. 2) A Semantic Regeneration module (SR) based on a conditional alignment extrapolator is proposed to restore the discriminative content from the attack-free trivial components, which achieves pixel and semantic consistency for adversarial samples. 3) To further harmonize robustness and accuracy and address such an intractable problem in adversarial defense, a self-augmentation regularizer with adversarial R-drop (ARD) is designed. Experiments on numerous benchmarks show the superiority of the proposed ASR approach. The code can be found in https://github.com/fxw13/ASR.

Abstract:
Learning local features is a fundamental task for many computer vision applications. Existing methods often struggle to maintain robustness and accuracy in extracting local features, especially in complex environments with numerous interfering objects. Although some studies have integrated semantic information into local feature extraction networks to enhance discrimination, their effectiveness remains limited. Therefore, this paper fully considers the importance of semantic information for feature extraction and proposes a semantically enhanced local feature extraction network framework. This framework includes a local feature network, a semantic segmentation network, and a reinforcement learning framework. Semantic information is incorporated into feature heatmaps and feature descriptors to improve the accuracy of feature points. Subsequently, the local feature network is continuously optimized by a reinforcement learning algorithm based on semantic information and matching ground truth to enhance robustness, ensuring that the final local features achieve optimal performance. Extensive experiments on three publicly available datasets validate the effectiveness of the proposed local feature network.

Abstract:
Video question answering (VideoQA) is the challenging task of accurately responding to natural language questions based on a given video. Most previous methods focus on designing complex cross-modal interactions to perform question-oriented video scene mining and semantic reasoning, and utilize straightforward classification and matching strategies with different decoders to forcibly associate the predicted representation with ground-truth answer. However, the limitations of question-oriented reasoning and the overlapping semantic co-occurrences between questions and candidates may cause them to fall into spurious correlation reasoning. In this paper, we propose a Collaborative aware Bidirectional Semantic Reasoning (CBSR) model to alleviate this challenging problem. Specifically, we first propose a collaborative aware adaptive correlation reasoning module to collaboratively mine multi-granularity text-aware critical video scenes and reason about the complex intrinsic correlations between them via bottom-up cross-granularity adaptive aggregation. By progressively performing video reasoning from object-level to frame-level, we can obtain a set of semantically rich critical video representations. Then, we collaboratively decode it together with question and knowledge semantics into an implicit representation through the proposed unified answer semantic collaborated decoding module. Finally, a novel bidirectional semantic reasoning learning strategy is proposed to bridge and strengthen the unique positive semantic correlation between the learned implicit representation and the ground-truth answer, and explicitly alleviate the challenge of overlapping semantic co-occurrence. Benefiting from the same model structure and learning strategy, our method can achieve seamless transfer between Open-Ended and Multi-Choice tasks. Extensive experimental results on seven commonly tested datasets (i.e. MSVD-QA, MSRVTT-QA, NExT-QA, Causal-VidQA, NExT-OOD, ActivityNet-QA and EgoSchema) verify the superior performance of our method and the effectiveness of each reasoning module. We provide our source codes and experimental datasets at https://github.com/XizeWu/CBSR.

Abstract:
Few-shot video object segmentation (FSVOS) aims to achieve accurate segmentation of novel objects in given video sequences, where the target objects are specified by limited annotated images as support. Most previous top-performing methods adopt the support-query semantic correlation learning paradigm or the intra-query temporal correlation learning paradigm. Nevertheless, they either fail to model temporal consistency across frames, resulting in inconsecutive segmentation, or lose diverse support object information, leading to incomplete segmentation. Therefore, we argue that it is more desirable to achieve both correlations in a collaborative manner. In this work, we delve into the issues present in the combination of few-shot image segmentation methods and video object segmentation methods and propose a dedicated Collaborative Correlation Network (CoCoNet) to address these problems, including a pixel correlation calibration module and a temporal correlation mining module. The proposed CoCoNet enjoys several merits. First, the pixel correlation calibration module aims to mitigate the noise issue in support-query correlation by integrating the affinity learning strategy and the prototype learning strategy. Specifically, we employ Optimal Transport to enrich pixel correlation with contextual information, thereby reducing intra-class differences between support and query. Second, the temporal correlation mining module is responsible for alleviating the issue of uncertainty in the initial frame and establishing reliable guidance for subsequent frames of the query video. With the collaboration of these two modules, our CoCoNet can effectively establish support-query and temporal correlation simultaneously and achieve accurate FSVOS. Extensive experimental results on two challenging benchmarks demonstrate that our method performs favorably against state-of-the-art FSVOS methods.

Abstract:
The main challenge for Few-Shot Fine-Grained (FSFG) image classification is to learn discriminative feature representations with few labeled samples. In response to this challenge, task-aware few-shot learning methods have been introduced. However, existing approaches focus only on how to correlate task information with feature representations, while overlooking two critical issues. The first one is how to obtain accurate task representations with few labeled samples, to accurately get the feature region related to the task. The second one is how to reduce the impact of the background noise introduced in the process of acquiring feature regions, to alleviate the overfitting problem under the few-shot setting. To address these issues, we propose the Adaptive Task-Aware Refining Network (ATR-Net). Unlike previous approaches that use the center of class as task representations, ATR-Net enables the model to adaptively select task-specific information by interacting with the patches of all local features, resulting in a more accurate task representation. Moreover, a Channel Region-Aware Module (CRAM) and a Refinement Filtering Module (RFM) are designed to better acquire task-level information and instance-level information, to overcome the impact of background noise. We conduct extensive experiments on four public fine-grained datasets. The results demonstrate that the proposed method achieves superior performance. Our codes are anonymously available at: https://anonymous.4open.science/r/ATR-Net.

Abstract:
Hyperspectral image (HSI) clustering has attracted increasing attention in recent years, because it doesn’t rely on labeled pixels. However, it is a challenging task due to the complex spectral-spatial structure. The emergence of large-scale HSIs introduces a new challenge in terms of heightened computational complexity. To address the above challenges, in this paper, we propose a structured anchor projected clustering (SAPC) model for large-scale HSIs. Specifically, we exploit spatial information reflecting in the generated superpixels to perform denoising and generate anchors. Based on the preprocessing, we simultaneously learn a pixel-anchor graph and an anchor-anchor graph in a projected feature space. Meanwhile, the rank-constraint is imposed on the Laplacian matrix related to the anchor-anchor graph. To uncover the clustering structure, we design a clustering inference strategy to propagate clustering labels from anchors to pixels based on the dual graphs. Additionally, we propose an efficient optimization strategy for the formulated SAPC model with linear time complexity in terms of the number of pixels. Since the anchor-anchor graph is with much smaller size, it is high efficient to obtain the structured anchors with pseudo labels. Thus, the clustering process is significantly accelerated. Extensive experiments on multiple large-scale HSI datasets demonstrates the superiority of our SAPC over the state-of-the-art methods. The source code is released at https://github.com/ZhangYongshan/SAPC.

Abstract:
The inherent imaging properties of sensors result in two distinct differences between the data from the two modalities in RGB-T Salient Object Detection (SOD) tasks. Namely, differences in imaging effectiveness due to varying sensitivities to specific scenes and fundamental domain differences resulting from differences in reflecting scene characteristics. Existing methods primarily focus on pursuing unique cross-modal fusion designs to enhance model performance. However, not only do direct cross-modal fusion modes fail to improve the effectiveness of original features, but intricate cross-modal fusion designs also increase the domain differences between modalities, thereby resulting in suboptimal performance. Therefore, in this paper, we no longer insist on pursuing unique cross-modal fusion designs but instead contemplate how to enhance the effectiveness of original features within modalities (mitigating differences in imaging effectiveness) and utilize a concise cross-modal fusion mechanism (alleviating the impact of domain differences) to achieve satisfactory performance. In this spirit, we propose the Intra-modality Self-enhancement Mirror Network (ISMNet) for RGB-T salient object detection. The core of ISMNet is the proposed Intra-modality Cross-scale Self-enhancement Module (ICSM). The main insight of ICSM is to exploit saliency clues by modeling the correlation between intra-modality cross-scale features (which exhibit strong correlations and small domain differences), thereby enhancing the effectiveness of original multi-scale features within modalities. We employ the proposed novel paradigm to mirror-expand existing typical paradigms to obtain a more robust model architecture. Extensive experiments demonstrate that our proposed new architecture and the introduced universal Intra-modality Cross-scale Self-enhancement Module effectively improve the effectiveness of original features and promote the achievement of state-of-the-art performance.

Abstract:
Prediction-error expansion (PEE) is the most efficient approach in reversible data hiding (RDH). However, in PEE, to ensure the reversibility, significant distortion is introduced since many pixels are shifted without embedded data. Based on this consideration, a novel double-layered RDH framework called S+PEE is proposed in this paper. Unlike the conventional PEE, by S+PEE, shifted pixels can also be utilized for carrying secret data. The secret data is embedded in the first embedding layer with steganography and a specifically designed PEE-like mechanism. Then, to ensure the reversibility, the irreversible modifications introduced by the first embedding layer are recorded and embedded in the second embedding layer. Moreover, the corresponding capacity-distortion model is established to minimize the embedding impact, so that the marked image quality can be optimized. Experimental results demonstrate that the proposed method can provide high marked image quality, and it outperforms some state-of-the-art methods.

Affiliations: School of Software Engineering, Xi’an Jiaotong University, Xi’an, China; School of Information and Communication Engineering, Xi’an Jiaotong University, Xi’an, China; School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China; Ministry of Education Key Laboratory for Intelligent Networks and Network Security, School of Information and Communication Engineering, and SMILES Laboratory, Xi’an Jiaotong University, Xi’an, China

Abstract:
Super-resolution (SR) aims to restore a high-resolution (HR) image from its low-resolution (LR) counterpart. Existing works try to achieve an overall average recovery over all regions to provide better visual quality for human viewing. If we desire to explore the potential that performs super-resolution for machine recognition instead of human viewing, the solution should change accordingly. From this insight, we propose a new SR pipeline, called InstanceSR, which treats each region in the LR image differentially and consumes more resources to focus on the recovery of the foreground region where the instances exist. In particular, InstanceSR consists of an encoder that formulates the LR image into a set of various difficulty tokens according to the instances distribution in each sub-region, and a decoder based on a multi-exit network structure to recover the sub-regions corresponding to various difficulty tokens by consuming different computational resources. Experimental results demonstrate the superiority of the proposed InstanceSR over state-of-the-art models, especially the recovery of regions where instances exist, by extensive quantitative and qualitative evaluations on three widely used benchmarks containing small instances. Besides, the comparisons using SR results on three challenging small object detection benchmarks verify that our InstanceSR can consistently boost the detection accuracy and has great potential for subsequent machine recognition.

Abstract:
In this paper we present two novel approaches for improving intra and inter chroma prediction in video coding. Our research demonstrates that treating the cross-component predictor as a two-dimensional convolutional model can significantly enhance chroma prediction performance. The proposed two convolutional models incorporate multiple spatial neighbors, a bias term, and a nonlinear term. For intra-coded blocks, we derive the model coefficients on the reconstructed neighborhood of the block, while for inter-coded blocks, the model coefficients are determined using prediction samples. To evaluate our methods, we implemented them on top of the ECM software that is currently under exploration by the ITU-T/ISO/IEC Joint Video Experts Team. Our intra cross-component predictor achieves BD-rate savings of −1.47%, −2.90%, −3.02%, −0.92%, −2.04%, −2.32% (Y, U, V) for the all intra and the random access configurations over ECM-5.0, respectively. Our inter cross-component predictor achieves BD-rate savings of −0.09%, −1.25%, −1.46%, −0.04%, −3.42%, −3.85% for the random access and the low-delay B configurations over ECM-9.0, respectively. Both proposed methods have been adopted into the ECM software.

Affiliations: Department of Information Security, Naval University of Engineering, Wuhan, Hubei, China; School of Computer Science, Wuhan University, Wuhan, China; School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China; School of Computer Science and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China; School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing, China

Abstract:
As an intermediate task in computer vision, multiple pedestrian tracking (MPT) aiming at tracking the pedestrians from a given video, has attracted attention due to its potential academic and commercial value. However, pedestrians commonly suffer from occlusion due to diverse and complex scenarios, which increases the challenge of this task. This survey provides comprehensive review in terms of occlusion scenarios encountered during MPT, and investigates the model robustness of the existing methods in this scenarios. Firstly, this survey introduces the various and states of occlusion. Secondly, the related occlusion datasets are introduced. Subsequently, we categorize existing occlusion handling methods according to the tracking process and detail their pros and cons. In addition, occlusion handling precision (OHP) metric is proposed to evaluate the ability of a tracker in handling occlusion in this survey. Moreover, comprehensive analyzes and discussions in several public datasets are provided to verify the effectiveness of these methods. Finally, the existing issues and future directions for occlusion handling methods are discussed. In doing so, this work serves as a foundation for future research by providing researchers with information about the occlusion handling method of MPT.

Abstract:
We explore the impact of transformers on accurate and reliable salient object detection. For accuracy, we integrate the transformer with a deterministic model and delineate its advantages in structural modeling. Regarding reliability, we address the transformer’s tendency to produce overly confident, incorrect predictions. To gauge reliability implicitly, we introduce a latent variable model within the transformer framework, termed the inferential generative adversarial network (iGAN). The stochastic nature of the latent variable facilitates the estimation of predictive uncertainty, which serves as an auxiliary measure of the model’s prediction reliability. Different from the conventional GAN, which defines the distribution of the latent variable as fixed standard normal distribution \mathcal N(0,\mathbf I) . The proposed iGAN infers the latent variable by gradient-based Markov Chain Monte Carlo (MCMC), namely Langevin dynamics, leading to an input-dependent latent variable model. We apply our proposed iGAN to fully supervised salient object detection, explaining that iGAN within the transformer framework leads to both accurate and reliable salient object detection. The source code and experimental results are publicly available via our project page: https://npucvr.github.io/TransformerSOD.

Abstract:
The recent one-to-one label assignment plays a crucial role in removing the last non-differentiable component, i.e., Non-Maximum Suppression (NMS), used in the post-processing step of the one-to-many label assignment, thus building an efficient end-to-end detection system. However, due to the limited number of foreground samples, the one-to-one label assignment often suffers from insufficient representation learning, and its performance is inferior to that of traditional detectors trained using the one-to-many label assignment. To solve these problems, we introduce a novel Dynamic Hybrid Label Assignment (DHLA) method, including a Hybrid Sample Selection (HSS) strategy and a Stage-aware Soft-label Adjustment (SSA) mechanism. In order to enhance the ability of representation learning of the one-to-one label assignment, the HSS strategy subtly integrates the one-to-many and the one-to-one label assignment rules to form a simple and effective hybrid assignment rule, where high-quality samples are selected for training according to an effective task consistency metric. Moreover, the SSA mechanism dynamically adjusts the contributions of different foreground samples at different training stages, thus effectively achieving the transition from one-to-many to one-to-one label assignment. In addition, we leverage a ranking loss function to widen the score gaps between the highest scoring position and surrounding areas for effectively removing duplicate bounding boxes. As a result, our method not only learns robust feature representations during training but also performs efficient end-to-end detection during inference. Extensive experiments demonstrate our method achieves competitive performance compared to state-of-the-art detectors on the challenging COCO and CrowdHuman datasets.

Abstract:
Monocular depth estimation aims to infer a dense depth map from a single image, which is a fundamental and prevalent task in computer vision. Many previous works have shown impressive depth estimation results through carefully designed network structures, but they usually ignore the planar information and therefore perform poorly in low-texture areas of indoor scenes. In this paper, we propose Plane2Depth, which adaptively utilizes plane information to improve depth prediction within a hierarchical framework. Specifically, in the proposed plane guided depth generator (PGDG), we design a set of plane queries as prototypes to softly model planes in the scene and predict per-pixel plane coefficients. Then the predicted plane coefficients can be converted into metric depth values with the pinhole camera model. In the proposed adaptive plane query aggregation (APGA) module, we introduce a novel feature interaction approach to improve the aggregation of multi-scale plane features in a top-down manner. Extensive experiments show that our method can achieve outstanding performance, especially in low-texture or repetitive areas. Furthermore, under the same backbone network, our method outperforms the state-of-the-art methods on the NYU-Depth-v2 dataset, achieves competitive results with state-of-the-art methods KITTI dataset and can be generalized to unseen scenes effectively.

Abstract:
Unsupervised anomaly detection methods can identify surface defects in industrial images by leveraging only normal samples for training. Due to the risk of overfitting when learning from a single class, anomaly synthesis strategies are introduced to enhance detection capability by generating artificial anomalies. However, existing strategies heavily rely on anomalous textures from auxiliary datasets. Moreover, their limitations in the coverage and directionality of anomaly synthesis may result in a failure to capture useful information and lead to significant redundancy. To address these issues, we propose a novel Progressive Boundary-guided Anomaly Synthesis (PBAS) strategy, which can directionally synthesize crucial feature-level anomalies without auxiliary textures. It consists of three core components: Approximate Boundary Learning (ABL), Anomaly Feature Synthesis (AFS), and Refined Boundary Optimization (RBO). To make the distribution of normal samples more compact, ABL first learns an approximate decision boundary by center constraint, which improves the center initialization through feature alignment. AFS then directionally synthesizes anomalies with more flexible scales guided by the hypersphere distribution of normal features. Since the boundary is so loose that it may contain real anomalies, RBO refines the decision boundary through the binary classification of artificial anomalies and normal features. Experimental results show that our method achieves state-of-the-art performance and the fastest detection speed on three widely used industrial datasets, including MVTec AD, VisA, and MPDD. The code will be available at: https://github.com/cqylunlun/PBAS.

Abstract:
Point cloud registration is a fundamental task for estimating the rigid transformation matrix between two point clouds, and is regarded as a prerequisite for downstream vision tasks. Recent works have sought to address the registration problem using the obtainable RGB-D sequence, rather than relying solely on point clouds, which may not always be available. However, most existing unsupervised RGB-D point cloud registration works struggle to obtain fine-grained, robust, discriminative correspondences due to the simple concatenation of multimodal features and the increase in vector dimensions. These methods typically follow a common paradigm: extracting features from the input data, estimating correspondences, and obtaining the transformation matrix through geometric fitting. In this work, we design a generative feature extraction module to fully leverage multimodal information, and seek a novel perspective for correspondence estimation which expands the points in the source and target point clouds into hyperrectangle-based embeddings and considers their inner relationships, based on intersections in n-dimensional space, as the basis for estimating correspondences. Each hyperrectangle-based embedding is built upon the natural and discriminative semantics from the proposed generative feature extraction module, which involves a diffusion branch, a geometric branch, and point-pixel fusion. We harness the capability of the generative model to fully leverage the information from both complementary modalities in RGB-D frames. Furthermore, this distinctive geometry space allows for efficient calculation of intersection volumes and model conditional probabilistics for estimating correspondences. Extensive experiments on the 3DMatch and ScanNet datasets show the effectiveness of the proposed method in this challenging task, outperforming state-of-the-art approaches. Our code will be released at: https://github.com/cbyan1003/DCE.

Affiliations: School of Computer and Communication, Lanzhou University of Technology, Lanzhou, China; Department of Automation, Tsinghua University, Beijing, China; Faculty of Actuarial Science and Insurance, Bayes Business School, City, University of London, London, U.K.; Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China; Department of Statistical Science, University College London, London, U.K.

Abstract:
Few-shot fine-grained image classification is prominent but challenging in computer vision, aiming to distinguish sub-classes under the same parent class but with only a few labeled support samples. Data augmentation techniques were explored to address the few-shot issue, but they often fail to mitigate the bias between support and query samples. Therefore, in this paper we propose a query-aware cross-mixup and cross-reconstruction method to address both few-shot and fine-grained issues. Specifically, in the training phase, we randomly select query samples and mix them with the support samples from the same class to augment the support set. This first strategy ensures the augmented support set query-aware within each sub-class. Then, we reconstruct both query samples and support samples from both original and cross-mixed support samples, thus leveraging both cross-reconstruction and self-reconstruction to enhance classification. This second strategy, enabling the reconstruction also query-aware, further mitigates the bias between support and query samples, leading to more reliable generalization. We evaluate our proposed method on four widely used few-shot fine-grained image classification datasets, and experimental results demonstrate its effectiveness in achieving the state-of-the-art classification performance.

Abstract:
Compared with the unimodal image aesthetics assessment (IAA), multimodal IAA has demonstrated superior performance. This indicates that the critiques could provide rich aesthetics-aware semantic information, which also enhance the explainability of IAA models. However, images are not always accompanied with critiques in real-world situation, rendering multimodal IAA inapplicable in most cases. Therefore, it would be interesting to investigate whether we can generate aesthetic critiques to facilitate image aesthetic representation learning and enhance model explainability. Motivated by these facts, this paper presents an attribute-oriented Critiques Generation framework for explainable IAA, dubbed CG-IAA, which consists of three major components, i.e., Vision-Language Aesthetic Pretraining (VLAP), Multi-Attribute Experts Learning (MAEL) and Multimodal Aesthetics Prediction (MAP). Specifically, the vanilla CLIP is first finetuned on a multimodal IAA database. Considering that the aesthetic critiques typically consist of multiple attributes, a new multimodal IAA database which contains over 1 million critiques with up to four aesthetic attributes is constructed with the language model-based knowledge transfer. Then, CLIP-based multi-attribute experts are trained based on this database. Finally, the pretrained experts are utilized to generate aesthetic critiques for assisting unimodal image aesthetics prediction. Extensive experiments have been done on four popular IAA databases, and the results demonstrate the advantage of CG-IAA over the state-of-the-arts. Furthermore, CG-IAA features better explainability and generalization with the assistance of generated critiques. The source code is available at https://github.com/sxfly99/CG-IAA.

Abstract:
Facial highlight removal aims to identify and remove the specular highlight components in the facial image, ensuring that the generated image has a consistent facial tone and high-fidelity texture detail. Existing methods struggle to remove the highlight and recover the details in disturbed areas simultaneously, often resulting in specular residues or distorted local details (i.e. texture, illumination, and color). To rectify these issues, this work proposes a novel two-stage facial highlight removal network (FHR-Net), which mainly consists of a Cross-Context Attention Module (CCAM) and a Texture Enhancement Module (TEM). In the first stage, according to the detected highlight mask, the CCAM explicitly integrates cross-context information to obtain coarse highlight removal results consistent with the surrounding facial context. Building upon the coarse result, the TEM in the second stage utilizes patch-wise attention to refine the texture details in the highlight areas, thereby producing a high-fidelity facial image. To improve coherence between the removed highlight areas and non-highlight areas, this work introduces a face feature loss that makes the processed highlight-disturbed areas align well with the surrounding facial architecture. Additionally, to address the lack of high-quality datasets in the research community and satisfy the training demands for data-driven facial highlight removal, this work builds a real-world Paired Facial Specular-Diffuse (PFSD) dataset through cross-polarization. Experimental results on PFSD and other datasets demonstrate that FHR-Net can effectively remove the facial highlight and recover original color and texture details.

Abstract:
Semi-supervised symmetric non-negative matrix factorization (SNMF) utilizes the available supervisory information (usually in the form of pairwise constraints) to improve the clustering ability of SNMF. The previous methods introduce the pairwise constraints from the local perspective, i.e., they either directly refine the similarity matrix element-wisely or restrain the distance of the decomposed vectors in pairs according to the pairwise constraints, which overlook the global perspective, i.e., in the ideal case, the pairwise constraint matrix and the ideal similarity matrix possess the same low-rank structure. To this end, we first propose a novel semi-supervised SNMF model by seeking low-rank representation for the tensor synthesized by the pairwise constraint matrix and a similarity matrix obtained by the product of the embedding matrix and its transpose, which could strengthen those two matrices simultaneously from a global perspective. We then propose an enhanced SNMF model, making the embedding matrix tailored to the above tensor low-rank representation. We finally refine the similarity matrix by the strengthened pairwise constraints. We repeat the above steps to continuously boost the similarity matrix and pairwise constraint matrix, leading to a high-quality embedding matrix. Extensive experiments substantiate the superiority of our method. The code is available at https://github.com/JinaLeejnl/TSNMF.

Abstract:
An efficient robust watermarking method should be resistant to various distortions, including distortions from image processing and geometric attacks. Geometric attacks are significant challenges for watermarking methods because they destroy the synchronization of the watermark between the embedding side and extracting side. It is a considerable challenge to accomplish watermark synchronization for watermarking methods. To address this challenge, a novel robust watermarking method with synchronization is proposed. At the embedding side, the watermark and the template are embedded to generate the watermarked image. If the watermarked image is attacked, the watermark and template are also distorted. At the extracting side, a template enhanced-extracted network is proposed to achieve watermark synchronization. The template enhanced-extracted network effectively extracts the distorted template from the distorted image. The template-enhanced subnet can indirectly enhance the strength of the distorted template in the distorted image and improve the accuracy of the template-extracted subnet. The visual quality of the watermarked image is guaranteed because there is no need to embed the template with high strength. Then, the attack factor is predicted based on the distorted template. By leveraging this prediction, correct watermark extraction with synchronization is achieved. The experimental results demonstrate that the proposed watermarking method with synchronization yields excellent robustness under image processing, geometric attacks and combined attacks.

Abstract:
Integrating dynamic effects has shown its significance in enhancing the accuracy and robustness of Visual-Inertial Odometry (VIO) systems in dynamic scenarios. Existing methods either prune dynamic features or rely heavily on prior semantic knowledge or kinetic models, proved unfriendly to scenes with a multitude of dynamic elements. This work proposes a novel dynamic feature fusion method for monocular VIO, named DFF-VIO, which requires no prior models or scene preference. By combining IMU-predicted poses with visual clues, it initially identifies dynamic features during the tracking stage by constraints of consistency and degree of motion. Then, we innovatively design a Dynamic Transformation Operation (DTO) to separate the effect of dynamic features on multiple frames into pairwise effects and construct a Dynamic Feature Cell (DFC) to preserve the eligible information. Subsequently, we reformulate the VIO nonlinear optimization problem and construct dynamic feature residuals with the transformed DFC as a unit. Based on the proposed inter-frame model of moving features, a so-called motion compensation is developed to resolve the reprojection issue of dynamic features, allowing their effects to be incorporated into the VIO’s tight coupling optimization, thereby realizing robust positioning in dynamic scenarios. We conduct accuracy evaluations on ADVIO and VIODE, degradation tests on EuRoC dataset, as well as ablation studies to highlight the joint optimization of dynamic residuals. Results reveal that DFF-VIO outperforms state-of-the-art methods in pose accuracy and robustness across various dynamic environments.

Abstract:
Feature compression has attracted much attention in recent years due to its promising applications in scenarios where features are transmitted and analyzed by machine vision. However, existing research mainly focuses on coarse-grained features extracted from recognition tasks such as classification and detection, neglecting fine-grained features extracted from identification tasks. In this paper, we make a pioneering attempt to study fine-grained feature compression in the context of identification tasks. Our main focus is on the distortion metric, given its critical importance in optimizing the performance of a compression network. We initiate our discussion by reviewing the instance-level metrics in existing literature, highlighting their oversight of the inter-feature relationships. The inter-feature relationships are especially important for identification tasks as they involve similarity comparison among different identities. To address this problem, we propose to consider inter-feature relationships from the perspective of identity information. Specifically, we propose an identity-level metric to incorporate both intra-identity similarity and inter-identity discriminability. The intra-identity similarity constraint aims to cluster features from the same identity, while the inter-identity discriminability constraint ensures that features from different identities deviate from each other. We implement the identity-level metric on four different feature compression networks designed based on feature characteristics. Experimental results show the effectiveness of the proposed identity-level metric on person re-identification and face verification tasks.

Abstract:
In the rapidly evolving image processing domain, transformers have emerged as powerful tools, yet significant challenges are encountered when they are applied to underwater image enhancement, such as visual disparity and computational inefficiency. Transformers do not have a unique module to maintain their performance while reducing the number of parameters. This study addresses the gap in the literature by introducing the globally deformable selection transformer (GS-Transformer), which is a model designed to enhance global feature selection and pixel connectivity, thereby reducing the computational complexity of the model while maintaining the image processing effect. Our novel multiresolution encoder-decoder module explicitly incorporates global information, overcoming the limitations of traditional transformers, whereas the multilocal coherence preserving loss (MCPL) mechanism ensures content integrity and coherence. Compared with the latest transform-based underwater image algorithms, this method is 15 times faster and utilizes only 41.7% (or approximately a half less) of the number of parameters. The experimental results on the UIEB, EUVP, and Synthesize datasets reveal that GS-Transformer achieves state-of-the-art performance in underwater image enhancement, with a reduced parameter number and improved efficiency, representing a significant advancement in the field. Our research will promote the application of the transformer in scenarios with high real-time performance.

Affiliations: State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, China; Department of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, China; College of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui, China; College of Computer Science and Technology, Tsinghua University, Beijing, China; Department of Computer Science, University of Rochester, Rochester, NY, USA

Abstract:
Image classification models including convolutional neural networks (CNN) and vision transformers (ViT) commonly employ a fully connected (FC) layer as the classifier. However, the fully connected nature of FC brings large amounts of weight parameters, limits the efficiency of inference, tends to over-fit the training data, and struggles to learn distinct class weights. To solve these problems, we propose a discrete representation classifier (DRC), a generic parameter-free classifier that offers efficiency, robustness, and more discriminative categorization. Specifically, the DRC discards numerous unimportant features and focuses solely on the salient features which are reinforced during training and presented in short discrete form during inference. Unlike the way of learning pseudo-prototypes (weights) from data laden with complex patterns and noises in FC, the DRC introducing discriminative fixed-prototypes which are almost uniformly distributed across the high-dimensional feature space, thus helps the model to learn more distinct boundaries between categories. Further leveraging the advantage of DRC’s focus on salient features, we propose Salient-CAM, which is able to locate the most important region in image without the need for weighting feature maps. The experiments demonstrate that simply replacing the model’s classifier from FC to DRC can lead to a significant acceleration in the whole model’s inference and a more robust classification. Additionally, the proposed Salient-CAM exhibits excellent object localization ability in complex natural scenes.

Abstract:
Surveillance videos play a crucial role in public security. However, current tasks related to surveillance videos primarily focus on classifying and localizing anomalous events. Despite achieving notable performance, existing methods are restricted to detecting and classifying predefined events and lack satisfactory semantic understanding. To tackle this challenge, we introduce a novel research avenue focused on Video-and-Language Understanding for surveillance (VALU), and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours. Moreover, we evaluate SOTA models on five multimodal tasks using this newly created dataset, establishing new baselines for surveillance VALU, from small to large models. Our experiments reveal that mainstream models, which perform well on previously public datasets, exhibit poor performance on surveillance video, highlighting new challenges in surveillance VALU. In addition to conducting baseline experiments to compare the performance of existing models, we also propose novel methods for multimodal anomaly detection tasks and finetune multimodal large language model models using our dataset. All the experiments highlight the necessity of constructing this multimodal dataset to advance surveillance AI. Upon the experimental results mentioned above, we conduct further in-depth analysis and discussion. The dataset and codes are provided at https://xuange923.github.io/Surveillance-Video-Understanding.

Abstract:
We propose MoBox, a low-cost solution for semi-supervised video object segmentation that requires only bounding boxes as manual annotations for training. Built upon a mature semi-supervised video object segmentation network, we redesign the training losses and employ a more stringent training strategy. Specifically, we introduce a well-designed constraint term that enhances traditional spatial projection by simultaneously leveraging the projections of both the ground-truth box and the predicted mask across two axes, rather than evaluating discrepancies along the x-axis and y-axis independently. To harness the intrinsic properties of videos, considering the underlying correspondence between motion represented by optical flow and the original image, we incorporate motion coherence information into the color consistency loss as supplementary information and propose a motion discrepancy loss to obtain accurate boundaries. Additionally, to mitigate the ambiguity of weak supervision, we further introduce the pseudo strict constraint during training, which significantly improves model performance. Our approach yields competitive scores on popular benchmarks, achieving a \mathcal J\& \mathcal F score of 78.6 on the DAVIS 2017 validation set and an Overall score of 78.0 on the YouTube-VOS 2018 validation set. These results highlight the efficacy of MoBox, demonstrating that the semi-supervised video object segmentation model can be effectively trained using only motion-augmented box supervision and intrinsic information of videos.

Abstract:
Unsupervised image restoration methods relying on a single data source often face challenges in achieving high-quality visual data completion due to the absence of additional supplementary information. This paper presents a novel optimization framework to address this limitation and further enhance the performance of image restoration. The framework generates pseudo side information (PSI) and utilizes it to guide the process of visual data completion. We introduce a pseudo side information regularizer (PSIR) tailored specifically for visual data completion tasks. The PSIR comprises two components: the PSI generator and updater, responsible for generating and refining the PSI, and the neural self-expressive prior (NSEP), which identifies a prior matching the desired result and PSI during optimization. Notably, our method achieves comprehensive visual data completion across various data types without the need for additional reference side information or training data. Extensive experimental evaluations conducted on spectral data (including color images, multispectral images, and hyperspectral images), video data (including gray video, color video, and hyperspectral video), magnetic resonance image, and real cloud data demonstrate the superiority of our approach over other state-of-the-art completion methods under different missing rate scenarios.

Affiliations: School of Information Science and Technology, the Engineering Research Center of Intelligent Perception and Autonomous Control of Ministry of Education, Beijing Laboratory of Smart Environmental Protection, Beijing Key Laboratory of Computational Intelligence and Intelligent Systems, and Beijing Artificial Intelligence Institute, Beijing University of Technology, Beijing, China; Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Depending on high quality images, industrial vision technologies can basically oversee all the industrial production processes, such as workpiece processing and assembly automation, which play a highly significant role in promoting detection automation and production capacity in assembly lines. Unlike the natural scene images which consist of richer colors and natural lines, industrial images that cover complex industrial goods and equipment are made up of fewer colors, more regular shapes, massive graphic elements, etc., causing existing image processing methods for quality estimation, enhancement and monitoring to fail. Human beings usually play the part of the final receiver of an industrial image, so in the researches of image quality estimation, it is necessary to take the perception process of human eyes and brain to the input images into consideration. On this basis, we in this paper propose a novel perceptual information fidelity based image quality estimation model, abbreviated as PIF. Particularly, we first introduce a visual-cell low-pass filter and an optical-nerve noise model, which are separately inspired by the two processes: one is that an image in the form of optical signals arrives at the retina through the eye’s optical system to form the stimuli; the other is that the aforesaid stimuli in the form of electrical signals transfer to the human brain through the optical nerve. Second, we construct a novel image content-aware adjustor to optimize the above visual-cell low-pass filter and optical-nerve noise model. Third, we compare the two quantities of the information that is present in the clean image and how much of the information can be extracted from the lossy image to generate the overall quality score. Experiments on the two large-size industrial image quality databases demonstrate the excellent performance achieved by our proposed PIF model, with a remarkable performance gain over the existing state-of-the-art competitors.

Abstract:
Chart images are widely employed as the intuitive form to express information, which renders them highly valuable. Consequently, there is an urgent demand to develop a watermarking algorithm for copyright protection and leakage prevention of chart images. Nevertheless, existing chart watermarking methods fail to thoroughly consider the chart image’s special characteristics and simply rely on the previous natural image-based watermarking framework. Compared to natural images, the chart image generally exhibits relatively simple layouts and textures, containing fewer complex texture regions that watermarks are typically embedded in. Therefore, the embedding locations of watermarks for different distortions can be relatively dispersed in natural images, while for chart images, watermark embedding regions under various distortion conditions tend to be relatively concentrated and share more overlaps. Inspired by the above special characteristics of chart images, to sufficiently leverage them and design a better framework, this paper proposes C3hartMark, a chart watermarking scheme with consecutive-encoding and concurrent-decoding. Instead of using the combined noise layer as existing methods to ensure multiple robustness, a novel consecutive training framework is introduced in this paper, which efficiently utilizes the overlapping of embedded watermark features in chart images, and simultaneously, mitigates the poor convergence brought by the combined noise layer. During the extraction stage, multiple concurrent decoders are introduced to extract the potential embedded watermarks for different distortions independently. Moreover, we also incorporate two special noise layers, namely Captioning and Fusion, to address the corresponding realistic distortions in chart images, and an agnostic noise layer to accommodate potential channel transmission distortions unknown during training. Through extensive experiments, we demonstrate that with the better visual quality, C3hartMark simultaneously outperforms existing state-of-the-art (SOTA) watermarking methods in terms of robustness, achieving 99.57% extraction accuracy under JPEG compression (QF=60).

Abstract:
Deep learning methods excel in Polarimetric SAR (PolSAR) image classification. However, existing methods typically sample an image block for each pixel with a fixed-size square window, which always contains inconsistent/incomplete content with the central pixel, resulting in many misclassifications especially in boundary and heterogeneous regions. So, a size-fixed square window is not enough for representing various terrain objects. To address this issue, we develop a content-adaptive multi-region deep network to obtain contextual consistent sampling windows for diverse terrain objects. Firstly, a complex scene of PolSAR image is partitioned into homogeneous, heterogeneous and boundary regions. Then, sampling windows with adaptive direction and scale are designed for three distinct regions. Besides, windows with central and global regions are proposed to provide additional local and global information. Finally, a fusion network is designed to adaptively combine different sampling windows to enhance classification performance. Experimental results on three real data sets demonstrate that the proposed method can achieve superior performance in both edge details and heterogeneous terrain objects compared with the state-of-the-art methods.

Abstract:
Hyperspectral image (HSI) provides detailed spectral and spatial information, essential for precise earth observation and various applications. Deep learning has advanced HSI classification, but the scarcity of labeled data and large model parameters necessitate semi-supervised methods to enhance performance and generalization. In this paper, we propose a novel semi-supervised framework dubbed Knowledge-Aware Geometric Contourlet Semantic Learning (KGCSL), aiming to achieve high-precision HSI classification with limited samples leveraging geometric and semantic knowledge. Specifically, to fully leverage geometric knowledge, KGCSL incorporates multi-scale and multi-directional representations of the contourlet transform within the neural network, enhancing the robustness of feature extraction and interpretability. Furthermore, to fully utilize semantic knowledge, an entropy-weighted prototype loss function is designed that exploits the attribute relationships between labeled and unlabeled samples to guide the optimization of unlabeled samples, promoting comprehensive semantic learning. Comprehensive evaluations of the proposed KGCSL framework on three public HSI datasets show that it outperforms existing state-of-the-art HSI classification methods and exhibits excellent generalization capabilities in limited-sample scenarios. The source code is available at https://github.com/ShirlySmile/KGCSL.

Abstract:
Integrating low-resolution hyperspectral images with high-resolution multispectral images is an effective approach to derive high-resolution hyperspectral images. Recently, numerous deep learning-based approaches have been employed to model the mapping relationships for the fusion directly. However, these methods often neglect the spectral characteristics and fail to facilitate comprehensive interactions among global features from heterogeneous modalities. In this paper, we propose a novel cyclic Transformer based on the cross-modality spatial-spectral interaction, exploiting diverse interaction modes to explore the similarity and complementarity among cross-modality features. Specifically, we design a cyclic interactive architecture to fully exploit the abundant spectral prior information in low-resolution hyperspectral images and the rich spatial prior information in high-resolution multispectral images. By incorporating spatial and spectral priors into the attention mechanisms in Transformer modules, we explore the long-range dependency information within the cross-modality features. Furthermore, to enhance interaction among features from different modalities, we devise the cross-modality adaptive interaction mechanisms in both spatial and spectral dimensions to facilitate information reciprocity between different modalities. Extensive experiments demonstrate that the proposed approach outperforms the state-of-the-art fusion methods both quantitatively and visually. The code is available at https://github.com/Tomchenshi/CYformer.

Abstract:
Class incremental learning (CIL) sequentially increases the number of classes, which often leads to catastrophic forgetting when fine-tuning on new classes. Existing approaches typically employ linear classifiers and expand them to accommodate new classes. However, conducting conventional classification inherently introduces feature drift in the image space upon the introduction of new classifiers, potentially disrupting the established distributions, and resulting in forgetting. In this paper, we propose a novel insight to reformulate the conventional classification as image-class matching (ICM) to mitigate the disruption. ICM independently encodes the image and the category and allows for the sharing of a matching classifier across all tasks, effectively stabilizing the feature space during the CIL process. To apply ICM to CIL, we introduce the Binary Matching Classification (BMC) framework, which employs cross attention to encode the matching relationship between images and each category to predict matching scores. When learning new tasks, BMC only requires the addition of category inputs without any structural changes. Moreover, we present a series of strategies to enhance the adaptation of BMC to CIL. Through simple regularization, our BMC framework achieves outstanding performance on various benchmarks including CIFAR-100, ImageNet-100, and ImageNet-1000 datasets. Our code is available at https://github.com/Ethanhuhuhu/BMC.

Affiliations: School of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing, China; Department of Biomedical Engineering, Duke University, Durham, NC, USA; Global Big Data Technologies Centre (GBDTC), University of Technology Sydney (UTS), Ultimo, NSW, Australia; School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen Campus, Shenzhen, China; School of Software Engineering, Tongji University, Shanghai, China; College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China

Abstract:
In this paper, a novel image enhancement method, called the all-inclusive image enhancement (AIIE), is proposed that can effectively enhance the degraded images for improving the visibility of image content. These imageries were acquired under various types of weather conditions such as haze, low-light, underwater, and sandstorm, etc. One commonality shared by this class of noise is that the resulted degradations on visual quality or visibility are caused by low-frequency interference. Existing image enhancement methods lack the ability to deal with all types of degradations from this class, while our proposed AIIE offers a unified treatment for them. To achieve this goal, a statistical property is obtained from the study of the discrete cosine transform (DCT) of 1,000 high- and 1000 low-quality images on their DCT domains. It shows that the normalized DCT coefficients (between 0 and 1) of high-quality images has about 95% fall in the interval [0, 0.2]; for low-quality images, almost all the coefficients are in the same interval. This fundamental property, called the DCT prior (DCT-P), is instrumental to the development of our AIIE algorithm proposed in this paper. Since the proposed DCT-P delineates the attributes of high- and low-quality images clearly, it becomes a highly effective ‘tool’ to convert low-quality images to its enhanced version. Extensive experimental results have clearly validated the superior performance of the AIIE conducted on different types of deteriorated images in terms of visual quality and efficiency as well as significant advantages on computational complexity, which is essential for real-time applications.

Abstract:
The application of multimodal image fusion has become increasingly widespread across various fields in the era of deep learning. Existing fusion methods integrate infrared and visible images to provide complementary content and enhance the robustness of complex real-world scenes for high-level visual tasks, such as semantic segmentation and object detection. In return, high-level visual tasks facilitate the fusion of infrared and visible by providing mid-level semantic information. However, such frameworks rely heavily on multimodal data and require strict registration of images from different modalities before fusion, seriously limiting their practical applications due to the common realistic situations of missing modalities or misregistration. To move beyond this limitation, we propose a novel hierarchical knowledge distillation (HKD) framework tailored for unimodal image segmentation with the guidance of multi-modality. This framework aims to retain as much diverse information from multimodal image fusion as possible, thereby enhancing downstream high-level visual tasks when only the unimodal images are available during the inference phase. Our proposed method is two-stage, and we construct a robust multimodal fusion and segmentation interaction network in the first stage as a powerful teacher model. In the second stage, we design a hierarchical distillation method to transfer the fused and segmented multi-layer knowledge from the multimodal teacher model to the unimodal student model. Extensive experimental results on two public datasets, i.e., MFNet and FMB, demonstrate that the proposed hierarchical knowledge distillation framework can effectively transfuse multimodal knowledge into the unimodal student model for image enhancement and segmentation under incomplete multimodal conditions, and achieves considerably competitive results compared to multimodal image fusion and segmentation models.

Abstract:
We introduce Frame Interpolation Pre-training (FIP), a simple learning technique for lifting deep image denoisers to video denoising with improved implicit temporal alignment. Modern video denoising networks typically rely on explicit motion estimation and alignment which are computationally intensive and harder to re-design and re-train, restricting their application scope and usability. Conversely, stacking frames and image denoisers, without incorporating explicit motion estimation modules, improves speed and benefits from a simpler design, thereby facilitating their generalizability to the video domain. However, it leads to lower accuracy due to suboptimal capture of temporal dependencies. To better leverage the adjacent frames in this setting and reduce the accuracy gap, we propose a novel training regime that divides the standard supervised training of the denoising task into two phases. In the initial phase, FIP guides the network to interpolate a fully masked central frame using only adjacent noisy input frames. In the subsequent phase, the pre-trained network is fine-tuned on denoising the central frame, now using all noisy input frames. Extensive diagnostics indicate that FIP-based networks provide better implicit motion estimation and temporal alignment. In effect, qualitative and quantitative evaluation on standard video denoising datasets with synthetic and real noise demonstrates that FIP consistently improves video denoising accuracy of motion-aware, video-lifted image denoisers without additional computational overhead during training and test time. Our code is available at https://github.com/camalab-ai/FIP

Abstract:
While diffusion-based art image synthesis has witnessed great success in terms of quality, there are still deficiencies in integrating artist-specified subjects with artistic style. In this paper, we propose Canvas, a framework that leverages the capabilities of text-guided latent diffusion models (LDMs) for flexible art image composition driven by diverse customized subject concepts. Specifically, we start by collecting art images manually drawn by proficient artists and annotating the corresponding subject concepts, forming the CreaCulture dataset. Based on this dataset, we build our Canvas with two generation stages. Firstly, a stable diffusion-based stylistic LDM is fine-tuned on the original CreaCulture dataset, aiming to generate an art-style background with annotated subject concepts. To alleviate the limited scope of tagged subject concepts, we propose nature-to-art (N2A) transition to expand the CreaCulture using the natural/art concepts from pre-trained/stylistic LDM, facilitating the fine-tuning of the tailor-made concept-derived LDM. Additionally, the Subject-Infused Attention (SIA) is integrated into the concept-derived LDM, which seamlessly composites the user-specified natural foreground with the pre-generated art background image in a training-free manner. Extensive experiments demonstrate that Canvas outperforms state-of-the-art alternatives under the setting of art image synthesis. The code and dataset are available at https://github.com/wangyunnan/Canvas

Abstract:
3D Gaussian Splatting (3DGS) has gained significant attention for its exceptional performance in real-time rendering and novel view synthesis. However, the traditional Gaussian densification method struggles to effectively capture the complexity of regions with insufficient point cloud density. Although this method improves overall rendering quality by expanding the point cloud to millions of points, blurring and distortion issues still persist in edge details and high-intensity lighting regions. To address these limitations, this paper proposes Gaussian-Enhanced Detail Reconstruction (GEDR), which enhances 3DGS with two key innovations: (1) Multi-scale adaptive Gaussian kernels, dynamically adjusted based on geometric features such as gradient and curvature, enabling finer reconstruction in high-detail regions while maintaining efficiency. (2) Opacity optimization leveraging illumination information, reducing artifacts caused by ambient lighting variations and ensuring stable rendering in large-scale scenes. This strategy ensures efficient and stable rendering, even in large-scale scenes. Evaluations on Mip-NeRF 360 and Tanks & Temples datasets demonstrate that GEDR significantly improves detail preservation, complex region restoration, and robustness to lighting changes while maintaining controlled storage overhead. These results highlight GEDR’s advantages over traditional 3DGS in high-fidelity scene reconstruction.

Abstract:
Current deepfake detection methods commonly use data augmentation and authenticity-content disentanglement to extract more generalized features for detection tasks. However, these methods rely exclusively on low-level spatial artifacts to distinguish real from fake images, which presents significant challenges in accurately capturing the rich forgery cues. Deepfakes create discrepancies between forged and original facial features within the face-recognition (FR) embedding space, which can serve as an additional cue for detection. To better exploit the artifacts in deepfake images, we propose a novel detection method that enhances the detector’s perception capability by incorporating not only the real and fake samples during training, but also the visual residual between real and fake images. Meanwhile, we integrate the discrepancy in facial embedding between the real and fake samples into the training procedure of artifact extraction, serving as a guidance signal with strong knowledge provided by the pretrained face recognition model. Specialized distillation loss along with additional cross-entropy losses are designed to enhance the detection capability. Experiments on multiple benchmarks demonstrate the superiority of the proposed approach in deepfake detection over literature methods.

Abstract:
As Vision Transformers (ViTs) become increasingly popular in various vision tasks, one may question: if a new training scheme for ViTs exists that can improve performance without increasing training and inference computation cost? In this paper, we affirmatively answer this question with a novel Sparse-to-Dense (S2D) training scheme. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Network (FFN) layers of ViTs with computationally efficient RUP-Mixture-of-FFN (RUP-MoF) layers, each comprising multiple FFN experts and allocating tokens to experts via Random Uniform Partition (RUP). Furthermore, an additional Experts Weights Averaging (EWA) update is performed specifically on these RUP-MoF layers after each gradient update. After training, we convert each RUP-MoF layer back to a single FFN layer by averaging the experts, transforming the training-time sparse model back to the original dense ViT model for inference. We further provide theoretical analysis to illustrate why and how it works. Comprehensive experiments across various 2D and 3D vision tasks, ViT architectures and datasets validate the effectiveness and generalization ability of the proposed S2D training scheme. Besides, we show that, S2D training scheme can also be applied to improve the performance of Transformer-based language models, and EWA update technique can also significantly improve the effectiveness of classic Mixture-of-Experts on various 2D vision small-scale datasets and 3D vision tasks.

Abstract:
Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness. Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions, leading to reduced robustness in real-world scenarios. To address this gap, we propose Set-of-Vision-Text Prompting (SoVTP), a novel framework that enhances zero-shot emotion recognition by integrating spatial annotations (e.g., bounding boxes, facial landmarks), physiological signals (facial action units), and contextual cues (body posture, scene dynamics, others’ emotions) into a unified prompting strategy. SoVTP preserves holistic scene information while enabling fine-grained analysis of facial muscle movements and interpersonal dynamics. Extensive experiments show that SoVTP achieves substantial improvements over existing visual prompting methods, demonstrating its effectiveness in enhancing VLLMs’ video emotion recognition capabilities.

Abstract:
6D object tracking plays an important role in various applications, including robotic manipulation and virtual reality. While current methodologies have achieved significant advancements through the use of CAD models, multi-modal sensor data, and category-level assumptions, such resources are often inaccessible in open-world scenarios. Consequently, tracking 6D object poses using only RGB data in such scenarios remains a challenging task. In this paper, we introduce Zero6DOT, an innovative and efficient method for real-time tracking of unknown 6D object poses in monocular RGB video sequences at 8Hz. Our approach requires only the mask of the initial frame, eliminating the need for additional data. The core of Zero6DOT lies in its ability to establish high-quality correspondences across images, from which accurate poses are derived. To achieve this, we employ a transformer-based neural network to predict initial long-term correspondences across frames and integrate a robust Dynamic Units System to refine these predictions. This combination facilitates precise pose tracking while maintaining both efficiency and robustness, even under challenging conditions such as object disappearance, reappearance, and handheld motion. The effectiveness of our approach has been rigorously evaluated through both qualitative and quantitative analyses on the OnePose, YCB-V, and RBOT datasets. The results demonstrate the potential of our proposed Zero6DOT to redefine 6D object pose tracking for real-world scenarios. The source code of the proposed method is available at https://github.com/pangbo1997/Zero6DOT

Abstract:
Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or their pet cat, the T2I models fail to accurately generate their subject-preserved images. The main problem is that pre-trained T2I models do not learn the T2I mapping between the target subjects and their corresponding visual contents. Even if multiple target subject images are provided, previous personalization methods either failed to accurately fit the subject region or lost the interactive generative ability with other existing concepts in T2I model space. For example, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (“Eiffel Tower”), actions (“holding a basketball”), and facial attributes (“eyes closed”). In this paper, we focus on inserting accurate and interactive subject embedding into the Stable Diffusion Model for semantic-fidelity personalized generation using one image. We address this challenge from two perspectives: subject-wise attention loss and semantic-fidelity token optimization. Specifically, we propose a subject-wise attention loss to guide the subject embedding onto a manifold with high subject identity similarity and diverse interactive generative ability. Then, we optimize one subject representation as multiple per-stage tokens, and each token contains two disentangled features. This expansion of the textual conditioning space enhances the semantic control, thereby improving semantic-fidelity. We conduct extensive experiments on the most challenging subjects, face identities, to validate that our results exhibit superior subject accuracy and fine-grained manipulation ability. We further validate the generalization of our methods on various non-face subjects.

Abstract:
Hyperspectral imaging offers significant potential for precise object tracking, yet the scarcity of dataset volumes specifically tailored for hyperspectral tracking algorithms hinders progress, particularly for deep models with complex structures. Additionally, current deep learning-based hyperspectral trackers typically enhance model accuracy via online or adversarial learning, adversely affecting tracking speed. To address these challenges, this paper introduces the Constrained Object Adaptive Learning hyperspectral Tracker (COALT), an effective parameter-efficient fine-tuning tracker tailored for hyperspectral tracking. COALT integrates Pixel-level Object Constrained Spectral Prompt (POCSP) and Temporal Sequence Trajectory Prompt (TSTP) through Adaptive Learning with Parameter-efficient Fine-tuning (ALPEFT), enabling a transformer-based tracker to capture detailed spectral features and relationships in hyperspectral image sequences through trainable rank decomposition matrices. Specifically, POCSP is designed to retain optimal spectral information with low internal correlation and high object representativeness, enabling rapid image reconstruction. Then, the most representative spectral template and search are fused into a single stream as spectral prompts for the Encoder and Decoder layers. Concurrently, the previous coordinates within the same sequence are tokenized and utilized as temporal prompts by TSTP in the decoder layers. The model is trained with ALPEFT to optimize spectral information learning, which substantially reduces the number of training parameters, alleviating overfitting issues arising from limited data. Meanwhile, the proposed tracker not only retains the ability of pre-trained model to estimate object trajectories in an autoregressive manner but also effectively utilizes spectral information and enhances target location perception during the fine-tuning process. Extensive experiments and evaluations are conducted on two public hyperspectral tracking datasets. The results demonstrate that the proposed COALT tracker achieves satisfactory performance with leading processing speed. The code will be available at https://github.com/ PING-CHUANG/COALT

Abstract:
Efficient compression of multi-view video data is a critical challenge for various applications due to the large volume of data involved. Although multi-view video coding (MVC) has introduced inter-view prediction techniques to reduce video redundancies, further reduction can be achieved by encoding a subset of views at a lower resolution through asymmetric rescaling, achieving higher compression efficiency. However, existing network-based rescaling approaches are designed solely for single-viewpoint videos. These methods neglect inter-view characteristics inherent in multi-view videos, resulting in suboptimal performance. To address this issue, we first propose a Disparity-aware Rescaling Learning Network (DRLN) that integrates disparity-aware feature extraction and multi-resolution adaptive rescaling to enhance MVC efficiency by minimizing both self- and inter-view redundancies. On the one hand, during the encoding stage, our method leverages the non-local correlation of multi-view contexts and performs adaptive downscaling with an early-exit mechanism, resulting in substantial multi-view bitrate savings. On the other hand, during the decoding stage, a dynamic aggregation strategy is proposed to facilitate effective interaction with inter-view features, utilizing the inter-view and cross-scale information to reconstruct fine-grained multi-view videos. Extensive experiments show that our network achieves a significant 26.31% BD-Rate reduction compared to the 3D-HEVC standard baseline, offering state of-the-art coding performance.

Abstract:
Recent advances in steganography leverage generative adversarial networks (GANs) as a robust framework for securing covert communications through adversarial training between stego-generators and steganalytic discriminators. This paradigm facilitates the synthesis of secure steganographic images by harnessing the competition between network components. However, existing GAN-based approaches suffer from asymmetric capacity between generators and discriminators: suboptimally trained discriminators provide inadequate gradient guidance for generator optimization, causing premature convergence and security degradation. To overcome this critical limitation, we propose an enhanced multi-steganalyzer adversarial architecture incorporating maximum mean discrepancy (MMD) regularization. Our framework introduces two key innovations: 1) an MMD-based regularization mechanism mitigating distributional discrepancies among multiple steganalyzers through kernel embedding optimization, and 2) a reward function with fusing gradients derived from multiple steganalyzers to boost reinforcement learning-based adversarial training. This dual strategy enables the discriminator to learn generalized forensic features while maintaining equilibrium in adversarial training dynamics, ultimately allowing the generator to produce stego images resistant to multiple steganalyzers simultaneously. Comprehensive experiments validate our method’s superiority: When evaluated across five steganalysis networks, including YedNet, CovNet, LWENet, SRNet, and SwT-SN, at 0.1-0.4 bpp payloads, the proposed framework achieves improvements in average detection error rates over state-of-the-art techniques such as SPAR-RL and GMAN. Ablation studies further confirm that MMD regularization contributes significantly to security enhancement.

Affiliations: School of Software Engineering, South China University of Technology, Guangzhou, China; College of Information Engineering and Shaanxi Engineering Research Center for Intelligent Perception and Analysis of Agricultural Information, Northwest A&F University, Xianyang, Shaanxi, China; School of Future Technology, South China University of Technology, Guangzhou, China; School of Computer Science, School of Artificial Intelligence, OPtics and ElectroNics, Northwestern Polytechnical University, Xi’an, China

Abstract:
Deep Multi-View Clustering (MVC) methods partition multi-view data into disjoint clusters in an unsupervised manner, showing significant promise across various domains. However, current MVC methods primarily focus on capturing the consistency information shared across all views and undervalue the specificity information inherent in each view that reflects its unique characteristics. Furthermore, the underexploration of the separability of learned representations limits the overall clustering performance of existing MVC methods and leads to undesirable clustering results. In this paper, we propose a fully differentiable and end-to-end deep MVC framework, named Comprehensive Information Extraction with Separable Representation Learning (CIRSEL), to address these issues. CIRSEL recasts specificity information extraction as a high-order graph pooling process to capture the view-specific characteristics of individual views. Utilizing the cross-attention mechanism, CIRSEL adaptively fuses the consistent and view-specific representations to achieve comprehensive information extraction. Subsequently, CIRSEL maps representations into a unit hypersphere space with evenly distributed prototypes and maximizes the variational estimation of Mutual Information, which enhances the inter-cluster separability and intra-cluster compactness in the embedding space and further benefits the following clustering learning. Finally, CIRSEL introduces a nuclear norm-based balance regularization, which ensures balanced clustering results can be directly retrieved by the cosine similarity between the representations and prototypes. Extensive experiments on ten benchmark datasets demonstrate the effectiveness of CIRSEL compared to sixteen current MVC methods.

Abstract:
Sketch-Less Facial Image Retrieval (SLFIR) framework facilitates the retrieval of target images with minimal strokes through a human-computer interactive approach, thereby circumventing the need for high-quality sketches required by traditional frameworks. The primary approach utilizes a contrastive learning framework that minimizes the distance between sketch images and their target images in the embedding space, while maximizing the distance from non-target images, thus efficiently learning representations of sketches and images. However, during the initial stages of sketching, the sparse strokes that capture only partial facial features can inadvertently match non-target facial images, blurring the distinctions between positive and negative samples and impairing early retrieval performance. To overcome this challenge, we introduce a multimodal retrieval model based on diversified feedback reinforcement learning, which not only enhances the semantic integrity of sketches but also optimally ranks the sketches corresponding to positive samples using diversified feedback. Specifically, (1) we developed a Facial Language-Image Pre-training (FLIP) model and, leveraging this model, constructed an on-the-fly multimodal retrieval model that excels in recognizing sparse and exaggerated sketches by extracting and integrating multiscale features from both sketches and textual descriptions. (2) Furthermore, we implemented a novel reward mechanism that adjusts the rewards for target images, accommodating reasonable fluctuations in sketch rankings on actual images. This mechanism effectively differentiates similar images during retrieval, ensuring a more consistent and progressively improving ranking list. Extensive experiments validate that our proposed method significantly enhances early retrieval accuracy and generalization capability.

Abstract:
Turbid underwater images often suffer from color distortion, contrast degradation, and detail loss. To improve the visual quality of these images, this paper proposes an illumination-constrained, structure-preserved retinex variational model. The proposed approach consists of three main components: a nonlinear model based on the classical retinex theory to represent the multiple adverse deformations of turbid underwater images; an adaptive channel compensation method to correct the color cast; and an illumination-constrained structure-preserved variational retinex model that simultaneously estimates a smooth illumination component and a detail display reflection component and uniformly predicts the noise pattern of preprocessed underwater images. Specifically, an adaptive weight matrix is proposed to reveal the structural details in reflectance. The overall smoothness of illumination is constrain by exponential guided filtering and l _\mathrm 1/2 norm. The total intensity of the noise pattern is constrained by l_2 norm. To solve the resulting optimization problem, we employ alternating direction minimization of logless transformations of Lagrange multipliers. Extensive experiments demonstrate the effectiveness of the proposed method in improving the quality of turbid underwater images. Beyond subjective visual observations, the method also exhibits competitive performance in objective image quality evaluations.

Abstract:
Recent advances in semantic correspondence have witnessed growing interest in vision foundation models, particularly stable diffusion (SD) and self-distillation with no labels (DINO). However, existing methods underutilize the matching potential of SD and DINOv2 features and show similar background interference patterns. They lack texture-to-semantic learning and intra- and inter-image feature interaction. This study proposes Tex2Sem, a framework learning from textures to semantics, to address the two problems. For the first problem, we propose a texture-to-semantic learning paradigm that achieves texture-semantic trade-offs on features and correlation maps, including progressive fusion and correlation map computation. The SD and DINOv2 features are aggregated from textures to semantics to produce multi-stage progressive fusion features. The resulting multi-stage progressive fusion correlation maps improve semantic correspondence significantly. For the second problem, MamFormer, a hybrid architecture of Mamba-2 and Transformer, is proposed to improve intra- and inter-image feature aggregation and interaction. It enhances foreground focus and background suppression. Given the high computational cost of processing all-stage progressive fusion features, the terminal-stage aggregation and interaction mechanism (TAIM) is proposed to enhance feature learning efficiency. Experiments demonstrate that Tex2Sem achieves state-of-the-art performance on SPair-71k, AP-10K, and PF-PASCAL. Furthermore, Tex2Sem shows remarkable generalization capabilities in cross-species, cross-family, and cross-dataset matching and demonstrates the potential for applications in video swap and human poseestimation. Code is available at https://github.com/wzhlearning/Tex2Sem.

Abstract:
Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest method introduce a pseudo label learning framework to bridge the gap between classification-based training and inference targets at localization. Typically, this framework employs a classification-based teacher model to generate pseudo labels, which are then used to train a regression-based student model for precise boundary prediction. However, the quality of these pseudo label—critical to the student model’s performance—has not been systematically investigated, leading to suboptimal localization accuracy. In this paper, we propose a set of simple yet efficient mechanisms for pseudo label quality enhancement to build our FuSTAL framework. Unlike previous one or two stages methods, FuSTAL decomposes the learning process into three stages and enhances pseudo label quality at each one: cross-video contrastive learning for more informative initiative pseudo labels at the Generation-Stage, prior-based filtering to remove the false positive proposals at the Selection-Stage and EMA-based distillation for smoother pseudo labels at the Training-Stage. These designs supplement each other, and enhance action proposals’ quality with respect to the accuracy, true positive rate and smoothness. With the help of these comprehensive designs at all three stages, FuSTAL achieves an average mAP of 50.8% on the benchmark data THUMOS’14, outperforming the previous best method by 1.2%.

Abstract:
Recently, the topic of multi-view multi-label classification has aroused significant attention from scholars. Plenty of methods adopt an average weighting scheme to merge the features obtained from multiple views, which commonly ignore the quality difference of information provided by multiple views and thus limit the credibility of the fusion feature for the overall task. Besides, most of these methods assume the views and labels are complete while neglecting both views and labels may be incomplete. To solve these problems, we propose a quality-aware representation fusion network for partial multi-view incomplete multi-label classification, named QARF-net. Since assigning equal fusion weights for each view may be not in line with the actual contributions of individual views, a view quality-aware module is proposed to learn suitable weights for different views dynamically based on the quality of each view’s information, which provides a reliable guide for fusing the information of multiple views. In addition, considering the consistency characteristics of multi-view data, we impose a sample-level dual constraint to preserve the consistency property of the feature in multi-view space and constrain the sample structure in the fused feature space, respectively. Last but not least, QARF-net can not only deal with complete multi-view multi-label classification tasks but also tackle partial multi-view incomplete multi-label classification tasks. Experimental results on five real-world datasets indicate that our proposed method outperforms state-of-the-art methods.

Abstract:
Weakly supervised temporal sentence grounding aims to temporally locate events described by a sentence in a video, relying solely on video-level visual-language correspondences. Because of the absence of precise boundary information, existing works primarily focus on multiple instance learning methods to establish segment-level video-language alignment. In this work, we propose Prompt-augmented Boundary Attentive Learning (PBAL) to enable the explicit modeling of the segment boundaries in a weakly supervised context. To represent the boundaries with sentences, we first generate sentences describing the start and end of an event, leveraging the capabilities of large language models (LLMs). With the augmented sentences, we then model the boundary-level video-language correspondence using a novel boundary-attentive learning module. This module generates probability maps of the starting and ending points, and is learned through boundary type prediction and self-supervised reconstruction. Experiments on two standard datasets, Charades-STA and ActivityNet Captions demonstrate PBAL’s state-of-the-art performance. The results of our ablation study further demonstrate the effectiveness of our boundary-attentive learning and prompt augmentation techniques.

Abstract:
In recent years, a number of effective Few-Shot Fine-Grained Image Classification (FS-FGIC) methods have been proposed, which mainly focus on extracting discriminative information within high-level features in a single episode/task. However, this is insufficient for addressing the cross-task challenges of FS-FGIC, which is represented in two aspects. On the one hand, from the perspective of the Fine-Grained Image Classification (FGIC) task, there is a need to supplement the model with mid-level features containing rich fine-grained information. On the other hand, from the perspective of the Few-Shot Learning (FSL) task, explicit modeling of cross-task general knowledge is required. In this paper, we propose a novel Enhanced Bi-directional Task-Guided Network (BTG-Net++) to tackle these issues. Specifically, from the FGIC task perspective, we design the Semantic-Guided Noise Filtering (SGNF) module to filter noise on mid-level features rich in detailed information with the assistance of high-level features. Further, from the FSL task perspective, the General Knowledge Prompt Modeling (GKPM) module is proposed to retain the cross-task general knowledge by utilizing the prompting mechanism, thereby enhancing the model’s generalization performance on unseen novel classes. We have conducted extensive experiments on five fine-grained benchmark datasets, and the results demonstrate that BTG-Net++ shows considerable improvements compared with state-of-the-art methods.

Abstract:
Continual learning aims to acquire new knowledge while retaining past information. Class-incremental learning (CIL) presents a challenging scenario where classes are introduced sequentially. For video data, the task becomes more complex than image data because it requires learning and preserving both spatial appearance and temporal action involvement. To address this challenge, we propose a novel exemplar-free framework that equips separate spatiotemporal adapters to learn new class patterns, accommodating the incremental information representation requirements unique to each class. While separate adapters are proven to mitigate forgetting and fit unique requirements, naively applying them hinders the intrinsic connection between spatial and temporal information increments, affecting the efficiency of representing newly learned class information. Motivated by this, we introduce two key innovations from a causal perspective. First, a causal distillation module is devised to maintain the relation between spatial-temporal knowledge for a more efficient representation. Second, a causal compensation mechanism is proposed to reduce the conflicts during increment and memorization between different types of information. Extensive experiments conducted on benchmark datasets demonstrate that our framework can achieve new state-of-the-art results, surpassing current example-based methods by 4.2% in accuracy on average. The codes are accessible in https:// github.com/tychen-SJTU/CSTA.

Abstract:
Stochastic Human Motion Prediction (HMP) has received increasing attention due to its wide applications. Despite the rapid progress in generative fields, existing methods often face challenges in learning continuous temporal dynamics and predicting stochastic motion sequences. They tend to overlook the flexibility inherent in complex human motions and are prone to mode collapse. To alleviate these issues, we propose a novel method called STCN, for stochastic and continuous human motion prediction, which consists of two stages. Specifically, in the first stage, we propose a spatio-temporal continuous network to generate smoother human motion sequences. In addition, the anchor set is innovatively introduced into the stochastic HMP task to prevent mode collapse, which refers to the potential human motion patterns. In the second stage, STCN endeavors to acquire the Gaussian mixture distribution (GMM) of observed motion sequences with the aid of the anchor set. It also focuses on the probability associated with each anchor, and employs the strategy of sampling multiple sequences from each anchor to alleviate intra-class differences in human motions. Experimental results on two widely-used datasets (Human3.6M and HumanEva-I) demonstrate that our model obtains competitive performance on both diversity and accuracy.

Abstract:
Existing point cloud upsampling methods typically treat upsampling as a local interpolation problem, neglecting the importance of global correlations within point sets, which can limit their performance. To address this limitation, we exploit the inherent self-similarity of point clouds from a global perspective and propose PU-GSM, a latent geometry-guided self-similarity model for upsampling. We first generate a lower-resolution sparse sub-point cloud (SPC) by downsampling the input point cloud (IPC). Then, we introduce a latent geometry-guided self-similarity model (LGSM) that learns a point distribution on the underlying surface of SPC by exploiting the inherent self-similarity of IPC. Next, we reuse the LGSM for the remaining points (i.e., the points left after removing SPC from IPC). Afterward, we introduce a gradient-aware dual domain refiner to generate and calibrate the upsampled point cloud from the learned point distribution. Finally, we propose an inference-free latent vector matching approach to regularize the upsampled point cloud by enhancing the feature similarity between the upsampled point cloud and the ground truth in latent space. Extensive experiments show that PU-GSM achieves better upsampling results compared to state-of-the-art methods. Our code will be available at: https://github.com/liuhaoyun/PU-GSM

Abstract:
The likelihood of encountering scenarios that lead to accidents, namely safety-critical scenarios, is minimal compared to long-term safe driving environments. The generation of repeatable and scalable safety-critical scenarios is essential for the advancement of human and autonomous driving capabilities. Compared with the high complexity and low practicality of existing scenario generation methods, in this paper we propose a real-time approach to automatically generate challenging scenarios and instantiate them in a CARLA-based simulator. First, the safety-critical scenario is decomposed into a perturbed and optimized vehicle trajectory and the remaining reusable Unreal Engine assets based on a hierarchical model. Second, a model that is based on a graph conditional variational autoencoder (VAE) is employed to predict future trajectories and head angles based on past information. Third, the safety-critical scene generation model is used to enhance the diversity of the scene by diversifying the latent variables over a pre-trained trajectory representation model. Finally, the trajectories of real-world vehicles are placed into the simulator by adapting them to enable the generation of safety-critical scenes in a three-dimensional environment. The results demonstrate that the proposed approach generates scenarios that are more plausible than those generated by the baselines, with a performance improvement of over 10% in collision metrics for scenario generation. The research facilitates the simplification of the long-tail scenario construction process for autonomous vehicles, which in turn facilitates the optimization of algorithms such as autonomous trajectory planning.

Abstract:
It remains to be extremely difficult to capture high quality photographs of low-light scenes. Low light causes the low signal-to-noise ratio (SNR) problem which makes the image noisy. Such scenes almost always have the high dynamic range (HDR) problem caused by uneven lighting where a small area surrounding the light source is very bright while the rest of the scene is very dark, making it very difficult to simultaneously obtain high quality signals in both the dark and bright areas. This paper presents a new image restoration method for tackling the problems in low-light scenes. Fundamentally differing from existing approaches, the new method borrows ideas from inverse graphics rendering and re-renders the image with a canonical light source thus correcting the image from first principle. A deep learning based simplified inverse rendering model (SIRM) featuring implicit regularization is first developed for correcting uneven lighting and then an end-to-end convolutional neural network is constructed for reducing noise. Extensive experimental results are presented to demonstrate that the new method outperforms state-of-the-art methods, and is capable of effectively brightening up dark image regions while at the same time preserving details and color consistency. Our code is available at: https://github.com/pj0927/SIRNet

Abstract:
Underwater images suffer from light absorption and scattering, impairs their visibility and applications. Existing underwater image restoration (UIR) methods based on generative models struggle are difficult to adapt to the complex and dynamic underwater environments characterized by illumination interference, low-light conditions, and non-uniform turbidity. To address these issues, we propose Water-CDM, a novel Adaptive Double-Branch Fusion Conditional Diffusion Model for underwater image restoration. Specifically, an adaptive double-branch fusion conditional diffusion model is presented utilizing a U-shaped full-attention network and Guided Multi-Scale Retinex with Brightness Correction (GMSRBC) to restore the challenging regions within underwater images. More precisely, to correct color casts and enhance the sharpness of underwater images, a U-shaped full-attention network incorporating Attention Blocks is designed for noise estimation during the reverse process of the conditional diffusion model. Concurrently, to mitigate overexposure during the enhancement of low-light underwater images under illumination interference, the GMSRBC method, featuring an Adaptive Brightness Correction Module, is proposed to efficiently adjust the brightness of underwater images. Experimental results demonstrate that the proposed Water-CDM significantly improves the quality of underwater images in challenging scenarios. Encouragingly, our proposed Water-CDM yields superior restoration outcomes compared to current state-of-the-art methods on three challenging publicly available datasets. Our codes will be released at: https://github.com/HKandWJJ/Water-CDM

Abstract:
High Definition (HD) maps, containing detailed road information, are essential for autonomous driving and many geo-related tasks. Recent developments in computer vision make it possible to automate the labor-intensive HD map maintenance work, such as localizing traffic signs within a road network. However, updating traffic signs to HD maps is non-trivial, as it not only requires precise geo-location but also requires confirming whether a sign belongs to a specific road. In our work, we develop an end-to-end automated traffic sign update system, termed AutoTS, which is capable of using an image sequence collected during vehicle operation to extract the geo-location of a traffic sign and determine whether it belongs to the road driven on, from its orientation. In AutoTS, we design a noise and sparsity adaptive localization module, which can filter noisy location points and derive a geo-location from sparse location points. To identify the orientation of traffic signs, we devise a position-aware orientation classification module, which uses the ROI feature and the position-aware SIFT feature to explore the orientation characteristic and understand the road context. To facilitate the evaluation of the proposed method, we construct a traffic sign localization and orientation classification benchmark, KITTI-TS. Our AutoTS achieves an MAE of 2.38 meters in traffic sign localization, while the accuracy in orientation classification reaches 88.89%.

Abstract:
Deep-learning-based structured light 3D reconstruction technology (SL3D) provides excellent solutions for intelligent manufacturing. However, the scarcity of real-world datasets covering full-process data and diverse objects hampers the validation of new ideas. Moreover, limited research on dataset construction strategies, such as scene backgrounds and sample distribution, reduces network performance. We investigate the impact of background stability on foreground accuracy (BS-FA) and find a whiteboard background improved foreground prediction accuracy by up to 82% over a black background. Guided by BS-FA, we develop the SL3D-BF, a background-effective SL3D dataset for industrial use, featuring approximately 2,100 scenes with diverse objects like metal/plastic workpieces, plaster sculptures, and standard parts for precise evaluation. It uniquely includes shadow and foreground masks, absent in prior datasets, and offers full-process data from gratings to 3D point clouds, totaling 100,800 gratings. We also establish an initial benchmark for future research by conducting evaluation experiments with advanced methods. Furthermore, we investigate the relationship between the spatial frequency of sample occurrence and the model predictive ability to minimize the time and resource demands of dataset construction. Most importantly, SL3D-BF is also a valuable resource for tasks like depth estimation, defect detection, and semantic segmentation. The dataset is available at: https://github.com/LiYiMingM/Dataset_SL3D_BF

Abstract:
Text-to-3D generation enables the creation of 3D content with infinite possibilities. Existing methods typically involve training 3D generative models, which suffer from poor semantic alignment due to the scarcity of paired 3D data, or optimizing a 3D representation with 2D diffusion guidance, resulting in slow inference, low diversity, and Janus problems. In this paper, we introduce InstantDreamer, a model designed for text-guided 3D-aware generation in a single forward pass without requiring paired training datasets, thereby enhancing efficiency. To accomplish this, we extend score distillation to learn a 3D-aware semantics distribution. We distill priors from diffusion models into a 3D-aware generator, amortizing the optimization time required for new prompts and eliminating the necessity of paired training data. We equip the generator with hierarchical semantics conditioning, explicitly allowing the model to perceive the correspondence between the text distribution and the 3D latent space. Our elaborate designs empower our 3D generative model with multi-view semantic consistency and feed-forward 3D generation capabilities, thus eliminating the need for score distillation-based optimization for each prompt. Both quantitative and qualitative results on the mainstream benchmarks demonstrate that our InstantDreamer generates competitive multi-view semantic consistent 3D assets compared with state-of-the-art methods. Our method outperforms previous approaches in terms of CLIP R-Precision (66.31) and FID (28.47) while also exhibiting a significant boost in generation speed.

Abstract:
Screen-shooting watermarking technology plays a critical role in copyright protection and traceability. However, existing methods often lack sufficient robustness under strong noise interference and tend to introduce noticeable visual artifacts when embedding watermarks in smooth image regions, thereby degrading visual quality and increasing the risk of watermark exposure. To address these limitations, this paper proposes a Contrastive Learning and Mask-guided Embedding (CLME) framework for robust screen-shooting watermarking. The framework comprises two key components: 1) a mask-guided watermark embedding module that utilizes a Residual Dense Feature Extraction Block (RDFEB) and an Attention Mask Generation Block (AMGB) to adaptively embed watermarks into texture-rich regions, improving watermark invisibility; and 2) a contrastive learning-based watermark decoding network that employs contrastive loss to enhance the consistency of decoded features by treating features from the same watermarked image under different noise conditions as positive samples and features from different watermarked images as negative samples, thereby improving the robustness of watermark extraction. Experimental results demonstrate that the proposed CLME framework outperforms existing methods in terms of both robustness and visual quality. Specifically, at a shooting distance of 100 cm and a shooting angle of 40°, the watermark extraction accuracy reaches 99.58%, and the peak signal-to-noise ratio (PSNR) of the watermarked images reaches 42.624 dB, highlighting the framework’s strong potential for real-world applications.

Abstract:
With the widespread adoption of smart devices and social networking platforms, the development of robust image steganography techniques for public lossy channels has become increasingly crucial. Among JPEG-resistant steganographic methods, PMAS (Postprocessing and precise dither Modulation based robust Adaptive Steganography) has demonstrated superior performance by utilizing high-quality images and maintaining resilience against aggressive compression. This method achieves remarkable concealment in user-shared images, presenting substantial challenges to public communication security. To counter this threat, we propose a specialized lightweight Scale-Free Network for mining Clues in downward JPEG-resistant steganography (SF-ClueNet), specifically designed to identify vulnerabilities in PMAS despite its sophisticated anti-detection mechanisms. Departing from conventional approaches that depend on high-pass filter residuals, SF-ClueNet extracts comprehensive global statistical features, enabling effective detection of dispersed steganographic artifacts. When integrated with lightweight residual feature miner, our method maintains pattern recognition capabilities as image dimensions increase, ensuring consistent detection performance. Experimental results demonstrate that SF-ClueNet significantly enhances detection accuracy, exhibits robust performance against data distribution shifts with minimal transfer loss, and supports direct analysis of high-resolution images. These advanced capabilities position SF-ClueNet as a viable and efficient solution for practical steganalysis applications across diverse operational environments.

Abstract:
In recent years, the multimedia forensics and security community has seen remarkable progress in multitask learning for DeepFake (i.e., face forgery) detection. The prevailing approach has been to frame DeepFake detection as a binary classification problem augmented by manipulation-oriented auxiliary tasks. This scheme focuses on learning features specific to face manipulations with limited generalizability. In this paper, we delve deeper into semantics-oriented multitask learning for DeepFake detection, capturing the relationships among face semantics via joint embedding. We first propose an automated dataset expansion technique that broadens current face forgery datasets to support semantics-oriented DeepFake detection tasks at both the global face attribute and local face region levels. Furthermore, we resort to the joint embedding of face images and labels (depicted by text descriptions) for prediction. This approach eliminates the need for manually setting task-agnostic and task-specific parameters, which is typically required when predicting multiple labels directly from images. In addition, we employ bi-level optimization to dynamically balance the fidelity loss weightings of various tasks, making the training process fully automated. Extensive experiments on six DeepFake datasets show that our method improves the generalizability of DeepFake detection and renders some degree of model interpretation by providing human-understandable explanations.

Abstract:
Multi-label image classification aims to classify all categories in images simultaneously. When current multi-label classification methods meet fine-grained objects in a single image, the extreme inter-class similarity and over-prediction problems are two major challenges that hinder model performance. To solve the above two problems, we propose Voronoi density based Locally Unique Network (VoLUNet). First, due to high correlation between predictions of different classes, following the Kolmogorov-Arnold Network (KAN), we design the Weak Inter-class Correlation Classifier (WIC-Classifier) to replace linear weights setting in MLP architecture, promoting the potential of fine-grained discrimination. Second, we propose a Local Non-Maximum Suppression (Local-NMS) loss to multi-label classification model, predicting only one unique class with high prediction value for each local region. Third, different classes may have different pixel proportions and Local-NMS loss will be imbalanced for diverse fine-grained classes, we design the Voronoi Density based Superpixel Module (VDSM) to balance the quantities of local feature vectors with different classes. Finally, comprehensive experiments are conducted on four datasets, TreeSatAI, GeoLifeCLEF, FothemNet and ShipRSImageNet, and our VoLUNet can significantly improve the classification performance compared to current state-of-the-art models. Codes of this paper are public available at https://github.com/cv516Buaa/BinghaoLiu/tree/main/VoLUNet

Affiliations: School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China; School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, China; School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou, China; Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China; Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Sai Kung, Hong Kong

Abstract:
Mirror detection is a challenging task, due to the reflective properties of mirrors. Most existing approaches rely on exploiting the relationship between the content inside the mirror and the surrounding environment to aid in locating mirrors. A typical solution is to utilize contextual contrasted features. However, the discontinuity in content at the edges of mirrors may not always be prominent. To overcome this limitation, we propose a novel mirror detection framework called S2MD including two main modules, multi-directional similarity perception module (MSPM) and spectral saliency enhancement decoder module (SSEDM). Specifically, we employ a backbone network to extract multi-scale global information from images using a dual-path approach. Then, we feed these high-level dual-path features into MSPMs to generate direction-sensitive similarity-consistent features. MSPM utilizes active rotating filters and oriented response pooling to model the similarity relations in different orientations. Moreover, the SSEDM is utilized to enhance the spatial contextual contrasted features using feature spectral residuals and fuse the dual-path features to obtain the final predicted mirror mask. Extensive experiments demonstrate that our method achieves state-of-the-art performance on challenging MSD, PMD, and RGBD-Mirror benchmarks. The code is available at https://github.com/RuiChen-stack/M2SD

Abstract:
Semantic bird-eye-view (BEV) map is an efficient data representation for environment perception in autonomous driving. In real driving scenarios, the collected sensory data usually exhibit class imbalance. For example, road layouts are often the majority classes and road objects are the minority. Such imbalanced data could lead to inferior performance in BEV map generation, particularly for minority objects due to insufficient learning samples. This work attempts to mitigate this issue from the perspective of network and loss function design. To this end, a diffusion-guided semantic BEV map generation network with a boundary-aware loss is proposed. The network learns the underlying distribution of the data, including the relationship between majority and minority classes. The boundary-aware loss increases weighting for minority classes during training, making the network focus on these classes. Experimental results on a public dataset demonstrate our superiority over the state-of-the-art methods, and our effectiveness in addressing the class imbalance issue.

Abstract:
Multi-view clustering (MVC), which integrates information from multiple views to enhance performance, has garnered increasing attention in recent years. Partially View-aligned Clustering (PVC), which is a particularly critical aspect of this process, requires a thorough exploration of complementary and consistent information under conditions of partial view alignment. However, most existing PVC methods primarily focus on semantic consistency, employing semantic consistency features for both view alignment and clustering tasks. These methods neglect the effects of noise and complementary information across multiple views and the suitability of these features for clustering. To address these limitations, our approach aims to leverage three distinct types of consistency to extract semantic consistency features and clustering consistency features, which are specifically designed for view alignment and clustering tasks, respectively. By omitting the reconstruction process, we mitigate the adverse effects of mutual information and noise on view alignment. Specifically, we first exploit the structural consistency of similarity graphs across different views to guide feature extraction in view-specific autoencoders. This process produces structural consistency features that are both cluster-discriminative and structurally coherent. Subsequently, two separate multilayer perceptrons (MLPs) are trained via contrastive learning to extract semantic consistency features and clustering consistency features from the structural features. These features are optimized for their respective tasks. Ultimately, a self-paced style view alignment strategy is used to iteratively re-align the data based on semantic and clustering consistency while the model is optimized via the re-aligned data. Extensive experiments on multiple real-world benchmark datasets demonstrate that our method outperforms the state-of-the-art multi-view approaches, highlighting its effectiveness in tackling the challenges of PVC. The code is available at https://github.com/kongyiH/TCLPVC

Abstract:
Transductive zero-shot learning (TZSL) has been proposed to address the domain shift problem by leveraging additional unlabeled unseen data to enhance the generalization ability from seen classes to unseen target classes. Existing TZSL methods primarily focus on mitigating the distribution bias problem by incorporating these unlabeled samples into the generative models. Although these methods have achieved great success, they do not fully exploit the potential of these unlabeled target data. In this paper, we propose a bidirectional weakly guided conditional generative modeling approach, which utilizes the attribute regressor and the visual generator to synthesize paired training data of unseen classes for each other, thus converting unlabeled target data into matched feature-attribute pairs. Additionally, on top of the generative modeling, we also propose to progressively estimate the associations between visual features and attributes among the unlabeled target data through a semi-supervised pseudo-labeling approach, so as to further facilitate the generative model and enhance the learning of target distributions. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art methods. Our source code is released in https://github.com/LevisWei/semi-zero-master.

Abstract:
Most existing approaches for image and video compression perform transform coding in the pixel space to reduce redundancy. However, due to the misalignment between the pixel-space distortion and human perception, such schemes often face the difficulties in achieving both high-realism and high-fidelity at ultra-low bitrate. To solve this problem, we propose Generative Latent Coding (GLC) models for image and video compression, termed GLC-image and GLC-Video. The transform coding of GLC is conducted in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE). Compared to the pixel-space, such a latent space offers greater sparsity, richer semantics and better alignment with human perception, and show its advantages in achieving high-realism and high-fidelity compression. To further enhance performance, we improve the hyper prior by introducing a spatial categorical hyper module in GLC-image and a spatio-temporal categorical hyper module in GLC-video. Additionally, the code-prediction-based loss function is proposed to enhance the semantic consistency. Experiments demonstrate that our scheme shows high visual quality at ultra-low bitrate for both image and video compression. For image compression, GLC-image achieves an impressive bitrate of less than 0.04 bpp, achieving the same FID as previous SOTA model MS-ILLM while using 45% fewer bitrate on the CLIC 2020 test set. For video compression, GLC-video achieves 65.3% bitrate saving over PLVC in terms of DISTS.

Abstract:
In recent years, 3D point cloud compression (PCC) has emerged as a prominent research area, attracting widespread attention from both academia and industry. As one of the PCC standards released by the moving picture expert group (MPEG), the geometry-based PCC (G-PCC) adopts two attribute lossy coding schemes, namely the prediction-based Lifting Transform and the region adaptive hierarchical transform (RAHT). Based on statistical analysis, it can be observed that the increase in predictive distance gradually weakens the attribute correlation between points, resulting in larger prediction errors. To address this issue, we propose a prediction enhancement method by using the smoothing filter to improve the attribute coding efficiency, which is both integrated into the Lifting Transform and RAHT. For the former, the neighbor point smoothing method based on the prediction order is proposed via a weighted average strategy. The proposed smoothing is only applied to points in the lower level of details (LoDs) by adjusting the distance-based predicted attribute values. For the latter, we design a neighbor node smoothing method after the inter depth up-sampling (IDUS) prediction, where the sub-nodes in the same unit node are filtered for lower levels. Experimental results have demonstrated that compared with two latest MPEG G-PCC reference software TMC13-v23.0 and GeSTM-v3.0, our proposed enhanced prediction method exhibits superior Bjøntegaard delta bit rate (BDBR) gains with small increase in time complexity.

Abstract:
Unsupervised Visible-Infrared Person Re-Identification (USVI-ReID) aims to match visible and infrared person images without relying on prior annotations. Recently, unsupervised contrastive learning methods have become the mainstream approach for USVI-ReID, leveraging clustering algorithms to generate pseudo-labels. However, these methods often suffer from inherent noisy pseudo-labels, which significantly hinders their performance. To address this challenge, we propose a Adaptive Pseudo-label Purification and Debiasing (APPD) framework for USVI-ReID, which is designed to calibrate noisy pseudo-labels and dynamically detects clean pseudo-labels, thereby enhancing the model’s performance and reliability. Specifically, we propose an Adaptive Pseudo-label Calibration and Division (APCD) module, which calibrates noisy pseudo-labels by assessing their reliability and divides pseudo-labels into clean and noisy subsets, ensuring a more focused and accurate learning process. Based on the calibrated pseudo-labels, we develop an Optimal Transport Prototype Matching (OTPM) module to establish robust cross-modality correspondences. For clean pseudo-labels, we propose a Debiased Memory Hybrid Learning (DMHL) module, which jointly captures modality-specific and modality-invariant information while addressing sampling bias to enhance feature representation. To effectively utilize noisy pseudo-labels, we introduce a Neighbor Relation Learning (NRL) module that mitigates intra-class variations by exploring neighbor relationships in the feature space. Comprehensive experiments conducted on two widely recognized USVI-ReID benchmarks demonstrate that APPD achieves state-of-the-art performance, significantly outperforming existing methods. The source code will be made available at https://github.com/XiangboYin/RPNR

Abstract:
Deep learning models are increasingly being employed in steganographic schemes for the embedding and extraction of secret information. However, steganographic models themselves are also at risk of detection and attacks. Although there are approaches proposed to hide deep learning models, making these models difficult to detect while achieving high-quality image steganography performance remains a challenging task. In this work, a robust image steganography method based on a camouflage model CamStegNet is proposed. The steganographic model is camouflaged as a routine deep learning model to significantly enhance its concealment. A sparse weight-filling paradigm is designed to enable the model to be flexibly switched among three modes by utilizing different keys: routine machine learning task, secret embedding task and secret recovery task. Furthermore, a residual state-space module and a neighborhood attention mechanism are constructed to improve the performance of image steganography. Experiments conducted on the DIV2K, ImageNet and COCO datasets demonstrate that the stego images generated by CamStegNet are superior to existing methods in terms of visual quality. They also exhibit enhanced resistance to steganalysis and maintain over 95% robustness against noise and scale attacks. Additionally, the model demonstrates high robustness which can achieve excellent performance in machine learning tasks and maintain stability across various weight initialization methods.

Affiliations: School of Computer Science and Shanghai Key Laboratory of Data Science, Fudan University, Shanghai, China; Oxford Suzhou Centre for Advanced Research, Suzhou, China; Department of Computer Science, University of Bath, Bath, U.K.; School of Mathematics and Statistics, University of Glasgow, Glasgow, U.K.; School of Computer Science, Fudan University, Shanghai, China; Bayes Business School, City, University of London, London, U.K.; Department of Electrical Engineering and Computer Science College of Engineering, University of Michigan, Ann Arbor, MI, USA; Department of Computer Science, University of Colorado at Boulder, Boulder, CO, USA

Abstract:
Denoising-based diffusion models have attained impressive image synthesis; however, their applications on videos can lead to unaffordable computational costs due to the per-frame denoising operations. In pursuit of efficient video generation, we present a Diffusion Reuse MOtion (Dr. Mo) network to accelerate the video-based denoising process. Our crucial observation is that the latent representations in early denoising steps between adjacent video frames exhibit high consistencies with motion clues. Inspired by the discovery, we propose to accelerate the video denoising process by incorporating lightweight, learnable motion features. Specifically, Dr. Mo will only compute all denoising steps for base frames. For a non-based frame, Dr. Mo will propagate the pre-computed based latents of a particular step with inter-frame motions to obtain a fast estimation of its coarse-grained latent representation, from which the denoising will continue to obtain more sensitive and fine-grained representations. On top of this, Dr. Mo employs a meta-network named Denoising Step Selector (DSS) to dynamically determine the step to perform motion-based propagations for each frame, ensuring the correct transformation of multi-granularity visual features. Extensive evaluations on video generation and editing tasks indicate that Dr. Mo delivers widely applicable acceleration for diffusion-based video generations while effectively retaining the visual quality and style. Video generation and visualization results can be found at https://drmo-denoising-reuse.github.io.

Abstract:
Quad meshes are essential in geometric modeling and computational mechanics. Although learning-based methods for triangle mesh demonstrate considerable advancements, quad mesh generation remains less explored due to the challenge of ensuring coplanarity, convexity, and quad-only meshes. In this paper, we present Point2Quad, the first learning-based method for quad-only mesh generation from point clouds. The key idea is learning to identify quad mesh with fused pointwise and facewise features. Specifically, Point2Quad begins with a k-NN-based candidate generation considering the coplanarity and squareness. Then, two encoders are followed to extract geometric and topological features that address the challenge of quad-related constraints, especially by combining in-depth quadrilaterals-specific characteristics. Subsequently, the extracted features are fused to train the classifier with a designed compound loss. The final results are derived after the refinement by a quad-specific post-processing. Extensive experiments on both clear and noise data demonstrate the effectiveness and superiority of Point2Quad, compared to baseline methods under comprehensive metrics. The code and dataset are available at https://github.com/cognaclee/Point2Quad.

Abstract:
Digital images captured by unstable imaging systems often simultaneously suffer from random noise and stripe noise. Due to the complex noise distribution, denoising and destriping methods based on simple handcrafted priors may leave residual noise. Although supervised methods have achieved some progress, they rely on large-scale noisy-clean image pairs, which are challenging to obtain in practice. To address these problems, we propose a self-supervised image denoising and destriping method based on blind-spot regularization, named Self-BSR. This method transforms the overall denoising and destriping problem into a modeling task for two spatially correlated signals: image and stripe. Specifically, blind-spot regularization leverages spatial continuity learned by the improved blind-spot network to separately constrain the reconstruction of image and stripe while suppressing pixel-wise independent noise. This regularization has two advantages: first, it is adaptively formulated based on implicit network priors, without any explicit parametric modeling of image and noise; second, it enables Self-BSR to learn denoising and destriping only from noisy images. In addition, we introduce the directional feature unshuffle in Self-BSR, which extracts multi-directional information to provide discriminative features for separating image from stripe. Furthermore, the feature-resampling refinement is proposed to improve the reconstruction ability of Self-BSR by resampling pixels with high spatial correlation in the receptive field. Extensive experiments on synthetic and real-world datasets demonstrate significant advantages of the proposed method over existing methods in denoising and destriping performance. The code will be publicly available at https://github.com/Jocobqc/Self-BSR

Abstract:
Reconstructing urban street scenes is crucial due to its vital role in applications such as autonomous driving and urban planning. These scenes are characterized by long, narrow camera trajectories, occlusion, complex object relationships, and sparse data across multiple scales. Despite recent advancements, existing surface reconstruction methods, which are primarily designed for object-centric scenarios, struggle to adapt effectively to the unique characteristics of street scenes. To address this challenge, we introduce StreetSurfGS, the first method to employ Gaussian Splatting specifically tailored for scalable urban street scene surface reconstruction. StreetSurfGS utilizes a planar-based octree representation and segmented training to reduce memory costs, accommodate unique camera characteristics, and improve scalability. Additionally, to mitigate depth inaccuracies caused by object overlap, we propose a guided smoothing strategy within regularization to eliminate inaccurate boundary points and outliers. Furthermore, to address sparse views and multi-scale challenges, we use a dual-step matching strategy that leverages adjacent and long-term information. Extensive experiments validate the efficacy of StreetSurfGS in both novel view synthesis and surface reconstruction.

Abstract:
Omnidirectional image quality assessment (OIQA) has become an increasingly vital problem in recent years. Most previous no-reference OIQA methods only extract local features from the distorted viewports, or extract global features from the entire distorted image, lacking the interaction and fusion between local and global features. Moreover, the lack of reference information also limits their performance. Thus, we propose a no-reference OIQA model which consists of three novel modules, including a bidirectional pseudo-reference module, a Mamba-based global feature extraction module, and a multi-scale local-global feature aggregation module. Specifically, by considering the image distortion degradation process, a bidirectional pseudo-reference module capturing the error maps on viewports is first constructed to refine the multi-scale local visual features, which can supply rich quality degradation reference information without the reference image. To well complement the local features, the VMamba module is adopted to extract the representative multi-scale global visual features. Inspired by human hierarchical visual perception characteristics, a novel multi-scale aggregation module is built to strengthen the feature interaction and effective fusion which can extract deep semantic information. Finally, motivated by the multi-task managing mechanism of human brain, a multi-task learning module is introduced to assist the main quality assessment task by digging the hidden information in compression type and distortion degree. Extensive experimental results demonstrate that our proposed method achieves the state-of-the-art performance on the no-reference OIQA task compared to other models.

Abstract:
LiDAR semantic segmentation plays a vital role in autonomous driving. Existing voxel-based methods for LiDAR semantic segmentation apply uniform partition to the 3D LiDAR point cloud to form a structured representation based on cartesian/cylindrical coordinates. Although these methods show impressive performance, the drawback of existing voxel-based methods remains in two aspects: 1) it requires a large enough input voxel resolution, which brings a large amount of computation cost and memory consumption. 2) it does not well handle the unbalanced point distribution of LiDAR point cloud. In this paper, we propose a non-uniform cylindrical partition network named NUC-Net to tackle the above challenges. Specifically, we propose the Arithmetic Progression of Interval (API) method to non-uniformly partition the radial axis and generate the voxel representation which is representative and efficient. Moreover, we propose a non-uniform multi-scale aggregation method to improve contextual information. Our method achieves state-of-the-art performance on SemanticKITTI and nuScenes datasets with much faster speed and much less training time. And our method can be a general component for LiDAR semantic segmentation, which significantly improves both the accuracy and efficiency of the uniform counterpart by 4 × training faster and 2 × GPU memory reduction and 3 × inference speedup. We further provide theoretical analysis towards understanding why NUC is effective and how point distribution affects performance. Code is available at https://github.com/alanWXZ/NUC-Net.

Abstract:
High-quality LiDAR point cloud (LPC) coding is essential for efficiently transmitting and storing the vast amounts of data required for accurate 3D environmental representation. The Octree-based entropy coding framework has emerged as the predominant method, however, previous study usually overly relies on large-scale attention-based context prediction to encode Octree nodes, overlooking the inherent correlational properties of this structure. In this paper, we propose a novel Graph-driven Attention-based Entropy Model (GAEM), which adopts partitioned graph attention mechanisms to uncover contextual dependencies among neighboring nodes. Different from the Cartesian coordinate-based coding mode with higher redundancy, GAEM uses the multi-level spherical Octree to organize point clouds, improving the quality of LPC reconstruction. GAEM combines graph convolution for node feature embedding and grouped-graph attention for exploiting dependency among contexts, which preserves performance in low-computation using localized nodes. Besides, to further increase the receptive field, we design a high-resolution cross-attention module introducing sibling nodes. Experimental results show that our method achieves state-of-the-art performance on the LiDAR benchmark SemanticKITTI and MPEG-specified dataset Ford, compared to all baselines. Compared to the benchmark GPCC, our method achieves gains of up to 53.9% and 53.6% on SemanticKITTI and Ford while compared to the sibling-introduced methods, we achieve up to 42.3% and 44.7% savings in encoding/decoding time. In particular, our GAEM allows for extension to downstream tasks (i.e., vehicle detection and semantic segmentation), further demonstrating the practicality of the method.

Abstract:
Applying knowledge distillation to virtual try-on tasks is challenging because current methods fail to fully and efficiently exploit responsible teacher knowledge. In other words, existing approaches merely transfer prior knowledge to the student model via pseudo-labels generated by the teacher model, resulting in shallow knowledge representation and low training efficiency. To address these limitations, we propose a novel teacher-student architecture for parser-free virtual try-on, named GLV, which generates high-quality try-on results with realistic body details. Specifically, we propose a deformation-related prior distillation method to effectively leverage the valuable deformation information contained in the teacher warpage model. This enhances the convergence efficiency of the student warpage model, preventing it from getting stuck in a local minima. Moreover, we are the first to propose a geometric correlation distillation, which models the underlying geometric relationship between clothing and the person and transfers this relationship from the teacher to the student. This enables the student warpage model to reduce the entanglement of deformation-irrelevant features, such as color and texture. Finally, we propose a clothing-body retouching method for try-on result synthesis, which refines the denoising process in the latent space of a well-trained diffusion model, thereby preventing catastrophic forgetting. This method seamlessly transforms the parser-based inpainting synthesis paradigm into a parser-free synthesis paradigm and enables efficient convergence of the diffusion model with only fine-tuning. Extensive experiments demonstrate the generality of our approach and highlight its superiority over previous methods.

Abstract:
Few-shot segmentation (FSS) methods aim to segment objects using only a few pixel-level annotated samples. Current approaches either derive a generalized class representation from support samples to guide the segmentation of query samples, which often discards crucial spatial contextual information, or rely heavily on spatial affinity between support and query samples, without adequately summarizing and utilizing the core information of the target class. Consequently, the former struggles with fine detail accuracy, while the latter tends to produce errors in overall localization. To address these issues, we propose a novel FSS framework, CCFormer, which balances the transmission of core semantic concepts with the modeling of spatial context, improving both macro and micro-level segmentation accuracy. Our approach introduces three key modules: 1) the Concept Perception Generation (CPG) module, which leverages pre-trained category perception capabilities to capture high-quality core representations of the target class; 2) the Concept-Feature Integration (CFI) module, which injects the core class information into both support and query features during feature extraction; and 3) the Contextual Distribution Mining (CDM) module, which utilizes a Brownian Distance Covariance matrix to model the spatial-channel distribution between support and query samples, preserving the fine-grained integrity of the target. Experimental results on the PASCAL- 5^i and COCO- 20^i datasets demonstrate that CCFormer achieves state-of-the-art performance, with visualizations further validating its effectiveness. Our code is available at github.com/lourise/ccformer.

Abstract:
Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.

Abstract:
Recent advancements in learned rate-distortion optimization (RDO) showcase that by making the intra coding decisions based on a learned measure, the encoding can be significantly accelerated without incurring much coding loss. Despite great progress in complexity reduction, the dependency issue has been largely neglected in the current learned RDO research. In this study, aiming to tap the full potential of dependent learned RDO, we first derive a probabilistic RDO framework for theoretical analysis, under which the classic and the learned RDO problems are equivalent to the maximum a posteriori (MAP) inference and the distribution imitation, respectively. Subsequently, we probabilistically revisit dependency considerations in the intra RDO research. Our key finding is that the existing learned RDO scheme can only produce a measure that indicates the local “goodness” of coding decisions. We therefore further discuss the opportunities for learning a dependent measure that is more optimal in the long run. Finally, as learning an accurate measure for the full decision space could be extremely challenging, taking the High Efficiency Video Coding (HEVC) intra coding as a case study, we experimentally identify that the prediction decision accounts for the majority of the dependent optimization gain and is of the utmost value to be learned, paving the way for future research on dependent learned RDO.

Abstract:
Due to the challenges of obtaining data from valuable targets, few-shot learning plays a critical role in synthetic aperture radar (SAR) target recognition. However, the high noise levels and complex backgrounds inherent in SAR data make this technology difficult to implement. To improve the recognition accuracy, in this paper, we propose a novel vision-language framework, VLF-SAR, with two specialized models: VLF-SAR-P for polarimetric SAR (PolSAR) data and VLF-SAR-T for traditional SAR data. Both models start with a frequency embedded module (FEM) to generate key structural features. For VLF-SAR-P, a polarimetric feature selector (PFS) is further introduced to identify the most relevant polarimetric features. Also, a novel adaptive multimodal triple attention mechanism (AMTAM) is designed to facilitate dynamic interactions between different kinds of features. For VLF-SAR-T, after FEM, a multimodal fusion attention mechanism (MFAM) is correspondingly proposed to fuse and adapt information extracted from frozen contrastive language-image pre-training (CLIP) encoders across different modalities. Extensive experiments on the OpenSARShip2.0, FUSAR-Ship, and SAR-AirCraft-1.0 datasets demonstrate the superiority of VLF-SAR over some state-of-the-art methods, offering a promising approach for few-shot SAR target recognition.

Abstract:
Video saliency prediction and detection are thriving research domains that enable computers to simulate the distribution of visual attention akin to how humans perceive dynamic scenes. While many approaches have crafted task-specific training paradigms for either video saliency prediction or video salient object detection tasks, few attention has been devoted to devising a generalized saliency modeling framework that seamlessly bridges both these distinct tasks. In this study, we introduce the Unified Saliency Transformer (UniST) framework, which comprehensively utilizes the essential attributes of video saliency prediction and video salient object detection. In addition to extracting representations of frame sequences, a saliency-aware transformer is designed to learn the spatio-temporal representations at progressively increased resolutions, while incorporating effective cross-scale saliency information to produce a robust representation. Furthermore, task-specific decoders are proposed to perform the final prediction for each task. To the best of our knowledge, this is the first work to explore the design of a unified framework for both saliency modeling tasks. Convincible experiments demonstrate that the proposed UniST achieves superior performance across eight challenging benchmarks for two tasks, outperforming other state-of-the-art methods in most metrics. The project page is https://junwenxiong.github.io/UniST.

Abstract:
Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives. Traditional methods like morphing often lack artistic appeal and require specialized skills, limiting their effectiveness. Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes. We propose a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training. Our method leverages Gaussian Process Regression ( \mathcal GPR ) to model latent representations, ensuring smooth and dynamic transitions between frames. Additionally, we introduce interpolation-based conditional controls and a Frequency-aware Bidirectional Fusion (FBiF) architecture to enhance temporal control and transition reliability. Evaluations of benchmark datasets and custom image pairs demonstrate the effectiveness of our approach in generating high-quality smooth transition videos. The project is provided in https://sobeymil.github.io/tvg.com.

Abstract:
Catastrophic forgetting is the crucial challenge for continual learning. One of the state-of-the-art approaches is the orthogonal projection, which aims to learn each task by updating model parameters in the direction orthogonal to the subspace spanned by the previous task input. Although such strict orthogonal weight constraints ensure no interference with tasks that have been learned to achieve model stability, they greatly sacrifice model plasticity. In this paper, we propose an adaptive balanced orthogonal projection (AdaBOP) method, to search for the optimal network parameter updating direction to address the plasticity-stability dilemma in continual learning. The proposed AdaBOP method can adaptively adjust its tendency towards plasticity-stability trade-off based on the layer-wise feature space correlations of the model between old and new tasks. To further improve the training efficiency, we also implement the AdaBOP method in the uncentered covariance matrix space of the previous tasks, and finally achieve a better stability-plasticity trade-off in continual learning efficiently. Experimental results greatly demonstrate the effectiveness of the proposed method, which achieves superior performances to state-of-the-art continual learning approaches. The code is available at https://github.com/hyscn/AdaBOP.

Abstract:
An airborne monocular visual pose measurement method is proposed for autonomous landing in unknown environment, as there is currently no effective technical method to perform such tasks. The method begins by designing a multi-task neural network to extract semantic segmentation, depth and slope information from monocular images. Next, a 2D grid-scene information map is then constructed to evaluate candidate regions. Candidate regions are identified from this map, and Radon transform operation is applied to search for suitable landing regions. The optimal landing region is then chosen based on the characteristics of the aircraft and represented using keypoints. This representation facilitates precise relative pose measurement between the aircraft and the optimal landing region. To validate the method, the absolute scale of the environment is obtained by consecutive monocular images and tested on both synthetic and real images. The results confirm that this method ensures efficient and accurate autonomous landing measurements.

Abstract:
Weakly-supervised video anomaly detection (WS-VAD) aims to identify fine-grained anomalies from sparse video-level labels, which has gained increasing attention in recent years due to its various applications such as disaster warning and public security. Recent studies typically formulate WS-VAD as a multi-instance learning (MIL) problem. However, they neglect the instance creation process and simply apply a uniform temporal pooling (UTP) operation to obtain the training instances, leading to severe anomaly contamination and dilution. In this paper, we emphasize the importance of the instance modeling procedure and propose two simple yet effective modules, i.e., the dynamic segment merging (DSM) module and the retrieval-augmented anomaly restoration (RA2R) module, to tackle the problem from segment-level and feature-level, respectively. We equip various state-of-the-art WS-VAD models with the proposed methods and conduct thorough experiments on the challenging datasets, e.g., UCF-Crime, and XD-Violence. Results demonstrate the proposed method brings consistent performance improvement and establishes new state-of-the-art.

Abstract:
When restoring low-light images, most methods largely overlook the ambiguity due to dark noise and lack discrimination for region and shape representations, resulting in invalid feature enhancement. In this work, we propose a physically explainable and prior guidance model for low-light image enhancement, termed Dual-Conditional Guidance Sparse Diffusion (DCGSD). Specifically, we introduce an elaborately designed Luminance Structure Guidance Head, which can be easily plugged into the existing diffusion model to emphasize the value of the luminance and structural representation. Furthermore, for reliable noise analysis, we provide a novel Sparse Attention Enhancement Module that is adaptively empowered to exploit the most useful region-to-region dependencies. This dynamic selection makes the diffusion process from dense to sparse, thus improving the efficiency of the reasoning noise distributions. To avoid noise amplification, we further present a Skip Calibration Module, which can be used to refine the local neighborhood that contains noisy and structural information. Extensive experiments have been performed to verify the superiority of the proposed method. DCGSD shows that leveraging dual-conditional guidance can support the diffusion model to produce sharper and more realistic results.

Abstract:
Underwater images encounter a range of quality degradation issues caused by the differential scattering and absorption of light in water. To address these challenges, we introduce a WFAC method, a wavelet decomposition fusion method that combines global and local contrast for underwater image enhancement. Specifically, we begin with a color transfer compensation strategy to correct the colors in a degraded underwater image. Subsequently, we utilize the pixel gradient distribution to create a matrix weight map that dynamically adjusts the weight distribution in overly bright or dark areas of the color-corrected image, enhancing its global contrast. Simultaneously, we apply a rapid integration statistical strategy to adaptively fine-tune the local contrast of color-corrected images using the local mean and variance statistics. To combine the strengths of various enhanced images, we implement a wavelet decomposition fusion strategy to break down different scale components of globally and locally contrast-enhanced images and merge the benefits of varying scale images to obtain a high-quality underwater image. Comprehensive experimental assessments across three underwater image datasets demonstrate that our WFAC method efficiently recovers colors and boosts contrast in degraded underwater images. The code is publicly available at: https://www.researchgate.net/publication/386508762_2024WFAC.

Abstract:
Viewport in immersive media corresponds to the field of view (FoV), playing a critical role in both data transmission volume and user experience. However, instantaneous and highly dynamic interactions often conflict with segment-based transmission modes, resulting in substantial redundant data transmission and wastage of valuable resources. In this paper, we analyze data from an open-source dataset and our self-collected records to investigate the interactive characteristics of viewers in immersive scenes, including focus time, viewing area scope, movement direction, and tile access probability. Based on empirical statistical inference, we innovatively introduce the concept of an irregular, expandable, and directional extended field of view (EoV) to describe the dynamically variable area mimicking human visual motion. Furthermore, we propose a motion-aware tile-based adaptive control scheme for viewing areas, named VAAC-IM, designed to enable flexible transmission of immersive media. Specifically, we developed an FoV prediction model based on ConvLSTM, leveraging spatiotemporal features from historical viewing records to provide advanced predictions of visual motion preferences. Subsequently, we model the viewing area control process as a constrained submodular minimization problem, dynamically managing irregular EoV area using marginal effects. Finally, we perform a comprehensive validation. The results demonstrate that VAAC-IM significantly enhances performance in terms of reducing black edge coverage, minimizing data volume, lowering latency, and improving overall user experience.

Abstract:
The generation of 3D meshes is critical in numerous applications, evidenced by the growing popularity and attention towards interactive generative models. Although diffusion models currently stand out as powerful interactive generative models, they are confined to the 2D domain. Performing direct diffusion and denoising on complex 3D meshes with dense vertices and faces is impractical, time-consuming, and resource-intensive. In this work, we discretize the 3D space and incorporate the intricate 3D mesh topology within the Truncated-Signed Distance Fields (T-SDFs) of each discrete cell vertex and propose an efficient discrete index graph diffusion model for T-SDFs. We further divide T-SDFs into multiple local shapes and encode the complete object as a discretized 3D grid based on codebook indices, with each index labeled for its position to preserve its discretization while reducing the input dimensionality. A graph neural network is trained on these latent spaces to jointly denoise the diffusion process on continuous coordinates and discrete codebook indices to incorporate local and global information. As we delete the most frequently repeated codebook indices in the 3D grid, the input size of the diffusion model becomes variable. We employ diverse conditional embeddings from task-specific autoencoders to estimate the quantity of codebook indices in various 3D grids and achieve interactive conditional synthesis by utilizing classifier-free guidance to sample from diverse normal distributions. Our model exhibits exceptional generative performance, supported by experimental results showcasing its effectiveness in various generative tasks, including shape completion, single-view 3D generation, and text-driven generation.

Abstract:
The Visible-Infrared Person Re-identification (VI ReID) aims to achieve cross-modality re-identification by matching pedestrian images from visible and infrared illumination. A crucial challenge in this task is mitigating the impact of modality divergence to enable the VI ReID model to learn cross-modality correspondence. Regarding this challenge, existing methods primarily focus on eliminating the information gap between different modalities by extracting modality-invariant information or supplementing inputs with specific information from another modality. However, these methods may overly focus on bridging the information gap, a challenging issue that could potentially overshadow the inherent complexities of cross-modality ReID itself. Based on this insight, we propose a straightforward yet effective strategy to empower the VI ReID model with sufficient flexibility to adapt diverse modality inputs to achieve cross-modality ReID effectively. Specifically, we introduce a Modality-aware and Instance-aware Visual Prompts (MIP) network, leveraging transformer architecture with customized visual prompts. In our MIP, a set of modality-aware prompts is designed to enable our model to dynamically adapt diverse modality inputs and effectively extract information for identification, thereby alleviating the interference of modality divergence. Besides, we also propose the instance-aware prompts, which are responsible for guiding the model to adapt individual pedestrians and capture discriminative clues for accurate identification. Through extensive experiments on four mainstream VI ReID datasets, the effectiveness of our designed modules is evaluated. Furthermore, our proposed MIP network outperforms most current state-of-the-art methods.

Abstract:
Temporal Action Localization (TAL) aims to classify and localize all actions within untrimmed videos. Existing TAL methods often struggle with inaccurate boundary predictions due to the similarity of action content and the uncertainty of boundaries between adjacent frames. Many of these methods rely on fixed or global proposal learning strategies, which lack a more refined method to improve localization accuracy. In this paper, we propose BRTAL, a new Boundary Refinement framework for TAL based on an offset-driven diffusion model, specifically designed to enhance action boundary precision through a refined approach iteratively. Unlike traditional TAL methods emphasizing global target predictions, BRTAL adopts a local refinement perspective by leveraging an offset-driven strategy. Specifically, our framework employs diffusion to iteratively generate local offsets between predictions and ground truth, gradually reducing these offsets to achieve better alignment with the ground truth. This refined approach is particularly effective in addressing the challenges of ambiguous boundaries frequently encountered in TAL, enabling BRTAL to achieve more refined boundary localization than existing methods. Furthermore, we introduce a lightweight yet powerful Temporal Context Modeling (TCM) module to enhance temporal information modeling for accurate action localization. TCM features a Temporal Representation Perception (TRP) layer, which captures temporal evolution and long-term contextual dependencies through a squeeze-and-excitation design combined with large convolutional kernels, ensuring robust temporal representation learning. Extensive experiments on THUMOS14, ActivityNet-1.3, and EPIC-KITCHEN 100 datasets highlight the significant advantages of BRTAL. Notably, BRTAL achieves an average mAP of 69.6% on THUMOS14, establishing a new state-of-the-art benchmark and demonstrating its outstanding boundary refinement capability.

Abstract:
The evolution of 3D visualization techniques has fundamentally transformed how we interact with digital content. At the forefront of this change is point cloud technology, offering an immersive experience that surpasses traditional 2D representations. However, the massive data size of point clouds presents significant challenges in data compression. Current methods for lossy point cloud attribute compression (PCAC) generally focus on reconstructing the original point clouds with minimal error. However, for point cloud visualization scenarios, the reconstructed point clouds with distortion still need to undergo a complex rendering process, which affects the final user-perceived quality. In this paper, we propose an end-to-end deep learning framework that seamlessly integrates PCAC with differentiable rendering, denoted as rendering-oriented PCAC (RO-PCAC), directly targeting the quality of rendered multiview images for viewing. In a differentiable manner, the impact of the rendering process on the reconstructed point clouds is taken into account. Moreover, we characterize point clouds as sparse tensors and propose a sparse tensor-based transformer, called SP-Trans. By aligning with the local density of the point cloud and utilizing an enhanced local attention mechanism, SP-Trans captures the intricate relationships within the point cloud, further improving feature analysis and synthesis within the framework. Extensive experiments demonstrate that the proposed RO-PCAC achieves state-of-the-art compression performance, compared to existing reconstruction-oriented methods, including traditional, learning-based, and hybrid methods. The code will be released at https://github.com/net-F/RO-PCAC.git.

Abstract:
Emotion recognition in conversations (ERC) has garnered significant attention for its critical role in human-computer interaction systems. ERC benefits from multimodal data, which offers diverse perspectives on emotional states, and commonsense knowledge (CSK), which enriches the context by incorporating real-world understanding of human behavior. However, existing ERC studies have not fully exploited the potential of multimodal-CSK interactions for complementary information learning from these sources. To address this, we innovatively propose a Knowledge-Aware Multimodal Interaction Network (KA-MIN). KA-MIN is designed to capture complementary emotional information from CSK-multimodal interactions, thereby facilitating the ERC task. To achieve this, KA-MIN begins by combining six relation types of CSK, leveraging their differences between multimodal emotional information. The fused CSK features are then refined to incorporate context and emotional information using multimodal contextual guidance. Subsequently, we construct a novel knowledge-aware multimodal graph structure that allows the CSK information to interact with multimodal information, leading to more comprehensive multimodal and context modeling. During the graph learning process, the CSK-multimodal interactions capture the complementary emotional information between CSK and multimodal features. Finally, we dynamically fuse the multimodal emotional information with the informative CSK and textual guidance to obtain the final utterance representations, which encompass effective emotional information from both multimodal and CSK features. Extensive experiments on two popular multimodal ERC datasets demonstrate the superiority and effectiveness of the proposed KA-MIN framework.

Abstract:
Cross-modal hashing enables efficient cross-modal retrieval by compressing multi-modal data into compact binary codes, but traditional methods primarily rely on centralized training, which is limited when handling large-scale distributed datasets. Federated learning presents a scalable alternative, yet existing federated frameworks for cross-modal hashing face challenges like data heterogeneity and imbalance, such as non-IID data distribution across clients. To address these challenges, we propose Personalized Federated learning with Lookahead for Adaptive cross-modal Hashing (PFedLAH) method, which combines Feature Adaptive Personalized Learning (FAPL) and Weight-aware Lookahead Adaptive Selection (WLAS) mechanism together. Initially, the FAPL module is designed for the client, enabling personalized learning to mitigate the effect of divergence between server and client resulting from non-IID data distribution, while the local optimization constraint mechanism is also integrated to avoid local optimization shift and ensure better alignment with global convergence. On the server side, WLAS module combines weight-aware adaptive client selection and gradient momentum lookahead to form a dynamic and intelligent client selection scheme, while enhancing the overall convergence and consistency through lookahead gradient prediction. Comprehensive experiments on widely used datasets, including MIRFlickr-25K, MS COCO, and NUS-WIDE, comparing state-of-the-art federated hashing methods, demonstrate the superior retrieval performance, robustness, and scalability of the PFedLAH method.

Abstract:
Realistic image restoration is a crucial task in computer vision, and diffusion-based models for image restoration have garnered significant attention due to their ability to produce realistic results. Restoration can be seen as a controllable generation conditioning on priors. However, due to the severity of image degradation, existing diffusion-based restoration methods cannot fully exploit priors from low-quality images and still have many challenges in perceptual quality, semantic fidelity, and structure accuracy. Based on the challenges, we introduce a novel image restoration method, SSP-IR. Our approach aims to fully exploit semantic and structure priors from low-quality images to guide the diffusion model in generating semantically faithful and structurally accurate natural restoration results. Specifically, we integrate the visual comprehension capabilities of Multimodal Large Language Models (explicit) and the visual representations of the original image (implicit) to acquire accurate semantic prior. To extract degradation-independent structure prior, we introduce a Processor with RGB and FFT constraints to extract structure prior from the low-quality images, guiding the diffusion model and preventing the generation of unreasonable artifacts. Lastly, we employ a multi-level attention mechanism to integrate the acquired semantic and structure priors. The qualitative and quantitative results demonstrate that our method outperforms other state-of-the-art methods overall on both synthetic and real-world datasets. Our project page is https://zyhrainbow.github.io/projects/SSP-IR.

Abstract:
Task-oriented point cloud sampling is a fundamental technique in 3D computer vision and has become a crucial step in numerous 3D applications. However, most state-of-the-art task-oriented sampling methods adopt a point-wise analysis strategy, making them susceptible to data redundancy. Taking inspiration from the abstract-to-detailed recognition process of the human visual system, we propose a novel voxel-to-point network framework called V2PNet for task-oriented point cloud sampling. Specifically, we first design a lightweight coarse-grained sampling module named Important Voxel Prediction (IMVP). This module adaptively outputs points from significant regions of the point cloud by explicitly modeling inter-region relationships, thereby reducing interference from redundant points. Then, the V2PNet framework seamlessly integrates the IMVP module with existing point-wise and task-oriented sampling networks, enabling joint training with downstream tasks. This creates a task-oriented coarse-to-fine-grained sampling pipeline that effectively samples representative and informative points from significant regions to represent the original point cloud. Moreover, to mitigate disturbances across similar regions, we introduce a voxel simplification loss function to enhance the discriminative voxel prediction. Extensive experiments demonstrate that V2PNet improves the performance of existing state-of-the-art task-oriented sampling models.

Abstract:
Accurate depth information is crucial for roadside perception in Cooperative Vehicle Infrastructure Systems. Beyond existing radar and LiDAR solutions, monocular depth estimation using surveillance cameras is emerging as a superior approach due to its cost-effectiveness and dense depth output. Unlike onboard cameras, roadside cameras are relatively fixed in position. Many existing monocular depth estimation methods, which do not independently model camera pose, tend to overfit to training data and produce suboptimal results when confronted with slight variations in camera poses, which may be caused by external forces within the same camera or across different cameras. To address this issue, a pose decoupled monocular depth estimation method specifically designed for roadside perception systems is proposed. This method separates depth estimation into two components: a pose-dependent modeling portion that recovers ground depth based on the current camera pose, and a pose-agnostic portion that estimates pixel height relative to the ground plane. Additionally, a knowledge distillation framework is introduced to improve the robustness of the proposed method against variations in roadside cameras. To validate the method, we propose the first open source dataset for roadside monocular depth estimation, DAIR-MDE, and a roadside instance segmentation dataset, DAIR-Ins, both derived from the DAIR dataset. The proposed method demonstrates significant advances over the state-of-the-art methods on DAIR-MDE. The proposed dataset and source code are publicly available at https://github.com/441599828/PDDepth.

Abstract:
LiDAR-based 3D object detection is essential for autonomous driving. Existing high-performance 3D object detectors usually design complex structures in the 3D backbone to capture long-range dependencies among features. However, introducing these complex structures into the 3D backbone significantly increases computational cost and inference latency, limiting the efficiency and feasibility of detectors in practical applications. In this work, we rethink the long-range dependency capturing problem from a new perspective, that is transferring this task from 3D backbone to 2D feature space. To accomplish this goal, we propose a Long-Range Dense Feature Capture Network (LDFCNet). LDFCNet retains the basic structure of the 3D backbone to extract preliminary 3D features but shifts the complex long-range dependency capturing task to be processed on a 2D dense feature map, thereby enhancing the detection performance while reducing the computational cost. Importantly, a robust 2D dense feature capture (2D-DFC) backbone is devised to effectively and efficiently capture the long-range dependencies. In addition, we introduce a re-parameterization technique to decouple the training and inference of the 2D backbone, further reducing inference latency. We conduct extensive experiments on the Waymo Open and nuScenes datasets and the experimental results show that LDFCNet demonstrates competitive performance. Notably, LDFCNet is 1.5× faster than the state-of-the-art hybrid detector HEDNet and 2.1× faster than the transformer-based detector DSVT. Codes and results are released on https://github.com/asd291614761/LDFCNet.

Abstract:
Object detection approaches are expanding by leaps and bounds with recent progress in deep learning. However, there is a considerable amount of environments hampering and challenging generic detectors in open-world scenarios, which received quite limited attention. In this paper, we focus on three specific challenging conditions: 1) targets presented with low lightness, 2) camouflaged objects merged in backgrounds, 3) complex acquisition scenarios, and present a novel end-to-end detector accordingly, termed Context-awareness Network (CANet). Specifically, we propose Global Context Encoder and Context Feature Fusion module to model the context-awareness (CA) mechanism that plays a crucial role in the human visual system (HVS) in an explicit way, which integrates both latent global and local context information to make each region of interest (RoI) more informative, and thus more discriminative. To our knowledge, such high-level mechanisms are under-explored for object detection in the literature. In addition, Global Semantic Awareness module is designed to regress positions and classify better in the process of extracting the feature. Experiments demonstrate that CANet achieves very competitive performance on the ExDark, DARK FACE, COD10K, and CURE-TSD, suggesting the effectiveness and efficiency of CANet in various challenging conditions as well as common scenarios.

Abstract:
The registration of time-varying 3D shapes with high degrees of freedom remains a challenging task. Most existing techniques attempt to address this issue by solving an optimization problem defined on deformation graph with as-rigid-as-possible smoothness prior, which usually struggle to capture large scale displacements. Motivated by the insight that a set of points tends to collectively undergo significant rigid motion accompanied by slight nonrigid deformation, we propose a two-step approach to address nonrigid registration in a coarse-to-fine manner. In the first step, coarse correlations between source and target points are constructed by estimating a set of rigid transformations for local patches which are regional clusters of points. To leverage more contextual information, a bidirectional registration module is introduced that estimates both the forward and backward patch-wise rigid transformation fields (PRTFs). Subsequently, in the second step, the source point set is warped by blending both forward and backward PRTFs and fed into a deformation optimization module. Here, unidirectional point-based correspondences are sought to refine the global nonrigid transformation fields (GNTFs) while adhering to local rigidity constraints. To illustrate the efficacy of our method, we conduct tests on challenging scenarios involving human datasets, including large displacements resulting from fast inter-frame motions or pose changes. Both qualitative and quantitative results demonstrate that our approach outperforms several state-of-the-art methods in terms of robustness and registration accuracy.

Abstract:
Accurate 6D object pose estimation from RGB images is crucial for various computer vision applications, such as augmented reality, robotic manipulation and autonomous driving. Existing methods often rely on extensive labeled data, either manually annotated or synthetically generated, which can be laborious and impractical for real-world deployment. To address these challenges, we propose OK-POSE, a keypoint-based 6D object pose estimation method that leverages relative transformations between viewpoints for training. By utilizing pairs of images with object annotations and relative transformation information, OK-POSE automatically learns to detect 3D keypoints of objects, enabling geometrically and visually consistent pose estimation. The simplicity and accessibility of obtaining relative transformation information, which can be acquired from inexpensive binocular cameras or common smartphone devices, significantly reduce labeling costs and mitigate domain gap issues associated with synthetic data. Experimental results demonstrate that OK-POSE achieves competitive performance compared to methods relying on explicit 3D annotations or object 3D models. Moreover, we provide insights into the data collection process and introduce OK-POSE++, an enhanced version with optimized network architecture and loss functions, yielding further improvements in performance. Our approach offers a practical solution for 6D object pose estimation, suitable for real-world applications in scenarios where extensive 3D annotations or object models are unavailable. The code is released at https://github.com/acmff22/OKPOSE.

Abstract:
Object detectors have demonstrated vulnerability to adversarial examples crafted by small perturbations that can deceive the object detector. Existing adversarial attacks mainly focus on white-box attacks and are merely valid at a specific viewpoint, while the universal multi-view black-box attack is less explored, limiting their generalization in practice. In this paper, we propose a novel universal multi-view black-box attack against object detectors, which optimizes a universal adversarial UV texture constructed by multiple image stickers for a 3D object via the designed layout optimization algorithm. Specifically, we treat the placement of image stickers on the UV texture as a circle-based layout optimization problem, whose objective is to find the optimal circle layout filled with image stickers so that it can deceive the object detector under the multi-view scenario. To ensure reasonable placement of image stickers, two constraints are elaborately devised. To optimize the layout, we adopt the random search algorithm enhanced by the devised important-aware selection strategy to find the most appropriate image sticker for each circle from the image sticker pools. Extensive experiments conducted on four common object detectors suggested that the detection performance decreases by a large magnitude of 74.29% on average in multi-view scenarios. Additionally, a novel evaluation tool based on the photo-realistic simulator is designed to assess the texture-based attack fairly.

Abstract:
Recent advancements in neural image codecs (NICs) are of significant compression performance, but limited attention has been paid to their error resilience. These resulting NICs tend to be sensitive to packet losses, which are prevalent in real-time communications. In this paper, we investigate how to elevate the resilience ability of NICs to combat packet losses. We propose ResiComp, a pioneering neural image compression framework with feature-domain packet loss concealment (PLC). Motivated by the inherent consistency between generation and compression, we advocate merging the tasks of entropy modeling and PLC into a unified framework focused on latent space context modeling. To this end, we take inspiration from the impressive generative capabilities of large language models (LLMs), particularly the recent advances of masked visual token modeling (MVTM). In specific, ResiComp develops a bi-directional masked Transformer to model the contextual dependencies among latents with dual-functionality: 1) iteratively acts as a conditional entropy model to boost compression efficiency; 2) operates latent PLC to improve resilience. During training, we integrate MVTM to mirror the effects of packet loss, enabling a dual-functional Transformer to restore the masked latents by predicting their missing values and conditional probability mass functions. Our ResiComp jointly optimizes compression efficiency and loss resilience. Moreover, ResiComp provides flexible coding modes, allowing for explicitly adjusting the efficiency-resilience trade-off in response to varying Internet or wireless network conditions. Extensive experiments demonstrate that ResiComp can significantly enhance the NIC’s resilience against packet losses, while exhibits a worthy trade-off between compression efficiency and packet loss resilience. Additionally, packet-level simulations, conducted using diverse network models based on real traces, demonstrate that ResiComp exhibits much better robustness to fluctuating network conditions compared to redundancy-based approaches like VTM + FEC.

Abstract:
A conditional lossless point cloud attribute compression method, dubbed ConPCAC, is proposed. The previous work typically codes point attributes in a point cloud in an autoregressive way, incurring unbearable coding time. By contrast, ConPCAC proposes a group-wise conditional entropy model for fast coding while preserving coding performance. Specifically, ConPCAC adopts a “Group Decomposition - Attribute Initialization - Latent Distribution Prediction” framework. First, it flexibly decomposes the original point cloud into multiple groups according to the geometry coordinate distribution. Then, the first group is coded using a base coder, e.g., the standardized G-PCC, and the following groups are progressively coded using a neural coder conditioned on their preceding groups. Two key units, Attribute Initialization (Init) and Latent Distribution Prediction (LDP), are devised in the neural coder. The Init unit employs the nearest neighbor to initialize the attributes of a group, and the LDP unit further predicts the attribute probability distribution for the group. In this way, ConPCAC enables full correlation exploration across groups and parallel processing among points in a group. Finally, the predicted probabilities are fed into the arithmetic engine to code the true attribute values of each group. Extensive experiments demonstrate the performance of ConPCAC. It achieves 14.59%, 10.32%, and 12.26% improvements over the latest G-PCC on the widely used 8iVFB, Owlii, and MVUB datasets, respectively, significantly outperforming state-of-the-art lossless PCAC methods. Moreover, its computational complexity is comparable to G-PCC and much lower than existing learning-based methods. Associated code and models will be released on the website https://github.com/3dpcc/ConPCAC.

Abstract:
Sketch-based 3D shape retrieval (SBSR) has been a challenging task for decades, crucially depending on aligning shared semantic attributes between sketches and 3D shapes. Previous efforts mainly aimed at creating a common embedding space to bridge domain gaps. However, sketches’ subjective and abstract nature, known as confounders, potentially reduces learning performance of matching with 3D shapes. To address this issue, in this paper, we propose a sketch causal disentangled learning for SBSR, named SCDL, which introduce causal intervention to explicitly disentangle sketches into the inherent shared semantic part, and other unrelated confounders to classification (styles, abstraction levels, etc.) for the first time. Specifically, we construct a structural causal model (SCM) in the sketch branch under the dual variational autoencoder (VAE) architectures to alleviate confounders negative impact through learning the semantic attributes in the latent variable space. Next, we adopt a learning strategy on the separated semantic latent variables to construct a shared semantic embedding space further to make cross-modal features of the same class more similar, alleviating the cross-modality discrepancies effectively and establishing new state-of-the-art on three benchmarks. Comprehensive experiment results, ablation studies, and visualization validate the effectiveness of our approach.

Abstract:
Coded Aperture Snapshot Spectral Imaging (CASSI) systems provide an efficient approach to acquiring Hyperspectral Images (HSI), yet the reconstruction process still presents challenges. Traditional Deep Unfolding Networks (DUN) applied to CASSI often face constraints due to inadequate feature utilization and poor handling of multi-scale frequency-domain information, leading to the loss of image detail and global information. Furthermore, most DUN methodologies oversimplify degrading factors and fail to account for issues such as distortions found in actual imaging, thus affecting accuracy and robustness. This paper presents MIDET, a novel DUN tailored for CASSI systems, which integrates the fusion of band information, spatial information, and multi-scale information to meaningfully improve feature utilization and information interaction efficiency. Additionally, MIDET introduces a degradation-guided learning strategy and a frequency feature extraction module, enhancing the capability to handle real imaging distortions and preserve more details in HSI reconstruction. Experimental results demonstrate that MIDET significantly outperforms existing technologies on both simulated and real datasets, effectively enhancing the quality of HSI reconstruction.

Abstract:
Audio-Visual Wake Word Spotting (AVWWS) aims to accurately detect user-defined keywords by leveraging the complementary nature of different modalities in challenging acoustic environments. However, two primary challenges hinder the application of AVWWS models in real-world scenarios: increased model parameters involving the video modality and the scarcity of paired audio-visual data. To address these issues, we propose a novel diverse acoustic knowledge distillation (DAKD) framework, which utilizes easily accessible single-modality audio data to train two teacher models and employs cross-modal knowledge distillation to transfer the generalization and de-noising capabilities of the teachers to the audio-visual student model. This approach mitigates the overfitting risk associated with large parameter counts and limited data. The DAKD framework consists of an audio-visual student model based on the lightweight multi-scale temporal-spatial attention (LMTSA) architecture, a multi-conditional teacher (MCT) model, and a de-noising teacher (DNT) model. The LMTSA model integrates compact 3D and 2D blocks based on the ResNet architecture through a simple attention module and accepts multi-scale supervision from word-level and phone-level labels, achieving joint temporal-spatial modeling with minimal parameter usage. The MCT and DNT models were trained using extensive real or simulated far-field speech and paired near-field and far-field speech, respectively, to generalize unseen acoustic environments and de-noising capabilities to the audio-visual student model. The effectiveness of our proposed DAKD framework is validated through comprehensive experiments on the MISP2021 and the updated MISP2021 Eval Hard datasets, establishing new benchmarks with fewer parameters. Our code will be available at https://github.com/wikkk-tp/AVWWS_DAKD.

Abstract:
Bio-inspired vision sensors, which emulate the human retina by recording light intensity as binary spikes, have gained increasing interest in recent years. Among them, the spike camera is capable of perceiving fine textures by simulating a small retinal region called the fovea and producing high temporal resolution (20,000 Hz) spatiotemporal spike streams. To bridge the gap between binary spike streams and human vision in high-speed scenes, reconstructing intensity and optical flow from high temporal resolution spikes is particularly important. In this paper, we present a hybrid SNN-ANN network designed for simultaneous intensity and optical flow learning from spike streams. To adaptively extract spatial and temporal features from continuous spike streams, we propose a spiking neuron module with dense connections that efficiently processes both short-term and long-term spike data, while maintaining low power consumption characteristics. Subsequently, we introduce two decoders for optical flow and intensity estimation that complement each other. A temporal-aware warping module, based on flow features, is specifically designed to align the temporal features of the intensity decoder, thereby reducing motion artifacts. Concurrently, improved intensity features contribute to more accurate flow feature predictions, resulting in a mutually beneficial relationship within our network. To evaluate the effectiveness of our proposed network, we conduct experiments on both simulated and real spike datasets. Our network outperforms existing state-of-the-art spike-based reconstruction and optical flow estimation methods, demonstrating its potential for advancing the field of bio-inspired vision sensors. Our code is available at https://github.com/LinZhu111/SLIO.

Abstract:
Retrieval plays an important role in knowledge-based visual question answering (KB-VQA), which relies on external knowledge to answer questions related to an image. However, not all information in the external knowledge is beneficial in retrieval, e.g., the knowledge that is only semantically similar to the query but is not useful for question answering. To improve the effectiveness and efficiency of retrieval, in this paper, we propose efficient multimodal selection to filter out irrelevant information and increase the retriever performance for KB-VQA. First, to exclude most irrelevant knowledge from the large external knowledge, multimodal selection uses a query-aware sample selection method, which uses the pretrained answer generator’s prediction to obtain better positive and negative training samples to help retrievers distinguish knowledge that is semantically relevant to the multimodal query. Then, question-aware visual feature selection is proposed to select the distinguishable visual information related to the question: where cross-attention to questions and images is proposed to obtain question-aware visual features. These visual features are used to perform fine-grained multimodal retrieval within the small set to obtain the final top-related knowledge. The experimental results show that the proposed approach achieves state-of-the-art retrieval performance on the OK-VQA and FVQA datasets, indicating the effectiveness of our selection strategy for retrieval.

Abstract:
Road crack detection is a key computer vision task that identifies and locates cracks in road surface images, which usually have an irregular shape and contain only a few pixels in width. Generative and unsupervised methods are popular these years, but generative methods require a lot of training data and computational power while unsupervised methods are not so satisfactory in pixel-level segmentation. The process is challenged by the irregularity of crack shapes and complex road image backgrounds. To alleviate these problems, we propose a novel method in this paper, CDS-Net, that significantly improves road crack detection performance through multiple practical modules, including the Multi-Directional Hierarchical Attention (MDHA) module and the Difference Sensitivity Reconstruction Block (DSRB). Specifically, the MDHA module employs a multi-directional feature extraction strategy to capture detailed information of cracks, thereby enhancing the discriminative power of the features. The DSRB module, designed to address the inefficiency of traditional skip-connections, utilizes masked convolution and graph convolution attention to reconstruct and refine feature representations. Additionally, we propose an improved weighted cross-entropy loss function to address the inherent class imbalance problem in road crack detection. Extensive experiments on five public datasets demonstrate that CDS-Net achieves superior performance compared to other state-of-the-art methods, showcasing its effectiveness and robustness in road crack detection. It also has a stronger generalization ability compared with other methods. Code is available at https://github.com/ttttqz/CDS-Net/tree/master.

Abstract:
Existing state-of-the-art methods for few-shot action recognition (FSAR) achieve promising performance by spatial and temporal modeling. However, most current methods ignore the importance of edge information and motion cues, leading to inferior performance. For the few-shot task, it is important to effectively explore limited data. Additionally, effectively utilizing edge information is beneficial for exploring motion cues, and vice versa. In this paper, we propose a novel edge guided network with motion enhancement (EGME) for FSAR. To the best of our knowledge, this is the first work to utilize the edge information as guidance in the FSAR task. Our EGME contains two crucial components, including an edge information extractor (EIE) and a motion enhancement module (ME). Specifically, EIE is used to obtain edge information on video frames. Afterward, the edge information is used as guidance to fuse with the frame features. In addition, ME can adaptively capture motion-sensitive features of videos. It adopts a self-gating mechanism to highlight motion-sensitive regions in videos from a large temporal receptive field. Based on the above designed components, EGME can capture edge information and motion cues, resulting in superior recognition performance. Experimental results on four challenging benchmarks show that EGME performs favorably against recent advanced methods.

Abstract:
Unsupervised hyperspectral change detection (UHCD), detecting subtle changes between bi-temporal images without manual annotations, is an essential but challenging task in the earth observation community. The current modus operandi often performs it in a feature comparison manner, which is limited by variations in imaging conditions. We observe that fully supervised paradigms using limited annotations are capable of overcoming this challenge. Based on this, we introduce a novel Observational Learning Paradigm (OraL) for UHCD by mimicking fully supervised paradigms. OraL comprises two sequential stages: Observation, which designs a spatial-temporal observation strategy (STO) that records the learning consistency of pixels under different training steps and views, to obtain reliable pseudo-labels. Reproduction, which retrains the model with these pseudo-labels and introduces a distribution-aware spectral learning strategy (DSL) to adaptively increase their learning difficulty according to spectral distributions, enhancing the robustness and generalization of the model. Extensive experiments on several public hyperspectral image datasets demonstrate its state-of-the-art performance and pluggability for previous unsupervised methods. The code is available at: https://github.com/GC-WSL/OraL.

Abstract:
Lightweight video representation techniques have advanced significantly for simple activity recognition, but they still encounter several issues when applied to complex activity recognition: 1) The presence of numerous individuals and varying spatial positions makes it difficult for traditional token pruning methods to maintain accuracy. 2) Simply discarding entire frames may result in the loss of crucial clues. 3) To maintain parallel computing, applying the same pruning rate to every frame leads to significant redundancy in frames with low information content. To this end, we propose a lightweight and novel Spatial-Temporal Token Pruning and Merging (STPM) framework, specifically designed for complex action videos where human actors occupy a small spatial resolution within video frames. Our framework considers two critical factors: semantic importance and spatial-temporal redundancy, to further reduce overhead. For semantic importance, STPM captures class-specific attention scores by learning multiple class tokens within the transformer to guide token pruning. For spatial-temporal redundancy, STPM employs an anchor graph and temporal attention to perform spatial and temporal token merging, preserving appearance and temporal cues while eliminating semantic duplication and redundancy. We conduct extensive experiments on JRDB-PAR primarily using recently introduced video transformer backbones, e.g., MViT and ViT. Our framework achieves similar results while requiring 40% less computation.

Abstract:
Few-Shot Class-Incremental Learning (FSCIL) has gained considerable attention in recent years for its pivotal role in addressing continuously arriving classes. However, it encounters additional challenges. The scarcity of samples in new sessions intensifies overfitting, causing incompatibility between the output features of new and old classes, thereby escalating catastrophic forgetting. A prevalent strategy involves mitigating catastrophic forgetting through the Explicit Memory (EM), which comprise of class prototypes. However, current EM-based methods retrieves memory globally by performing Vector-to-Vector (V2V) interaction between features corresponding to the input and prototypes stored in EM, neglecting the geometric structure of local features. This hinders the accurate modeling of their positional relationships. To incorporate information of local geometric structure, we extend the V2V interaction to Graph-to-Graph (G2G) interaction. For enhancing local structures for better G2G alignment and the prevention of local feature collapse, we propose the Local Graph Preservation (LGP) mechanism. Additionally, to address sample scarcity in classes from new sessions, the Contrast-Augmented G2G (CAG2G) is introduced to promote the aggregation of same class features thus helps few-shot learning. Extensive comparisons on CIFAR100, CUB200, and the challenging ImageNet-R dataset demonstrate the superiority of our method over existing methods.

Abstract:
In realistic open-set scenarios where labels of a part of testing data are totally unknown, when vision-language (VL) prompt learning methods encounter inputs related to unknown classes (i.e., not seen during training), they always predict them as one of the training classes. The exhibited label bias causes difficulty in open set recognition (OSR), in which an image should be correctly predicted as one of the known classes or the unknown one. To achieve this goal, we propose a vision-language prompt tuning method with mitigated label bias (M-Tuning). It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario. Besides, inspired by the observation that classifying directly on large datasets causes a much higher false positive rate than on small datasets, we propose a Combinatorial Tuning and Testing (CTT) strategy for improving performance. CTT decomposes M-Tuning on large datasets as multiple independent group-wise tuning on fewer classes, then makes accurate and comprehensive predictions by selecting the optimal sub-prompt. Finally, given the lack of VL-based OSR baselines in the literature, especially for prompt methods, we contribute new baselines for fair comparisons. Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.

Abstract:
Object detection has achieved a promising development in recent years and played an important role in various applications. However, the performance of object detection networks generally drops significantly when subjected to adversarial attacks. As an effective technique for defending against adversarial attacks, adversarially robust object detection has attracted increasing interest. In this paper, a novel deviation-calibrated and content-preserved network (DCCP-Net) is proposed for adversarially robust object detection by effectively exploring and mitigating the essential negative impact of noise disturbance in the feature space. Specifically, a deviation-calibrated robust feature enhancement module is designed to enhance the feature robustness of adversarial images by removing noise disturbance and supplementing rectified information. Besides, by enabling adversarial image features to imitate corresponding clean image features, a content-preserved consistency information imitation mechanism is proposed to obtain more accurate content information of adversarial images. Extensive experiment results have verified the superiority of the proposed DCCP-Net.

Abstract:
Video anomaly detection (VAD) confronts significant challenges arising from data scarcity in real-world open scenarios, encompassing sparse annotations, labeling costs, and limitations on closed-set class definitions, particularly when scene diversity surpasses available training data. Although current weakly-supervised VAD methods offer partial alleviation, their inherent confinement to closed-set paradigms renders them inadequate in open-world contexts. Therefore, this paper explores open vocabulary video anomaly detection (OVVAD), leveraging abundant vision-related language data to detect and categorize both seen and unseen anomalies. To this end, we propose a robust framework, PLOVAD, designed to prompt tuning large-scale pretrained image-based vision-language models (I-VLMs) for the OVVAD task. PLOVAD consists of two main modules: the Prompting Module, featuring a learnable prompt to capture domain-specific knowledge and an anomaly-specific prompt crafted by a large language model (LLM) to capture semantic nuances and enhance generalization; and the Temporal Module, which integrates temporal information using graph attention network (GAT) stacking atop frame-wise visual features to address the transition from static images to videos. Extensive experiments on four benchmarks demonstrate the superior detection and categorization performance of our approach in the OVVAD task without bringing excessive parameters.

Abstract:
Diffusion-based generative models have exhibited considerable success in conditional video synthesis and editing. Nevertheless, prevailing video diffusion models primarily rely on conditioning with specific input modalities, predominantly text, restricting their adaptability to alternative modalities without necessitating retraining of modality-specific components. In this work, we present EnergyViD, a universal spatio-temporal Energy-guided Video Diffusion model designed for zero-shot video synthesis and editing across diverse conditions. Specifically, we leverage off-the-shelf pre-trained networks to construct generic energy functions, guiding the generation process under specific conditions without the need for retraining. To precisely capture temporal dynamics related to motion conditions (e.g., pose sequences), we introduce a novel kernel Maximum Mean Discrepancy (MMD)-based energy function, which minimizes the global distribution discrepancy between the conditioning input and the generated video. Our extensive qualitative and quantitative experiments demonstrate that our algorithm consistently produces high-quality results across a wide range of motion and non-motion conditions, including text, face ID, style, poses, depths, sketches, canny edges, and segmentation maps, in the context of zero-shot video synthesis and editing. We will release source code upon acceptance of the paper.

Abstract:
Recently, the bio-inspired spike camera with continuous motion recording capability has attracted tremendous attention due to its ultra high temporal resolution imaging characteristic. Such imaging feature results in huge data storage and transmission burden compared to that of traditional camera, raising severe challenge and imminent necessity in compression for spike camera captured content. Existing lossy data compression methods could not be applied for compressing spike streams efficiently due to integrate-and-fire characteristic and binarized data structure. Considering the imaging principle and information fidelity of spike cameras, we propose a novel Reconstruction-based Contextual Spike Compression (RCSC) framework, which contains scene reconstruction, contextual image compression and spike generation. To our knowledge, it is the first learning-based model for efficient and robust spike stream compression with informative fidelity. Extensive experimental results show that our model outperforms the state-of-the-art conventional codec VVC intra by 6.14% and surpasses the state-of-the-art learned codec by 2.53% in BD-rate reduction, establishing a strong baseline for spike compression.

Abstract:
Video-text cross-modal retrieval is widely studied to improve retrieval accuracy. However, the security of video-text cross-modal retrieval models receives little attention. If attackers exploit the security vulnerabilities in these models, it poses a significant threat to the retrieval models. Thus, identifying security flaws in video-text cross-modal retrieval models becomes the focus of our research. We are the first to design a video poisoning model to uncover security vulnerabilities in retrieval models. Existing poisoning models have certain limitations when it comes to exploiting vulnerabilities in retrieval models. These include failing to comprehensively embed malicious information into the original video and being unable to maintain visual consistency between the original and poisoned videos. These limitations can result in unsuccessful attacks on retrieval models and an inability to effectively identify security flaws within them. To address these shortcomings, we design an efficient poisoning model that embeds malicious information thoroughly into the original clean data to attack video-text cross-modal retrieval models. We are the first to use a poisoning model to attack retrieval models, thereby uncovering their security vulnerabilities. Second, we introduce a bi-level poisoning module to ensure that malicious information is thoroughly embedded into the original video, thereby enhancing the attack capability of the poisoning model. Finally, we design an adversarial module to improve visual consistency between the original and poisoned videos, thus enhancing the concealment of malicious information within the training data of retrieval models. Our poisoning model can identify security flaws in video-text cross-modal retrieval models, providing insights into improving the security of retrieval models. The effectiveness of our model is validated on the MSR-VTT, LSMDC, and MSVD datasets.

Abstract:
Most existing video moment retrieval (VMR) benchmark datasets face a common issue of sparse annotations-only a few moments being annotated. We argue that videos contain a broader range of meaningful moments that, if leveraged, could significantly enhance performance. Existing methods typically follow a generate-then-select paradigm, focusing primarily on generating moment-query pairs while neglecting the crucial aspect of selection. In this paper, we propose a new method, HyperAux, to yield auxiliary moment-query pairs by modeling the multi-modal hyper-interaction between video and language. Specifically, given a set of candidate moment-query pairs from a video, we construct a hypergraph with multiple hyperedges, each corresponding to a moment-query pair. Unlike traditional graphs where each edge connects only two nodes (frames or queries), each hyperedge connects multiple nodes, including all frames within a moment, semantically related frames outside the moment, and an input query. This design allows us to consider the frames within a moment as a whole, rather than modeling individual frame-query relationships separately. More importantly, constructing the relationships among all moment-query pairs within a video into a large hypergraph facilitates selecting higher-quality data from such pairs. On this hypergraph, we employ a hypergraph neural network to aggregate node information, update the hyperedge, and propagate video-language hyper-interactions to each connected node, resulting in context-aware node representations. This enables us to use node relevance to select high-quality moment-query pairs and refine the moments’ boundaries. We also exploit the discrepancy in semantic matching within and outside moments to construct a loss function for training the HGNN without human annotations. Our auxiliary data enhances the performance of twelve VMR models under fully-supervised, weakly-supervised, and zero-shot settings across three widely used VMR datasets: ActivityNet Captions, Charades-STA, and QVHighlights. We will release the source code and models publicly.

Abstract:
Weakly supervised temporal action localization (WTAL) aims to localize action instances with only video-level labels for supervision. Recent methods convert category labels to natural language through prompting and utilize pre-trained vision-language models to generate text representation from natural language for supervision. This is because natural language can provide more prosperous and generalized semantic supervision to compensate for the lack of supervision in weakly supervised scenarios. However, it should be noted that current prompting methods face limitations in generating dynamic prompts that adapt to each video, which leads to difficulties in accurately aligning text and video representations. In this work, we propose a novel Text-Video Knowledge Guided Prompting (TVKP) framework for WTAL, which generates video-aware prompts based on text-video knowledge to enhance semantic alignment between text and video representations and introduce more video-related external category labels to enrich semantic supervision. We introduce the video-aware prompting (VAP) module to learn text-video knowledge from the joint distribution of text and video representations to generate video-aware text representation. Meanwhile, to make VAP more effectively learn text-video knowledge, a text-video contrastive loss is proposed to ensure semantic consistency between text and video representations. In addition, we propose the external knowledge prompting (EKP) module to introduce more video-related text labels from an external knowledge base to enrich prompts for accurate semantic alignment. Extensive experiments are conducted on three public datasets, THUMOS14, ActivityNet1.2, and ActivityNet1.3, demonstrating that our approach outperforms state-of-the-art methods.

Affiliations: School of Automation Science and Engineering, South China University of Technology, Guangzhou, China; School of Automation Science and Engineering and the School of Future Technology, South China University of Technology, Guangzhou, China; College of Computing and Data Science, Nanyang Technological University, Jurong West, Singapore; Key Laboratory for Mechanics in Fluid Solid Coupling Systems, Institute of Mechanics, Chinese Academy of Sciences, Beijing, China

Abstract:
Predicting per-voxel occupancy status and corresponding semantic labels in 3D scenes is pivotal to 3D intelligent perception in autonomous driving. In this paper, we propose a novel semantic scene completion framework that can generate complete 3D volumetric semantics from a single image at a low cost. To the best of our knowledge, this is the first endeavor specifically aimed at mitigating the negative impacts of incorrect voxel query proposals caused by erroneous depth estimates and enhancing interactions for positive ones in camera-based semantic scene completion tasks. Specifically, we present a straightforward yet effective Semantic-aware Guided (SAG) module, which seamlessly integrates with task-related semantic priors to facilitate effective interactions between image features and voxel query proposals in a plug-and-play manner. Furthermore, we introduce a set of learnable object queries to better perceive objects within the scene. Building on this, we propose an Interactive Refinement Transformer (IRT) block, which iteratively updates voxel query proposals to enhance the perception of semantics and objects within the scene by leveraging the interaction between object queries and voxel queries through query-to-query cross-attention. Extensive experiments demonstrate that our method outperforms existing state-of-the-art approaches, achieving overall improvements of 0.30 and 2.74 in mIoU metric on the SemanticKITTI and SSCBench-KITTI-360 validation datasets, respectively, while also showing superior performance in the aspect of small object generation.

Abstract:
Image dehazing is an important preliminary step for downstream vision tasks. Existing deep learning-based methods have limited generalization capabilities for real hazy images because they are trained on synthetic data and exhibit high domain-specific properties. This work proposes a new Diffusion Model for Synthetic-to-Real dehazing (DMSR) based on the haze-aware density. DMSR mainly comprises of a physics-based dehazing model and a Conditional Denoising Diffusion Model (CDDM)-based model. The coarse transmission map and coarse dehazing result estimated by the physics-based dehazing model serve as conditions for the subsequent CDDM-based model. In this process, the CDDM-based dehazing model progressively refines the coarse transmission map while generating the dehazing result, enabling the model to remove haze with accurate haze density information. Next, we propose a haze density-aware resampling strategy that incorporates the coarse dehazed result into the resampling process using the transmission map, thereby fully leveraging the diffusion model for heavy haze removal. Moreover, a new synthetic-to-real training strategy with the prior-based loss function and the memory loss function is applied to DMSR for improving generalization capabilities and narrowing the gap between the synthetic and real domains with low computational cost. Extensive experiments on various real datasets demonstrate the effectiveness and superiority of the proposed DMSR over state-of-the-art methods.

Abstract:
Image-based virtual try-on aims to transfer target in-shop clothing to a dressed model image, the objectives of which are totally taking off original clothing while preserving the contents outside of the try-on area, naturally wearing target clothing and correctly inpainting the gap between target clothing and original clothing. Tremendous efforts have been made to facilitate this popular research area, but cannot keep the type of target clothing with the try-on area affected by original clothing. In this paper, we focus on the unpaired virtual try-on situation where target clothing and original clothing on the model are different, i.e., the practical scenario. To break the correlation between the try-on area and the original clothing and make the model learn the correct information to inpaint, we propose an adaptive mask training paradigm that dynamically adjusts training masks. It not only improves the alignment and fit of clothing but also significantly enhances the fidelity of virtual try-on experience. Furthermore, we for the first time propose two metrics for unpaired try-on evaluation, the Semantic-Densepose-Ratio (SDR) and Skeleton-LPIPS (S-LPIPS), to evaluate the correctness of clothing type and the accuracy of clothing texture. For unpaired try-on validation, we construct a comprehensive cross-try-on benchmark (Cross-27) with distinctive clothing items and model physiques, covering a broad try-on scenarios. Experiments demonstrate the effectiveness of the proposed methods, contributing to the advancement of virtual try-on technology and offering new insights and tools for future research in the field. The code, model and benchmark will be publicly released.

Abstract:
Most existing single-image super-resolution (SISR) methods focus on addressing predefined uniform degradations, such as bicubic. However, these methods often perform poorly in real-world scenarios due to complicated and varying realistic degradations. In this paper, we propose a novel information bottleneck-based self-distillation method (IBSD) to boost lightweight networks for real-world image super-resolution. The proposed IBSD leverages the principle of information bottleneck to guide SR networks to learn invariant correlations from low-resolution (LR) to high-resolution (HR) across various degradations, thereby improving their generalization capacity. Specifically, the target super-resolution network (i.e., student) is interpreted as a Markov chain, and the distillation process is carried out through two modules. Mutual information (MI) estimation networks are used to quantify the mutual information between adjacent nodes within the Markov chain. To enhance robustness against blur and noise in real-world scenarios, an auxiliary loss with a progressive soft target is employed to better identify what is effective for reconstruction in the high-frequency domain. Minimizing the mutual information while preserving task-relevant features can help remove information that reflects spurious correlations between specific degradations and reconstructed targets. Experiments conducted on real-world image super-resolution datasets demonstrate that our proposed method can significantly improve the performance of recent lightweight SR models without adding any extra inference complexity, and it outperforms existing self-distillation approaches. Code is publicly available at https://github.com/hanzhu1121/IBSD.

Abstract:
Video compression artifacts arise from quantization applied in the frequency domain. Video quality enhancement aims to reduce such compression artifacts and reconstruct a visually pleasant result. While existing methods effectively reduce artifacts in the spatial domain, they often overlook the rich frequency domain information, especially in addressing multi-scale compression artifacts. This work introduces a frequency-domain upsampling strategy within a multi-scale framework, specifically designed to focus on high-frequency details rather than simply blending neighboring pixels during the upsampling process. Our proposed hierarchical frequency-based upsampling and refinement neural network (HFUR) consists of two modules: implicit frequency upsampling (ImpFreqUp) and hierarchical and iterative refinement (HIR). ImpFreqUp exploits the DCT-domain prior derived through an implicit DCT transform, and accurately reconstructs the DCT-domain signal via a coarse-to-fine transfer. Additionally, HIR is introduced to facilitate cross-collaboration and information compensation between the scales, further refining the feature maps and promoting the visual quality of the final output. We demonstrate the effectiveness of the proposed modules via ablation experiments and visualized results. Experimental results demonstrate that HFUR outperforms the state-of-the-art methods up to 0.13dB/0.17dB on both constant bit rate and constant QP modes. The code is available at https://github.com/zqqqyu/HFUR.

Abstract:
Oriented object detection, which aims to detect multi-oriented objects, is a fundamental task for visual analysis in complex scenarios, such as aerial images. However, powerful detection performance relies on abundant and accurate annotations. Therefore, semi-supervised oriented object detection, which utilizes unlabeled data to improve performance, is a promising method to address this problem. In this work, we explore Dense Pseudo-Label (DPL), which directly selects pseudo labels from the original output of the teacher model without any complicated post-processing steps, and expose the shortcomings of existing methods. Through analysis, we identify that the imbalance between obtaining potential positive samples and removing the interference of inaccurate pseudo labels hinders the effectiveness of DPL. To further improve DPL efficiency, we propose Denser Teacher, a new semi-supervised oriented object detection method. In this method, we design a simple yet effective adaptive mechanism called global dynamic k estimation to guide the selection of DPLs in densely-distributed scenes. Additionally, to improve scale adaptation, we introduce dense multi-scale learning for DPL, where DPLs from different scales are utilized to bridge the scale gap. We conduct extensive experiments on several benchmarks to demonstrate the effectiveness of our proposed method in leveraging unlabeled data for performance improvement. Our code will be available at https://github.com/Haru-zt/DenserTeacher.

Abstract:
Multi-image steganography ensures privacy protection while avoiding suspicion from third parties by embedding multiple secret images within a cover image. However, existing multi-image steganographic methods fail to model global spatial correlations to reduce image damage at the low computation cost. Moreover, they do not account for the anti-distortion capability of the cover image, which is crucial for achieving imperceptible and ensuring security. To overcome these limitations, we propose StegMamba, a distortion-free immune-cover for multi-image steganography architecture with a state space model. Specifically, we first explore the potential of the linear computational cost model Mamba for data hiding tasks through a steganography Mamba block (SMB), whose efficiency makes it suitable for real-time applications. Subsequently, considering that images with distortion resistance reduce embedding damage, the original cover image is reconstructed through immune-cover construction module (ICCM) and associated with the steganography task. Moreover, well-coupled features facilitate fusion, and thus a wavelet-based interaction module (WIM) is designed for effective communication between the immune-cover and the secret images. Compared with the state-of-the-art global attention-based methods, the proposed StegMamba obtains PSNR gains of 3.30 dB, 1.37 dB, and 1.92 dB for the stego image, and two secret recovery images, respectively, and the reduction of 2.87% in detection accuracy for anti-steganalysis. This code is available at https://github.com/YuhangZhouCJY/StegMamba.

Abstract:
Accurate and fast segmentation of 3D medical images is crucial in clinical analysis. CNNs struggle to capture long-range dependencies because of their inductive biases, whereas the Transformer can capture global features but faces a considerable computational burden. Thus, efficiently integrating global and detailed insights is key for precise segmentation. In this paper, we propose an effective and lightweight architecture named GCI-Net to address this issue. The key characteristic of GCI-Net is the global-guided feature enhancement strategy (GFES), which integrates the global context and facilitates the learning of local information; 3D convolutional attention, which captures long-range dependencies; and a progressive downsampling module, which perceives detailed information better. The GFES can capture the local range of information through global-guided feature fusion and global-local contrastive loss. All these designs collectively contribute to lower computational complexity and reliable performance improvements. The proposed model is trained and tested on four public datasets, namely MSD Brain Tumor, ACDC, BraTS2021, and MSD Lung. The experimental results show that, compared with several recent SOTA methods, our GCI-Net achieves superior computational efficiency with comparable or even better segmentation performance. The code is available at https://github.com/qintianjian-lab/GCI-Net.

Abstract:
Existing face forgery detection methods attempt to identify low-level forgery artifacts (e.g., blending boundary, flickering) in spatial-temporal domains or high-level semantic inconsistencies (e.g., abnormal lip movements) between visual-auditory modalities for generalized face forgery detection. However, they still suffer from significant performance degradation when dealing with out-of-domain artifacts, as they only consider single semantic mode inconsistencies, but ignore the complementarity of forgery traces at different levels and different modalities. In this paper, we propose a novel Multi-modal Multi-level Semantic Cues Distillation Detection framework that adopts the teacher-student protocol to focus on both spatial-temporal artifacts and visual-auditory incoherence to capture multi-level semantic cues. Specifically, our framework primarily comprises the Spatial-Temporal Pattern Learning module and the Visual-Auditory Consistency Modeling module. The Spatial-Temporal Pattern Learning module employs a mask-reconstruction strategy, in which the student network learns diverse spatial-temporal patterns from a pixel-wise teacher network to capture low-level forgery artifacts. The Visual-Auditory Consistency Modeling module is designed to enhance the student network’s ability to identify high-level semantic irregularities, with a visual-auditory consistency modeling expert serving as a guide. Furthermore, a novel Real-Similarity loss is proposed to enhance the proximity of real faces in feature space without explicitly penalizing the distance from manipulated faces, which prevents the overfitting in particular manipulation methods and improves the generalization capability. Extensive experiments show that our method substantially improves the generalization and robustness performance. Particularly, our approach outperforms the SOTA detector by 1.4% in generalization performance on DFDC with large domain gaps, and by 2.0% in the robustness evaluation on the FF++ dataset under various extreme settings. Our code is available at https://github.com/TianXie834/M2SD.

Abstract:
Effectively leveraging snow image formulation, which accounts for atmospheric light and snow masks, is crucial for enhancing image desnowing performance and improving interpretability. However, current direct-learning approaches often neglect this formulation, while model-based methods use it in overly simplistic ways. To address this, we propose a novel unfolding network that iteratively refines the desnowing process for more thorough optimization. Additionally, model-based techniques usually rely on real-world snow masks for supervision, a requirement that is impractical in many real-world applications. To overcome this limitation, we introduce a snow shape prior as a surrogate supervision signal. We further integrate the physical properties of atmospheric light and heavy snow by decomposing the optimization task into manageable sub-problems within our unfolding network. Extensive evaluations on multiple benchmark datasets confirm that our method outperforms current state-of-the-art techniques.

Abstract:
Exploring robust and efficient association methods has always been an important issue in multi-object tracking (MOT). Although existing tracking methods have achieved impressive performance, congestion and frequent occlusions still pose challenging problems in multi-object tracking. We reveal that performing sparse decomposition on dense scenes is a crucial step to enhance the performance of associating occluded targets. To this end, we propose a pseudo-depth estimation method for obtaining the relative depth of targets from 2D images. Secondly, we design a depth cascading matching (DCM) algorithm, which can use the obtained depth information to convert a dense target set into multiple sparse target subsets and perform data association on these sparse target subsets in order from near to far. By integrating the pseudo-depth method and the DCM strategy into the data association process, we propose a new tracker, called SparseTrack. SparseTrack provides a new perspective for solving the challenging crowded scene MOT problem. Only using IoU matching, SparseTrack achieves comparable performance with the state-of-the-art (SOTA) methods on the MOT17 and MOT20 benchmarks. Code and models are publicly available at https://github.com/hustvl/SparseTrack.

Abstract:
Images captured in low-light environments often suffer from significant degradation. However, most existing Retinex-based methods require an additional decomposition network and overlook the degradation caused by the illumination adjustment process, which results in the consumption of significant computational resources to achieve only average performance. To address the above issues, this paper proposes a more efficient Retinex-based approach named RetinexMac that allows training without an additional decomposition network or regularization functions. RetinexMac first employs an illumination coefficient estimation network to estimate the transform map and light up the global illumination and the local contrast of input images, then a multiscale degradation estimation network is used to suppress the degradation amplified by the illumination adjustment. In order to accurately estimate the degradation, a convolution and attention mixed module integrates the global and local spatial information. This is shown to also significantly improve the performance of other previous Retinex-based methods. Extensive experiments on several representative datasets show that our RetinexMac achieves both current state-of-the-art (SOTA) performance and more ideal visual appearance in terms of illumination and detail, as well as computational efficiency.

Abstract:
In recent years, object detection models have been extensively applied across various industries, leveraging learned samples to recognize and locate objects. However, industrial environments present unique challenges, including complex backgrounds, dense object distributions, object stacking, and occlusion. To address these challenges, we propose the Global Dynamic Matching Transformer Network (GMTNet). GMTNet partitions images into blocks and employs a sliding window approach to capture information from each block and their interrelationships, mitigating background interference while acquiring global information for dense object recognition. By reweighting key-value pairs in multi-scale feature maps, GMTNet enhances global information relevance and effectively handles occlusion and overlap between objects. Furthermore, we introduce a dynamic sample matching method to tackle the issue of excessive candidate boxes in dense detection tasks. This method adaptively adjusts the number of matched positive samples according to the specific detection task, enabling the model to reduce the learning of irrelevant features and simplify post-processing. Experimental results demonstrate that GMTNet excels in dense detection tasks and outperforms current mainstream algorithms. The code will be available at http://github.com/yikuizhai/GMTNet.

Abstract:
Self-supervised monocular depth estimation has exploited semantics to reduce depth ambiguities in texture-less regions and object boundaries. However, existing methods struggle to obtain universal semantics across scenes for effective depth estimation. This paper proposes VFM-Depth, a novel self-supervised teacher-student framework, that effectively leverages the vision foundation model as semantic regularization to significantly improve the accuracy of monocular depth estimation. Firstly, we propose a novel Geometric-Semantic Aggregation Encoding, integrating universal semantic constraints from the foundation model to reduce ambiguities in the teacher model. Specifically, semantic features from the foundation model and geometric features from the depth model are first encoded and then fused through cross-modal aggregation. Secondly, we introduce a novel Multi-Alignment for Depth Distillation to distill semantic constraints from the teacher, further leveraging knowledge from the foundation model. We obtain a lightweight yet effective student model through an innovative approach that combines distance category alignment with complementary feature and depth imitation. Extensive experiments on KITTI, Cityscapes, and Make3D datasets demonstrate that VFM-Depth (both teacher and student) outperforms state-of-the-art self-supervised methods by a large margin.

Abstract:
Vision-centric Bird’s Eye View (BEV) perception holds considerable promise for autonomous driving. Recent studies have prioritized efficiency or accuracy enhancements, yet the issue of domain shift has been overlooked, leading to substantial performance degradation upon transfer. We identify major domain gaps in real-world cross-domain scenarios and initiate the first effort to address the Domain Adaptation (DA) challenge in multi-view 3D object detection for BEV perception. Given the complexity of BEV perception approaches with their multiple components, domain shift accumulation across multi-geometric spaces (e.g., 2D, 3D Voxel, BEV) poses a significant challenge for BEV domain adaptation. In this paper, we introduce an innovative geometric-aware teacher-student framework, BEVUDA++, to diminish this issue, comprising a Reliable Depth Teacher (RDT) and a Geometric Consistent Student (GCS) model. Specifically, RDT effectively blends target LiDAR with dependable depth predictions to generate depth-aware information based on uncertainty estimation, enhancing the extraction of Voxel and BEV features that are essential for understanding the target domain. To collaboratively reduce the domain shift, GCS maps features from multiple spaces into a unified geometric embedding space, thereby narrowing the gap in data distribution between the two domains. Additionally, we introduce a novel Uncertainty-guided Exponential Moving Average (UEMA) to further reduce error accumulation due to domain shifts informed by previously obtained uncertainty guidance. To demonstrate the superiority of our proposed method, we execute comprehensive experiments in four cross-domain scenarios, securing state-of-the-art performance in BEV 3D object detection tasks, e.g., 12.9% NDS and 9.5% mAP enhancement on Day-Night adaptation.

Abstract:
Recently, Transformer-based offline video instance segmentation (VIS) solutions have made significant progress by decomposing the whole task into global segmentation map generation and instance discrimination. We argue that the quality of video queries that represent all instances in a video clip is crucial for offline VIS methods. Existing methods typically interact video queries with dense spatio-temporal features, resulting in significant computational complexity and redundant information. Thus, we propose a novel video instance segmentation framework, LBVQ, dedicated to learning better video queries. Specifically, we first obtain the frame queries for each frame independently without any complex inter-frame spatial-temporal association operations. Secondly, we propose an adaptive query initialization module (AQI), which adaptively integrates frame queries to initialize video queries instead of traditional random initialization strategies. This initialization method preserves rich instance clues and accelerates the optimization of the whole model. Finally, to enhance the quality of video queries, we propose a query propagation module (QPM) that captures relevant instance information in frame queries frame by frame, greatly improving the model’s understanding of long videos. By learning higher quality video queries, LBVQ achieves the state-of-the-art on VIS benchmarks with a ResNet-50 backbone: 52.2 AP, 44.8 AP on YouTube-VIS 2019 \& 2021 . Moreover, LBVQ achieves 39.7 AP on YouTube-VIS 2022 and 22.2 AP on OVIS, demonstrating superior potential for long videos. To further improve the quality of segmentation masks, a large-scale pretrained SAM is employed to refine the segmentation results. Code is available at https://github.com/fanghaook/LBVQ.

Abstract:
Interpretable image classification is crucial for making decisions in high-stakes scenarios. Recent advancements have demonstrated that interpretable models can achieve performance comparable to black-box models by integrating Visual Language Models (VLMs) with Concept Bottleneck Models (CBMs). These models explain their predictions by calculating the weighted sum of similarities between the image representation and predefined text embeddings. However, selecting textual descriptors is subjective, and relying solely on textual information may not capture the complexities of visual data, impacting both interpretability and performance. To address these limitations, this work explores the cross-modality interpretation of class-related concepts in image classification. Specifically, we propose decomposed concept bottleneck model (DCBM), which utilizes a set of decomposed visual concepts that are extracted directly from images instead of predefined text concepts. The decomposition of concepts is achieved through vector projection onto concept decomposition vectors (CDVs), which can be interpreted across both textual and visual modalities. We introduce a quintuple notion of concepts and a concept-sample distribution theorem, which enables the localization of decomposed concepts in images using the Segment Anything Model (SAM) with automatically generated prompts. Experimental results demonstrate that DCBM achieves competitive performance compared to non-interpretable models, with a 3.42% improvement in classification accuracy and a 66.27% improvement in image-text groundability compared to other VLM-based CBMs. Furthermore, we evaluate the benefits of employing automatically generated prompts in SAM for interpreting visual concepts, in contrast to prompts created by human operators.

Abstract:
In dynamic real-world scenarios, continuous learning without forgetting old knowledge is essential, particularly in environments with stricter privacy protection or resource-constrained edge devices where storing old exemplars is infeasible. Therefore, Non-Exemplar Class-Incremental Learning (NECIL) has garnered significant attention. Compared with normal settings, it faces a more severe plasticity-stability dilemma and classifier bias. To address those challenges, we propose a framework based on the vision transformer architecture, called the Continual Expansion and Absorption Transformer (CEAT), which consists of two core components. First, we propose the Continual Expansion and Absorption (CEA) method to alleviate the trade-off between new and old classes by parallelly expanding a set of parameters (i.e. EF layer) on the backbone to learn new tasks, while freezing the backbone to retain old task knowledge. The EF layers can be seamlessly absorbed into the ViT backbone through parameter recombination before inference, mitigating storage and computational burdens. Second, we propose a Dynamic Boundary-Aware (DBA) method to generate dynamic pseudo-features for classifier calibration to address the classifier bias. Extensive experiments demonstrate that our approach achieves state-of-the-art performance, particularly showcasing significant improvements of 4.82% and 5.92% on TinyImageNet and ImageNet-Subset, respectively.

Abstract:
In real world applications, multi-view data has attracted intensive attention due to the complex and complementary relationship across views. Multi-view representation learning (MvRL) focuses on obtaining consistent feature representation from multi-view data, and becomes a popular topic in multi-view research field. However, the relationship between different samples, i.e., the graph information, is usually ignored or excavated insufficiently in most existing MvRL methods, which only regard graph structure as regularization items instead of graph embedding for multi-view data. Besides, the limited learning capacity of the adopted shallow models is another challenge for MvRL. To tackle them, in this paper, we propose a novel unsupervised deep multi-view representation learning model guided by learnable graph structure, termed as LGG-DMRL. It first captures a multi-view consistent graph from original data based on self-representation learning, and explores the view-specific feature representation of each view by the designed graph guided attention network using the learnt graph. After that, the information bottleneck principle is employed to identify the shared representation across views integrated with the view-specific feature representations, promoting the multi-view complementarity and completeness. Experimental results on five real-world datasets demonstrate the superiority and effectiveness of our proposed LGG-DMRL compared with the recent state-of-the-art multi-view approaches.

Abstract:
Homography estimation is essential for aligning images captured from different viewpoints by accurately modeling the geometric relationship between them. In homography estimation, global information plays a critical role. To establish global correspondences, cross-attention has been widely used in recent studies. However, vanilla cross-attention mechanisms treat queries in redundant and low-texture areas the same as those in richly textured areas, leading to the accumulation and propagation of erroneous information. We define this phenomenon, where the model excessively attends to queries in redundant and low-texture areas, as query over-focusing. To alleviate query over-focusing and achieve fine-grained homography estimation, we propose a novel homography estimation network, termed AGNet, which integrates an Adaptive Query Transformer (AQFormer) and a Gated Interaction Module (GIM). The AQFormer is designed to dynamically adjust attention by applying a mask to queries, allowing the model to adaptively emphasize feature-rich regions while suppressing redundant or weakly textured areas. Meanwhile, the GIM selectively captures local information by adjusting convolutional kernels based on input, enhancing the extraction of shared features between image pairs. Extensive experiments on various datasets demonstrate that AGNet significantly improves accuracy in homography estimation, particularly in challenging scenarios with low overlap and large viewpoint variations.

Abstract:
Few-shot classification aims to develop a classifier that adapts to new tasks using only a limited number of labeled images. To overcome the limitation of lacking training images in few-shot image classification, dense features have been extensively utilized to represent images by providing more subtle and discriminative clues. However, dense feature based methods are still facing challenges despite leveraging local details in images. Primarily, these methods deal with the support set images in each category independently, which ignores the information across different categories. Furthermore, dense features suffer from background noise, when performing similarity calculations based on a large number of dense feature pairs, these methods are susceptible to interference from task-irrelevant feature pairs. In this paper, we propose a cross dense feature learning with task guidance method to address the aforementioned issues. The key components of our method include two aspects. Firstly, a dense feature extraction approach based on transformer is proposed, aiming to better utilize inter-class information within the support set. We design two types of cross-attention mechanisms to get the across information among different categories for a better representation of dense features, named Support-Support Attention (SSA) and Support-Query Attention (SQA). Secondly, a task-relevant model is trained for dense feature pairs similarity calculating, aiming to filter out feature pairs that contribute more effectively to classification. Then we can get the final similarity to predict the label of query image through summarizing weighted local similarity. The experimental results prove that our method achieves a promising improvement for few-shot classification by taking information across different categories and task attention similarity into consideration.

Abstract:
Computer-aided medical image segmentation helps to assist physicians in locating lesion area for the subsequent diagnosis and treatment. Due to the irregular shape of the target and the uneven sample size between the target and the background area, automatic segmentation of medical images is a challenging task. Many CNN-Based, Transformer-Based models deepen the number of network layers or introduce complex modules in order to improve the segmentation accuracy. Limited by the computational resources, these types of large models are not suitable for the actual clinical environment. Inspired by the rapidity, accuracy, and low consumption characteristics of bio-visual processing, the Ultra-Lightweight Network Inspired by Bio-Visual Interaction (BVI-Net) is constructed in this paper. The Global Pathway is constructed by simulating the dorsal stream, in order to extract global features rapidly, and the Local Pathway is constructed by simulating the ventral stream, in order to process local features finely. At the same time, the skip connection module integrating Graph Convolutional Network (GCN) attention mechanism is constructed to simulate the synchronous integration ability of the visual pathway for multi-level features. The International Skin Imaging Collaboration (ISIC) dataset, the Liver Tumor Segmentation (LiTS) dataset, and the Brain Tumor Segmentation Challenge (BraTS) dataset are used for experiments. The BVI-Net proposed in this paper requires only 0.026M parameters to achieve the excellent performance in three representative medical image segmentation datasets, which has certain advantages over state-of-the-art (SOTA) methods. The biological vision mechanism and the artificial intelligence algorithm are integrated in this paper, which provides new ideas for the construction of biological vision-guided deep learning models and promotes the development of biomimetic computational vision.

Abstract:
Constrained by imaging systems, hyperspectral images (HSIs) always have a low spatial resolution. Deep learning-based HSI super-resolution methods have achieved impressive results through learning the nonlinear mapping between low-resolution (LR) and high-resolution (HR) images. However, most of them take the LR image or its upsampled version through bicubic interpolation as input, leading to low-quality features and limited details captured by the network. As a powerful generative model, diffusion model has the ability to learn both contextual semantics and textual details from distinct timesteps, enabling the effective exploration of spatial-spectral distributions in high-dimensional data. In this paper, we propose a novel method that extracts high-quality prior information from original images to assist in super-resolution through pretraining a diffusion model. Specifically, we first train a diffusion model using original HSI patches in a self-supervised manner and then obtain prior features from the pretrained denoising U-Net decoder. To efficiently incorporate the prior features into the super-resolution model, we propose an adaptive fusion module based on spatial and spectral attention mechanisms, which enhances features in both dimensions while preserving the original characteristics. Additionally, to leverage the complementarity of spatial and spectral information, we design a spatial-spectral aggregation Transformer module that incorporates an adaptive interaction module to facilitate information exchange across different dimensions, thereby enhancing the representation capability. Extensive experiments on three public hyperspectral datasets demonstrate that the proposed method achieves excellent super-resolution performance and outperforms the state-of-the-art methods in terms of quantitative quality and visual results.

Abstract:
The proliferation of Artificial Intelligence-Generated Images (AIGIs) has greatly expanded the Image Naturalness Assessment (INA) problem. Different from early definitions that mainly focus on tone-mapped images with limited distortions (e.g., exposure, contrast, and color reproduction), INA on AI-generated images is especially challenging as it owns more diverse contents and could be affected by factors from multiple perspectives, including low-level technical distortions and high-level rationality distortions. In this paper, we take the first step to benchmark and assess the visual naturalness of AI-generated images. First, we construct the AI-Generated Image Naturalness (AGIN) dataset by conducting a large-scale subjective study to collect human opinions on the overall naturalness as well as perceptions from the technical quality and rationality perspectives. AGIN verifies several insights for the first time that naturalness is universally and disparately affected by both technical and rational distortions, while its manifestations vary with different generation tasks. Second, to automatically assess the naturalness of AIGIs that align with human opinions, we propose the Joint Objective Image Naturalness evaluaTor (JOINT). Specifically, JOINT imitates human reasoning in naturalness evaluation by jointly learning technical and rationality features with several specific designs to guide model behavior from respective perspectives. Experiments demonstrate that JOINT significantly outperforms existing methods for providing more subjectively consistent results on naturalness assessment. The dataset can be accessed at https://github.com/zijianchen98/AGIN.

Abstract:
To tackle the challenge of single-spectral object re-identification in complex and dynamic lighting scenarios, multi-spectral object re-identification, which integrates visible light and infrared information, is gradually taking the lead. Nevertheless, the significant heterogeneity across spectra causes formidable obstacles for this task. Most existing approaches alleviate inter-spectral disparities by amalgamating representations from different spectra, ignoring the selection of spectrum-specific crucial information. To address this issue, we propose a novel Representation Selective Coupling Network (RSCNet) for multi-spectral object re-identification. Specifically, we design an Attention-Fourier Token Sparsification (AFTS) module to adaptively sparse and join tokens from multi-spectral images in the attention domain and Fourier domain. This not only preserves spectrum-specific crucial information but also reduces inter-spectral gaps by selective coupling of multi-spectral representation. Meanwhile, to further align multi-spectral information and guide the model to learn more discriminative representation, we propose an Information Unification Constraint (IUC) learning strategy. Both feature-level information constraint and distribution-level information constraint are simultaneously deployed in IUC. Finally, we conduct extensive experiments on three multi-spectral object re-identification benchmarks, and the experimental results verify the effectiveness of our proposed method.

Abstract:
Today, many image coding scenarios do not have a human as final intended user, but rather a machine fulfilling computer vision tasks on the decoded image. Thereby, the primary goal is not to keep visual quality but maintain the task accuracy of the machine for a given bitrate. Due to the tremendous progress of deep neural networks setting benchmarking results, mostly neural networks are employed to solve the analysis tasks at the decoder side. Moreover, neural networks have also found their way into the field of image compression recently. These two developments allow for an end-to-end training of the neural compression network for an analysis network as information sink. Therefore, we first roll out such a training with a task-specific loss to enhance the coding performance of neural compression networks. Compared to the standard VVC, 41.4% of bitrate are saved by this method for Mask R-CNN as analysis network on the uncompressed Cityscapes dataset. As a main contribution, we propose LSMnet, a network that runs in parallel to the encoder network and masks out elements of the latent space that are presumably not required for the analysis network. By this approach, additional 27.3% of bitrate are saved compared to the basic neural compression network optimized with the task loss. In addition, we are the first to utilize a feature-based distortion in the training loss within the context of machine-to-machine communication, which allows for a training without annotated data. We provide extensive analyses on the Cityscapes dataset including cross-evaluation with different analysis networks and present exemplary visual results.

Abstract:
On-demand video streaming continues to dominate the Internet, posing a formidable challenge in designing efficient adaptive bitrate (ABR) algorithms to enhance user quality-of-experience (QoE), particularly amplified by increasing video resolutions (e.g., from 1080P to 2K, 4K, and even 8K) and dynamic Internet conditions. Through a comprehensive study, we identify a common limitation in both existing throughput-based and hybrid-based ABR algorithms: they rely on coarse-grained network bandwidth estimation, missing detailed and accurate (i.e., millisecond-level) network variations. This often leads to misguided resolution (corresponding to bitrate level) decisions, resulting in unsatisfactory QoE. In this work, we propose SuperABR, a fine-grained throughput-driven ABR solution aimed at achieving the optimal bitrate adaptation. To accomplish this, SuperABR first incorporates a two-stage learning module, generating fine-grained future throughput to provide a near-Oracle network view. SuperABR then uses this fine-grained throughput to accurately calculate the download duration for a video chunk, transforming it into the optimal resolution decision via a custom-designed QoE benefit model. We have implemented SuperABR as a lightweight plug-in interface on a standard DASH framework and evaluate it over extensive real-world network traces. Extensive experiments demonstrate that SuperABR can generate accurate future throughput, resulting in a remarkable 1.21～ 1.46× QoE improvement over classic ABR solutions.

Abstract:
To estimate depth maps from monocular videos in a self-supervised way, existing methods simultaneously predict the pose changes between adjacent frames and the depth maps of each frame, and then reconstruct the forward or backward frames using them, thereby casting depth estimation as a frame reconstruction problem. The corresponding reconstruction loss, which serves as a key supervision signal for training the whole network, can adversely affect the depth estimation accuracy if it is not properly established. In this paper, we propose a novel self-supervised monocular depth estimation method from videos via adaptive reconstruction constraints, i.e., designing the loss functions by establishing more accurate reconstruction constraints. Specifically, we first propose a pose-adaptive reconstruction loss to adaptively select the optimal pose parameterizations that yield the minimum reconstruction errors, reducing the impact of inaccurate posture on frame reconstruction. Then, we propose a region-sensitive reconstruction loss that fully utilizes the pretrained image reconstruction model to adaptively identify the poorly reconstructed regions and characterize the deviation of these regions on feature space. Finally, we additionally construct a multi-frame depth estimation network and design a reconstruction-guided bidirectional distillation loss to adaptively adjust the direction of distillation between networks of multi-frame and monocular depth estimation based on their current reconstruction quality, which encourages them to learn from each other and benefits the core task of monocular depth estimation. With our proposed losses, we achieve superior performance in comparison with state-of-the-art methods on benchmark datasets.

Abstract:
In recent years, transformer-based models have exhibited considerable potential in point cloud instance segmentation. Despite the promising performance achieved by existing methods, they encounter challenges such as instance query initialization problems and excessive reliance on stacked layers, rendering them incompatible with large-scale 3D scenes. This paper introduces a novel method, named SGIFormer, for 3D instance segmentation, which is composed of the Semantic-guided Mix Query (SMQ) initialization and the Geometric-enhanced Interleaving Transformer (GIT) decoder. Specifically, the principle of our SMQ initialization scheme is to leverage the predicted voxel-wise semantic information to implicitly generate the scene-aware query, yielding adequate scene prior and compensating for the learnable query set. Subsequently, we feed the formed overall query into our GIT decoder to alternately refine instance query and global scene features for further capturing fine-grained information and reducing complex design intricacies simultaneously. To emphasize geometric property, we consider bias estimation as an auxiliary task and progressively integrate shifted point coordinates embedding to reinforce instance localization. SGIFormer attains state-of-the-art performance on ScanNet V2, ScanNet200, S3DIS datasets, and the challenging high-fidelity ScanNet ++ benchmark, striking a balance between accuracy and efficiency. The code, weights, and demo videos are publicly available at https://rayyoh.github.io/SGIFormer/.

Abstract:
LiDAR-based single object tracking plays a key role in intelligent vehicles. Current methods typically follow appearance matching or motion-centric frameworks. However, point clouds are usually sparse and incomplete, providing insufficient appearance information for matching. While the motion-centric framework predicts inter-frame motion of targets instead of performing appearance matching for tracking, it neglects contextual information matching of consecutive frames that is conducive to target motion modeling. In this paper, we propose an elegant and effective framework by leveraging Context Matching to guide motion modeling for accurate Tracking (CMTrack). The novel framework possesses two attractive properties: 1) It incorporates a context matching encoder-decoder network to match contextual information of consecutive frames, fully exploring informative cues relevant to target motion. 2) Benefiting from informative motion cues being modeling, CMTrack allows for accurate prediction of inter-frame motion of targets in a one-stage manner. Extensive experiments are conducted on several widely-adopted datasets, i.e., KITTI, NuScenes and Waymo Open Dataset. Without bells and whistles, our CMTrack demonstrates competitive tracking accuracy (e.g., 87.3% and 69.3% precision on KITTI and NuScenes, respectively) compared to state-of-the-art methods, while running at a high speed of 48 Fps on a single Titan Xp GPU.

Abstract:
Compressive learning (CL) is an emerging framework that enables machine learning inference tasks to be performed directly in the measurements of compressed sensing (CS), which can reduce memory usage and improve compute and transmission efficiency. However, as a typical CL task, CS object detection is still challenging due to severe feature loss during sampling and is hard for deployment of terminal equipment with limited computing resources. In recent years, one-stage object detection models have achieved remarkable detection speed and accuracy in the image domain. In this paper, we introduce a CS object detection architecture (CSDet), which can perform object detection in the compressed domain and reconstruct color images with high quality at arbitrary sampling rate. CSDet is composed of a CS object detection module (CSDM) and a CS reconstruction module (CSRM) with lightweight networks. The CSDM integrates an optimizable joint multi-channel sampling matrix and a lightweight one-stage object detection network, which are trained jointly. The CSRM employs a joint multi-channel global reconstruction network. Experimental validation on the MS-COCO-Person dataset demonstrates that the proposed CS compressed domain object detection method achieves an 80% decrease in floating point of operations with even better accuracy compared to the image domain method. Meanwhile, the method exhibits notable improvements in reconstruction quality and speed compared to recent approaches.

Abstract:
In object goal navigation tasks, the robot’s understanding of semantic relationships in the environment is a key factor in its ability to localize target objects. Previously, learning-based methods trained robots using 3D scene datasets to learn semantic relationships. However, these approaches perform poorly in new environments with unfamiliar semantic contexts. In this paper, we propose ChatNav which leverages the powerful knowledge summarizing and reasoning capabilities of a Large Language Model (LLM) for zero-shot inference of explicit semantic relationships. These relationships are further integrated into the navigation system for efficient localization of target objects. ChatNav employs a spatial object clustering algorithm to collect semantic clues and designs common-sense-based prompts for interacting with LLM. It then uses a gravity-repulsion model to convert inference results into heuristic factors for robust navigation decision-making. Our approach requires no additional training and can consistently obtain accurate semantic relationships from LLM, making it well-suited for navigating unknown environments. Experimental results demonstrate the outstanding navigation performance of our proposed method on the Gibson and HM3D datasets, surpassing the current state-of-the-art object goal navigation methods.

Abstract:
In the field of Infrared-Visible Image Fusion (IVIF), the preservation of details, edges, and texture is crucial for generating high-quality fused images. However, a major challenge arises due to the inevitable loss of high-frequency information during feature extraction, resulting in fused images that lack significant details. In this paper, we propose a dual-branch auto-encoder by exploiting an invertible high-frequency branch for detailed feature preservation and a transformer-based low-frequency branch for global dependencies modeling. First, the high-frequency branch employs the wavelet transforms and an Invertible Neural Networks (INN)-based encoder to model high-frequency features through an invertible transformation, including a forward process for image fusion and an inverse process for original image reconstruction. Additionally, a high-frequency loss is designed to enhance the high-frequency feature representation for high-quality image fusion. Second, a low-frequency branch based on a transformer encoder and an adaptive fusion module is introduced to capture the global contextual features of the infrared and visible images. Finally, the decoder integrates the low- and high-frequency features from both branches to generate the final fused image. Image fusion, object detection, and semantic segmentation experiments conducted on public datasets such as TNO, MFNet, and M3FD, show that our method outperforms the state-of-the-art (SOTA) image fusion methods.

Abstract:
Text-based person retrieval (TBPR) is a challenging task that aims at retrieving candidate pedestrian images from a gallery, using textual descriptions as queries. Existing methods generally assume that the textual query and the unique candidate image have a certain cross-modal relationship under one-to-one constraint, and optimize their conditional probability via a discriminative paradigm. However, in real scenarios of TBPR, a textual query may associate with multiple candidate images at one time, indicating that the uncertainty resides in the one-to-many cross-modal relationship. Moreover, the learnt conditional probabilities from the discriminative paradigm of existing methods may be less effective in reflecting the joint probabilities of the textual query and candidate images. To tackle these problems, we propose a novel method termed Cross-modal Uncertainty Modeling with Diffusion-based Refinement (CUMDR) for the TBPR task. First, we implicitly model the cross-modal uncertainty to capture richer semantics and complex correlations, thus generating diverse yet plausible retrievals. Additionally, to reasonably mitigate the impact of noisy data with high uncertainty, we quantify the uncertainty to allocate the importance of raw and complement annotations, which are generated from the multi-modal large language model based on the retrieval-augmented template. Finally, we propose a novel diffusion-based denoiser to progressively refine cross-modal alignments by learning joint probabilities. Extensive experiments on three TBPR datasets demonstrate the superior performance and generalizability of our CUMDR approach compared to the latest methods. Our anonymous implementation repository is available at https://github.com/Shenshen7/CUMDR.

Abstract:
Deep learning has been extensively applied in medical image segmentation, providing significant support for disease diagnosis. However, traditional encoder-decoder networks struggle with segmenting scale-sensitive point target lesions. To address this challenge, this paper proposes an innovative incremental fusion architecture that can integrate different models and achieve significant performance improvements through complementary fusion. Based on this architecture, we developed Focus-TransUnet3D by combining the Trans-FusionNet3D model and the 3D Unet model. This model adopts a global-to-local segmentation strategy, effectively addressing the challenges of medical point target segmentation, thereby expanding the application of deep learning in the field of medical image processing. Furthermore, we design a deep fusion strategy suitable for the transformer model to adapt to multi-scale feature learning. The integration of the transformer model with convolutional neural networks brings improvements in local and global feature extraction capabilities, enhancing the applicability of our model. We evaluate our model on three clinical datasets with different target scales: the Intracranial Artery dataset, the Intracranial Aneurysm dataset, and the LiTS17 dataset. The results indicate that in the external test for intracranial aneurysm auxiliary diagnosis, the model trained with only 47 annotated samples achieved the state-of-the-art performance, attaining a Dice coefficient of 84.14% and a sensitivity of 100%. This effectively addresses the challenges of annotation scarcity and tiny targets. Our code will be released at https://github.com/caijilia/FTUnet3D.

Abstract:
Catastrophic forgetting is the core problem of class incremental learning (CIL). Existing work mainly adopts memory replay, knowledge distillation, and dynamic architecture to alleviate this problem, but seldom from the aspect of parameter regularization. However, existing parameter regularization methods struggle to achieve an appropriate balance between old and new tasks. To bring it back to CIL, we first propose constrained incremental learning with less forgetting direction (LFD) to leave more plasticity for the new task under a strong stability constraint for old tasks. Specifically, the new parameters are constrained to be close to the LFD of old tasks instead of a single group of old parameters. To validate the effectiveness of this regularization, we investigate the connectivity between the old parameters and the new parameters, and additionally find that a higher accuracy interval exists along the linear connection. Therefore, we further propose a post-processing procedure to find an equilibrium point in this interval for better balance between old and new tasks. Extensive classification experiments on CIFAR-100, ImageNet-100, and ImageNet-1K show our method can significantly improve performance compared with existing CIL methods and the object detection experiments on PASCAL-VOC show its broad generality on other tasks.

Abstract:
Machine learning techniques can help us deal with many difficult problems in the real world. Proper ensemble of multiple learners can improve the predictive performance. Each base learner usually has different predictive ability on different instances or in different instance regions. However, existing ensemble methods often assume that base learners have the same predictive ability for all instances without consideration of the specificity of different instances or categories. To address these issues, we propose an adaptive ensemble learning framework with category-aware attention and local contrastive loss, which can adaptively adjust the ensemble weight of each base classifier according to the characteristics of each instance. Specifically, we design a category-aware attention mechanism to learn the predictive ability of each classifier on different categories. Furthermore, we design a local contrastive loss to capture local similarities between instances and further enhance the model’s ability to discern fine-grained patterns in the data. Extensive experiments on 20 public datasets demonstrate the effectiveness of the proposed model.

Abstract:
Many robotics and industry applications have a high demand for the capability to estimate the 6D pose of novel objects from the cluttered scene. However, existing classic pose estimation methods are object-specific, which can only handle the specific objects seen during training. When applied to a novel object, these methods necessitate a cumbersome onboarding process, which involves extensive dataset preparation and model retraining. The extensive duration and resource consumption of onboarding limit their practicality in real-world applications In this paper, we introduce ZeroPose, a novel zero-shot framework that performs pose estimation following a Discovery-Orientation-Registration (DOR) inference pipeline. This framework generalizes to novel objects without requiring model retraining. Given the CAD model of a novel object, ZeroPose enables in seconds onboarding time to extract visual and geometric embeddings from the CAD model as a prompt. With the prompting of the above embeddings, DOR can discover all related instances and estimate their 6D poses without additional human interaction or presupposing scene conditions. Compared with existing zero-shot methods solved by the render-and-compare paradigm, the DOR pipeline formulates the object pose estimation into a feature-matching problem, which avoids time-consuming online rendering and improves efficiency. Experimental results on the seven datasets show that ZeroPose as a zero-shot method achieves comparable performance with object-specific training methods and outperforms the state-of-the-art zero-shot method with 50x inference speed improvement.

Abstract:
The complementary properties exhibited upon RGB-T data involve context complementarity as well as content complementarity. During cross-modal feature fusion, most existing RGB-T semantic segmentation methods are dedicated to highlighting the exploitation of content-complementary information. Unfortunately, these methods usually overlook the excavation of cross-modal context-complementary information (i.e., the contextual dependencies among different regions that only exist in one certain modality data) or try to exploit such cross-modal context-complementary information in an implicit way, yielding fragmentary semantic segmentation results. To remedy this problem, in this paper, a novel Cross-modal Context- and Content-Complementarity Network ( \mathbf C^4 Net) is presented for RGB-T semantic segmentation, in which both the cross-modal context-complementary information and the cross-modal content-complementary information are fully excavated and exploited during cross-modal feature fusion. Specifically, a Context-Complementary Information Aggregation (CxCIA) module is carefully designed, in which the cross-modal context-complementary information is explicitly excavated by measuring the discrepancies between contextual dependencies from different modality data. Then, such cross-modal context-complementary information is further exploited to enhance the original RGB and thermal contextual dependencies for boosting the integrity of objects in the fused features. In the meantime, a Content-Complementary Information Aggregation (CnCIA) module is presented, which highlights the utilization of cross-modal content-complementary information from a multi-scale perspective. Furthermore, an MLP-based Multi-level Feature Interaction (MFI) decoder is presented, in which the semantic gaps among different levels of fused features are mitigated by establishing the interactions of multi-level fused features along spatial and channel dimensions. Comprehensive experimental results on several public datasets demonstrate that our proposed \mathbf C^4 Net surpasses other state-of-the-art models.

Abstract:
Considering that the nature of the stego signal caused by spatial domain steganography and joint photographic experts group (JPEG) domain steganography is different, existing deep-learning steganalysis networks typically cannot work well in both spatial and JPEG domains. We propose a unified steganalysis network named ESNet to effectively preserve and identify the stego signal from spatial and JPEG domains. Specifically, dual-branch preprocessing extracts noise residuals by using fixed SRM kernels (branch 1) and randomly initialized kernels (branch 2), fuses the features from two branches and exchanges the fused complementary information through two carefully designed bidirectional fusion blocks, thereby effectively enhancing the signal-to-noise ratio. During feature extraction, considering that low-level features, such as texture and edge, are indispensable for steganalysis, we gather multi-level feature maps at different layers of the network to provide richer feature representations and merge them by using a multi-level feature fusion module, which learns the weight of different features in single-level feature map to enhance the expression of steganographic features. During classification, the multi-scale attention pooling module is employed to extract multi-scale features by designing convolution kernels of different sizes. After concatenating features of different scales, gated channel transformation is exploited to weight the importance of each channel to further strengthen the representations of steganographic features. Finally, stylepooling in combination with global standard deviation pooling and global average pooling, is used to compress channels and preserve the representation ability of channels as much as possible for classification. The experimental results show that the proposed ESNet exhibits state-of-the-art detection performance in both spatial and JPEG domains, and achieves satisfactory robustness against the cover source mismatch.

Abstract:
VCOD (Video Camouflage Object Detection) is a crucial security technology that identifies camouflaged objects in videos, bolstering security measures across diverse applications. On one hand, appearance-based VCOD methods face challenges because camouflaged appearances cause objects to blend into their surroundings, and current VCOD methods typically utilize optical flow to represent motion information. However, over-reliance on accurate estimation renders the model overly fragile. On the other hand, there is a shortage of effectively annotated camouflaged video datasets, coupled with the time-consuming and labor-intensive annotation process, severely constraining the development of this field. To address this, we propose a novel weakly-supervised framework for VCOD based on cross-domain querying of preceding and succeeding frames. Specifically, we propose a time-efficient and labor-saving manual annotation approach based on large visual models to rapidly generate pseudo-labels. Furthermore, we design a network based on Spatio-Temporal Memory (STM) that performs cross-modal feature querying with the current frame against preceding and succeeding frames to acquire useful information, thereby enhancing the focus on temporal information. Extensive experiments conducted on two common VCOD datasets have proven the effectiveness of our method, achieving state-of-the-art performance on the challenging camouflaged video data.

Abstract:
Unsupervised Cross-Domain 3D Model Retrieval (UCD3DMR) has emerged as an effective tool for managing 3D model data recently. However, existing UCD3DMR algorithms typically demand accessibility to source data and cross-domain label consistency, limiting their deployment in real-world industrial scenarios. Therefore, we relax the two demanding constraints and explore to address a newly challenging task, source-free universal 3D model retrieval (SFU3DMR). However, the inaccessibility to source data results in significant label noise in target pseudo-labels, while cross-domain label inconsistency introduces interference from target-private models, presenting tremendous challenges to model transfer. To address these challenges, we propose a novel SFU3DMR algorithm, Progressive Contrastive Label Optimization (PCLO). Specifically, we introduce the Neighbor-based Soft Label Optimization (NSLO) strategy, which refines target pseudo-labels based on the pseudo-label confidence of their nearest neighbors. Additionally, we design the Adaptive Hybrid Label Optimization (AHLO) strategy, which conducts positive label optimization to maximize label semantics for target-common models and executes negative label optimization to minimize label noise for target-private models. Experimental results confirm that the combined NSLO and AHLO strategies effectively refine the target pseudo-labels, and our PCLO achieves state-of-the-art performance for SFU3DMR on two well-established cross-domain benchmarks (MI3DOR and NTU/PSB).

Abstract:
In recent years, deep learning has been significantly advancing the field of image deblurring. However, existing deep learning models usually rely on overloaded large kernel convolutions or overweighted attention modules. This leads to a heavy computational burden and restricts real applications. To address this issue, we propose a lightweight deblurring network, termed RGE-Net. Our RGE-Net possesses two novel features: 1) We propose a recurrent path into the convolutions to ensure each kernel weight can learn better and stronger feature information, thus increasing the parameter efficiency and reducing the parameters. Furthermore, we propose gated attention to suppress incorrect features flowing in the recurrent path, thus improving performance. 2) We decouple the kernels into spatial and channel components to reduce learning difficulty by reducing parameters and then perform an attention mechanism to obtain significant performance. Extensive experiments on benchmark datasets demonstrate the superiority of RGE-Net over state-of-the-art deblurring models in terms of both effectiveness and efficiency.

Abstract:
Data drift is a thorny challenge when deploying person re-identification (ReID) models into real-world devices, where the data distribution is significantly different from that of the training environment and keeps changing. To tackle this issue, we propose a federated spatial-temporal incremental learning approach, named FedSTIL, which leverages both lifelong learning and federated learning to continuously optimize models deployed on many distributed edge clients. Unlike previous efforts, FedSTIL aims to mine spatial-temporal correlations among the knowledge learnt from different edge clients. Specifically, the edge clients first periodically extract general representations of drifted data to optimize their local models. Then, the learnt knowledge from edge clients will be aggregated by centralized parameter server, where the knowledge will be selectively and attentively distilled from spatial- and temporal-dimension with carefully designed mechanisms. Finally, the distilled informative spatial-temporal knowledge will be sent back to correlated edge clients to further improve the recognition accuracy of each edge client with a lifelong learning method. Extensive experiments on a mixture of five real-world datasets demonstrate that our method outperforms others by nearly 4% in Rank-1 accuracy, while reducing communication cost by 62%. All implementation codes are publicly available on https://github.com/MSNLAB/Federated-Lifelong-Person-ReID.

Affiliations: School of Computer Science, University of Electronic Science and Technology of China, Zhongshan Institute, Zhongshan, China; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Complex Laboratory of New Finance and Economics, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China; School of Software Engineering, South China University of Technology, Guangzhou, China

Abstract:
Natural image matting is a widely used image processing technique that extracts foreground by predicting the alpha values of the unknown region based on the alpha values of the known foreground and background regions. However, existing image matting methods may not yield the most optimal results when applied to images containing transparent objects because the known foreground region is small or even absent. To address this shortcoming, in this paper, we propose a novel method named Transparent Object Matting using Predicted Definite Foreground and Background (TOM-PDFB), which can explore and utilize the definite foreground and background in the unknown region. For this purpose, a newly developed foreground-background confidence estimator is applied to predict the confidence level of the definite foreground and the definite background, thus providing the priors required for transparent object matting. Next, foreground-background guided progressive refinement network developed as a part of this work is adopted to incorporate the estimated definite foreground and background into the alpha matte refinement process. Extensive experimental results demonstrate that the TOM-PDFB outperforms state-of-the-art methods when applied to transparent objects. Project page: https://github.com/yihuiliang/TOM-PDFB.

Abstract:
Contrastive Learning (CL) is an effective self-supervised learning method. It performs instance-level contrastiveness based on the image representations, which enables the model to extract abstract information from images. However, when training data is insufficient, abstract information fails to distinguish samples from different classes. This problem is more severe in the pre-training of Vision Transformer (ViT). In general, detailed information is crucial for enhancing the discrimination of representations. Patch representations, which focus on the details of images, are often overlooked in existing methods that train ViT through CL, resulting in the confusion of similar samples. To address this problem, we propose a Contrastive Learning model with Enhancing Detailed Information (CL-EDI) for pre-training ViT. Our model consists of dual ViT contrastive modules. The first module is similar to MoCo V3, which can learn abstract information about images. The role of the second ViT contrastive module is to enhance detailed information in data representations by aggregating patch representations of images. Extensive experiments demonstrate the necessity of learning detailed information. Across several datasets, our model surpasses existing approaches in image classification, transfer learning and object detection tasks.

Abstract:
Visual place recognition is crucial to accurate localization in large-scale environments. Existing methods combine convolutional neural network and deep metric learning to improve performance, however, it is still challenging to promote the adaptability of features under various environmental conditions. To address the problem, this paper proposes a hardness-aware metric learning method with cluster-guided attention. By leveraging the affinity between each local feature and the corresponding scene cluster center, the model is attracted to focus on the local features proximate to their clusters, while suppressing outlier local features that deviate from their clusters. In this way, the scene-related reliable local features are concentrated to construct the global feature of the whole image with adaptability to environmental conditions. Meanwhile, a hardness-aware metric loss is designed to train the proposed network, which determines the hardness of negative samples based on their similarity to the query images and the training iterations. Subsequently, the hardness is used to reweigh each term of the loss function to promote network optimization. In addition, a condition normalization layer is also introduced to regularize the feature distributions under different environmental conditions to a canonical space, improving the feature robustness to condition variations. Our method achieves the top-10 recalls of 97.2%, 99.0%, and 94.9% on Pitts250k-test, TokyoTM-val, and Tokyo 24/7 datasets, respectively. Extensive experiments demonstrate that the proposed method learns robust global features with the adaptability to various environmental conditions.

Abstract:
Recent studies have shown that video action recognition models are also vulnerable to fooling by adversarial samples. However, currently existing video attack methods usually require high computational overhead (e.g., they generate adversarial perturbations for all frames by default), and most of them are difficult to implement printable attacks in the physical world. To address the above issues, we devise a novel efficient and effective framework for video action recognition attack: Bullet-Screen-Emoji Attack with Temporal Difference Noise (BSE), a reinforcement learning-based black-box attack method that fools the model by simply generating adversarial bullet screens for key frame and scrolling them on clean video. The agent is optimized to make the optimal actions, i.e., searching key frame. Moreover, we introduce a simple and effective temporal difference noise to enhance the attack capability of the adversarial bullet screen and accelerate the convergence speed. Most importantly, BSE enables printable physical attacks. Extensive experiments show that our proposed BSE achieves promising attack performance on mainstream datasets (HMDB51, UCF101 and Kinetics-400) and in the physical world with high efficiency.

Abstract:
In reversible data hiding (RDH) in the plaintext domain, the reversibility of the data and the image is the greatest strength but also comes with limitations, such as low embedding capacity and weak generalization ability. These limitations make it challenging for RDH to be applied in scenarios that require the concealment of high-capacity data. To address these issues, we propose a compatible reversible data hiding with high capacity and generalization (CRDH), which can perform a second embedding based on all existing RDH methods and the two extractions are independent of each other. The nearest-neighbor interpolation (NNI) algorithm and integer wavelet transform are initially designed to create additional redundancy room, diverging from existing RDH methods that typically exploit the inherent redundancy within the image itself. Following this, we derive a novel method to prevent pixel value overflow or underflow, which is employed to guide the data embedding process. In the experimental results on standard test images, the average maximum embedding capacity of the CRDH method reaches 4.41 bits per pixel (BPP), which is 1.98 times that of other methods. As the embedded data increases, the peak signal-to-noise ratio (PSNR) of CRDH’s stego-images becomes higher compared to other methods. Furthermore, CRDH exhibits a significantly superior generalization ability in terms of both capacity and quality compared to state-of-the-art RDH methods.

Abstract:
As deepfake technology poses severe threats to information security, significant efforts have been devoted to deepfake detection. To enable model generalization for detecting new types of deepfakes, it is required that the existing models should learn knowledge about new types of deepfakes without losing prior knowledge, a challenge known as catastrophic forgetting (CF). Existing methods mainly utilize domain adaptation to learn about the new deepfakes for addressing this issue. However, these methods are constrained to utilizing a small portion of data samples from the new deepfakes, and they suffer from CF when the size of the data samples used for domain adaptation increases. This resulted in poor average performance in source and target domains. In this paper, we introduce a novel approach to boost the generalizability of deepfake detection. Our approach follows a two-stage training process: training in the source domain (prior deepfakes that have been used for training) and domain adaptation to the target domain (new types of deepfakes). In the first stage, we employ expansive learning to train our expanded model from a well-trained teacher model. In the second stage, we transfer the expanded model to the target domain while removing assistant components. For model architecture, we propose the frequency extraction module to extract frequency features as complementary to spatial features and introduce spatial-frequency contrastive loss to enhance feature learning ability. Moreover, we develop a confidence judgement module to eliminate conflicts between new and prior knowledge. Experimental results demonstrate that our method can achieve better average accuracy in source and target domains even when using large-scale data samples of the target domain, and it exhibits superior generalizability compared to state-of-the-art methods.

Affiliations: School of Computer Science, China University of Geosciences, Wuhan, China; School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China; School of Artificial Intelligence, Hubei University, Wuhan, China; School of Computer, National University of Defense Technology, Changsha, China; School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China; School of Computer Science and Technology (School of Artificial Intelligence), Zhejiang Normal University, Jinhua, China

Abstract:
Hyperspectral band selection, aimed at identifying key spectral bands from the original image, is crucial for reducing dimensionality and enhancing computational efficiency in hyperspectral image (HSI) analysis. Graph learning-based methods have attracted considerable attention due to their efficiency in representing structural correlations between bands and their powerful capability to extract features. However, existing methods have limitations in utilizing spatial relationships among bands and learning their discriminative characteristics. To address these limitations, we propose a Diversity Learning Guided Dual Graph Autoencoder (DLG-DGAE) for unsupervised hyperspectral band selection. In our framework, we integrate a Dual Graph Autoencoder (DGAE) module designed to extract information from both the spatial and spectral relationships among bands, thus fully capturing the structural similarity of the bands. Additionally, we introduce a Spectral Diversity Learning (SDL) strategy to reduce redundant information in the latent representation and enhance the discriminative properties of each band. In the final step, we proceed to cluster the fused latent embeddings. Within each cluster, we select the band exhibiting the highest information entropy as the representative band. Through extensive experimentation on three publicly available datasets, our results consistently indicate that the proposed method surpasses other state-of-the-art techniques. The code is available at https://github.com/fengwe1/DLG-DGAE.

Abstract:
Human pose estimation is a challenging research task in the computer vision community due to the semantic ambiguity problem caused by inevitable occlusions, varying body shapes, and complex articulations. Although deep learning-based methods have significantly improved the performance of this task, existing feature upsampling operations, e.g., bilinear interpolation and transposed convolution, within current convolutional neural networks and Transformer frameworks suffer from a multitude of limitations, including the inability to adapt to specific tasks and the loss of fine-grained semantic details. In this work, we propose a simple yet effective two-step stable feature upsampling (SIU) strategy that addresses these limitations by leveraging a learnable and efficient upsampling operation. Specifically, we first apply periodic shuffling to increase the resolution of the feature maps. Secondly, we utilize convolution layers to adjust the size of feature channels to match those of the input feature maps. The proposed SIU enables the entire network to adapt to the specific feature requirements of the human pose estimation task, making it more effective in preserving spatial information. Quantitatively, extensive experimental results on the challenging COCO-WholeBody dataset validate that our approach outperforms state-of-the-art methods accurately and efficiently, and possesses strong transferability, making it applicable to a wide range of baselines. Moreover, the qualitative results validate that SIU can effectively eliminate the semantic ambiguity problem in challenging pose scenarios, such as occlusions and overlapping. The code and weights have been released at: SIU.

Abstract:
In the classical radar imaging framework, the echo signals can be well compressed and focused by matched filtering. Yet these methods suffered model mismatch in the non-idea scenarios, such as the active jamming, the motion errors. In these situations, the imaging results were defocused or blurred. To solve these problems, a new learnable SAR imaging method was proposed in this paper. First, a hierarchical U-shaped network ImagingNet was constructed to learn the imaging mechanism from the history data. The base model was formed by a training strategy to optimize the errors between the learning imaging result and the reference image. On this basis, a new teacher-student training strategy was developed to refine the base model, and form the advanced model accordingly. Different from the classical imaging framework, the proposed method could focus the echo signals in the ideal and non-idea scenarios. In addition, the proposed method could achieve the real-time imaging when deployed on the parallel computing platform. Multiple rounds of experiments were performed to verify the proposed method. The performance improvement of 0.209 and 0.502 for SSIM, 4.8 dB and 4.4 dB for PSNR were achieved in the active jamming and motion errors scenarios in comparison to the classical method.

Abstract:
Visible-infrared image pairs provide complementary information, enhancing the reliability and robustness of object detection applications in real-world scenarios. However, most existing methods face challenges in maintaining robustness under complex weather conditions, which limits their applicability. Meanwhile, the reliance on attention mechanisms in modality fusion introduces significant computational complexity and storage overhead, particularly when dealing with high-resolution images. To address these challenges, we propose the Cross-modality Fusion Mamba with Weather-removal (CFMW) to augment stability and cost-effectiveness under adverse weather conditions. Leveraging the proposed Perturbation-Adaptive Diffusion Model (PADM) and Cross-modality Fusion Mamba (CFM) modules, CFMW is able to reconstruct visual features affected by adverse weather, enriching the representation of image details. With efficient architecture design, CFMW is 3 times faster than Transformer-style fusion (e.g., CFT). To bridge the gap in relevant datasets, we construct a new Severe Weather Visible-Infrared (SWVI) dataset, encompassing diverse adverse weather scenarios such as rain, haze, and snow. The dataset contains 64,281 paired visible-infrared images, providing a valuable resource for future research. Extensive experiments on public datasets (i.e., M3FD and LLVIP) and the newly constructed SWVI dataset conclusively demonstrate that CFMW achieves state-of-the-art detection performance. Both the dataset and source code will be made publicly available at https://github.com/lhy-zjut/CFMW

Abstract:
Few-shot food recognition aims to first train a meta-model based on an extensive labeled dataset, and then adapt it to recognize novel food classes with limited labeled data. Although existing studies have achieved compelling success, they still heavily relied on a large number of labeled food data for training the initial meta-model. To save the annotation cost, we propose the unsupervised food recognition task, which aims to train a meta-model using only unlabeled food data. Due to the two challenges presented in food images: 1) high intra-class variations and 2) high inter-class similarity, directly applying existing unsupervised few-shot learning methods could result in sub-optimal results. Towards this end, we propose a novel framework, i.e., Unsupervsied Few-shot Food Recognition with Intra-class Variation and Inter-class Similarity (UFFR-IVIS). It consists of two key components: 1) dual diversity-injected support/query representation learning that introduces instance-level and representation-level diversities for the representation learning of support/query instance to model the characteristics of high intra-class variation; and 2) dual regularization-enhanced meta learning that designs two regularizations: auxiliary task-based intra-class regularization and similarity-guided inter-class regularization to regularize the intra-class variation and inter-class similarity modeling, respectively. Extensive experiments on two food datasets demonstrate the superiority of our UFFR-IVIS.

Abstract:
Two-Tower Vision–Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it (i) suffers from ineffective layer-by-layer utilization of unimodal representations, (ii) restricts the flexible exploitation of different levels of unimodal semantic knowledge, and (iii) is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at https://github.com/LooperXX/ManagerTower

Abstract:
Grounding DINO (GDINO) has strong potential for use in zero-shot detection and data annotation, but its use is limited by high computational costs. In addition, YOLOX allows real-time detection but struggles to perform well in complex scenes. To address this challenge, we propose an edge-cloud collaborative framework for an Advanced Driver Assistance System (ADAS) to enhance real-time detector performance on edge devices by leveraging the robust capabilities of cloud-based multimodal detectors to improve perception in complex environments. Our framework consists of cloud and edge components: on the cloud side, we propose a distillation method for multimodal object detectors, which is referred to as MMKD, to optimize the performance of GDINO. Specifically, we use a two-stage distillation strategy, including Cross-modal Listwise Distillation (CLD) and Risk-focused Pseudo-label Distillation (RPLD). With MMKD, we successfully deploy the GDINO model to the cloud, achieving a 1.4% improvement in average precision (AP) and a 1.7× increase in inference speed. On the edge side, leveraging this streamlined version of GDINO, we propose an ADAS data engine to construct a 1.5 Million-scale GDINO-based Dataset for ADAS, named GDDA1.5M. Impressively, on the basis of YOLOX-Lite, we develop a lightweight object detector that is optimized for the application of an ADAS on edge devices through pruning and architectural refinements. Leveraging the GDDA1.5M dataset and the RPLD training strategy, the model achieves a 7.5% improvement in AP, substantially surpassing its counterparts that were trained on 300K manually labeled images. After the YOLOX-Lite detector is deployed on edge devices within our proposed edge-cloud collaborative framework, it achieves an inference speed of 18 milliseconds on the Horizon X3E chip, while the cloud-based distilled model functions efficiently in complex environments.

Abstract:
As an important technology in the fields of intelligent transportation and public safety, crowd counting that can obtain pedestrian flow information has attracted extensive attention from academic and industrial communities. However, existing RGB-T crowd counting methods cannot effectively balance the counting accuracy and computational complexity in practical applications. For this, we propose a Mutual Head Knowledge Distillation Framework (MHKDF) to obtain a lightweight RGB-T crowd counting network for efficient and accurate pedestrian number estimation. Specifically, to avoid the influence of parameter and structure differences between teacher and student networks on the distillation effect, we propose a Cooperative Mutual Knowledge Distillation (CMKD) strategy to comprehensively and dynamically transfer the crowd analysis ability of the complex teacher model (MHKDF-T) to the lightweight student model (MHKDF-S). In addition, the upper bound of the performance of the student network depends on the teacher model with high accuracy. Therefore, to take advantage of the complementary advantages of frequency domain and spatial domain feature fusion, we propose a Multi-Modal Spatial-Frequency Hybrid Fusion Module (MSFHFM) to futher improve counting accuracy of MHKDF-T. Comprehensive experiments on two RGB-T crowd counting datasets demonstrate that our MHKDF-S achieves competitive performance with only 5.68 FLOPs and 4.89M parameters. Our code will be released at https://github.com/BaoYangCC/MHKDF

Abstract:
Monocular depth estimation infers the relative depth of objects by analyzing visual cues in images, ultimately enhancing the comprehension of complex scenes in computer vision systems. Although existing Transformer architectures effectively capture long-range visual dependencies, two significant challenges persist: (a) insufficient integration of spatial context leads to inconsistent depth estimation, particularly under varying perspectives or lighting conditions; (b) the model’s difficulty in capturing global features hinders the parsing of subtle object differences, causing confusion in depth information and reducing sensitivity to variations in object distance and scene layout. To address the aforementioned issues, an Adaptive Clustering Mechanism (ACM) module coupled with a Deformable Frequency Division Fusion (DFDF) module was introduced. Specifically, the ACM module refines and adjusts features via cosine similarity, thereby enhancing cluster center similarity and stabilizing depth estimation. The DFDF module leverages frequency decomposition to extract differential features between objects, enhancing high-frequency information to improve the discrimination of subtle features. Integrating these components, the Frequency Division Adaptive Clustering Enhancement (FDACE) module emerges as the decoder’s core within the Adaptive Clustering and Frequency Division Network (ACFD-Net), facilitating both the precise generation of depth information and the efficient recovery of spatial resolution. Furthermore, we present a progressive depth estimation strategy that seamlessly integrates non-gradient output features from FDACE modules across various scales, and conducts independent optimizations, merging multi-scale information with localized details, and progressively calibrating depth estimates to enhance congruence with actual scenes. The ACM and DFDF modules concentrate on pivotal features, selectively enhancing high-frequency information, thereby minimizing redundant computations and optimizing resource allocation, which significantly boosts computational efficiency. Experimental results demonstrate that ACFD-Net significantly improves both the accuracy and efficiency of depth estimation. Code is released at https://github.com/Songlei7664/ACFD-Net

Abstract:
Temporal Action Localization (TAL) aims to identify the boundaries of actions and their corresponding categories in untrimmed videos. Most existing methods simultaneously process past and future information, neglecting the inherently sequential nature of action occurrence. This confused treatment of past and future information hinders the model’s ability to understand action procedures effectively. To address these issues, we propose Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization (BFSTAL), a new bidirectional feature-splitting approach based on Mamba for the TAL task, composed of two core parts, Decomposed Bidirectionally Hybrid (DBH) and Cross-Layer Fusion Detection (CLFD), which explicitly enhances the model’s capacity to understand action procedures, especially to localize temporal boundaries of actions. Specifically, we introduce the Decomposed Bidirectionally Hybrid (DBH) component, which splits video features at a given timestamp into forward features (past information) and backward features (future information). DBH integrates three key modules: Bidirectional Multi-Head Self-Attention (Bi-MHSA), Bidirectional State Space Model (Bi-SSM), and Bidirectional Convolution (Bi-CONV). DBH effectively captures long-range dependencies by combining state-space modeling, attention mechanisms, and convolutional networks while improving spatial-temporal awareness. Furthermore, we propose Cross-Layer Fusion Detection (CLFD), which aggregates multi-scale features from different pyramid levels, enhancing contextual understanding and temporal action localization precision. Extensive experiments demonstrate that BFSTAL outperforms other methods on four widely used TAL benchmarks: THUMOS14, EPIC-KITCHENS 100, Charades, and MultiTHUMOS.

Abstract:
Edge devices require low power consumption and compact area, which poses challenges for visual signal processing. This work introduces an energy-efficient heterogeneous neuromorphic system-on-chip (SoC) for edge visual computing. The neuromorphic core design incorporates advanced technologies, such as sparse-aware synaptic calculation, partial membrane potential update, non-uniform weight quantization, and partial parallel computing, achieving excellent energy efficiency, computing performance, and area utilization. Twenty neuromorphic cores and twelve multi-mode connected-matrix-based routers form a network-on-chip (NoC) with fullerene-like topology. Its average degree of communication nodes exceeds traditional topologies by 32 % and maintains a minimum degree variance of 0.93, thereby enabling advanced decentralized on-chip communication. Moreover, the NoC can be scaled up through extended off-chip high-level router nodes. At the top layer of the SoC, a RISC-V CPU and a 20-core neuromorphic processor are tightly coupled to form a heterogeneous architecture. Eventually, the chip is fabricated within a 3.41 mm2 die area under 55 nm CMOS technology, achieving a low power density of 0.52 mW/mm2 and a high neuron density of 30.23 K/mm2. Its effectiveness is verified across different visual tasks, with a best energy efficiency of 0.96 pJ/SOP. This work is expected to promote the development of neuromorphic computing in edge visual applications.

Abstract:
The vision transformer (ViT) architecture offers significant advantages in object detection tasks. However, some limitations affect improving task performance. Firstly, the ViT relies heavily on inflexible position embedding, which causes poor performance when processing images with complex semantic dependencies. Secondly, class imbalance in large-scale datasets can easily cause training instability and inference bias. To overcome these limitations, we propose a generalized concordant ViT scheme for object detection (GCViTDet). Specifically, we first introduce a relevance enhancement strategy (RES) into the encoder-decoder structure, which is composed of the spatial enhanced position embeddings (SEPE) component, the cross multi-pooling attention (CMPA) component, and a global-local path. This strategy establishes semantic-rich dependencies through enhanced position embedding information and omni-feature representations. Subsequently, a bottom-up feature aggregation pathway is employed, utilizing a cross multi-pooling attention to improve the model’s capacity to capture semantic dependencies. This scheme enables the extraction of high-dimensional features that exhibit complex positional relationships. Besides, we propose a focal unified cross-entropy (FUCE) loss to solve the class imbalance problem during training by introducing a uniform threshold to regulate the similarity between positive and negative samples of different classes. Compared with existing methods, GCViTDet can not only capture more intricate positional relationships and semantic-rich dependencies but also alleviate the class-imbalance problem. Experimental results on the challenging MS-COCO dataset validate that GCViTDet can consistently improve performance over state-of-the-art object detection baseline models.

Abstract:
Multiview clustering task groups objects using multiple properties, such as RGB images, infrared images, and texture information. However, incomplete multi-view clustering faces significant challenges due to missing views that hinder clustering performance. This paper proposes a Discriminative Feature Recovery and Tensorized Matrix Factorization method (DFRTMF) that effectively recovers missing views, learns low-dimensional discriminative embeddings, and enables direct clustering. DFRTMF addresses high dimensionality through projection learning and enables the output of soft indicators. To improve projection and facilitate the recovery of missing views, we propose an uncorrelated constraint based on the scatter matrix of the recovered complete data, exploring the correlations between observed and missing views. To capture high-order correlations among views, a low-rank tensor constraint based on tensor Schatten p-norm regularization is applied to a third-order tensor composed of soft indicator matrices. DFRTMF adaptively controls the inter-coordination between these factorizations using view weights to optimally explore complementary information. Furthermore, we propose an alternating optimization algorithm based on the Alternating Direction Method of Multipliers to effectively solve the proposed objective function. Extensive experiments across diverse datasets demonstrate the effectiveness of DFRTMF compared to the state-of-the-art methods.

Abstract:
Adapting Vision Transformers (ViTs) to medical image analysis is challenging due to the scarcity of annotated data and the significant domain shift from natural to medical images. Traditional fine-tuning approaches, while effective, require storing separate model parameters for each task, leading to high computational costs. Existing prompt tuning methods reduce this overhead by introducing task-specific prompt tokens, but they often fail to fully leverage label semantics, resulting in suboptimal performance for medical tasks. To address these limitations, we propose a label-semantic-based prompt tuning method (LPT), which transforms the visual prompt learning problem into a text-image alignment task. Unlike traditional prompt methods that only focus on visual prompts, LPT incorporates label semantics through a cross-attention-based module to better align image features with the target labels. This approach not only captures rich semantic information from the labels but also enhances the model’s ability to extract fine-grained image details relevant to specific medical conditions. By leveraging label-text alignment during training, LPT improves both label utilization and model adaptability, enabling more accurate predictions. Extensive experiments on eight diverse medical datasets demonstrate that LPT significantly improves diagnostic accuracy and generalization, outperforming both traditional fine-tuning and current prompt-based methods, especially in data-limited scenarios.

Abstract:
Learning-based Multi-View Stereo (MVS) methods, typically reliant on cascaded cost volume formulations, perform well on small-scale scenes. However, as the depth range of captured images becomes broader and more varied, the coarse-to-fine depth sampling process, which depends solely on feature matching, is increasingly prone to local optima. Despite recent advancements in feature representation, depth sampling patterns, and cost aggregation techniques, challenges related to model generalization and computational efficiency persist. In this paper, we propose SR-MVSNet, a novel framework that integrates multi-view feature matching and RGB-D cross-modal structural consistency learning to achieve high-quality 3D reconstruction. Our approach begins with the construction of Low-Resolution (LR) cost volumes for initial LR depth estimation, which are then enhanced to full-resolution via a tailored uncertainty-aware guided depth super-resolution module. To ensure cross-view consistency, the depth maps undergo further refinement through multi-view feature matching. By avoiding high-resolution cost volume processing, our framework improves depth estimation robustness and efficiency. Additionally, we introduce an iterative depth fusion post-processing strategy during inference to improve reconstruction in ambiguous matching regions, a critical challenge for MVS methods. Experiments show that our method achieves top-3 performance on the DTU and Tanks & Temples datasets and ranks first on the ETH3D dataset. Furthermore, it uses significantly fewer GPU resources than most high performing methods, offering a favorable trade-off between reconstruction quality and computational efficiency.

Abstract:
Computing power has evolved into a foundational and indispensable resource in the area of deep learning, particularly in tasks such as Face Recognition (FR) model training on large-scale datasets, where multiple GPUs are often a necessity. Recognizing this challenge, some FR methods have started exploring ways to compress the fully-connected layer in FR models. Unlike other approaches, our observations reveal that without prompt scheduling of the learning rate (LR) during FR model training, the loss curve tends to exhibit numerous stationary subsequences. To address this issue, we introduce a novel LR scheduler leveraging Exponential Moving Average (EMA) and Haar Convolutional Kernel (HCK) to eliminate stationary subsequences, resulting in a significant reduction in converging time. However, the proposed scheduler incurs a considerable computational overhead due to its time complexity. To overcome this limitation, we propose FastFace, a fast-converging scheduler with negligible time complexity, i.e. \mathcal O(1) per iteration, during training. In practice, FastFace is able to accelerate FR model training to a quarter of its original time without sacrificing more than 1% accuracy, making large-scale FR training feasible even with just one single GPU in terms of both time and space complexity. Extensive experiments validate the efficiency and effectiveness of FastFace. The code is publicly available at: https://github.com/amoonfana/FastFace

Abstract:
Tracking multiple objects in videos captured by uncrewed aerial vehicles (UAVs) is challenging due to sudden viewpoint changes, non-linear motion trajectories, and rapid variations in target size and appearance. Existing methods often struggle to handle these complexities, as they rely heavily on handcrafted geometric constraints and fail to adapt to significant field-of-view changes. To address these issues, this paper presents the Dynamic Field-Aware Multi-Object Tracker (DFA-MOT), a joint detection and tracking framework that integrates detection and motion prediction into a unified model, enhancing tracking performance in dynamic UAV environments. The proposed Dynamic Field-of-View Consistency Learning (DFCL) module mitigates geometric distortions caused by UAV movement by leveraging optical flow and learnable deformation operations to achieve progressive spatial alignment. A Scale-Aware Tracking (SAT) mechanism is explored, which enables to accurately predict of both position and scale variations, enhancing the model’s adaptability to variations in target size. By combining detection with predictive motion modeling, DFA-MOT effectively overcomes the limitations of traditional manual constraints. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate that DFA-MOT significantly outperforms state-of-the-art tracking methods in UAV scenarios.

Abstract:
Despite the excellent performance achieved by Siamese tracking in most scenarios, its performance remains unsatisfactory in some scenarios. Traditional convolutional operations tend to use a fixed-size convolution kernel, which results in a small receptive field; thus, the model focuses only on local target features, keeping the model from considering global dependencies. Intercorrelation operations also do not explicitly model and capture information about local targets’ corner regions, which prevents the model from considering critical information about target edges and reduces the accuracy of locating the target. In this paper, we propose a feature extraction subnetwork based on a long-range spatial representation module that captures long-range spatial dependencies between the foreground and background. The network allows the model to learn more discriminative feature representations. We also construct a feature fusion network based on a local feature enhancement module that considers features contained in local targets’ corner regions more strongly. The proposed model can learn more comprehensive and detailed feature representations that lead to more accurate tracking. The proposed tracker is compared with SOTA trackers on six tracking datasets, and an average tracking speed of 45 FPS is achieved. Especially, it achieves 86.5% precision and 67.2% success rate on the UAV123 dataset with 40 FPS.

Abstract:
Image anomaly detection and localization have received widespread interest in the community, and knowledge distillation (KD) has been widely explored. Recently, the reverse distillation (RD) paradigm has successfully mitigated the homogenization of anomaly representations in traditional KD arising from identical or similar teacher-student (T-S) architecture. However, in RD, the lack of an effective means to prevent anomalous patterns in the teacher encoder from being leaked into the student decoder undermines potential modeling discrepancies between the T-S model in anomaly representations. To settle this problem, we propose REverse distillation with latent Anomaly SuppressiON (REASON) method, preventing the student decoder from receiving anomalous patterns by extra means of anomaly filtering during the inference phase, and thus, the student model can only restore representations of anomaly-free images. Specifically, we construct a Siamese teacher encoder architecture, with one branch extracting features from anomaly-free samples and the other synthesizing anomaly features with spatial noise injection from the latent feature level. Next, we design a latent anomaly suppression module to recover normal features from perturbed anomalous features. In this sense, the follow-up student decoder will receive input without abnormal patterns. Thus, representations of the anomaly-free images can be described well, while those of the anomalous images can be well-differentiated between the T-S model. Furthermore, to enhance the model’s anomaly detection and localization capabilities, we propose multi-granularity KD loss to optimize the student decoder to focus on context and local details. Extensive experiments are performed on three benchmark datasets, i.e., MVTec AD, AeBAD, and OCT2017, and the results show the effectiveness and robustness of our proposed approach, which achieves superiority over the current state-of-the-art methods.

Affiliations: Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian, China; School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China; School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China; School of Mechanical Engineering, Dalian University of Technology, Dalian, China; Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland, New Zealand

Abstract:
Conventional imaging devices often struggle to produce high-dynamic-range (HDR) images that accurately represent natural scenes. To overcome this limitation, multi-exposure image fusion (MEF) techniques have been introduced as a viable solution. Existing MEF approaches aim to enhance performance by optimizing or searching architectures. However, they face challenges in precise feature extraction and scene reconstruction, leading to distortion in the fused images. Additionally, most methods do not adequately address luminance variations across different image regions, which may result in the loss of essential details. To address these challenges, we present a novel luminance-aware MEF framework that integrates text-correlation priors (LarTap). By embedding textual information into fusion process, the proposed framework enhances content extraction and comprehension. Specifically, it consist of two key components: the text-image correlation network (N1) and the multi-exposure fusion network (N2). First, N1 performs correlation training to achieve a holistic alignment between text and image pairs. Its iterative vision encoders (VEs) generate text-correlated prior knowledge to facilitate the fusion process in N2. Second, N2 leverages these priors for scene reconstruction and dynamically adjusts luminance based on comparative perception. Extensive experiments on multiple datasets demonstrate that LarTap outperforms state-of-the-art methods. The source code is available at https://github.com/EnLong-wang/LarTap

Abstract:
Underwater target detection is primarily achieved through two methods: optical imaging and underwater sonar. 3D sonar, as the most advanced underwater detection technology, is characterized by strong penetration and long scanning distance, making it more suitable for tasks such as deep-sea exploration, murky water detection, and long-distance target identification. However, acquiring underwater sonar images is challenging, and there is no open-source 3D sonar dataset. Traditional three-dimensional target detection methods typically require high-quality data and face significant challenges when dealing with weak heterogeneous sonar point clouds caused by high noise, low resolution, and occlusions. To address the aforementioned issues, we first propose a novel fuzzy decoupling module that differs from traditional foreground-background segmentation. This module simultaneously extracts valuable information about the target and its surrounding environment, mitigating the reduction in heterogeneity caused by noise and sonar side lobes. To achieve efficient fusion and capture global information after fuzzy decoupling, a multi-hop Mamba seamless adaptive decoupling point is introduced. It effectively enhances the connection between the two decoupled parts. To address missing and occlusion problems, a second-stage refinement based on Markov prediction is proposed. This low-cost approach, in contrast to using the original point cloud for contour and detail completion, enriches target boundary information. To validate our method, we have designed a practical 3D sonar imaging system and tested it through lake-based experiments. We have collected extensive raw data from Qiandao Lake and conducted annotation work. Through qualitative and quantitative experiments, our method outperforms the most advanced methods by 11.4%.

Abstract:
Neuromorphic Vision Sensors (NVS) have raised increasing attention due to their sparsity, low latency, and high dynamic range. However, they suffer from the background activity noise which causes unnecessary computational waste. Existing learning-based denoising methods usually achieve better performance than rule-based methods but require larger computational and storage resources. To make rule-based filters as competitive as learning-based filters, this paper proposes a novel filter, namely the Event-based Bilateral Filter (EBF) that utilizes both spatiotemporal and polarity information. EBF first assigns two types of weights to each nearest neighborhood pixel based on the temporal and polarity information of the event to be classified. Next, EBF multiplies and accumulates the weights to get a correlation score, which is then compared with a threshold to predict the label of the event. We evaluate the proposed methods on three neuromorphic datasets, including both simulated data and real-world data. EBF significantly improves the denoising accuracy compared with rule-based filters and can exceed or compete with learning-based methods across different noise levels. The corresponding codes, datasets, and results are available at https://github.com/shicy17/EBF

Abstract:
Images captured under haze weather conditions usually suffer from visual quality degradations, such as blurred details, faded colors, and decreased saturation. Existing physics-based dehazing methods mainly have two drawbacks: 1) the atmospheric light is treated as a constant for the entire image, and 2) pixel- or patch-based strategies are employed to estimate the model parameters, resulting in inaccurate haze density estimations. Therefore, these methods may lead to over-dehazing or under-dehazing due to insufficient utilization of features from regions with similar haze densities. To address these issues, a novel single image dehazing framework based on fuzzy region segmentation and haze density decomposition is proposed. Specifically, a region-based physical model that considers the non-uniform atmospheric light is first constructed based on the classic atmospheric scattering model. Then, a fuzzy segmentation algorithm is improved to divide the input hazy image into several separate regions. Subsequently, we formulate a simple linear relationship between the atmospheric light and brightness to estimate region-based atmospheric light. On the other hand, we develop a novel haze density decomposition algorithm based on boundary constraints to separate the atmospheric veil into two components: thin part and dense part. Three haze-related features, contrast, gradient and clarity, are extracted from the input hazy image to construct weight maps and a multi-scale fusion is further exploited to combine weight maps and boundary veils to acquire the refined atmospheric veil. Finally, the model inversion is performed to acquire the haze-free result. Experiments on six diverse hazy datasets demonstrate that the proposed algorithm outperforms several state-of-the-art dehazing methods in both visual quality and objective evaluation.

Abstract:
While existing Few-Shot Learning (FSL) techniques demonstrate strong performance on uniform datasets, they encounter domain shift challenges when presented with domain-agnostic queries in real-world scenarios. So we investigate it in Cross-Domain Few-Shot Learning (CD-FSL) and propose to learn more universal feature representations to enhance generalization on unseen domains. Toward this issue, we pinpoint two issues in current multi-model fusion approaches: 1) the entanglement of domain and class information, and 2) feature overlap across distinct domains. To address these challenges, we introduce a Bi-level Feature Relation Alignment method, BFRA, which facilitates the acquisition of more versatile features by decoupling domain-class relationships and aligning feature relations. Through the segregation of domain and class feature learning, we devise a smoothing layer prior for domain feature alignment to mitigate inter-domain discrepancies. This approach enables our model to acquire domain-consistent features, diminishing interference in subsequent class feature alignment procedures. During the class feature alignment, we notice that class feature representations from various in-domain models may intersect, leading to a diminished distinction between classes. To address this, we adopt a topological perspective to train our target model, by aligning feature relations instead of features between our target model and multiple in-domain models. The integration of these components results in the establishment of a bi-level feature relation alignment framework aimed at acquiring more universal features. Furthermore, we partially fine-tune the plug-in layer-wise affine adapter on domain-agnostic queries to expedite adaptation without impacting the known domains. Experiments of 21 datasets on meta-dataset and BSCD-FSL benchmark demonstrate the effectiveness of our method. The code are made publicly available at https://github.com/leaves162/BFRA

Abstract:
JPEG XS is a wavelet-based lightweight image coding standard that features low-complexity and low-latency. As currently there are no efficient intra-compensation prediction techniques conforming to these features, we propose a frequency domain intra-copy prediction framework named Intra Pattern Copy (IPC), to improve its coding efficiency on screen content. In IPC, prediction methods that leverage the diverse decomposed patterns of two-dimensional wavelets, including the directional and frequency characteristics, are proposed to achieve efficient predictions under low-complexity and low-latency constraints. Specifically, we perform in-band compensation predictions in a multi-band synchronized approach, with coefficients of similar pattern distributions predicted simultaneously. A coefficient grouping scheme is derived from the band characteristics to facilitate this compensation process. Based on the grouping scheme, a multi-band synchronized side information coding method is also proposed to code the pattern offset vector of coefficients. Moreover, pattern search schemes incorporating strict limitations on the search range and prediction block size are further developed. Simulation results on JPEG XS demonstrate that an average improvement of 0.75 dB and 1.99 dB in BD-PSNR can be achieved on screen content for two different wavelet decomposition configurations, respectively, with a moderate increase in complexity.

Affiliations: School of Design, Foshan University, Foshan, China; School of Mechatronic Engineering and Automation, Foshan University, Foshan, China; Department of Mathematics, College of Science, Shantou University, Shantou, China; Department of Electronic Engineering, Shantou University, Shantou, China; School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan, China; Department of Mechanical and Aerospace Engineering, Western Michigan University, Kalamazoo, MI, USA

Abstract:
Flow visualization through motion estimation using time-sequenced images plays a significant role in analyzing and understanding complex flow phenomena, and it is widely used in meteorology, oceanography, medicine, astronomy, experimental fluid mechanics, etc. However, it is difficult for current motion estimators to adapt to illumination changes, remove instable perturbation, and capture diverse motion patterns. In this paper, a novel flow visualization tool is developed to address these issues by employing a structure-enhanced motion estimator composed of a data term and a regularization term. Specifically, a statistical correlation descriptor is designed for the data term to improve the accuracy of motion estimation by enhancing both illumination robustness and matching discrimination. Inspired by the strong distinguishability of a structure-texture distribution in a local window, a structure-enhanced regularizer that considers the physical mechanism of fluid diffusion is introduced to capture different motion patterns, enhance prominent flow structures, and remove unnecessary ripples or textures caused by instable perturbation or noise. The experimental results demonstrate that our approach significantly outperforms current motion estimators in handling illumination changes and predicting complex fluid flows, and it also achieves state-of-the-art evaluation results on the public fluid flow datasets. Furthermore, the designed flow visualization tool successfully captures diverse motion patterns in Jupiter’s White Ovals, which is crucial for understanding the physical mechanisms behind their formation and sustenance.

Abstract:
Current LiDAR-Camera fusion methods for 3D object detection achieve considerable accuracy at the immense cost of computation and storage, posing challenges for the deployment at the edge. To address this issue, we propose a lightweight 3D object detection framework, namely TinyFusionDet. Specially, we put forward an ingenious Hybrid Scale Pillar Strategy in LiDAR point cloud feature extraction to efficiently improve the detection accuracy of small objects. Meanwhile, a low cost Cross-Modal Heatmap Attention module is presented to suppress background interference in image features for reducing false positives. Moreover, a Cross-Modal Feature Interaction module is designed to enhance the cross-modal information fusion among channels for further promoting the detection precision. Extensive experiments demonstrated that TinyFusionDet achieves competitive accuracy with the lowest memory consumption and inference latency, making it suitable for hardware constrained edge devices. Furthermore, TinyFusionDet is implemented on a customized FPGA-based prototype system, yielding a record high energy efficiency up to 114.97GOPS/W. To the best of our knowledge, this marks the first real-time LiDAR-Camera fusion detection framework for edge applications.

Abstract:
In recent years, raw video denoising has garnered increased attention due to the consistency with the imaging process and well-studied noise modeling in the raw domain. However, two problems still hinder the denoising performance. Firstly, there is no large dataset with realistic motions for supervised raw video denoising, as capturing noisy and clean frames for real dynamic scenes is difficult. To address this, we propose recapturing existing high-resolution videos displayed on a 4K screen with high-low ISO settings to construct noisy-clean paired frames. In this way, we construct a video denoising dataset (named as ReCRVD) with 120 groups of noisy-clean videos, whose ISO values ranging from 1600 to 25600. Secondly, while non-local temporal-spatial attention is beneficial for denoising, it often leads to heavy computation costs. We propose an efficient raw video denoising transformer network (RViDeformer) that explores both short and long-distance correlations. Specifically, we propose multi-branch spatial and temporal attention modules, which explore the patch correlations from local window, local low-resolution window, global downsampled window, and neighbor-involved window, and then they are fused together. We employ reparameterization to reduce computation costs. Our network is trained in both supervised and unsupervised manners, achieving the best performance compared with state-of-the-art methods. Additionally, the model trained with our proposed dataset (ReCRVD) outperforms the model trained with previous benchmark dataset (CRVD) when evaluated on the real-world outdoor noisy videos. Our code and dataset will be released.

Abstract:
Early and precise accident anticipation is critical for preventing road traffic incidents in advanced traffic systems. This paper presents a Multi-modal Architecture with Spatio-Temporal-Text Adaptation (MASTTA), featuring a Visual Encoder and a Text Encoder within a streamlined end-to-end framework for traffic accident anticipation. Both encoders leverage the CLIP model, pre-trained on large-scale text-image pairs, to utilize visual and textual information effectively. MASTTA captures complex traffic patterns and relationships by fine-tuning only the adapters, reducing retraining demands. In the Visual Encoder, spatio-temporal adaptation is achieved through a novel Temporal Adapter, a novel Spatial Adapter, and an MLP Adapter. The Temporal Adapter enhances temporal consistency in accident-prone areas, while the Spatial Adapter captures spatio-temporal interactions among visual cues. The Text Encoder, equipped with a Text Adapter and an MLP Adapter, aligns latent textual and visual features in a joint embedding space, refining semantic representation. This synergy of text and visual adapters enables MASTTA to model complex spatial interactions across long-range temporal context, improving accident anticipation. We validate MASTTA on DAD and CCD datasets, demonstrating significant improvements in both the earliness and correctness compared to state-of-the-art methods.

Abstract:
In cross-modal unsupervised domain adaptation, a model trained on source-domain data (e.g., synthetic) is adapted to target-domain data (e.g., real-world) without access to target annotation. Previous methods seek to mutually mimic cross-modal outputs in each domain, which enforces a class probability distribution that is agreeable in different domains. However, they overlook the complementarity brought by the heterogeneous fusion in cross-modal learning. In light of this, we propose a novel fusion-then-distillation (FtD++) method to explore cross-modal positive distillation of the source and target domains for 3D semantic segmentation. FtD++ realizes distribution consistency between outputs not only for 2D images and 3D point clouds but also for source-domain and augment-domain. Specially, our method contains three key ingredients. First, we present a model-agnostic feature fusion module to generate the cross-modal fusion representation for establishing a latent space. In this space, two modalities are enforced maximum correlation and complementarity. Second, the proposed cross-modal positive distillation preserves the complete information of multi-modal input and combines the semantic content of the source domain with the style of the target domain, thereby achieving domain-modality alignment. Finally, cross-modal debiased pseudo-labeling is devised to model the uncertainty of pseudo-labels via a self-training manner. Extensive experiments report state-of-the-art results on several domain adaptive scenarios under unsupervised and semi-supervised settings. Code is available at https://github.com/Barcaaaa/FtD-PlusPlus

Abstract:
Point cloud analysis is essential in accurately perceiving and analyzing real-world scenarios. Recently, transformer-based models have demonstrated great performance superiority in diverse domains. Nonetheless, directly applying transformers to point clouds is still challenging, primarily due to the computational intensity of transformers, which may significantly compromise their efficacy. Moreover, most methods typically rely on the relative 3D coordinates of point pairs to generate geometric information without fully exploiting the inherent local geometric properties. To tackle these challenges, we propose DGAS-Net, a novel architecture to enhance point cloud analysis. Specifically, we propose a Dual Geometry Learning (DGL) module to generate explicit geometric descriptors from triangular representations. These descriptors capture the local shape and geometric details of each point, serving as the foundation for deriving informative geometric features. Subsequently, we introduce a Dual Geometry Context Aggregation (DGCA) module to efficiently merge local geometric and semantic information. Furthermore, we design an Adaptive Sparse Attention (ASA) module to capture long-range information and expand the effective receptive field. ASA adaptively selects globally representative points and employs a novel vector attention mechanism for efficient global information fusion, thereby significantly reducing the computational complexity. Extensive experiments on four datasets demonstrate the superiority of DGAS-Net for various point cloud analysis tasks. The codes of DGAS-Net are available at https://github.com/zcustc-10/DGAS-Net

Abstract:
Point clouds serve as the foundational representation of 3D objects, playing a pivotal role in both computer vision and computer graphics. Recently, the acquisition of point clouds has been effortless because of the development of hardware devices. However, the collected point clouds may be incomplete due to environmental conditions, such as occlusion. Therefore, completing partial point clouds becomes an essential task. The majority of current methods address point cloud completion via the utilization of shape priors. While these methods have demonstrated commendable performance, they often encounter challenges in preserving the global structural and geometric details of the 3D shape. In contrast to those mentioned earlier, we propose a novel cross-modal coarse-to-fine network (CMNet) for point cloud completion. Our method utilizes additional image information to provide global information, thus avoiding the loss of structure. To ensure that the generated results contain sufficient geometric details, we propose a coarse-to-fine learning approach based on multiple patches. Specifically, we encode the image and use multiple generators to generate multiple coarse patches, which are combined into a complete shape. Subsequently, based on the coarse patches generated in advance, we generate fine patches by combining partial point cloud information. Experimental results show that our method achieves state-of-the-art performance on point cloud completion.

Abstract:
Although existing text-to-motion (T2M) methods can produce realistic human motion from text description, it is still difficult to align the generated motion with the desired postures since using text alone is insufficient for precisely describing diverse postures. To achieve more controllable generation, an intuitive way is to allow the user to input a few motion frames describing precise desired postures. Thus, we explore a new Text-Frame-to-Motion (TF2M) generation task that aims to generate motions from text and very few given frames. Intuitively, the closer a frame is to a given frame, the lower the uncertainty of this frame is when conditioned on this given frame. Hence, we propose a novel Progressive Motion Generation (PMG) method to progressively generate a motion from the frames with low uncertainty to those with high uncertainty in multiple stages. During each stage, new frames are generated by a Text-Frame Guided Generator conditioned on frame-aware semantics of the text, given frames, and frames generated in previous stages. Additionally, to alleviate the train-test gap caused by multi-stage accumulation of incorrectly generated frames during testing, we propose a Pseudo-frame Replacement Strategy for training. Experimental results show that our PMG outperforms existing T2M generation methods by a large margin with even one given frame, validating the effectiveness of our PMG. Code is available here.

Abstract:
The authenticity of audio-visual content is being challenged by advanced multimedia editing technologies inspired by Artificial Intelligence-Generated Content (AIGC). Temporal forgery localization aims to detect suspicious contents by locating forged segments. So far, most of the existing methods are based on Convolutional Neural Networks (CNNs) or Transformers, yet neither of them has fully considered the complex relationships within forged audio-visual content. To address this issue, in this paper, we propose a novel method, named TransHFC, which innovatively introduces hypergraphs to model group relationships among segments while considering point-to-point relationships through Transformers. Through its dual hypergraph filtering convolution branch, TransHFC captures both temporal and spatial level group relationships, enhancing the representation of forged segment features. Furthermore, we propose a new hypergraph filtering convolution Auto-Encoder that uses a multi-frequency filter bank for adaptive signal capture. This design compensates for the limitation of a single hypergraph filter. Our extensive experiments on Lav-DF, TVIL, Psynd, and HAD datasets demonstrate that TransHFC achieves state-of-the-art performance.

Abstract:
Entropy modeling is the core component of learned image compression (LIC) that models the distribution of latent representation learned from input images via neural networks for bit-rate estimation. However, existing entropy models employ presumed parameterized distributions such as Gaussian models and are limited for the learned latent representation characterized by complex distributions. To address this problem, in this paper, we for the first time achieve generative probabilistic entropy modeling of latent representation based on conditional diffusion models. Specifically, we propose a conditional diffusion-based probabilistic entropy model (CDPEM) to parameterize the latent representation with distributions of arbitrary forms that are generated by well designed training-test consistent denoising diffusion implicit model (TC-DDIM) without introducing any presumption. TC-DDIM is designed to leverage ancestral sampling to gradually approximate the distribution of latent representation with guaranteed consistency in generation for training and test. Furthermore, we develop a hierarchical spatial-channel context model to incorporate with TC-DDIM to sufficiently exploit spatial correlations with the approximate contextual information produced by ancestral sampling and channel-wise correlations using channel-wise information aggregation with reweighted training loss. Experimental results demonstrate that the proposed entropy model achieves state-of-the-art performance on the Kodak, CLIC, and Tecnick datasets compared to existing LIC methods. Remarkably, when incorporated with recent baselines, the proposed model outperforms latest VVC standard by an evident gain in R-D performance.

Abstract:
Point clouds, which directly record the geometry and attributes of scenes or objects by a large number of points, are widely used in various applications such as virtual reality and immersive communication. However, due to the huge data volume and unstructured geometry, efficient compression of point clouds is very crucial. The Moving Picture Expert Group is establishing a geometry-based point cloud compression (G-PCC) standard for both static and dynamic point clouds in recent years. Although lossy compression of G-PCC can achieve a very high compression ratio, the reconstruction quality is relatively low, especially at low bitrates. To mitigate this problem, we propose a high efficiency Wiener filter that can be integrated into the encoder and decoder pipeline of G-PCC to improve the reconstruction quality as well as the rate-distortion performance for dynamic point clouds. Specifically, we first propose a basic Wiener filter, and then improve it by introducing coefficients inheritance and variance-based point classification for the Luma component. Besides, to reduce the complexity of the nearest neighbor search during the application of the Wiener filter, we also propose a Morton code-based fast nearest neighbor search algorithm for efficient calculation of filter coefficients. Experimental results demonstrate that the proposed method can achieve average Bjøntegaard delta rates of -6.1%, -7.3%, and -8.0% for Luma, Chroma Cb, and Chroma Cr components, respectively, under the condition of lossless-geometry-lossy-attributes configuration compared to the latest G-PCC encoding platform (i.e., geometry-based solid content test model version 7.0 release candidate 2) by consuming affordable computational complexity.

Abstract:
The application of structured light (SL) techniques has achieved remarkable success in three-dimensional (3D) measurements. Traditional methods generally calculate SL information pixel by pixel to obtain the measurement results. Recently, the rise of deep learning (DL) has led to significant developments in this task. However, existing DL-based methods generally learn all features within the image in an end-to-end manner, ignoring the distinction between SL and non-SL information. Therefore, these methods may encounter difficulties in focusing on subtle variations in SL patterns across different scenes, thereby degrading measurement precision. To overcome this challenge, we propose a novel SL Image Planar-Topography Feature Decomposition Network (SIDNet). To fully utilize the information from different SL modality images (fringe and speckle), we decompose different modalities into topography features (modality-specific) and planar features (modality-shared). A physics-driven decomposition loss is proposed to make the topography/planar features dissimilar/similar, which guides the network to distinguish between SL and non-SL information. Moreover, to obtain modality-fused features with global overview and local detail information, we propose a wrapped phase-driven feature fusion module. Specifically, a novel Tri-modality Mamba block is designed to integrate different sources with the guidance of the wrapped phase features. Extensive experiments demonstrate the superiority of our SIDNet in multiple simulated 3D measurement scenes. Moreover, our method shows better generalization ability than other DL models and can be directly applicable to unseen real-world scenes.

Abstract:
In recent years, the field of explicit semantic multimodal content research makes significant progress. However, research on content with implicit semantics, such as online memes, remains insufficient. Memes often convey implicit semantics through metaphors and may sometimes contain hateful information. To address this issue, researchers propose a task for detecting hateful memes, opening up new avenues for exploring implicit semantics. The hateful meme detection currently faces two main problems: 1) the rapid emergence of meme content makes continuous tracking and detection difficult; 2) current methods often lack interpretability, which limits the understanding and trust in the detection results. To make a better understanding of memes, we analyze the definition of metaphor from social science and identify the three key factors of metaphor: socio-cultural knowledge, metaphorical tenor, and metaphorical representation pattern. According to these key factors, we guide a multimodal large language model (MLLM) to infer the metaphors expressed in memes step by step. Particularly, we propose a hateful meme detection and interpretation framework, which has four modules. We first leverage a multimodal generative search method to obtain socio-cultural knowledge relevant to visual objects of memes. Then, we use socio-cultural knowledge to instruct the MLLM to assess the social-cultural relevance scores between visual objects and textual information, and identify the metaphorical tenor of memes. Meanwhile, we apply a representative interpretation method to provide representative cases of memes and analyze these cases to explore metaphorical representation pattern. Finally, a chain-of-thought prompt is constructed to integrate the output of the above modules, guiding the MLLM to accurately detect and interpret hateful memes. Our method achieves state-of-the-art performance on three hateful meme detection benchmarks and performs better than supervised training models on the hateful meme interpretation benchmark.

Abstract:
This work proposes to learn blind image super-resolution (SR) using deep constrained least squares deconvolution with low-resolution (LR) space kernels. Our method recovers the high-resolution (HR) image with a kernel estimation step and a kernel-based image restoration process. Specifically, we first reformulate the classical degradation model to transfer the deblurring kernel estimation into the LR space. We show that the LR space kernel has a closed-form solution given a pair of LR-HR images, which can be learned without ground truth kernels. Next, we introduce a dynamic deep linear filter module, which can generate deblurring kernel weights adaptively. Subsequently, the estimated kernel is integrated with a deep constrained least square filtering module to produce clean features. For reconstruction, we adopt a dual-path structured SR network that inputs both the deblurred feature and the original feature to suppress deconvolution artifacts. Finally, we learn discriminative features for deblurring and then restore the HR image in a single branch, producing a lighter weight network that can achieve comparable performance while only using 56% parameters and 60% inference time. Extensive experiments on both synthetic and real-world datasets demonstrate that our method achieves better accuracy and visual improvements against state-of-the-art approaches.

Abstract:
Scene flow estimation from 4D radar sensors has become increasingly popular in recent years. In this paper, we propose a matching and refinement decoupling method to estimate scene flow from 4D radar point clouds. Since 4D radar point clouds are much sparser and noisier than LiDAR point clouds, it is challenging to effectively establish correspondences between two frames and properly refine flow fields in the 3D space. To address this issue, we present decoupled correlation fields and decoupled flow fields for scene flow estimation, named DMRFlow. On the one hand, we propose a position-velocity decoupled matching approach that decouples the positional features from the velocity features of two adjacent point clouds and matches them separately. On the other hand, we design a dynamic-static decoupled refinement approach that splits initial flow fields into two groups according to motion segmentation maps and refines them separately. By integrating the matching and refinement decoupling method, our DMRFlow is able to effectively reduce mutual interference between different features during the matching and refinement process. We evaluate the proposed approach on the View-of-Delft (VoD) dataset. Experimental results show that DMRFlow yields competitive performance in autonomous driving scenarios compared to recent 4D radar scene flow estimation methods.

Abstract:
Anomaly detection in surveillance videos aims to differentiate anomalies from regular events by discriminative representations, which has gathered considerable attention due to its significant effect to public security. However, most existing works are limited in the lack of annotated samples, and lots of approaches find it challenging to avoid the well-reconstruction of anomalous data. To alleviate these issues, we propose a dual distillation fusion framework for weakly supervised anomaly detection. We reformulate the anomaly detection problem into two steps, namely filtering anomalies and inpainting normal patterns. Each step corresponds to one branch of the dual distillation. Specifically, the dual distillation comprises the contrastive distillation module and the inpainting distillation module. The contrastive distillation optimizes the encoder to filter out abnormal features and capture key normal features, while the inpainting distillation refines the decoder to inpaint normal patterns on the encoded features. The contrastive distillation module and the inpainting distillation module are optimized iteratively in a self-training manner with video-level labeled data. Moreover, a joint optimization module is devised to effectively fuse the distilled encoder and decoder, thereby collectively improving the anomaly detection performance. During the training phase, we take into account the diversity of normal samples by selecting pseudo normal and abnormal samples with high confidence from abnormal videos. These selected samples, along with original normal frames, are then fed into the subsequent training iterations to enhance the distinguishing ability of the model. Experimental results show that our proposed method performs competitively on five benchmark datasets.

Abstract:
Traditional LiDAR SLAM approaches prioritize localization over mapping, yet high-precision dense maps are essential for numerous applications involving intelligent agents. Recent advancements have introduced methods leveraging neural fields to enhance mapping capabilities; however, these approaches still face several limitations. Firstly, concerning scene representation, they typically employ neural fields with high-dimensional features and multi-layer perceptron decoders utilizing non-continuous activation functions. This results in low learning efficiency and challenges in capturing high-frequency signals. Secondly, in terms of scene organization, these methods often treat the entire scene as a singular neural field, leading to inefficiencies, inflexibility, and difficulties in rectifying accumulated errors when mapping large-scale environments over extended periods. To tackle the first issue, we propose a lightweight continuous SDF regression approach by encoding the scene in single-valued embeddings and decoding SDF values from a Kolmogorov-Arnold Network. By minimizing discrepancies in measuring range, sampling distance, and decoded SDF values, we facilitate iterative frame-to-model tracking and bundle adjustment neural mapping. To mitigate the second challenge, we propose structuring the whole scene into multiple neural SDF submaps. By establishing node-node, node-submap, and loop closure constraints into a global pose graph, the system can create dense neural maps with global consistency across large-scale scenes. Experimental evaluations in both real-world and simulated settings indicate that our system achieves superior mapping completeness and accuracy, enhanced learning efficiency, reduced memory consumption, and greater flexibility compared to its counterparts.

Abstract:
To obtain high-quality PET scans while minimizing potential radiation hazards for patients, various GAN-based methods have been developed to reconstruct high-quality standard-count PET (SPET) images from low-count PET (LPET) ones. While recent efforts try to integrate MRI or CT to enhance reconstruction in a multi-modal way, current architectures mainly face two limitations: 1) CNN backbones or simple Transformer bottleneck layers are insufficient for robust semantic understanding; and 2) the identical strategies for multi-modal feature extraction and fusion overlook each modality’s respective importance for the reconstruction task. In this work, we propose the Multi-modal Long-Short Distance Attention-based Transformer-GAN (MLSDA-GAN), a novel network combining 3D transformer and CNN architecture for PET image reconstruction. Specifically, to extract fine-grained features with a small number of parameters, our MLSDA-GAN integrates multi-scale convolution into the embedding part of the transformer. As for our multi-modal design, given the strong correlation between LPET and SPET in structural characteristics, we treat MRI as an auxiliary modality to LPET and achieve effective multi-modal extraction and fusion strategies. These strategies include 1) a PET-specific Self-attention Extraction (PSE) block for comprehensive feature extraction of the primary LPET and 2) a Multi-modality Cross-attention Fusion (MCF) block for effective multi-modal interaction and fusion, enabling us to more efficiently model both long- and short-range relationships in the corresponding feature extraction and fusion processes. Experiments demonstrate superiority of our method quantitatively and qualitatively. Code is available at https://github.com/Aru321/MLSDA-GAN.

Abstract:
Scanpath prediction for omnidirectional images aims to effectively simulate the human visual perception mechanism to generate dynamic realistic fixation trajectories. However, the majority of scanpath prediction methods for omnidirectional images are still in their infancy as they fail to accurately capture the time-dependency of viewing behavior and suffer from sub-optimal performance along with limited generalization capability. A desirable solution should achieve a better trade-off between prediction performance and generalization ability. To this end, we propose a novel dual-temporal modulation scanpath prediction (ScanDTM) model for omnidirectional images. Such a model is designed to effectively capture long-range time-dependencies between various fixation regions across both internal and external time dimensions, thereby generating more realistic scanpaths. In particular, we design a Dual Graph Convolutional Network (Dual-GCN) module comprising a semantic-level GCN and an image-level GCN. This module servers as a robust visual encoder that captures spatial relationships among various object regions within an image and fully utilizes similar images as complementary information to capture similarity relations across relevant images. Notably, the proposed Dual-GCN focuses on modeling temporal correlations from both local and global perspectives within the internal time dimension. Furthermore, drawing inspiration from the promising generalization capabilities of diffusion models across various generative tasks, we introduce a novel diffusion-guided saliency module. This module formulates the prediction issue as a conditional generative process for the saliency map, utilizing extracted semantic-level and image-level visual features as conditions. With the well-designed diffusion-guided saliency module, our proposed ScanDTM model acting as an external temporal modulator, we can progressively refine the generated scanpath from the noisy map. We conduct extensive experiments on several benchmark datasets, and the results demonstrate that our ScanDTM model significantly outperforms other competitors. Meanwhile, when applied to tasks such as saliency prediction and image quality assessment, our ScanDTM model consistently achieves superior generalization performance.

Affiliations: College of Computer Science and Technology, Jilin University, Changchun, China; School of Software, Tsinghua University, Beijing, China; State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; National Key Laboratory of Science and Technology on Advanced Composites in Special Environments, Harbin Institute of Technology, Harbin, China; Faculty of Computing, Harbin Institute of Technology, Harbin, China

Abstract:
3D object detection from a Bird’s Eye View (BEV) has emerged as a novel perception paradigm for autonomous driving scenarios. While most current 3D object detection methods still rely on the conventional Cartesian coordinates, they fail to align with the non-aligned coordinate system inherent in image geometry. The Polar coordinates, on the other hand, better fit with the geometric shape corresponding to the perception of cameras. However, transforming between coordinate systems introduces distortions in the perception information, resulting in issues such as “Weak Adaptability to Heatmap Distribution” and “Offset in the Center Point of the Bounding Box.” To address these challenges, this paper proposes a cutting-edge 3D object detection model named PolarBEVU, which leverages the bird’s-eye view under the Polar coordinates along with multi-camera unprojection. The model introduces an innovative “Deformable Uniform Heatmap Distribution” method that adjusts heatmap computations based on box shapes, generating high-quality heatmaps and effectively resolving the issue of “Weak Adaptability to Heatmap Distribution.” Moreover, the model incorporates the concept of “Dynamic High-risk Regression Region” to enhance the accuracy and robustness of the center point regression at the bounding box, thus mitigating the issue of “Offset in the Center Point of the Bounding Box.” In extensive experiments on the nuScenes dataset, PolarBEVU achieves impressive results with 49.9% mAP and 57.4% NDS on the test set, surpassing other comparative approaches and reaching the state-of-the-art (SOTA) performance among methods utilizing Polar coordinates. This clearly demonstrates the efficacy and superiority of PolarBEVU. In addition, the model is successfully deployed on Nvidia Jetson AGX Orin, showcasing real-time inference speeds of 31.42ms. These findings affirm PolarBEVU’s potential for practical applications. Code is available at https://github.com/JLUrob/PolarBEVU.

Abstract:
In asteroid exploration and orbital servicing missions with space robots, accurate 3D structural of the target is typically relied upon for planning landing trajectories and controlling movements. Unlike conventional neural radiance fields (NeRF) studies, which rely on full-view random sampling of targets that can be easily achieved on the ground, spacecraft operations present unique challenges due to the kinematic orbit constraint, the high cost of controlled motion, and limited fuel reserves. This results in limited observation of space targets. In order to obtain 3D structure under close-flybys and restricted observation, we proposed Uncertainty Neural Surfaces (UNS) model based on Bayesian uncertainty estimation. UNS enhance the precision of reconstructed target surfaces under constrained-views, providing guidance for subsequent imaging view design. Specifically, UNS introduces Bayesian estimation based surface uncertainty on neural implicit surfaces. The estimation is calculated based on the degree of self-occlusion of the target and the difference between rendered and actual colors. This approach enables uncertain estimation of 3D space and arbitrary view. Finally, extensive systematic evaluations and analyses of spacecraft model sampling in a local darkroom validate the sophistication of UNS in uncertainty estimation and surface reconstruction quality. Code is available at https://github.com/YD-96/UNS.

Abstract:
Fine-grained object detection (FGOD) in remote sensing images is an emerging and challenging task in the field of image intelligent interpretation. It aims to localize objects while classifying them into different fine-grained categories. Modern FGOD methods are mainly derived from well-developed detectors and have made compelling progress. Despite this, these methods struggle to perform well in classifying objects at the subordinate level due to the limitations of their representation manners. In this paper, we propose a network capable of learning discriminative representation (DR) for fine-grained object detection in remote sensing images, named DRNet. First, a fine-grained branch that works in parallel with other task branches is introduced, where objects’ features are re-encoded with dual refinement to generate discriminative representation, enabling accurate fine-grained classification. Second, we design a confusion-minimized loss that automatically scales loss contributions according to the separability of samples to train the fine-grained branch, further boosting discriminative ability of the representation and better addressing hard-to-distinguish objects. Moreover, we devise an interaction verification strategy that empowers the network to fully utilize the results of fine-grained classification and coarse classification for achieving robust inference. On large-scale FAIR1M-1.0 and FAIR1M-2.0 datasets, our DRNet with ResNet50 and 1× training schedule obtains 40.87% mAP and 47.04% mAP, respectively, establishing new state-of-the-arts for fine-grained object detection in remote sensing images. The source code is available at https://github.com//54wb//DRNet.

Abstract:
Variable-rate coding is challenging but indispensable for learned image compression (LIC) that is in nature characterized by nonlinear transform coding (NTC). Existing methods for variable-rate LIC are restricted by the non-smooth quantization process with zero gradients almost everywhere, and consequently, suffer from training-test gap and degraded rate-distortion (R-D) performance. To address this problem, in this paper, we propose sampling-based optimization for training NTC models along with non-uniform quantizers. Different from gradient-based optimization, the proposed sampling-based optimization first randomly samples the parameters from Gaussian distributions with progressively reduced variance and then selects the optimal parameters with a R-D indicator. On the basis of sampling-based optimization, we develop a learnable non-uniform dead-zone quantizer by adaptively refining the quantization steps for variable-rate coding with nonlinear transforms. Furthermore, we incorporate the learnable dead-zone quantizer to achieve a variable-rate LIC model with enhanced R-D performance and design rate and distortion control algorithms to adapt to dynamic network conditions. Experimental results show that the proposed method achieves state-of-the-art R-D performance in variable-rate image compression. It obtains an average 8.82% BD-rate reduction compared to latest versatile video compression (VVC) standard, and simultaneously achieves precise rate and distortion control with an average variation of 0.0087 bpp in bit-rates and 0.1265 dB in distortion on the Kodak dataset.

Abstract:
With the swift advancement of deep learning, state-of-the-art algorithms have been utilized in various social situations. Nonetheless, some algorithms have been discovered to exhibit biases and provide unequal results. The current debiasing methods face challenges such as poor utilization of data or intricate training requirements. In this work, we found that the backdoor attack can construct an artificial bias similar to the model bias derived in standard training. Considering the strong adjustability of backdoor triggers, we are motivated to mitigate the model bias by carefully designing reverse artificial bias created from backdoor attack. Based on this, we propose a backdoor debiasing framework based on knowledge distillation, which effectively reduces the model bias from original data and minimizes security risks from the backdoor attack. The proposed solution is validated on both image and structured datasets, showing promising results. This work advances the understanding of backdoor attacks and highlights its potential for beneficial applications. The code for the study can be found at https://github.com/KirinNg/DwB.

Abstract:
Robot navigation in an unknown environment is a challenging task, due to the lack of spatial awareness and semantic understanding of the environment. Previous works predominantly relied on prior scene knowledge and semantic information, lacking generalization and transferability. This paper proposes an environment exploration and backtracking agent (E2BA) for visual language object navigation, which leverages the rich semantic prior knowledge and commonsense reasoning of large language models (LLMs) to explore the environment and find the object. By fusing LLM scores and spatial geometric costs using particle filters, we select a redefined optimal frontier as sub-goal for environment exploration. To avoid redundant exploration and paths, we design a backtracking discriminator to evaluate the state of the agent and determine the timing of backtracking triggering through a double-level cascade mechanism. Additionally, we design a random instruction fuzzy semantic guessing task to verify the application diversity of this method. Comprehensive experiments on the Habitat-Matterport 3D dataset show that our method achieves a success rate of 0.704, which is higher than the existing baseline method. This study explores the potential application of LLMs in environment exploration without the need for additional training and semantic supplementation.

Abstract:
As a useful remote sensing (RS) scene interpretation technique, multi-label RS scene classification (RSSC) always attracts researchers’ attention and plays an important role in the RS community. To assign multiple semantic labels to a single RS image according to its complex contents, the existing methods focus on learning the valuable visual features and mining the latent semantic relationships from the RS images. This is a feasible and helpful solution. However, they are often associated with high computational costs due to the widespread use of Transformers. To alleviate this problem, we propose a Mamba-based efficient network based on the newly emerged state space model called MLMamba. In addition to the basic feature extractor (convolutional neural network and language model) and classifier (multiple perceptrons), MLMamba consists of two key components: a pyramid Mamba and a feature-guided semantic modeling (FGSM) Mamba. Pyramid Mamba uses multi-scale scanning to establish global relationships within and across different scales, improving MLMamba’s ability to explore RS images. Under the guidance of the obtained visual features, FGSM Mamba establishes associations between different land covers. Combining these two components can deeply mine local features, multi-scale information, and long-range dependencies from RS images and build semantic relationships between different surface covers. These superiorities guarantee that MLMamba can fully understand the complex contents within RS images and accurately determine which categories exist. Furthermore, the simple and effective structure and linear computational complexity of the state space model ensure that pyramid Mamba and FGSM Mamba will not impose too much computational burden on MLMamba. Extensive experiments counted on three benchmark multi-label RSSC data sets validate the effectiveness of MLMamba. The positive results demonstrate that MLMamba achieves state-of-the-art performance, surpassing existing methods in accuracy, model size, and computational efficiency. Our source codes are available at https://github.com/TangXu-Group/ multilabelRSSC/tree/main/MLMamba.

Abstract:
Cross-domain detection frequently encounters a decline in detection accuracy, necessitating the application of domain adaptation techniques. One crucial approach to unsupervised domain adaptation is the pseudo label-based self-training method, which iteratively trains the model by treating the pseudo labels as ground truth. However, differences in distribution that can exist between the source and target domains can lead to potentially incorrect generated pseudo labels. This can result in the threshold-setting method failing to accurately select the pseudo labels. Therefore, to tackle the challenge of determining pseudo label thresholds in self-training, we propose an unsupervised 3D object detection domain adaptation method based on pseudo label regularization. Specifically, a self-training framework based on the fusion of two detection heads is used to obtain more accurate pseudo labels. The variance of the two detection heads is utilized as the noise information for the corresponding pseudo labels. Then, the noise information is incorporated as a regularization term to enhance the bounding box regression loss, thereby addressing the challenge of determining pseudo label thresholds in self-training. The experimental results demonstrate that the method proposed in this paper achieves higher cross-domain detection accuracy compared to existing domain adaptation methods for 3D object detection.

Abstract:
Currently, protecting personal privacy through selective encryption of facial images has become a research hotspot. This paper aims to design a new image encryption scheme using chaotic systems, optimization algorithms, and the Semi-Tensor Product (STP) theory. Firstly, we proposed a 3D Coupled Ikeda Map with Bounded Amplitude (3D-CIMBA), which has controllable Lyapunov exponents and high-complexity. Secondly, Particle Swarm Optimization (PSO) algorithm is used to generate control keys of the chaotic system, and produce chaotic sequences. Then, DeepFace model is applied to recognize facial regions for encryption. Moreover, the face image is encrypted by performing row-column alternation cyclic shifting operation and STP diffusion. Finally, cryptographic analysis is conducted using histograms, pixel correlation, information entropy, and SSIM. The simulation results show that this scheme demonstrates robustness against differential attacks and noise attacks, which also exhibits fast encryption speed and large key space. These results show that the proposed algorithm can encrypt facial information more efficiently and securely compared to traditional algorithms.

Abstract:
With the rapid development of single image super-resolution (SR) technology, there is an urgent need to develop a fair no reference Super-Resolution image Quality Assessment (SRQA) method. Existing no reference SRQA methods primarily concentrate on SR artifacts including structural distortion and texture distortion by extracting spatial features, but ignore the inductive bias of Deep Neural Network (DNN)-based SR models. As a result, they function effectively for interpolation-based and dictionary-based algorithms, but struggle to perform as effectively with DNN-based SR algorithms. We found that the visual content generated by DNN-based SR models under different inductive biases often carries a content-invariant model-specific style, which can be captured by the correlations between hierarchical representation channels. To that end, we propose a novel Scene-modulated High-order Statistical Representation network (SmHSR) built on a multi-scale over-complete transformation. We quantify the perceptual quality of SR images as the shift of high-order statistical properties in their multi-scale over-complete representation, where intra-channel statistics are used to capture spatial correlations and inter-channel statistics are used to capture the inductive bias of SR models. In addition, the scene information implicit in the deep over-complete representation is used to modulate the high-order statistical properties, which simulates the top-down regulation of cognition on perception. Under the modulation of scene information, SmHSR can learn more sophisticated scene-aware statistical representation. The MultiLayer Perceptron (MLP) is used to map the high-order statistical representation to an overall quality. We test our method on multiple SR image quality databases. Experimental results show that our method outperforms the state-of-the-art SRQA methods.

Abstract:
In the context of Industrial Anomaly Detection (IAD), ensuring the quality of manufactured products is critical. Traditional 2D based methods often fail to capture anomalies present in complex 3D shapes. For effective anomaly detection in 3D shapes, it is essential to incorporate global semantic context, local geometric structure, and color information of the object. To fully leverage these features, we propose a network named 2M3DF, that leverages knowledge from multi-view RGB images and corresponding point cloud information for enhanced anomaly detection performance. Our model initially employs pre-trained feature extractors that generate local features from multi-view RGB images and corresponding point clouds. The novel inter-modality feature representation and fusion module first adapts these inter-modality features and then effectively aligns and aggregates these multimodality features on a pixel-to-point basis. To learn the normality from point-wise fused multimodal features, we fit a multivariate Gaussian distribution to model the normal feature distribution. Comprehensive experimental evaluations using the MVTec3D-AD and Eyecandies dataset validate the effectiveness of our propose model and demonstrate significant improvements in comparison to existing state-of-the-art methods. Our model achieves a 96.6% mean I-AUROC while delivering real-time results.

Abstract:
Generating a 3D human model from a single reference image is a challenging task as it involves inferring textures and geometries in unseen views while maintaining consistency with the reference image. Existing methods that rely on 3D generative models are limited by the availability of 3D training data. Optimization-based approaches that distill text-to-image diffusion models into 3D models often struggle to preserve the intricate texture details of the reference image, resulting in inconsistent appearances across different views. In this paper, we propose HumanRef-GS, a novel method for single image-to-3D clothed human generation based on 3D Gaussian Splatting (3DGS). To ensure the generated 3D model is both photorealistic and consistent with the input image, HumanRef-GS employs a unique technique called reference-guided score distillation sampling (Ref-SDS). This method effectively incorporates image guidance into the generation process, enhancing the quality of the results. Additionally, we introduce region-aware attention to Ref-SDS, which ensures accurate correspondence between different body regions. To mitigate the impact of view dependence in 3DGS and enhance the view-consistency of the generated results, we substitute the anisotropic Gaussians in the vanilla representation with isotropic Gaussians. By utilizing the 3D Gaussian representation, our method significantly enhances the generation efficiency and rendering speed of 3D clothed human models. This improvement allows for faster and more efficient generation of high-quality results. Experimental results demonstrate that HumanRef-GS surpasses state-of-the-art methods in generating 3D clothed humans with fine geometry, photorealistic textures, and view-consistent appearances. We are committed to making our code and model available upon acceptance for further research and exploration.

Abstract:
End-to-end person search is a research domain that executes the task of locating and identifying a target individual from a large number of scene images via a multi-task framework. However, a major challenge for the learning of end-to-end methods is the inherent conflicts between the two sub-tasks: person detection focuses on identifying generic features of persons, while person re-identification (Re-ID) strives to find unique, distinguishing features for matching the target person. Unlike previous research focusing on model architectures, this paper delves into the end-to-end person search training process. We find that the unbalanced and conflicting training issues significantly impair the learning efficiency of the Re-ID sub-task, which directly influences person search accuracy. To address this, we propose a novel Guiding Multi-Task Training (GMT) framework that facilitates end-to-end balanced learning for person search. We introduce a Guiding Multi-Task Harmonious Learning (GMHL) module, which decouple the features and then performs intra- and cross-task feature interaction to enhance the learning of each sub-task. Moreover, GMT employs a Balancing Multi-Task Oriented Fusing (BMOF) method to explicitly enhance Re-ID sub-task learning through additional Re-ID training and target-guided multi-model parameters fusion. Extensive experiments on 2 benchmark datasets, CUHK-SYSU and PRW, show that GMT achieves leading performance with 96.0% mAP and 61.3% mAP, respectively.

Abstract:
Deep learning-driven object detection models are capable of accurately identifying and localizing objects. However, small objects contain limited information relative to global features, resulting in the fact that detection models often do not learn small object features adequately. To enhance the precision in detecting small objects, we propose a bi-directional and triangular circulation fusion neural network (BTFN). First, to selectively strengthen the position features of small objects, we propose a feature circulation extraction module composed of a bi-directional triangular densely nested convolutional network (BTF), thus achieving repetitive multi-layer feature fusion. Second, to fill up the semantic gaps between different scales of features, we design a mixed dual attention module (MDA) in the bi-directional triangular densely nested network. Third, to mitigate the lost information in the neural networks with deep layers as well as improve the inference time, we design a re-parameterization bi-directional composite feature fusion module (Rep-BFM) that fuses the features of multiple scales. The proposed model is evaluated extensively on the MS COCO, Tsinghua-Tencent 100k, and Haier dismantled parts of used home appliances datasets. The experiment results show that the proposed model improves the AP on MS COCO by 4%, especially the APS of small objects is improved by 7.7% compared with SOTA models.

Abstract:
Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes the two sub-tasks of pedestrian detection and Re-Identification (ReID). Despite significant progress, current methods face two primary challenges: 1) the pedestrian candidates learned within detectors are suboptimal for the ReID task. 2) the potential for collaboration between two sub-tasks is overlooked. To address these issues, we present a novel Person Search framework based on the Diffusion model, PSDiff. PSDiff formulates the person search as a dual denoising process from noisy boxes and ReID embeddings to ground truths. Distinct from the conventional Detection-to-ReID approach, our denoising paradigm discards prior pedestrian candidates generated by detectors, thereby avoiding the local optimum problem of the ReID task. Following the new paradigm, we further design a new Collaborative Denoising Layer (CDL) to optimize detection and ReID sub-tasks in an iterative and collaborative way, which makes two sub-tasks mutually beneficial. Extensive experiments on the standard benchmarks show that PSDiff achieves state-of-the-art performance with fewer parameters and elastic computing overhead.

Abstract:
Acknowledging different wavelengths by imaging mechanisms, optical images usually embed higher low-dimensional manifolds into ambient spaces than SAR images do. How to utilize their complementarity remains challenging for multimodal clustering. In this study, we devise a conditional dual diffusion (CDD) model for multimodal clustering of optical and SAR images, and theoretically prove that it is equivalent to a probability flow ordinary differential equation (ODE) having a unique solution. Different from vanilla diffusion models, the CDD model is equipped with a decoupling autoencoder to predict noises and clear images simultaneously, preserving data manifolds embedded in latent space. To the fuse manifolds of optical and SAR images, we train the model to generate optical images conditioned by SAR images, mapping them into a unified latent space. The learned features extracted from the model are fed to K-means algorithm to produce resulting clusters. To the best of our knowledge, this study could be one of the first diffusion models for multimodal clustering. Extensive comparison experiments on three large-scale optical-SAR pair datasets show the superiority of our method over state-of-the-art (SOTA) methods overall in terms of clustering performance and time consumption. The source code is available at https://github.com/suldier/CDD.

Abstract:
Cross-view image geo-localization aims to estimate the geographic position of a query image from the ground platform (such as mobile phone, vehicle camera) by matching it with geo-tagged reference images from the aerial platform (such as drone, satellite). Although existing studies have achieved promising results, they usually rely only on depth features and fail to effectively handle the serious changes in geometric shape and appearance caused by view differences. In this paper, a novel Semantic Concept Perception Network (SCPNet) with interactive prompting is proposed, whose core is to extract and integrate semantic concept information reflecting spatial position relationship between objects. Specifically, for a given of pair input images, a CNN stem with positional embedding is first adopted to extract depth features. Meanwhile, a semantic concept mining module is designed to distinguish different objects and capture the associations between them, thereby achieving the purpose of extracting semantic concept information. Furthermore, to obtain global descriptions of different views, a feature bidirectional injection fusion module based on attention mechanism is proposed to exploit the long-range dependencies of semantic concept and depth features. Finally, a triplet loss with a flexible hard sample mining strategy is used to guide the optimization of the network. Experimental results have shown that our proposed method can achieve better performance compared with state-of-the-art methods on mainstream cross-view datasets.

Abstract:
Several existing works have revealed the effectiveness of arctangent-type penalties in exploiting sparsity for compressed sensing. However, addressing the subproblems associated with the arctangent penalty incurs considerable computational cost. Aiming to reduce complexity, we derive the closed-form proximity operator of an arctangent penalty, which is expressed as hyperbolic functions of sine and cosine in this paper. Accordingly, a computationally-efficient arctangent regularization iterative thresholding (ARIT) algorithm for sparse approximation is proposed. Furthermore, we theoretically prove that under certain conditions, the ARIT algorithm converges to a local minimizer of the arctangent regularization problem with an eventually linear convergence. Extensive experiments are conducted to compare our scheme with conventional iterative thresholding algorithms, demonstrating the former superiority in terms of the probability of successful recovery, rate of support recovery, phase transition, and robustness to noise.

Abstract:
With the rapid development of Deepfake technology, social security is facing great challenges. Although numerous Deepfake detection algorithms based on traditional CNN frameworks perform well on specific datasets, they still suffer from overfitting due to an over-reliance on localized artifact information. This limitation leads to degraded detection performance across diverse datasets. To address this issue, this study proposes a dual-branch fusion network called LGDF-Net. LGDF-Net uses a dual-branch structure to process the local artifact features and global texture features generated by Deepfake separately, preserving their unique characteristics. Specifically, the local compression branch utilizes a specially designed local compression module (LCM) that allows the network to focus more accurately on key regions of localized artifacts in Deepfake faces. The global expansion branch enhances the analysis of the global facial context through a global expansion module (GEM), which captures image context information and subtle texture features more comprehensively. Additionally, the proposed multi-scale feature extraction module (MSFE) delves into image features at various scales, enriching the extraction of detailed information. Finally, the multi-level feature fusion strategy (MLFF) improves the integration of local and global features through multiple layers, enabling the network to learn the intrinsic connections between these two types of features. A series of experimental validations demonstrate that the proposed scheme outperforms many existing detection networks in terms of accuracy and generalization ability.

Abstract:
Video deblurring is a fundamental problem in low-level vision, and many methods have employed designs based on CNNs and transformers. Traditional CNNs often require deeper architectures to achieve a larger receptive field, which may not be optimal for spatially non-uniform blurs and intense motion blurs. While transformers offer a large receptive field, their quadratic complexity due to attention designs typically imposes a significant computational burden. In addressing these issues, we present an Attentive Large Kernel Network with Mixture of Experts (ALK-MoE). In ALK-MoE, an attentive large kernel backbone network is proposed. On one hand, it inherently extends the network’s receptive field through its large kernel design. On the other hand, it addresses the quadratic complexity of attention by employing a sophisticated attention design, thus maintaining its ability to capture long-range dependencies. Furthermore, to achieve more precise and robust alignment of inter-frame features using optical flow for better utilization of clear frames, a mixture of experts model is proposed. It involves integrating optical flow updates between different experts in a residual manner. Our ablation experiments and experiments on multiple datasets indicate that ALK-MoE achieves comparable or superior performance compared to Transformer-based methods, with lower complexity.

Abstract:
Due to the complex underwater imaging process, underwater images contain a variety of unique distortions. While existing underwater image quality assessment (UIQA) methods have made progress by highlighting these distortions, they overlook the fact that image content also affects how distortions are perceived, as different content exhibits varying sensitivities to different types of distortions. Both the characteristics of the content itself and the properties of the distortions determine the quality of underwater images. Additionally, the intertwined nature of content and distortion features in underwater images complicates the accurate extraction of both. In this paper, we address these issues by comprehensively accounting for both content and distortion information and explicitly disentangling underwater image features into content and distortion components. To achieve this, we introduce a dynamic content-distortion guiding and feature disentanglement network (DysenNet), composed of three main components: the feature disentanglement sub-network (FDN), the dynamic content guidance module (DCM), and the dynamic distortion guidance module (DDM). Specifically, the FDN disentangles underwater features into content and distortion elements, allowing us to more clearly measure their respective contributions to image quality. The DCM generates dynamic multi-scale convolutional kernels tailored to the unique content of each image, enabling content-adaptive feature extraction for quality perception. The DDM, on the other hand, addresses both global and local underwater distortions by identifying distortion cues from both channel and spatial perspectives, focusing on regions and channels with severe degradation. Extensive experiments on UIQA datasets demonstrate the state-of-the-art performance of the proposed method.

Affiliations: School of Information Science and Technology, University of Science and Technology of China (USTC), Hefei, China; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China; School of Information Science and Technology, University of Science and Technology of China, Hefei, China; School of Information Science and Technology and the MoE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China (USTC), Hefei, China

Abstract:
Video corpus moment retrieval (VCMR) aims to retrieve a moment from a large corpus of untrimmed videos corresponding to a given language query. However, existing methods often fall short due to their reliance on simple cross-modal attention mechanisms and one-stop localization, which fail to handle the complex multimodal information and large search space effectively. To address these challenges, we propose a novel VCMR method with Query-specific Context Learning and Progressive Localization (QCLPL). First, we construct query-specific multimodal contexts that capture complementary and consistent semantics across subtitles and frames, ensuring informative and efficient context building. We further introduce a semantic contrastive loss to refine these multimodal contexts, filtering out query-irrelevant information. Additionally, we introduce a progressive localization strategy that transforms the moment localization task into a two-stage process. By classifying frames into foreground and background regions, we present a simplified binary classification problem before boundary prediction, constrained by a region-aware loss. This progressive approach leverages region priors to improve subsequent moment localization. Extensive experiments on the TVR and DiDeMo datasets demonstrate that our method significantly outperforms existing approaches, setting a new state of the art for VCMR.

Abstract:
The Text-to-Image Person Re-identification (TI-ReID) task objective is to precisely identify the person’s images with the textual description of the person. The mainstream research methods focus on cross-modal aligning local features, and overlook the learning of intra-modal and cross-modal relationships between different features. This renders the person features lacking in high-level semantic information. To resolve such issues, we propose the Progressive Relationship-Mining Graph Network (RMGNet), including the Intra-Modal Relationship-Mining (IMRM) and the Cross-Modal Relationship-Mining (CMRM) module. These modules are employed to model and mine semantic relationship information among different features. Specifically, the IMRM module models and mines the high-level semantic interrelationships inherent in the image and text features. The CMRM module introduces the nearest neighbor method to model cross-modal semantic relationships to enhance the cross-modal semantic correspondence capabilities of person features. On this basis, we design the Adaptive Corner Center (Acc) loss and the Coarse-to-Fine Learning (C2FL) strategy. These ensure the network receives consistent and effective metric learning supervision throughout the entirety of the training process. To validate the efficacy of the proposed method, extensive experiments are conducted on three prevalent datasets: CHUK-PEDES, ICFC-PEDES, and RSTPReid. The achieved mAP of 70.59%, 41.62%, and 49.58% surpassed those current state-of-the-art methods.

Abstract:
The existing Visual-inertial-LiDAR localization methods lack consideration of sensors degradation in challenging scene, such as underground space, where light condition is poor, text feature is scarce and geometric structure is degraded. To enhance the robustness of the multi-sensor fusion, this paper proposes a confidence factor based robust localization algorithm. It mainly consists of two parts. In front-end, a lightweight DCE-Net is used to improve the image quality under low illumination, and then an improved feature extraction method based on Line Segment Detector and Manhattan World (LSD-MW) is proposed to extract robust point-line features and construct Manhattan Frame (MF) structure constraint for subsequent optimization, making visual feature more reliable in underground space. In back-end, a novel confidence factor graph optimization strategy is proposed to enhance the robustness in the case of sensor degradation. Where adaptive confidence factors are designed to assess the reliability of LiDAR and camera features based on their feature matching degree. These confidence factors are leveraged to weight the sensors residual factors to construct an iterative function based on Graduated Nonconvexity (GNC), mitigating the influence of outliers on state estimation in degradation scene of underground space. Experiments conducted on public and real-world dataset collected by the UAV platform we built verify the proposed method has promising performance. Ablation studies also show the robustness of our method.

Abstract:
Fine-grained visual classification (FGVC) is a challenging task characterized by interclass similarity and intraclass diversity and has broad application prospects. Recently, several methods have adopted the vision Transformer (ViT) in FGVC tasks since the data specificity of the multihead self-attention (MSA) mechanism in ViT is beneficial for extracting discriminative feature representations. However, these works focus on integrating feature dependencies at a high level, which leads to the model being easily disturbed by low-level background information. To address this issue, we propose a fine-grained attention-locating vision Transformer (FAL-ViT) and an attention selection module (ASM). First, FAL-ViT contains a two-stage framework to identify crucial regions effectively within images and enhance features by strategically reusing parameters. Second, the ASM accurately locates important target regions via the natural scores of the MSA, extracting finer low-level features to offer more comprehensive information through position mapping. Extensive experiments on public datasets demonstrate that FAL-ViT outperforms the other methods in terms of performance, confirming the effectiveness of our proposed methods. The source code is available at https://github.com/Yueting-Huang/FAL-ViT.

Abstract:
Intelligent reflective surface (IRS) assisted symbiotic radio (SR) network has been proposed as a promising solution for the sixth generation (6G) mobile wireless system, which achieves mutualistic spectrum sharing and highly reliable backscattering communication with extremely low energy cost. On the other hand, exponential growth in video traffic makes wireless video transmission more challenging in the 6G era. With the assistance of IRS based secondary link in SR network, an efficient soft video transmission scheme (IRSCast) is proposed to achieve linear quality transition under the drastically varying wireless channel. To minimize the transmission distortion of the video signal, a multivariable optimization problem is formulated to jointly optimize the wireless resources, including transmission power, active beamforming of the primary transmitter (PTx), and passive beamforming of the secondary transmitter (STx). Then, an alternating optimization method is utilized to decouple the multivariate optimization problem into multiple univariate sub-problems that are finally solved by semi-positive definite relaxation and Lagrange multiplier methods. The simulation results demonstrated that the proposed IRSCast method significantly improves the objective and subjective quality of the received video.

Abstract:
Recently, neural network-based in-loop filters have been rapidly developed, effectively improving the reconstruction quality and compression efficiency in video coding. Existing deep in-loop filters typically employed networks with fixed structures to process all image blocks. However, under various bitrate conditions, compressed image blocks with different textures exhibit varying degradations, which poses a challenge for high-quality and low-complexity filtering. Additionally, different complexity requirements for coding tools in various scenarios limit the versatility of fixed models. To address these problems, a content-aware dynamic in-loop filter (dubbed DILF) with adjustable complexity is proposed in this paper. Specifically, DILF comprises a policy network and a filtering network. For each reconstructed image block, the policy network dynamically generates a filtering network topology based on pixel information and the quantization parameter (QP), guiding the filtering network to skip redundant layers and conduct content-aware image enhancement, thereby improving the filtering performance. In addition, by introducing a user-defined balancing factor into the policy network, the content-aware filtering network topology can be further adjusted according to user’s requirements, facilitating adjustable complexity with a single model. We integrate DILF into Versatile Video Coding (VVC) to replace the built-in deblocking filter. Extensive experiments demonstrate the efficiency of DILF in processing image blocks with varying degrees of degradation and its flexibility in controlling complexity. When the balancing factor is set to 2e-5, DILF achieves bitrate savings of 8.07%, 17.97%, and 20.93% on average for YUV components over VVC reference software VTM-11.0 under all-intra configuration. Compared to static networks with fixed structures, DILF demonstrates superior performance and lower computational complexity.

Abstract:
In recent years, zero-shot sketch-based image retrieval (ZS-SBIR) task has attracted considerable attention. Although some ZS-SBIR approaches have been proposed, it remains challenging to handle the inherent linkages between the sketch and image domains. Moreover, how to transfer semantic knowledge from seen categories to unseen categories is still an open problem, significantly affecting retrieval performance. In this article, we propose a novel approach Modality Fused Class-Proxy with Knowledge Distillation, named MFCPKD, which develops two novel schemes to remedy the above issues. Specifically, MFCPKD leverages a Modality Fusion Model to learn modality-fused feature embeddings and class proxies. The knowledge distillation is employed for student to learn the feature from seen categories and infer the unknown category through class proxies. Furthermore, three losses constrain the student network to narrow the modality gap between sketch and image domains. Finally, we conduct extensive experiments on three benchmark datasets (Sketchy Ext, TU-Berlin Ext, and QuickDraw Ext) and demonstrate that our MFCPKD method can achieve excellent performance compared to some existing methods in ZS-SBIR scenarios. The code for our project is available at https://github.com/li1changxing/MFCPKD.

Affiliations: School of Information Science and Engineering, Yanshan University, Qinhuangda, Hebei, China; Spacetime Supercomputing (Beijing) Technology Company Ltd., Beijing, China; National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China; Department of Automation, Tsinghua University, Beijing, China; ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany

Abstract:
Currently, many researchers aim to achieve automatic depression level prediction via speech and video behavior analysis. However, previous works have struggled to decompose audio and video sequences into the information related to and unrelated to depression scores, hindering the model’s perception of depression cues. Besides, previous works implement multimodal fusion using attention mechanisms or linear layers, but failed to simultaneously consider the Euclidean relationship among tokens and the non-Euclidean relationship among channels, which bring limitations in capturing depression cues. In response to the above issues, we propose a depression scale dictionary decomposition framework, which mainly includes a Bidirectional Dictionary Decomposition (BDD) module and a Bidirectional Multimodal Fusion (BMF) module. The BDD module can use the dictionaries generated based on the depression scale to semantically decompose audio and video sequences into the information related to and unrelated to depression scores along token and channel dimensions for promoting depression cue perception. Moreover, considering the respective characteristics of tokens and channels, the BMF module uses linear layers and graph convolution to achieve cross-modal mixing, which is used to aggregate audio and video sequences for predicting depression levels. The validation on AVEC 2013, AVEC 2014 and DAIC-WOZ datasets demonstrates our method’s superiority.

Abstract:
This paper studies the domain generalized remote sensing semantic segmentation (RSSS), aiming to generalize a model trained only on the source domain to unseen domains. Existing methods in computer vision treat style information as domain characteristics to achieve domain-agnostic learning. Nevertheless, their generalizability to RSSS remains constrained, due to the incomplete consideration of domain characteristics. We argue that remote sensing scenes have layout differences beyond just style. Considering this, we devise a joint style and layout synthesizing framework, enabling the model to jointly learn out-of-domain samples synthesized from these two perspectives. For style, we estimate the variant intensities of per-class representations affected by domain shift and randomly sample within this modeled scope to reasonably expand the boundaries of style-carrying feature statistics. For layout, we explore potential scenes with diverse layouts in the source domain and propose granularity-fixed and granularity-learnable masks to perturb layouts, forcing the model to learn characteristics of objects rather than variable positions. The mask is designed to learn more context-robust representations by discovering difficult-to-recognize perturbation directions. Subsequently, we impose gradient angle constraints between the samples synthesized using the two ways to correct conflicting optimization directions. Extensive experiments demonstrate the superior generalization ability of our method over existing methods.

Abstract:
With the rapid progress of generative models, the current challenge in face forgery detection is how to effectively detect realistic manipulated faces from different unseen domains. Though previous studies show that pre-trained Vision Transformer (ViT) based models can achieve some promising results after fully fine-tuning on the Deepfake dataset, their generalization performances are still unsatisfactory. To this end, we present a Forgery-aware Adaptive Vision Transformer (FA-ViT) under the adaptive learning paradigm for generalized face forgery detection, where the parameters in the pre-trained ViT are kept fixed while the designed adaptive modules are optimized to capture forgery features. Specifically, a global adaptive module is designed to model long-range interactions among input tokens, which takes advantage of self-attention mechanism to mine global forgery clues. To further explore essential local forgery clues, a local adaptive module is proposed to expose local inconsistencies by enhancing the local contextual association. In addition, we introduce a fine-grained adaptive learning module that emphasizes the common compact representation of genuine faces through relationship learning in fine-grained pairs, driving these proposed adaptive modules to be aware of fine-grained forgery-aware information. Extensive experiments demonstrate that our FA-ViT achieves state-of-the-arts results in the cross-dataset evaluation, and enhances the robustness against unseen perturbations. Particularly, FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation. The code and trained model have been released at: https://github.com/LoveSiameseCat/FAViT.

Abstract:
Few-shot semantic segmentation (FSS) aims to segment the target object under the condition of a few annotated samples. However, current studies on FSS primarily concentrate on extracting information related to the object, resulting in inadequate identification of ambiguous regions, particularly in non-target areas, including the background (BG) and Distracting Objects (DOs). Intuitively, to alleviate this problem, we propose a novel framework, namely NTRENet++, to explicitly mine and eliminate BG and DO regions in the query. First, we introduce a BG Mining Module (BGMM) to extract BG information and generate a comprehensive BG prototype from all images. For this purpose, a BG mining loss is formulated to supervise the learning of BGMM, utilizing only the known target object segmentation ground truth. Subsequently, based on this BG prototype, we employ a BG Eliminating Module to filter out the BG information from the query and obtain a BG-free result. Following this, the target information is utilized in the target matching module to generate the initial segmentation result. Finally, a DO Eliminating Module is proposed to further mine and eliminate DO regions, based on which we can obtain a BG and DO-free target object segmentation result. Moreover, we present a prototypical-pixel contrastive learning algorithm to enhance the model’s capability to differentiate the target object from DOs. Extensive experiments conducted on both PASCAL-5i and COCO-20i datasets demonstrate the effectiveness of our approach despite its simplicity. Additionally, we extend our method to the few-shot video object segmentation task and achieve improved performance on a baseline model, demonstrating its generalization ability. Code is available at https://github.com/LIUYUANWEI98/NTRENet++.

Abstract:
This paper proposes a fully differentiable and end-to-end framework for learning Bézier decomposition on 3D point clouds. The framework aims to partition input point clouds into multiple Bézier primitive patches through a learned Bézier decomposition process. Unlike previous approaches that handle different primitive types separately, thus being limited to specific shape categories, our method seeks to achieve a generalized primitive segmentation on point clouds. Drawing inspiration from Bézier decomposition on NURBS models, we adapt it to guide point cloud segmentation without relying on pre-defined primitive types. To achieve this, we introduce a joint optimization framework that simultaneously learns Bézier primitive segmentation and geometric fitting in a cascaded architecture. Additionally, we propose a soft voting regularizer to enhance primitive segmentation and an auto-weight embedding module to effectively cluster point features, making the network more robust and applicable to various scenarios. Furthermore, we incorporate a reconstruction module capable of processing multiple CAD models with different primitives simultaneously. Extensive experiments were conducted on both synthetic ABC datasets and real-scan datasets to validate and compare our approach against several baseline methods. The results demonstrate that our method outperforms previous work in terms of segmentation accuracy, while also exhibiting significantly faster inference speed.

Abstract:
Reflection removal is a crucial issue in image reconstruction, especially for high-definition images. Removing undesirable reflections can greatly enhance the performance of various visual systems, such as medical imaging, autonomous driving, and security surveillance. However, the resolution of existing reflection removal datasets is not high and the training data heavily relies on synthetic data, which hampers the performance of reflection removal methods and restricts the development of effective techniques tailored for high-definition images. Therefore, this paper introduces a new dataset, Real-world Reflection Removal in 4K (RR4K). This novel dataset, with its large capacity and high resolution of 6000× 4000 pixels, represents a significant advancement in the field, ensuring a realistic and high quality benchmark. Furthermore, building upon the dataset, we propose an efficient method for single-image reflection removal, optimized for high-definition processing. This method employs the U-Net architecture, enhanced with large kernel distillation and scale-aware features, enabling it to effectively handle complex reflection scenarios while reducing computational demands. Comprehensive testing on the RR4K dataset and existing low-resolution datasets has demonstrated the method’s superior efficiency and effectiveness. We believe that our constructed RR4K dataset can better evaluate and design algorithms for removing undesirable reflection from real-world high-definition images. Our dataset and code are available at https://github.com/jengchauwei/RR4K.

Abstract:
Bi-modal (RGB-T and RGB-D) salient object detection (SOD) aims to enhance detection performance by leveraging the complementary information between modalities. While significant progress has been made, two major limitations persist. Firstly, mainstream fully supervised methods come with a substantial burden of manual annotation, while weakly supervised or unsupervised methods struggle to achieve satisfactory performance. Secondly, the indiscriminate modeling of local detailed information (object edge) and global contextual information (object body) often results in predicted objects with incomplete edges or inconsistent internal representations. In this work, we propose a novel paradigm to effectively alleviate the above limitations. Specifically, we first enhance the consistency regularization strategy to build a basic semi-supervised architecture for the bi-modal SOD task, which ensures that the model can benefit from massive unlabeled samples while effectively alleviating the annotation burden. Secondly, to ensure detection performance (i.e., complete edges and consistent bodies), we disentangle the SOD task into two parallel sub-tasks: edge integrity fusion prediction and body consistency fusion prediction. Achieving these tasks involves two key steps: 1) the explicitly disentangling scheme decouples salient object features into edge and body features, and 2) the exclusively fusing scheme performs exclusive integrity or consistency fusion for each of them. Eventually, our approach demonstrates significant competitiveness compared to 26 fully supervised methods, while effectively alleviating 90% of the annotation burden. Furthermore, it holds a substantial advantage over 15 non-fully supervised methods.

Affiliations: Department of Computer Science, Dalian Minzu University, Dalian, China; College of Civil Engineering, Dalian Minzu University, Dalian, China; School of Information Science and Technology, ShanghaiTech University, Shanghai, China; National Center for Computer Animation, Bournemouth University, Poole, U.K.; Department of Computer Science, State Key Laboratory of CAD&CG, Zhejiang University, Hangzhou, Zhejiang, China; Department of Computer Science and Technology, Tsinghua University, Beijing, China; Department of Machine Learning, Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates

Abstract:
Most current Light Field Salient Object Detection (LFSOD) methods require full supervision with labor-intensive pixel-level annotations. Unsupervised Light Field Salient Object Detection (ULFSOD) has gained attention due to this limitation. However, existing methods use traditional handcrafted techniques to generate noisy pseudo-labels, which degrades the performance of models trained on them. To mitigate this issue, we present a novel learning-based approach to synthesize labels for ULFSOD. We introduce a prominent focal stack identification module that utilizes light field information (focal stack, depth map, and RGB color image) to generate high-quality pixel-level pseudo-labels, aiding network training. Additionally, we propose a novel model architecture for LFSOD, combining a multi-scale spatial attention module for focal stack information with a cross fusion module for RGB and focal stack integration. Through extensive experiments, we demonstrate that our pseudo-label generation method significantly outperforms existing methods in label quality. Our proposed model, trained with our labels, shows significant improvement on ULFSOD, achieving new state-of-the-art scores across public benchmarks.

Abstract:
Most facial expression recognition (FER) models are trained on large-scale expression data with centralized learning. Unfortunately, collecting a large amount of centralized expression data is difficult in practice due to privacy concerns of facial images. In this paper, we investigate FER under the framework of personalized federated learning, which is a valuable and practical decentralized setting for real-world applications. To this end, we develop a novel uncertainty-Aware label refineMent on hYpergraphs (AMY) method. For local training, each local model consists of a backbone, an uncertainty estimation (UE) block, and an expression classification (EC) block. In the UE block, we leverage a hypergraph to model complex high-order relationships between expression samples and incorporate these relationships into uncertainty features. A personalized uncertainty estimator is then introduced to estimate reliable uncertainty weights of samples in the local client. In the EC block, we perform label propagation on the hypergraph, obtaining high-quality refined labels for retraining an expression classifier. Based on the above, we effectively alleviate heterogeneous sample uncertainty across clients and learn a robust personalized FER model in each client. Experimental results on two challenging real-world facial expression databases show that our proposed method consistently outperforms several state-of-the-art methods. This indicates the superiority of hypergraph modeling for uncertainty estimation and label refinement on the personalized federated FER task. The source code will be released at https://github.com/mobei1006/AMY.

Abstract:
Recent progress of semantic point clouds analysis is largely driven by synthetic data (e.g., the ModelNet and the ShapeNet), which are typically complete, well-aligned and noisy-free. Therefore, representations of those ideal synthetic point clouds have limited variations in the geometric perspective and can gain good performance on a number of 3D vision tasks such as point cloud classification. In the context of unsupervised domain adaptation (UDA), representation learning designed for synthetic point clouds can hardly capture domain invariant geometric patterns from incomplete and noisy point clouds. To address such a problem, we introduce a novel scheme for induced geometric invariance of point cloud representations across domains, via regularizing representation learning with two self-supervised geometric augmentation tasks. On one hand, a novel pretext task of predicting translation distances of augmented samples is proposed to alleviate centroid shift of point clouds due to occlusion and noises. On the other hand, we pioneer an integration of the self-supervised relational learning on geometrically-augmented point clouds in a cascade manner, utilizing the intrinsic relationship of augmented variants and other samples as extra constraints of cross-domain geometric features. Experiments on the PointDA-10 dataset demonstrate the effectiveness of the proposed method, achieving the state-of-the-art performance.

Abstract:
In real-world surveillance scenarios, person re-identification tasks are often seriously affected by occlusion problems, which requires the model to be able to not only extract powerful features, but also effectively recover features when they are occluded. Although existing methods disentangle visible human bodies by clustering semantic information, they often damage discriminative appearance due to the introduction of background noises. To solve this problem, we propose Heterogeneous Generative Tokens and Distance-aware Recovery (HGTDR) network, which aims to effectively extract discriminative appearance and recover the occluded body regions. HGTDR mainly contains two branches: a holistic stream and a part stream. The holistic stream utilizes ViT to capture the global context information and provide stable global features by establishing long-range relationships. In the part stream, we propose a Semantic Patch Generator (SPG), which combines the local attention mechanism to capture rich local semantics and further generate semantic patches. Further, considering the discrimination score and relevance score of semantic patches, we feed them into the proposed Adaptive Heterogeneous Semantic Token Generator (AHSTG) to gradually generate strong-response foreground and weak-response background features. In addition, to complete the features of occluded regions, the Distance-based Feature Recovery (DFR) module is designed. The module calculates the planar Euclidean distance of heterogeneous tokens and adaptively allocates the corresponding weights to dynamically recover the invisible bodies. Finally, we obtain discriminative and robust person descriptors. Extensive experiments on several challenging occluded, partial and holistic Re-ID datasets demonstrate that our proposed HGTDR network achieves superior performance and outperforms various state-of-the-art methods.

Abstract:
Light field (LF) imaging captures both spatial and angular information of the real world, enabling precise depth estimation. However, images are merely discrete expressions of scenes. Limited by imaging technology, LF camera cannot capture the infinite rays emitted by scenes, leading to the discrete information storage (e.g. pixel). Consequently, previous deep learning methods have encountered challenges in accurately extracting depth information from LF images. In this paper, we investigate a surface-continuous scene representation using planarity prior and design PlaneNet, a Plane-based Network that successfully generates highly detailed depth maps for real scenes. Specifically, inspired by the plane assumption that real-world scenes generally yield piecewise smooth surfaces, we refine it to the pixel level for continuous surface approximation, which can overcome the limitations of discrete representation. Rather than explicitly parameterizing planes as multiple coefficients, we propose a novel plane regular sampling operator (PRSO), enabling the network to fit smooth depth surfaces easily. To explore the role of our theory at the feature level, we also introduce PRSO into the intermediate layers of PlaneNet. Experiments show that our method achieves state-of-the-art performance on both synthetic and real-world LF scenes, ranking 1st (MSE) on the HCI 4D Light Field benchmark. Furthermore, we explore the utilization of our representation in multiple LF depth estimation networks, and experiments demonstrate improved performance when surface-continuous representation is applied. Code is available at https://github.com/crs904620522/PlaneNet.

Abstract:
Few-shot fine-grained classification entails notorious subtle inter-class variation. Recent works address this challenge by developing attention mechanisms, such as the task discrepancy maximization (TDM) that can highlight discriminative channels. This paper, however, aims to reveal that, besides designing sophisticated attention modules, a well-designed input scheme, which simply blends two types of features and their interactions capturing different properties of the target object, can also greatly promote the quality of the learnt weights. To illustrate, we design a bi-feature interactive TDM (BiFI-TDM) module to serve as a strong foundation for TDM to discover the most discriminative channels with ease. Specifically, we design a novel mixing strategy to produce four sets of channel weights with different focuses, reflecting the properties of the corresponding input features and their interactions, as well as a proper feature re-weighting scheme. Extensive experiments on four benchmark fine-grained image datasets showcase superior performance of BiFI-TDM in metric-based few-shot methods. Our codes are available at https://github.com/Peiy-Lu/BiFI-TDM.

Abstract:
Transferability and imperceptibility of adversarial examples are pivotal for assessing the efficacy of black-box attacks. While diffusion models have been employed to generate adversarial examples, leveraging their advanced image generation capability to enhance transferability and imperceptibility, these methods typically focus only on perturbing the image or latent space. They often ignore the critical role of semantic information in the denoising process, thereby impeding the improvement of the transferability of adversarial examples. Furthermore, the modification of high-level semantics inevitably introduces image blurring. This degradation in visual quality makes the adversarial examples more susceptible to detection. To overcome the above limitations, we are the first to utilize image latent encoding and semantic embedding perturbations to enhance the performance of adversarial attacks. Then, the LESEP method is proposed. In the LESEP framework, we first apply image latent encoding attack to achieve deception of the target model. Second, the semantic embedding attack enhances the transferability of adversarial examples. Additionally, we utilize the image restoration technique to guarantee the high imperceptibility of the crafted adversarial examples. Through comprehensive experiments on diverse datasets, different network architectures and defense methods, we have demonstrated that the LESEP method achieves outstanding transferability and imperceptibility while displaying strong robustness.

Abstract:
Due to rich textures and ease of acquisition, finger-based biometric features have gained significant attention for personal authentication in recent years. However, the majority of current finger-based authentication techniques predominantly rely on features extracted solely from a single modality, such as fingerprint or finger vein. Additionally, most current authentication methods utilize contact-based capture and identification, which poses a risk of bacterial or viral infection. To overcome these limitations, we advocate for the adoption of touchless multimodal finger features, providing a hygienic and robust authentication solution. Specifically, we design a device which can capture touchless finger vein and fingerprint images from four fingers, creating the THU-FVFP dataset. To the best of our knowledge, the THU-FVFP dataset is the first publicly available dataset that includes touchless finger vein and fingerprint data from four fingers. Subsequently, we introduce the Attention-based Cross-domain Fusion Network (ACFNet), which can leverage both intra and inter-features of finger vein and fingerprint data. To achieve this, we develop an Intra Multi-Level Feature Fusion Module (IMLFFM) for merging features from different layers within a single modality and an Inter Multi-Modal Feature Fusion Module (IMMFFM) for achieving optimal fusion of diverse features. We extensively evaluate the model on the THU-FVFP database, proving its outstanding performance with an equal error rate of 0.07%. The THU-FVFP dataset is available at https://github.com/oneline-wsq/THU-FVFP-Dataset.

Abstract:
Pixel value ordering (PVO) is an efficient method for implementing reversible data hiding, which can achieve embedding based on overlapping pixel blocks when combined with the flexible patch moving (FPM) mode, especially the two-dimensional (2D) FPM mode. However, the existing 2D FPM mode, whose pairing way of prediction error is not conducive to generating more pixel blocks available for embedding, and whose movement rules are too inefficient to fully exploit the potential of the PVO, results in wasting many available blocks. Therefore, in this paper, a two-stage embedding mechanism is proposed for the 2D FPM mode, in which the combination of prediction errors is adjusted to improve the possibility of generating available blocks and the two-stage embedding doubles the number of pixel blocks available for embedding. Furthermore, an FPM mode selection is proposed, where four novel 2D FPM modes are designed to efficiently exploit the potential of the PVO according to the different directional gradients. Lastly, a set of efficient 2D mappings is well-designed for multiple histograms to achieve lower embedding distortion. The extensive experimental results show that the proposed method outperforms other state-of-the-art methods in terms of embedding capacity and image fidelity. The average peak signal-to-noise ratio for the Kodak image dataset is as high as 63.62 dB after embedding 10,000 bits.

Abstract:
Atrous convolutions are employed as a method to increase the receptive field in semantic segmentation tasks. However, in previous works of semantic segmentation, it was rarely employed in the shallow layers of the model. We revisit the design of atrous convolutions in modern convolutional neural networks (CNNs), and demonstrate that the concept of using large kernels to apply atrous convolutions could be a more powerful paradigm. We propose three guidelines to apply atrous convolutions more efficiently: Do not only use atrous convolutions, Avoiding the “Atrous Disasters”, Appropriate fusion mechanisms make it perfect. Following these guidelines, we propose DSNet, a Dual-Branch CNN architecture, which incorporates atrous convolutions in the shallow layers of the model architecture, as well as pretraining the nearly entire encoder on ImageNet to achieve better performance. To demonstrate the effectiveness of our approach, our models achieve a new state-of-the-art trade-off between accuracy and speed on ADE20K, Cityscapes and BDD datasets. Specifically, DSNet achieves 40.0% mIOU with inference speed of 179.2 FPS on ADE20K, and 80.4% mIOU with speed of 81.9 FPS on Cityscapes. Additionally, we propose a novel multi-scale attention fusion module, MSAF. It demonstrates outstanding performance in classification as well as downstream tasks such as segmentation. Source code and models are available at Github: https://github.com/takaniwa/DSNet.

Abstract:
Learning hash functions for approximate nearest neighbor search of high-dimensional data has received a surge of interests in recent years. Most existing methods are often concerned with learning hash functions for nearest neighbor search on high-dimensional data from a single source. In many real-world applications, data can be collected from diverse sources or represented using different feature descriptors. This raises an open challenge, i.e., the Cross-View Nearest Neighbor Search (CVNNS), where the representation of a query instance can be different from that of target instances to be retrieved in database. The key challenge of cross-view search is to learn an effective shared representation which can effectively connect the query instance and the target instances to be retrieved. In this paper, we present a new cross-view nearest neighbor search scheme by applying the emerging deep learning to hash techniques. In particular, we investigate two different architectures of deep Restricted Boltzmann Machines (RBMs) for learning to hash toward cross-view nearest neighbor search, and conduct extensive experiments to examine their empirical performance on diverse settings of cross-view image retrieval tasks. The encouraging results show that our technique outperforms the state-of-the-art approaches.

Affiliations: Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan, China; School of Information Science & Technology, Dalian Maritime University, Dalian, China; School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China

Abstract:
Aiming to mitigate image distortion caused by steganography algorithms at high-capacity information embedding and enhance the steganalysis resistance capability of generated stego images, this paper proposes a high-capacity differential steganography algorithm for color images based on multiple adversarial networks. Instead of directly modifying the pixels of the cover image, the algorithm embeds the secret information into the differential plane generated by the two most similar channels of the cover image. Consequently, the distortion of the stego image is minimized while embedding a secret image of the same size. At the same time, the fidelity of the stego and extracted secret images is continually improved through adversarial training between the generator and discriminator in the proposed steganography network. Furthermore, multiple steganalysis networks are parallelly utilized to enhance the steganalysis resistance capability of stego images. In addition, the Lion optimizer is utilized for the first time to improve the convergence speed of the proposed steganographic network. Experimental results show that the comprehensive performance of the proposed algorithm outperforms other state-of-the-art steganography algorithms significantly.

Abstract:
The rich motion and appearance cues between consecutive frames are crucial for robust visual tracking. However, most existing tracking methods are still limited in designing different components to separately employ corresponding cues and even ignore one of them. This makes them difficult to maintain effective interaction between different cues, thus hindering the models from fostering a comprehensive understanding of the target objects. To address these issues, we propose a unified spatio-temporal cues learning framework (named USCLTrack) that comprehensively mines the variation patterns of targets between consecutive frames in complex video streams. Specifically, USCLTrack firstly aggregates motion and appearance cues into shared queries to provide the bridge of interaction between both cues. Then, it directly generates object locations on the condition of these shared queries in an autoregressive manner, unifying different cues to guide future inferences. To effectively learn multiple spatio-temporal cues aggregated in the shared queries, we develop a spatio-temporal attention mechanism. This mechanism integrates motion cues with appearance cues according to the time steps for ensuring temporal consistency. Moreover, it concurrently captures motion trends and appearance changes to facilitate the understanding of the target objects. Extensive experiments on eight popular tracking benchmarks validate the effectiveness of the proposed USCLTrack.

Abstract:
The existing approaches for skeleton-based action recognition based on graph convolutional networks (GCNs) primarily emphasize the construction of human skeletal structure by leveraging inherent connections. However, the static skeletal topology used across all action categories fails to capture discriminative relationships between joint pairs, while current graph structures struggle to model dynamic motion information, limiting their ability to represent both temporal and motion-specific dependencies. To address this limitation, we propose the decoupled static-dynamic co-occurrence graph convolution (DSDC-GConv), which specifically aims to learn and adapt the graph topology by refining the inter-frame and intra-frame joint dependencies through decomposed manner. Additionally, a multi-level context-aware module is proposed to comprehensively model the latent saliencies of multiple domains in skeletal sequences. This module refines the spatial nodes, temporal dynamics, channel-wise characteristics, and motional dependencies within the graph convolution block. Furthermore, a hierarchical densely connected temporal convolution is proposed to enhance the representation of local features through partial dense connections and enrich the temporal information during the convolution process. Findings from our evaluations on five large-scale benchmark datasets (i.e., NTU RGB+D 60, NTU RGB+D 120, Kinetics Skeleton 400, Northwestern-UCLA, PKU-MMD) demonstrate the effectiveness and superiority of our proposed method over competing approaches, with an recognition accuracy of 93.0% and 97.1% on NTU RGB+D 60, 89.9% and 90.6% on NTU RGB+D 120, 38.6% and 63.4% on Kinetics Skeleton 400, 97.4% on Northwestern-UCLA, 97.6% and 63.6% on PKU-MMD.

Abstract:
Salient object detection of underwater scenes (USOD) poses greater challenges than that of traditional terrestrial scenes due to the presence of diverse and complex underwater image degradation. Current deep learning-based USOD methods generally treat all samples equally while failing to account for the varying difficulty levels of different training samples, thus leading to a limited performance. To tackle this challenge, this paper introduces a novel deep USOD method which benefits from iterative Dual-stage Self-paced Learning (DSPL) and Salient Object Depth Emphasis (SODE). Specifically, a DSPL strategy, which enforces the network to only focus on simpler samples in the first stage and then shifts attention to more challenging samples in the second stage, is devised to imitate the learning process of humans. The whole network is iteratively trained with the DSPL strategy and thus gradually adapted to various underwater scenes with different difficulty levels. Additionally, the proposed method involves an SODE module, which adaptively enhances depth information to effectively locate salient objects, addressing the issue of unreliable depth data caused by underwater image quality degradation. Experimental results on two benchmark datasets demonstrate the superior performance of the proposed method against state-of-the-art methods. The source code of our method will be made available at https://github.com/NIT-JJH/SPDE.

Abstract:
In few-shot fine-grained recognition (FS-FGR) tasks, the main challenge is to distinguish novel categories with high intra-class variations and low inter-class differences given scarce training data. Existing studies explore discriminative features through a compact network to avoid overfitting, while they achieve marginal performance gain owing to the limited representation capability. Motivated by the significant progress of the vision foundation model, we introduce it to describe visual attributes and boost the performance of the compact feature extractor. A few-shot fine-grained recognition method with Variational Feature Imitation Conditioned on Visual Descriptions, VFI-CVD for short, has been proposed in this paper. It simultaneously exploits the pre-trained knowledge from a vision foundation model and the expert knowledge mined by a feature extractor. Specifically, the intra-class variations shared across object categories are encoded into a common distribution thus we can augment features by sampling latent variables. To enhance the learning of intra-class variations, a condition exchange strategy (CES) is put forward to interact the knowledge between samples through feature cross-imitation. In the inference stage, the learned knowledge is further integrated through the joint prediction of visual descriptions and cross-imitated features. Comprehensive experimental results on four fine-grained benchmark datasets show that the proposed VFI-CVD achieves state-of-the-art performance, e.g., 90.37% under the 5-way 1-shot setting on CUB-200-2011. It surpasses existing methods by a large margin, especially in the challenging 30-way recognition tasks and cross-domain evaluation. The source code is publicly available: https://github.com/Lx-zjwf/VFI-CVD.

Abstract:
Dynamic objects pose significant challenges to the accuracy of state estimation and map quality in Simultaneous Localization and Mapping (SLAM). While current dynamic SLAM methods often rely on semantic information to detect specific movable objects, this dependency on pre-trained models and semantic priors can lead to false dynamic detections. This paper presents a novel semantic-independent dynamic SLAM method that detects truly moving regions, without being constrained by the classes or motion patterns of dynamic objects. We introduce a geometric re-clustering approach to improve object clustering by addressing the under- and over-segmentation caused by the K-Means algorithm. Next, instead of simply classifying entire clusters as dynamic or static, we propose a method to detect dynamic regions within each cluster based on dense optical flow residuals. This enables the detection of partial object movements, such as a seated person moving only his hands. Dynamic detection results are propagated across consecutive frames as dynamic priors for calculating optical flow residuals. Additionally, to enhance map quality, we address the mis-detection of slowly or intermittently moving objects through depth consistency checks applied over a larger time interval. Extensive evaluations on public datasets (TUM and Bonn) and real-world scenes show that our method outperforms state-of-the-art semantic-based methods in terms of localization accuracy and generalizability across various scenarios, particularly when facing unknown dynamic objects. Our method also achieves clean and dense reconstructions, demonstrating its potential for applications like robot navigation in dynamic environments.

Abstract:
Recently, deep learning has significantly advanced the satellite object detection. However, the effectiveness of these methods heavily relies on abundant and accurate annotations, which are extremely labor-intensive for satellite videos. Meanwhile, the robustness of traditional difference-based methods is limited by the hand-craft feature from the satellite videos with low-resolution and frame misalignment. To address this problem, an unsupervised deep satellite video vehicle detection framework based on scene prior constrained and self-paced learning (S-SPL) is proposed in this paper. S-SPL obtains the initial pseudo label by the difference-based methods, and employs a deep learning-based detector and refiner to detect objects and update labels respectively. In the train phase, to alleviate the deviation of feature expression caused by noise samples, a novel cooperative self-paced learning scheme is designed to improve the label quality and model accuracy in an alternating optimization manner. Furthermore, considering the semantic relationship between the scene and the object distribution, multi-cue prior knowledge is introduced to provide scene-level constraints, with which samples in high-confidence scenes are emphasized to improve the self-paced learning process. The experimental results on Jilin-1 and SkySat satellite videos demonstrate the superiority of S-SPL.

Abstract:
Multi-expert networks have shown great superiority for imbalanced data classification tasks due to their complementary and diverse. We have summarized two aspects for further explorations: (1) uncontrollable results, arising from the performance differences of individual experts and variations in sample difficulty; (2) insufficient exploration of the internal data structure. These factors result in inconsistent model performance across different data distributions, thereby impact the model’s generalization ability. To address the above issues, we propose a Collaborative Global-Local Structure Network (CGL-Net) with knowledge distillation for imbalanced data classification. Firstly, CGL-Net, as a new framework, decouples the representation learning of imbalanced data into global and local structure, enhancing the controllability of integration model in a hierarchical manner. Secondly, CGL-Net innovatively combines knowledge distillation, data augmentation, and multiple expert networks, efficiently extracting the internal structure of the data and improving robust recognition on imbalanced data. In particular, the global structure learning introduces an independent student network that integrates knowledge from diverse experts, enabling the model to achieve comprehensive and balanced performance across categories in imbalanced data. The local structure learning incorporates augmented data, allowing the model to focus on discriminative regional learning of individual objects, thereby enhances the robust representation for imbalanced data. After completing these two sequential learning stages, the model hierarchically integrates knowledge to achieve robust recognition performance on imbalanced data. Extensive experiments on six benchmark datasets demonstrate that the proposed CGL-Net significantly outperforms recent state-of-the-art methods.

Abstract:
Reversible data hiding (RDH) is one special type of data hiding, and is widely used for many intended applications. Moreover, to protect the cover image, RDH in encrypted image (RDHEI) schemes are accordingly proposed. In RDHEI, if the marked image is lost, the cover image and the secret data cannot be restored. To address the issue of losing sub marked images, RDH in shared image (RDHSI) by secret sharing (SS) is proposed to achieve fault-tolerance. Recently, an anonymous application scenario using RDHSI is introduced, on which multiple data hiders in RDHSI serve as reviewers in a committee. When the number of agreement votes is above the threshold of SS, the application is approved. However, there are weaknesses in this application framework: (i) reviewers cannot provide equal right to vote (namely, they cannot cast “Yes” and “No” votes, respectively), and (ii) anonymous submission is not really achieved. In the paper, we propose a new RDHSI to allow data hiders can hide approved sub data or disapproved sub data into sub images. Therefore, how to perform recovery and extraction from n marked images (some embedded with “Yes” votes and some with “No” votes) should be carefully designed. Based on the consistency property of polynomial interpolation, we conduct verification, recovery and extraction algorithms from n marked images. In addition, we use Hamming codewords to represent pixel difference instead of directly hiding pixel difference for reversibility, and this improvement also improves the embedding capacity. Compare with the current anonymous application scheme using RDHSI, the embedding rate in this paper can reach 3.5 bits per pixel (bpp), which is an improvement of 2 bpp. Thus, it is more suitable for the anonymous submission of application framework.

Abstract:
Change Detection (CD) is a crucial and challenging task in remote sensing observations. Despite the remarkable progress driven by deep learning in remote sensing change detection, several challenges remain regarding global information representation and efficient interaction. The traditional Siamese network structure, which extracts features from bitemporal images using a weight-sharing network and generates a change map, but often neglects phase interaction information between images. Additionally, multi-scale feature fusion methods frequently use FPN-like structures, leading to lossy cross-layer information transmission and hindering the effective utilization of features. To address these issues, we propose a multi-scale interaction fusion network (MIFNet) that fuses bitemporal features at an early stage, using deep supervision techniques to guide early fusion features in obtaining abundant semantic representation of changes, also we construct a dual complementary attention module (DCA) to capture temporal information. Furthermore, we introduce a collection-allocation fusion mechanism, which is different from previous layer-by-layer fusion methods since it collects global information and embeds features at different levels to achieve effective cross-layer information transmission and promote global semantic feature representation. Extensive experiments demonstrate that our method achieves competitive results on the LEVIR-CD+ dataset, outperforming other advanced methods on both the LEVIR-CD and SYSU-CD datasets, with F1 improved by 0.96% and 0.61%, respectively, compared to the most advanced models.

Abstract:
Supervised dehazing models, trained on synthetic hazy-clean image pairs, often face a notable decline in performance when applied to real-world scenes. Consequently, CycleGAN-based unpaired dehazing methods are proposed to improve the model’s generalization. One successful approach among these methods involves decomposing the physical properties of the atmospheric scattering model (ASM). However, estimating physical properties individually from input images is difficult without supervised labels, which ignores the semantic consistency between different physical regions. We claim semantic region information can offer additional geometric spatial constraints for estimating physical properties, as natural images can be divided into regions with similar scene depths. Motivated by this, we propose a novel generalized and realistic unpaired image dehazing framework via region-aware physical constraints (RPC-Dehaze). Our approach utilizes fine-grained semantic region maps from the Segment Anything Model (SAM) in a specially designed region prompt enhancement module. This enables the dehazing and hazing cyclic networks to learn region-aware physical constraints, leading to accurate estimation of haze imaging physical properties. In contrast to existing unpaired methods that treat dehazing and hazing networks equally, we incorporate Retinex theory into the hazing network, allowing it to learn diverse illumination effects in different regions. We adaptively refine the Retinex-based illumination component, resulting in more realistic hazy images. To further facilitate unsupervised learning in our framework, we propose a physical consensual contrastive regularization to ensure compact representation constraints in the latent feature space. Extensive experiments on synthetic and real image datasets show our method surpasses state-of-the-art unpaired dehazing methods in both effectiveness and generalization capability.

Abstract:
Faces in a scene of human group, if coded with sufficient precision, can be computer analyzed for machine vision tasks involving faces. But this requires storing and communicating them at a very high bit rate. Traditional ROI-based image compression methods are ill suited to code many faces at high precision against a complex background. In this work, we propose a novel group image compression neural network (GICNet) of two layers: 1) the face layer dedicated to machine analysis, in which face bounding boxes are first cropped out of the background and converted to a compression-friendly canonical sketch-guided representation of fixed resolution for compact coding and facilitating downstream tasks without additional preprocessing; 2) the background layer dedicated to overall human vision perceptual quality, in which face residuals and background elements are coded and appended to the code stream. Experimental results demonstrate the effectiveness of our proposed GICNet, conserving up to 13%-57% bitrate for machine vision applications while maintaining competitive perceptual quality.

Abstract:
Multi-modal neural architecture search (MNAS) is an effective approach to obtain task-adaptive multi-modal classification models. Deep neural networks, as currently main-stream feature extractors, can provide hierarchical features for each modality. Existing MNAS methods face difficulty in exploiting such hierarchical features due to their different form coexistence such as tensorial multi-scale features and vectorized penultimate features. Moreover, existing methods always focus on the evolution of fusion operators or vectorized features of all modalities, constraining search space. In this paper, a novel two-stage method called multi-modal multi-scale evolutionary neural architecture search (MM-ENAS) is proposed. The first stage unifies the representation form of hierarchical features by the proposed evolutionary statistics strategy. The second stage identifies the optimal combination of basic fusion operations for all unified hierarchical features by the evolutionary algorithm. MM-ENAS increases search space by simultaneously searching for feature statistical extraction methods, basic fusion operators and feature representation set consisting of tensorial multi-scale features and vectorized penultimate features. Experimental results on three multi-modal tasks demonstrate that the proposed method achieves competitive performance in terms of accuracy, search time, and number of parameters compared to existing representative MNAS methods. Additionally, the method exhibits fast adaptation to various multi-modal tasks.

Abstract:
Near infrared-visible (NIR-VIS) heterogeneous face recognition aims to match face identities in cross-modality settings, which has achieved significant development recently. The work on adversarial attack and security issues of the heterogeneous face recognition task is still lacking. Existing adversarial face generation methods can’t deploy directly because of the inevitable large modality discrepancy. Besides, the ideal adversarial attacking generated images should maintain both high capabilities and low detectability. Considering the properties of near-infrared face images, our basic idea is to construct adversarial shadows for good stealthiness and high attack capability. In this paper, we propose a novel face adversarial shadow generation framework for NIR-VIS heterogeneous face recognition, which can synthesize fine-crafted lighting conditions containing strong identity attacking ability. Specifically, we design the variance consistency-based symmetric face attacking loss to improve the attacking generalization and the synthesized image quality. Extensive qualitative and quantitative experiments on the public large-scale NIR-VIS heterogeneous face dataset prove the proposed method achieves superior performance compared with the state-of-the-art methods. The source code is publicly available at https://github.com/GEaMU/Devil-in-Shadow.

Abstract:
Generating high-quality high dynamic range (HDR) images in dynamic scenes is particularly challenging due to the influence of large motion. Despite the effectiveness of existing deep learning methods, they still suffer from ghosting artifacts when saturation and motion coexist. Inspired by fusion on static scenes, we propose an inpainting and fusion strategy to enhance the quality of the generated HDR images. The proposed method consists of pseudo-static LDR generation and detail-guided HDR generation, which creates pseudo-static images and then generates ghost-free HDR images. Specifically, the pseudo-static LDR generation network utilizes semantic information to identify the motion regions, and employs a diffusion model-based inpainting approach to produce pseudo-static LDR images that closely resemble real scenes. In the detail-guided HDR generation network, we employ a detail enhancement module to refine diverse high-frequency features with detailed information extracted from pseudo-static LDR images, which effectively enhances the visual quality. Extensive experiments on four public datasets demonstrate the superiority of the proposed method, both quantitatively and qualitatively.

Abstract:
Object counting aims to count the accurate number of object instances in images, and its operation efficiency is essential. However, most current CNN-based methods rely on complex network architectures, which results in them consuming a significant amount of memory, time, and other resources at runtime. This seriously limits their deployment in practical application scenarios, such as public safety and agriculture planting. Therefore, we propose a lightweight object counting method named EdgeCount to effectively balance inference speed and object counting accuracy. Specifically, we construct a network composed of a student model (EdgeCount) and a teacher model (EdgeCount-T) with the same encoder-decoder structure based on density map knowledge distillation (DMKD), allowing the EdgeCount to learn object density distribution from the EdgeCount-T. After that, we introduce spatial and channel reconstruction convolution (SCConv), composed of a spatial reconstruction unit (SRU) and a channel reconstruction unit (CRU), to decrease spatial and channel redundancy with lower computational costs. Moreover, a low parameter weighted multi-scale feature fusion module (LWMFFM) is designed to further improve the countering ability through segmenting minor structural discrepacies among multi-scale features. Extensive experiments conducted on challenging remote sensing and dense crowd object counting datasets demonstrate the effectiveness and superiority of our method. In particular, under the four NVIDIA Jetson devices, EdgeCount can accurately counter objects with only 0.12M parameters and 19.87M floating-point operations per second (FLOPs) in the size of 128, which achieves the lowest latency and fastest FPS compared with other state-of-the-art object counters.

Abstract:
Collecting training data for deep models from the Internet is a common data acquisition approach. However, there are challenges in using these data directly, as they often contain inaccurate annotations. This situation has increased the attention and importance of noisy label learning, the process of training a deep model with unreliable annotations. The typical strategy in noisy label learning is to identify potential mislabeled samples and assign pseudo-labels generated by the network to them, replacing the original labels. However, existing methods encounter the following problems: 1) they typically do not evaluate the pseudo-labels and directly use all of them, and 2) empirical parameter settings are often dataset-specific. These shortcomings limit the application of these methods in real-world scenarios. In this paper, we propose the Linear Feature Source Prediction and Recombination Network (LFSPR), trying to solve the problem above by proposing a new pretext task. The pretext task is designed to build the linear connection between the high-dimensional feature and the low-dimensional feature. The source of the latter is regarded as the high-dimensional feature, which follows a non-linear head network to obtain the low-dimensional feature. The pretext task is designed in low-dimensional space by predicting the linear composition weights of the potential source. Based on the pretext task, our method can generate pseudo-labels for uncertain samples while dynamically evaluating and selecting them, rather than simply using all pseudo-labels or discarding a fixed proportion of pseudo-labels for a given dataset. To the best of our knowledge, this is the first approach in the noisy label learning domain to employ pretext task for the pseudo-labels generation, evaluation and selection. The experiments on CIFAR-10, CIFAR-100 and Clothing1M demonstrate the effectiveness of our method.

Abstract:
The complex entanglement between darkness and noise hinders the advance of low-light image enhancement. Most existing methods adopted lightening-then-denoising or embedded a special denoising module into enhancement network without specific noise knowledge as supervision to restore low-light images. However, they either fail to remove the amplified noise or blur the detail information. Against above drawbacks, we propose a novel dual prior guidance method for low-light image enhancement that relights darkness and suppresses noise simultaneously. Concretely, the main novelties of our proposed method are three-fold. Firstly, our formulation originates from a statistic observation that darkness can be disentangled into luminance channel, yet noise still exists each channel when low-light images are transformed from RGB space to YCbCr space. It inspires us to design an ingenious method, extracting noise and darkness, termed END, to enhance low-light images. Secondly, we propose a prior extraction network with prior composition module to extract luminance and noise priors from different channels. Thirdly, an image enhancement network deployed with prior guidance module is proposed to progressively lighten the darkness and remove noise. Extensive experiments on multiple benchmarks demonstrate that our proposed method achieves remarkable performance compared to other state-of-the-art low-light image enhancement methods. The source code and trained model can be found in https://github.com/WHK-Huake/END.

Abstract:
Underwater images often suffer from various issues such as low brightness, color shift, blurred details, and noise due to light absorption and scattering caused by water and suspended particles. Previous underwater image enhancement (UIE) methods have primarily focused on spatial domain enhancement, neglecting the frequency domain information inherent in the images. However, the degradation factors of underwater images are closely intertwined in the spatial domain. Although certain methods focus on enhancing images in the frequency domain, they overlook the inherent relationship between the image degradation factors and the information present in the frequency domain. As a result, these methods frequently enhance certain attributes of the improved image while inadequately addressing or even exacerbating other attributes. Moreover, many existing methods heavily rely on prior knowledge to address color shift problems in underwater images, limiting their flexibility and robustness. In order to overcome these limitations, we propose the Embedding Frequency and Dual Color Encoder Network (FDCE-Net) in our paper. The FDCE-Net consists of two main structures: 1) Frequency Spatial Network (FS-Net) aims to achieve initial enhancement by utilizing our designed Frequency Spatial Residual Block (FSRB) to decouple image degradation factors in the frequency domain and enhance different attributes separately; 2) To tackle the color shift issue, we introduce the Dual-Color Encoder (DCE). The DCE establishes correlations between color and semantic representations through cross-attention and leverages multi-scale image features to guide the optimization of adaptive color query. The final enhanced images are generated by combining the outputs of FS-Net and DCE through a fusion network. These images exhibit rich details, clear textures, low noise and natural colors. Extensive experiments demonstrate that our FDCE-Net outperforms state-of-the-art (SOTA) methods in terms of both visual quality and quantitative metrics. The code of our model is publicly available at: https://github.com/Alexande-rChan/FDCE-Net.

Abstract:
Recent image quality assessment (IQA) methods typically focus on predicting the mean opinion score (MOS) of image quality, ignoring the image quality score distribution. This distribution provides valuable information beyond the MOS, including the standard deviation of opinion scores (SOS) and opinion scores at different quality levels. This paper introduces a novel no-reference IQA method that predicts the image quality score distribution to estimate the MOS. The proposed method consists of three modules: a visual feature extraction module, a graph convolutional module, and a MOS prediction module. In the visual feature extraction module, a convolutional neural network is designed to extract both first- and second-order visual features of images. The graph convolutional module employs a graph convolutional network (GCN)-based mapper to map these visual features to the image quality score distribution by exploring correlations between quality labels. The MOS is then derived from the predicted image quality score distribution in the MOS prediction module. We are the first to jointly train the method using both the MOS and the image quality score distribution, enabling it to learn richer subjective information and improve prediction performance. To address the lack of the ground-truth image quality score distribution in some IQA databases, we propose to use a SOS assumption to generate a Gaussian-based image quality score distribution that better reflects subjective perception. Additionally, we design appropriate loss functions for training. Experimental results demonstrate that our method effectively predicts both the image quality score distribution and the MOS, outperforming most state-of-the-art IQA methods.

Abstract:
Most Image-Text Matching (ITM) models adopt Triplet loss with Hard Negative mining (T-HN) as the optimization objective. T-HN mines the hardest negative samples in each batch for training and achieves impressive performance. However, we observe that these ITM models have bad training behaviors in the early phases of training. Model training is difficult to converge, and matching performance is slow to improve. In this paper, we find that the cause of bad training behavior is that the model suffers from gradient vanishing. Optimizing an ITM model using only the hardest negative samples can easily lead to gradient vanishing. Through gradient analysis, we first derive the condition under which the gradient vanishes during training. We explain why the gradient tends to zero under certain conditions. To alleviate gradient vanishing, we propose a Triplet loss with Selectively Hard Negative mining (T-SelHN), which decides whether to mine the hardest negative samples according to the gradient vanishing condition. T-SelHN can be applied to ITM models in a plug-and-play manner to improve their training behaviors. To further ensure the back-propagation of gradients, we construct a Residual Visual Semantic Embedding model with T-SelHN, denoted RVSE++, whichÂ has a simple network structure and efficient training and inference speeds. Extensive experiments on two ITM benchmarks demonstrate the strength of RVSE++, achieving state-of-the-art performance. The code is available at https://github.com/AAA-Zheng/RVSEPP.

Abstract:
Semantic communication (SC) is an emerging communication paradigm that transmits only task-related semantic features to receivers, offering advantages in speed. However, existing robust steganography cannot extract message correctly after SC. To address this issues, we propose a novel steganography framework based on Generating Adversarial Networks (GANs) for SC, called “Image Semantic Steganography”. Our framework embeds message into semantic features to guarantee extraction while considering both pixel-level and semantic-level distortions to enhance security. Experimental results show that our framework not only achieves message extraction successfully and behavioral covertness during and after SC, but also does not impact the implementation of SC.

Abstract:
Despite the remarkable success of monocular depth estimation, most works focus on ideal experiment conditions, such as favorable weather, where there is few environmental factors impacting the depth estimation system. In practical, when suffering from adverse weather conditions, such as fog and rain, the model trained on favorable weather degrades sharply as the domain shift, caused by the decreasing of visibility. To solve this problem, in this paper, we propose a Curriculum Domain Distribution Alignment (CDA) algorithm to learn the domain-invariant representation, progressively aligning data distributions across favorable weather and adverse weather in the feature space. Concretely, to construct a domain adaptation curriculum, we first separate the target domain into several subsets with increased domain discrepancy based on an optical model. Then, we bridge the distribution discrepancy between domains from easier to harder data by matching the source and target representation subspace. Furthermore, to control the distribution aligning pace, we introduce self-paced learning to learn a dynamic domain adaptation weight, promoting the generalization ability of monocular depth estimation networks against environmental factors. We conduct experiments with six monocular depth estimation frameworks on FoggyCityScapes, RainCityScapes, SnowCityscapes, and All-day Cityscapes, improving RMSE with 8.5 %, 30.5 %, 30.9 %, 20.9 %. The extraordinary performance demonstrates the effectiveness and generalizability of our method under adverse weather conditions.

Abstract:
Correspondence learning aims to identify correct correspondences from the initial correspondence set and estimate camera pose between a pair of images. At present, Transformer-based methods have make notable progress in the correspondence learning task due to their powerful non-local information modeling capabilities. However, these methods seem to neglect local structures during feature aggregation from all query-key pairs, resulting in computational inefficiency and inaccurate correspondence identification. To address this issue, we propose a novel Context-aware Local and Global interaction Transformer (CLGFormer), a lightweight Transformer-based module with dual-branches that address local and global context perception in attention mechanisms. CLGFormer explores the relationship between neighborhood consistency observed in correspondences and context-aware weights appearing in vanilla attention and introduces an attention-style convolution operator. On top of that, CLGFormer also incorporates a cascaded operation that splits full features into multiple subsets and then feeds to the attention heads, which not only reduces computational costs but also enhances attention diversity. At last, we also introduce a feature recombination operate with high jointness and a lightweight channel attention module. The culmination of our efforts is the Context-aware Local and Global interaction Network (CLG-Net), which accurately estimates camera pose and identifies inliers. Through rigorous experiments, we demonstrate that our CLG-Net network outperforms existing state-of-the-art methods while exhibiting robust generalization capabilities across various scenarios. Code will be available at https://github.com/guobaoxiao/CLG.

Abstract:
Despite significant advances in continual semantic segmentation (CSS), they still rely on the pixel-level annotation to train models, which is time-consuming and labor-intensive. Continual learning from image-level labels is an emerging scheme in continual semantic segmentation to reduce the annotation cost. However, the incomplete and coarse pseudo-labels are insufficient to train a model to maintain a balance between stability and plasticity. To solve these issues, we propose a novel end-to-end framework based on Transformer, called L2A, for Weakly Supervised Continual Semantic Segmentation (WSCSS). In particular, to generate reliable annotations from the image-level supervision, we introduce a semantic affinity from multi-head self-attention (SA-MHSA) module to capture the semantic relationships among adjacent image coordinates. Subsequently, this acquired semantic affinity is employed to refine the initial pseudo labels of new classes trained with the image-level annotations. Furthermore, to minimize catastrophic forgetting, we propose a semantic drift compensation (SDC) strategy to optimize the pseudo-label generation process, which can effectively improve the alignment of object boundaries across both new and old categories. Comprehensive experiments conducted on the PASCAL VOC 2012 and COCO datasets demonstrate the superiority of our framework in existing WSCSS scenarios and a newly proposed challenge protocol, as well as remains competitive compared to the pixel-level supervised CSS methods.

Abstract:
Open-vocabulary semantic segmentation (OVSS) aims to segment an image into regions of corresponding semantic vocabularies, without being limited to a predefined set of object categories. Existing works mainly utilize large-scale vision-language models (e.g., CLIP) to leverage their superior open-vocabulary classification abilities in a two-stage manner. However, their heavy reliance on the first-stage segmentation network leaves the full potential of CLIP untapped, creating an unresolved gap between the rich pre-training knowledge and the challenging per-pixel classification task. Although the recent one-stage paradigm has further leveraged pre-trained vision knowledge from CLIP, it fails to effectively utilize text information due to the inclusion of numerous unrelated semantics in the vocabulary list. How to avoid noise interference in text information and utilize language guidance remains a Gordian knot. In this paper, we propose a bi-directional bridge network (BBN) to bridge the gap between upstream pre-trained models and downstream segmentation tasks. It first purifies the noisy text embedding and then guides semantics-vision aggregation with the purified information in a purification-then-guidance manner, thereby facilitating effective semantic utilization. Specifically, we design an optimal purification modulator to purify noisy text information via the optimal transport algorithm, and a reliable guidance modulator to integrate proper textual information into vision embedding via the designed reliable attention in an adaptive manner. Extensive experimental results on five challenging benchmarks demonstrate that our BBN performs favorably against state-of-the-art open-vocabulary semantic segmentation methods.

Abstract:
In this work, we observe that the generators, which are pre-trained on massive natural images, inherently hold the promising potential for superior low-light image enhancement against varying scenarios. Specifically, for the low-light image enhancement process of a single image, we introduce the pre-trained generators to restore the details and colors degraded by low-light conditions, thereby improving the visual effect. Taking one step further, we introduce a novel optimization strategy, which backpropagates the gradients to the input seeds rather than the parameters of the low-light image enhancement model, thus intactly retaining the generative knowledge learned from natural images and achieving faster convergence speed. Benefiting from the pre-trained knowledge and seed-optimization strategy, the low-light image enhancement model can significantly regularize the visibility and fidelity of the enhanced result, thus rapidly generating high-quality images without training on any low-light dataset. Extensive experiments on various benchmarks demonstrate the effectiveness of the proposed method, showing its potential advantages over numerous state-of-the-art methods both qualitatively and quantitatively.

Abstract:
Traditional fusion methods based on deep learning mainly employ convolutional or self-attention operations to model local or global dependencies, which often lead to the oversight of frequency-domain information. To address this deficiency, we introduce a unified frequency adversarial learning network, termed FreqGAN. Our method involves a frequency-compensated generator that employs discrete wavelet transformation to decompose encoded spatial features into multiple frequency bands. Leveraging skip connections, low and high-frequency components are respectively directed into the encoder and decoder, compensating for additional outline and detail. Moreover, we construct a hybrid frequency aggregation module, which enables a progressive optimization of activity levels across multiple scales and makes the various frequency bands correlated. Complementing our generative model, we devise dual frequency-constrained discriminators. These discriminators are tasked with dynamically adjusting weights for each input frequency band, thereby obligating the generator to accurately reconstruct salient frequency information from different modality images. Additionally, a frequency-supervised function is formulated to further safeguard against the loss of frequency information. Our comprehensive experimental evaluations, encompassing a wide range of fusion tasks and subsequent applications, distinctly highlight FreqGAN’s superior performance, establishing it as a frontrunner in comparison to existing state-of-the-art alternatives. The source codes are forthcoming at: https://github.com/Zhishe-Wang/FreqGAN.

Abstract:
Zero-shot referring expression comprehension (zero-shot REC) is a crucial yet challenging task in the field of multi-modal understanding, which aims to locate an object described by a referring expression without training on task-specific datasets. Existing methods take advantage of a pre-trained CLIP model to align cropped proposal regions with referring expressions. However, our analysis reveals that this aligning way heavily biases toward certain salient visual regions due to CLIP focusing on global-level image-text matching. To mitigate this bias, we propose MCCE-REC, an MLLM-driven cross-modal contrastive entropy model for training-free zero-shot REC. Benefiting from the remarkable in-context comprehension ability of the multi-modal large language model (MLLM), we design a set of referring prompts for MLLM to generate diverse detailed informative, and contrastive cues related to referring objects. Based on these cues, on the one hand, we propose a multi-cues cross-modal interaction network, which associates the visual features and referring object textual features from multiple perspectives and perceives surrounding context object information in a parameter-free manner, avoiding bias towards salient features. On the other hand, we introduce a contrastive similarity entropy selection mechanism that compares the positive and negative cues to suppress biased regions with high similarity scores and emphasizes accurate regions correlating with referring descriptions. Extensive experiments demonstrate our MCCE-REC outperforms existing zero-shot methods by a significant margin on various REC datasets.

Abstract:
Current low-light image enhancement (LLIE) techniques truly enhance luminance but have limited exploration on another harmful factor of nighttime visibility, the glow effects with multiple shapes in the real world. The presence of glow is inevitable due to widespread artificial light sources, and direct enhancement can cause further glow diffusion. In the pursuit of Overall Nighttime Visibility Enhancement (ONVE), we propose a physical model guided framework ONVE to derive a Nighttime Imaging Model with Near-Field Light Sources (NIM-NLS), whose APSF prior generator is validated efficiently in six categories of glow shapes. Guided by this physical-world model as domain knowledge, we subsequently develop an extensible Light-aware Blind Deconvolution Network (LBDN) to face the blind decomposition challenge on direct transmission map D and light source map G based on APSF. Then, an innovative Glow-guided Retinex-based progressive Enhancement module (GRE) is introduced as a further optimization on reflection R from D to harmonize the conflict of glow removal and brightness boost. Notably, ONVE is an unsupervised framework based on a zero-shot learning strategy and uses physical domain knowledge to form the overall pipeline and network. Empirical evaluations on multiple datasets validate the remarkable efficacy of the proposed ONVE in improving nighttime visibility and performance of high-level vision tasks.

Abstract:
While advanced lightweight models excel at real-time inference on resource-constrained end cameras in general scenarios, they often face limitations in adverse environments because of poor generalization ability. To achieve accurate inference in adverse environments, it becomes imperative to design adaptive model update strategies that can efficiently respond to the occurrence of adverse environments. In this paper, we propose a video analytics system that can continuously and responsively update the on-device lightweight model to handle various adverse environments. Our system consists of three modules, namely, a key frame extractor, a trigger controller, and a retraining manager. The key frame extractor identifies the most informative frames with minimal redundancy for bandwidth-efficient transmission. Those key frames are then used for potential model retraining and updating. Once the trigger controller detects a notable accuracy drop above an adaptive threshold within those selected key frames, it initiates the retraining process and evaluates the current urgency level. Then the retraining manager responds by generating the optimal retraining configuration that strikes a balance between inference accuracy and retraining latency. The retrained model is subsequently enforced to the end camera for responsive update. The designed system is prototyped on typical end devices and an edge server. Extensive experimental results under real-world datasets demonstrate that, the designed system is robust to handle various adverse environments, significantly improving the overall detection accuracy (up to 29%) and reducing more than 50% of the retraining time.

Abstract:
Image compression at extremely low bitrates (below 0.1 bits per pixel (bpp)) is a significant challenge due to substantial information loss. In this work, we propose a novel two-stage extreme image compression framework that exploits the powerful generative capability of pre-trained diffusion models to achieve realistic image reconstruction at extremely low bitrates. In the first stage, we treat the latent representation of images in the diffusion space as guidance, employing a VAE-based compression approach to compress images and initially decode the compressed information into content variables. The second stage leverages pre-trained stable diffusion to reconstruct images under the guidance of content variables. Specifically, we introduce a small control module to inject content information while keeping the stable diffusion model fixed to maintain its generative capability. Furthermore, we design a space alignment loss to force the content variables to align with the diffusion space and provide the necessary constraints for optimization. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in terms of visual performance at extremely low bitrates. The source code and trained models are available at https://github.com/huai-chang/DiffEIC.

Abstract:
Pedestrian facial attention plays an essential role in autonomous driving scenarios where a vehicle has to handle complex interactions with pedestrians. By inferring whether pedestrians are making eye contact with the ego-vehicle, the intention of pedestrians can be deduced. However, traditional gaze estimation and eye detection algorithms have limitations in complex traffic scenes due to the lower resolution caused by spatial distance and the lack of visual features caused by occlusion. To address these limitations, this study proposes an innovative pedestrian facial attention detection framework. The proposed framework adopts a deep feature fusion strategy to achieve a deep-level fusion of visual features and semantic pose features. Moreover, a multi-modal fusion classifier that helps discover the cross-model spatial interactive representation from the feature maps, thus enhancing the robustness of model generalization, is proposed. The proposed framework is verified by experiments on public JAAD and LOOK datasets. The experimental results demonstrate the effectiveness of the proposed framework, indicating that it can achieve better performance compared to the existing methods.

Abstract:
Infrared and visible image fusion is an image enhancement technique that generates a single image with rich textures and significant objectives in a variety of scenarios, providing great convenience for human discrimination and computer recognition. However, in low-light environments, low-intensity visible images tend to blur valuable information, and these details are often ignored during image fusion, resulting in the loss of important information. Although existing methods take into account the damage of low illumination and highlight the illumination in the fusion process, a large amount of structural information is lost in the process of adjusting illumination, resulting in the lack of texture details and poor performance in high-level vision tasks. To address the above challenges, this paper proposes a structure-aware image fusion method for low illumination scenes, called SLFusion, which enhances the illumination while reducing the loss of structural information, leading to a fused image with richer texture details. We first design an illumination enhancement module to separate the degraded illumination from the scene information in the visible image, and mine more details from the low-intensity regions. Based on the fact that image edge information has a good capability of modeling structures, we design an edge extraction network for low-light visible images to model the structural information, which can accurately highlight important structural information and inject it into the fusion image. The proposed method produces fusion results that not only have good visual perception, but also minimize the loss of structural information. Extensive experiments on benchmark datasets demonstrate that the proposed method outperforms state-of-the-art (SOTA) methods in terms of visual quality, quantitative metrics as well as advanced vision tasks.

Abstract:
Scene text reading is an essential component of scene understanding. As its fundamental requirement, text detection has garnered increasing attention. Segmenting the text kernel and extending it to reconstruct text instances is efficient and effective among the various methods. However, the incomplete semantic features of text kernels and the high similarity between kernels and texts make it hard to extract kernels from images accurately. Considering the above, we propose an efficient text detector, termed CIC, which comprises a bidirectional information transfer module (BITM), a dual knowledge integration module (DKIM), and a cross-verification module (CVM). The former generates collaborative information between the predicted text and kernel via the proposed differentiable adaptive gap operator. It forces mutual restraint and collaborative progress between the predictions of text and kernel. Unlike BITM, DKIM designs a knowledge fuse scheme, which helps to locate kernels accurately under the guidance of the complete semantic feature of texts. Intuitively, as the kernel is generated by shrinking the text, the kernel pixel is only presented in the text area. Based on this criterion, the CVM further utilizes text predictions to constrain kernel predictions and reduce false positive predictions. Ablation experiments demonstrate the effectiveness of the proposed BITM, DKIM, and CVM. Extensive experiments show the proposed CIC outperforms existing state-of-the-art (SOTA) methods on five public datasets from different scenes. The code is available at https://github.com/fengmulin/CIC

Abstract:
With the rapid development of immersive multimedia technology, the growing demand for high-quality visual experiences has driven the emergence of point cloud quality assessment (PCQA). While current deep learning-based PCQA models have achieved breakthroughs in performance, problems such as high computational complexity and limited model generalization ability still need to be solved. In this study, focusing on compression distortion, we analyzed and verified that the compression quantization parameter (QP) can be used as a key feature for predicting perceptual quality. Based on this, a novel no-reference point cloud perceptual quality assessment metric, DQP-PCQA, is proposed. Unlike existing PCQA models that only use mean opinion score (MOS) as a supervisory label, this study proposes a multi-objective constrained optimization scheme that adds geometric quantization parameter (GQP) and texture quantization parameter (TQP) as auxiliary supervisory labels to help the model can learn robust perceptual features that take into account both subjective quality and objective distortion. We conducted comparative experiments with other advanced PCQA models on several mainstream PCQA datasets. The results show that the DQP-PCQA model achieves fast convergence speed, excellent and stable performance, low complexity and strong generalization. Further migration experiments show that after applying our proposed method to other advanced PCQA models, the performance of the improved model is further improved. Our discovery provides new insight for PCQA research. To facilitate future reproducible research, the source code will be publicly released at https://github.com/Dds46/DQP-PCQA

Abstract:
Current one-stream trackers suffer from limitations in distinguishing targets from complex backgrounds owing to their uniform token division strategy. By treating all regions equally, these methods allocate inadequate attention to crucial target details while overemphasizing redundant background information. Consequently, their performance deteriorates significantly in scenarios involving similar distractors or background clutter. In this work, we propose a Local Implicit Neural Representation(LINR) module specifically designed for local fine-grained object modeling. It consists of two key modules: (1) Local Window Selection: Leveraging template-guided CNN-based cross-correlation, it accurately identify crucial target-relevant regions, reducing background information redundant and computation burden. (2) INR-based Window Refinement: Using implicit neural networks, it optimizes token density and spatial continuity to improve local fine-grained instance-level representations, facilitating the discriminative ability between the target and the background. Moreover, the LINR module exhibits three remarkable advantages as a generalized enhancement for visual tracking. Firstly, it is plug-and-play, seamlessly integrating into existing one-stream trackers, both non-real-time and real-time ones, without architectural modifications, achieving significant performance improvements. Secondly, it is highly portable since it does not introduce new loss functions, additional training strategies or data. Thirdly, it is efficiency-friendly, having minimal impact on model parameters and tracking speed, e.g., AQATrack-LINR increases only 1.9% of the parameters and reduces the tracking speed by only 6~fps . We incorporate the LINR module into two non-real-time trackers, OSTrack based on ViT-B and AQATrack based on HiViT-B, and one real-time tracker, FERMT based on ViT-tiny, respectively. The resultant OSTrack-LINR, AQATrack-LINR, and FERMT-LINR achieve state-of-the-art performance across seven widely utilized datasets, such as TrackingNet, LaSOT, and NFS30. The source code is available at https://github.com/Xiaochen918/LINR

Abstract:
Accurate segmentation of diverse structures in pathological images is crucial for medical analysis. While widely used RGB images offer high spatial resolution, microscopic hyperspectral images (MHSIs) provide unique biomedical spectral signatures. Existing multi-modal segmentation methods, however, often suffer from insufficient uni-modal learning, ineffective cross-modal interaction, and nonadaptive multi-modal fusion. Therefore, we propose a novel synergistic multi-modal learning paradigm for co-registered RGB-MHSIs, instantiated within the Synergistic Fusion Network (SyFusNet) which comprises: modality-specific modules and objectives to ensure uni-modal feature extraction, the Mutual Knowledge Sharing Module (MKSM) for explicit cross-modal interaction, and the Adaptive Dual-level Co-decision Module (ADCM) for collaborative multi-modal segmentation. Alongside uni-modal learning, MKSM disentangles MHSI- and RGB-specific features into band- and position-aware guidance, respectively, sharing as cross-modal knowledge to enhance each other’s representations. To fuse multi-modal predictions, ADCM generates global attention from integrated multi-modal features to adaptively refine decision-level outputs, yielding reliable segmentation. Experiments demonstrate that SyFusNet outperforms state-of-the-art methods with statistical significance \boldsymbol (p \lt 0.01) , achieving relative IoU gains of 9.35%, 4.63%, and 2.47% on the public PLGC, MDC, and WBC datasets, respectively, while also exhibiting strong generalizability and diagnostic potential through practical applications in multi-class segmentation and tumor regression grading.

Abstract:
Digital image forensics aims to verify the authenticity of digital images, which has emerged as a prominent research area. To reveal the manipulation history of an image, the existing methods can only detect specific image operations or are based on a general forensic feature with high dimensions. Moreover, these methods perform well only when the operation chain length is no greater than 2. However, their detection accuracy drops significantly for images with longer operation chains that are more representative of real-world scenarios. To break these limitations, we proposed a novel forensics frequency Feature based on Histogram and Detail Map ( \mathrm FHDM ^\mathrm (79D) ), which can distinguish various operation chains containing different numbers of operations. Specifically, compared to the traces left by image manipulation in the spatial domain, we have discovered that they are more distinct in the frequency domain. This observation has prompted us to extract features from the frequency domain of images by analyzing their histograms and detail maps to capture the manipulation traces of the images. Notably, the proposed feature extracted in the frequency domain has almost 90% fewer dimensions than the commonly used general forensic features, such as \mathrm SRM ^\mathrm (714D) , which greatly reduces the computational complexity. Meanwhile, compared to deep learning-based methods, the experiments show that the proposed method achieves a detection accuracy of over 95% for image operations across multiple datasets, while other deep learning-based methods do not exceed 90% accuracy. Extensive experimental results show that the proposed method is more versatile and effective, showing good performance in complex operation chain detection and local forgery detection. The code is available at https://github.com/CherishL-J/Op-detection

Abstract:
Anti-aliasing is a crucial research topic in computer graphics, which can significantly enhance the rendering quality of neural radiance fields (NeRF). Recent studies have introduced effective anti-aliasing NeRF methods, utilizing cone-casting to replace ray-casting and modeling the 3D observation area of pixels as circular cones. The cone-casting strategy has successfully reduced blurring and aliasing in novel view rendering. However, we have observed that the light cones are not standard circular cones because the camera projection model distorts them into elliptic cones of diverse sizes and shapes. This finding motivates us to model pixel light cones as anisotropic elliptic cones and propose an elliptic cone-casting-based anti-aliasing NeRF method called “ECC-NeRF”. Specifically, we first derive the elliptic cone models for common pinhole, fisheye, and panoramic cameras based on their camera projection models. Then, we integrate the proposed elliptic cone-casting into two representative cone-casting-based anti-aliasing NeRF methods: Mip-NeRF and Zip-NeRF. Our experimental evaluations on multiple datasets demonstrate that our method can achieve more accurate multi-scale anisotropic representation and better novel view rendering quality with negligible additional computation cost.

Abstract:
Underwater Image Quality Assessment (UIQA) plays an important role in assess the effectiveness of Underwater Image Enhancement (UIE) algorithms or to evaluate the quality of underwater images. However, accurate UIQA that are consistent with human perception remains challenging. This dilemma on one hand is attributed to the lack of real human visual perception UIQA data, and on the other hand that the quality feature representation used by existing UIQA algorithms are inconsistent with human perceptions. To address these issues, we introduce a Large scale Underwater Image Quality Dataset (LUIQD), and propose an UIQA network named as Perception-Aware Underwater image Quality Assessment Network (PAUQA-Net). Specifically, the LUIQD includes 64,180 real and enhance underwater images covering a wide range of scenes, target and imaging conditions, with their perceptual quality scores. Based on the analysis of the mechanisms of human perception, we further design the data-driven PAUQA-Net that integrates an efficient convolutional attention vision Transformer to extract multi-scale features by a multi-path structure. Considering the specificity of human perception of underwater images, color and sharpness features from the chrominance and luminance domains are extracted and fused with local and global images features for joint feature interaction. Extensive experiments conduted on LUIQD and other datasets demonstrate that the proposed PAUQA-Net achieves superior assessment performance compared with the most popular UIQA and IQA methods. The code and dataset can be found in https://github.com/CatchACat083/PAUQA

Abstract:
Continuous sign language recognition (CSLR) plays a crucial role in facilitating communication between individuals who are hearing and unable to hear. A key aspect of achieving precise CSLR is the alignment of the video segment of each sign with its gloss, namely its corresponding text representation in natural language. However, the coarticulation phenomenon, where contextual dependencies between adjacent signs blur the boundaries of individual signs, poses a significant challenge to this task. In this paper, we propose a novel boundary-aware sentence-gloss alignment network for CSLR to address this challenge. Our network first designs a task-relevant boundary-aware similarity measurement, evaluating sign frames by both appearance and their recognition contribution, mitigating coarticulation-induced transition noise to restore precise boundaries. For enhanced alignment, we propose a hierarchical sentence-gloss alignment: coarse sentence-level alignment reduces cross-modal disparity, while fine-grained gloss-level alignment refines video-to-token mapping. Finally, an adaptive class-divergence loss sharpens gloss decoding by maximizing inter-class discrimination. Our proposed framework provides a simple and effective solution to mitigate the boundary ambiguity caused by coarticulation, optimizing continuous sign language recognition algorithms from a new perspective. Extensive experiments conducted on four public sign language recognition (SLR) datasets demonstrate that our proposed boundary-aware sentence-gloss alignment network learns precise alignments and achieves state-of-the-art performance.

Abstract:
Image-text matching remains challenging in big data processing. Matching accuracy is influenced by various factors, including the correlation between images and texts, feature extraction and fusion. Although activation functions play a crucial role in image-text matching, their design has received limited attention. This paper proposes an image-text matching model that utilizes multi-scale feature fusion based on a piecewise polynomial activation function. On one hand, a feature correlation optimization method is proposed to minimize the distance between paired images and texts. This method introduces large-scale downsampling odd-even feature embeddings and mean downsampling feature embeddings. After feature enhancement using a self-attention module, the odd-even feature embeddings are corrected with large-scale mean downsampling features to improve their representational ability. Additionally, a new multi-scale feature fusion method is utilized to enhance the robustness of the feature enhancement algorithm. On the other hand, we propose PCPAF (Piecewise Cubic Polynomial Activation Function), which offers advantages such as low computational cost, C^1 continuity, and superior generalizability. The PCPAF significantly improves model accuracy compared to existing activation functions. By adjusting the parameters of PCPAF, different activation functions can be derived, thereby improving matching accuracy in other image-text matching models. Experimental results on the Flickr30k and MS-COCO datasets demonstrate that the proposed model outperforms state-of-the-art models in terms of overall performance.

Abstract:
Video grounding tasks have recently gained significant attention. However, existing methods failed to fully comprehend the semantics within queries and videos, often overlooking key content. Moreover, the lack of fine-grained cross-modal alignment and interaction to guide the semantic matching of complex texts and videos lead to inconsistent representational modeling. To address this issue, we propose a Semantic Hierarchical Grounding model, referred to as SHG, and design a cross-modal semantic hierarchical graph to achieve fine-grained semantic understanding. SHG decomposes both the query and each video moment into three levels: global, action, and element. This topology, ranging from global to local, establishes multi-granularity intrinsic connections between the two modalities, fostering a comprehensive understanding of dynamic semantics and fine-grained cross-modal matching. Accordingly, to fully leverage the rich information within the cross-modal semantic hierarchical graph, we employ contrastive learning by seeking samples with the same action and element semantics, then achieve node-moment cross-modal hierarchical matching for global alignment. This approach can unearth fine-grained clues and align semantics across multiple granularities. Moreover, we combine the designed hierarchical graph interaction for coarse-to-fine fusion of text and video, thereby enabling highly accurate video grounding. Extensive experiments conducted on three challenging public datasets (ActivityNet-Captions, TACoS, and Charades-STA) demonstrate that the proposed approach outperforms state-of-the-art techniques, validating its effectiveness.

Abstract:
Polarimetric synthetic aperture radar (PolSAR) image change detection (CD) aims to accurately analyze the difference and detect changes in PolSAR images. Recently, graph transformer (GT), which combines the advantages of graph convolutional network and transformer, has increasingly attracted attention in the field of remote sensing. However, the direct application of GT for PolSAR image CD with limited training samples is challenging owing to polarimetric scattering confusion and random speckle noise. Here, we propose a novel unsupervised representation learning framework for CD in PolSAR images, named statistic-guided difference enhancement GT (SDEGT). Our motivation is that polarimetric statistics can effectively guide GT to extract robust and highly discriminative features from the raw polarimetric graphs and thus accurately detect changes. The SDEGT follows the architecture based on neighborhood aggregation GT and innovatively introduces polarimetric statistics to guide feature difference enhancement, thereby capturing the structural interaction between graph nodes and aggregating the local-to-global change correlations at low computational cost. First, SDEGT innovatively introduces noise-robust polarimetric statistics to improve its noise suppression ability and learn sufficient change-aware features from the PolSAR data. Subsequently, guided by the polarimetric statistical difference, a difference enhancement module (DEM) is designed and cleverly embedded in the SDEGT to adaptively enhance the difference between changed and unchanged nodes, thus improving the discrimination of the change-aware features. Finally, symmetric cross-entropy (SCE) is employed to facilitate the robust learning of SDEGT and attenuate the detrimental effect of label noise. Visual and quantitative experimental results on five measured PolSAR datasets with different scenes and dimensions demonstrate the competitiveness of our SDEGT over other state-of-the-art methods.

Affiliations: Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Shenzhen, Guangdong, China; School of Computing Science, Simon Fraser University, Burnaby, Canada; Yale School of Medicine, Yale University, New Haven, USA; Department of Electrical and Computer Engineering, Concordia University, Montreal, Canada; Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada; Shenzhen University of Advanced Technology, Faculty of Computer Science and Control Engineering, Shenzhen, Guangdong, China

Abstract:
Event camera, as an asynchronous vision sensor capturing scene dynamics, presents new opportunities for highly efficient 3D human pose tracking. Existing approaches typically adopt modern-day Artificial Neural Networks (ANNs), such as CNNs or Transformer, where sparse events are converted into dense images or paired with additional gray-scale images as input. Such practices, however, ignore the inherent sparsity of events, resulting in redundant computations, increased energy consumption, and potentially degraded performance. Motivated by these observations, we introduce the first sparse Spiking Neural Networks (SNNs) framework for 3D human pose tracking based solely on events. Our approach eliminates the need to convert sparse data to dense formats or incorporate additional images, thereby fully exploiting the innate sparsity of input events. Central to our framework is a novel Spiking Spatiotemporal Transformer, which enables bi-directional spatiotemporal fusion of spike pose features and provides a guaranteed similarity measurement between binary spike features in spiking attention. Moreover, we have constructed a large-scale synthetic dataset, SynEventHPD, that features a broad and diverse set of 3D human motions, as well as much longer hours of event streams. Empirical experiments demonstrate the superiority of our approach over existing state-of-the-art (SOTA) ANN-based methods, requiring only 19.1% FLOPs and 3.6% energy cost. Furthermore, our approach outperforms existing SNN-based benchmarks in this task, highlighting the effectiveness of our proposed SNN framework. The dataset will be released upon acceptance, and code can be found at https://github.com/JimmyZou/HumanPoseTracking_SNN

Affiliations: Institute of Intelligent Information Processing, Taizhou University, Taizhou, China; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China; Peng Cheng Laboratory, Shenzhen, China; College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China; National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, Beijing, China; Huawei Cloud and AI, Shenzhen, China

Abstract:
Multimodal Emotion Recognition (MER) leverages multiple input signals to identify the expressed emotions in user-generated data. Currently, effectively addressing both modality heterogeneity and homogeneity on MER tasks is a challenging issue due to the diversity of multimodal inputs in videos. To address this issue, this work proposes an efficient Multimodal Decoupling Method with Knowledge Aggregation and Transfer (MDKAT) for robust multimodal feature learning in emotional videos. MDKAT is consisted of three key steps: modality-independent feature extraction, modality-specific feature extraction, and multi-loss integration for decoupling. In these three steps, four crucial modules are individually designed to improve different aspects of multimodal learning on MER tasks, including a Cross-modal Feature Fusion (CFF) module for enhancing modality-independent features, an Adaptive Masked Self-Attention (AMSA) module for feature refinement, a Knowledge Aggregation (KA) module for ensuring the semantic similarity of modality-independent features, and a Knowledge Transfer (KT) module for balancing the strengths of different modalities. Experimental results on the typical CMU-MOSI and CMU-MOSEI datasets show that MDKAT obtains superior performance over state-of-the-art methods, demonstrating the effectiveness of MDKAT on MER tasks.

Abstract:
The purpose of deep image hiding is to embed the secret image imperceptibly in an equally sized cover image, and then recover the secret image almost perfectly at the receiver end. How to improve the quality of recovered secret images while ensuring the visual quality and security of stego images is an important challenge. In order to address this issue, a novel deep image hiding framework called EFCA-DIH (Edge Features and Coordinate Attention-based Invertible Network for Deep Image Hiding) is proposed. Firstly, an important feature extraction module is proposed to extract wavelet sub-band features coupled with edge features, thereby hiding the secret image better in the cover image. Secondly, a coordinate attention mechanism is introduced into the invertible hidden module to embed the secret information in the complex texture regions. Finally, an edge feature loss function is designed to constrain the edge differences between the stego image and the cover image, and between the secret image and the recovered secret image, thereby improving the quality of both the stego image and the recovered secret image. Experimental results have demonstrated that our EFCA-DIH significantly improves the quality of recovered secret images compared with other state-of-the-art methods, while maintaining the visual quality and security of stego images.

Abstract:
Deep learning performance may decrease substantially with unseen heterogeneous data. While most unsupervised domain adaptation (UDA) methods seek to address this through image alignment, they often ignore uncertainty style fluctuations within the target domain. When testing image styles vary in both direction and intensity, such models may fail to adapt. Furthermore, existing UDA methods tend to over-reliance on domain-level entire feature alignment, resulting in potentially over-exploiting semantic content-independent cues (e.g., intensity) as shortcut features. To address these limitations, this paper introduces an innovative and model-agnostic Causality-inspired Representation Learning Based on Target Style Imitation method for UDA. Specifically, we propose a novel Target Style Imitation (TSI) data augmentation approach to diversify the training data and align training and unseen target testing image styles. TSI constructs a Gaussian distribution for the target domain style and simulates unseen testing style variations through random sampling. Additionally, inspired by the stable and generalizable causal mechanism, we propose Causality-inspired Representation Learning (CRL) based on TSI method to enforce feature representations to adhere to causal properties (i.e., Separation and Independence) essential for robust UDA, thus fostering the model to focus on the domain-invariant semantic features. Our method surpasses state-of-the-art methods on two cross-modality medical image segmentation datasets.

Abstract:
Research on emotion recognition in conversations emphasises the importance of complex relationships between conversational context and multimodality. Graph-based methods, particularly hypergraph-based methods have shown promise in capturing these relationships. However, challenges persist in avoiding redundant context while capturing essential information for optimal context embeddings and fully leveraging cross-modal complementarities for sufficient fusion. In contrast, the human brain flexibly retrieves relevant memories and integrates multimodal data for accurate recognition. Based on this superiority, we propose BrainyHGNN, a brain-inspired hypergraph neural network. It integrates a Dynamic Memory Selector for contextual hyperedges, mimicking selective memory retrieval mechanisms for adaptive and modality-specific context retrieval. HierSensNet is designed for multimodal hyperedges, mirroring hierarchical cross-modal interaction mechanisms to ensure effective multimodal fusion. Experimental results on two benchmark datasets validate the superior performance of BrainyHGNN, confirming the effectiveness of its innovative approach. This work highlights the potential of brain-inspired methods to advance flexible context retrieval and sufficient multimodal fusion, presenting a promising direction for future research in this domain.

Abstract:
Unsupervised video object segmentation aims to detect the most salient object in a video without any external guidance regarding the object. Salient objects often exhibit distinctive movements compared to the background, and recent methods leverage this by combining motion cues from optical flow maps with appearance cues from RGB images. However, because optical flow maps are often closely correlated with segmentation masks, networks can become overly dependent on motion cues during training, leading to vulnerability when faced with confusing motion cues and resulting in unstable predictions. To address this challenge, we propose a novel motion-as-option network that treats motion cues as an optional component rather than a necessity. During training, we randomly input RGB images into the motion encoder instead of optical flow maps, which implicitly reduces the network’s reliance on motion cues. This design ensures that the motion encoder is capable of processing both RGB images and optical flow maps, leading to two distinct predictions depending on the type of input provided. To make the most of this flexibility, we introduce an adaptive output selection algorithm that determines the optimal prediction during testing. Code and models are available at https://github.com/suhwan-cho/TMO.

Abstract:
Deep-learning-based image inpainting technology has achieved remarkable visual consistency but is vulnerable to malicious use. Existing detection methods overlook semantic inconsistencies between targets and backgrounds, leading to ambiguous results due to low discriminability. To tackle these challenges, we draw inspiration from human strategies in visual tasks, which involve initially assigning uncertainty across the entire input and subsequently concentrating on highly uncertain regions using prior knowledge like boundary information. Building on this, we propose a Dual Information Guided Network (DIGNet). It combines object-background semantic modulation with uncertainty to precisely locate inpainting regions. This is the first work to address inpainting prediction inaccuracies by considering both edge uncertainty and semantic inconsistency. DIGNet consists of three key parts: the Edge Uncertainty Awareness Module (EUAM), the Edge Correction Module (ECM) based on semantic differences, and the Dual Information Guided Interaction Module (DIGIM). We use semantic inconsistency to get edge constraints and quantify uncertainty as feature variance to guide mainstream feature maps. The DIGIM effectively fuses guide information for accurate predictions. Comprehensive experiments show that our method outperforms existing CNN-based approaches. Specifically, it improves the F1 Score by at least 0.31% and the IOU by at least 0.19% on multiple datasets.

Abstract:
Human skeletons provide a compact representation for action recognition. Compared to 3D skeletons, 2D skeletons lack view-independence and depth, making them less robust for motion analysis. However, 3D skeleton data requires specialized hardware, limiting its practicality, especially in outdoor or dynamic settings. In contrast, 2D skeletons can be extracted from standard RGB videos, making them more accessible. To address this, we propose 2D3-SkelAct, a 2D skeleton-based action recognition model. It maps 2D inputs to a 3D latent space, where pose and view features are decoupled. Additionally, 2D3-SkelAct distills motion cues from 3D models, enhancing motion detail capture while keeping the benefits of 2D data. Specifically, the pipeline of our 2D3-SkelAct consists of two steps: pose-view decoupling and pose-view distilling. First, we use a spatio-temporal transformer to decouple 2D skeleton sequences into latent pose and view features, enhancing the model’s ability to learn motion dynamics. Next, these decoupled features are separately integrated into the 2D skeleton model through two cross-attention modules, allowing it to extract discriminative motion features while mitigating uncertainties in 3D viewpoint and depth. Additionally, we distill motion cues from 3D models to compensate for the limitations of 2D skeletons. Remarkably, our model can seamless integrate with various skeleton feature extractors. We validate the proposed 2D3-SkelAct through extensive experiments, demonstrating its adaptability across different model architectures as where consistent improvement achieving. When combined with advanced skeleton feature extractors, 2D3-SkelAct achieves state-of-the-art performance in 2D skeleton-based action recognition.

Abstract:
This paper introduces a high-quality talking head generation method that is jointly driven by keypoints and action units, aiming to strike a balance between low-bandwidth transmission and high-quality generation in video conference scenarios. Existing methods for talking head generation often face limitations: they either require an excessive amount of driving information or struggle with accuracy and quality when adapted to low-bandwidth conditions. To address this, we decompose the talking head generation task into two components: a driving task, focused on information-limited control, and an enhancement task, aimed at achieving high-quality, high-definition output. Our proposed method innovatively incorporates the joint driving of keypoints and action units, improving the accuracy of pose and expression generation while remaining suitable for low-bandwidth environments. Furthermore, we implement a multi-step video quality enhancement process, targeting both the entire frame and key regions, while incorporating temporal consistency constraints. By leveraging attention mechanisms, we enhance the realism of the challenging-to-generate mouth regions and mitigate background jitter through background fusion. Finally, a prior-driven super-resolution network is employed to achieve high-quality display. Extensive experiments demonstrate that our method effectively supports low-resolution recording, low-bandwidth transmission, and high-definition display.

Abstract:
Traditional JPEG image encryption that prioritizes solely confidentiality fails to account for the pressing usability requirements of cloud-based environments, thus boosting the boom in thumbnail-preserving encryption (TPE) to balance image privacy and usability. However, existing TPE schemes for JPEG images face numerous challenges, such as insufficient security, inability to achieve lossless decryption, and high file extension. To address these challenges, we propose a TPE scheme based on dynamic M-ary decomposition and adaptive threshold constraints (TPE-MDTC). First, the valid ranges of quantized DC coefficients for JPEG images are determined. Then, a sum-preserving encryption method for quantized DC coefficients with compliance threshold constraints is designed using the bit-plane permutation to preserve thumbnails with high accuracy. Next, the introduction of dynamic M-ary decomposition effectively changes bit statistical characteristics preserved by bit-plane permutation, enhancing the ciphertext security. Finally, a quantized AC encryption method with RV (Run/Value) pair global permutation is proposed, effectively modifying the unit block features, thereby significantly improving the security and attack resistance of encrypted images. Experimental results show that the proposed TPE-MDTC scheme can reconstruct the original JPEG images without loss, and the generated ciphertext images exhibit significant advantages over previous schemes regarding file extension and security.

Abstract:
Understanding how the brain responds to external stimuli and decoding this process has been a significant challenge in neuroscience. While previous studies typically concentrated on brain-to-image and brain-to-language reconstruction, our work strives to reconstruct gestures associated with speech stimuli perceived by brain. Unfortunately, the lack of paired brain, speech, gesture data hinders the deployment of deep learning models for this purpose. In this paper, we introduce a novel approach, fMRI2GES, that allows training of fMRI-to-gesture reconstruction networks on unpaired data using Dual Brain Decoding Alignment. This method relies on two key components: 1) observed texts that elicit brain responses, and 2) textual descriptions associated with the gestures. Then, instead of training models in a completely supervised manner to find a mapping relationship among the three modalities, we harness an fMRI-to-text model, a text-to-gesture model with paired data and an fMRI-to-gesture model with unpaired data, establishing dual fMRI-to-gesture reconstruction patterns. Afterward, we explicitly align two outputs and train our model in a self-supervision way. We show that our proposed method can reconstruct expressive gestures directly from fMRI recordings. We also investigate fMRI signals from different ROIs in the cortex and how they affect generation results. Overall, we provide new insights into decoding co-speech gestures, thereby advancing our understanding of neuroscience and cognitive science.

Abstract:
Panoramic depth estimation is crucial for acquiring comprehensive 3D environmental perception information, serving as a foundational basis for numerous panoramic vision tasks. The key challenge in panoramic depth estimation is how to address various distortions in 360° omnidirectional images. Most panoramic images are displayed as 2D equirectangular projections, which exhibit significant distortion, particularly with the severe fisheye effect near the equatorial regions. Traditional depth estimation methods for perspective images are unsuitable for such projections. On the other hand, cubemap projection consists of six distortion-free perspective images, allowing the use of existing depth estimation methods. However, the boundaries between faces of a cubemap projection introduce discontinuities, causing a loss of global information when using cube maps alone. In this work, we propose an innovative geometric priors assisted dual-projection fusion network (GADFNet) that leverages geometric priors of panoramic images and the strengths of both projection types to enhance the accuracy of panoramic depth estimation. Specifically, to better focus the network on key areas, we introduce a distortion perception module (DPM) and incorporate geometric information into the loss function. To more effectively extract global information from the equirectangular projection branch, we propose a scene understanding module (SUM), which captures features from different dimensions. Additionally, to achieve effective fusion of the two projections, we design a dual projection adaptive fusion module (DPAFM) to dynamically adjust the weights of the two branches during fusion. Extensive experiments conducted on four public datasets (including both virtual and real-world scenarios) demonstrate that our proposed GADFNet outperforms existing methods, achieving superior performance.

Abstract:
Scale variation of objects remains one of the crucial challenges in object detection. Currently, conventional dense detectors with fixed receptive fields and label weights are not conducive to the detection of multi-scale objects. However, the design limitations of unbalanced label weights and fixed refinement for multi-scale objects and multi-tasks in these studies make it difficult to achieve better detection performance. In this paper, we propose a novel dense detector named Balanced FCOS which consists of two components: Balanced Label Assignment (BLA) and Flexible Shape-based Refinement (FSR). The BLA implements scale-balanced sample assignment by introducing reweighting factors consisting of localization and classification scores into the label assignment. Low-quality but high-weight samples can be weakened by the BLA. Furthermore, we design a cross-reweighting mechanism in the BLA to ensure score consistency between classification and localization. The FSR implements scale-balanced sample refinement by learning flexible sample points’ offsets for multi-scale objects and multi-tasks based on objects’ coarse features to get more discriminative features with appropriate receptive field. In addition, better features obtained by FSR are beneficial to get better classification and localization scores, which can be used by BLA to produce accurate label weights. Only equipped with the BLA, we can achieve 41.7/46.6 AP under R50/R101-FCOS without any additional parameters. When combining the BLA with the FSR, our Balanced FCOS achieves SOTA results among dense detectors on the COCO test-dev set. Experiments conducted on other heads (T-Head, DyHead), detectors (DINO), and datasets (AI-TOD) further demonstrate the effectiveness of our method.

Abstract:
Different from natural videos, where artifacts distributed evenly, the artifacts of compressed screen content videos mainly occur in the edge areas. Besides, these videos often exhibit abrupt scene switches, resulting in noticeable distortions in video reconstruction. Existing multiple-frame models using a fixed range of neighbor frames face challenges in effectively enhancing frames during scene switches and lack efficiency in reconstructing high-frequency details. To address these limitations, we propose a novel method that effectively handles scene switches and reconstructs high-frequency information. In the feature extraction part, we develop long-term and short-term feature extraction streams, in which the long-term feature extraction stream learns the contextual information, and the short-term feature extraction stream extracts more related information from shorter input to assist the long-term stream to handle fast motion and scene switches. To further enhance the frame quality during scene switches, we incorporate a similarity-based neighbor frame selector before feeding frames into the short-term stream. This selector identifies relevant neighbor frames, aiding in the efficient handling of scene switches. To dynamically fuse the short-term feature and long-term features, the muti-scale feature distillation focuses on adaptively recalibrating channel-wise feature responses to achieve effective feature distillation. In the reconstruction part, a high-frequency reconstruction block is proposed for guiding the model to restore the high-frequency components. Experimental results demonstrate the significant advancements achieved by our proposed Long Short-term Fusion by Multi-Scale Distillation (LSFMD) method in enhancing the quality of compressed screen content videos, surpassing the current state-of-the-art methods.

Abstract:
Traditional in the wild image quality assessment (IQA) models are generally trained with the quality labels of mean opinion score (MOS), while missing the rich subjective quality information contained in the quality ratings, for example, the standard deviation of opinion scores (SOS) or even distribution of opinion scores (DOS). In this paper, we propose a novel IQA method named RichIQA to explore the rich subjective rating information beyond MOS to predict image quality in the wild. RichIQA is characterized by two key novel designs: 1) a three-stage image quality prediction network which exploits the powerful feature representation capability of the Convolutional vision Transformer (CvT) and mimics the short-term and long-term memory mechanisms of human brain; 2) a multi-label training strategy in which rich subjective quality information like MOS, SOS and DOS are concurrently used to train the quality prediction network. Powered by these two novel designs, RichIQA is able to predict the image quality in terms of a distribution, from which the mean image quality can be subsequently obtained. Extensive experimental results verify that the three-stage network is tailored to predict rich quality information, while the multi-label training strategy can fully exploit the potentials within subjective quality rating and enhance the prediction performance and generalizability of the network. RichIQA outperforms state-of-the-art competitors on multiple large-scale in the wild IQA databases with rich subjective rating labels. The code of RichIQA will be made publicly available on GitHub.

Abstract:
Multimodal monocular depth estimation methods based on deep learning have achieved competitive performance in recent years. However, the existing Contrastive Language-Image Pre-training (CLIP)-based multimodal networks often suffer from incomplete fusion of two modalities and lack multi-scale contextual information. To remedy these issues, this paper proposes a high-order feature and attention-assisted CLIP model HoCLIP for monocular depth estimation. Specifically, with the CLIP model as the backbone, Matrix Power Normalization Covariance Pooling (MPN-COV) technique is employed for high-order statistical modeling to capture image features by the visual encoder. These features are then combined with learnable deep prompts before being fed into the text encoder, facilitating enhanced fusion of text and image and enabling the extraction of more intricate statistical information and spatial structure. Furthermore, the Efficient Multi-Scale Attention (EMA)-Decoder is utilized for the reconstruction of depth maps. This structure captures contextual information across different scales, establishes long-range dependencies between features, and meticulously preserves spatial position information. Finally, a vertical discriminator with embedded vertical attention is integrated into the model’s final stages to capture vertical features and refine depth map generation. The extensive experiments on the NYU Depth V2 and KITTI datasets are conducted, and the results show that the proposed method has a decisive improvement over the state-of-the-art multimodal methods and exhibits robust competitiveness across all metrics.

Abstract:
Unmanned aerial vehicles (UAVs) visual localization capabilities provide a promising solution for ensuring reliable navigation, especially in environments where Global Navigation Satellite System (GNSS) signals are obstructed or unavailable. It can be implemented without reliance on external signals or additional equipments. However, significant changes in the appearance of the ground surface, as well as variations in illumination, rotation, and viewpoint, make the matching between UAV imagery and reference satellite images challenging. In this study, a UAV visual localization method based on deep learning features with steerable semantic information and density-based clustering is proposed to enhance the robustness and accuracy of localization. The proposed lightweight semantic-aware steerable feature extraction network (SemSNet) integrates a multilevel reduced semantic segmentation block (MR-SSB) to extract local features with rotation-invariant semantic and structural information. MR-SSB learns pixel-level semantic features domain adaptively by leveraging semantic information from an off-the-shelf semantic segmentation network combined with a semantic label remapping technique. During the matching phase, a hybrid density-based clustering integrating multiple Gaussian models (MGMDBC) identifies corresponding reference regions to maximize covisibility. Experiments conducted on public and self-collected challenging UAV datasets demonstrate that our method can effectively overcome severe changes in viewpoint and ground surface. Our method achieves a minimal average localization error of under 5 m and a maximum improvement of 6.38 m in accuracy compared to previous models.

Abstract:
Remote physiological measurement from facial videos, exemplified by the remote photoplethysmograph (rPPG) technology, has attracted considerable attention for its potential in many applications. While recent advances in remote physiological measurement have achieved great success, what is often overlooked in previous studies is the periodic nature of physiological signals. In this study, we present long short-term temporal shift (LSTS), a novel neural network designed to effectively model the periodicity in physiological signals. We propose the periodic channel shift (PCS) mechanism to represent the periodic nature of physiological signals by selectively shifting channels between frames in adjacent periods. Additionally, to help models focus more on inter-period variations in the videos, we also propose TSAug, a shift-based data augmentation technique to suppress intra-period variations. Furthermore, we propose a simple input preprocessing scheme through color space transformation, termed multi-scale plane-orthogonal-to-skin (MPOS), to better capture the rPPG clues in videos. Extensive experiments show that the proposed LSTS model not only achieves superior or on par state-of-the-art performance on four benchmark datasets, but also exhibits outstanding generalizability across people of different races and skin tones, making LSTS an inclusive model that can benefit a wide range of users. The code will be available at https://github.com/Promisery/LSTS.

Abstract:
Diffusion models show great potential in solving inverse problems, including MRI reconstruction. With its unique characteristics, medical imaging demands both efficiency and accuracy in the reconstruction process. However, existing MRI reconstruction methods based on diffusion models often fall short of fully leveraging the available measurements during sampling. Consequently, these methods suffer from compromised reconstruction quality and elevated bias, especially when dealing with large acceleration factors. In response to these challenges, we propose Dual Manifold Constraints (DMC), a fast MRI reconstruction method based on diffusion models. We treat the sampling process as a combination of denoising and adding noise processes, and we constrain these two processes using both pristine measurements and their noisy counterparts to adapt to the geometry of diffusion. It’s worth noting that we propose a method to estimate the noisy measurement that satisfies the sub-sampling process to maintain the current data manifold when performing data consistency constraints. Experimental results show that our method outperforms the latest diffusion-based methods regarding both reconstruction speed and accuracy, and exhibits strong out-of-distribution generalization performance.

Abstract:
6D object pose estimation from a single RGB-D image is a fundamental problem in computer vision and robot manipulation. Despite recent advancements, existing methods still suffer several limitations. First of all, the object shape representation extracted from the depth map is often less expressive because the object point cloud parsed from the depth map is highly incomplete due to the object self-occlusion and noisy due to the sensor artifacts. This shape representation issue further intensifies when lacking sufficient labeled data for model training, which unfortunately is another typical problem for object pose estimation considering the heavy annotation cost for real-world pose labeling. In this study, we propose to tackle the above issues in a unified way. First, we enhance the object shape representation from the partial point cloud with a novel canonical shape reconstruction module, in which an implicit canonical frame is established by incorporating the SE(3) equivariance, achieving implicit feature alignment of the partial point cloud inputs, leading to robust shape recovery. Second, based on the enhanced object representation, we further utilize the de-canonicalized and pose-dependent completed object shape as the training signal, and develop a novel weakly-supervised learning framework to leverage both labeled synthetic data and unlabeled real data to train the pose estimation model in a label-efficient way. Extensive experiments on three widely used benchmarks demonstrate the effectiveness, and superiority of our framework over state-of-the-art methods.

Affiliations: State Key Laboratory of Dynamic Testing Technology and the School of Information and Communication Engineering, North University of China, Taiyuan, China; School of Control Science and Engineering, Key Laboratory of Machine Intelligence and System Control, Ministry of Education, Shandong University, Jinan, China; School of Software, Shandong University, Jinan, China; Institute of Microelectronics, Chinese Academy of Sciences, Beijing, China; Advanced Multimedia Research Laboratory, University of Wollongong, Wollongong, NSW, Australia

Abstract:
Unsupervised skeleton-based action recognition has achieved remarkable progress recently. Existing unsupervised learning methods suffer from severe overfitting problem, and thus small networks are used, significantly reducing the representation capability. To address this problem, the overfitting mechanism behind the unsupervised learning for skeleton-based action recognition is first investigated. It is observed that skeleton is already a relatively high-level and low-dimension feature, but not in the same manifold as the features for action recognition. Simply applying the existing unsupervised learning method tends to produce features that discriminate the different samples rather than action classes, resulting in the overfitting problem. To address this problem, this paper proposes an Unsupervised spatial-temporal Feature Enrichment and Fidelity Preservation (U-FEFP) learning framework to generate rich distributed features that contain all the information of a skeleton sample. A spatial-temporal feature transformation subnetwork is developed using channel-wise topology refinement graph convolutional block and graph convolutional gated recurrent unit block as the basic feature extraction network. The unsupervised Bootstrap Your Own Latent-based learning is utilized to generate rich distributed features, and the unsupervised pretext task-based learning is employed to preserve the information contained in the skeleton. The two unsupervised learning ways are collaborated as U-FEFP to produce robust and discriminative representations. Experimental results on four widely used benchmarks, namely NTU-RGB+D-60, PKU-MMD, NTU-RGB+D-120 and AAV-Human dataset, demonstrate that the proposed U-FEFP obtains the best result compared with the state-of-the-art unsupervised learning methods.

Abstract:
Diffusion-based image editing involves both preserving the source image content and generating new content or applying modifications. Although current editing approaches have made improvements under text guidance, they have two key drawbacks: overemphasis on retaining original image info, neglecting editability and text alignment, and inability to handle both structure-consistent and non-rigid editing tasks. In this paper, we propose a zero-shot image editing method, named Enhance Editability for text-based image Editing via Efficient CLIP guidance (E4C), which presents an innovative adaptive feature sharing mechanism to enable multi-task editing. Additionally, a novel random gateway mechanism is designed to efficiently introduce CLIP guidance into the multi-step sampling of diffusion, achieving high congruence between editing results and target text. Comprehensive quantitative and qualitative experiments demonstrate that our method effectively resolves the text alignment issues prevalent in existing methods while maintaining the fidelity to the source image, and performs well across a wide range of editing tasks.

Abstract:
This paper proposes an end-to-end video saliency prediction network model, termed TM2SP-Net (Transformer-based Multi-level Spatiotemporal Feature Pyramid Network). Leveraging the strong encoding learning capability of Video Swin Transformer for video data, we design a Multi-level Spatiotemporal Feature Pyramid Network (MLSTFPN) that effectively detects and enriches salient regions and spatial details across different scales. In particular, a pre-trained image saliency detection encoder is employed to extract salient features from each frame, serving as prior knowledge to guide the multi-scale spatiotemporal feature fusion and decoding processes. Additionally, we introduce an Inception Gate-Controlled Fusion (IGCF) and Layered Self-Attention Aggregation Fusion (LSAF) mechanisms to efficiently merge spatiotemporal features across various stages. Finally, extensive experiments conducted on the DHF1K, Hollywood-2, UCF-Sports, and six audio-visual saliency datasets demonstrate the superiority of our method over existing state-of-the-art approaches.

Abstract:
Spike camera is a retina-inspired neuromorphic camera which can capture dynamic scenes of high-speed motion by firing a continuous stream of spikes at an extremely high temporal resolution. The limitation in the current design is that each spike only represents the arrival of a fixed amount of photons. It can not deal with strong light areas in which the amount of accumulated photons reaches the pre-specified threshold multiple times within a single readout interval. In this paper, we propose a new spike camera model of high-speed imaging for high dynamic range scenarios. In this scheme, each pixel accumulates the incoming photons persistently and generates a new type of spike stream in which each spike symbol may be associated with different levels, indicating the arrival of different amounts of photons since the last readout. This enables the camera to support dynamic scenes with wider dynamic range. To achieve this, we propose a two-level buffer mechanism, one for photon accumulation and one for spike-firing encoding. We use a register to hold the number of spike-firings which has not been read out yet. At each readout time, the major part in the counter is read out via a carefully designed exponential encoding and the counter is updated. Such encoding and readout strategy enables a very efficient expansion of the dynamic range using a small number of encoding bits. Furthermore, we propose an image reconstruction scheme for the proposed camera, utilizing both spike intervals and spike levels to recover the light intensity. We incorporate Mamba and propose a temporal-spatial selective scan mechanism to extract temporal-spatial correlation within spike streams. We employ a pyramid adaptive filtering and alignment module to achieve coarse-to-fine feature alignment. Experimental results show that the proposed scheme can achieve better imaging quality and outperform the existing spike camera in high dynamic range scenarios.

Abstract:
Facial Expression Recognition (FER) has received considerable research attention owing to its poor robustness in real-world scenarios. This issue, defined as the uncertainty problem in FER, is often solved by recognizing the noise samples in FER datasets. Unlike noise samples with incorrect labels, ambiguous samples exhibit mixed emotions that align with multiple basic expressions. It makes them indistinguishable in training and harms model robustness. To address this issue, we propose an ambiguity-aware FER framework called Co-dance with Ambiguity (CoA). CoA combines an Emotion Extraction Module (EEM) and an Expression Description Module (EDM) to leverage ambiguity for better performance and robustness. Specifically, EEM employs a coupled-stream structure to extract both representative and detailed features through diverse-scale fusion and patch-attention sensing. EDM adjusts ground-truth labels of ambiguous samples by introducing label pairs derived from the top two highest predictions, describing the mixed-emotion nature. The pairs guide the model to align feature extraction with the inherent ambiguity of ambiguous samples during training. Extensive experiments on five in-the-wild FER datasets demonstrate the superiority of CoA over advanced methods. Moreover, introducing ambiguity-aware strategies enriches feature representations and significantly enhances robustness when faced with a high ratio of ambiguous samples in FER.

Abstract:
Adverse weather conditions like rain, fog snow reduce visibility and degrade image quality, challenging the reliability of outdoor vision systems. Previous research mainly focuses on network models tailored to specific adverse weather conditions, limiting their effectiveness in addressing diverse weather scenarios in video processing. Recent research focuses on unified models for weather removal, significantly improving video quality in adverse conditions. However, the performance of these methods notably deteriorates in real environments due to the domain gap between synthetic and actual environments. In this paper, we present a meta-learning framework featuring a self-supervised learning (SSL) branch, aimed at boosting adaptability. In particular, we employ a two-stage training process. Initially, Joint training is implemented to establish a comprehensive model for weather reconstruction. Following this, Meta-BN training is applied to fine-tune the affine coefficients of the Batch Normalization (BN) layers, thus enabling the model to quickly adjust to different weather scenarios and maintain its efficacy in reconstruction. Moreover, an SSL-driven update strategy bolsters this targeted optimization, facilitating Test-time Weather Adaptation (TT-WA) and ensuring effective generalization to unfamiliar weather conditions. Experimental results across multiple benchmark datasets demonstrate that TT-WA consistently achieves state-of-the-art (SOTA) performance in both qualitative and quantitative evaluations under a variety of weather conditions, including rain, haze, and snow, outperforming existing methods. More critically, our approach exhibits robust adaptive reconstruction capabilities when applied to unseen real-world videos, further underscoring its effectiveness in generalizing to diverse and complex weather scenarios.

Abstract:
Video embedding is the pivot in Temporal Action Detection (TAD). Once the video embedding can robustly capture the essence of actions and perceive activities in complex scenes, the TAD model can more accurately localize action boundaries. Currently, video embedding is typically based on rule-based pixel convolution or cube-based transformer, wherein structured semantic information is intertwined, leading to the submergence of crucial spatial semantic information, such as the intrinsic motion of key semantic objects and interactions among semantic objects. To address these limitations, it is imperative to explore alternative approaches. With the remarkable performance of general semantic segmentation models in visual representation, we introduce the general segmentation model SEEM into the video embedding paradigm, constructing a semantically structured representation from perceptual semantics to cognitive semantics. To more effectively utilize SEEM for structured video representation, we designed the Semantic Adapter (Sem-Adapter) as a bridge to connect the two models. Firstly, we design a Self-Motion Module (SMM) to pay attention to the self-motion of key semantic regions. Secondly, we propose a Mutual Relation Module (MRM) to construct the interactions between semantic regions. Extensive experiments on ActivityNet-1.3, THUMOS-14 and EPIC-Kitchens-100 reveal that our method significantly outperforms state-of-the-art methods under the same input modality, and our method improves the average mAP from 60.6% to 64.2% on THUMOS-14 with the same backbone. The code is available on https://github.com/shouxiaozixuan/semtad.

Abstract:
Global intra prediction (GIP), including intra-block copy and template matching prediction (TMP), exploits the global correlation of the same image to improve the coding efficiency. In Beyond VVC, TMP uses template matching to determine the reference blocks for efficient prediction. There usually exists an error between the coding block and reference blocks, caused by the content mismatch or the coding distortion of the reference blocks. We propose an enhancement over the reference blocks, namely enhanced GIP (EGIP). Specifically, we design an enhanced filter according to the templates of the coding block and the reference blocks, with the reconstructed template of the coding block as the label for supervised learning. To support different enhancements, we design two types of inputs, i.e., EGIP based on neighboring samples (N-EGIP) and EGIP based on multiple hypothesis references (M-EGIP). Experimental results show that, based on enhanced compression model (ECM) version 8.0, N-EGIP achieves BD-rate reductions of 0.37%, 0.42%, and 0.40%, and M-EGIP brings 0.34%, 0.37%, and 0.34% BD-rate savings for Y, Cb, and Cr components, respectively. A higher coding gain, 0.46%, 0.54%, and 0.52% BD-rate savings, can be achieved by integrating N-EGIP and M-EGIP together. Owing to the coding gain and small complexity increase, the proposed EGIP has been adopted in the exploration of Beyond VVC and integrated into its reference software.

Abstract:
Video Moment and Highlight Retrieval (VMHR) aims at retrieving video events with a text query in a long untrimmed video and selecting the most related video highlights by assigning the worthiness scores. However, we observed existing methods mostly have two unavoidable defects: 1) The temporal annotations of highlight scores are extremely labor-cost and subjective, thus it is very hard and expensive to gather qualified annotated training data. 2) The previous VMHR methods would fit the temporal distributions instead of learning vision-language relevance, which reveals the limitations of the conventional paradigm on model robustness towards biased training data from open-world scenarios. In this paper, we propose a novel method termed Query as Supervision (QaS), which jointly tackles the annotation cost and model robustness in the VMHR task. Specifically, instead of learning from the distributions of temporal annotations, our QaS method completely learns multimodal alignments within semantic space via our proposed Hybrid Ranking Learning scheme for retrieving moments and highlights. In this way, it only requires low-cost annotations and also provides much better robustness towards Out-Of-Distribution test samples. We evaluate our proposed QaS method on three benchmark datasets, i.e., QVHighlights, BLiSS, and Charades-STA and their biased training version. Extensive experiments demonstrate that the QaS outperforms existing state-of-the-art methods under the same low-cost annotation settings and reveals better robustness against biased training data. Our code is available at https://github.com/CFM-MSG/Code_QaS.

Abstract:
The task of live video analytics relies on real-time object tracking that typically involves computationally expensive deep neural network (DNN) models. In practice, it has become essential to process video data on edge devices deployed near the cameras. However, these edge devices often have very limited computing resources and thus suffer from poor tracking accuracy. Through a measurement study, we identify three major factors contributing to the performance issue: outdated detection results, tracking error accumulation, and ignorance of new objects. We introduce a novel approach, called Predict & Correct based Tracking, or PCTrack, to systematically address these problems. Our design incorporates three innovative components: 1) a Predictive Detection Propagator that rapidly updates outdated object bounding boxes to match the current frame through a lightweight prediction model; 2) a Frame Difference Corrector that refines the object bounding boxes based on frame difference information; and 3) a New Object Detector that efficiently discovers newly appearing objects during tracking. Experimental results show that our approach achieves remarkable accuracy improvements, ranging from 19.4% to 34.7%, across diverse traffic scenarios, compared to state of the art methods.

Abstract:
Color cast is one of the main degradations in underwater images. Existing data-driven methods, while capable of learning color correction rules from large datasets, often overlook the imaging characteristics and light behavior in underwater environments, making them unable to accurately restore colors in complex water bodies. To address this, we use color constancy and an underwater imaging model to heuristically model the underwater environment for accurate color restoration. On one hand, we propose a multi-scale joint prior network architecture to fully explore the rich feature-level information at different scales in underwater images. This is used to fit the complex parameters of the underwater imaging model, deriving high-quality potential undegraded images. On the other hand, to tackle the challenges of color distortion caused by complex imaging factors in different water environments, we estimate the background light of the water body through the color constancy of underwater objects and dynamically incorporate it into the underwater imaging model as a prior. This not only guides the learning process more effectively but also allows the model to consider key aspects of underwater optical propagation, making it adaptable to different water environments and improving the color accuracy of the enhanced images. We have also conducted extensive experiments to demonstrate the effectiveness of the proposed method, which not only achieves the best overall performance in qualitative analysis and quantitative comparison but also boasts the best color accuracy and the fastest inference speed. The code is available at https://github.com/JunyuFan/MJPNet.

Abstract:
In machine learning, generating realistic human motion is paramount for a range of applications that require lifelike movements. Traditional methods have often overlooked the adherence to physical principles, leading to motion sequences that exhibit unrealistic behaviors such as foot sliding, penetration, and floating. These issues are particularly pronounced in complex tasks like dance choreography, which demand a higher degree of fidelity and realism. To address these challenges, we introduce RF-Rotation, a novel approach to human pose representation that strategically repositions the root joint of the SMPL model to align with both feet, while representing other joints through recursive bone rotations. It not only aligns more closely with the natural dynamics of human movement but also integrates an advanced contact predictor to ascertain the ground contact status of both feet, thereby preventing physically implausible movements on feet. We note that RF-Rotation is compatible with any motion generation tasks, including dance choreography, text-to-motion synthesis, and motion prediction, and can be seamlessly integrated into existing frameworks without modifications. Extensive experiments across three distinct tasks demonstrate the superior performance of RF-Rotation in enhancing the realism and stability of generated motion sequences. This method can significantly reduce foot sliding, floating, and penetration issues, without affecting computational efficiency, underscores its potential to set new standards in human motion generation.

Abstract:
Assessing the aesthetic quality and visual appeal of artworks has become one of the hotspots in current research. The existing artistic image aesthetics assessment (AIAA) methods directly learn aesthetics from images, while ignoring the impact of variations in visual attributes on human aesthetic perception, which hampers the further development of AIAA. To address this issue, this paper presents a new AIAA method based on attribute knowledge amalgamation, named AKA-Net. Specifically, we initially learn common attribute aesthetic rules (e.g., composition and color) through pre-training on natural aesthetic images. Then, we devise a multi-model amalgamation strategy based on contrastive learning to transfer different types of prior attribute knowledge into a single target model, enabling flexible and efficient aesthetic prediction. Finally, an attribute-aware feature enhancement module (AFEM) is introduced to better establish the relationship between aesthetic quality and attribute knowledge. Experimental results on three public benchmark AIAA databases demonstrate that the proposed AKA-Net outperforms the state-of-the-art AIAA metrics.

Abstract:
LiDAR-based 3D pedestrian detection has recently been extensively applied in autonomous driving and intelligent mobile robots. However, it remains a highly challenging perceptual task due to the sparsity of pedestrian point cloud data and the significant deformation of pedestrian body postures. To address these challenges, we propose a Dense Cross Connections network with Linear Attention (DCCLA), which mitigates the semantic discrepancy between the encoder and decoder of the network by integrating multiple 3D sparse convolutional layers within the skip connections. Furthermore, we enhance these connections by introducing cross-connections, thereby effectively promoting information interaction among various channels. To effectively retain crucial information while summarizing diverse pedestrian representations, we propose the Linear Self-Attention module for 3D point clouds (LSA3D), which significantly reduces model complexity. The experimental results demonstrate that our DCCLA achieves state-of-the-art Average Precision (AP) for the 3D pedestrian detection task on the JRDB large-scale dataset, outperforming the second-ranked method by 2.7% AP. Furthermore, our DCCLA enhances 1.6% mIoU over the benchmark method on the SemanticKITTI dataset. Therefore, our method achieves excellent performance through a cross-scale feature fusion strategy and linear attention that fully combines the advantages of convolution and transformer architectures. The project is publicly available at https://github.com/jinzhengguang/DCCLA.

Abstract:
The continuous development of Earth observation (EO) technology has significantly increased the availability of multi-sensor remote sensing (RS) data. The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data has become a research hotspot. Current mainstream convolutional neural networks (CNNs) excel at extracting local features from images but have limitations in modeling global information, which may affect the performance of classification tasks. In contrast, modern graph convolutional networks (GCNs) excel at capturing global information, particularly demonstrating significant advantages when processing RS images with irregular topological structures. By integrating these two frameworks, features can be fused from multiple perspectives, enabling a more comprehensive capture of multimodal data attributes and improving classification performance. The paper proposes a spatial-spectral-structural feature fusion network (S3F2Net) for HSI and LiDAR data classification. S3F2Net utilizes multiple architectures to extract rich features of multimodal data from different perspectives. On one hand, local spatial and spectral features of multimodal data are extracted using CNN, enhancing interactions among heterogeneous data through shared-weight convolution to achieve detailed representations of land cover. On the other hand, the global topological structure is learned using GCN, which models the spatial relationships between land cover types through graph structure constructed from LiDAR data, thereby enhancing the model’s understanding of scene content. Furthermore, the dynamic node updating strategy within the GCN enhances the model’s ability to identify representative nodes for specific land cover types while facilitating information aggregation among remote nodes, thereby strengthening adaptability to complex topological structures. By employing a multi-level information fusion strategy to integrate data representations from both global and local perspectives, the accuracy and reliability of the results are ensured. Compared with state-of-the-art (SOTA) methods, the framework’s validity is verified on three real multimodal RS datasets. The source code will be available at https://github.com/slylnnu/S3F2Net.

Affiliations: Hubei Key Laboratory of Optical Information and Pattern Recognition, Wuhan Institute of Technology, Wuhan, China; College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin, China; School of Electrical Information Engineering, Wuhan Donghu University, Wuhan, China; School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China; School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China

Abstract:
In the image fusion mission, the crucial task is to generate high-quality images for highlighting the key objects while enhancing the scenes to be understood. To complete this task and provide a powerful interpretability as well as a strong generalization ability in producing enjoyable fusion results which are comfortable for vision tasks (such as objects detection and their segmentation), we present a novel interpretable decomposition scheme and develop a target-aware Taylor expansion approximation (T2EA) network for infrared and visible image fusion, where our T2EA includes the following key procedures: Firstly, visible and infrared images are both decomposed into feature maps through a designed Taylor expansion approximation (TEA) network. Then, the Taylor feature maps are hierarchically fused by a dual-branch feature fusion (DBFF) network. Next, the fused map of each layer is contributed to synthesize an enjoyable fusion result by the inverse Taylor expansion. Finally, a segmentation network is jointed to refine the fusion network parameters which can promote the pleasing fusion results to be more suitable for segmenting the objects. To validate the effectiveness of our reported T2EA network, we first discuss the selection of Taylor expansion layers and fusion strategies. Then, both quantitatively and qualitatively experimental results generated by the selected SOTA approaches on three datasets (MSRS, TNO, and LLVIP) are compared in testing, generalization, and target detection and segmentation, demonstrating that our T2EA can produce more competitive fusion results for vision tasks and is more powerful for image adaption. The code will be available at https://github.com/MysterYxby/T2EA.

Abstract:
Recent advances in video object segmentation (VOS) highlight its potential across various applications. Semi-supervised VOS aims to segment target objects in video frames based on annotations from the initial frame. Collecting a large-scale video segmentation dataset is challenging, which could induce noisy labels. However, it has been overlooked and most of the research efforts have been devoted to training VOS models by assuming the training dataset is clean. In this study, we first explore the effect of VOS models under noisy labels in the training dataset. To investigate the effect of noisy labels, we simulate the noisy annotations on DAVIS 2017 and YouTubeVOS datasets. Experiments show that the traditional training strategy is vulnerable to noisy annotations. To address this issue, we propose a novel noise-robust training method, named SMART (Spatial Mask-based Adaptive Robust Training), which is designed to train models effectively in the presence of noisy annotations. The proposed method employs two key strategies. Firstly, the model focuses on the common spatial areas from clean knowledge-based predictions and annotations. Secondly, the model is trained with adaptive balancing losses based on their reliability. Comparative experiments have demonstrated the effectiveness of our approach by outperforming other noise handling methods over various noise degrees.

Abstract:
Semantic segmentation is a fundamental task in computer vision and finds extensive applications in scene understanding, medical image analysis, and remote sensing. With the advent of deep learning, significant advancements have been made in segmentation tasks. However, deep learning models require a substantial amount of labeled data for training, and accurately annotating datasets is labor-intensive and costly. Recently, numerous studies have explored the semantic segmentation task through the lens of semi-supervised learning, with the pseudo-labeling (PL) method emerging as a straightforward and widely applicable approach. This paper provides a comprehensive review and analysis of various PL methods and their applications in semi-supervised semantic segmentation (SSSS) from multiple angles. Initially, it captures the essence of individual model self-training and the collaborative training of multiple models from a model-centric viewpoint. Next, it explores strategies for refining or dismissing unreliable methods. Then, it categorizes techniques for addressing noisy PL data and inspects improvements in PL methods from the perspective of data augmentation. It further provides insights into optimization strategies. Furthermore, it examines PL methods from an application-oriented standpoint, such as in medical image segmentation and remote sensing image segmentation. Lastly, this paper evaluates the performance of cutting-edge methods on public datasets and concludes by discussing the challenges and potential directions for future research.

Abstract:
Existing RGB-D salient object detection (SOD) models have large numbers of parameters, high computational complexity, and slow inference speeds, limiting their deployment on edge devices. To address this issue, we propose a highly efficient network (HENet), focusing on developing lightweight RGB-D SOD models. Specifically, to fairly handle multimodal inputs and capture long-range dependencies of features, we employ a dual-stream structure and use MobileViT as the network encoder. We introduce the Adaptive Edge-Aware Fusion Module (AEFM) that adaptively adjusts the contribution of features during the fusion process based on the amount of feature information, and perceives the edges of the fused features at the pixel level. To compensate for the insufficient feature extraction capability of the lightweight backbone network, we propose the Dual-Branch Feature Enhancement Module (DFEM) to enhance the representation capability of the fused features. Finally, we design the Feature Attention Regulation Module (FARM) to adjust the model’s focus in real time. HENet has fewer parameters (11.9M) and lower computational complexity (10.7 GFLOPs), achieving an inference speed of 121 FPS for images with size 384× 384 . Extensive experiments are conducted on seven challenging RGB-D SOD datasets. The experimental results demonstrate that HENet outperforms 16 state-of-the-art methods and shows great potential in downstream computer vision tasks. Codes and results are available on https://github.com/BojueGao/HENet.

Abstract:
Despite the rapid advancements in the unsupervised anomaly detection and localization, most existing methods require to train different models for different categories, leading to increased computational and memory demands for real application with the number of classes grows. A more practical task is to detect anomalies from different categories using one unified model. However, this unified setting is challenging for modeling the multi-class normal feature representation due to the diversity of data categories, and the existing methods often drop in performance under this setting. In this work, we propose UniSTAD, a novel and effective unified method for multi-class anomaly detection and localization, using a transformer-based triple-tower students-teacher model. The triple-tower design contains global and local student models, respectively predicting features from global and local context features. UniSTAD learns the feature representation of normal data by joint distilling features to pre-trained teacher model, and enforcing the global/local context-based feature reconstruction and consistency. In the inference stage, UniSTAD identifies anomalous regions where expected feature consistencies are broken. Additionally, we integrate an untrained, category-agnostic localization refinement module, further improving multi-class anomaly detection and localization performance. Evaluated on real-world industrial datasets, UniSTAD demonstrates the state-of-the-art performance, validating its efficacy for multi-class anomaly detection and localization.

Abstract:
With the development of computer vision technology, unsupervised depth estimation from single images has experienced significant advancements under normal weather conditions, demonstrating highly promising results. Nevertheless, its efficacy in estimating depth under less-than-optimal weather conditions, particularly those characterized by fog, continues to pose substantial challenges. In this paper, we propose FoggyDepth that is designed to utilize channel-wise Fourier transform to remedy this limitation. Specifically, to relieve the problem of photometric consistency assumption not holding in foggy scenes within the unsupervised framework, we employ a channel-dimension Fourier transform to obtain channel global statistical information, thereby enhancing the discriminative ability of global representation. Meanwhile, we generate a series of foggy scene samples corresponding to normal training samples and use them for self-supervised training to guide the model to accurately recover depth in foggy conditions. In addition, to further improve the model performance, we utilize a non-local network to capture long-range spatial dependencies in depth estimation. Comprehensive evaluations conducted on the Oxford RobotCar, nuScenes, and Driving Stereo datasets substantiate the precision and reliability of our proposed method. Through a meticulous comparison with existing leading-edge algorithms in depth estimation, our approach demonstrates superior performance, both qualitatively and quantitatively.

Abstract:
Skeleton data has become popular in human action recognition because of its efficacy in capturing human motion patterns while mitigating the influence of environmental noise. However, overlooking critical action-related environmental descriptors presents challenges in distinguishing actions characterized by similar body movements. To address this limitation, we propose a novel framework that integrates skeleton data with language descriptions to easily capture essential environmental information for fine-grained action recognition while maintaining the robustness of skeleton-based methods. We first develop a Language Environment Description Generation (LEDG) module that utilizes the open-world understanding ability of Large Multimodal Models to generate instance-level action-related language environment descriptions without the need to train additional modules. Then, we introduce a Skeleton-supported Environment Feature Extraction (SEFE) module that leverages the temporal dependency inherent in skeleton data to extract key semantic environmental features. Additionally, we propose an Entropy-based Feature Fusion (EFF) module to dynamically amalgamate complementary features from both skeleton and language domains. Experimental results demonstrate the superiority of our framework, which can improve the accuracy of existing skeleton-based action recognition methods and achieve state-of-the-art performance on four well-established skeleton-based action recognition benchmarks.

Abstract:
Non-blind rotary motion deblurring (RMD) aims to restore a latent image from its blurred image. Since the integration path of rotary motion blurring (RMB) is a circle, RMD is modelled as a typical motion deblurring in the polar coordinate system (PCS). However, existing PCS-based methods use hand-designed image priors and are limited by transformation errors, including Cartesian-to-polar transformation (CPT) error and polar-to-Cartesian transformation (PCT) error. In this paper, we analyze the impact of transformation errors on the restored image and propose a novel end-to-end network which introduces a convolutional neural network (CNN) to learn image priors. Specifically, considering the CPT error, we construct a degradation model and solve it in an unrolling way, effectively reducing the ringing artifacts. For the PCT error, we develop a PCT error correction module (PCM) to reconstruct the lost details and textures. Experiments show our method performs against state-of-the-art (SOTA) approaches on synthetic and real-world rotary motion blur datasets by a large margin. The code and model are available at https://github.com/Jinhui-Qin/RMD_PCS.

Abstract:
Rain in the dark poses significant challenges to deploying real-world applications such as autonomous driving, surveillance systems, and night photography. Existing low-light enhancement or deraining methods struggle to brighten low-light conditions and remove rain simultaneously. Cascade approaches, like “deraining followed by low-light enhancement” and vice versa, often result in problematic rain patterns or overly blurred and overexposed images. To address these challenges, we introduce a novel two-stage model called L2RIRNet, which innovatively integrates low-light enhancement and deraining into a single framework in real-world settings. Our model comprises two key components: a Dual Degradation Representation Network (DDR-Net) and a Restoration Network. The DDR-Net independently learns degradation representations for luminance effects in dark areas and rain patterns in light areas, which are constrained by dual degradation loss and have not been discussed in the previous methods. The Restoration Network restores the degraded image using a Fourier Detail Guidance (FDG) module, which focuses on texture details in frequency and spatial domains to inform the restoration process and leverages near-rainless detailed images. Furthermore, we contribute a dataset containing both synthetic and real-world low-light-rainy images. Extensive experiments demonstrate that our L2RIRNet performs favorably against existing methods in synthetic and complex real-world scenarios. All the code and dataset can be found in https://github.com/linxin0/Low_light_rainy.

Abstract:
Due to the disorder of points, point clouds need to be structured by sampling and neighbor query before feeding to Deep Neural Networks (DNNs). Structuring point clouds costs high computation overhead, which limits the deployment of DNNs on embedded devices such as autonomous vehicles and robots. To address this problem, we design a novel data structure, i.e., Fast Spatial-Searching Tree (FSSTree), to accelerate point cloud structuring for DNNs on embedded devices. The FSSTree is constructed based on density distribution of point clouds to achieve semantic segmentation, which can guarantee that points with similar spatial positions are stored in adjacent storage sets. Based on FSSTree, we propose a point-sparsity-aware sampling method and a leafwise k-nearest neighbor query method to reduce the computation overhead of structuring point clouds. Meanwhile, the point-sparsity-aware sampling method achieves fair sampling on both dense and sparse parts, which can overcome the nonuniform distribution of point clouds caused by occlusion, lighting and other factors. The leafwise k-nearest neighbor query method skips a large number of dissimilar points to quickly obtain the neighbor points, which can significantly reduce the search scope. We also present a layerwise self-pruning algorithm to automatically adjust the FSSTree after each layer’s operation to match the hierarchical architecture of DNNs. Finally, we conduct extensive experiments on KITTI, S3DIS and ModelNet40 datasets and three devices (including an RTX 3090 server, a Jetson AGX Xavier and an Apple M2). The experimental results demonstrate the efficiency of our approach, which can reduce the time overhead by up to 97.46% compared with the other five methods. The code is released at https://github.com/EmbeddedAILab-UESTC/fsstree.

Abstract:
Anomaly detection in complex industrial processes plays a pivotal role in ensuring efficient, stable, and secure operation. Existing anomaly detection methods primarily focus on analyzing dominant anomalies using the process variables (such as arc current) or constructing neural networks based on abnormal visual features, while overlooking the intrinsic correlation of cross-modal information. This paper proposes a cross-modal Transformer (dubbed FmFormer), designed to facilitate anomaly detection by exploring the correlation between visual features (video) and process variables (current) in the context of the fused magnesium smelting process. Our approach introduces a novel tokenization paradigm to effectively bridge the substantial dimensionality gap between the 3D video modality and the 1D current modality in a multiscale manner, enabling a hierarchical reconstruction of pixel-level anomaly detection. Subsequently, the FmFormer leverages self-attention to learn internal features within each modality and bidirectional cross-attention to capture correlations across modalities. By decoding the bidirectional correlation features, we obtain the final detection result and even locate the specific anomaly region. To validate the effectiveness of the proposed method, we also present a pioneering cross-modal benchmark of the fused magnesium smelting process, featuring synchronously acquired video and current data for over 2.2 million samples. Leveraging cross-modal learning, the proposed FmFormer achieves state-of-the-art performance in detecting anomalies, particularly under extreme interferences such as current fluctuations and visual occlusion caused by heavy water mist. The presented methodology and benchmark may be applicable to other industrial applications with some amendments. The benchmark will be released at https://github.com/GaochangWu/FMF-Benchmark.

Abstract:
Image restoration aims to recover high-quality images from their corrupted counterparts. Many existing methods focus on the spatial domain while overlooking frequency variations between sharp/degraded image pairs. Meanwhile, they typically establish skip connections between encoder and decoder features using addition or concatenation to enhance image restoration. However, since encoder features may contain degradation factors, this approach can inadvertently introduce implicit noise. In this paper, we introduce a multi-scale frequency selection network (MFSNet) that seamlessly integrates spatial and frequency domain knowledge, selectively recovering richer and more accurate information. Specifically, we initially capture spatial features and input them into dynamic filter selection modules (DFS) at different scales to integrate frequency knowledge. DFS utilizes learnable filters to generate high and low-frequency information and a frequency cross-attention mechanism (FCAM) to determine the most information to recover. To learn a multi-scale and accurate set of hybrid features, we develop a skip feature fusion block (SFF) that leverages contextual features to discriminatively determine which information should be propagated in skip-connections. It is worth noting that our DFS and SFF are generic plug-in modules that can be directly employed in existing networks without any adjustments, leading to performance improvements. Extensive experiments across various image restoration tasks demonstrate that our MFSNet achieves performance that is either superior or comparable to state-of-the-art algorithms. The code and the pre-trained models are released at https://github.com/Tombs98/MFSNet_.

Abstract:
Omnidirectional image quality assessment (OIQA) has been widely investigated in the past few years and achieved much success. However, most of existing studies are dedicated to solve the uniform distortion problem in OIQA, which has a natural gap with the non-uniform distortion problem, and their ability in capturing non-uniform distortion is far from satisfactory. To narrow this gap, in this paper, we propose a multitask auxiliary network for non-uniformly distorted omnidirectional images, where the parameters are optimized by jointly training the main task and other auxiliary tasks. The proposed network mainly consists of three parts: a backbone for extracting multiscale features from the viewport sequence, a multitask feature selection module for dynamically allocating specific features to different tasks, and auxiliary sub-networks for guiding the proposed model to capture local distortion and global quality change. Extensive experiments conducted on two large-scale OIQA databases demonstrate that the proposed model outperforms other state-of-the-art OIQA metrics, and these auxiliary sub-networks contribute to improve the performance of the proposed model. The source code is available at https://github.com/RJL2000/MTAOIQA.

Abstract:
Camera-based stereo 3D object detection estimates 3D properties of objects with binocular images only, which is a cost-effective solution for autonomous driving. The state-of-the-art methods mainly improve the detection accuracy of general objects by designing ingenious stereo matching algorithms or complex pipeline modules. Moreover, additional fine-grained annotations, such as masks or LiDAR point clouds, are often introduced to deal with the occlusion problems, which brings in high manual costs for this task. To address the detection bottleneck caused by occlusion in a more cost-effective manner, we develop a novel stereo 3D object detection method named DSC3D, which achieves significant improvements for occluded objects without introducing additional supervision. Specifically, we first report the ambiguity in feature sampling, which refers to the presence of noisy features in the sampling for occluded objects. Then, we propose the Epipolar Constraint Deform-Attention (ECDA) module to address the unreliable left-right correspondence computation in stereo matching caused by occlusion, which reweights epipolar features by adaptively aggregating local neighbor information. Furthermore, to ensure that 3D property estimation is based on robust object features, we propose visible regions guided constraint to explicitly guide the offset learning for feature sampling. Extensive experiments conducted on the KITTI benchmark have demonstrated the proposed DSC3D outperforms the state-of-the-art camera-based methods.

Abstract:
Existing gait recognition methods are capable of extracting rich spatial gait information but often overlook fine-grained temporal features within local regions and temporal contextual information across different sub-regions. Considering gait recognition as a fine-grained recognition task and each individual exhibits uniqueness in their movements across different temporal sequences, we propose a local multi-scale and global contextual spatio-temporal (LMGCS) network for gait recognition. It divides the whole gait sequence into sub-sequences with multiple spatio resolutions and extracts multi-scale temporal features. We extract the temporal context information of different sub-sequences with the transformer, and all sub-sequences are fused to form global features. Furthermore, the loss function that combines the triplet loss function and cross-entropy loss function is utilized to prompt the proposed model to fulfill the gait recognition. The proposed method achieved state-of-the-art results on two popular public datasets. It achieved rank-1 accuracy of 98.0%, 95.4%, and 85.0% on the three walk states of the CASIA-B dataset and 90.9% on the OU-MVLP dataset.

Abstract:
Few-shot image generation aims to train generative models using a small number of training images. When there are few images available for training (e.g. 10 images), Learning From Scratch (LFS) methods often generate images that closely resemble the training data while Transfer Learning (TL) methods try to improve performance by leveraging prior knowledge from GANs pre-trained on large-scale datasets. However, current TL methods may not allow for sufficient control over the degree of knowledge preservation from the source model, making them unsuitable for setups where the source and target domains are not closely related. To address this, we propose a novel pipeline called Peer is your Pillar (PIP), which combines a target few-shot dataset with a peer dataset to create a data-unbalanced conditional generation. Our approach includes a class embedding method that separates the class space from the latent space, and we use a direction loss based on pre-trained CLIP to improve image diversity. Experiments on various few-shot datasets demonstrate the advancement of the proposed PIP, especially reduces the training requirements of few-shot image generation.

Abstract:
Deep learning-based video watermarking is shown to be effective in improving robustness. However, existing methods neglect the enhancement of long-distance spatio-temporal features and the representation of the inter-frame difference and the intra-frame difference, which lead to poor robustness against H.264 compression and low compatibility with high-definition (HD) and full high-definition (FHD) videos for copyright protection, respectively. To address these issues, we propose a robust and compatible video watermarking network (RC-VWN) based on spatio-temporal enhancement and multiscale pyramid attention. For robustness, RC-VWN extracts long-distance spatio-temporal features using a central difference 3D U-Net and enhances them through multiscale spatio-temporal fusion, which alleviates the loss of the watermark caused by attacks through the association of long-distance spatio-temporal features. Then, the simulated compression network is developed to simulate H.264 compression with high-accuracy, which guides the decoder to recover the watermark accurately. For compatibility, a multiscale pyramid attention is designed to represent the intra-frame difference and the inter-frame difference effectively. Experimental results demonstrate that RC-VWN outperforms the state-of-the-art methods with higher robustness and imperceptibility under quantitative evaluation and visual quality. Furthermore, RC-VWN exhibits high compatibility with various videos, including HD and FHD videos, ensuring effective copyright protection.

Abstract:
Video Question Answering (VideoQA) represents a crucial intersection between video understanding and language processing, requiring both discriminative unimodal comprehension and sophisticated cross-modal interaction for accurate inference. Despite advancements in multi-modal pre-trained models and video-language foundation models, these systems often struggle with domain-specific VideoQA due to their generalized pre-training objectives. Addressing this gap necessitates bridging the divide between broad cross-modal knowledge and the specific inference demands of VideoQA tasks. To this end, we introduce HeurVidQA, a framework that leverages domain-specific entity-action heuristics to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model’s focus toward precise cues that enhance reasoning. By delivering fine-grained heuristics, we improve the model’s ability to identify and interpret key entities and actions, thereby enhancing its reasoning capabilities. Extensive evaluations across multiple VideoQA datasets demonstrate that our method significantly outperforms existing models, underscoring the importance of integrating domain-specific knowledge into video-language models for more accurate and context-aware VideoQA.

Abstract:
Image steganography discreetly embeds secret information within a carrier, allowing covert communication and enabling the receiver to extract the concealed data when needed. Previous techniques for image steganography had limitations in achieving imperceptibility and security when dealing with images containing intricate textures. In this paper, we introduce DenseJIN, an innovative model for dense depth image steganography. DenseJIN joins invertible and noninvertible mechanisms to achieve effective and secure information hiding. The invertible component of DenseJIN ensures that the stego image maintains high imperceptibility and security, while the noninvertible component enables high-quality recovery of the secret image. In the invertible component, we employ a dense connection for each invertible block in the forward process and a straightforward series connection during the reverse process. In the forward process of the network, the secret image is embedded, while the backward process is responsible for extracting the embedded secret image. To perform the noninvertible step, we incorporate a modified Unet architecture, enabling deep fine-grained feature extraction from cover images and secret images. Our experimental results indicate that DenseJIN surpasses other contemporary image steganography methods. On average, DenseJIN achieves a remarkable improvement of over 1.75 dB in PSNR for secret image recovery across DIV2K, COCO and ImageNet.

Abstract:
Anterior segment optical coherence tomography (AS-OCT) is a popular imaging technique that can directly visualize the anterior segment structures while inherent speckle noise severely impairs visual readability and subsequent clinical analysis. Though unpaired OCT image denoising algorithms have been developed to improve visual quality considering the limited supervised clinical data, preserving the edge structures while denoising remains challenging, especially in AS-OCT images with little hierarchy and low contrast. This work proposes an edge enhancement generative adversarial network ( E^2GAN ) based contrast-aware, particularly for unpaired AS-OCT image denoising. Specifically, to improve edge-structure consistency, we design a contrast attention mechanism for exploiting diverse hierarchical knowledge from multiple contrast images and adopt particular gradient-guided speckle filtering modules with an edge preservation loss for stabilizing the network. Additionally, considering that bi-directional GANs often focus on global appearance rather than essential features, E^2GAN adds a perceptual quality constraint into the cycle consistency. Extensive experiments validate the superiority of E^2GAN for AS-OCT image denoising and the benefits for downstream clinical analysis. Further experiments on the synthetic retinal OCT images prove the generalization of E^2GAN .

Abstract:
Many studies have focused on utilizing convolutional neural networks (CNNs) to enhance loop filter performance in video encoding. However, existing methods primarily concentrate on improving the natural sequence quality rather than addressing the specific needs of screen content sequences, which have gained increased attention due to the growing demands of remote desktops and online meetings. This paper proposed to understand machine behavior from the machine’s point of view, and adopts the machine intelligence to screen content coding. It presents a novel loop filter specifically tailored for screen content coding (SCC), referred to as video coding-SCC (VC-SCC). It employs a multiscale feature extraction structure and introduces two innovative non-local models to address distortions in different frame types across various coding setups. Specifically, considering regions of text and graphic textures in screen content, three types of prior maps, including screen content maps, coding configuration maps, and traditional filtering maps, are designed as auxiliary information in the model, promoting distortion pattern learning under different configurations. Two novel non-local models are proposed to enhance the model’s ability to capture global features in intra- and inter-frames while keeping low computational complexity. Finally, the VC-SCC is proposed for parallel implementation with the standard in-loop filter, and the optimal results are selected in each patch. Experimental results demonstrate significant performance improvements, with average BD-rate savings of 9.93%, 11.05%, and 10.73% for the all-intra(AI), low-delay(LD), and random-access(RA) configurations, respectively, outperforming other state-of-the-art approaches.

Abstract:
Sketch-based 3D shape retrieval has attracted increasing attention in recent years. Most existing methods fail to address the zero-shot scenario, and the few dedicated to zero-shot learning encounter the following two issues: 1) the features learned by these methods lack informativeness and generalization, rendering them ineffective in identifying unseen samples; 2) the generation of low-quality samples, aimed at facilitating the recognition of unseen categories, paradoxically diminishes their ability to identify these unseen classes. This paper introduces a novel contrastive disentanglement generative adversarial networks (CoDi) tailored for zero-shot sketch-based 3D shape retrieval. Initially, we introduce a paradoxical feature construction approach designed to assist the networks in capturing certain low-level features. Despite their weak semantic relevance, these features play a crucial role in sample recognition. Subsequently, a SemContrast fusion module is employed to align the semantic space with the prototype embedding space of categories. This alignment facilitates knowledge transfer to unseen classes and promotes the generation of high-quality samples. The networks are jointly trained on real and generated samples to achieve retrieval for unseen categories. Extensive experiments demonstrate a significant improvement in retrieval performance for unseen categories using our method.

Abstract:
Aircraft detection in synthetic aperture radar (SAR) images is one challenging task due to the discreteness of aircraft scattering, the diversity of aircraft size, and the interference of background. In order to deal with these problems, a novel method named scattering enhancement and feature fusion network (SEFFNet) is here proposed to detect aircraft via combining traditional image processing and deep learning together. At first, a scattering information extraction and enhancement module (SIEEM) is proposed to highlight the scattering points of aircraft targets. Then, to more effectively focus on the location of aircraft targets, a space-to-depth coordinate attention module (SDCAM) is further designed, following which an efficient multi-scale feature fusion pyramid (FFP) is also introduced to fuse the semantic information of different layers. At last, a contextual fusion head (CFH) is built to improve the receptive field for better detecting aircraft. The experiments carried out on the popular datasets SADD and SAR-AIRcraft-1.0 show that SEFFNet is more appropriate for aircraft detection, especially the small-size aircraft detection, in comparison with other state-of-the-art (SOTA) methods. Taking the dataset SADD for example, on average, the precision, recall, F1-score, and APs values are respectively 2.8%, 2.6%, 2.7%, and 2.0% higher than the baseline network YOLOv5.

Abstract:
Visual Grounding (VG) has become a prominent task in recent years, achieving significant advancements with the development of detection and vision transformers. However, existing VG methods struggle to handle the effects of inaccurate or irrelevant textual descriptions, tending to generate false-alarm objects. Moreover, existing methods fail to capture fine-grained features, accurate localization, and comprehensive context understanding from the whole image and textual descriptions. To address these issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with Multi-stage False-alarm Sensitive Decoder (MFSD) to prevent the generation of false-alarm objects when presented with inaccurate expressions. The framework introduces Masked Reference based Centerpoint Supervision (MRCS) and Iterative Multi-level Vision-language Fusion (IMVF) for enhancing the accuracy of localization and better visual-language alignment. To investigate the elements that affect VG robustness further, we release a robust VG benchmark with 24,000 instances and we also provide a detailed classification of false-alarm according to different parts of speech. Extensive experiments on existing state-of-the-art (SOTA) VG methods and foundation models have proven that it is difficult to handle the robustness of VG by existing models. Even foundation models, which have been pre-trained with a large amount of data, have difficulty to understand inaccurate language descriptions. Our IR-VG can handle false-alarm issues in robust VG well and achieve new SOTA results on the newly proposed robust VG datasets. Ablation studies and visualization experiments demonstrate the effectiveness of the proposed components. Moreover, the proposed framework is also verified effective on five regular VG datasets. Codes and models will be publicly at https://github.com/cv516Buaa/IR-VG.

Abstract:
Equivariant networks have recently made significant strides in computer vision tasks related to robotic grasping, molecule generation, and 6D pose tracking. In this paper, we explore 3D mesh object analysis based on an equivariant masked autoencoder to reduce the model dependence on large datasets and predict the pose transformation. We employ 3D reconstruction tasks under rotation and masking operations, such as segmentation tasks after rotation, as pretraining to enhance downstream task performance. To mitigate the computational complexity of the algorithm, we first utilize multiple non-overlapping 3D mesh patches with a fixed face size. We then design a rotation-equivariant self-attention mechanism to obtain advanced features. To improve the throughput of the encoder, we design a sparse token merging strategy. Our method achieves comparable performance on equivariant analysis tasks of mesh objects, such as 3D mesh pose transformation estimation, object classification and part segmentation on the ShapeNetCore16, Manifold40, COSEG-aliens, COSEG-vases and Human Body datasets. In the object classification task, we achieve superior performance even when only 10% of the original sample is used. We perform extensive ablation experiments to demonstrate the efficacy of critical design choices in our approach.

Abstract:
Reversible data hiding in encrypted image (RDHEI) is a powerful security technology that aims to hide data into the encrypted image without any distortions of data extraction and image recovery. Most existing RDHEI methods using vacated room-based data embedding algorithms face challenges in improving embedding capacity and security. In this paper, we develop a novel data hiding strategy via fusion based on reservoir computing (RC) system, upon which a new RDHEI scheme is further proposed. In the proposed scheme, the original image is first encrypted by the stream cipher-based encryption algorithm using the secret keys generated by an optical chaotic system. Then, by means of the RC system, the generated encrypted image can be fused with the secret data to produce the final masked image. Unlike the existing data embedding algorithms based on vacating rooms, the RC-based fusion strategy allows for hiding secret data comparable to the volume of the cover image into the encrypted image so that a higher embedding capacity can be greatly afforded. Moreover, the proposed strategy involves a chaotic transformation via the reservoir of RC system during data hiding, producing a masked image that is completely different from the encrypted image, thus the security is greatly enhanced. Experimental results show the contributions in improving the embedding capacity and security, and also demonstrate the superiority of the proposed scheme compared to some existing RDHEI methods.

Abstract:
The count supervision used in weakly-supervised crowd counting is derived from the number of point annotations, which means that the labeling cost is not effectively reduced. Moreover, due to the lack of spatial information about the pedestrians during training, previous works struggle to accurately learn the positions of individuals. To address these challenges, we propose a crowd counting and localization method based on scene-specific synthetic data for surveillance scenarios, which can accurately predict the number and location of person without any manually labeled point-wise or count-wise annotations. Our method dynamically adjust scene-specific synthetic data to minimize domain differences from surveillance scenes by learning the crowd scale and distribution. Specifically, based on realistic synthetic data, the models learn precise location and scale information, which can then regenerate new synthetic data with a more reasonable pedestrian distribution and scale and generate high-quality pseudo point-wise annotations. Subsequently, the counter is trained using our proposed robust soft-weighted loss function, under the joint supervision of auto-generated point-wise annotations on synthetic data and pseudo point-wise annotations on real data in an end-to-end manner. Our proposed loss function, based on the designed weighted optimal transport, effectively mitigates noise in pseudo point-wise labels and is not only insensitive to hyperparemeters but also exhibits superior generalization ability on real data. We conduct comprehensive experiments across multiple scene-specific datasets, demonstrating our method’s superiority in counting and localization performance over count-supervised, fully-supervised, and state-of-the-art domain adaption algorithms. Code is available at https://github.com/fyw1999/LCSD.

Abstract:
Multimodal knowledge graph completion (MKGC) seeks to enrich knowledge graphs by integrating information from diverse modalities, facilitating more comprehensive knowledge representation and enhancing reasoning accuracy. However, existing models lack the flexibility to adapt to different tasks, and their performance still requires further improvement. To tackle these challenges, we propose an MKGC model based on Dynamic prompt learning and Multi-granularity cross-modal Aggregation, namely DM-MKGC. To be specific, a novel dynamic prompt template is proposed, which employs an adaptive task-guided mechanism to dynamically adjust the structure of entities, relations, and textual information. This approach enables the generation of prompts tailored to diverse tasks, ensuring both functional flexibility and structural consistency. Furthermore, a multi-granularity cross-modal aggregation method, which facilitates the aggregation of cross-modal information by facilitating the interaction between coarse-grained and fine-grained image features with textual features, is designed to enhance the model’s performance. Extensive experiments conducted on four datasets (FB15k-237-IMG, WN18-IMG, MNRE, and Twitter-2017) demonstrate our model outperforms other SOTA methods in knowledge completion, achieving an average improvement of 9.8%, 1.6%, and 1.23% in MR, Hits@n, and F1 respectively. Our model not only offers a novel method for multimodal knowledge graph completion but also contributes valuable insights for the advancement of knowledge graph technologies.

Abstract:
Point spread function (PSF) characterizes the intensity distribution characteristics of each object point and is widely used in areas such as defocus estimation, non-blind image deblurring, and computational imaging. Estimating spatially varying PSF from a single image is a typical inverse problem, which is constrained by multiple factors such as sensor noise and semantic information interference. In this paper, we propose a polynomial fitting-based method to model spatially varying PSF. With this method, we generate a large-scale, high-quality dataset with pixel-level annotations that can be used for training deep learning networks. To solve the task of estimating defocus maps from a single image, we design a novel high-resolution coefficient regression network to achieve accurate defocus estimation and concurrent estimation of multiple aberrations, respectively. To the best of our knowledge, this work presents the inaugural attempt at spatially varying PSF estimation based on polynomial coefficient regression. Extensive experimental results show that our methodology attains state-of-the-art performance across numerous evaluation metrics, fully verifying its effectiveness and superiority. The dataset and code is available on GitHub: https://github.com/67689E4F/PSFNet.git

Abstract:
Current camera pose estimation methods primarily rely on high-contrast pixels and feature points in images, but in practical applications, parallel and perpendicular artificial structures are more common. This paper proposes a low texture three-dimensional reconstruction system based on Manhattan axis and 2D/3D line features. The proposed method combines point features with commonly occurring line features in structural scenes to reduce inaccuracies in camera pose estimation caused by texture loss. By combining structural constraints, Manhattan axes, and reprojection error, the extraction of 3D lines is optimized and the robustness of camera pose estimation is improved. An improved Truncated signed distance function with implicit representation and truncated distance function is used to obtain a highly readable 3D reconstruction model. The results of pose estimation and reconstruction are compared with state-of-the-art methods on public datasets, and the proposed algorithm is shown to outperform other solutions in low-texture artificial areas.

Abstract:
Photometric stereo (PS) methods recover surface normals from appearance changes under varying light directions, excelling in tasks like 3D surface reconstruction and defect inspection. However, collecting the illumination images is expensive, and current PS methods cannot obtain the light direction set that satisfies the pre-defined accuracy constraint, limiting their adaptability to various applications with varying accuracy requirements. To address this issue, we propose the LAC-PS, a light direction selection policy under the accuracy constraint for photometric stereo, which optimizes the light direction set to meet target reconstruction accuracy. In our method, we develop an accuracy assessment network that estimates reconstruction accuracy without ground truth. With this estimated accuracy, we put forward a reinforcement learning-based method that can utilize policy to sequentially select light directions and obtain the light directions satisfying the desired PS recovery accuracy constraint. Experimental results on real and synthetic datasets demonstrate that our method effectively selects light directions that satisfy accuracy constraints.

Abstract:
Meta-learning provides a promising solution to the issue of insufficient training samples in Pumi spectrogram recognition. However, capturing model uncertainty remains a critical challenge, particularly for tasks influenced by lexical ambiguities. To overcome this problem, we propose a novel method, Synthetic Gradient Optimization-Based Implicit Amortized Bayesian Meta-Learning (SGO-IABML), which captures model uncertainty by evaluating posterior distributions within a hierarchical Bayesian framework, thereby facilitating few-shot Pumi spectrogram recognition. Specifically, SGO-IABML reformulates meta-learning as a bi-level variational inference problem, leveraging information bottleneck principles. At the lower level, a generative inference module is developed to implicitly model task-specific variational posteriors, thereby enhancing the model’s expressiveness. Given the lack of analytical forms for implicit distributions, we derive the Fenchel-Bayesian Bound Theorem to measure the divergence between arbitrary distributions. For the meta-learning of variational parameters, SGO-IABML constructs a synthetic gradient optimizer, integrating prior gradient information to facilitate rapid adaptation to new tasks. At the upper level, the model is calibrated by estimating the local geometry of the posterior distribution, utilizing the Generalized Gauss-Newton Matrix to capture the directional sensitivity of the loss function. Comprehensive experimental results on Pumi spectrograms demonstrate that SGO-IABML achieves state-of-the-art performance in generalization, calibration, expressiveness, versatility, and cross-domain adaptability. Furthermore, ablation studies confirm the contribution of each component to the overall performance improvement.

Affiliations: Chinese Academy of Sciences, Institute of Computing Technology, Beijing, China; School of Electrical Engineering and Computer Science, The University of Queensland (UQ), Brisbane, Australia; resides, Changsha, China; Cooperative MediaNet Innovation Center, Shanghai Jiao Tong University, Shanghai, China; DSLAB, School of Information Science and Engineering, Lanzhou University, Lanzhou, China; Chinese Academy of Agricultural Sciences, Agricultural Information Institute, Beijing, China

Abstract:
Recently, patch-deformation methods have exhibited significant effectiveness in multi-view stereo owing to the deformable and expandable patches in reconstructing textureless areas. However, existing approaches neglect to address the problem of deformation instability caused by easily overlooked edge-skipping, potentially leading to matching distortions, thus leaving room for further improvement. To fill this gap, we propose SED-MVS, which adopts panoptic segmentation and multi-trajectory diffusion strategy for segmentation-driven and edge-aligned patch deformation. Specifically, to prevent unanticipated edge-skipping, we first employ SAM2 for panoptic segmentation as depth-edge guidance to guide patch deformation, followed by multi-trajectory diffusion strategy to ensure patches are comprehensively aligned with depth edges. Moreover, to avoid potential inaccuracy of random initialization, we combine both sparse points from LoFTR and monocular depth map from DepthAnything V2 to restore reliable and realistic depth map for initialization and supervised guidance. Finally, we integrate the segmentation image with the monocular depth map to exploit inter-instance occlusion relationship, then further regard them as occlusion map to implement two distinct edge constraint, thereby facilitating occlusion-aware patch deformation. Extensive results on ETH3D, Tanks & Temples, BlendedMVS, Strecha and DL3DV-10K datasets validate the state-of-the-art performance and robust generalization capability of our proposed method.

Abstract:
Establishing local feature matches between image pairs serves as a fundamental component of plentiful vision tasks, such as visual localization and Structure from Motion (SfM). Recently, detector-free techniques equipped with the transformer have exhibited exceptional performance. Theoretically, optimizing the transformer architecture and stacking more transformer blocks could emphasize crucial features and filter out extraneous information by progressively narrowing the effective perception regions of the network within images, thereby enhancing the matching performance. Nevertheless, this paradigm results in a linear escalation of model size with respect to the number of blocks. In this study, we introduce VD-Matcher to address this issue. A principal innovation of VD-Matcher is the utilization of a weight recycling technique (WRT) that enables partial weights to be reutilized across successive transformer blocks, along with specific transformations designed to sufficiently enhance feature representations. This approach enables VD-Matcher to construct a deep transformer architecture for accurate local feature matching while maintaining a manageable parameter size. Furthermore, we propose a lightweight multi-scale keypoint detection module that captures representative keypoints to replace all keypoints for compact global information aggregation intra-/inter- images, which reduces the computational overhead induced by excessively deep transformer layers while alleviating redundant information propagation to a certain extent. Extensive experiments verify that VD-Matcher exceeds state-of-the-art algorithms on multiple benchmarks while maintaining less parameters. The source code is available at https://github.com/mooncake199809/VD-Matcher

Abstract:
3D human pose estimation (3DHPE) from a single monocular RGB image is fundamental in many image-related fields, such as virtual reality, motion analysis, and human-computer interaction. To improve estimation accuracy, existing works typically integrate complex networks or divide monocular 3DHPE into multiple stages. However, complicating the estimation process to improve the estimation accuracy sacrifices the estimation speed and limits its application. To alleviate this, we propose AITEPose, an end-to-end model, which achieves higher monocular 3DHPE accuracy with a simpler model structure. Specifically, inspired by online knowledge distillation, we design an Auxiliary-Information-Driven Training Enhancement (AITE) framework. In the AITE framework, during training, an adjustment network is introduced between the prediction network and the loss function to incorporate auxiliary information and enhance the training process. Notably, the adjustment network is constructed by developing a novel cascaded Disturbance-Correction Module (DCM). It adjusts the poses to get more accurate results based on ground-truth bone lengths. Both AITE and DCM are employed only during training, thereby improving training outcomes without complicating the inference process. The AITEPose model achieves state-of-the-art performance for single-frame monocular 3DHPE on the most comprehensive dataset Human3.6M. To further validate the effectiveness of AITE and DCM, we design a monocular 2DHPE model, AITEPose2D, and conduct extensive ablation experiments on the COCO2017 dataset, demonstrating the robustness and generalizability of our proposed AITEPose.

Abstract:
No-reference image quality assessment (NR-IQA), which functions without the need for a reference image, is a challenging yet essential task in various image processing systems and downstream vision applications, ranging from semantic recognition to image enhancement. Traditionally, numerous NR-IQA models have been developed using supervised learning methodologies, which rely heavily on the availability and quality of ground truth data. To improve the generalization capability and robustness of these models, recent studies have explored the application of contrastive learning, aiming to enhance the quality representation capacity of model backbones through a self-supervised approach. However, the training process for contrastive learning is computationally intensive, posing significant challenges in resource-constrained environments. To mitigate this issue, we propose a Lightweight Contrastive-learning-based IQA (LCIQA) framework, designed to be efficiently trained on a single GPU without relying on ground truth data. This framework maintains a fixed vision backbone and focuses on optimizing the parameters of subsequent IQA heads through contrastive learning. To accommodate a lightweight framework, we incorporate a quality task adapter to eliminate semantic biases introduced by the features extracted from the fixed-parameter backbone. A coarse-to-fine contrastive learning strategy is then employed to train the quality regression module. Extensive experiments demonstrate the superior performance of our model in terms of both accuracy and complexity. In addition, ablation studies validate the effectiveness of each component within the proposed framework.

Abstract:
Animated human (AH) have gained popularity due to their vivid appearance and smooth, natural movements. Various animation methods based on artificial intelligence (AI) have been introduced, which are viewed as “Imitators,” offering new solutions for designing AHs. However, the effectiveness of these AI-generated AHs varies significantly across different categories and within the same category, leading to visual distortions that adversely affect the viewer’s experience. Consequently, it is essential to evaluate the quality of AHs to provide reliable and objective indicators for their further development and to ensure the delivery of higher-quality AH videos to users. In this paper, the first Animated Human Quality Assessment (AHQA) dataset is constructed by selecting 6 advanced and popular imitators and 10 common actions to animate 20 AI-generated characters. The constructed dataset integrates different genders and age groups of character images, and two types of poses, standing and sitting, are selected, highlighting the comprehensiveness and diversity of the AHQA dataset. Subjective experiments reveal significant differences in the quality of AHs produced by different imitators. Finally, we propose a quality assessment method, VIP-QA, incorporating Video quality, Identity consistency, and Posture similarity for the AHQA dataset. Experimental results show that VIP-QA significantly outperforms existing assessment methods on multiple datasets by about 5%, more closely approximates human visual perception, and provides a valid objective metric for assessing imitators. All the work in this paper has been released at https://github.com/zyj-2000/Imitator.

Abstract:
Exemplar-based image colorization aims to colorize a grayscale image using a reference color image, ensuring that reference colors are applied to corresponding input regions based on their semantic similarity. To achieve accurate semantic matching between regions, we leverage the self-attention module of a pre-trained diffusion model, which is trained on a large dataset and exhibits powerful attention capabilities. To harness this power, we propose a novel, fine-tuning-free approach based on a pre-trained diffusion model, making two key contributions. First, we introduce dual attention-guided color transfer. We utilize the self-attention module to compute an attention map between the input and reference images, effectively capturing semantic correspondences. The color features from the reference image is then transferred to the semantically matching regions of the input image, guided by this attention map, and finally, the grayscale features are replaced with the corresponding color features. Notably, we utilize dual attention to calculate attention maps separately for the grayscale and color images, achieving more precise semantic alignment. Second, we propose classifier-free colorization guidance, which enhances the transferred colors by combining color-transferred and non-color-transferred outputs. This process improves the quality of colorization. Our experimental results demonstrate that our method outperforms existing techniques in terms of image quality and fidelity to the reference. Specifically, we use 335 input-reference pairs from previous research, achieving an FID of 95.27 (image quality) and an SI-FID of 5.51 (fidelity to the reference). In addition, we evaluate on our novel dataset, which consists of 100 pairs of natural photos and historical paintings, achieving an FID of 219.05 and an SI-FID of 7.94. Our source code is available at https://github.com/satoshi-kosugi/powerful-attention

Abstract:
Real-world datasets often suffer from both noisy labels and imbalanced class distribution, presenting significant challenges for the effective deployment of deep neural networks (DNNs). Existing studies typically address these challenges separately and struggle to perform effectively when they occur simultaneously. In this paper, we introduce an unbiased Sample Selection method based on the Graph Attention Network (GAT), namely GSS. GSS can effectively divide the training set into clean and noisy subsets while avoiding sample selection bias by analyzing the intrinsic relationships between the training set and a small clean validation set. For the clean subset, we propose an Adaptive Label Refinement (ALR) strategy to improve the reliability of the labels within the clean subset. ALR dynamically integrates the network’s predictions with the given labels, mitigating the adverse impacts of misidentification. For the noisy subset, we introduce a Class-Balanced Pseudo Labeling (CBPL) method. CBPL addresses the cognitive bias in model predictions caused by class imbalance by integrating class distribution information into the pseudo-label generation process, resulting in more accurate pseudo-labels. Comprehensive evaluations on both synthetic and real-world datasets highlight the effectiveness and superiority of our approach, especially in scenarios characterized by noisy labels and imbalanced class distributions.

Abstract:
Neural radiance field (NeRF) has achieved impressive results in high-quality 3D scene reconstruction. However, NeRF heavily relies on precise camera poses. While recent works like BARF have introduced camera pose optimization within NeRF, their applicability is limited to simple trajectory scenes. Existing methods struggle while tackling complex trajectories involving large rotations. To address this limitation, we propose CT-NeRF, an incremental reconstruction and optimization pipeline using only RGB images without pose and depth input. In this pipeline, we first propose a local-global bundle adjustment under a pose graph connecting neighboring frames to enforce the consistency between poses to escape the local minima caused by only pose consistency with the scene structure. Further, we instantiate the consistency between poses as a reprojection error constraint resulting from pixel-level correspondences between input image pairs. Through the incremental reconstruction, CT-NeRF enables the recovery of both camera poses and scene structure and is capable of handling scenes with complex trajectories. We evaluate the performance of CT-NeRF on two real-world datasets, NeRFBuster and Free-Dataset, which feature complex trajectories. Results show CT-NeRF outperforms existing methods in novel view synthesis and pose estimation accuracy.

Abstract:
Localized neural implicit representation methods have recently been proven effective for shape reconstruction. However, while some recent neural implicit representation-based approaches have investigated part awareness, there is still room for improvement in leveraging the rich geometry information contained in parts, which is crucial for accurate reconstruction. This study aims to enhance the accuracy of shape reconstruction by incorporating part awareness. This principle faces a fundamental technical challenge: manually defining parts across various categories is ambiguous and expensive. To address it, we propose a new self-supervised learning paradigm that automatically discovers meaningful parts. Our proposed paradigm has several prominent advantages as compared with the prior arts: 1) It allows masked part modeling that scales well with available data; 2) It is a flexible formulation that allows a variable number of parts; 3) It allows the fusion of multi-scale (global-level and part-level) features at an arbitrarily given coordinate; 4) The semantic consistency of learned parts leads to transferable features. Extensive experiments validate our approach, named Masked PaCONet, showcasing its superiority in qualitative and quantitative results on public benchmarks, even under challenging settings. Codes and models will be released.

Abstract:
Highly accurate 3D object detection is critical for autonomous driving and robotic sensing system. However, some objects with few foreground points significantly affect the accuracy of 3D object detection. As the network depth increases, the low-level features of these objects are gradually lost, especially for the hard object. Due to this issue, current LiDAR-only based and multimodal methods often misclassify background as foreground. Therefore, how to leverage the low-level feature that contain information about these objects in the high layer of the network becomes the key to optimizing the issue. In this paper, we propose LDFCDet, a framework boosting 3D object detectors with low-high level feature crosses using Laplace distribution (LD). In our proposed method, we design a low-high level feature crosses module (LHFCM) to embed low-level feature into high-level feature in the deeper layer of the network, and use Laplace distribution to obtain a new low-high level feature that includes information about these objects with few foreground points. In addition, we propose a res-gated feature aggregation module (RGFAM) to fuse the multi-scale features. Our approach is well-suited for both LiDAR-based and multimodal methods. We evaluate the LDFCDet on the widely used KITTI dataset, and our method outperforms almost current 3D object detection methods on the challenging KITTI test set. Moreover, we conducted comparative experiments on the ONCE dataset, and the results further demonstrate the effectiveness and superiority of our method.

Abstract:
Sign Language Representation Learning (SLRL) is crucial for a range of sign language-related downstream tasks such as Sign Language Translation (SLT) and Sign Language Retrieval (SLRet). Recently, many gloss-based and gloss-free SLRL methods have been proposed, showing promising performance. Among them, the gloss-free approach shows promise for strong scalability without relying on gloss annotations. However, it currently faces suboptimal solutions due to challenges in encoding the intricate, context-sensitive characteristics of sign language videos, mainly struggling to discern essential sign features using a non-monotonic video-text alignment strategy. Therefore, we introduce an innovative pretraining paradigm for gloss-free SLRL, called C2RL, in this paper. Specifically, rather than merely incorporating a non-monotonic semantic alignment of video and text to learn language-oriented sign features, we emphasize two pivotal aspects of SLRL: Implicit Content Learning (ICL) and Explicit Context Learning (ECL). ICL delves into the content of communication, capturing the nuances, emphasis, timing, and rhythm of the signs. In contrast, ECL focuses on understanding the contextual meaning of signs and converting them into equivalent sentences. Despite its simplicity, extensive experiments confirm that the joint optimization of ICL and ECL results in robust sign language representation and significant performance gains in gloss-free SLT and SLRet tasks. Notably, C2RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign. It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign. Additionally, we set a new baseline for the OpenASL dataset in the SLRet task.

Abstract:
Automating video and radar spatial registration without sensor layout constraints is crucial for enhancing the flexibility of perception systems. However, this remains challenging due to the lack of effective approaches for constructing and utilizing matching information between heterogeneous sensors. Existing methods rely on human intervention or prior knowledge, making it difficult to achieve true automation. Consequently, establishing a registration model that automatically extracts matching information from heterogeneous sensor data remains a key challenge. To address these issues, we propose a novel Video-Radar Automatic Registration (VRAR) method based on vehicle trajectory spatiotemporal feature encoding and a bidirectional mapping network. We first establish a unified representation for heterogeneous sensor data by encoding spatiotemporal features of vehicle trajectories. Based on this, we automatically extract a large number of high-quality matching points from synchronized trajectory pairs using a frame synchronization strategy. Subsequently, we utilize the proposed Video-Radar Bidirectional Mapping Network to process these matching points. This network learns the bidirectional mapping between the two sensor modalities, extending the alignment from discrete local observation points to the entire observable space. Experimental results demonstrate that the VRAR method exhibits significant performance advantages in various traffic scenarios, verifying its effectiveness and generalizability. This capability of automated and adaptive registration highlights the method’s potential for broader applications in heterogeneous sensor integration.

Affiliations: College of Railway Transportation, Hunan University of Technology, Zhuzhou, China; School of Robotics, Hunan University, Changsha, China; School of Electrical Engineering, Guangxi University, Nanning, China; Hubei Key Laboratory of Intelligent Geo-Information Processing, School of Computer Science, China University of Geosciences, Wuhan, China; Department of Technology of Computers and Communications, Escuela Politécnica, Hyperspectral Computing Laboratory, University of Extremadura, Cáceres, Spain

Abstract:
Processing high-dimensional data cubes and developing high-performance classifiers are core objectives in the field of hyperspectral image classification (HSIC). Superpixel-based methods are widely used in HSIC due to their efficacy in reducing redundant information and enhancing local features. However, imprecise segmentation, especially in complex structures and textures of hyperspectral images (HSIs), may lead to inconsistencies in the regions extracted by superpixels and the boundaries between different ground objects. Such inconsistencies significantly degrade the classification performance of HSIs. Alternatively, when parameter settings are inaccurate, edge-aware feature extraction methods often introduce sharpening artifacts at the image boundaries, resulting in a decrease in classification accuracy. To effectively address these challenges, we propose a novel probabilistic fusion method for HSIC. This method consists of the following stages. First, spatial information is extracted by a multiscale superpixel segmentation method and then probabilistically optimized by the extended random walk (ERW) method. Next, semantic-aware structural features (S2Fs) are extracted along with edge information of different objects. Lastly, a probabilistic framework is proposed to fuse the class probabilities of superpixel-based spatial information and semantic-aware structural features. Experimental results on three real datasets show state-of-the-art classification performance, even with limited training sets.

Abstract:
Image enhancement plays a crucial role in computer vision by improving visual quality while minimizing distortion. Traditional methods enhance images through pixel value transformations, yet they often introduce new distortions. Recent advancements in deep learning-based techniques promise better results but challenge the preservation of image fidelity. Therefore, it is essential to evaluate the visual quality of enhanced images. However, existing quality assessment methods frequently encounter difficulties due to the unique distortions introduced by these enhancements, thereby restricting their effectiveness. To address these challenges, this paper proposes a novel blind image quality assessment (BIQA) method for enhanced natural images, termed multi-scale local feature fusion and global feature representation-based quality assessment (MLGQA). This model integrates three key components: a multi-scale Feature Attention Mechanism (FAM) for local feature extraction, a Local Feature Fusion (LFF) module for cross-scale feature synthesis, and a Global Feature Representation (GFR) module using Vision Transformers to capture global perceptual attributes. This synergistic framework effectively captures both fine-grained local distortions and broader global features that collectively define the visual quality of enhanced images. Furthermore, in the absence of a dedicated benchmark for enhanced natural images, we design the Natural Image Enhancement Database (NIED), a large-scale dataset consisting of 8,581 original images and 102,972 enhanced natural images generated through a wide array of traditional and deep learning-based enhancement techniques. Extensive experiments on NIED demonstrate that the proposed MLGQA model significantly outperforms current state-of-the-art BIQA methods in terms of both prediction accuracy and robustness.

Abstract:
Dynamic reconstruction technology presents significant promise for applications in visual and interactive fields. Current techniques utilizing 3D Gaussian Splatting show favorable results and fast reconstruction speed. However, as scene expanding, using individual Gaussian structure 1) leads to instability in large-scale dynamic reconstruction, marked by abrupt deformation, and 2) the heuristic densification of individuals suffers significant redundancy. Tackling these issues, we propose a jointed Gaussian representation method named FRPGS, which learns the global information and the deformation using center Gaussians and generates the neural Gaussians around them for local detail. Specifically, FRPGS employs center Gaussians initialized from point clouds, which are learned with a deformation field for representing global relationships and dynamic motion over time. Then, for each center Gaussian, attribute networks generate neural Gaussians that move under the linked center Gaussian driving, thereby ensuring structural integrity during movement within this joint-based representation. Finally, to reduce Gaussian redundancy, a densification strategy is developed based on the average cumulative gradient of the associated neural Gaussians, imposing strict limits on the growing of center Gaussians without compromising accuracy. Additionally, we established a large-scale dynamic indoor dataset at the MuLong Laboratory of ZTE Corporation. Evaluations demonstrate that FRPGS significantly outperforms state-of-the-art methods in both training efficiency and reconstruction quality, achieving over a 50% (up to 74%) improvement in efficiency on an RTX 4090. FRPGS also supports the 4K resolution reconstruction of 60 frames simultaneously.

Affiliations: Faculty of Applied Sciences, Macao Polytechnic University, Macau, China; Department of Medical Ultrasonics, Institute of Diagnostic and Interventional Ultrasound, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Department of Cardiology, Conde de São Januário General Hospital, Macao Health Bureau, Macau, SAR, China; School of Automation, Guangxi University of Science and Technology, Liuzhou, China; School of Biomedical Engineering, Sun Yat-sen University, Shenzhen, China

Abstract:
In echocardiography, accurate segmentation of the left ventricle at end-diastole (ED) and end-systole (ES) is crucial for quantitative assessment of left ventricular ejection fraction. However, as a dynamic imaging modality requiring real-time analysis and frequently performed in various clinical settings with portable devices, this challenges mainstream approaches that primarily enhance model performance by increasing the number of parameters and computational costs, while lacking targeted optimization for its characteristics. To address these challenges, we propose BLENet, a lightweight segmentation model inspired by biological vision mechanisms. By integrating key mechanisms from biological vision systems with medical image features, our model achieves efficient and accurate segmentation. Specifically, the center-surround antagonism of retinal ganglion cells and the lateral geniculate nucleus exhibits high sensitivity to contrast variations, corresponding to the distinct contrast between the ventricular chamber (hypoechoic) and myocardial wall (hyperechoic) in ultrasound images. Based on this, we designed an antagonistic module to enhance feature extraction in target regions. Subsequently, the directional selectivity mechanism in the V1 cortex aligns with the variable directional features of the ventricular boundary, inspiring our direction-selective module to improve segmentation accuracy. Finally, we introduce an adaptive wavelet fusion module in the decoding network to address the limited receptive field of convolutions and enhance feature integration in cardiac ultrasound. Experiments demonstrate that our model contains only 0.16M parameters and requires no pre-training. On the CAMUS dataset, it achieves Dice coefficient values of 0.951 and 0.927 for ED and ES phases respectively, while on the EchoNet-Dynamic dataset, it achieves 0.933 and 0.909, with an inference speed of 112 FPS on NVIDIA RTX 2080 Ti. Evaluation on an external clinical dataset indicates our model’s promising generalization and potential for clinical application.

Affiliations: School of Aerospace Engineering, Beijing Institute of Technology, Beijing, China; College of Computing and Data Science, Nanyang Technological University, Jurong West, Singapore; State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an, China; School of Electrical and Information Engineering, Zhengzhou University of Light Industry, Zhengzhou, China; National Key Laboratory of Land and Air Based Information Perception and Control, Beijing Institute of Technology, Beijing, China

Abstract:
Cross-view geo-localization provides an offline visual positioning strategy for unmanned aerial vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, it still faces the following challenges, leading to suboptimal localization performance: 1) Existing methods primarily focus on extracting global features or local features by partitioning feature maps, neglecting the exploration of spatial information, which is essential for extracting consistent feature representations and aligning images of identical targets across different views. 2) Cross-view geo-localization encounters the challenge of data imbalance between UAV and satellite images. To address these challenges, the Spatial Hybrid Attention Network with Adaptive Cross-Entropy Loss Function (SHAA) is proposed. To tackle the first issue, the Spatial Hybrid Attention (SHA) method employs a Spatial Shift-MLP (SSM) to focus on the spatial geometric correspondences in feature maps across different views, extracting both global features and fine-grained features. Additionally, the SHA method utilizes a Hybrid Attention (HA) mechanism to enhance feature extraction diversity and robustness by capturing interactions between spatial and channel dimensions, thereby extracting consistent cross-view features and aligning images. For the second challenge, the Adaptive Cross-Entropy (ACE) loss function incorporates adaptive weights to emphasize hard samples, alleviating data imbalance issues and improving training effectiveness. Extensive experiments on widely recognized benchmarks, including University-1652, SUES-200, and DenseUAV, demonstrate that SHAA achieves state-of-the-art performance, outperforming existing methods by over 3.92%. Code will be released at: https://github.com/chennanhua001/SHAA.

Abstract:
Recently, the classical tensor-tensor product (T-product) has attracted considerable attention for capturing the interactions between tensor factors. However, the mode-3 consistency in the T-product restricts its flexibility and expressive ability. To break the restriction, we suggest a tensor-tensor product for arbitrary mode-3 dimension (termed as Art-product) which enables us to flexibly and expressively capture the interactions between tensor factors. Concretely, by leveraging the exclusive hierarchical nonlinear transforms along the third mode, two tensor factors with inconsistency dimensions are first transformed into the corresponding latent factors with consistency dimensions. The face-wise product is then performed between these latent factors with consistency dimensions. Empowered with this Art-product, we can readily deconstruct and reconstruct new tensor network decomposition from an interaction perspective. As a representative example, we redesign the tensor train decomposition which can benefit from the advantage of the Art-product. Extensive experiments on multi-spectral images, color videos, and light field data sustain the superiority of tensor train decomposition equipped with Art-product over classic tensor decomposition.

Affiliations: School of Remote Sensing and Information Engineering and the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China; School of Remote Sensing and Information Engineering and the Collaborative Innovation Center of Geospatial Technology, Wuhan University, Wuhan, China; School of Cyber Science and Engineering and the Engineering Research Center of Blockchain Application, Supervision and Management, Southeast University, Nanjing, China; State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing and the Collaborative Innovation Center of Geospatial Technology, Wuhan University, Wuhan, China

Abstract:
Boundary point detection aims to outline the external contour structure of clusters and enhance the inter-cluster discrimination, thus bolstering the performance of the downstream classification and clustering tasks. However, existing boundary point detectors are sensitive to density heterogeneity or cannot identify boundary points in concave structures and high-dimensional manifolds. In this work, we propose a robust and efficient boundary point detection method based on Local Direction Dispersion (LoDD). The core of boundary point detection lies in measuring the difference between boundary points and internal points. It is a common observation that an internal point is surrounded by its neighbors in all directions, while the neighbors of a boundary point tend to be distributed only in a certain directional range. By considering this observation, we adopt density-independent K-Nearest Neighbors (KNN) method to determine neighboring points and design a centrality metric LoDD using the eigenvalues of the covariance matrix to depict the distribution uniformity of KNN. We also develop a grid-structure assumption of data distribution to determine the parameters adaptively. The effectiveness of LoDD is demonstrated on synthetic datasets, real-world benchmarks, and application of training set split for deep learning model and hole detection on point cloud data. The datasets and toolkit are available at: https://github.com/ZPGuiGroupWhu/lodd.

Abstract:
Multi-modal image fusion aims to amalgamate pivotal information from various sensor sources to provide informative visual representation in imaging scenes. Rapid and precise fusion of images is crucial for practical applications in fields such as autonomous driving and medical diagnostics. However, the primary challenge lies in balancing computational costs with the effectiveness of feature extraction, while ensuring the robust integration of salient features across modalities. Here, this paper introduces WaveFusion, a wavelet vision transformer equipped with an advanced saliency-guided loss strategy to optimize multi-modal image fusion. Initially, to provide a comprehensive and efficient representation of multi-modal data, we introduce an adaptive wavelet transform module for feature decomposition and reconstruction. Following this, self-attention mechanisms and convolutional networks are naturally applied in parallel to process low-frequency and high-frequency components, resulting in the development of a wavelet-enhanced vision transformer. Secondly, WaveFusion utilizes a dual-aggregation attention approach that improves cross-modal feature complementarity and intra-modal feature coherence within a single fusion module. Furthermore, we propose a dynamic saliency-informed selective loss function to refine the optimization process, with the objective of enhancing critical feature retention and maintaining overall image consistency across fusion scenarios. The efficacy and versatility of our method are validated in both infrared-visible fusion and medical image fusion tasks. Experiment results demonstrate that WaveFusion provides a superior balanced approach that optimizes both fusion performance and cost-efficiency, and additionally improves performance in downstream tasks such as multi-modal semantic segmentation and object detection.

Abstract:
Deep neural networks (DNNs) based watermarking algorithms have made significant strides in recent years. However, existing methods either demand substantial resources for image feature extraction during watermark embedding, sacrificing efficiency, or completely neglect image texture information, resulting in suboptimal performance. Moreover, current algorithms struggle with real-time watermark extraction. To address these limitations, we propose a lightweight conditional residual watermarking (CReW) architecture. Specifically, CReW employs a Conditional Generative Adversarial Network (CGAN) framework to generate an adaptive residual image guided by the structure of the cover image, which is decoupled from the network to reduce computational complexity. This design enables CReW to achieve an optimal balance between performance and efficiency. Additionally, by directly optimizing the residual image to capture variations in watermark behavior under distortion, CReW significantly enhances robustness. Furthermore, we design redundancy coding blocks to increase the mutual information of the watermark, along with a patch-level discriminator to improve local patch discrimination, thereby further enhancing image quality. Finally, by reducing channel redundancy and leveraging FasterNet, we developed a low-complexity network architecture, FasterCReW, which facilitates real-time watermark embedding and extraction. Extensive experimental results demonstrate that, despite having 36 × fewer network parameters and 30× fewer floating point operations (FLOPs) than Adaptor, FasterCReW exhibits excellent robustness against distortions such as cropout, JPEG compression, and Gaussian noise. Furthermore, FasterCReW significantly outperforms other existing DNN-based watermarking algorithms in terms of running speed, achieving an 8× speed increase over UDH and a 28× increase over Adaptor on an Intel Core i7-8750H CPU.

Abstract:
Noisy label learning (NLL) in open-world scenarios poses a novel challenge due to the presence of noisy data from both known and unknown classes. Most existing methods operate under the closed-set assumption, rendering them vulnerable to open-set noise, which significantly degrades their performance. While some approaches attempt to mitigate the impact of open-set examples, they struggle to learn effective discriminative representations for them, leading to unsatisfactory recognition performance. To address these issues, we propose a unified Open Adapter (OpenAda) that identifies open-set noise from both data-centric and learning-based perspectives, and can be easily integrated into mainstream NLL methods to improve their performance and robustness. Specifically, the data-centric part leverages label clusterability to sequentially identify basic clean and basic open-set examples both with high neighbor agreement. The learning-based part integrates one-vs-all classifiers with a progressive open disambiguation strategy to learn a reliable “inlier vs. outlier” boundary for each class. This enables the model to detect challenging open-set examples that partially overlap in the representation space with closed-set ones. Extensive experiments on synthetic and real-world datasets validate the superiority of our approach. Notably, with minor modifications, DivideMix with OpenAda achieves performance improvements of 9.31% and 18.26% on the open-world CIFAR-80 dataset under 80% symmetric noise and 40% asymmetric noise. The code is available at https://github.com/chenchenzong/OpenAda.

Affiliations: Key Laboratory of Network Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou, China; New Laboratory of Pattern Recognition (NLPR), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing, China; Fujian Key Laboratory for Intelligent Processing and Wireless Transmission of Media Information, College of Physics and Information Engineering, Fuzhou University, Fuzhou, China; School of Electrical and Data Engineering, University of Technology Sydney, Ultimo, NSW, Australia; AI Research Center, SDIC Intelligence Xiamen Information Company Ltd., Xiamen, China

Abstract:
Multi-label Pedestrian Attribute Recognition (PAR) involves identifying a series of semantic attributes in person images. Existing PAR solutions typically rely on CNN as the backbone network to extract pedestrian features. Unfortunately, CNNs process only one adjacent region at a time, resulting in the disappearance of long-range relations between different attribute-specific regions. To address this limitation, we adopt the Vision Transformer (ViT) instead of CNN as the backbone for PAR, aiming to build long-range relations and extract more robust features. However, PAR suffers from an inherent attribute imbalance issue, causing ViT to naturally focus more on attributes that appear frequently in the training set and ignore some pedestrian attributes that appear less. The native features extracted by ViT are not able to tolerate the imbalance attribute distribution issue. To tackle this issue, we propose a novel component and a dual-level loss: the Selective Feature Activation Method (SFAM), the Orthogonal Feature Activation Loss (OFALoss), and Orthogonal Weight Regularization Loss (OWRLoss). SFAM smartly suppresses the more informative attribute-specific features, thus compelling the PAR model to pay greater attention to attribute-specific regions that are often overlooked. The proposed OFALoss enforces an orthogonal constraint on the original feature extracted by ViT and the suppressed features from SFAM, promoting the comprehensiveness of feature representation in each attribute-specific region. Furthermore, OWRLoss is employed for decreasing correlations among entries of the last shared classification layer, which can alleviate the highly correlated of weight vectors caused by non-uniform distribution. This can prevent excessive mutual interference among different attributes during attribute recognition. Our model-agnostic approach is plug-and-play, requiring no additional training parameters in the training process. We conduct experiments on several benchmark PAR datasets, including PETA, PA100K, RAPv1, and RAPv2, demonstrating the effectiveness of our method. Specifically, our method outperforms existing state-of-the-art approaches.

Abstract:
Introducing saliency information to mitigate perceptual redundancy and achieve superior compression represents a novel approach to the development of video compression. Existing saliency-based compression coding methods rely on the determination of saliency regions and focus too much on saliency regions while ignoring the perceptible distortion in non-saliency regions. We propose a spatiotemporal visual perceptual rate-distortion optimization (PRDO) algorithm for Versatile Video Coding (H.266/VVC) that is more in line with the human visual system (HVS). Firstly, we establish a linear weighted distortion model based on spatiotemporal and saliency features. The distortion model makes effective use of saliency features while considering image content in non-saliency regions that is still perceptible to the human eye, thereby achieving an overall visual effect that conforms to human subjective perception. Based on this distortion model, we propose a saliency adaptive quantization parameter (SAQP) selection method with a more flexible quantization parameter selection range, adaptively allocating the optimal coding unit quantization parameter according to the saliency regions of the image, ensuring a balanced bitrate allocation between saliency and non-saliency regions. The proposed method is implemented for the first time on the H.266/VVC coding standard, attaining an average bitrate saving of 19.9% across all test sequences and an average PSNR improvement of 2.65 dB in saliency regions compared to VTM16.0. The BD-EWPSNR of the proposed PRDO and SAQP method improves by 1.34 dB and 1.45 dB in the All-Intra and Lowdelay_P encoding modes, respectively. Additionally, the BD-Rate based on EWPSNR is reduced by 25.86% and 33.73%, respectively, with an overall compression coding time saving of 19.76%. The experimental results demonstrate that the proposed method can significantly reduce the bit rate and coding time while improving the subjective perceptive quality, providing a competitive solution for video compression coding.

Abstract:
With the breakthrough of transfer learning and meta-learning, cross-domain few-shot hyperspectral image classification (CDFSL HSIC) technology has recently achieved satisfactory performance under limited annotations. Nevertheless, the most practical applications are zero-shot scenarios, which are intractable for CDFSL technology, such as the extraterrestrial detection scene, where unexplored objects are recognized by scientists to be more valuable for research. To conquer the zero-shot problem under domain shift, a two-stage contrastive MLP network (MAC-CDZS) is proposed, which constitutes a pioneering effort in the cross-domain zero-shot (CDZS) HSIC task. Firstly, given the remarkable performance of MLPs within a diminutive model size and their enhanced capacity for extracting spatial-spectral features of HSIs, the MLP framework has been strategically chosen as the foundational backbone of the first stage in the MAC-CDZS for facilitating efficient feature extraction. Secondly, to alleviate the potential category collapse, the second-stage fine-tuning framework is introduced, which extends the first-stage backbone by incorporating the elaborate adjacent coordinate module and contrastive learning paradigm for more harmonious classification performance. Specifically, the adjacent coordinate module is creatively designed to adequately mine the adjacent coordinates among samples for ameliorating category collapse from the perspective of grasping more reliable priors. Furthermore, a contrastive learning paradigm is innovatively constructed, comprising a Spatial Augmentation (SA) module tailored for hyperspectral patches and a construction strategy of sample pair under zero-shot conditions, which aims to boost the representation capability and alleviate the class collapse. The superior performance of the MAC-CDZS is demonstrated by experimental results on four benchmark datasets.

Abstract:
Facial image restoration has gained a tremendous progress since the increasing boom of the deep learning methods. Owing to its nature of strong ill-posedness, different categories of a-priori constraints have been harnessed or embedded in the existing deep architectures. While, as it turns to face restoration with more complicated degradations, the challenge becomes greater. In this paper, a further insightful step is taken by exploring the potentials of the graph convolutional networks (GCN) in conjunction with the structured priors for the degradation-unknown problem. Specifically, a lightweight yet physically more intuitive model termed FaceGCN is proposed. On the one hand, a dynamic generator of facial adjacency matrices is constructed assisted by two self-supervised losses, allowing a sparse, accurate, and adaptive construction of case-specific face graphs with facial feature components as nodes. On the other hand, to model well the joint local-nonlocal correlations among various facial feature components, a kind of novel strip-attention GCN modules is correspondingly developed by splitting facial feature maps into intra- and inter-strips in both horizontal and vertical orientations, respectively. Extensive experimental results show that FaceGCN has achieved comparable or even superior performance to state-of-the-art methods, yet at a considerably less computational cost.

Abstract:
Synthesizing 3D content from single image has great potential in many real-world applications. To deal with the inherent ambiguity of single image, existing methods usually leverage pre-trained 2D diffusion models for computational intensive per-instance optimization. Although having been able to create 3D assets in a feed-forward manner, the efficacy of recent advances in 3D foundation models is still limited due to neglecting geometric cues from images. To address this issue, we propose an efficient 3D foundation model named LPM to synthesize 3D content from an image. Like the masked modeling in the 2D image domain, the key of our approach is to learn 3D representations from incomplete visible shapes. By taking advantage of a synthesis-by-analysis paradigm, we establish an efficient pipeline to first estimate the visible portions and then generate the complete 3D representations. Based on the principle that an image is the projection of 3D model, we initially estimate partial 3D voxel features from single image, which are further projected onto orthogonal planes to form an incomplete yet efficient triplane representation. Subsequently, an autoencoder is employed to model a complete triplane representation based on the incomplete parts. We train our model on massive data with over 250 million parameters to enhance its generalization capability. The experimental results show that LPM can generate high-fidelity 3D objects from an image within 0.1 seconds, which is more effective than the existing feed-forward approaches. Our implementation and pre-trained models will be made publicly available.

Abstract:
Large-scale fine-grained image retrieval aims to learn compact discriminative feature representations based on mining the subtle distinctions between visually similar objects. However, existing fine-grained image retrieval methods focus on enhancing the attention to the discriminative regions within single images, which barely exploit the high-order relational information between the global features and local region features across different images. Thus, the over-fitting problem of complex personalized differences cannot be effectively solved. In addition, existing unconstrained vector quantization methods tend to assign unquantized feature vectors to a few major codewords, which are unable to effectively distinguish the quantized features and reduce the redundant information. To address these issues, we propose a novel optimal transport quantization method based on cross-X semantic hypergraph learning for large-scale fine-grained image retrieval. Specifically, we first introduce a cross-layer multi-scale aggregation module to extract the global features and local region features. Subsequently, we build a semantic hypergraph to model the high-order correlations between the global features and local region features extracted from different layers, different scales and different images, which can alleviate the over-fitting problem of complex personalized differences by suppressing sample-level and background noise. Moreover, we introduce an error regularization term into the progressive asymmetric quantization loss to reduce the quantization errors and preserve the semantic similarity. Finally, we attempt to introduce the code balance and uncorrelated constraints into the multi-codebook quantization framework to improve the utilization efficiency of codewords and reduce the redundant information, which can be approximated by solving the optimal transport problem. Experimental results on several fine-grained image datasets demonstrate that the proposed method outperforms the state-of-the-art fine-grained image retrieval methods.

Abstract:
Omnidirectional images, offering immersive 360° views, have gained significant attention, but assessing their perceptual quality, especially for stereoscopic content, remains a complex challenge. A major limitation lies in the fact that head-mounted devices restrict the viewer’s experience to a single viewport at a time, necessitating a comprehensive understanding of how multiple viewport images interact and aggregate during the viewing process. Moreover, the depth dimension inherent in stereoscopic content further complicates the 360° visual experience, a factor often oversimplified by existing methods, limiting their ability to accurately differentiate perceptual quality across viewports. To address these challenges, we propose a novel no-reference quality assessment model for stereoscopic omnidirectional images. Our approach integrates binocular vision principles within a viewport hypergraph convolutional network framework. First, guided by the unique viewing patterns of stereoscopic omnidirectional images, our model selects panoramic viewports that align with human visual preferences. Next, we devise an image feature extraction network that simulates the binocular fusion and rivalry mechanisms within the human visual system, leveraging a twin encoder-decoder network and tensor decomposition to capture key features. Finally, to assess overall image quality, we introduce a hypergraph structure module that captures complex positional and content-based interactions among sampled viewports through the Graph Influence Network. Extensive experiments on the NBU-SOID, SOLID, and LIVE 3D VR databases demonstrate the superior accuracy and robustness of our model compared to state-of-the-art methods.

Abstract:
The escalating need for live video streaming has emerged as a significant catalyst for the business expansion of today’s content delivery networks (CDN). Selecting the right CDN live streaming architecture is fundamentally important in achieving the objective of enhancing users’ quality of experience (QoE) while reducing bandwidth costs. Regrettably, a limited number of studies have been conducted to systematically measure and compare the current typical solutions at production scale. Consequently, the performance and costs of different streaming architectures remain myths. This paper aims to address the existing research gap by undertaking a large-scale measurement study of three representative CDN live streaming architectures, defined by streaming protocol and overlay topology choices, currently running on Alibaba Cloud’s production video delivery network. By analyzing the results of over 500 million video plays over two months on a large live streaming platform hosted on Alibaba Cloud’s CDN, we reveal the impact of architectural compositions and operational factors on live streaming performance and bandwidth costs. In particular, our study reveals the trade-offs between QoE metrics and bandwidth costs for operational streaming architectures. Drawing upon the insights of this study, we further develop and deploy pragmatic strategies that yield remarkable real-world impact—our design saves over 17% bandwidth costs while maintaining the QoE.

Abstract:
Large Language Models (LLMs) have evolved into Multimodal Large Language Models (MLLMs), significantly enhancing their capabilities by integrating visual information and other types, thus aligning more closely with the nature of human intelligence, which processes a variety of data forms beyond just text. Despite advancements, the undesirable generation of these models remains a critical concern, particularly due to vulnerabilities exposed by text-based jailbreak attacks, which have represented a significant threat by challenging existing safety protocols. Motivated by the unique security risks posed by the integration of new and old modalities for MLLMs, we propose a unified multimodal universal jailbreak attack framework that leverages iterative image-text interactions and transfer-based strategy to generate a universal adversarial suffix and image. Our work not only highlights the interaction of image-text modalities can be used as a critical vulnerability but also validates that multimodal universal jailbreak attacks can bring higher-quality undesirable generations across different MLLMs. We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP, and reveal significant multimodal safety alignment issues, highlighting the inadequacy of current safety mechanisms against sophisticated multimodal attacks. This study underscores the urgent need for robust safety measures in MLLMs, advocating for a comprehensive review and enhancement of security protocols to mitigate potential risks associated with multimodal capabilities.

Abstract:
Inspired by the human visual system (HVS), no-reference image quality assessment (NR-IQA) has made significant progress without relying on perfect reference images. The HVS is primarily influenced by the combined effects of representational information with different receptive fields and attribute categories when capturing subjective perceived quality. However, existing methods only roughly or partially utilize representations of multi-dimensional information. Furthermore, current NR-IQA methods either rely on convolutional neural networks (CNNs) with limited local perception or depend on the computational complexity of vision transformers (ViTs). To make up for the shortcomings of these two architectures, an emerging visual state space model (VMamba) is introduced. Motivated by this, this paper presents a NR-IQA method via VIsual State space model with Graph-based feature Aggregation (VISGA). Specifically, we utilize a plain, pre-training-free, and feature-enhanced VMamba as the backbone. To align with the perceptual mechanisms of the HVS by effectively using features with different dimensional information, a graph convolutional network-based multi-receptive field and multi-level aggregation module is designed to deeply explore the correlations and interactions of multi-dimensional representations. Additionally, we propose a gated local enhancement module with patch-wise perception to enhance the local perception of VMamba. Extensive experiments conducted on seven databases demonstrate that VISGA achieves outstanding performance. Notably, our model remains state-of-the-art when training with very few parameters. The code is released at https://github.com/xirihao/VISGA.

Affiliations: Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, School of Information Science and Technology, Beijing University of Technology, Beijing, China; Department of Radiation Oncology, Stanford University, Stanford, CA, USA; National Engineering Laboratory of Robot Visual Perception and Control Technology, College of Electrical and Information Engineering, Hunan University, Changsha, Hunan, China; ReLER Laboratory, Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW, Australia; Discipline of Business Analytics, University of Sydney Business School, The University of Sydney, Camperdown, NSW, Australia

Abstract:
With the rapid growth of internet content, multimodal long document data has become increasingly prominent, drawing significant attention from researchers. However, most existing methods primarily focus on scenarios where all modalities are present, often overlooking more challenging and realistic cases involving missing image modality. To address this limitation, we propose a robust multimodal long document classification (MLDC) framework that integrates hierarchical modeling and dynamic prompting to handle complex multimodal long document data. Our approach begins by leveraging hierarchical modeling combined with an Adaptive Correlation Multimodal Transformer (ACMT) to effectively capture relationships between text and images at both section and sentence levels. We also introduce a Dynamic Prompt Generation (DPG) module at both levels to enhance the model’s robustness in handling missing image data. By evaluating sample uncertainty, the DPG module dynamically adjusts both the number of prompts and the prompts themselves, allowing the model to better adapt to the varying needs of different samples. Finally, a Hierarchical Heterogeneous Graph (HHG) is introduced to enhance feature interactions across levels, further improving the coherence and accuracy of the model. Extensive experiments on four multi-modal long document datasets demonstrate that our model shows superior performance compared to existing state-of-the-art MLDC classification methods in various conditions.

Abstract:
Despite the widespread use of gadolinium-based contrast agents in clinical MRI examinations due to their significant advantages in structural localization and tumor identification, there is a risk of brain deposition and nephrogenic systemic fibrosis. Cross-modal image synthesis methods offer a new alternative, yet lesion synthesis remains challenging. On one hand, brain lesions vary significantly in location, shape, and size. On the other hand, the high background ratio associated with brain lesions makes their synthesis more difficult. To address these issues, we first introduce a Multi-Objective Local Perception Module (M-OLPM), which utilizes edge generation and lesion segmentation tasks to prioritize local lesions from the disentangled local perceptual feature subspaces. To better extend to multi-objective local perception, we propose a ‘Disentangle and Then Fuse’ learning strategy, including a Feature Disentanglement Module (FDM) and a Global Fusion Module (GFM). The FDM decouples multimodal deep features into low-frequency semantic features and high-frequency edge features, alleviating feature conflicts from weakly related perception tasks. To enhance feature interaction among multiple perception tasks, the GFM progressively integrates these local perceptual features and underlying detail features through an attention mechanism, further refining the global image quality. Evaluated on the publicly available BRaTS2020, BRaTS2021 datasets, and the private HPPH dataset, our method significantly outperforms the existing technology in both visual and quantitative assessments of gadolinium-enhanced MRI images in global and localized lesion areas, providing a safe alternative to gadolinium enhancement. The source code is publicly available at https://github.com/zengyangche/DTF-Net.

Abstract:
Multi-modal medical image fusion enhance the representation, aggregation and comprehension of functional and structural information, improving accuracy and efficiency for subsequent analysis. However, lacking explicit cross channel modeling and interaction among modalities results in the loss of details and artifacts. To this end, we propose a novel Explicit Channel-wise Interaction Network for unified multi-modal medical image Fusion, namely ECINFusion. ECINFusion encompasses two components: multi-scale adaptive feature modeling (MAFM) and explicit channel-wise interaction mechanism (ECIM). MAFM leverages adaptive parallel convolution and transformer in multi-scale manner to achieve the global context-aware feature representation. ECIM utilizes the designed multi-head channel-attention mechanism for explicit modeling in channel dimension to accomplish the cross-modal interaction. Besides, we introduce a novel adaptive L-Norm loss, preserving fine-grained details. Experiments demonstrate ECINFusion outperforms state-of-the-art approaches in various medical fusion sub-tasks on different metrics. Furthermore, extended experiments reveal the robust generalization of the proposed in different fusion tasks. In breif, the proposed explicit channel-wise interaction mechanism provides new insight for multi-modal interaction.

Abstract:
All-in-one image restoration has recently developed to be a new research trend in the low-level computer vision field, aiming to tackle multiple image degradation types simultaneously in a unified model. As a typical multi-task learning, existing approaches focus on modeling either the specificity or commonality among different image restoration tasks. To exploit the unique strengths of both worlds, we propose a method of Image Pyramid Transformer coupled with Information Loss Regularization (IPT-ILR), in which the multi-scale architecture structure can excavate more information for multiple restoration tasks concurrently, while the learning strategy can identify the difference among multiple restoration tasks depending on the degree of information loss in each restoration task. Specifically, it first establishes a new Image Pyramid Transformer Network (IPT-Network) to accommodate multiple image restoration tasks. Given original degraded images, the IPT-Network exploits the image pyramid technique to establish a series of images with different scales, which are then restored by transformer-like auto-encoders. Moreover, the restored image on a low-level scale is referenced to assist restoring the degraded image on a high-level scale. Next, Information Loss Regularization (ILR) is presented to optimize the IPT-Network. ILR calculates the average distance between degraded images and their clean counterparts as the weights, which automatically implement different penalties for different image restoration tasks, thus avoiding the short-cut phenomenon for the easy task while encouraging the hard task. Extensive experiments have been conducted with 6 image restoration tasks in the all-in-one setting. The results show our method performs favorably against numerous state-of-the-art methods across most tasks, including image denoising, image deblurring, image dehazing, image deraining, image desnowing, as well as low-light enhancement.

Abstract:
Detecting double JPEG compression with the same quantization matrix is a crucial yet challenging task in image forensics. Existing methods often fail to accurately identify and fully exploit the differences between singly and doubly compressed images, resulting in unsatisfactory detection performance, especially for cases with low quality factors (QFs). To address this issue, a novel method is proposed to extract highly discriminative features for performance enhancement. First, we design a new error block classification method that categorizes error blocks into stable error blocks, rounding error blocks (REBs), and truncation error blocks (TEBs). This classification method enables more accurate identification of TEBs, which are the most discriminative blocks in error images for cases with low QFs. Then, based on the theoretical analysis of REBs and TEBs, an intrinsic variable that directly leads to the differences between two classes of images is derived, providing more essential characteristics for the detection. Finally, a number of 25-dimensional highly discriminative features are extracted from REBs, TEBs, and flat blocks. Experimental results demonstrate that the proposed method outperforms several state-of-the-art works, especially on images with low QFs.

Affiliations: Hubei Key Laboratory of Internet of Intelligence, School of Electronics Information and Communications, Huazhong University of Science and Technology, Wuhan, China; School of Intelligent Systems Engineering, Sun Yat-sen University (Shenzhen Campus), Guangzhou, China; Graduate School of Informatics and Engineering, The University of Electro-Communications, Chofu, Tokyo, Japan; School of Information Management, Jiangxi University of Finance and Economics, Nanchang, China

Abstract:
Volumetric videos offer an incredibly immersive viewing experience but encounters challenges in maintaining quality of experience (QoE) due to its ultra-high bandwidth requirements. One significant challenge stems from user’s spatial interactions, potentially leading to discrepancies between transmission bitrates and the actual quality of rendered viewports. In this study, we conduct comprehensive measurement experiments to investigate the impact of six degrees of freedom information on received video quality. Our results indicate that the correlation between spatial quality and transmission bitrates is influenced by the user’s viewing distance, exhibiting variability among users. To address this, we propose a spatial quality oriented rate control system, namely sparkle, that aims to satisfy spatial quality requirements while maximizing long-term QoE for volumetric video streaming services. Leveraging richer user interaction information, we devise a tailored learning-based algorithm to enhance long-term QoE. To address the complexity brought by richer state input and precise allocation, we integrate pre-constraints derived from three-dimensional displays to intervene action selection, efficiently reducing the action space and speeding up convergence. Extensive experimental results illustrate that sparkle significantly enhances the averaged QoE by up to 29% under practical network and user tracking scenarios.

Abstract:
Fully supervised video salient object detection (VSOD) has made considerable breakthroughs using costly and time-consuming pixel-wise annotations. Recently, to achieve a trade-off between the annotation burden and the model performance, scribble-based VSOD tasks have attracted increasing attention. However, learning the complete object structure and precise boundary details from sparse scribble annotations remains challenging. In this paper, we propose a series of strategies to effectively explore valid information from the recently proposed segmentation foundation model “Segment Anything Model (SAM)” in various perspectives to address these challenges. Specifically, due to the limited performance of SAM on videos, we propose a SAM-guided label enhancement method instead of directly using the results of SAM, which can introduce edge information while reducing the interference of erroneous information. Moreover, we propose a SAM-driven spatiotemporal network guided by general semantic features from the SAM encoder to help the model be aware of global connections. Additionally, we propose a SAM-based global-aware loss, which further considers the affinity constraint between predicted results and foreground labels or background labels from a global perspective, guiding the model to perceive the complete salient objects. Experimental results demonstrate that our method outperforms state-of-the-art weakly supervised VSOD methods and is comparable to fully supervised VSOD methods.

Abstract:
In recent years, Incomplete Multi-View Clustering (IMVC) has become an important and challenging task. Although several methods have been proposed to address IMVC, they still have the following drawbacks: i) Due to the presence of missing samples in the views, clustering prototypes obtained from different views may have positional deviations, leading to inaccurate positioning of cluster centers, thus affecting the accuracy of clustering results. ii) Repair strategies based on cross-view prediction and adversarial generation have high computational costs and heavily rely on model performance. Neighbor-based repair strategies may result in inaccurate neighbor selection due to the presence of noise. iii) Models learned solely from complete data often perform better than models learned from both complete and incomplete data, especially when there are semantic differences between the repaired data and the missing data. To address the aforementioned issues, this paper proposes a Dynamic Imputation and Triple Alignment with Dual-Optimization for Deep Incomplete Multi-View Clustering (DITA-IMVC). Specifically, by accurately representing advanced features, we propose a cross-view dynamic structure learning strategy for missing view repair, where the dynamic structural relationships between high-semantic features within each view are calculated to obtain highly related samples from different views. To address positional deviations from different views, we propose a triple cross-view alignment with prototype, feature, and clustering assignment, which preserves the consistency among different views. Finally, we design a dual-optimization process for both complete view features and repaired features via alternating iterations to fully utilize the incomplete view data, thereby improving clustering performance. To demonstrate the effectiveness of our DITA-IMVC, extensive experiments conducted on different standard datasets show that it yields superior clustering results compared to existing methods.

Abstract:
Due to light attenuation and complex environments, underwater robots need to carry artificial light to improve visibility, which leads to issues such as brightness vignetting, low contrast, and color distortion in the captured underwater images. However, existing methods for enhancing underwater images often overlook the challenges caused by artificial light. To address these challenges, we construct a novel underwater vignetting image formation model and propose a correction method called UVIC. The method consists of three main modules: separating the vignetting component, separating the backscattering component, and adaptive brightness and color correction. In our model, based on the linear relationship between the image gradient and the coefficients of the binary polynomial, we introduce a binary polynomial regularization to separate the vignetting component without estimating the center of the vignetting. Additionally, the backscattering can be effectively separated by introducing a latent low-rank representation based on local consistency, without estimating atmospheric light and transmission parameters. Furthermore, we design an adaptive brightness and color correction module using the global brightness of the image L layer and the histogram distribution characteristics of the a and b layers to adjust the brightness and color bias of the image. Particularly, there are both additive and multiplicative operations, and we decompose the objective function into two submodels and solve them by the iterative reweighted least squares and alternating direction multiplier methods, respectively. Numerous experiments demonstrate that UVIC not only effectively corrects image brightness vignetting, but also improves color bias, contrast, and sharpness.

Abstract:
When capturing images under strong light sources at night, intense lens flare artifacts often appear, significantly degrading visual quality and impacting downstream computer vision tasks. Although transformer-based methods have achieved remarkable results in nighttime flare removal, they fail to adequately distinguish between flare and non-flare regions. This unified processing overlooks the unique characteristics of these regions, leading to suboptimal performance and unsatisfactory results in real-world scenarios. To address this critical issue, we propose a novel approach incorporating Location Prior Guidance (LPG) and a specialized flare removal model, LPFSformer. LPG is designed to accurately learn the location of flares within an image and effectively capture the associated glow effects. By employing Location Prior Injection (LPI), our method directs the model’s focus towards flare regions through the interaction of frequency and spatial domains. Additionally, to enhance the recovery of high-frequency textures and capture finer local details, we designed a Global Hybrid Feature Compensator (GHFC). GHFC aggregates different expert structures, leveraging the diverse receptive fields and CNN operations of each expert to effectively utilize a broader range of features during the flare removal process. Extensive experiments demonstrate that our LPFSformer achieves state-of-the-art flare removal performance compared to existing methods. Our code and a pre-trained LPFSformer have been uploaded to GitHub for validation.

Abstract:
The structured light (SL)-based three-dimensional (3D) measurement techniques with deep learning have been widely studied to improve measurement efficiency, among which fringe projection profilometry (FPP) and speckle projection profilometry (SPP) are two popular methods. However, they generally use a single projection pattern for reconstruction, resulting in fringe order ambiguity or poor reconstruction accuracy. To alleviate these problems, we propose a parallel dual-branch Convolutional Neural Network (CNN)-Transformer network (PDCNet), to take advantage of convolutional operations and self-attention mechanisms for processing different SL modalities. Within PDCNet, a Transformer branch is used to capture global perception in the fringe images, while a CNN branch is designed to collect local details in the speckle images. To fully integrate complementary features, we design a double-stream attention aggregation module (DAAM) that consists of a parallel attention subnetwork for aggregating multi-scale spatial structure information. This module can dynamically retain local and global representations to the maximum extent. Moreover, an adaptive mixture density head with bimodal Gaussian distribution is proposed for learning a representation that is precise near discontinuities. Compared to the standard disparity regression strategy, this adaptive mixture head can effectively improve performance at object boundaries. Extensive experiments demonstrate that our method can reduce fringe order ambiguity while producing high-accuracy results on self-made datasets.

Abstract:
Multi-camera 3D semantic occupancy prediction is a critical task for autonomous driving, playing a vital role in understanding the environment. Current methods mainly rely on uniform voxel representation to encode space, which greatly limits their resolution scalability. It causes most existing methods to struggle with scaling to finer granularities, as the cubic growth nature of uniform voxel leads to a significant increase in the demand for computational and storage resources when scaling. To address this, we propose a multi-level hierarchical model AdaptiveOcc. Using the octree structure, our model can adaptively represent different parts of space with varying voxel granularity. It can selectively extend resolution only for a small subset of voxels, thus mitigating the substantial computational and storage burden brought by scaling. To endow our model with adaptability, we propose a distance-adaptive octree construction rule for generating supervised labels. Considering that the voxel granularity requirements vary for different distance ranges in environmental perception, such a construction rule results in a higher likelihood of coarser granularity for distant regions and finer granularity for nearby regions. This ensures a more efficient and rational allocation of computational resources, further reducing the inference latency. Extensive experiments on nuScenes, SemanticKITTI and Waymo dataset validate that our method can scale to finer granularities with faster speed, and less training memory compared with other state-of-the-art methods. Our code is available at https://github.com/yty-sky/AdaptiveOcc.

Abstract:
Light Field Image (LFI) records both angular and spatial information and provides immersive experiences for observers by rendering a scene from multiple perspectives. To cope with the resolution limitations of capture hardware, LFI angular reconstruction and spatial super-resolution are two widely-used methods, but they can also induce some special types of distortions, especially when two methods are adopted in combination. To this end, new challenges have been brought in assessing the quality of these distorted LFIs. In this paper, firstly, we conduct subjective experiments to evaluate the distorted LFI quality and present a novel perceptual quality assessment database with the associated subjective quality scores. Specifically, the proposed database focuses on the distortions introduced by deep learning-based LFI angular reconstruction and spatial super-resolution methods, individually and multiplely. Besides, in the case of multiple distortions, the adoption order of two distortions is taken into consideration. Further, our database presents three types of LFIs that suffer from distortions: real-world, dense synthesis, and sparse synthesis. As a result, the quality of distorted LFIs was subjectively assessed by 32 valid observers using the Pairwise Comparison (PC) protocol. Secondly, we develop a novel objective No-Reference (NR) metric for LFI quality evaluation, based on the features extracted from spatial gradients, angular-spatial statistics, and binocular disparity. Finally, a benchmark of the proposed metric and numerous state-of-the-art quality assessment metrics on the proposed database is presented. Experimental results demonstrate the superiority of the proposed metric over most existing metrics in various aspects. The proposed database and metric will be publicly available at https://github.com/ZhengyuZhang96/IETR-LFI.

Abstract:
Clustering based on structured graph learning involves acquiring a proximity matrix with an explicit clustering structure from the original one. However, the original proximity matrix often lacks some must-links compared to the groundtruth, constraining the upper bound of clustering performance. High-order proximity information can mitigate this limitation, yet traditional high-order proximity matrix-based methods are time-intensive. To tackle this, we propose the Tensorized High-order Bipartite Graphs-based structured proximity matrix learning method (THBG). Firstly, we introduce a high-order bipartite graph proximity matrix with a swift computation method, incorporating high-order information and significantly reducing computational overhead. Secondly, we apply tensor nuclear norm minimization to the tensor composed of high-order bipartite graphs, learning a low-rank tensor representation that effectively harnesses the consistency of high-order information. Concurrently, a structured bipartite graph proximity matrix with an explicit clustering structure is adaptively learned based on the low-rank tensor representation and Laplace rank constraint. Experimental results demonstrate the superiority and great potential of this method. Code available: https://anonymous.4open.science/r/THBG-D10D.

Abstract:
Event-based semantic segmentation in traffic scenes has attracted considerable attention in autonomous driving systems due to the advantages of event cameras such as high dynamic range, low latency, and low energy consumption. However, existing Artificial Neural Network (ANN)-based methods rely on conventional image frames, often neglecting the spatial-temporal dynamics inherent in event streams and consuming higher energy costs, significantly limiting their applicability in energy-constrained environments. In this study, we introduce Spike-BRGNet, a Spike-driven Boundary Region-Guided Network that efficiently extracts boundary information utilizing only events to guide the segmentation encoder, while preserving the energy efficiency of Spiking Neural Networks (SNNs). Specifically, to explore the implicit information from events, we design a three-branch spiking encoder that consists of semantic detail (SD), context aggregation (CA), and boundary aware (BA) branches to capture specific features. Then, a spiking multi-scale context aggregation (SMSCA) module is proposed to enhance the semantics of the CA branch. Finally, a novel boundary region-guided loss function and a dynamic surrogate gradient function, EvAF, are designed to optimize the model. Extensive experiments show that our model outperforms state-of-the-art (SOTA) SNN-based methods on DDD17 (+1.57%) and DSEC dataset (+1.91%). Furthermore, Spike-BRGNet consumes 17.76× less energy than ANN-based models, showing superior energy-saving performance.

Abstract:
Three-dimensional (3D) point clouds are becoming more and more popular for representing 3D objects and scenes. Due to limited network bandwidth, efficient compression of 3D point clouds is crucial. To tackle this challenge, the Moving Picture Experts Group (MPEG) is actively developing the Geometry-based Point Cloud Compression (G-PCC) standard, incorporating innovative methods to optimize compression, such as the Region-Adaptive Hierarchical Transform (RAHT) nestled within a layer-by-layer octree-tree structure. Nevertheless, a notable problem still exists in RAHT, i.e., the proportion of zero residuals in the last few RAHT layers leads to unnecessary bitrate consumption. To address this problem, we propose an adaptive skip coding method for RAHT, which adaptively determines whether to encode the residuals of the last several layers or not, thereby improving the coding efficiency. In addition, we propose a rate-distortion cost calculation method associated with an adaptive Lagrange multiplier. Experimental results demonstrate that the proposed method achieves average Bjøntegaard rate improvements of -3.50%, -5.56%, and -4.18% for the Luma, Cb, and Cr components, respectively, on dynamic point clouds, when compared with the state-of-the-art G-PCC reference software under the common test conditions recommended by MPEG.

Abstract:
In this paper, a novel deep learning-based no-reference video quality assessment (NR-VQA) model for screen content videos (SCVs) is proposed, called the hierarchical spatiotemporal perceptual quality model (HSPQ). Firstly, the human visual system (HVS) perceives SCVs hierarchically, with varying sensitivity and attention to diverse attribute regions. Secondly, the visual redundancies are copious in the spatiotemporal domain of SCVs, degrading video quality to some extent. Based on these characteristics, the SCVs are decomposed into three hierarchical levels (i.e., patch level, frame level, and video level), which contain quality-related spatiotemporal information. Specifically, the visual saliency is first utilized for more salient textual and pictorial patches selection, and then, a dual-channel convolutional neural network integrating spatial-gate feature enhancement module (SGFEM) is designed to evaluate the quality of patches based on their attributes at the patch level separately. With spatial correlation, an adaptive blur-focused visual mechanism-based weighting strategy (BFWS) is proposed for converting quality scores from patch level to frame level. Finally, the video-level quality score, which reflects the temporal perceptual quality degradation, is combined to provide a comprehensive evaluation of distorted SCV quality. Experiments conducted on the Screen Content Video Database (SCVD) and Compressed Screen Content Video Quality (CSCVQ) databases demonstrate that our proposed HSPQ model aligns better with the visual perception of SCVs by the HVS. Moreover, it exhibits strong robustness compared to multiple classic and state-of-the-art image/video quality assessment models.

Abstract:
This paper proposes a novel spatiotemporal chaotic system: a dual-dynamic-coupling-coefficient coupled map lattice with delayed feedback (DDCMLDF). Building on the traditional Coupled Map Lattice (CML), we introduce two dynamic coupling coefficients and a random delay mechanism. These enhancements enable our system to exhibit more complex chaotic behavior, making it more suitable for applications in chaotic encryption and secure communication. Based on this system, a multi-image encryption algorithm is proposed. This encryption algorithm first utilizes chaotic dynamics theory to perform internal scrambling of the images. Then, using the concept of multi-path merging, scrambling is performed among the images, further disrupting the relationships between them. Finally, a diffusion operation is applied separately to each scrambled image. A series of simulation experiments demonstrates that this encryption algorithm has good security and robustness.

Abstract:
It is well known that the diverse causes of low-light images challenge the adaptability of enhancement algorithms in uncertain environments. Most deep learning-based algorithms only learn single illuminance estimation or mapping relationship, which inhibit the generalization ability of the model. To address this, we propose a novel multi-illumination estimation framework based on ghost imaging theory, dubbed Ghillie. Specifically, we consider low-light enhancement as a re-imaging process for objects in dark scenes. First, the light modulation network (LMN) is designed to modulate a series of estimated lights following a normal light distribution. These lights “illuminate” the low-light image and the enhanced illuminance image can be reconstructed by a differential ghost imaging algorithm. Then, a gradient-guided denoising network (GDN) is constructed to eliminate noise and enhance details. Finally, we employ the color adaption network (CAN) to restore the color degradation. Additionally, a novel mean structural similarity loss (AM-SSIM) is proposed to guide the model to address the uneven image illumination. The qualitative and quantitative experimental results show that our enhanced methods outperform state-of-the-art methods on the vast majority of publicly available datasets. Our code is available at: https://github.com/zzj-dyj/Ghillie.

Abstract:
Accurate estimation of burden surface depth plays a crucial role in constructing the temperature field and optimizing reaction control in volatile kilns. However, most image-based depth estimation techniques require high-quality input images and achieve limited accuracy, which restrict their applications in actual harsh working conditions such as high temperature, heavy dust and dense smoke. In this study, a deep learning-based monocular depth estimation model is proposed to measure the burden surface depth in the volatile kiln head zone. The proposed model integrates an encoder-decoder network with an attention module. The encoder-decoder network outputs a set of deep semantic features, while the attention module intelligently fuses multi-level features to predict a probability distribution over depth intervals for each pixel. A volatile kiln prototype is designed and constructed to generate image datasets of the kiln head zone which approximate real data collected from industrial production sites. Results demonstrate that the proposed model has a depth prediction error of RMSE = 11.008 mm for the burden surface region, outperforming state-of-the-art neural networks and the traditional depth-from-defocus method. Code and datasets are available at https://github.com/LLLcong/Attention-MonoDepth.

Abstract:
Reversible data hiding-based contrast enhancement can be applied to medical images, which not only allows the storage of patient information through reversible embedding, but also achieves image contrast enhancement, thereby assisting doctors in accurately diagnosing patient diseases. In response to the existing problems of mainstream methods, a novel reversible data hiding-based local contrast enhancement method for medical images is proposed. This method utilizes superpixel segmentation to segment medical images into multiple pixel blocks, and performs reversible data embedding and contrast enhancement for the pixel blocks within the region of interest (ROI). Additionally, a new embedding strategy is proposed. According to the contrast and texture features of each pixel block, histogram expansion of different degrees is carried out to effectively enhance the pixel blocks with low contrast, while avoiding excessive enhancement of the pixel blocks with high contrast. Experimental results demonstrate that, compared with the state-of-the-art mainstream methods, the proposed method not only improves the contrast in the ROI but also ensures high visual quality of the medical images.

Abstract:
The major paradigm of weakly supervised video anomaly detection (WSVAD) is treating it as a multiple instance learning (MIL) problem, with only video-level labels available for training. Due to the rarity and ambiguity of anomaly, the selection of potential abnormal training sample is the prime challenge for WSVAD. Considering the temporal relevance and length variation of anomaly events, how to integrate the temporal information is also a controversial topic in WSVAD area. To address forementioned problems, we propose a novel method named Inter-clip Feature Similarity based Video Anomaly Detection (IFS-VAD). In the proposed IFS-VAD, to make use of both the global and local temporal relation, a Multi-scale Temporal MLP (MT-MLP) is leveraged. To better capture the ambiguous abnormal instances in positive bags, we introduce a novel anomaly criterion based on the Inter-clip Feature Similarity (IFS). The proposed IFS criterion can assist in discerning anomaly, as an additional anomaly score in the prediction process of anomaly classifier. Extensive experiments show that IFS-VAD demonstrates state-of-the-art performance on ShanghaiTech with an AUC of 97.95%, UCF-Crime with an AUC of 86.57% and XD-Violence with an AP of 83.14%. Our code implementation is accessible at https://github.com/Ria5331/IFS-VAD.

Abstract:
Automatic land cover classification from high-resolution remote sensing (RS) images remains challenging due to the complex composition of classes. Given the potential of a graph to simulate latent class composition, the latest development of graph convolutional network (GCN) has received increasing attention. However, most existing methods only use a single perspective graph structure, which largely limits their ability to capture the complementary features that would better represent the underlying data structure of images. Therefore, this paper proposes a novel multi-view GCN-based representation learning network(MvRLNet) for RS image classification. First, a superpixel-based spectral component decomposes module(SSCDM) is designed to enhance the uniqueness and homogeneity of graph nodes because the mixed superpixels may lead to miscellaneous information on graph aggregations. Second, a multi-view graph learning module (MGLM) is proposed to integrate topology and spectral graph information into a unified network with an effective feature learning strategy. Finally, the effectiveness of the proposed MvRLNet is validated on a variety datasets with different resolutions. The experimental results show that the proposed MvRLNet performs better than state-of-the-art techniques.

Abstract:
Partial point cloud registration is an essential and fundamental component of generating complete 3D shapes, aiming at converting partial scans into a unified coordinate system. However, existing point cloud registration methods suffer from inadequately rich local feature extraction and feature interaction. In addition, these methods still face challenges in modeling the global contextual information of point clouds sufficiently, which limits the improvement in registration effectiveness, especially for partial-to-partial point cloud registration with high noise. To overcome these issues, this paper proposes an enhanced PointNet and assignable weights transformer network (LFA-Net) for partial point cloud registration. The model achieves coarse-to-fine point cloud registration through three core modules. First, the Sufficient Local Feature Extraction Module (LFM) is constructed to extract various local feature information. Then, the Adequate Feature Aggregation Module (FAM) is designed to integrate the feature information from different point clouds. Finally, the Assignable Weights Transformer Module (ATM) is presented to stimulate the model’s global modeling ability during the registration process, enabling the selection of representative points for optimal point cloud registration. Extensive experiments conducted on ModelNet40 using partially overlapping point clouds illustrate the superior registration performance of LFA-Net compared with other state-of-the-art methods. Moreover, Numerous experiments on synthetic and real-word datasets further indicate that LFA-Net also has significant advantages in registering partial point clouds with noise and unseen categories, which demonstrates its excellent robustness and generalization ability for real-world practical application.

Abstract:
3D human reconstruction from RGB images achieves decent results in good weather conditions but degrades dramatically in rough weather. Complementarily, mmWave radars have been employed to reconstruct 3D human joints and meshes in rough weather. However, combining RGB and mmWave signals for weather-robust 3D human reconstruction is still an open challenge, given the sparse nature of mmWave and the vulnerability of RGB images. The limited research about the impact of missing points and sparsity features of mmWave data on reconstruction performance, as well as the lack of available datasets for paired mmWave-RGB data, further complicates the process of fusing the two modalities. To fill these gaps, we build up an automatic 3D body annotation system with multiple sensors to collect a large-scale mmWave dataset. The dataset consists of synchronized and calibrated mmWave radar point clouds and RGB(D) images under different weather conditions and skeleton/mesh annotations for humans in these scenes. With this dataset, we conduct a comprehensive analysis about the limitations of single-modality reconstruction and the impact of missing points and sparsity on the reconstruction performance. Based on the guidance of this analysis, we design ImmFusion, the first mmWave-RGB fusion solution to robustly reconstruct 3D human bodies in various weather conditions. Specifically, our ImmFusion consists of image and point backbones for token feature extraction and a Transformer module for token fusion. The image and point backbones refine global and local features from original data, and the Fusion Transformer Module aims for effective information fusion of two modalities by dynamically selecting informative tokens. Extensive experiments demonstrate that ImmFusion can efficiently utilize the information of two modalities to achieve robust 3D human body reconstruction in various weather environments. In addition, our method achieves superior accuracy compared to that of the state-of-the-art Transformer-based LiDAR-camera fusion methods.

Affiliations: Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, School of Computer Science and Technology, Xidian University, Xi’an, China; Key Laboratory of Intelligent Perception and Image Understanding, Ministry of Education of China, School of Artificial Intelligence, Xidian University, Xi’an, China; Chinese Academy of Science, Institute of High Energy Physics, Beijing, China; Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Xi’an, China

Abstract:
The unstructured, unordered and inherent irregular sampling properties presents difficulties for accurate and efficient realizing semantic segmentation of large-scale 3D point cloud. The complexity and the long-distance information exploitation are the key challenges for large-scale 3D point cloud semantic segmentation. Therefore, in order to efficiently exploit long-distance global information and improve the segmentation accuracy of point cloud data located at the edges of distinct categories, a novel Trigonometric Bilinear attention and Global-aware Aggregation network (TBGA-Net) is designed to integrate and supplement the local-global contextual features for large-scale point clouds segmentation. The proposed Global-aware Context Aggregation block (GCA) can implicitly excavate the global context for each 3D point by utilizing its surface-to-volume ratio of the neighborhood to the global point cloud. Furthermore, aim at refining local-global features, the Trigonometric Bilinear Attention block (TBA) utilizes trigonometric functions to embed the point cloud coordinates of local regions, and applies bilinear attention to realize feature enhancement for obtaining more discriminative local-global features. Additionally, we further designed a novel Dynamic-adjusting Cross-entropy Loss (DCLoss) to incorporate with TBGA-Net for addressing the issue of class imbalance in training data for large-scale 3D point cloud semantic segmentation. The experimental results on three 3D point cloud datasets demonstrates that the proposed algorithm indicates better segmentation accuracy especially for the point located at the boundary of the distinct categories.

Abstract:
Efficiently aggregating 4D light field information to achieve accurate semantic segmentation has always faced challenges in capturing long range dependency information (CNN-based) and the memory limitations of quadratic computational complexity (Transformer-based). Recently, the Mamba architecture, which utilizes the state space model (SSM), has achieved high performance under linear complexity in various vision tasks. However, directly applying Mamba to 4D light field scanning will lead to an inherent loss of multi-spatial-angular information. To address the above challenges, we introduce LFSSMam, a novel Light Field Semantic Segmentation architecture based on the selective state space model (Mamba). Firstly, LFSSMam presents an innovative spatial-angular selective scanning mechanism to decouple and scan 4D multi-dimensional light field data. It separately captures the rich spatial context, complementary angular and structural information of light field 2D slices within the state space. In addition, we design an SSM-attention Cross-Fusion Enhance Module to perform preferential scanning and fusion across multi-spatial-angular-modal light field information, adaptively aggregating and enhancing the central view features. Comprehensive experiments on synthetic and real world datasets demonstrate that LFSSMam achieves leading edge SOTA (State-Of-The-Art) performance (with a 6.97% improvement to LF-based methods) while reducing memory and computational complexity. This work provides valuable guidance for the efficient modeling and application of multi-spatial-angular information in light field semantic segmentation. Our code is available at https://github.com/HNU-WQW/LFSSMam

Abstract:
In the field of no-reference image quality assessment (NR-IQA), the visual masking effect has long been a challenging issue. Although existing methods attempt to alleviate the interference caused by masking by generating pseudoreference images, the quality of these images is often constrained by the accuracy and reconstruction capabilities of image restoration algorithms. This can introduce additional biases, thereby affecting the reliability of the evaluation results. To address this problem, we propose a novel generative “noise” estimation framework (GNE-Vim) that eliminates the need for pseudoreference images. Instead, it deeply decouples the distortion components from degraded images and performs quality-aware modelling of these components. During the training phase, the model leverages both reference images and distortion components to guide the learning of the true distortion distribution. In the inference phase, quality prediction is conducted directly on the basis of the decoupled distortion components, making the evaluation results more aligned with human subjective perception. The experimental results demonstrate that the proposed method achieves strong performance across datasets containing various types of distortions. The source code is publicly available at the following website: https://github.com/opencodelxt/GNE-Vim

Affiliations: School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China; Department of Computer and Network Engineering, The University of Electro-Communications, Tokyo, Japan; Department of Electrical and Computer Engineering, Aarhus University, Aarhus, Denmark; School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, China; School of Information and Electronics, Beijing Institute of Technology, Beijing, China

Abstract:
Volumetric video, also referred to as hologram video, is an emerging medium that represents 3D content in extended reality. As a next-generation video technology, it is poised to become a key application in 5G and future wireless communication networks. Because each user generally views only a specific portion of the volumetric video, known as the viewport, accurate prediction of the viewport is crucial for ensuring an optimal streaming performance. Despite its significance, research in this area is still in the early stages. To this end, this paper introduces a novel approach called Saliency and Trajectory-based Viewport Prediction (STVP), which enhances the accuracy of viewport prediction in volumetric video streaming by effectively leveraging both video saliency and viewport trajectory information. In particular, we first introduce a novel sampling method, Uniform Random Sampling (URS), which efficiently preserves video features while minimizing computational complexity. Next, we propose a saliency detection technique that integrates both spatial and temporal information to identify visually static and dynamic geometric and luminance-salient regions. Finally, we fuse saliency and trajectory information to achieve more accurate viewport prediction. Extensive experimental results validate the superiority of our method over existing state-of-the-art schemes. To the best of our knowledge, this is the first comprehensive study of viewport prediction in volumetric video streaming. We also make the source code of this work publicly available.

Abstract:
Speech-driven talking head synthesis technology has made remarkable progress, but it still faces the challenge of one-to-many pathological mapping. The challenge results in inaccurate lip movements, ambiguity in facial expressions, and a lack of coherence during transitions between facial motions. The phenomenon is primarily caused by: (1) for one speaker, the same phoneme corresponds to a wide range of mouth shapes and facial expressions due to contextual variations, and (2) for the same spoken content, different speakers exhibit diverse facial motions as a result of unique speaking styles. In this work, we propose a novel framework, called AllTalk, to alleviate one-to-many pathological mapping, which enables a more vivid and natural talking head. Specifically, considering the asymmetry and dynamic nature of mouth shapes’ dependence on phoneme context, we propose a Dynamic Adaptive Context encoder to capture the context around the phoneme and its dynamics, thereby reducing the ambiguity in mapping speech to facial movements. Moreover, to alleviate the uncertainty caused by differences in speaking style, we propose a Style Adapter that expands a generic discrete motion space for the target speaker. The Style Adapter not only effectively represents general facial motions but also captures the personalized nuances of facial movements. To further enhance the fidelity of output, we introduce a Dynamic Gaussian Renderer based on 3D Gaussian Splatting, capable of producing stable and realistic rendering videos. Extensive qualitative and quantitative experiments demonstrate that AllTalk surpasses existing state-of-the-art methods, providing an effective solution to the challenge of one-to-many mapping. Project page: https://zjchu.github.io/projects/AllTalk.

Abstract:
Existing unsupervised video anomaly detection methods based on prediction typically employ a memory module to limit the generalization ability of the network so that normal frames can be accurately reconstructed or predicted while abnormal frames cannot. These memory-based methods usually utilize memory to record the fusion prototypes of appearance and motion. However, the motion part of the fusion prototypes is obtained from implicit motion representation, which is incomplete and constrain the ability for abnormal detection. To tackle the above issue, we proposed a Two-Stage framework with Memory via video Decomposition and bidirectional Consistency (TSMDC), which employs explicit motion data to learn comprehensive motion prototypes and use video decomposition and bidirectional consistency to learn fine granularity and advanced prototypes. We first decompose the video clip into three components: motion, scene, and object. In the first stage, the features of the motion are extracted to obtain a comprehensive motion representation, and its prototypes are stored in the motion memory. For further use of the bidirectional information of video, we present a cascaded frame prediction network that is utilized to learn and record the advanced spatio-temporal prototypes in the second stage. Specifically, the fine-granularity features of the scene and object are extracted and fused to predict the future frame. Then the initial frame of the video clip is predicted based on bidirectional consistency and motion prototypes enhancement. And the advanced spatio-temporal prototypes of video are recorded in this process. Anomalies are evaluated using the combination anomaly score of the predicted future and initial frame.Extensive experimental results on three public datasets indicate the effectiveness of the proposed method. Code will be available at https://github.com/yangugu/TSMDC.

Abstract:
End-to-end optimized Learned Image Compression (LIC) has demonstrated remarkable performance in terms of Rate-Distortion (R-D) efficiency. However, the R-D characteristics of LIC codecs remain underexplored. Previous research has attempted to investigate the R-D behavior through numerical and statistical approaches, but these methods often provide only empirical results, lacking theoretical insights. In this work, we introduce a novel methodology for studying the R-D characteristics of LIC. By rethinking the LIC paradigm from a fresh perspective, we propose a plug-and-play module, the Latent-domain Auto-Encoder (LAE). This innovative approach not only naturally leads to Variable Bit-Rate (VBR) compression, but also allows for a theoretical modeling of the R-D behavior of LIC codecs. Our findings reveal that the bit-rate is the logarithmic sum of the neurons n_\lambda in our designed network’s last layer, plus a constant C introduced by image content, formally expressed as R_\lambda = \sum \log n_\lambda + C . This insight is pivotal, as it underscores how the bit-rate can be systematically derived from the latent representations. Further analysis demonstrates that our proposed R- \lambda model enables effective rate control for learned image codecs, enhancing their adaptability and accuracy. Experimental results validate that our VBR method surpasses fixed-rate coding by 2.9% in terms of BD-rate. Additionally, the proposed R- \lambda model exhibits superior rate control performance, suggesting that it not only elucidates the underlying R-D characteristics of LIC but also significantly enhances its practical deployment in real-world applications.

Abstract:
Video, as an information carrier, provides a vast amount of important information to people. Therefore, the method of obtaining video becomes particularly important, which drives the research on text-video cross-modal retrieval technology. However, current text-video cross-modal retrieval models still face several issues. First, these models do not fully utilize the powerful reasoning and generative capabilities of large models to address the issues of missing critical objects and insufficient high-quality video-text paired training data. Second, existing retrieval models do not adequately research the bidirectional cross-modal semantic interaction and reasoning mechanism, which hinders the ability to fully capture and learn the implicit semantic features between different modalities. To address these issues, we propose an innovative bidirectional semantic reasoning and large model data augmentation cross-modal retrieval model (BiSeR-LMA). This model first leverages the strong reasoning and generative capabilities of large models to perform semantic reasoning on the textual descriptions of videos, then generates multiple semantically rich video frames, thereby compensating for the missing critical objects in the original video and improving the quality of video-text paired training data. Second, we design a bidirectional text-video semantic reasoning module, which uses features from one modality as auxiliary information to assist the model in reasoning the implicit semantic information of another modality. This enhances the model’s capability to establish semantic relationships and perform reasoning on implicit semantics, promoting text-video semantic alignment. Finally, we verify the effectiveness of the proposed cross-modal retrieval model on the MSR-VTT, LSMDC, and MSVD datasets.

Abstract:
Snapshot compressive imaging (SCI) captures a 3D hyperspectral image (HSI) using a 2D compressive measurement and reconstructs the desired 3D HSI from that 2D measurement. The effective reconstruction method thus is crucial in SCI. Despite recent successes of deep learning (DL)-based methods over traditional approaches, they often ignore the intrinsic characteristics of HSI and are trained for a specific imaging system using sufficient paired datasets. To address this, we propose a novel self-supervised HSI reconstruction framework called low-rank tensor meets deep prior (LDMeet), which couples model-driven and data-driven methods. The design of LDMeet is inspired by the traditional model-driven low-rank tensor prior constructed based on domain knowledge, which can explore the intrinsic global spatial-spectral correlation of HSI and make the reconstruction method interpretable. To further utilize the powerful learning ability of DL-based approaches, we introduce a self-supervised spatial-spectral guided network (SSG-Net) into LDMeet to learn the implicit deep spatial-spectral prior of HSI without requiring training data, making it adaptable to various imaging systems. An efficient alternating direction method of multiplier (ADMM) is designed to solve the LDMeet model. Comprehensive experiments confirm that our LDMeet achieves superior results compared to self-supervised HSI reconstruction methods, while also yielding competitive results with supervised learning methods.

Abstract:
Sequential ground moving target imaging (GMTIm) is an imperative and challenging task under terahertz video synthetic aperture radar (THz-ViSAR), which contributes to fine-grained situational awareness and moving target (MT) recognition. However, traditional GMTIm methods are usually designed for single-frame images, which involves repetitive parameter estimation for the sequential imaging problem and lacks the efficiency due to the parameter sensitivity. To tackle the aforementioned problems, this paper proposes a sequential GMTIm method based on hybrid THz-ViSAR-inverse SAR (ISAR) image formation. With respect to ViSAR processing, the sequential imaging results are firstly obtained. Considering the similarity of scene among inter-frame images, MTs can be detected based on target-level change detection after image registration and shadows left on the road. Following this, the defocused target region is transformed to obtain raw echoes, which is beneficial for parallel processing and reduce the computation amount. As for ISAR processing, the envelope alignment and auto-focus methods are employed to eliminate the residual motion errors and compensate for phase errors without constructing prior motion patterns. Thereafter, the ratio of equivalent rotational velocities between MTs and the scene is estimated to achieve the azimuth scaling. Finally, sparsity-based imaging enhancement is employed to further enhance the imaging quality. Simulations and airborne experiments are carried out to validate the effectiveness of the proposed method.

Abstract:
Infrared and visible image fusion (IVIF) is increasingly applied in critical fields such as video surveillance and autonomous driving systems. Significant progress has been made in deep learning-based fusion methods. However, these models frequently encounter out-of-distribution (OOD) scenes in real-world applications, which severely impact their performance and reliability. Therefore, addressing the challenge of OOD data is crucial for the safe deployment of these models in open-world environments. Unlike existing research, our focus is on the challenges posed by OOD data in real-world applications and on enhancing the robustness and generalization of models. In this paper, we propose an infrared-visible fusion framework based on Multi-View Augmentation. For external data augmentation, Top-k Selective Vision Alignment is employed to mitigate distribution shifts between datasets by performing RGB-wise transformations on visible images. This strategy effectively introduces augmented samples, enhancing the adaptability of the model to complex real-world scenarios. Additionally, for internal data augmentation, self-supervised learning is established using Weak-Aggressive Augmentation. This enables the model to learn more robust and general feature representations during the fusion process, thereby improving robustness and generalization. Extensive experiments demonstrate that the proposed method exhibits superior performance and robustness across various conditions and environments. Our approach significantly enhances the reliability and stability of IVIF tasks in practical applications.

Abstract:
Existing projection-related point cloud quality assessment (PCQA) methods commonly adopt a straightforward but content-independent projection strategy, which selects a certain number of viewpoints to obtain projected images of degraded point clouds for further assessment. Through experimental studies, however, we observed the instability of final predicted quality scores, which change significantly over different viewpoint settings. Inspired by the “wooden barrel theory”, given the default content-independent viewpoints of existing projection-related PCQA approaches, this paper presents a novel content-aware viewpoint generation network (CAVGN) to learn better viewpoints by taking the distribution of geometric and attribute features of degraded point clouds into consideration. Firstly, the proposed CAVGN extracts multi-scale geometric and texture features of the entire input point cloud, respectively. Then, for each default content-independent viewpoint, the extracted geometric and texture features are refined to focus on its corresponding visible part of the input point cloud. Finally, the refined geometric and texture features are concatenated to generate an optimized viewpoint. To train the proposed CAVGN, we present a self-supervised viewpoint ranking network (SSVRN) to select the viewpoint with the worst quality projected image to construct a default-optimized viewpoint dataset, which consists of thousands of paired default viewpoints and corresponding optimized viewpoints. Experimental results show that the projection-related PCQA methods can achieve higher performance using the viewpoints generated by the proposed CAVGN. The source code can be found at https://github.com/yokeno1/CAVGN1.

Abstract:
High dynamic range (HDR) imaging technology has received increasing attention in recent years, and HDR image quality assessment (IQA) metrics are indispensable during the capturing, processing and displaying of HDR images. However, existing HDR-IQA datasets and methods neglect complex distortions during the HDR image processing schemes, leading to limited generalization performance on practical application. In this work, to facilitate the development of HDR-IQA dataset, we present HDRQAD, a large-scale HDR Quality Assessment Dataset, which possesses diversified distortions during HDR imaging technologies, abundant scenes and considerable quantity. Specifically, the HDRQAD dataset contains 1409 HDR images, which are derived from source scenes with six types of distortions during the HDR imaging schemes. In contrast to existing datasets that contain only compression artifacts, the HDRQAD includes Under-exposure, Over-exposure, Motion blur and Ghosting in HDR images achieved with multi-exposure fusion technology, conversion artifacts in HDR images achieved with single image reconstruction technology and compression artifacts during the transmission of HDR images. Furthermore, during the process of constructing the dataset, we identified three key challenges in HDR-IQA tasks: 1) dynamic range variations, 2) HDR visual artifacts with large overall gap, 3) inter-regional non-uniform image quality. Based on these observations, we propose a new end-to-end network for HDR-IQA tasks, which consists of a Distortion-aware Representation Learning (DRL) module and an Inter-Regional Quality Interaction (IRQI) module. The DRL learns the representations of dynamic range variations and HDR visual artifacts, enhancing the reliability of prior information extraction. The IRQI captures inter-regional quality dependencies with interacting and fusing intermediate distortion features for more accurately predicting image quality. Extensive experiments prove the superiority of proposed HDRQAD and demonstrate that the proposed network achieves state-of-the-art performance. The Dataset and Code will be made publicly available at HDR-IQA-Dataset.

Abstract:
Robust depth completion of transparent objects would be beneficial for industrial automation such as vision-based robotic grasping and manipulation. However, although some methods try to learn a compact intra-layer feature representation with the boost of the attention mechanism or the vision Transformer, they ignore the neglected corner regions and sparse geometry information that are important for accurate depth completion. To tackle these issues, we propose a novel sim-to-real transferable model, named CAGT, with interactive embedding aggregation and geometry awareness to reconstruct severely sparse depth maps of transparent objects in this paper. We design a Depth-clue Interaction Aggregation Module (DIAM) to enhance the Transformer’s ability to extract boundary corner features and thus supplement depth clues. Then, we propose a Geometric Information Augmentation Module (GIAM) to fuse the geometry-aware feature containing shape and surface details. Moreover, we introduce a contrastive learning mechanism to facilitate the sim-to-real generalization of the completion model. Extensive experiment results on two challenging datasets, ClearGrasp and TransCG, demonstrate that our proposed CAGT can obtain superior performance over the state-of-the-art methods. We also demonstrate that CAGT can improve the grasp accuracy of transparent objects by a robotic grasping generalization experiment. The code and supplementary video will be available at: https://github.com/xingshuojing/CAGT.

Abstract:
The current surge in video content highlights the tasks of moment retrieval (MR) and highlight detection (HD), which involve localizing video segments of events and predicting clip-wise saliency scores based on text queries. The recent methods, while effective, may overlook two aspects: 1) Multimodal features often show weak alignment from frozen encoders, hindering thorough semantic exploration of video clips through fine-grained cross-modal interaction. 2) Due to the absence of significant distinction between adjacent video clips, it is challenging for clip-level context modeling to accurately locate query-relevant content. To mitigate these gaps and inspired by the human routine in understanding visual events, we propose a progressive framework dubbed “what and where” to initially grasp the aligned semantics of each video clip, and then proceed to scan moment-level contextual features temporally to identify events matching the query. In the ‘what’ stage, to enable explicit alignment of modal features and achieve a thorough semantic understanding, we firstly devise the Initial Semantic Projection (ISP) loss to bring closer different modal features with similar semantics. Additionally, we develop a Clip Semantic Mining module to deeply mine the relevance of these identified semantics to the specific query (at both word- and sentence-level). In the ‘where’ stage, to enhance feature distinctiveness, we design a Multi-Context Perception module that models moment-level context. It includes an Event Context (EC) branch and a Chronological Context (CC) branch, focusing on possible query-relevant event moments and temporal moments of various lengths. Finally, extensive experiments validate the state-of-the-art performance of our W2W model on three benchmark datasets without additional pre-training. Codes are available at https://github.com/TJUMMG/W2W.

Abstract:
Distributed storage systems, such as Hadoop distributed file system (HDFS), are widely used for video storage due to their outstanding scalability. However, they frequently face challenges related to data unavailability arising from issues like network disconnection, server downtimes, and storage failures. This necessitates file replication, leading to significant storage requirements. To address this, we propose a novel deep reinforcement learning (DRL) algorithm that relies on the tradeoff between video quality and storage demands. We first formulate an optimization problem with the objective of maximizing video quality while constraining storage requirements necessary for replication. The video storage system is then modeled with time-varying video streaming workloads as the DRL environment, where the agent determines the placement of replica files without foreknowledge of future storage availability and video popularity. To address this uncertainty, we use a deep double-Q network (D3QN), which includes an action space that finds the number of replicas for each file, an observation space featuring storage utilization and file placement, and a reward model calculating the expected video quality under various data unavailability probabilities. The implementation of our method is examined within the HDFS. Experimental results show that our method improves video quality by up to 39% compared to benchmarks, achieves quality comparable to Oracle when all bitrates are accessible, and even surpasses HDFS’s triple redundancy method while using only 20% of the storage space.

Abstract:
Polarization imaging is extensively employed in underwater image restoration due to its effectiveness in removing backscattered light. However, existing polarization imaging methods generally assume the degree of polarization (DoP) of the backscattering is spatially constant and estimate it from the background region, limiting their practical applications. To address these challenges, we propose an underwater image restoration method based on a polarization imaging optimization model (PIOM). First, we develop a novel polarization image formation model by fusing the DoP and angle of polarization (AoP) of backscattered light. Second, we introduce an adaptive particle swarm local optimization (APSLO) method based on the PIOM. This method decomposes the image into small blocks and employs an objective optimization function to estimate the local optimal fusion parameters. Additionally, we propose a robust polynomial spatial fitting method to reduce block artifacts and noise disturbances, achieving globally optimal fusion parameters. Finally, we fully consider the advantages of gamma correction, and propose an adaptive contrast enhancement method to balance brightness and contrast. Experimental results show that our PIOM effectively removes backscattering while preserving finer details, colors, and contours. The code and datasets will be available at https://github.com/liyafengLYF/UIRPIOM.

Abstract:
In autonomous driving, LiDAR and radar are crucial for environmental perception. LiDAR offers precise 3D spatial sensing information but struggles in adverse weather like fog. Conversely, radar signals can penetrate rain or mist due to their specific wavelength but are prone to noise disturbances. Recent state-of-the-art works reveal that the fusion of radar and LiDAR can lead to robust detection in adverse weather. Current approaches typically fuse features from various data sources using basic convolutional/transformer network architectures and employ straightforward label assignment strategies for object detection. However, these methods have two main limitations: they fail to adequately capture feature interactions and lack consistent regression constraints. In this paper, we propose a bird’s-eye view fusion learning-based anchor box-free object detection system. Our approach introduces a novel interactive transformer module for enhanced feature fusion and an advanced label assignment strategy for more consistent regression, addressing key limitations in existing methods. Specifically, experiments show that, our approach’s average precision ranks 1^st and significantly outperforms the state-of-the-art method by 13.1% and 19.0% at Intersection of Union (IoU) of 0.8 under “Clear+Foggy” training conditions for “Clear” and “Foggy” testing, respectively. Our code repository is available at: https://github.com/yyxr75/RaLiBEV.

Abstract:
In this paper, we present an end-to-end holographic video conferencing system that enables real-time high-quality free-viewpoint rendering of participants in different spatial regions, placing them in a unified virtual space for a more immersive display. Our system offers a cost-effective, complete holographic conferencing process, including multiview 3D data capture, RGB-D stream compression and transmission, high-quality rendering, and immersive display. It employs a sparse set of commodity RGB-D cameras that capture 3D geometric and textural information. We then remotely transmit color and depth maps via standard video encoding and transmission protocols. We propose a GPU-parallelized rendering pipeline based on an image-based virtual view synthesis algorithm to achieve real-time and high-quality scene rendering. This algorithm uses an on-the-fly Truncated Signed Distance Function (TSDF) approach, which marches along virtual rays within a computed precise search interval to determine surface intersections. We then design a multiweight projective texture mapping method to fuse color information from multiple views. Furthermore, we introduce a method that uses a depth confidence map to weight the rendering results from different views, which mitigates the impact of sensor noise and inaccurate measurements on the rendering results. Finally, our system places conference participants from different spaces into a virtual conference environment with a global coordinate system through coordinate transformation, which simulates a real conference scene in physical space, providing an immersive remote conferencing experience. Experimental evaluations confirm our system’s real-time, low-latency, high-quality, and immersive capabilities.

Abstract:
Tensor ring (TR) decomposition demonstrates superior performance in handling high-order tensors. However, traditional TR-based decomposition algorithms face limitations in real-world applications due to large data sizes, missing entries, and outlier corruption. To address these challenges, we propose a scalable and robust TR decomposition algorithm for large-scale tensor data that effectively handles missing entries and gross corruptions. Our method introduces a novel auto-weighted scaled steepest descent approach that adaptively identifies outliers and completes missing entries during decomposition. Additionally, leveraging the tensor ring decomposition model, we develop a Fast Gram Matrix Computation (FGMC) technique and a Randomized Subtensor Sketching (RStS) strategy, significantly reducing storage and computational complexity. Experimental results demonstrate that the proposed method outperforms existing TR decomposition and tensor completion methods.

Abstract:
Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors. Due to its ability to capture long-range dependencies, the Transformer model provides a powerful mechanism for integrating multimodal features during feature extraction. This capability significantly enhances the accuracy of multimodal object detection by addressing the limitations of local feature extraction inherent in traditional methods. However, current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network, thus limiting the improvements in detection performance. In this paper, we introduce an accurate and efficient multimodal object detection method named SeaDATE. Initially, we propose a novel dual attention Feature Fusion (DTF) module that, under Transformer’s guidance, integrates local and global information through a dual attention mechanism, strengthening the fusion of modal features from orthogonal perspectives using spatial and channel tokens. Meanwhile, our theoretical analysis and empirical validation demonstrate that the Transformer-guided fusion method, treating images as sequences of pixels for fusion, performs better on shallow features’ detail information compared to deep semantic information. To address this, we designed a contrastive learning (CL) module aimed at learning features of multimodal samples, remedying the shortcomings of Transformer-guided fusion in extracting deep semantic features, and effectively utilizing cross-modal information. Extensive experiments and ablation studies on the FLIR, LLVIP, and M3FD datasets have proven our method to be effective, achieving state-of-the-art detection performance.

Abstract:
Underwater imaging grapples with challenges from light-water interactions, leading to color distortions and reduced clarity. In response to these challenges, we propose a novel Color Balance Prior Guided Hybrid Sense Underwater Image Restoration framework (GuidedHybSensUIR). This framework operates on multiple scales, employing the proposed Detail Restorer module to restore low-level detailed features at finer scales and utilizing the proposed Feature Contextualizer module to capture long-range contextual relations of high-level general features at a broader scale. The hybridization of these different scales of sensing results effectively addresses color casts and restores blurry details. In order to effectively point out the evolutionary direction for the model, we propose a novel Color Balance Prior as a strong guide in the feature contextualization step and as a weak guide in the final decoding phase. We construct a comprehensive benchmark using paired training data from three real-world underwater datasets and evaluate on six test sets, including three paired and three unpaired, sourced from four real-world underwater datasets. Subsequently, we tested 14 traditional and retrained 23 deep learning existing underwater image restoration methods on this benchmark, obtaining metric results for each approach. This effort aims to furnish a valuable benchmarking dataset for standard basis for comparison. The extensive experiment results demonstrate that our method outperforms 37 other state-of-the-art methods overall on various benchmark datasets and metrics, despite not achieving the best results in certain individual cases. The code and dataset are available at https://github.com/CXH-Research/GuidedHybSensUIR.

Abstract:
Feature-based approaches have been the focal point of previous research on knowledge distillation (KD) for dense object detection. These methods employ feature imitation and result in competitive performance. Despite being able to achieve comparable performance in image recognition, response-based KD methods can not reach the same level in dense object detection. Inspired by improving distillation performance from two key aspects: where to distill and how to distill, in this paper, a parallel distillation (PD) is introduced to fully utilize the sophisticated detection head and transfer all the output responses from the teacher to the student efficiently. In particular, the proposed PD takes an important consideration of the specific location of distillation, which is crucial for effective knowledge transfer. Regarding the discrepancies in output responses between the localization branch and the classification branch, we propose a novel Dynamic Localization Temperature (DLT) module to enhance the precision of distilling localization information. As for the classification branch, a Classification Temperature-Free (CTF) module is also designed to increase the robustness of distillation in heterogeneous networks. By incorporating the DLT and CTF into the PD framework to avoid setting temperature values manually, the Flexible Temperature Parallel Distillation (FTPD) is proposed to achieve a state-of-the-art (SOTA) performance, which can also be further combined with mainstream feature-based methods for better results. In terms of accuracy and robustness with extensive experiments, the proposed FTPD outperforms other KD methods in the task of dense object detection.

Abstract:
Place recognition is a fundamental task in robotics, enabling loop closure detection in simultaneous localization and mapping (SLAM), and re-localization on prior maps. Current range image-based networks use single-column convolution to maintain feature invariance to shifts in image columns caused by light detection and ranging (LiDAR) viewpoint change. However, this raises the issues such as “restricted receptive fields” and “excessive focus on local regions”, degrading the performance of networks, especially in scenarios with movable objects. In this paper, a lightweight circular convolutional Transformer network named CCTNet is proposed, which aims to boost performance by capturing structural information in point clouds and facilitating cross-dimensional interaction of spatial and channel information. Initially, a Circular Convolution Module (CCM) is introduced, expanding the network’s perceptual field while maintaining feature consistency across varying LiDAR perspectives. Then, a Range Transformer Module (RTM) is proposed, which enhances place recognition accuracy in scenarios with movable objects by employing a combination of channel and spatial attention mechanisms. Furthermore, we propose an overlap-based loss function, transforming the place recognition task from a binary loop closure classification into a regression problem linked to the overlap between LiDAR frames. Through extensive experiments on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) and Ford Campus datasets, CCTNet surpasses comparable methods, achieving Recall@1 of 0.924 and 0.965, and Recall@1% of 0.990 and 0.993 on the test set, showcasing superior performance. Results on the self-collected dataset further demonstrate the proposed method’s potential for practical implementation in complex scenarios to handle movable objects, showing improved generalization across various datasets.

Abstract:
Versatile Video Coding (VVC) employs Affine Motion Compensation (AMC) to process scenes with high-order motion. To improve AMC efficiency, the Affine Motion Estimation (AME) process based on the gradient-based iterative algorithm (GIA) and block match algorithm (BMA) is introduced to the VVC Test Model (VTM). However, the AME process is highly complex and difficult for hardware implementation in real-time applications. In this context, this paper proposes a hardware-friendly AME algorithm and implements the corresponding accelerator. Firstly, the weighted least squares regression is used to reduce the iteration of GIA. Then an iteration-free search scheme is proposed to remove the search dependence during the GIA and BMA process. In addition, a motion vector clamping mechanism and four-level memory organization are proposed to solve the problem of reference pixel reading conflict, which reduces 51.7% and 67.5% internal bandwidth of the AME accelerator. Compared with the default AME process of VTM 16.0, experimental results show that the proposed algorithm reduces AME run time by 81.63% while the corresponding Bjontegaard Delta Bit Rate (BDBR) loss is only 0.492%. The proposed AME accelerator can flexibly support AME search tasks in various configurations. Synthesized with the TSMC 28nm process, the proposed architecture has a gate count of 1313K and a power consumption of 156.83 mW. It can achieve 7680× 4320 @1.7fps~30fps and the corresponding BDBR loss is 0.492%~1.835%.

Abstract:
Clinical research has demonstrated that exploring behavioral signal differences between depressed patients and non-depressed people using audiovisual technology is an effective approach for achieving depression recognition. Hence, in this paper we propose an emotion word reading experiment (EWRE), and extract features from facial expressions and audios for depression recognition. Building upon this, we propose a depression recognition model (DEP-Former), which deeply integrates multimodal features. DEP-Former first designs a modality adapter to achieve emotion space mapping and the sharing of multimodal features, addressing cross-modal inconsistencies. Simultaneously, it proposes a mechanism of attention index sharing, exceeding the limitations of cognitive subjectivity by calculating confidence in key emotional information across modalities. Finally, we propose a multimodal cross-attention module and a Bernoulli distribution feature fusion prediction module to achieve deep integration of multilevel information, thereby enabling depression recognition. Compared with existing advanced multimodal models, DEP-Former demonstrates superior performance in EWRE, achieving an accuracy of 0.9500 and an F1 score of 0.9499, significantly enhancing depression recognition over the single-modality methods. Furthermore, its robust generalization ability is validated on the AVEC 2014 dataset. Through the attention query of the interpretability analysis module, we discover that depressed patients exhibit heightened sensitivity to negative emotional words, such as dismissal and tragedy. In contrast, healthy individuals tend to be more attuned to positive emotional words, including passion, purity, and justice. Additionally, depressed patients exhibit a degree of psychological state diversity, showing sensitivity to some positive emotional words as well. Our codes and data are available at https://github.com/QLUTEmoTechCrew/DEP-Former.

Abstract:
In High Efficiency Video Coding standard, rate estimation based on context-based adaptive binary arithmetic coding (CABAC) typically achieves high accuracy. However, due to serial data dependencies, hardware implementation solutions suffer from lower throughput. When it comes to the latest generation video coding standard, namely Versatile Video Coding (VVC), the increased data dependency and computational complexity during the coding process pose more challenges for the hardware design of rate estimation. To solve these problems, this paper presents a hardware implementation of high-accuracy and high-throughput rate estimation unit for VVC. In terms of throughput improvement, we propose two optimization algorithms to eliminate the majority of data dependencies in coefficient coding with nearly negligible loss in Bjontegaard Delta (BD)-rate performance. To save hardware resources, we introduce a rate estimation table compression algorithm and an optimized local statistical information storage strategy. Based on these optimizations, we present a hardware implementation for the rate estimation unit and a parallel scheme for the rate-distortion optimization process. The proposed algorithm shows an increase of 0.29% in the BD-rate compared to the VVC test model 19.2. Synthesis results show that the proposed design supports real-time coding of 7680× 4320 @30fps at 500MHz operating frequency. These results indicate that our proposed design performs well in terms of BD-rate performance and throughput. To the best of our knowledge, this is the first hardware implementation of rate estimation for VVC.

Affiliations: National Engineering Research Center of RVC, College of Electrical and Information Engineering, Hunan University, Changsha, Hunan, China; College of Robotics, Hunan University, Changsha, Hunan, China; Department of Radiation Oncology, Stanford University, Stanford, CA, USA; School of Computer Science and Technology, Xidian University, Xi'an, China; ReLER Laboratory, Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW, Australia

Abstract:
In the electric power scene, fasteners play a pivotal role in securing and connecting electrical equipment, with small fastener detection (SFD) being crucial for ensuring operational stability. Despite the replacement of manual inspection methods by non-destructive techniques employing deep learning, these approaches often demand substantial computational resources and involve numerous parameters. While knowledge distillation (KD) can be a viable solution, existing KD methods may often fail to achieve satisfactory performance when dealing with small object presentation and little inter-class variability in SFD tasks. To alleviate this, we propose a Focal Multi-scale Shape-feature Distillation Network (FMSD) to achieve efficient and precise fastener detection in electric power scenarios. Specifically, we propose a novel Multi-Scale Shape-Aware Feature Aggregation module (MSFA) to augment the network's perception of object shape and scale during the KD process. Additionally, we propose a Contour-Guided Distillation (CGD) module to optimize the transfer of the extracted shape-sensitive knowledge between the teacher and student models. Through a series of experiments compared with existing state-of-the-art (SOTA) methods, our method demonstrates superior performance over existing SOTA techniques, both efficiently and effectively. Furthermore, validation on publicly available power scene datasets confirms the generalizability and adaptability of our proposed FMSD across various settings.

Affiliations: School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing, China; School of Mathematics and Computational Science, Xiangtan University, Xiangtan, China; School of Artificial Intelligence and Automation and the Key Laboratory of Image Processing and Intelligent Control of Education Ministry of China, Huazhong University of Science and Technology, Wuhan, China; School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou, China

Abstract:
Owing to their ability to effectively characterize the memory effect of magnetic flux, specifically in relation to the effect of external electromagnetic radiation, memristors have elicited widespread interest in the construction of neural networks with complex dynamics. This work proposes a novel memristive multiscroll multistable neural network (MMSMSNN), wherein multistable threshold memristors are used to describe external electromagnetic radiation effects. Numerical simulations show that the MMSMSNN can yield any number of cubic lattice multiscroll attractors by adjusting the internal parameters of memristors. Another highlight is that it can also be able to yield abundant initial offset boosting behaviors, i.e., different kinds of infinitely many homogeneous coexisting attractors, including linearly arranged homogeneous coexisting attractors, planar lattice-distributed homogeneous coexisting attractors, and cubic lattice-distributed homogeneous coexisting attractors. In addition, hardware experiments based on the CH32V307 microcontroller are carried out to demonstrate the numerical findings. Finally, a new secure medical image communication scheme is designed to investigate the MMSMSNN in practical applications, and performance analyses reveal its superiority and high security.

Abstract:
Ultrasound is an important routine screening modality for breast cancer. Breast ultrasound screening is a dynamic process, and clinical practice involves radiologists recording representative frames during dynamic breast scanning for subsequent diagnosis. However, existing computer-assisted diagnosis methods often concentrate on dull diagnostic results by analyzing these representative frames and ignore the valuable information in the dynamic examination process that facilitates diagnosis. Moreover, breast lesions could exhibit various characteristic differences during scanning, and effective learning of lesion representations is challenging and may affect the clinical interpretability of the methods. To this end, we draw insights from the behavior of radiologists during the dynamic breast examination and leverage the knowledge of breast anatomy to propose a clinical knowledge-aware framework for lesion detection and classification of breast lesions in ultrasound videos. It is equipped with global-local attentive aggregation and a dynamic allocation mechanism that simulates the behavior of radiologists searching for diagnostic clues, thus integrating local localization and global semantic information from the video into the feature representation of the lesion. An anatomically-aware transformer is also designed to refine the lesion feature representation using spatial relationships within and across different anatomical layers of the breast anatomy. Extensive experiments show that the proposed framework can achieve competitive performance in both lesion detection and video classification tasks while exhibiting good clinical availability and interpretability, with an average precision of 40.80% and an AUC of 85.86% on our constructed breast video dataset and an average precision of 39.79% and an AUC of 87.04% on a publicly available dataset.

Abstract:
Semi-supervised Object Detection (SSOD) is a method that uses a small amount of labeled data and a large amount of unlabeled data to improve the performance of object detection. However, existing SSOD methods face the challenges of scale imbalance and class inconsistency, resulting in large differences in detection results across different scales and classes. To overcome these challenges, we propose a Scale-Rebalanced Global Proposal Contrast Consistency (SGPC) approach, which has the following three advantages: 1) we design a Scale-Rebalanced Input (SRI) structure, which adjusts the distribution of objects of different scales by resampling the input images at low magnification, thereby enhancing the ability of small object detection; 2) we design a Global Proposal Contrast Consistency Loss (GPCC), which can enhance the intra-class compactness and inter-class diversity of Region of Interest (RoI) features, thereby reducing the class inconsistency in pseudo-labels; and 3) we adopt a loss blending optimization strategy, which optimizes the localization accuracy of pseudo-labels by combining supervised loss and unsupervised loss. We conduct extensive experiments on multiple datasets, and the results show that SGPC significantly outperforms the latest other methods on the SSOD task. On the PASCAL VOC dataset, SGPC achieves 55.90 mAP, on the MS-COCO dataset, SGPC exceeds the supervised methods by more than 10 mAP at different scales, and we also verify the significant improvement and robustness of SGPC on the small object detection datasets VisDrone-2019 and EDD.

Abstract:
With the rise of telemedicine and intelligent diagnostics, the efficiency and accuracy of healthcare services have been significantly enhanced. However, the highly sensitive nature of medical images makes protecting patient privacy during transmission and storage a critical challenge. In this paper, we propose a secure medical image encryption scheme based on cross-ring Josephus scrambling and two-dimensional cellular automata, designed to safeguard medical images. First, we introduce a novel two-dimensional chaotic map (2D-CICM) with an expanded parameter range to generate high-quality key sequences for encryption. Next, we design a cross-ring Josephus scrambling algorithm for pixel permutation, where the eliminated pixel is determined by both inter-ring and intra-ring step sizes. Following this, we develop a diffusion mechanism based on interaction rules defined by six types of two-neighbor structures within a cellular automaton framework. To enhance key sensitivity and image-specific security, we also incorporate the 512-bit hash value of the plaintext image to dynamically update the initial keys, ensuring that the encryption key sequences are unique for each image. Comprehensive security analyses and performance evaluations confirm that the proposed scheme provides strong encryption performance and effectively resists common attacks, while maintaining computational efficiency suitable for medical applications.

Abstract:
Wide-angle videos shot with short-focus lenses often exhibit deformation distortions, which poses significant challenges for video quality assessment (VQA). Although current VQA methods focus primarily on video content and distortion perception, there has been little explicit research on the impact of deformation characteristics on the perception of wide-angle video quality. To this end, this paper makes the first attempt to construct a novel wide-angle video quality assessment method based on deformation representation learning and multi-dimensional feature fusion, termed DRLMF. Specifically, we first analyze the deformation distribution characteristics of wide-angle videos based on the deformation camera model. Based on this, a three-stream video perception and assessment network is proposed. The first branch extracts global semantics using the image encoder of CLIP. The second branch introduces an effective deformation region selection strategy and proposes an interpretable deformation representation learning module. This module leverages the perception advantages of convolutional neural networks (CNNs) in local distortions and considers the correlation between patch size and distortion perception. The third branch extracts motion features using an action recognition network. Finally, an effective multi-dimensional feature fusion module is proposed to integrate more refined and richer semantic, deformation, and motion features. Extensive experiments on wide-angle VQA datasets and standard video datasets show that the DRLMF outperforms the state-of-the-arts in terms of prediction monotonicity and accuracy. The codes will be available at https://github.com/BoHu90/DRLMF

Abstract:
Research on adversarial attacks in remote sensing tasks have predominantly focused on designing perturbations or patches, presenting challenges in balancing attack success rate and adversarial stealthiness. Instead of focusing on designing adversarial examples under adversarial perturbation constraints to ensure stealthiness, this paper proposes a Style Transfer Enabled Adversarial Attack with Attention Mechanism (STEAM), which leverages style transfer to generate adversarial examples with high visual fidelity. Specifically, STEAM transfers distinctive styles from critical regions of source samples to attackable areas in target samples, effectively incorporating natural textures from source samples. To further refine this process, an attention mechanism is introduced to selectively extract style features from key regions of the source samples, mitigating redundancy from global style information. Additionally, selective style transfer process also includes the consideration of semantic features across different regions in target samples, ensuring a more effective attack area selection. As a result, STEAM achieves high attack success rate by utilizing proper style selected from groups of style samples, and preserving high visual fidelity through selectively transfer the natural style feature into specific attackable region in target samples. Experimental results on the UCM and WHU-RS19 datasets demonstrate that STEAM not only enhances the visual fidelity of adversarial examples but also improves the attack success rate. Furthermore, experiments against state-of-the-art adversarial defense methods highlight the adversarial attack effectiveness and robostness of STEAM compared to other adversarial attack methods.

Abstract:
Hyperspectral image multi-class change detection (HSI-MCD) based on deep learning (DL) rely significantly on the number of labeled data. Due to the high cost of manually labeling for hyperspectral images (HSIs), obtaining a large amount of labeled samples is difficult. Moreover, for multi-class change detection (MCD) tasks, there is the phenomenon of semantic cross-coupling of changes due to complex change scenarios. To solve the above problems, a cross-domain few-shot learning method based on fractional domain information for HSI-MCD (FrCFSL) is proposed. Firstly, a spectral-spatial-fractional information extraction module is proposed, which can extract spectral-spatial-fractional domain joint feature. Thus, the module can obtain more comprehensive and discriminative representations of land cover categories, alleviating the phenomenon of semantic cross-coupling between classes. Afterward, a cross-domain few-shot learning strategy is introduced, where it learns task-relevant category discrimination meta-knowledge from a pair of richly labeled very high-resolution optical images (VHRIs) dataset and transfers it to the bitemporal HSIs dataset. Thus, the model can achieve better MCD performance with a small number of labeled samples. Finally, to mitigate the domain distribution differences between VHRIs data and HSIs data, a topological structure alignment module is proposed to align the intrinsic topological relationships between land cover categories, thus narrowing the gap between the two domain distributions. Through experiments conducted on three HSI-MCD datasets and comparative analysis with six state-of-the-art methods, the validity and stability of the proposed method are indicated.

Abstract:
3D-aware face generators are typically trained on 2D real-life face image datasets that primarily consist of near-frontal face data. Due to data limitations, these generators cannot generate one-quarter headshot 3D portraits with head, neck, and shoulder geometry, which is crucial for applications like talking heads. Two reasons account for this issue: First, existing facial recognition methods struggle with extracting facial data captured from large camera angles or back views. Second, it is challenging to learn a distribution of 3D portraits covering the one-quarter headshot region from single-view data due to significant geometric deformation caused by diverse body poses. To this end, we first create the dataset \it 360^\circ -Portrait-HQ ( \it 360^\circ PHQ for short) which consists of high-quality single-view real portraits annotated with a variety of camera parameters (the yaw angles span the entire 360° range) and body poses. We then propose 3DPortraitGAN, the first 3D-aware one-quarter headshot portrait generator that learns a canonical 3D avatar distribution from the \it 360^\circ PHQ dataset with body pose self-learning. Our model can generate view-consistent portrait images from all camera angles with a canonical one-quarter headshot 3D representation. Our experiments show that the proposed framework can accurately predict portrait body poses and generate view-consistent, realistic portrait images with complete geometry from all camera angles.

Abstract:
Facial expression recognition (FER) in the wild suffers from ambiguous facial expressions, occlusions, and backgrounds, leading to inter-class similarities and intra-class variations. Most existing methods introduce maximization and orthogonality loss functions into convolutional neural networks (CNN) to restrict inter-class samples in a mini-batch. Such methods neglect intrinsic emotion correlations, resulting in over-separation and under-separation. Besides, CNNs have limitations in capturing global spatial information due to the inductive bias property. Thus, we propose a novel FER network to reveal intrinsic emotion correlations and learn global expression-related features for high-accurate FER in the wild. In this network, first, a correlation-biased orthogonality loss based on emotion correlations separates inter-class samples towards class distances. To decouple this loss from the mini-batch and alleviate the influence of ambiguous expressions, we use a conditional queue to maintain high-quality expressions. Second, implicit selective transformers are plugged into the intermediate stages of CNN to construct long-range relationships between local regions. Particularly, this module measures the importance of local regions implicitly to eliminate occlusions and backgrounds. Third, a cross-attention fusion module uses high-level features to guide the intermediate features to be class-specific and finally fuse them for emotion recognition. The experimental results on five in-the-wild datasets demonstrate the effectiveness and superiority of our network by showing clear performance improvements over other state-of-the-art FER methods. Our codes are available at https://github.com/Gabrella/R-SFT

Abstract:
With the continuous development of autonomous aerial system (AAS) visual positioning technology, dynamic target detection and feature point optimization have become one of the difficult problems for AASs to achieve high-precision visual positioning in a dynamic environment. To solve the problem of AAS target detection accuracy in a dynamic environment, this paper proposes a dynamic target detection method for airborne cameras based on background prediction and semantic compensation. Firstly, to solve the problem of high false detection rate of the traditional background difference method on the camera of moving carrier, this paper proposes a background compensation method based on region of interest prediction and uses a technique combining a scale-transformed Unscented Kalman filter (ST-UKF) and Rodrigues Formula with Perspective Transformation (RFPT) to predict the background model. Then, a moving target discrimination method based on semantic confidence is proposed to solve the problem that the traditional semantic map cannot effectively discriminate the current state of the object and leads to an excessive elimination of effective feature points; in addition, a general detection framework for airborne cameras to obtain accurate and reliable target selection boxes are proposed to improve the positioning accuracy of traditional visual positioning methods in dynamic environments, the feasibility, and innovation of the algorithm in this paper are verified through data set simulation and experimental environment.

Abstract:
Text-based Visual Question Answering (TextVQA) focuses on answering questions about the scene text in images. Most works in this field uses transformer based models to modeling the interaction of question and scene texts which means the scene texts will be treated as a natural language sentence and concatenated in reading order as a part of input. However, they ignore the fact that different from words in natural language sentence which have inherent context relation, the context relation of scene texts in images need to be determined. To tackle this problem, we propose a novel method named Separate, Locate and Align (SLA) that discriminate the context relation of scene texts from semantic, visual and spatial aspects. Specifically, based on scene texts with similar visual information (e.g. background color, font color, font style, etc.) having semantic contextual relations, we propose a Text Semantic Separate (TSS) module to discriminate the semantic relation between different scene texts according to their visual contextual information. Then, we introduce a Spatial Circle Position (SCP) module that helps the model discriminate the spatial relation between different scene texts. Last, we design a Visual Alignment (VA) module to help the model distinguish the visual relationships between different scene texts according to the color distribution differences. Extensive experiments show that our method outperforms existing alternatives on TextVQA and ST-VQA datasets without pre-training tasks.

Abstract:
Facial point clouds collected in practical applications often suffer from pose variations and occlusion. Existing studies typically focus on either pose estimation or landmarks localization, neglecting to fully utilize the effective information from various facial features, thus limiting the improvement of prediction accuracy. Therefore, we propose an innovative 3D facial multi-task prediction network. The proposed network embeds the output of related tasks into feature extraction from the point level to the global level based on the physical dependencies between tasks. This facilitates explicit multi-task knowledge transfer, enabling the simultaneous prediction of facial landmarks, occlusion, and head pose. We introduce a training strategy based on posterior knowledge correction to iteratively refine and improve multi-task prediction results. Moreover, no single dataset provides annotations for all these tasks at once, so we synthesized a 3D landmarks, occlusion and pose (3D-LOP) dataset, which includes annotations for landmarks coordinates, occlusion probability, and head pose. The proposed method was compared with state-of-the-art methods on two public datasets and 3D-LOP. The landmarks localization accuracy improved by 7.1% on the two public datasets, and the pose estimation accuracy and stability on 3D-LOP improved by 28.5% and 32.7%, respectively. The performance on wild data also shows its potential in practical applications.

Abstract:
Deep Hashing (DH) based image retrieval is commonly used in facial recognition systems for its precision and effectiveness. However, this convenience is accompanied by a mounting threat to privacy. The DH model possesses vulnerability to adversarial attacks, which can be leveraged to prevent the retrieval of private images. Current adversarial attacks on DH models commonly focus on individual images or specific categories, lacking universal perturbations for the entire hashing dataset. This paper introduces the UTAP series, the first universal, transferable, and robust adversarial perturbation against DH facial image retrieval, safeguarding all images with a single perturbation. We explore the relationships between clusters learned by different DH models and define the optimization goal for optimizing UTAP series as moving away from the voted overall hashcenter. To alleviate the challenges of single-objective optimization, we randomly vote for sub-cluster centers and propose sub-task-based meta-learning to aid global optimization. Furthermore, we dissect the functional roles of key components in DH models and introduce UTAP++, a feature-hashing two-stage attack that is readily adaptable to cross-model and cross-scheme ensemble adversarial attacks. Extensive experiments conducted on renowned face datasets and DH models under varied complex scenarios, encompassing cross-image, cross-model, cross-bit, cross-algorithm, model ensemble, algorithm ensemble, and image compression, reveal that the UTAP series demonstrate remarkable universality, transferability, and robustness in preventing facial image retrieval. Compared to existing state-of-the-art methods, the UTAP series excel in white-box settings and exhibits significant transferability improvements of 10%-70% in all black-box settings, with 20% and 55% average robustness improvements in white-box and black-box settings, respectively. These findings underscore the practical value of the UTAP series in real-world, presenting novel effective defense strategies against unauthorized facial image retrieval.

Abstract:
Deep learning-based low-light image enhancers have made significant progress in recent years, with a trend towards achieving satisfactory visual quality while gradually reducing the number of parameters and improving computational efficiency. In this work, we aim to delving into the limits of image enhancers both from visual quality and computational efficiency, while striving for both better performance and faster processing. To be concrete, by rethinking the task demands, we build an explicit connection, i.e., visual quality and computational efficiency are corresponding to model learning and structure design, respectively. Around this connection, we enlarge parameter space by introducing the re-parameterization for ample model learning of a pre-defined minimalist network (e.g., just one layer), to avoid falling into a local solution. To strengthen the structural representation, we define a hierarchical search scheme for discovering a task-oriented re-parameterized structure, which also provides powerful support for efficiency. Ultimately, this achieves efficient low-light image enhancement using only a single convolutional layer, while maintaining excellent visual quality. Experimental results show our sensible superiority both in quality and efficiency against recently-proposed methods. Especially, our running time on various platforms (e.g., CPU, GPU, NPU, DSP) consistently moves beyond the existing fastest scheme. The source code will be released at https://github.com/vis-opt-group/AR-LLIE.

Abstract:
Medical images exhibit a wide range of gray scales and are frequently affected by noise and artifacts, necessitating high security and robustness during transmission. A novel medical image encryption scheme combined with compressed sensing is proposed. This scheme utilizes the reconstructed hyperchaotic map as tool for generating random sequences. Through multi-faceted dynamic behavior analysis of the map, it has been verified that it possesses excellent unpredictability and an ultra-wide key space, as well as its hardware implementation lays a foundation for practical image encryption. By decomposing the image into bit-planes, cross scrambling and ascending diffusion are applied to different bit-planes. Combined with compressed sensing, the medical image is transformed into ciphers of varying sizes to achieve the purpose of secure transmission. This approach significantly enhances encryption efficiency by reducing image size. Comprehensive multi-dimensional performance tests demonstrate its superior effectiveness in medical image encryption, providing strong security performance for the application of the scheme in the Internet of Medical Things (IoMT).

Abstract:
Few-shot Semantic Segmentation (FSS) aims to accurately segment query images with guidance from only a few annotated support images. Previous methods typically rely on pixel-level feature correlations, denoted as the many-to-many (pixels-to-pixels) or few-to-many (prototype-to-pixels) manners. Recent mask proposals classification pipeline in semantic segmentation enables more efficient few-to-few (prototype-to-prototype) correlation between masks of query proposals and support reference. However, these methods still involve intermediate pixel-level feature correlation, resulting in lower efficiency. In this paper, we introduce the Proposal and Reference masks matching transFormer (PRFormer), designed to rigorously address mask matching in both spatial and semantic aspects in a thorough few-to-few manner. Following the mask-classification paradigm, PRFormer starts with a class-agnostic proposal generator to partition the query image into proposal masks. It then evaluates the features corresponding to query proposal masks and support reference masks using two strategies: semantic matching based on feature similarity across prototypes and spatial matching through mask intersection ratio. These strategies are implemented as the Prototype Contrastive Correlation (PrCC) and Prior-Proposals Intersection (PPI) modules, respectively. These strategies enhance matching precision and efficiency while eliminating dependence on pixel-level feature correlations. Additionally, we propose the category discrimination NCE (cdNCE) loss and IoU-KLD loss to constrain the adapted prototypes and align the similarity vector with the corresponding IoU between proposals and ground truth. Given that class-agnostic proposals tend to be more accurate for training classes than for novel classes in FSS, we introduce the Weighted Proposal Refinement (WPR) to refine the most confident masks with detailed features, yielding more precise predictions. Experiments on the popular Pascal-5i and COCO-20i benchmarks show that our Few-to-Few approach, PRFormer, outperforms previous methods, achieving mIoU scores of 70.4% and 49.4%, respectively, on 1-shot segmentation. Code is available at https://github.com/ANDYZAQ/PRFormer.

Abstract:
To date, the image quality assessment (IQA) research field has mainly focused on natural images (NIs)-based IQA and screen content images (SCIs)-based IQA. Usually, these two research branches are quite independent due to the large differences between NIs and SCIs, where NIs, captured by cameras directly, contain pictorial information solely, yet, SCIs, synthesized or GPU-rendered, have pictures and textures. Moreover, the distortion types are also different, and subjective scores of different datasets assigned by participants are usually not well aligned. So, due to the above-mentioned “domain shifts” and “dataset misalignments”, our research community has widely believed that it could be very difficult to achieve joint mutual promotions between NIs- and SCIs-based IQA. In this paper, we argue that despite the “differences”, there still are some “common characteristics” — our human visual system perceives the “pictures” in both SCIs and NIs almost the same way. Thus, we can still achieve mutual performance promotion if we can appropriately use the “common characteristics” between SCIs and NIs. Our key idea is to devise a “content-aware” data switch, which, from the perspective of input’s contents (i.e., pictures or textures), aims at letting the model automatically enhance the commonness and compress the discrepancies between the two tasks. Notice that none of the existing fusion schemes can reach this goal since they are actually content-unaware, degenerating the “mutual interactions” into “mutual interferences”. This paper is the first attempt to achieve full end-to-end “mutual interactions” between NIs- and SCIs-based IQA. Using the proposed switch, we are also the first to achieve solid mutual promotions for the two tasks, reaching new SOTA results.

Affiliations: School of Information and Electronics, Beijing Institute of Technology, Beijing, China; School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China; School of Computer and Information Science, Southwest University, Chongqing, China; School of Science and Engineering and the Future Network of Intelligence Institute (FNii), The Chinese University of Hong Kong (Shenzhen), Shenzhen, China; School of Information Science and Technology, University of Science and Technology of China, Hefei, China

Abstract:
Semantic communication is an emerging paradigm with significant potential for image transmission. However, resource-efficient architecture design and resource allocation in this field have not received adequate research attention. This paper proposes a resource-efficient multi-branch semantic communication architecture based on saliency detection, aimed at optimizing computational efficiency in image transmission. The architecture leverages models with varying capacities to process regions of images with different complexities. We further address the problem of multi-user uplink semantic communication and resource allocation, focusing on minimizing the total energy consumption for communication and computation. The optimization problem, subject to user demand, computation, delay, and transmission power constraints, is non-convex due to the coupling of variables, making it challenging to solve. To tackle this, we introduce a two-level decomposition approach. The lower-level problem, given a fixed compression rate, is solved using Karush-Kuhn-Tucker (KKT) conditions to derive closed-form solutions for transmission power and computation frequency. The upper-level problem, which optimizes the compression rate, is reformulated as a monotone optimization problem for efficient solution finding. Numerical results demonstrate that the proposed architecture significantly reduces computational resource usage while maintaining image quality, and the resource allocation strategy effectively minimizes energy consumption, outperforming baseline schemes in terms of energy efficiency.

Abstract:
Monocular visual odometry (VO) is crucial for the application of various autonomous systems. However, the inherent scale ambiguity issue in monocular methods greatly limits their performance in pose estimation. In this paper, we propose a hybrid monocular VO system named KPDepth-VO, which solves camera pose from monocular video based on sparse keypoints. To estimate the scale-consistent relative pose, we present a novel photometric-sensitive depth uncertainty model that accounts for the depth uncertainty introduced by limitations in the photometric error constraint. We also introduce an uncertainty-aware scale recovery strategy that incorporates depth uncertainty for reliable scale alignment. Additionally, we propose a novel difference attention mechanism to construct a point filter that effectively filters out less distinctive points, ensuring high-quality matches for more accurate and efficient pose estimation in the proposed system. Experimental results on the KITTI dataset and Oxford Robotcar dataset demonstrate that our system can predict scale-consistent trajectories from monocular videos and achieve state-of-the-art performance among similar methods. Meanwhile, the depth network within our system achieves competitive depth estimation performance on KITTI depth benchmark.

Abstract:
Computer-aided pathology diagnosis based on whole slide images, which is often formulated as a weakly supervised multiple instance learning (MIL) paradigm. Current approaches generally employ attention mechanisms to aggregate instance-level features. However, the weakly supervised signal and the imbalanced instance distribution often lead to inaccurate attention localization, compromising the performance and generalization capability of the MIL framework. To address these problems, this paper presents a novel MIL framework called FAMIL that focuses on inaccurate attention and refines them. FAMIL adopts a dual-branch structure and incorporates two innovative online data augmentation strategies: attention-based Mixup (ABMix) and attention-based Masking (ABMask). ABMix emphasizes the significance of positive instances, generalizing Mixup in the MIL scenarios, while ABMask flexibly identifies challenging positive instances to optimize the feature representation. Moreover, these two methods are plug-and-play and can be easily embedded into attention-based MIL methods. Extensive experiments on three public benchmarks demonstrate the superiority of our FAMIL, outperforming current state-of-the-art methods. The test AUC for the binary tumor classification can be up to 92.61% over CAMELYON16. And the AUC over the cancer subtype classification can be up to 93.81% and 98.41% on TCGA-NSCLC and TCGA-RCC datasets, respectively.

Abstract:
The rapid development of vision-based 3D perceptions, in conjunction with the inherent vulnerability of deep neural networks to adversarial examples, motivates us to investigate realistic adversarial attacks for the 3D detection models in autonomous driving scenarios. Due to the perspective transformation from 3D space to the image and object occlusion, current 2D image attacks are difficult to generalize to 3D detectors and are limited by physical feasibility. In this work, we propose a unified framework to generate physically printable adversarial patches with different attack goals: 1) instance-level hiding—pasting the learned patches to any target vehicle allows it to evade the detection process; 2) scene-level creating—placing the adversarial patch in the scene induces the detector to perceive plenty of fake objects. Both crafted patches are universal, which can take effect across a wide range of objects and scenes. To achieve above attacks, we first introduce the differentiable image-3D rendering algorithm that makes it possible to learn a patch located in 3D space. Then, two novel designs are devised to promote effective learning of patch content: 1) a Sparse Object Sampling Strategy is proposed to ensure that the rendered patches follow the perspective criterion and avoid being occluded during training, and 2) a Patch-Oriented Adversarial Optimization is used to facilitate the learning process focused on the patch areas. Both digital and physical-world experiments are conducted and demonstrate the effectiveness of our approaches, revealing potential threats when confronted with malicious attacks. We also investigate the defense strategy using adversarial augmentation to further improve the model’s robustness.

Abstract:
Detecting foreground objects in video separation tasks is a challenging endeavor in complex environments. This paper presents a deep neural network architecture considering the features of spectral, spatial, and temporal at the same time. Under this framework, we designed a gated mechanism for attention and incorporated it into the Transformer architecture (GMAT). This model employs a gating mechanism to dynamically control and allocate attention across different features (wavelet features and raw features), learning how to balance their importance, emphasize or ignore certain features based on the current context. Additionally, in order to enhance features in video frame data and better focus on important features while ignoring unimportant ones in GMAT, we introduced an optical flow estimation method based on wavelet transform. Due to the advantage of wavelet transform in capturing motion features at different scales, its introduction enables a more comprehensive and detailed focus on finer features. We evaluated the GMAT on various videos from the CDNet2014 dataset, results of qualitative and quantitative evaluations demonstrating its significant improved for foreground detection over several recent video separation models.

Abstract:
This paper addresses the rate control issue for Video-on-Demand (VoD) services in Low Earth Orbit (LEO) satellite Internet. LEO systems employ long-distance Non-Orthogonal Multiple Access (NOMA), where the transmission rate of the last hop directly determines the Quality of Experience (QoE) levels for the VoD users and the satellite’s energy consumption. Our research identifies two primary issues: (i) determining the transmission rate to ensure high user QoE while minimizing energy consumption, and (ii) ensuring fairness among users within the satellite coverage area. To address these issues, we model the multi-user VoD viewing process as a Partially Observable Markov Process (POMDP) and describe the interactions among users using a cooperative coalition game framework. We propose Harmony, a distributed and dynamic improvement solution based on the Deep Deterministic Policy Gradient (DDPG) approach. Harmony intelligently determines each user’s transmission rate by combining feedback from user applications and MEC server metrics, ensuring superior QoE levels, energy efficiency, and fairness. The trained Harmony can be adapted to various Adaptive BitRate (ABR) algorithms, providing scalability and immediate applicability in existing LEO networks. It can also achieve improved performance in dynamic user environments. Simulation results demonstrate that Harmony improves energy efficiency and fairness, while maintaining high QoE levels and reducing MEC traffic overhead by 28.1% to 62.6%.

Abstract:
Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration. Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.

Abstract:
Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset (BOVText). There are four features for BOVText. Firstly, we provide 2,021 videos with more than 1,750,000 frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 32 open scenarios, including many virtual scenarios, e.g., Life Vlog, Driving, Movie, Game, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in the video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures’ lives and communication. Besides, we propose a real-time end-to-end video text spotting with Contrastive Learning of Semantic and Visual Representation (CoText), which includes two advantages: 1) With a lightweight architecture, CoText simultaneously addresses the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) CoText tracks texts by comprehending them and relating them to each other with visual and semantic representations. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting \mathrm ID_F1 of 71.7% at 32.3 FPS on ICDAR2015video, with 10.2% and 23.3 FPS improvement the previous best method. The dataset and code of CoText can be found at: Dataset and CoText, respectively.

Abstract:
Deep neural networks (DNNs) have been successfully applied in the remote sensing semantic segmentation. However, training DNNs requires a large number of densely labeled samples, which is laborious and time-consuming. Sparsely supervised semantic segmentation (SSSS) can train deep segmentation networks using only sparse annotations. In this paper, we propose a negative class guided spatial consistency network (NCG-SCNet) for semantic segmentation with sparse annotations. Specifically, we introduce a spatial consistency enhancement module (SCEM) to enhance network features by non-linearly combining spatially similar features. Thus, it could provide better representations of the boundaries and the shape of the target. Additionally, a channel compression module (CCM) is proposed to reduce channel redundancy while preserving the network’s feature extraction capability. A negative class guided loss function (NCG Loss) is constructed to provide extra supervisory information, where the negative classes are defined as the classes with lower probability in the prediction. Extensive experiments on two widely used remote sensing datasets show that the proposed NCG-SCNet outperforms the comparison methods.

Abstract:
Compared with traditional modification image steganography, coverless image steganography can resist the detection of steganalysis algorithms relying on no modification to the carriers. Previous works have made great efforts to improve the robustness against image attacks. However, the robustness of resisting geometric attacks performs not that well. After studying the general flow of the coverless image steganography, we find out that the receiver always needs to generate or map the hash sequences directly from the received images, which causes a significantly negative impact for extracting correct secret information because these received images might be attacked. Inspired by this finding, we surprisingly explore a common way to solve the problem by proposing a universal restoration framework for the attacked images. The most important module of the framework, the restoration module, contains two main parts, i.e., the classification sub-module and the attack restoration sub-module. The attacked images at the receiving end are first sent to a classification sub-module to estimate the type of the attack. Then, the corresponding attack restoration sub-module is utilized to repair the attacked images to improve the robustness. Experimental results show that the robustness of the existing coverless image steganography methods have been greatly improved after using the proposed framework without introducing extra security issues.

Abstract:
Ultrasound examination has become a vital mid-term prenatal screening method due to its numerous benefits. However, identifying fetal ultrasound sections poses significant challenges and requires the expertise of experienced doctors, especially when dealing with unfavorable fetal positions and image artifacts. The shortage of skilled doctors and the complexity of these screenings highlight the necessity of artificial intelligence to support medical professionals. Most current research targets a few specific fetal ultrasound sections for particular tasks. In reality, mid-pregnancy level II ultrasound assessments involve around ten different fetal sections. Therefore, there is a critical need for algorithms that can recognize these level II fetal sections. Furthermore, the standard ultrasound planes located in the same region exhibit a high degree of similarity. The traditional fetal slice algorithm such as ResNet generates feature maps that are sparse matrices. As a result, employing global average pooling (GAP) layers can lead to significant information loss, making it challenging to effectively differentiate between similar views. In this study, we introduce a deep-learning network designed for second-trimester anatomy ultrasound standard plane recognition. Our approach incorporates a novel spectral pooling paradigm. We introduce singular pooling to enhance information extraction from feature maps and integrate this singular pooling layer into ResNet. This methodology is tested on our dataset, which includes ten distinct standard fetal ultrasound sections from second-trimester level II examinations. Our model achieves exceptional performance, with an accuracy of 92.07% and an F1-Score of 0.919. The implementation of singular pooling significantly improves the original ResNet model. This method can effectively aid in the identification of standard plane recognition in second-trimester anatomy ultrasounds. Furthermore, we validated the efficacy of our model by applying it to two additional datasets, thereby further demonstrating its efficiency and applicability. Our code is available at https://github.com/sysll/vectore/tree/master

Abstract:
Recent advancements in 3D Gaussian Splatting (3D-GS) have demonstrated the potential of using 3D Gaussian primitives for high-speed, high-fidelity, and cost-efficient novel view synthesis from continuously calibrated input views. However, conventional methods require high-frame-rate dense and high-quality sharp images, which are time-consuming and inefficient to capture, especially in dynamic environments. Event cameras, with their high temporal resolution and ability to capture asynchronous brightness changes, offer a promising alternative for more reliable scene reconstruction without motion blur. In this paper, we propose SweepEvGS, a novel hardware-integrated method that leverages event cameras for robust and accurate novel view synthesis across various imaging settings from a single sweep. SweepEvGS utilizes the initial static frame with dense event streams captured during a single camera sweep to effectively reconstruct detailed scene views. We also introduce different real-world hardware imaging systems for real-world data collection and evaluation for future research. We validate the robustness and efficiency of SweepEvGS through experiments in three different imaging settings: synthetic objects, real-world macro-level, and real-world micro-level view synthesis. Our results demonstrate that SweepEvGS surpasses existing methods in visual rendering quality, rendering speed, and computational efficiency, highlighting its potential for dynamic practical applications.

Abstract:
Image restoration under multiple adverse weather conditions aims to develop a single model to recover the underlying scene with high visibility. Weather-related artifacts vary with the particle’s distance to the camera according to the established scene visibility analysis, where close and faraway regions are more affected by falling drops and fog effects, respectively. In challenging weather conditions, existing image restoration methods fall short by not accounting for the varying impact of adverse weather on different scene regions. We develop a novel unified imaging model combined with a weather-prior-based network that directly incorporates weather-specific physical imaging processes into the restoration process. This approach not only enhances visibility in both near and distant regions affected by drops but also outperforms current state-of-the-art methods by effectively mitigating artifacts such as fog. Our contributions include a comprehensive analysis of weather-related visual factors and the development of an innovative network architecture that leverages estimated occlusion and transmission to restore scene details. Experimental results on three synthetic benchmarks, including our Weather30K dataset, along with two all-weather datasets, and a real-world benchmark with challenging mixed weather conditions, show the superiority of our method against state-of-the-art methods.

Affiliations: School of Information Science and Engineering, Dalian Polytechnic University, Dalian, China; School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Perth, WA, Australia; Department of Computer Engineering, Faculty of Engineering, Karamanoglu Mehmetbey University, Karaman, Turkiye; Department of Electrical Electronics Engineering, Karamanoglu Mehmetbey University, Karaman, Turkiye; Department of Artificial Intelligence and Data Engineering, Engineering Faculty, Ankara University, Ankara, Türkiye; School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), and Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China

Abstract:
Discrete chaotic systems based on memristors exhibit excellent dynamical properties and are more straightforward to implement in hardware, making them highly suitable for generating cryptographic keystreams. However, most existing memristor-based chaotic systems rely on a single memristor. This paper introduces a novel discrete chaotic system employing dual memristors, named the 3D memristive cubic map with dual discrete memristors (3D-MCM). The 3D-MCM system demonstrates richer and more intricate dynamical behaviors compared to its single-memristor counterparts, as verified through bifurcation diagrams, Lyapunov exponent spectra, and complexity analyses. Notably, the system exhibits coexisting attractors, substantially enhancing its dynamical complexity. Hardware implementation of the 3D-MCM attractors confirms its feasibility for industrial applications. To illustrate the system’s potential in encryption tasks, this study integrates the quaternary-based permutation and dynamic emanating diffusion (QPDED-IE) scheme with the 3D-MCM for image encryption. Experimental results demonstrate that the QPDED-IE scheme based on the 3D-MCM exhibits strong diffusion and confusion properties, effectively resisting cryptanalytic attacks.

Abstract:
Video anomaly detection methods are mainly classified into two categories based on their primary feature types: appearance-based and action-based. Appearance-based methods rely on low-level visual features like color, texture, and shape, learning patterns specific to training scenes. While effective in familiar settings, they struggle with unknown or altered scenes due to poor generalization and limited understanding of action-scene relationships. In contrast, action-based methods focus on detecting action anomalies but often overlook contextual scene associations, leading to misjudgments (e.g., running on a street being deemed normal without considering scene context). To overcome these limitations, we propose a novel decoupling-based anomaly detection architecture (DecoAD). Its core lies in the decoupling and interweaving of scenes and actions, enabling explicit modeling of their complex relationships. By reconstructing these interactions using knowledge graphs, DecoAD achieves a deeper understanding of behaviors and contexts. This design ensures strong performance in both known and unknown scenarios, significantly enhancing generalization. To evaluate its effectiveness in dynamic scenes and its ability to handle scene-related anomalies, we introduce UFSR, the first video anomaly detection dataset featuring dynamic scenes and scene-related anomalies. DecoAD supports fully-supervised, weakly-supervised, and unsupervised settings, improving AUC on UBnormal by 1.1%, 3.1%, and 2.1% in fully-supervised, weakly-supervised, and unsupervised settings, and on UFSR by 1.2% and 8.2% in weakly-supervised and unsupervised settings. The source code and datasets are available at: https://github.com/liuxy3366/DecoAD.

Abstract:
Object detection in remote sensing images (RSIs), including optical and SAR images, has emerged as a rapidly advancing field. However, the abundance of small objects in RSIs poses a significant challenge in designing a network structure with effective receptive fields to support accurate localization and classification. In this paper, we propose a position guided dynamic receptive field network (PG-DRFNet) for small object detection friendly to optical and SAR images. Specifically, PG-DRFNet overcomes the problem of small objects vanishing or being submerged in features by establishing a positional guidance relationship of small objects between different feature layers. Then, we design a combination head structure that utilizes additional supervised information extracted from small objects to make the model more effective and flexible. Moreover, a dynamic perception algorithm based on feature construction is developed to dynamically optimize the perception regions and feature hierarchies of the model, while seeking the optimal tradeoff between model accuracy and inference speed. Without bells and whistles, our model is robust to two modalities of remote sensing data, and our experiments are conducted on four benchmark RSI datasets, including DOTA-v2.0, VEDAI, SSDD, and HRSID. The experimental results achieve competitive performance with 59.01%, 84.06%, 90.06%, and 80.59% mAP, respectively. Code and models are released at https://github.com/BJUT-AIVBD/PG-DRFNet.

Abstract:
Mask is considered as an important prior for fusion, which could selectively enhance specific regions to generate ideal fused images. However, masks used in the existing methods exhibit limitations in the precise representation of targets, and more importantly, these masks are generated from a single modality, which restricts the effective integration of multi-modal information. To address this issue, we propose a competitive mask-guidance fusion method for infrared and visible images. A multi-modal semantic-sensitive mask selection network is proposed to generate complementary-mask maps, which organically integrate advantageous target regions of different modalities by competitively comparing the qualities of masks. In this network, a pseudosiamese architecture is designed to obtain respective target masks, and specifically, a spatial-aligned-based feature aggregation module is devised to produce high-quality pseudo-labels which are served as references for the generation of the complementary-mask maps. Furthermore, we propose a bidirectional-collaboration region fusion strategy, which enhances the expression of advantageous target regions from each modality inforeground while suppressing the contribution of corresponding regions from the other modality in background. Compared to methods on public datasets, the results show that our method significantly enhances the description of semantic-sensitive targets in fused images, including the saliency and the integrity of structural information. Code are available at https://github.com/xbsj-cool/MSCRFusion.

Abstract:
The radiance field, emerging as a novel 3D scene representation, has found widespread application across diverse fields. Standard radiance field construction approaches rely on the ground-truth poses of key-frames, while building the field without pose prior remains a formidable challenge. Recent advancements have made strides in mitigating this challenge, albeit to a limited extent, by jointly optimizing poses and the radiance field. However, in these schemes, the consistency between the radiance field and poses is achieved completely by training. Once the poses of key-frames undergo changes, long-term training is required to readjust the field to fit them. To address such a limitation, we propose a new solution for radiance field construction without pose prior, namely I-DACS (Incremental radiance field construction with Direction-Aware Color Sampling). Diverging from most of the existing global optimization solutions, we choose to incrementally solve the poses and construct a radiance field within a sliding-window framework. The poses are unequivocally retrieved from the radiance field, devoid of any constraints and accompanying noise from other observation models, so as to achieve the consistency of poses to the field. Besides, in the radiance field, the color information is much higher-frequency and more time-consuming to learn compared with the density. To accelerate training, we isolate the color information to a distinct color field, and construct the color field based on an innovative direction-aware color sampling strategy, by which the color field can be derived directly from images without training. The color field obtained in this way is always consistent with the poses, and intricate details of training images can be retained to the utmost extent. Extensive experimental results evidently showcase both the remarkable training speed and the outstanding performance in rendering quality and localization accuracy achieved by I-DACS. To make our results reproducible, the source code has been released at https://cslinzhang.github.io/I-DACS-MainPage/.

Abstract:
Person re-identification (ReID) aims to search for the target person among the non-overlapping surveillance cameras. Video-based clothes-changing person re-identification (VCC-ReID) has become an essential branch of ReID due to the rich spatial and temporal information in the videos and the broad application of the scenarios. Appearance and gait are discriminative features in the video-based ReID, but appearance information is limited due to the clothes changing, which makes the VCC-ReID challenging. To solve this challenge, we propose a Framework with explicit Learning based on Appearance and Gait (FLAG), which can explicitly extract two corresponding types of information and be combined with most existing video-based ReID methods. The FLAG includes a multi-modal and multi-granularities Architecture (MGA), which is a large model, and a Cross-Modal Knowledge Distillation Scheme (CMKDS), which has a small model. They can be applied to devices with different computing resources. The MGA is designed to simultaneously take the visible light and silhouette modalities as input to explicitly learn the appearance and gait features, respectively. The silhouette modalities are composed of several levels of granularities to model global and local gait features and independently serve as input for MGA. The Embedding-Based parallel fusion module is proposed to fuse the appearance and multi-granularities gait feature efficiently. The CMKDS is present to distill the MGA to a small single-modal model that only uses the visible light modality as input. The Embedding-Based direct and indirect distillation strategies are designed in the CMKDS. Experimental results demonstrate that the FLAG combined with the existing video-based ReID methods can significantly improve their performance. In addition, when FLAG is combined with the AP3D method, the MGA can outperform state-of-the-art accuracy by 4.2%.

Abstract:
Recent state-of-the-art (SOTA) techniques have demonstrated substantial efficacy in 3D Human Pose Estimation (HPE) from videos. Despite strong progress, no prior work has used soft supervision to handle Hard-to-Estimate (H2E) samples in 3D pose estimation, which is inherently a regression task. Existing H2E-example mining solutions, based on logit distillation, progressive target refinement, and label smoothing, are confined to classification problems. Traditional regression-based problems are deprived of the benefits of these regularization techniques. A soft supervision-based H2E example mining technique is crucial for regression problems. To the best of the author’s knowledge, there are no soft supervision-based regularization techniques exist for regression problems. This paper proposes a novel training strategy, referred to as Progressive Soft-Supervision Training for Regression Problems (PSTR). PSTR introduces the concept of progressive soft targets to the 3D pose estimator, a regression-based task. Highly inaccurate predictions, representing the H2E poses, are focused more while preserving the representations for Easy-to-Estimate (E2E) poses. PSTR forces the network to learn an alternate optimum inductive bias in the solution space via dynamically shifting gradients. The proposed PSTR improves SOTA performances on the large-scale Human3.6m dataset by a large margin with an average MPJPE, and P-MPJPE of 1.2mm and 1.09mm for Protocol 1 and 2, respectively, where improvements on PCK, AUC, and MPJPE for MPI_INF_3DHP dataset are, 1.53%, 1.75% and 1.40mm for 2D-3D pose uplifting and 1.60%, 1.95% and 1.60mm for RGB to 3D pose estimation tasks, respectively. PSTR can be effortlessly deployed on any regression-based task.

Abstract:
Generalized few-shot object detection aims to improve detection accuracy for novel classes while maintaining high performance on base classes. Traditional fine-tuning approaches often blur feature boundaries, leading to misclassification of novel samples as base classes or background. Additionally, differences in data distributions between base and novel classes can cause the model to “forget” base knowledge. This paper proposes a novel generalized few-shot detection method that leverages memory distillation of category prototypes. The approach includes two key components: a variational prototype refinement module (VPRM) and a memory bank of category prototypes (MBCP). The variational prototype refinement module introduces a class-agnostic feature fusion mechanism based on the original variational autoencoder. First, the mean and variance of the original distribution of base class are estimated in the base class training stage. The noise variables are converted into memory prototypes with strong generalization ability through reparameterization and stored. Second, the stored memory prototypes are fused with class-agnostic features of novel classes in the fine-tuning stage, which significantly alleviates the problem of base class bias when processing novel classes. In the base class training phase, the category prototype memory bank stores the base class memory prototypes extracted by the variational prototype refinement module and selects the best memory items by dynamically updating the category confidence and intersection-over-union threshold. This memory item can be used not only to constrain features of base classes to alleviate catastrophic forgetting of base classes but also to fuse with features of novel classes, adaptively extracting class-agnostic common information to strengthen the feature representation of the novel class. Experiments on PASCAL VOC and MS-COCO show superior average precision in both single-round and multi-round tests, outperforming existing state-of-the-art methods.

Abstract:
Ultra-high field magnetic resonance imaging (MRI), such as 7-Tesla (7T) MRI, provides significantly enhanced tissue contrast and anatomical details compared to 3T MRI. However, 7T MRI scanners are more costly and less accessible in clinical settings than 3T scanners. In this paper, we propose a wavelet-based frequency attention network (WFANet) and a semi-supervised method named dual domain consistency learning (DDCL), and combine them to form a WFANet-DDCL framework for 7T MRI synthesis. WFANet leverages the frequency sensitivity of the proposed wavelet-based frequency attention encoder (WFAE) along with the large receptive field of dilated convolution. WFAE is proposed as an independent module to capture multi-scale frequency attention via the proposed wavelet-based frequency attention (WFA) mechanism. WFAE can be integrated into any backbone network as a plug-and-play component and improve network performance. To tackle the challenge of limited paired data for network training, DDCL is proposed to take advantage of both paired and unpaired data. Frequency domain perturbation is proposed and combined with Gaussian noise to regularize the supervised learning process in dual domains, better avoiding overfitting. Extensive experimental results demonstrate that WFANet-DDCL can achieve comparable performance to state-of-the-art supervised methods even using 66% of all paired data.

Abstract:
Optical remote sensing images are frequently contaminated by thin clouds, thus causing great challenges for subsequent applications. To address this issue, numerous methods guided by cloud features have been developed. However, the cloud features utilized in these methods are generally either unlearnable or lack cloud thickness data constraints, which may further mislead the cloud removal. In this paper, a THIcknEss Fused thin cloud removal network (ThiefCloud) with self-supervised learnable cloud prior is proposed. Firstly, in order to provide reliable cloud prior, a self-supervised cloud prior model (SCPM) is introduced. Secondly, an adaptive feature extraction (AFE) module efficiently extracts the cloud information of the original cloud image, and a physically guided feature fusion (PGFF) module, inspired by the atmospheric scattering model, accurately restores more realistic details. Finally, to enhance the generalizability of SCPM in real scenarios, a staged training strategy is adopted. SCPM is trained independently on the simulated thickness maps and cloud images in advance, then SCPM can guide ThiefCloud. During the training of ThiefCloud, SCPM is frozen initially and then tunable. The frozen SCPM provides effective cloud prior to the non-converged ThiefCloud. The tunable SCPM makes the cloud prior learnable, better aligning with real-world cloud removal. Experimental results demonstrate that compared with other 11 methods, ThiefCloud could achieve competitive results on three public datasets, namely T-CLOUD, RICE and SateHaze1k datasets. The implementation code and data will be available soon at: https://github.com/lixinghua5540/ThiefCloud

Abstract:
To address the issues of privacy leakage and copyright infringement in screen-camera scenarios, we propose a robust image watermarking method, which incorporates Kolmogorov–Arnold network (KAN) blocks and a simulation-enhanced noise pool to resist screen-camera noise attacks. Specifically, we first modify the traditional convolutional blocks for processing high-dimensional features in the U-Net-based encoder to KAN blocks. This operation enhances the ability of encoder to model nonlinear relationships between complex features, while preserving global structure of the original image and minimizing the damage to local details caused by watermark embedding, thereby improving visual quality of the watermarked image. Additionally, to enhance the robustness of the proposed method against complex screen-camera noise attacks, a simulation-enhanced noise pool containing mathematical models and a deep noise simulation network, called NSim-Net, is designed. Especially, in NSim-Net, adversarial training between the simulator based on the improved U-Net and the discriminator based on PatchGAN effectively improves the ability to simulate complex noise. Experimental results demonstrate that, compared to typical screen-camera resilient watermarking methods, the watermarked image generated by the proposed method achieves a maximum peak signal-to-noise ratio (PSNR) improvement of 4.78 dB. Furthermore, based on our simulation-enhanced noise pool, the watermark extraction accuracy of the proposed method exceeds 98% under various screen-camera noise attacks.

Abstract:
A fundamental challenge in remote sensing-based forest fire detection lies in accurately discerning fire characteristics on various scales against the backdrop of intricate and heterogeneous forest landscapes. In response to this challenge, we propose a dual-path network (DPMNet) with multidimensional feature interaction for real time remote sensing forest fire detection. Initially, a dual-path backbone network is designed, integrating coarse-grained and fine-grained parallel pathways, working in tandem to capture both global visual features and nuanced local texture details. Subsequently, we develop the Multidimensional Interactive Feature Pyramid Network (MiFPN), a novel structure that amalgamates information streams from varied levels through a three-branch structure and engenders profound fusion and dynamic interaction of features across multiple scales. Thereafter, the Context-Enriched Adaptive Fusion Module (CEAFM) is proposed, which emerges to meticulously blend macroscopic visual elements harvested via coarse-grained conduits, employing a multi-faceted pathway strategy to bolster the model’s overarching comprehension and precision in forest fire detection. Finally, the Enhanced Contextual Pooling Bottleneck (ECPB) is put forward, an integration that augments the model’s spatial perception and contextual acumen through the incorporation of dilated convolution and global pooling techniques. Extensive experiments are conducted on the remote sensing forest fire dataset in order to confirm the efficacy of DPMNet. The experimental results demonstrate that our DPMNet achieves satisfactory performance in terms of real-time performance as well as accuracy and provides an effective solution for real-time detection of remote sensing forest fires based on UAVs.

Abstract:
With the popularization of digital information, reversible data hiding in ciphertext has become a critical research focus in privacy protection in cloud storage. A reversible data hiding method for encrypted images is proposed: Reversible Data Hiding in Encrypted Images with Adaptive Multi-directional MED and Huffman Code based on Interval-Wise Dynamic Prediction Axes (RDHEI-AHIDA). Firstly, the original image is predicted by the gradient Adaptive Multi-Directional Median Edge Detector (AM-MED) to obtain the critical gradient and the position of the Interval-wise Dynamic Prediction Axes (IDP-Axes). Then, information bits are allocated at intervals on the IDP-Axes. Combining the determined position of the IDP-Axes and the critical gradient, the prediction error values of the original image are calculated and recorded. After the image is encrypted, according to the distribution of prediction error values, an adaptive Huffman code rule is established, and pixel marking, classification and auxiliary information embedding are carried out. Finally, the secret data is embedded by the bit replacement method. Compared with the state-of-the-art RDHEI methods, experimental results show that RDHEI-AHIDA not only provides a higher pure payload while ensuring security but also exhibits certain robustness.

Affiliations: Tianjin Key Laboratory for Advanced Mechatronic System Design and Intelligent Control, School of Mechanical Engineering, and the National Demonstration Center for Experimental Mechanical and Electrical Engineering Education, Tianjin University of Technology, Tianjin, China; Systems Engineering Institute, Academy of Military Sciences, People’s Liberation Army, Tianjin, China; Tianjin Key Laboratory for Advanced Signal Processing, Civil Aviation University of China, Tianjin, China

Abstract:
Remote Photoplethysmography (rPPG) is a non-contact method for measuring heart rate (HR) through facial video, breaking the constraints of contact measurements and offering broad application prospects. However, in real monitoring scenarios, the distance of subjects and facial illumination often vary. This leads to degradation of rPPG signals due to changes in facial resolution and light intensity. This paper presents the High and Low Frequency Feature Adversarial Learning Network (HLFF-AL), which utilizes feature space interpolation and composite feature capture to perform multi-band global and local signal extraction. By adopting an adversarial learning strategy, it enhances the capability to capture semantic information of rPPG under varying resolutions and lighting, bridging the differences in rPPG signal generation due to resolution changes. While maintaining accuracy in HR measuring, the robustness of the network is improved to achieve robust measurement of HR via rPPG. Even under high-resolution conditions, the Mean Absolute Error (MAE) for the three public datasets—unchanging lighting (COHFACE, UBFC-rPPG) and varying lighting (BUAA-MIHR)—reached 1.10 bpm, 1.64 bpm, and 4.41 bpm, respectively, with a mere 0.29 bpm difference in MAE between high and low resolutions on the BUAA-MIHR dataset. This demonstrates that HLFF-AL can predict more robust rPPG signals in various resolutions and lighting scenarios, achieving competitive results compared to state-of-the-art methods.