TPAMI2026

Abstract:
Contrastive learning aims to learn an embedding space with sample discrimination where similar samples attract together while dissimilar samples repulse apart. However, the issue of sampling bias likely happens and degrades the classification performance when a contrast model is trained with the leakage caused by similar samples but from different classes or dissimilar samples from the same class. Out-of-distribution (OOD) detection provides a meaningful scheme to detect and mask those false negative samples for debiasing in an outlier-aware contrastive loss for high-fidelity contrastive learning. Sample debiasing is feasible to reduce the upper bound of contrastive loss. Also, the previous OOD detector was trained from auxiliary collection of OOD samples. In real world, the prior knowledge of OOD samples is commonly unavailable. This study presents new outlier-aware detection and contrast models through generation and augmentation of those samples near the boundary between in-distribution (ID) and OOD. These synthesized samples are located right outside ID, and their Gaussian embeddings sufficiently reflect OOD behaviors. An OOD detector is learned by using ID samples and synthesized OOD samples with the learning objective towards contrastive OOD detection and debiased contrast model. The experiments are conducted to illustrate the merit of the proposed outlier-aware contrastive learning.

Abstract:
Subspace clustering is one of the most popular clustering methods due to its effectiveness. Although subspace clustering methods have been demonstrated to achieve promising performance, they still lack interpretability, especially when handling high-dimensional complicated data. To bridge this gap, this paper focuses on the interpretability of subspace clustering and proposes a novel interpretable subspace clustering method. Our goal is to answer two key questions about the interpretability in subspace clustering: 1) when handling an individual sample, which features should work for this sample? 2) Which cluster or subspace will the features that work put this sample into? To answer these two questions, we design two new interpretability regularized terms and plug them into the subspace clustering. In this way, we show that interpretability can be used to improve the clustering performance in turn. Extensive experiments on benchmark data sets demonstrate the effectiveness of our method in terms of clustering performance and interpretability.

Abstract:
Hypergraph neural networks (HGNNs) effectively model complex high-order relationships in domains like protein interactions and social networks by connecting multiple vertices through hyperedges, enhancing modeling capabilities, and reducing information loss. Developing foundation models for hypergraphs is challenging due to their distinct data, which includes both vertex features and intricate structural information. We present Hyper-FM, a Hypergraph Foundation Model for multi-domain knowledge extraction, featuring Hierarchical High-Order Neighbor Guided Vertex Knowledge Embedding for vertex feature representation and Hierarchical Multi-Hypergraph Guided Structural Knowledge Extraction for structural information. Additionally, we curate 11 text-attributed hypergraph datasets to advance research between HGNNs and LLMs. Experiments on these datasets show that Hyper-FM outperforms baseline methods by approximately 13.4%, validating our approach. Furthermore, we propose the first scaling law for hypergraph foundation models, demonstrating that increasing domain diversity significantly enhances performance, unlike merely augmenting vertex and hyperedge counts. This underscores the critical role of domain diversity in scaling hypergraph models.

Abstract:
Pedestrian trajectory prediction is crucial for ensuring safe decision-making in intelligent robotic systems. While this task demands real-time performance, previous works have primarily focused on improving prediction accuracy, often neglecting efficiency. Dense predictions with time-consuming post-clustering steps and global interactions with quadratic computational complexity result in a trade-off between accuracy and speed. In this paper, we propose a novel Sparse Trajectory Prediction (STP) model that aims to achieve both high accuracy and real-time speed by following an efficient principle: leveraging sparse structures to achieve global effects. STP instantiates this principle within a transformer-style encoder-decoder framework. In the encoder, STP introduces irregular interaction, which builds sparse interactions with dynamic interactive positions, reducing computational complexity to linearithmic/linear while maintaining global interaction. In the decoder, STP applies an early-sparsity strategy to generate sparse motion modes that represent global motion behaviors. These modes are shared across all predictions, eliminating redundant computations. By harnessing the expressive power of transformers, STP maps these sparse motion modes into multimodal future trajectories, significantly improving prediction speed while ensuring accuracy. Experimental results on four commonly used datasets demonstrate that STP maximizes both accuracy and prediction speed, achieving state-of-the-art performance and significantly improving prediction speed by about 100 ×100× − 150 ×150× to satisfy the real-time demand.

Abstract:
Logo detection is crucial for trademark compliance and media monitoring, enabling companies to monitor online trademark usage and evaluate brand visibility on social media and advertisements. The use of large datasets significantly improves accuracy and generalization, emphasizing the need for high-quality datasets to optimize performance and enhance reasoning abilities in visual detection models. This drove us to create Logo4500, an unparalleled dataset featuring 4,500 logo categories and over 293,000 meticulously labeled images. To ensure the dataset’s quality, we meticulously designed the construction and annotation process, with detailed information provided in our paper. Compared to existing logo datasets, Logo4500 offers greater diversity and class imbalance, making it more reflective of real-world distribution. Leveraging this high-quality dataset, we introduce a benchmark called Frequency-Aware Learnable Dual Reweighting Network (FALDR-Net), which enhances the representation of ambiguous features and addresses class imbalance for large-scale logo detection. We conducted extensive experiments, evaluating various recent methods on this new dataset and several existing publicly available logo datasets, demonstrating its effectiveness. Additionally, we verified Logo4500’s generalization ability in several tasks. We anticipate that Logo4500 and the benchmark will inspire further exploration in the logo-related research community, facilitating the advancement of visual foundation models.

Abstract:
Continual learning (CL) tackles a fundamental challenge in machine learning, aiming to continuously learn novel data from non-stationary data streams while mitigating forgetting of previously learned data. Although existing CL algorithms have introduced various practical techniques for combating forgetting, little attention has been devoted to studying how data schedules – which dictate how the sample distribution of a data stream evolves over time – affect the CL problem. Empirically, most CL methods are susceptible to schedule changes: they exhibit markedly lower accuracy when dealing with more “difficult” schedules over the same underlying training data. In practical scenarios, data schedules are often unknown and a key challenge is thus to design CL methods that are robust to diverse schedules to ensure model reliability. In this work, we introduce the novel concept of schedule robustness for CL and propose Schedule-Robust Continual Learning (SCROLL), a strong baseline satisfying this desirable property. SCROLL trains a linear classifier on a suitably pre-trained representation, followed by model adaptation using replay data only. We connect SCROLL to a meta-learning formulation of CL with provable guarantees on schedule robustness. Empirically, the proposed method significantly outperforms existing CL methods and we provide extensive ablations to highlight its properties.

Abstract:
Recent Transformer-based language representation techniques have commonly adopted a straightforward approach to modeling textual context as a linear sequence of successive tokens. However, this sequential modeling strategy falls short in actively exploring intermediate structures present in natural languages and does not account for the rich interactive relationships between sentences. To overcome these limitations, we propose a discourse-aware framework that bridges the gap between sequential contextualization and the interactive nature of conversational reading comprehension. Concretely, we first divide the context into elementary discourse units (EDUs), ensuring that each unit contains precisely one condition. Then, we systematically explore three instantiations for modeling discourse features: sequential EDU encoding, discourse-aware masking, and discourse graph network. These techniques allow us to capture the nuanced interactions within the discourse. To assess the efficacy of our methodologies, we perform experiments on three conversational reading comprehension tasks: multi-turn response selection, conversational question answering, and conversational machine reading. Experimental results demonstrate the superiority of our proposed approach. Moreover, analysis reveals that the discourse-aware approach enables the model to effectively capture intricate relationships within the context and fosters reasoning interpretability. Additionally, our method exhibits efficacy across various backbone PLMs and diverse domains.

Abstract:
Convolutional neural networks are constructed with massive operations with different types and are highly computationally intensive. Among these operations, multiplication operation is higher in computational complexity and usually requires more energy consumption with longer inference time than other operations, which hinders the deployment of convolutional neural networks on mobile devices. In many resource-limited edge devices, complicated operations can be calculated via lookup tables to reduce computational cost. Motivated by this, in this paper, we introduce a generic and efficient lookup operation which can be used as a basic operation for the construction of neural networks. Instead of calculating the multiplication of weights and activation values, simple yet efficient lookup operations are adopted to compute their responses. To enable end-to-end optimization of the lookup operation, we construct the lookup tables in a differentiable manner and propose several training strategies to promote their convergence. By replacing computationally expensive multiplication operations with our lookup operations, we develop lookup networks for the image classification, image super-resolution, and point cloud classification tasks. It is demonstrated that our lookup networks can benefit from the lookup operations to achieve higher efficiency in terms of energy consumption and inference speed while maintaining competitive performance to vanilla convolutional networks. Extensive experiments show that our lookup networks produce state-of-the-art performance on different tasks (both classification and regression tasks) and different data types (both images and point clouds).

Abstract:
Message passing is a core mechanism in Graph Neural Networks (GNNs), enabling the iterative update of node embeddings by aggregating information from neighboring nodes. Graph Convolutional Networks (GCNs) exemplify this approach by adapting convolutional operations for graph structures, allowing features from adjacent nodes to be combined effectively. However, GCNs encounter challenges with complex or dynamic data. Capturing long-range dependencies often requires deeper layers, which not only increase computational costs but also lead to over-smoothing, where node embeddings become indistinguishable. To overcome these challenges, reservoir computing has been integrated into GNNs, leveraging iterative message-passing dynamics for stable information propagation without extensive parameter tuning. Despite its promise, existing reservoir-based models lack structured convolutional mechanisms, limiting their ability to accurately aggregate multi-hop neighborhood information. To address these limitations, we propose RGC-Net (Reservoir-based Graph Convolutional Network), which integrates reservoir dynamics with structured graph convolution. Key contributions include: (i) a reimagined convolutional framework with fixed-random reservoir weights and a leaky integrator to enhance feature retention; (ii) a robust, adaptable model for graph classification; and (iii) an RGC-Net-powered transformer for graph generation with application to dynamic brain connectivity. Extensive experiments show RGC-Net achieves state-of-the-art performance in classification and generative tasks, including brain graph evolution, with faster convergence and mitigated over-smoothing.

Affiliations: School of Computer Science, Beijing Institute of Technology, Beijing, China; Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence, Macao Polytechnic University, Macao, China; Department of Computer Science, Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, China; Guangdong Provincial Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China; School of Medical Technology, Beijing Institute of Technology, Beijing, China

Abstract:
Robust Reversible Watermarking (RRW) enables perfect recovery of cover images and watermarks in lossless channels while ensuring robust watermark extraction under lossy channels. However, existing RRW methods, mostly non-deep learning-based, suffer from complex designs, high computational costs, and poor robustness limiting their practical applications. To address these issues, this paper proposes Deep Robust Reversible Watermarking (DRRW), a deep learning-based RRW scheme. DRRW introduces an Integer Invertible Watermark Network (iIWN) to achieve an invertible mapping between integer data distributions, fundamentally addressing the limitations of conventional RRW approaches. Unlike traditional RRW methods requiring task-specific designs for different distortions, DRRW adopts an encoder-noise layer-decoder framework, enabling adaptive robustness against various distortions through end-to-end training. During inference, the cover image and watermark are mapped into an overflowed stego image and latent variables. Arithmetic coding efficiently compresses these into a compact bitstream, which is embedded via reversible data hiding to ensure lossless recovery of both the image and watermark. To reduce pixel overflow, we introduce an overflow penalty loss, significantly shortening the auxiliary bitstream while improving both robustness and stego image quality. Additionally, we propose an adaptive weight adjustment strategy that eliminates the need to manually preset the watermark loss weight, ensuring improved training stability and performance. Experiments on multiple datasets demonstrate that the proposed DRRW addresses key challenges in current RRW methods and significantly advances the practical deployment of RRW.

Abstract:
Hypergraph Neural Networks (HGNNs) enhance graph-based modeling by representing complex relationships, with applications in brain network analysis, recommendation systems, and computer vision. However, conventional HGNNs often struggle with effective knowledge extraction and discriminative feature representation, leading to performance limitations. This paper presents Knowledge-Embedded Hypergraph Neural Networks (Knowledge HGNN), a framework that addresses these challenges with two complementary encoders and a multi-dimensional fusion strategy. The High-Order Incidence Encoder (HOI-Encoder) explicitly embeds structural knowledge by capturing permutation-invariant high-order incidence patterns that are typically overlooked by standard HGNNs. In contrast, the Task-Driven Rule Encoder (TDR-Encoder) focuses on feature-level knowledge, extracting task-related rules from vertex attributes through gradient boosted decision tree pre-training and encoding both rule content and positional importance. A Multi-Dimensional Knowledge Fusion module then integrates structural and rule-based embeddings, bridging semantic and dimensional gaps to form enriched vertex representations. The framework includes two implementations: Rule-Driven HGNN, which emphasizes rule-based knowledge, and Dual-Driven HGNN, which jointly leverages structural and rule-based knowledge for comprehensive feature extraction. Extensive experiments on ten datasets, together with ablation studies, demonstrate that Knowledge HGNN significantly improves performance, achieving a 7.3% gain on the Cora dataset and an average improvement of 2.5% across all datasets. These results highlight the effectiveness of explicitly differentiating and fusing structural and rule-based knowledge, setting a new standard for hypergraph applications in complex, data-driven scenarios.

Abstract:
Security and privacy concerns in real-world applications have led to the development of adversarially robust federated models. Previous works mainly target overcoming the adaptability constraints regarding communication and computation costs. However, the straightforward combination of adversarial training and federated learning might lead to undesired robust accuracy degradation emerging at later training stages. We reveal that the attribution behind this phenomenon is that the generated adversarial data could exacerbate the data heterogeneity among local clients, making the wrapped federated learning perform poorly. To deal with this problem, we introduce an \alphaα-slack mechanism to relax the original learning objective of federated adversarial training, and propose a novel framework called Slack Federated Adversarial Training (SFAT) to combat the intensified heterogeneity. By assigning the client-wise slack during aggregation, SFAT realizes a weighted aggregation that alleviates the optimization bias induced by the local adversarial generation. We further extend to a more general setting, permitting both clients trained by standard/adversarial training in a unified framework, and propose SFAT with a hierarchical aggregation schema for this scenario. Theoretically, we analyze the convergence of our method to properly relax the learning objective. Experimentally, we verify the rationality and effectiveness of our methods on various benchmarked and real-world datasets with different adversarial training and federated optimization methods.

Abstract:
Location determination finds wide applications in daily life. Instead of existing efforts devoted to localizing tourist photos captured by perspective cameras, in this article, we focus on devising person positioning solutions using fisheye cameras mounted overhead. Such solutions are advantageous in large field of view (FOV), low cost, anti-occlusion, and unaggressive work mode (without the necessity of cameras carried by persons). However, related studies are quite scarce, due to the paucity of data. To stimulate research in this exciting area, we present LOAF, the first large-scale overhead fisheye dataset for person detection and localization. LOAF is built with many essential features, e.g., i) the data cover abundant diversities in scenes, human pose, density, and location; ii) it contains currently the largest number of annotated pedestrian, i.e., 457K bounding boxes with ground-truth orientation and location information; iii) the body-boxes are labeled as radius-aligned so as to fully address the positioning challenge. To approach localization, we build a fisheye person detection network, which exploits the geometry of fisheye images from rotation-equivariance and distortion-awareness two perspectives. Corresponding training strategies are devised to deliver high accuracy, radius-aligned human box and angle predictions in an end-to-end manner. Then, the actual locations of the detected persons are calculated by a numerical solution on the fisheye model and camera altitude data. Extensive experiments on LOAF validate the superiority of our fisheye detector w.r.t. previous methods. This demonstrates that our whole fisheye positioning solution is able to locate all persons in FOV with an accuracy of 0.5 m within 0.1 s.

Abstract:
Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. The recent success of deep learning has fostered many deep networks (e.g., Transformer, ResNet) based tabular learning methods. Generally, existing deep tabular machine learning methods are along with the two paradigms, i.e., in-learning and pre-learning. In-learning methods need to train networks from scratch or impose extra constraints to regulate the representations which nonetheless train multiple tasks simultaneously and make learning more difficult, while pre-learning methods design several pretext tasks for pre-training and then conduct task-specific fine-tuning, which however need much extra training effort with prior knowledge. In this paper, we introduce a novel deep Tabular Representation Corrector, TRC, to enhance any trained deep tabular model’s representations without altering its parameters in a model-agnostic manner. Specifically, targeting the representation shift and representation redundancy that hinder prediction, we propose two tasks, i.e., (i) Tabular Representation Re-estimation, that involves training a shift estimator to calculate the inherent shift of tabular representations to subsequently mitigate it, thereby re-estimating the representations and (ii) Tabular Space Mapping, that transforms the above re-estimated representations into a light-embedding vector space via a coordinate estimator while preserves crucial predictive information to minimize redundancy. The two tasks jointly enhance the representations of deep tabular models without touching on the original models thus enjoying high efficiency. Finally, we conduct extensive experiments on state-of-the-art deep tabular machine learning models coupled with TRC on various tabular benchmarks which have shown consistent superiority.

Abstract:
We propose a novel factor model in the graph frequency domain for multivariate data residing on the vertices of a graph, referred to as a multivariate graph signal. By utilizing graph filters, our model extends the frequency-domain approach of the dynamic factor model from time series to graphs, enabling a graph-aware and multiscale interpretation of factors across graph frequencies. This latent modeling approach reduces the dimensionality of graph signals, thereby improving the understanding of their structure. It also allows the use of the extracted factors for subsequent analyses, such as clustering. We describe the estimation of factors and their loadings and investigate the consistency of the factor estimator. In addition, we propose two consistent estimators for determining the number of factors. The finite sample performance of the proposed method is demonstrated through simulation studies across various graph structures. We also compare it with classical factor analysis and examine how the choice of graph structure affects the results. The findings show that our model achieves lower reconstruction errors and successfully incorporates the graph structure. Furthermore, we illustrate the effectiveness of the proposed method by applying it to G20 economic data, water quality data from the Geum River, and passenger data from the Seoul Metropolitan subway.

Abstract:
Matrix factorization is a fundamental characterization model in machine learning and is usually solved using mathematical decomposition reconstruction loss. However, matrix factorization is a data-driven model whose results depend on data quality, making it susceptible to noise. Inspired by physics, the law of conservation of energy is used to introduce physical laws into matrix factorization, which is called Physics-informed Matrix Factorization operator (PiMF). The PiMF operator uses the heat conduction equation to construct the energy objective function for matrix factorization, thereby retaining the mathematical model’s decomposition meaning and satisfying the interpretability of physics. The PiMF follows the physical laws, thereby suppressing irregular or sudden noise signals that violate these physical principles. The solutions of the PiMF operator include more comprehensive knowledge of mathematics and physics, which improves the ability to generalize complex data, especially for noisy data. We demonstrate the consistency of the energy objective function and the mathematical model, which verifies the feasibility of matrix factorization using physical energy laws. In addition, the physical interpretability of the PiMF operator is proved from the perspective of energy decline. This study proposes two practical algorithms for PiMF in classification and clustering tasks, enhancing the practicability of matrix factorization by incorporating task-specific prior information constraints. The experimental results of PiMF for classification and clustering demonstrate the advantages of the proposed operator. The importance of physics-informed matrix factorization is verified, especially for noisy data.

Abstract:
With the increasing concerns about privacy and data regulations, federated learning (FL) has been emerging as a solution to train machine learning models collaboratively with non-exchangeable data from multiple clients. As a result of data locality, data is usually not identically or independently (non-IID) distributed across clients, and the non-IID property has long been the key challenge in FL. Furthermore, in real-world cross-silo scenarios, it is ubiquitous that clients are organizations owning private data from multiple domains internally, which exacerbates the non-IID issue. For example, in healthcare applications, each client (hospital) gathers data from patients with heterogeneous demographics. While previous works have made efforts to address the non-IID challenge across clients by assuming various relations among client-level data distributions and enabling personalized models at the client level, they ignore the internal data heterogeneity within each client or require explicit data domain indicators, which are hardly accessible in real-world data. Here, we propose Sample-Level Prototypical Federated Learning (SL-PFL) to bridge the gap. SL-PFL incorporates prototypical learning under the FL framework and provides a fine-grained personalized model for each data sample instead of learning one uniform model for all samples of each client. Meanwhile, it can be trained using data without ground-truth domain indicators. Experimental results demonstrate that our proposed method with sample-level personalized models outperforms existing FL methods with a global model or client-level personalized models on various real-world regression and classification tasks from weather, computer vision, and healthcare applications.

Abstract:
Stochastic partition processes divide a multi-dimensional space into a number of regions, such that the data within each region exhibit some form of homogeneity. Due to the nature of their partition strategies, partition processes can often create many unnecessary divisions in sparse regions when trying to describe data in dense regions. To avoid this problem we introduce a parsimonious partition model – the Rectangular Bounding Process (RBP) – to efficiently partition multi-dimensional spaces, by employing a bounding strategy to enclose data points within rectangular bounding boxes. The RBP is self-consistent and as such can be directly extended from a finite hypercube to an infinite (unbounded) space. We extend the RBP to establish a data-dependent RBP (data-RBP) to generate bounding boxes only over existing data points in a sequential manner, which can effectively reduce model complexity and enable online learning. To achieve this, we design an alternative way to generate bounding boxes and prove the distributional equivalence between the data-RBP and the RBP when empty boxes are removed. We demonstrate application of the RBP and the data-RBP in three scenarios: regression trees, relational modelling, and random feature construction for online learning. Extensive experimental results validate the performance of the RBP and the data-RBP for both accuracy and efficiency.

Abstract:
Federated multi-view clustering is an emerging machine learning paradigm that groups the data with each view distributed on an isolated client while preserving their privacies. Although recent researches have proposed a few feasible solutions, they are severely limited by two drawbacks. In specific, the clients are required to share their data representations at each iteration of model training, leading to heavy communication overhead. On the other hand, existing researches handle large-scale data by employing the matrix factorization and neural network encoding techniques, failing to utilize their similarity information sufficiently. To address these issues, we propose a communication-efficient federated multi-view clustering framework by approximating the data representation with pseudo-label and centroid matrix, where the latter two are shared in model training. Meanwhile, the framework is instanced by incorporating linear kernel function to consider the data pairwise similarities. Note that, corresponding linear kernels are not required to compute explicitly, making the resultant method able to be optimized in linear complexity to the number of samples. Nevertheless, the proposed method is evaluated on benchmark datasets. It not only achieves inspiring results (26.84% accuracy improvement on average, 2.9_××-2153_×× computation speedup and 98.4% communication overhead reduction at most) compared with existing federated multi-view clustering methods, but also outperforms centralized multi-view clustering approaches on performance and computation efficiency.

Abstract:
Diffusion models have gained significant attention in the field of generative modeling due to their capability to produce high-quality samples. However, recent studies show that applying a uniform treatment to all distributions during the training of diffusion models is sub-optimal. In this paper, we present a comprehensive theoretical analysis of the forward process in diffusion models. Our findings indicate that distribution variations are not uniform throughout the diffusion process, with the sharpest changes occurring during the initial stages. Moreover, we observe that the initial distribution converges to a Gaussian distribution at an exponential rate, indicating that different initial distributions rapidly become quite similar during the forward diffusion process. Consequently, employing a uniform timestep sampling strategy does not effectively capture these dynamics, potentially leading to sub-optimal training outcomes for diffusion models. To remedy this, we introduce the Bidirectional Beta-Tuned Diffusion Model (BB-TDM). The BB-TDM leverages the Beta distribution to design the timestep sampling distribution and enhance the separation between different initial distributions during the diffusion process. By selecting appropriate parameters, the BB-TDM ensures that the timestep sampling distribution is aligned with the properties of the forward diffusion process and moderates the convergence speed of different initial distributions. Extensive experiments across various benchmark datasets on different diffusion models confirm the efficacy of the proposed BB-TDM.

Abstract:
Data-Free Meta-Learning (DFML) aims to enable efficient learning of unseen few-shot tasks, by meta-learning from multiple pre-trained models without accessing their original training data. While existing DFML methods typically generate synthetic data from these models to perform meta-learning, a comprehensive analysis of DFML’s robustness—particularly its failure modes and vulnerability to potential attacks—remains notably absent. Such an analysis is crucial as algorithms often operate in complex and uncertain real-world environments. This paper fills this significant gap by systematically investigating the robustness of DFML, identifying two critical but previously overlooked vulnerabilities: Task-Distribution Shift (TDS) and Task-Distribution Corruption (TDC). TDS refers to the sequential shifts in the evolving task distribution, leading to the catastrophic forgetting of previously learned meta-knowledge. TDC exposes a security flaw of DFML, revealing its susceptibility to attacks when the pre-trained model pool includes untrustworthy models that deceptively claim to be beneficial but are actually harmful. To mitigate these vulnerabilities, we propose a trustworthy DFML framework comprising three components: synthetic task reconstruction, meta-learning with task memory interpolation, and automatic model selection. Specifically, utilizing model inversion techniques, we reconstruct synthetic tasks from multiple pre-trained models to perform meta-learning. To prevent forgetting, we introduce a strategy to replay interpolated historical tasks to efficiently recall previous meta-knowledge. Furthermore, our framework seamlessly incorporates an automatic model selection mechanism to automatically filter out untrustworthy models during the meta-learning process. Extensive experiments across various datasets with two types of untrustworthy models confirm the superiority of our method in significantly enhancing the robustness of DFML.

Abstract:
3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene captured from different viewpoints, Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments. Due to its efficiency and effectiveness, MVS has become a pivotal method for image-based 3D reconstruction. Recently, with the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods. We categorize these learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D Gaussian Splatting-based, and large feed-forward methods. Among these, we focus significantly on depth map-based methods, which are the main family of MVS due to their conciseness, flexibility and scalability. In this survey, we provide a comprehensive review of the literature at the time of this writing. We investigate these learning-based methods, summarize their performances on popular benchmarks, and discuss promising future research directions in this area.

Abstract:
Hypergraph neural networks (HGNNs) are widely used models for analyzing higher-order relational data. HGNNs suffer from the rapid performance degradation with increasing layers. Hypergraph dynamic system (HDS) is a potential way to deal with this challenge. However, hypergraph dynamic system is confined to a time-continuous isotropic model, lacking positional information in the structural space of the hypergraph. In contrast, anisotropic diffusion can capture structural space differences among vertices, providing a more precise representation of the information propagation process in hypergraph structures than isotropic diffusion. In this paper, we introduce HGNNv2, a stable hypergraph neural network, which is built as a hypergraph dynamic system with partial differential equation (PDE). This model incorporates a position-aware anisotropic diffusion term and an external control term. We further present the vertex-rooted subtree method to determine anisotropic diffusion intensity. HGNNv2 has properties that vertices occupying equivalent positions in the structural space share equivalent structural labels and positional features. Experiments on 6 hypergraph datasets and 3 graph datasets reveal that HGNNv2 outperforms all 12 compared methods. HGNNv2 is capable of achieving stable final representations and task accuracy even under noisy conditions. HGNNv2 achieves stable performance with fewer layers than hypergraph dynamic systems employing isotropic diffusion. We provide feature visualizations to illustrate the evolution of representations.

Abstract:
Handwritten Text Recognition (HTR) has become an essential field within pattern recognition and machine learning, with applications spanning historical document preservation to modern data entry and accessibility solutions. The complexity of HTR lies in the high variability of handwriting, which makes it challenging to develop robust recognition systems. This survey examines the evolution of HTR models, tracing their progression from early heuristic-based approaches to contemporary state-of-the-art neural models, which leverage deep learning techniques. The scope of the field has also expanded, with models initially capable of recognizing only word-level content progressing to recent end-to-end document-level approaches. Our paper categorizes existing work into two primary levels of recognition: (1) up to line-level, encompassing word and line recognition, and (2) beyond line-level, addressing paragraph- and document-level challenges. We provide a unified framework that examines research methodologies, recent advances in benchmarking, key datasets in the field, and a discussion of the results reported in the literature. Finally, we identify pressing research challenges and outline promising future directions, aiming to equip researchers and practitioners with a roadmap for advancing the field.

Abstract:
Higher-order graph neural networks (HOGNNs) and the related architectures from Topological Deep Learning are an important class of GNN models that harness polyadic relations between vertices beyond plain edges. They have been used to eliminate issues such as over-smoothing or over-squashing, to significantly enhance the accuracy of GNN predictions, to improve the expressiveness of GNN architectures, and for numerous other goals. A plethora of HOGNN models have been introduced, and they come with diverse neural architectures, and even with different notions of what the “higher-order” means. This richness makes it very challenging to appropriately analyze and compare HOGNN models, and to decide in what scenario to use specific ones. To alleviate this, we first design an in-depth taxonomy and a blueprint for HOGNNs. This facilitates designing models that maximize performance. Then, we use our taxonomy to analyze and compare the available HOGNN models. The outcomes of our analysis are synthesized in a set of insights that help to select the most beneficial GNN model in a given scenario, and a comprehensive list of challenges and opportunities for further research into more powerful HOGNNs.

Abstract:
Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships between visual and linguistic modalities, enabling machines to develop human-like multimodal comprehension capabilities. Consequently, it has extensive applications in various domains. However, since 2021, visual grounding has witnessed significant advancements, with emerging new concepts such as grounded pre-training, grounding multimodal LLMs, generalized visual grounding, and giga-pixel grounding, which have brought numerous new challenges. In this survey, we first examine the developmental history of visual grounding and provide an overview of essential background knowledge, including fundamental concepts and evaluation metrics. We systematically track and summarize the advancements, and then meticulously define and organize the various settings to standardize future research and ensure a fair comparison. In the dataset section, we compile a comprehensive list of current relevant datasets, conduct a fair comparative analysis, and provide ultimate performance prediction to inspire the development of new standard benchmarks. Additionally, we delve into numerous applications and highlight several advanced topics. Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers. By extracting common technical details, this survey encompasses the representative work in each subtopic over the past decade. To the best of our knowledge, this paper represents the most comprehensive overview currently available in the field of visual grounding. This survey is designed to be suitable for both beginners and experienced researchers, serving as an invaluable resource for understanding key concepts and tracking the latest research developments.

Abstract:
Surface parameterization is a fundamental geometry processing task, laying the foundations for the visual presentation of 3D assets and numerous downstream shape analysis scenarios. Conventional parameterization approaches demand high-quality mesh triangulation and are restricted to certain simple topologies unless additional surface cutting and decomposition are provided. In practice, the optimal configurations (e.g., type of parameterization domains, distribution of cutting seams, number of mapping charts) may vary drastically with different surface structures and task characteristics, thus requiring more flexible and controllable processing pipelines. To this end, this paper introduces FlexPara, an unsupervised neural optimization framework to achieve both global and multi-chart surface parameterizations by establishing point-wise mappings between 3D surface points and adaptively-deformed 2D UV coordinates. We ingeniously design and combine a series of geometrically-interpretable sub-networks, with specific functionalities of cutting, deforming, unwrapping, and wrapping, to construct a bi-directional cycle mapping framework for global parameterization without the need for manually specified cutting seams. Furthermore, we construct a multi-chart parameterization framework with adaptively-learned chart assignment. Extensive experiments demonstrate the universality, superiority, and inspiring potential of our neural surface parameterization paradigm.

Abstract:
The expansion of textual data, stemming from various sources such as online product reviews and scholarly publications on scientific discoveries, has created a significant demand for the extraction of succinct yet comprehensive information. While many methods have been proposed for automatic keyword extraction in unsupervised and fully supervised settings, effectively leveraging a partial list of known keywords, such as author-specified keywords or Twitter hashtags, remains under-explored. This work aims to enhance both the effectiveness and scalability of semi-supervised keyword extraction. We propose a novel variational Bayesian semi-supervised (VBSS) method that builds upon recent Bayesian advancement in the field, replacing computationally expensive posterior sampling with variational inference and data augmentation. This leads to closed-form updates and substantial speedups, particularly for long texts. Our numerical results show that the VBSS method not only improves performance on longer texts but also offers better control over false discovery rates compared to state-of-the-art keyword extraction techniques.

Affiliations: School of Computer Science and Technology, and the School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, China; Wangxuan Institute of Computer Technology, Peking University, Beijing, China; School of Information Science and Technology, University of Science and Technology of China, Hefei, China; School of Information, Renmin University of China, Beijing, China; School of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, China; School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China

Abstract:
In current text-to-video retrieval (T2VR), videos to be retrieved have been properly trimmed so that a correspondence between the videos and ad-hoc textual queries naturally exists. Note in practice that videos circulated on the Internet and social media platforms, while being relatively short, are typically rich in their content. Often, multiple scenes / actions / events are shown in a single video, leading to a more challenging T2VR setting wherein only part of the video content is relevant w.r.t. a given query. This paper presents a first study on this setting which we term Partially Relevant Video Retrieval (PRVR). Considering that a video typically consists of multiple moments, a video is regarded as partially relevant w.r.t. to a given query if it contains a query-related moment. We formulate the PRVR task as a multiple instance learning problem, and propose a Multi-Scale Similarity Learning (MS-SL++) network that jointly learns both clip-scale and frame-scale similarities to determine the partial relevance between video-query pairs. Extensive experiments on three diverse video-text datasets (TVshow Retrieval, ActivityNet-Captions and Charades-STA) demonstrate the viability of the proposed method.

Abstract:
Existing feature matching methods are strongly coupled to their pre-defined position priors. For instance, sparse matchers are coupled to keypoints, and semi-dense matchers are coupled to grids. The coupled position prior dictates the distribution of matching points and imposes inherent limitations on the matcher. Consequently, sparse matchers suffer from a reliance on keypoint repeatability, while semi-dense matchers lack texture-based precision. Our preliminary work RCM leverages the keypoint prior in the source image and the grid prior in the target image, ensuring texture-based precision with keypoints while eliminating reliance on repeatability. However, RCM still relies heavily on keypoints in the source image, inheriting limitations such as sparsity and poor distribution in challenging scenes. To address these challenges, we introduce RCM+, which presents a novel free-form matching paradigm. By combining a position-agnostic encoder with a parameter-free decoder, we decouple the matcher from any position prior. As a result, the free-form matcher can match arbitrary input positions in a zero-shot manner, including detected keypoints, lines, edges, grids of any resolution, user-specified points, and more. This paradigm offers exceptional flexibility, allowing users to select position priors based on scene properties without retraining. Thus, RCM+ can leverage the advantages of various position priors without over-relying on any single prior, avoiding limitations in specific scenarios. To better match multiple position priors, we propose the Balancer, which reconciles all input position priors to achieve a more favorable point distribution for downstream tasks. Additionally, we enhance the view switcher and conflict-free matching layer introduced in RCM, further improving matching quality. Comprehensive experiments demonstrate the excellent performance, efficiency, and flexibility of RCM+, underscoring its promising potential for applications.

Abstract:
Abductive reasoning seeks the likeliest possible explanation for partial observations. Although being frequently employed in human daily reasoning, abduction is rarely explored in computer vision literature. In this article, we propose a new task, Visual Abductive Reasoning (VAR), that underpins the machine intelligence study of abductive reasoning in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the observed premise. We create the first large-scale VAR dataset, which contains a total of 9K examples. We further devise a transformer-based VAR model – Reasonerv2 – for knowledge-driven, causal-and-cascaded reasoning. Reasonerv2 first adopts a contextualized directional position embedding strategy in the encoder, to capture the causal-related temporal structure of the observations, and yield discriminative representations for the premises and hypotheses. Then, Reasonerv2 extracts condensed causal knowledge from external knowledge bases, for reasoning beyond observation. Finally, Reasonerv2 cascades multiple decoders so as to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR benchmarking results show that Reasonerv2 surpasses many famous video-language models, while still being far behind human performance.

Abstract:
Over the past five decades, automated face recognition (FR) has progressed from handcrafted geometric and statistical approaches to advanced deep learning architectures that now approach, and in many cases exceed, human performance. This paper traces the historical and technological evolution of FR, encompassing early algorithmic paradigms through to contemporary neural systems trained on extensive real and synthetically generated datasets. We examine pivotal innovations that have driven this progression, including advances in dataset construction, loss function formulation, network architecture design, and feature fusion strategies. Furthermore, we analyze the relationship between data scale, diversity, and model generalization, highlighting how dataset expansion correlates with benchmark performance gains. Recent systems have achieved near-perfect large-scale identification accuracy, with the leading algorithm in the latest NIST FRTE 1:N benchmark reporting a False Negative Identification Rate (FNIR) of 0.15 percent at False Positive Identification Rate (FPIR) of 0.001 on a gallery of over 10 million identities. Larger galleries increase false positive rates and deployments at greater scales will see higher error rates. We delineate key open problems and emerging directions, including scalable training, multi-modal fusion, synthetic data, and interpretable recognition frameworks.

Abstract:
Deep learning has profoundly impacted society, yet the inherent nature of deep neural networks hinders further application to high-reliability industries. To demystify these closed-boxes, numerous works attempt to improve the explainability by observing or impacting internal variables of the models. However, existing methods rely on heuristics without rigorous theoretical foundations, often requiring intricate model modifications or redesigns. This work first formalizes two fundamental properties of explainability: alignment and invertibility, serving as theoretical pillars for rigorous interpretability analysis. Building on these, we introduce Bort, a plug-and-play optimizer that enforces Boundedness and orthogonality constraints on model parameters to improve explainability. These constraints are theoretically derived from the alignment and invertibility principles. Considering conventional optimizers can not leverage data features for precise attribution, we present a data-aware extension, termed DBort, which integrates an auxiliary loss term. Intriguingly, in the linear case, DBort converges to Principal Component Analysis (PCA). Our in-depth analysis of penalty term design reveals that l_1l1-based penalties provide a more stringent adherence to the imposed constraints compared to their l_2l2 counterparts. Our experiments involve reconstructing and backtracking through the optimized model representations, which reveal a marked enhancement in explainability. Furthermore, leveraging Bort, we successfully synthesize explainable adversarial examples without additional training. Notably, Bort consistently improves the classification accuracy across diverse architectures, including ResNet and DeiT, on benchmark datasets such as MNIST, CIFAR-10, and ImageNet.

Abstract:
Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep tt (“time indexing”), which struggles to predict precise object movements. Given two images of a baseball, there are infinitely many possible trajectories: accelerating or decelerating, straight or curved. This often results in blurry frames as the method averages out these possibilities. Instead of forcing the network to learn this complicated time-to-location mapping implicitly together with predicting the frames, we provide the network with an explicit hint on how far the object has traveled between start and end frames, a novel approach termed “distance indexing”. This method offers a clearer learning goal for models, reducing the uncertainty tied to object speeds. We further observed that, even with this extra guidance, objects can still be blurry especially when they are equally far from both input frames (i.e., halfway in-between), due to the directional ambiguity in long-range motion. To solve this, we propose an iterative reference-based estimation strategy that breaks down a long-range prediction into several short-range steps. When integrating our plug-and-play strategies into state-of-the-art learning-based models, they exhibit markedly sharper outputs and superior perceptual quality in arbitrary time interpolations, using a uniform distance indexing map in the same format as time indexing without requiring extra computation. Furthermore, we demonstrate that if additional latency is acceptable, a continuous map estimator can be employed to compute a pixel-wise dense distance indexing using multiple nearby frames. Combined with efficient multi-frame refinement, this extension can further disambiguate complex motion, thus enhancing performance both qualitatively and quantitatively. Additionally, the ability to manually specify distance indexing allows for independent temporal manipulation of each object, providing a novel tool for video editing tasks such as re-timing.

Abstract:
Causality plays a pivotal role in various fields of study. Based on the framework of causal graphical models, previous works have proposed identifying whether a variable is a cause or non-cause of another variable in every Markov equivalent graph by learning only the local structure. However, the presence of prior knowledge, often represented as a partially known causal graph, is common in many causal modeling applications. Leveraging this prior knowledge enables further identification of causal relations. In this paper, we first propose a method for learning the local structure by incorporating several types of causal background knowledge, including direct causal, non-ancestral, and ancestral information. Then we introduce sufficient and necessary conditions for identifying causal relations based solely on the local structure in the presence of prior knowledge. The effectiveness and efficiency of our method are demonstrated through experiments on local structure learning, causal relation identification, and its application to fair machine learning.

Abstract:
Data plays a pivotal role in the groundbreaking advancements in artificial intelligence. The quantitative analysis of data significantly contributes to model training, enhancing both the efficiency and quality of data utilization. However, existing data analysis tools often lag in accuracy. For instance, many of these tools even assume that the loss function of neural networks is convex. These limitations make it challenging to implement current methods effectively. In this paper, we introduce a new formulation to approximate a sample’s influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In. Specifically, we formulate the sample-wise influence as the cumulative sum of its changes/differences across successive training iterations. By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods. Despite being a second-order method, Diff-In maintains computational complexity comparable to that of first-order methods and remains scalable. This efficiency is achieved by computing the product of the Hessian and gradient, which can be efficiently approximated using finite differences of first-order gradients. We assess the approximation accuracy of Diff-In both theoretically and empirically. Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators. Extensive experiments further confirm its superior performance across multiple benchmark datasets in three data-centric tasks: data cleaning, data deletion, and coreset selection. Notably, our experiments on data pruning for large-scale vision-language pre-training show that Diff-In can scale to millions of data points and outperforms strong baselines.

Abstract:
Pseudo-labeling has emerged as a popular and effective approach for utilizing unlabeled data. However, in the context of semi-supervised multi-label learning (SSMLL), conventional pseudo-labeling methods encounter difficulties when dealing with instances associated with multiple labels and an unknown label count. These limitations often result in the introduction of false positive labels or the neglect of true positive ones. To overcome these challenges, this paper proposes a novel solution called Class-distribution-Aware Pseudo-labeling (CAP) that performs pseudo-labeling in a class-aware manner. The proposed approach introduces a regularized learning framework incorporating class-aware thresholds, which effectively control the assignment of positive and negative pseudo-labels for each class. Notably, even with a small proportion of labeled examples, our observations demonstrate that the estimated class distribution serves as a reliable approximation. Motivated by this finding, we develop a class-distribution-aware thresholding (CAT) strategy to ensure the alignment of pseudo-label distribution with the true distribution. Moreover, we extend CAT into a label decision method, aiming to improve the model’s classification performance during the testing phase. The correctness of the estimated class distribution is theoretically verified, and a generalization error bound is provided for our proposed method. Extensive experiments on multiple benchmark datasets confirm the efficacy of CAP in addressing the challenges of SSMLL problems.

Abstract:
Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.

Abstract:
Real-world person re-identification (Re-ID) systems are susceptible to malicious attacks, leading to the leakage of pedestrian images and the Re-ID model, posing severe threats to the privacy of both system owners and pedestrians. Existing privacy-preserving person re-identification (PPPR) methods fail to simultaneously resist data leakage, model leakage, and data & model leakage while compromising the normal functionality of Re-ID systems. In this paper, we begin with an in-depth analysis of prior methodologies and identify the gap between existing works and the ideal PPPR paradigm. Inspired by the concept of “Let the invisible perturbation become the system trigger”, we propose SHIELD, a pioneering and comprehensive two-stage privacy-preserving framework. To resist data leakage, we propose a self-supervised method for Protected Dataset Generation in the first stage, which obviates the dependence on identity labels and ensures image quality. To resist model leakage without compromising the normal retrieval accuracy, we propose Original Feature Deconstruction and Protected Feature Alignment to train the system model with paired protected and original images. Extensive experiments substantiate that SHIELD significantly outperforms existing PPPR methods, offering robust and holistic protection for Re-ID systems while maintaining decent retrieval accuracy for authorized users. The code will be released soon.

Abstract:
Understanding traffic accident scenes is a long-standing research for vision-based safe driving. It seeks to answer why accidents occur, how near-crash scenes develop, and what the key elements of an accident are. This research is challenging due to the scarcity and fragmentation of accident data, as well as the complex accident environments. To study this, we present a framework of Abductive Driving accident Video understanding (ADVersa), which infers a plausible visual and textual explanation for the absent near-crash scenes. ADVersa underscores three groups of tasks: 1) visual past recovery of near-crash scenes, 2) visual prediction of near-crash scenes, and 3) accident cause involved video synthesis. To support the study, we first contribute MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild driving accident videos with temporally aligned text descriptions, 2.23 million well-annotated object boxes, and 58,650 pairs of video-based accident cause texts. We then propose an Abductive CLIP model and a Contrastive Graph Video Pre-training (CGVP) model, which exploit relation-aware cross-modal semantic learning to drive spatially abductive and temporally abductive accident video diffusion. Extensive experiments verify the superiority of ADVersa to the state-of-the-art approaches on different tasks, i.e., historical near-crash video frame recovering, crashing video frame prediction, textual accident cause and category reasoning, normal-to-accident video synthesis, and accident video editing. With these efforts, we hope this research can advance the progress on multimodal accident video understanding.

Abstract:
Eliminating semantic discrepancy between different modalities is the ultimate goal of image text retrieval. However, most of the existing methods only focus on retrieval of the ground-truth instance while ignoring those semantically similar instances yet unlabeled as positives, which causes the phenomenon of one-to-many correspondence. The mainstream solutions of this research are mainly based on uncertainty learning and the exploration of one-to-many correspondence is still insufficient albeit their significant progress. Therefore, this work develops a novel Distribution-to-Points (termed D2P) matching mechanism for image-text retrieval to capture the one-to-many correspondence between multiple samples and a given query via hypergraph modeling. Specifically, a given query is first mapped as a probabilistic embedding to learn its true semantic distribution based on Mahalanobis distance. Then each candidate instance in a mini-batch is regarded as a hypergraph node with its mean semantics while a Gaussian query is modeled as a hyperedge to capture the semantic correlations beyond the pair between candidate points and the query. Moreover, an energy-based semantic modeling framework is developed to pull all similar candidates (not only the ground truth) close to their query while pushing those dissimilar ones far away. In the end, distribution-to-points matching is learned based on the similarity measurement over the Mahalanobis distance, which considers semantic variance to perform many-to-one correspondence well. Experimental results on several widely used datasets and under various evaluation metrics confirm our superiority and effectiveness in improving the retrieval ability of the baseline including ground-truth matching and semantic multiplicity for image text retrieval.

Abstract:
Making personalized recommendation for cold-start users, who only have a few interaction histories, is a challenging problem in recommendation systems. Recent works leverage hypernetworks to directly map interaction histories to user-specific parameters, which are then used to modulate predictor by certain modulation structure. These works obtain the state-of-the-art performance. However, there lacks a general approach to design the modulation structure. Instead of using a fixed modulation function and deciding modulation position by expertise, we propose to determine proper modulation structure, including function and position, via neural architecture search. We propose two approaches. We first design a symbolic search space which covers broad models and theoretically prove that this search space can be transformed to a much smaller space, enabling an efficient and robust one-shot search algorithm, called ColdNAS. Since recommendation systems are a special case of bipartite matching problems, the proposed methods can be generalized to a wide range of cold-start tasks, such as disease-gene association prediction for emerging diseases. However, diverse scenarios introduce new challenges in both the flexibility of the search algorithm and the search space. To address these limitations, we further propose ColdNAS_++, where we employ neural networks to model modulation functions to extend search space and design a two-stage decoupled stochastic search algorithm to enable non-differentiable targets in continuous spaces. Extensive experimental results on benchmark datasets show that modulation structures obtained by ColdNAS and ColdNAS_++ consistently outperform hand-designed cold-start techniques for recommending items for new users and predicting associated genes for new disease. We observe that different modulation functions lead to the best performance on different datasets or under different metrics, which validates the necessity of designing the modulation structure in a data-driven way.

Abstract:
High-speed vision tasks have long been a challenge in computer vision. Recently, the spike camera has shown great potential in these tasks due to its high temporal resolution. Unlike traditional cameras, it emits asynchronous spike signals to capture visual information. However, under low-light conditions, spike signals become highly sparse, and the sparse spike stream severely hinders the effectiveness of existing spike-based methods in high-speed scenarios. To address this challenge, we introduce SS2DS, the first deep learning framework that enhances sparse spike streams into dense spike streams. SS2DS first estimates the spike firing frequency within sparse streams. Subsequently, the spike firing frequency is enhanced by a neural network. Finally, SS2DS decodes the enhanced spike stream from the enhanced spike firing frequency sequence. SS2DS can adjust the temporal distribution of sparse spike streams and improve the performance degradation of existing methods in low-light and high-speed scenarios. To evaluate sparse spike stream enhancement, we construct both synthetic and real sparse spike stream datasets. By comparing the reconstruction results, enhanced spike streams achieve an average improvement of +0.78 MA, −18.42 BRISQUE, and −1.42 NIQE over sparse spike streams. Moreover, the enhanced spike streams also benefit other spike-based vision tasks, such as 3D reconstruction (+1.325 dB PSNR, +0.005 SSIM, and −0.01 LPIPS) and superresolution (+0.63 MA, −13.67 BRISQUE, and −1.28 NIQE).

Abstract:
Interactive Segmentation (IS) segments specific objects or parts by deducing human intent from sparse input prompts. However, the sparse-to-dense mapping is ambiguous, making it challenging for users to obtain segmentations at the desired granularity and causing them to engage in trial-and-error cycles. Although existing multi-granularity IS models (e.g., SAM) alleviate the ambiguity of single-granularity methods by predicting multiple masks simultaneously, this approach has limited scalability and produces redundant results. To address this issue, we introduce a creative granularity-controllable IS paradigm that resolves ambiguity by enabling users to precisely control the segmentation granularity. Specifically, we propose a Unified Granularity Controller (UniGraCo) that supports multi-type optional granularity control signals to pursue unified control over diverse segmentation requirements, effectively overcoming the limitation of single-type control in adapting to different needs, thus boosting the system efficiency and practicality. To mitigate the excessive cost of annotating the multi-granularity masks and the corresponding granularity control signals for training UniGraCo, we construct an automated data engine capable of generating high-quality and granularity-abundant mask-granularity data pairs at low cost. To enable UniGraCo to learn unified granularity controllability in an efficient and stable manner, we further design a granularity-controllable learning strategy. This strategy leverages the generated data pairs to incrementally equip the pre-trained IS model with granularity controllability while preserving its segmentation capability. Extensive experiments on intricate scenarios at both instance and part level demonstrate that our UniGraCo has significant advantages over previous methods, highlighting its potential as a practical interactive tool.

Abstract:
Transformers have excelled in image restoration due to their advanced representational abilities. However, their reliance on a fixed local window for attention often undermines translation invariance and local relationship preservation. This limitation can reduce network stability, especially when dealing with positional changes in degradation scenarios. In this research, we present a new Bayesian Window Transformer, which innovates by employing a probability distribution for window shifts, overcoming the limitations of fixed window configurations in traditional transformers. This approach allows for more flexible coverage beyond a predetermined region. During the evaluation procedure, we further develop two approximate inference algorithms: Layer Expectation Propagation and Monte Carlo Average. These two algorithms calculate expectations derived from the introduced distribution to effectively approximate the marginalization results of the probabilistic variables. Hence, our Bayesian Window Transformer not only inherits the powerful representation ability but also maintains essential properties like translation invariance and local relationship preservation for image restoration. We also provide a theoretical guarantee, demonstrating that our method is aligned with the classic sliding window technique in terms of receptive field sizes and sliding behavior. Comprehensive experiments validate the exceptional effectiveness of our Bayesian Window Transformer across multiple image restoration tasks, including image deraining, denoising, and deblurring.

Abstract:
This paper revisits the canonical concept of learning structured representations without label supervision by eigendecomposition. Yet, unlike prior spectral methods such as Laplacian Eigenmap which operate in a nonparametric manner, we aim to parametrically model the principal eigenfunctions of an integral operator defined by a kernel and a data distribution using a neural network for enhanced scalability and reasonable out-of-sample generalization. To achieve this goal, we first present a new series of objective functions that generalize the EigenGame Gemp et al. 2020 to function space for learning neural eigenfunctions. We then show that, when the similarity metric is derived from positive relations in a data augmentation setup, a representation learning objective function that resembles those of popular self-supervised learning methods emerges, with an additional symmetry-breaking property for producing structured representations where features are ordered by importance. We call such a structured, adaptive-length deep representation Neural Eigenmap. We demonstrate using Neural Eigenmap as adaptive-length codes in image retrieval systems. By truncation according to feature importance, our method requires up to 16×16× shorter representation length than leading self-supervised learning ones to achieve similar retrieval performance. We further apply our method to graph data and report strong results on a node representation learning benchmark with more than one million nodes.

Abstract:
We propose SNI-SLAM++, a tightly-coupled semantic SLAM system utilizing neural implicit representation, that simultaneously performs accurate semantic mapping, high-quality surface reconstruction, and robust camera tracking. Our system tightly integrates visual appearance, geometry, and semantics through five key components: (i) We introduce hierarchical semantic representation to allow multi-level semantic comprehension for top-down structured semantic mapping of the scene. (ii) To fully utilize the correlation between multiple attributes of the environment, we integrate appearance, geometry and semantic features through cross-attention for feature collaboration. This strategy enables a more multifaceted understanding of the environment, thereby allowing SNI-SLAM++ to remain robust even when single attribute is defective. (iii) We design an internal fusion-based decoder to obtain semantic, RGB, and Truncated Signed Distance Field (TSDF) values from multi-level features for accurate decoding. (iv) We introduce a semantics-coupled tracking framework that tightly incorporates semantic constraints for camera pose estimation in neural implicit SLAM. This framework leverages the multi-view consistency of semantics to construct a pose graph and perform semantic loop closure optimization, enabling robust tracking. (v) We propose a feature loss to update the scene representation at the feature level. Compared with low-level losses such as RGB loss and depth loss, our feature loss is capable of guiding the network optimization on a higher level. Our SNI-SLAM++ demonstrates superior performance over all recent visual SLAM methods in terms of mapping and tracking accuracy on the datasets of Replica, ScanNet, TUM-RGBD, and ScanNet++, while also showing excellent capabilities in accurate semantic segmentation and 3D semantic mapping.

Abstract:
Frequency domain analysis reveals fundamental image patterns difficult to observe in raw pixel values, while avoiding redundant information in original image processing. Although recent remote sensing foundation models (FMs) have made progress in leveraging spatial and spectral information, they have limitations in fully utilizing frequency characteristics that capture hidden features. Existing FMs that incorporate frequency properties often struggle to maintain connections with the original image content, creating a semantic gap that affects downstream performance. To address these challenges, we propose the All-in-One Spectral-Spatial-Frequency Awareness Foundation Model (Alliance), a framework that effectively integrates information across all three domains. Alliance introduces several key innovations: (1) a progressive frequency decoding mechanism inspired by human visual cognition that minimizes multi-domain information gaps while preserving connections between general image information and frequency characteristics, progressively reconstructing from low to mid to high frequencies to extract patterns difficult to observe in raw pixel values; (2) a triple-domain fusion attention module that separately processes amplitude, phase, and spectral-spatial relationships for comprehensive feature integration; and (3) frequency embedding with frequency-aware Cls token initialization and frequency-specific mask token initialization that achieves fine-grained modeling of different frequency band information. Additionally, to evaluate FMs generalizability, we construct the Yellow River dataset, a large-scale multi-temporal collection that introduces challenging cross-domain tasks and establishes more rigorous standards for FMs assessment. Extensive experiments across six downstream tasks demonstrate Alliance’s superior performance.

Abstract:
Image Style Transfer aims to replicate the style of a reference image based on the content from a text description or another image. With the significant advancements in image generation through diffusion models, recent studies have attempted to either fine-tuning embeddings to learn the single style or utilizing the pre-trained CLIP image encoder to extract style representations. However, style-tuning requires substantial computational resources and the pre-trained CLIP image encoder is trained for semantic understanding rather than for style representation. To address these challenges, we introduce a style-aware encoder and a well-organized style dataset called StyleGallery to learn a good style representation that is crucial and sufficient for generalized style transfer without test-time tuning. With dedicated design for style learning, this style-aware encoder is trained to extract expressive style representation from multi-level patches with decoupling training strategy, and StyleGallery enables the generalization ability. Moreover, we employ a content extraction and content-fusion encoder to enhance image-driven style transfer. We highlight that, our approach, named StyleShot, is simple yet effective in mimicking various desired styles, i.e., 3D, flat, abstract or even fine-grained styles, without test-time tuning. Rigorous experiments validate that, StyleShot achieves superior performance across a wide range of styles compared to existing state-of-the-art text- and image-driven methods.

Abstract:
The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.

Abstract:
We present a novel setting of active learning (AL) where multiple target models are simultaneously learned. This setting arises in real-world applications where machine learning systems require training multiple models on the same labeled dataset to accommodate diverse devices with varying computational resources. However, traditional AL methods are often limited by their model dependence and non-transferability. In this paper, we address the question of whether an effective AL method can be designed for multiple target models. We analyze the query complexity of active and passive learning in this setting and demonstrate the potential for AL to achieve improved query complexity. Based on this insight, we further propose an agnostic AL sampling strategy which selects examples located in the joint disagreement regions of different target models. Experimental evaluations on classification and regression benchmarks validate the effectiveness of our approach over traditional AL methods.

Abstract:
Offering rich contexts to Large Language Models (LLMs) has shown to boost the performance in various tasks, but the resulting longer prompt would increase the computational cost and might exceed the input limit of LLMs. Recently, some prompt compression methods have been suggested to shorten the length of prompts by using language models to generate shorter prompts or by developing computational models to select important parts of original prompt. The generative compression methods would suffer from issues like hallucination, while the selective compression methods have not involved linguistic rules and overlook the global structure of prompt. To this end, we propose a novel selective compression method called PartPrompt. It first obtains a parse tree for each sentence based on linguistic rules, and calculates local information entropy for each node in a parse tree. These local parse trees are then organized into a global tree according to the hierarchical structure such as the dependency of sentences, paragraphs, and sections. After that, the root-ward propagation and leaf-ward propagation are proposed to adjust node values over the global tree. Finally, a recursive algorithm is developed to prune the global tree based on the adjusted node values. The experiments show that PartPrompt receives the state-of-the-art performance across various datasets, metrics, compression ratios, and target LLMs for inference. The in-depth ablation studies confirm the effectiveness of designs in PartPrompt, and other additional experiments also demonstrate its superiority in terms of the coherence of compressed prompts and in the extreme long prompt scenario.

Abstract:
Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models’ essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration.

Abstract:
Mixture-of-Experts (MoE) has emerged as an effective and efficient scaling mechanism for large language models (LLMs) and vision-language models (VLMs). By expanding a single feed-forward network into multiple expert branches, MoE increases model capacity while maintaining efficiency through sparse activation. However, despite this sparsity, the need to preload all experts into memory and activate multiple experts per input introduces significant computational and memory overhead. The expert module becomes the dominant contributor to model size and inference cost, posing a major challenge for deployment. To address this, we propose MC# (Mixture-Compressor-sharp), a unified framework that combines static quantization and dynamic expert pruning by leveraging the significance of both experts and tokens to achieve aggressive compression of MoE-LLMs/VLMs. To reduce storage and loading overhead, we introduce Pre-Loading Mixed-Precision Quantization (PMQ), which formulates adaptive bit allocation as a linear programming problem. The objective function jointly considers expert importance and quantization error, producing a Pareto-optimal trade-off between model size and performance. To reduce runtime computation, we further introduce Online Top-any Pruning (OTP), which models expert activation per token as a learnable distribution via Gumbel-Softmax sampling. During inference, OTP dynamically selects a subset of experts for each token, allowing fine-grained control over activation. By combining PMQ’s static bit-width optimization with OTP’s dynamic routing, MC# achieves extreme compression with minimal accuracy degradation. On DeepSeek-VL2, MC# achieves a 6.2 × weight reduction at an average of 2.57 bits, with only a 1.7% drop across five multimodal benchmarks compared to the 16-bit baseline. Moreover, OTP further reduces expert activation by 20% with less than 1% performance loss, demonstrating strong potential for efficient deployment of MoE-based models.

Abstract:
The rise of AI-generated images has sparked serious concerns about their potential misuse across various domains, prompting the urgent need for robust detection methods. Despite advancements, many current approaches prioritize short-term gains at the expense of long-term effectiveness. This paper critiques the overly specialized approach of fine-tuning pre-trained models for short-term gains on a single AI image dataset, while disregarding the long-term imperative of achieving generalization and knowledge retention. To address this trade-off issue, we propose a novel learning framework (PoundNet) for the generalization of AI-generated image detection on a pre-trained vision-language model. PoundNet incorporates a learnable prompt design and a balanced objective to preserve broad knowledge from upstream tasks (object classification) while enhancing generalization for downstream tasks (AI-generated image detection). We train PoundNet on a single standard AI image dataset, following common practice in the literature. We then evaluate its performance across 10 large-scale public AI-generated image detection datasets with 5 main evaluation metrics, forming the largest benchmark test set for assessing the generalization ability of AI-generated image detection models, to our knowledge. The comprehensive benchmark evaluation demonstrates that PoundNet successfully balances generalization with knowledge retention, achieving a remarkable relative improvement of 19% in AI-generated image detection performance compared to state-of-the-art methods, while maintaining a strong performance of 63% on object classification tasks.

Abstract:
The key success of existing video super-resolution (VSR) methods stems mainly from exploring spatial and temporal information that is usually achieved by a temporal propagation with alignment strategies. However, inaccurate alignment usually leads to significant artifacts that will be accumulated during propagation and thus affect video restoration. Moreover, only propagating the same timestep features forward or backward does not handle the videos with complex motion or occlusion. To address these issues, we propose a collaborative feedback discriminative (CFD) method to correct inaccurate aligned features and better model spatial and temporal information for VSR. Specifically, we first develop a discriminative alignment correction (DAC) method to reduce the influences of the artifacts caused by inaccurate alignment. Then, we propose a collaborative feedback propagation (CFP) module based on feedback and gating mechanisms to explore spatial and temporal information of different timestep features from forward and backward propagation simultaneously. Finally, we embed the proposed DAC and CFP into commonly used VSR networks to verify the effectiveness of our method. Experimental results demonstrate that our method improves the performance of existing VSR models while maintaining a lower model complexity.

Abstract:
Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.

Abstract:
Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness: The Local Attention Refinement Module aligns the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples. This alignment enhances the model’s robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.

Abstract:
Vision Transformer (ViT) has shown impressive performance in image restoration due to its ability to capture a large receptive field. However, its complexity grows quadratically with input resolution, limiting its applicability for high-resolution images. In contrast, Convolutional Neural Networks (CNNs) are computationally efficient but are constrained by their inherently local receptive fields, which limit their ability to capture long-range pixel relationships. To address these challenges, we propose StarIR, which possesses the efficiency of CNNs while also capturing a large receptive field, similar to Transformers. StarIR incorporates two key innovations: 1) a dual-domain representation learning framework, with one branch processing spatial details and the other focusing on mesoscale interactions in the frequency domain; and 2) a high-dimensional feature fusion mechanism, the Star operation, which fuses information from both domains through element-wise multiplication, thereby enhancing representational capacity without increasing network width and depth. Our Star operation is followed by a channel attention unit to facilitate global feature modeling and enhance channel-wise interactions. Building on our straightforward yet powerful design principles, StarIR achieves state-of-the-art performance across 21 datasets covering six single-degradation image restoration tasks. Furthermore, our model performs favorably against leading algorithms in two all-in-one settings and demonstrates robustness on two composite-degradation datasets. In addition, StarIR extends well to several domain-specific applications, including ultra-high-definition (UHD) imaging, remote sensing, medical imaging, and underwater image enhancement.

Abstract:
VQA explanation task aims to explain the decision-making process of VQA models in a way that is easily understandable to humans. Existing methods mostly use visual location or natural language explanation approaches to generate corresponding rationales. Although significant progress has been made, these frameworks are bottlenecked by the following challenges: 1) Uni-modal paradigm inevitably leads to semantic ambiguity of explanations. 2) The reasoning process cannot be faithfully responded to and suffers from logical inconsistency. 3) Human-annotated explanations are expensive and time-consuming to collect. In this paper, we introduce a new Semi-supervised VQA Multi-modal Explanation (SME) method via self-critical learning, which addresses the above challenges by leveraging both visual and textual explanations to comprehensively reveal the inference process of the model. Meanwhile, in order to improve the logical consistency between answers and rationales, we design a novel self-critical strategy to evaluate candidate explanations based on answer reward scores. More importantly, our method can benefit from a tremendous amount of samples without human-annotated explanations with semi-supervised learning. Extensive automatic measures and human evaluations all show the effectiveness of our method. Finally, the framework achieves a new state-of-the-art performance on the three VQA explanation datasets.

Abstract:
Open-vocabulary Video Instance Segmentation addresses the challenging task of detecting, segmenting, and tracking objects in videos, including categories not encountered during training. However, existing approaches often overlook rich temporal cues from preceding frames, limiting their ability to leverage causal context for robust open-world generalization. To bridge this gap, we propose CPOVIS, a novel framework that introduces causal prompts—dynamically propagated visual and taxonomy prompts from historical frames—to enhance temporal reasoning and semantic consistency. Built upon a Mask2Former architecture with a CLIP backbone, CPOVIS integrates three core innovations: 1) PromptCLIP, which aligns cross-modal embeddings while preserving open-vocabulary capabilities; 2) a Visual Prompt Injector that propagates object-level features to maintain spatial-temporal coherence; and 3) a Taxonomy Prompt Infuser that leverages hierarchical semantic relationships to stabilize unseen category recognition. Furthermore, we introduce a contrastive learning strategy to disentangle object representations across frames and adapt the Segment Anything Model (SAM2) to boost open-vocabulary segmentation and tracking capacity in open-vocabulary video scenarios. Extensive experiments on seven challenging open- and closed-vocabulary video segmentation benchmarks demonstrate CPOVIS’s state-of-the-art performance, outperforming existing methods by significant margins. Our findings highlight the critical role of causal prompt propagation in advancing video understanding in open-world scenarios.

Abstract:
Unsupervised Domain Adaptation (UDA) focuses on transferring knowledge from a labeled source domain to an unlabeled target domain, addressing the challenge of domain shift. Significant domain shifts hinder effective knowledge transfer, leading to negative transfer and deteriorating model performance. Therefore, mitigating negative transfer is essential. This study revisits negative transfer through the lens of causally disentangled learning, emphasizing cross-domain discriminative disagreement on non-causal environmental features as a critical factor. Our theoretical analysis reveals that overreliance on non-causal environmental features as the environment evolves can cause discriminative disagreements (termed environmental disagreement), thereby resulting in negative transfer. To address this, we propose Reducing Environmental Disagreement (RED), which disentangles each sample into domain-invariant causal features and domain-specific non-causal environmental features via adversarially training domain-specific environmental feature extractors in the opposite domains. Subsequently, RED estimates and reduces environmental disagreement based on domain-specific non-causal environmental features. Experimental results confirm that RED effectively mitigates negative transfer and achieves state-of-the-art performance.

Abstract:
Ensuring the privacy of local datasets has emerged as an important concern in decentralized learning. However, the inherent privacy-utility tradeoff remains a fundamental challenge for privacy preserving decentralized algorithms. To address this issue, we introduce Positive-Incentive Noise Generator (PING), a novel mechanism designed to eliminate negative impact of privacy noise on convergence while defending against powerful colluding inference attacks. PING leverages network topologies and lightweight encryption-decryption operations to generate correlated noise. Building upon PING, we propose PP-DPIN, a privacy preserving stochastic algorithm tailored for decentralized learning. By integrating differential privacy and differential information entropy, we provide a comprehensive privacy quantification for PP-DPIN, with at least half nodes achieving arbitrarily strong privacy guarantees. Furthermore, convergence rate of PP-DPIN is established under stochastic convex and nonconvex settings, which characterizes the impact of privacy noise and demonstrates the linear speedup relative to the network size. Experiments on computer vision tasks validate PP-DPIN’s superior performance and robustness against attacks compared to state-of-the-art methods.

Abstract:
Visual recognition models have achieved unprecedented success in various tasks. While researchers aim to understand the underlying mechanisms of these models, the growing demand for deployment in safety-critical areas like autonomous driving and medical diagnostics has accelerated the development of eXplainable AI (XAI). Distinct from generic XAI, visual recognition XAI is positioned at the intersection of vision and language, which represent the two most fundamental human modalities and form the cornerstones of multimodal intelligence. This paper provides a systematic survey of XAI in visual recognition by establishing a multi-dimensional taxonomy from a human-centered perspective based on intent, object, presentation, and methodology. Beyond categorization, we summarize critical evaluation desiderata and metrics, conducting an extensive qualitative assessment across different categories and demonstrating quantitative benchmarks within specific dimensions. Furthermore, we explore the interpretability of Multimodal Large Language Models and practical applications, identifying emerging trends and opportunities. By synthesizing these diverse perspectives, this survey provides an insightful roadmap to inspire future research on the interpretability of visual recognition models.

Abstract:
Multi-object navigation (MON) tasks involve sequentially locating multiple targets in an unknown environment, requiring global long-term planning under incomplete information. This necessitates that the agent dynamically balance immediate actions and long-term rewards while considering both local adaptability and global foresight. However, current methods overly focus on local path optimization, which leads to slower convergence in sparse reward settings and increases the risk of deadlocks or trap states. The core challenge of MON lies in the deformation of the shared decision space, where independent optimization leads to redundant and overlapping paths. Thus, path planning requires dynamic, cross-task optimization rather than simple subtask aggregation. To minimize overall effort, the optimization process should adaptively balance task contributions through weight adjustment. Thus, we propose the Goal-oriented Dynamic Weight Optimization (GDWO) algorithm. GDWO integrates target-specific value loss functions into a unified optimization framework and dynamically adjusts weights through gradient-based updates. To prevent over-optimization, weights are normalized during training according to navigation success rates, prioritizing more challenging targets. This adaptive mechanism effectively addresses the challenge of sparse rewards and improves convergence efficiency. By leveraging this mechanism, GDWO unifies multiple objectives within a unified decision space, achieving efficient optimization and balancing short-term gains with long-term goals. Additionally, we introduce two auxiliary modules: prior knowledge-based navigation and frontier-aware exploration to further enhance GDWO’s performance. Experimental results on the Gibson and Matterport3D datasets demonstrate that GDWO achieves improvements in key metrics for MON tasks. It optimizes path planning, reduces exploration costs, and enhances navigation efficiency, enabling the agent to perform tasks more effectively in complex environments.

Abstract:
The Vision-and-Language Navigation (VLN) task involves an agent navigating within 3D indoor environments based on provided instructions. Achieving cross-modal alignment presents one of the most critical challenges in VLN, as the predicted trajectory needs to precisely align with the given instruction. This paper focuses on addressing cross-modal alignment in VLN from a fine-grained perspective. Firstly, to address the issue of weak cross-modal alignment supervision arising from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset called Landmark-RxR. This dataset aims to offer precise, fine-grained supervision for VLN. Secondly, in order to comprehensively demonstrate the potential and advantage of the fine-grained data from Landmark-RxR, we explore the core components of the training process that depend on the characteristics of the training data. These components include data augmentation, training paradigm, reward shaping, and navigation loss design. Leveraging our fine-grained data, we carefully design methods for handling them and introduce a novel evaluation mechanism. The experimental results demonstrate that the fine-grained data can effectively improve the agent’s cross-modal alignment ability.

Abstract:
Multimodal anomaly detection (MAD) aims to exploit both texture and spatial attributes to identify deviations from normal patterns in complex scenarios. However, zero-shot (ZS) settings arising from privacy concerns or confidentiality constraints present significant challenges to existing MAD methods. To address this issue, we introduce ZUMA, a training-free, Zero-shot Unified Multimodal Anomaly detection framework that unleashes CLIP’s cross-modal potential to perform ZS MAD. To mitigate the domain gap between CLIP’s pretraining space and point clouds, we propose cross-domain calibration (CDC), which efficiently bridges the manifold misalignment through source-domain semantic transfer and establishes a hybrid semantic space, enabling a joint embedding of 2D and 3D representations. Subsequently, ZUMA performs dynamic semantic interaction (DSI) to enable structural decoupling of anomaly regions in the high-dimensional embedding space constructed by CDC, where natural languages serve as semantic anchors to help DSI establish discriminative hyperplanes within hybrid modality representations. Within this framework, ZUMA enables plug-and-play detection of 2D, 3D or multimodal anomalies, without training or fine-tuning even for cross-dataset or incomplete-modality scenarios. Additionally, to further investigate the potential of the training-free ZUMA within the training-based paradigm, we develop ZUMA-FT, a fine-tuned variant that achieves notable improvements with minimal parameter trade-off. Extensive experiments are conducted on two MAD benchmarks, MVTec 3D-AD and Eyecandies. Notably, the training-free ZUMA achieves state-of-the-art (SOTA) performance on both datasets, outperforming existing ZS MAD methods, including training-based approaches. Moreover, ZUMA-FT further extends the performance boundary of ZUMA with only 6.75 M learnable parameters.

Abstract:
Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset.

Abstract:
Secure and high-capacity secret information transmission is an important task of the image hiding research. The existing image hiding methods face some critical issues: cover-based methods offer high capacity but introduce image distortion and security risks, whereas secure coverless methods have low capacity. To address these issues, this paper proposes a novel generative-based coverless multi-image hiding method called GCL-MIH, which can achieve high capacity and high security. The GCL-MIH first utilizes a feature reverse module to compress multiple secret images into multiple feature vectors and then normalizes them to generate a vector that conforms to a standard normal distribution, and finally inputs this vector into an invertible generative network (Flow-GAN) to generate a face image, enabling coverless multiple-image hiding without a predefined cover image. Experimental results demonstrate that the GCL-MIH successfully hides up to four images within a single generated face image, achieving a maximum embedding rate of 32 bpp. This capacity far exceeds those of the existing coverless methods. On the COCO test set, the generated stego images of the GCL-MIH are highly realistic (FID score: 11.98), and the recovered secret images exhibit satisfactory fidelity (the average PSNR and SSIM of four recovered secret images are 33.18 dB and 0.9412).

Affiliations: School of Computer Science and Engineering, and the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Nanjing, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; School of Computer Science and Engineering, Nanyang Technological University, Singapore; Institute of Intelligent Information Processing, Taizhou University, Taizhou, Zhejiang, China; School of Engineering Science, University of Chinese Academy of Sciences, Beijing, China

Abstract:
Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality's features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3%/2.4% (ACC_77), 1.3%/1.9% (ACC_22) and 1.9%/1.8% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.

Abstract:
The transferability of adversarial examples across different models has drawn considerable attention recently, particularly in targeted transferability. Prior research has empirically shown that optimizing adversarial perturbations at neighboring points with the highest loss value improves transferability. While effective, such a method requires multiple iterations to reach the local maxima and disregards the local minima of the input loss landscape. In this paper, we theoretically show that enhancing adversarial transferability is attainable by flattening the input loss landscape. This is accomplished through the perturbation optimization at both local maxima and minima. Moreover, we propose the Cost-efficient LandscapE Flattening (CLEF) attack to consider local maxima and minima around current inputs in a cost-efficient way to flatten the loss landscape and improve adversarial transferability. Specifically, we reuse the gradients of the previous attack step to assist current inputs in reaching local maxima, and employ probabilistic modeling to learn the distributional representations of perturbations that assist current inputs in reaching local minima. This probabilistic modeling can be pre-trained on dozens of images from other domains, enabling us to directly sample this type of perturbation from the pre-trained distribution when attacking. Experimental results demonstrate that integrating local maxima and minima into targeted transferable attacks can significantly flatten the loss landscape of the crafted adversarial examples, resulting in improved adversarial transferability.

Abstract:
Although multiview learning methods have been widely studied, they mostly focus on improving accuracy while ignoring decision uncertainty. In the real world, multiview data often encounters misalignment issues, resulting in conflictive instances and further limiting the application of these methods in safety-critical domains. Recently, some efforts have been made to improve the reliability of multiview learning methods by estimating decision uncertainty, but most methods often experience performance degradation due to their inability to handle conflictive instances. To address this issue, we propose a Robust Trusted Conflictive Multiview Collaborative Contrastive Learning (RCMCL) method, which enhances the model’s robustness and generalization ability in conflictive multiview scenarios. Specifically, RCMCL first uses an evidential deep neural network to construct view-specific opinions, and then employs dissonance-based evidence contrastive learning to enhance the consistency of these opinions across different views. Subsequently, RCMCL performs collaborative learning of consistent evidence and complementary evidence. It first introduces the vacuity degree into the complementary evidence to extract more useful information, and then employs category-level contrastive learning to separate consistent and complementary evidence. In addition, consistent and complementary evidence is combined to make a joint decision. Finally, experimental results on eight benchmark datasets verify the superiority of RCMCL over state-of-the-art methods.

Abstract:
Human Motion Prediction (HMP) aims to predict future human poses at different moments according to observed past motion sequences. Previous approaches mainly treated the prediction of different temporal moments as a single prediction task and learned the predictions of varied moments simultaneously, which would encounter a main limitation: the learning of short-term predictions (referring to “near-future” prediction) could be hindered by the predictions of long-term (referring to “far-future” prediction) motions. In this paper, we develop a novel temporal continual learning framework called Continual Prior Compensation (CPC) to progressively train HMP models, in which we divide the prediction task of motions corresponding to varied temporal moments into several subtasks and train the model in a multi-stage manner. To mitigate the prior information forgetting in the progressive training, we further introduce a learnable random variable Prior Compensation Factor (PCF) to explicitly measure the prior knowledge loss. We theoretically show that the PCF can be efficiently learned together with the model parameters by minimizing a reasonable upper bound of the objective function. The proposed CPC is further enhanced to estimate the prior information loss for each subtask and a new framework called Continual Prior Compensation++ (CPC++) with Fine-Grained Prior Compensation Factor (FGPCF) is finally developed. Our CPC and CPC++ frameworks are quite flexible and can be easily integrated with different HMP backbone models and adapted to various datasets and applications. Extensive experiments on three HMP benchmark datasets using multiple SOTA HMP backbones (PGBIG, siMLPe, MotionMixer, and LTD) demonstrate the effectiveness and flexibility of our frameworks.

Abstract:
As model modification techniques are increasingly employed to obtain high-performing machine learning models at reduced costs, identifying lineage relationships—i.e., whether one model is derived from another—has garnered significant research interest. However, existing approaches are largely empirical, lack theoretical grounding, and are often ineffective against high-impact modifications. Furthermore, none have addressed the measurement of lineage closeness, which quantifies the degree of modification between models. In this paper, we reformulate the model lineage determination problem as a question of whether two models’ parameters reside within the same local optimum of the loss landscape. Based on this formulation, we propose a simple yet effective method for lineage determination. We further explore the impact of various modification techniques on models’ decision boundaries using visualization techniques, and observe that changes in decision boundaries serve as an accurate metric for lineage closeness. Leveraging this insight, we propose a task-agnostic and modification-type-agnostic method to measure lineage closeness by computing the mean adversarial distance from data points to decision boundaries and the matching rate of data point predictions. To reduce computational overhead, we design an efficient sampling strategy for data point selection. Extensive experiments demonstrate that our approach achieves 100% accuracy in model lineage determination and provides precise, quantitative measurements of lineage closeness across a wide range of modification scenarios.

Abstract:
Ordinal regression aims to predict ordered classes. Existing methods mainly focus on label distribution shapes and feature distance relationships, while the directional characteristics in the representation space remain underexplored. In this paper, we propose deep orientational representation learning (ORL), aiming to ensure the trajectory of features sequentially connected by ordinal categories approximates a geodesic. We treat the output layer weights as ordinal prototypes and introduce two constraints, the co-directional constraint and the counter-directional constraint. They operate by constraining the angles between pairs of vectors. The former minimizes the angle between vectors with matching start and end categories, while the latter maximizes the angle between vectors whose start categories are the same but whose end categories are on opposite sides. The two constraints optimize the representation from different ordinal directions. ORL is extended to a multi-prototype setting (MORL) to mitigate misalignment between features and oriented prototypes caused by large intra-class variations. Theoretical analysis links ORL to distribution unimodality and distance orderliness, highlighting its advantages. The effectiveness of ORL (MORL) is demonstrated on various tasks including facial age estimation, historical image dating, and aesthetic quality assessment.

Abstract:
In this paper, we study the problem of low-rank tensor learning, where only a few of training samples are observed and the underlying tensor has a low-rank structure. The existing methods are based on the sum of nuclear norms of unfolding matrices of a tensor, which may be suboptimal. In order to explore the low-rankness of the underlying tensor effectively, we propose a nonconvex model based on transformed tensor nuclear norm for low-rank tensor learning. Specifically, a family of nonconvex functions are employed onto the singular values of all frontal slices of a tensor in the transformed domain to characterize the low-rankness of the underlying tensor. An error bound between the stationary point of the nonconvex model and the underlying tensor is established under restricted strong convexity on the loss function (such as least squares loss and logistic regression) and suitable regularity conditions on the nonconvex penalty function. By reformulating the nonconvex function into the difference of two convex functions, a proximal majorization-minimization (PMM) algorithm is designed to solve the resulting model. Then the global convergence and convergence rate of PMM are established under very mild conditions. Numerical experiments are conducted on tensor completion and binary classification to demonstrate the effectiveness of the proposed method over other state-of-the-art methods.

Abstract:
This paper studies the task of SatStreet-view synthesis, which aims to render photorealistic street-view panorama images and videos given a satellite image and specified camera positions or trajectories. Our approach involves learning a satellite image conditioned neural radiance field from paired images captured from both satellite and street viewpoints, which comes to be a challenging learning problem due to the sparse-view nature and the extremely large viewpoint changes between satellite and street-view images. We tackle the challenges based on a task-specific observation that street-view specific elements, including the sky and illumination effects, are only visible in street-view panoramas, and present a novel approach, Sat2Density++, to accomplish the goal of photo-realistic street-view panorama rendering by modeling these street-view specific elements in neural networks. In the experiments, our method is evaluated on both urban and suburban scene datasets, demonstrating that Sat2Density++ is capable of rendering photorealistic street-view panoramas that are consistent across multiple views and faithful to the satellite image.

Abstract:
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce Low-Rank Adaptation (LoRA) modules to fine-tune the Feed-Forward Network (FFN) layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes.

Abstract:
Private data, when published online, may be collected by unauthorized parties to train deep neural networks (DNNs). To protect privacy, defensive noises can be added to original samples to degrade their learnability by DNNs. Recently, unlearnable examples (Huang et al., 2021) are proposed to minimize the training loss such that the model learns almost nothing. However, raw data are often pre-processed before being used for training, which may restore the private information of protected data. In this paper, we reveal the data privacy violation induced by data augmentation, a commonly used data pre-processing technique to improve model generalization capability, which is the first of its kind as far as we are concerned. We demonstrate that data augmentation can significantly raise the accuracy of the model trained on unlearnable examples from 21.3% to 66.1%. To address this issue, we propose a defense framework, dubbed Armor, to protect data privacy from potential breaches of data augmentation. To overcome the difficulty of having no access to the model training process, we design a non-local module-assisted surrogate model that better captures the effect of data augmentation. In addition, we design a surrogate augmentation selection strategy that maximizes distribution alignment between augmented and non-augmented samples, to choose the optimal augmentation strategy for each class. We also use a dynamic step size adjustment algorithm to enhance the defensive noise generation process. Extensive experiments are conducted on 4 datasets and 5 data augmentation methods to verify the performance of Armor. Comparisons with 6 state-of-the-art defense methods have demonstrated that Armor can preserve the unlearnability of protected private data under data augmentation. Armor reduces the test accuracy of the model trained on augmented protected samples by as much as 60% more than baselines. We also show that Armor is robust to adversarial training. We will open-source our codes upon publication.

Affiliations: Hefei Comprehensive National Science Center, Hefei University of Technology (HFUT), Hefei, China; Department of Electronic Engineering and Information Science, School of Information Science and Technology, University of Science and Technology of China, Hefei, China; Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, United Arab Emirates; Department of Electrical and Computer Engineering, National University of Singapore, Singapore; School of Intelligence Science and Technology, Peking University, Beijing, China; Department of Computer Science and Technology, Tsinghua University, Beijing, China; OpenNLPLab, Shanghai, China

Abstract:
Mainstream research in audio-visual learning has focused on designing task-specific expert models, primarily implemented through sophisticated multimodal fusion approaches. Recently, a few efforts have aimed to develop more task-independent or universal audiovisual embedding networks, encoding advanced representations for use in various audiovisual downstream tasks. This is typically achieved by fine-tuning large pretrained transformers, such as Swin-V2-L and HTS-AT, in a parameter-efficient manner through techniques such as tuning only a few adapter layers inserted into the pretrained transformer backbone. Although these methods are parameter-efficient, they suffer from significant training memory consumption due to gradient backpropagation through the deep transformer backbones, which limits accessibility for researchers with constrained computational resources. In this paper, we present Meta-Token Learning (Mettle), a simple and memory-efficient method for adapting large-scale pretrained transformer models to downstream audio-visual tasks. Instead of sequentially modifying the output feature distribution of the transformer backbone, Mettle utilizes a lightweight Layer-Centric Distillation (LCD) module to distill in parallel the intact audio or visual features embedded by each transformer layer into compact meta-tokens. This distillation process considers both pretrained knowledge preservation and task-specific adaptation. The obtained meta-tokens can be directly applied to classification tasks, such as audio-visual event localization and audio-visual video parsing. To further support fine-grained segmentation tasks, such as audio-visual segmentation, we introduce a Meta-Token Injection (MTI) module, which utilizes the audio and visual meta-tokens distilled from the top transformer layer to guide feature adaptation in earlier layers. Extensive experiments on multiple audiovisual benchmarks demonstrate that our method significantly reduces memory usage and training time while maintaining parameter efficiency and competitive accuracy.

Abstract:
In multi-dimensional classification (MDC), each instance is associated with labels from multiple potentially interdependent class dimensions. However, existing approaches often overlook the fact that different semantic dimensions may require distinct feature representations. Additionally, irrelevant and redundant features in the feature space can adversely affect model performance. To address these issues, a feature selection approach based on evolutionary multi-tasking named Fest is proposed for MDC. It treats feature selection for each class dimension as a separate subtask for evolution, ensuring the selected features effectively capture the semantics of each dimension. To effectively identify and select shared features between correlated class dimensions, Fest introduces an exploration mechanism for feature interaction that considers class dependencies. Extensive experiments are conducted on eleven benchmark datasets as well as on four state-of-the-art MDC approaches. Experimental results clearly demonstrate that selecting dimension-specific features instead of all features can significantly improve the classification performance of existing MDC approaches.

Abstract:
Image restoration has witnessed significant advancements with the development of deep learning models. Transformer-based models, particularly those using window-based self-attention, have become a dominant force. However, their performance is constrained by the rigid, non-overlapping window partitioning scheme, which leads to insufficient feature interaction across windows and limited receptive fields. This highlights the need for more adaptive and flexible attention mechanisms. In this paper, we propose the Deformable Sliding Window Transformer for Image Restoration (DSwinIR), a new attention mechanism: the Deformable Sliding Window (DSwin) Attention. This mechanism introduces a token-centric and content-aware paradigm that moves beyond the grid and fixed window partition. It comprises two complementary components. First, it replaces the rigid partitioning with a token-centric sliding window paradigm, making it effective at eliminating boundary artifacts. Second, it incorporates a content-aware deformable sampling strategy, which allows the attention mechanism to learn data-dependent offsets and actively shape its receptive field to focus on the most informative image regions. Extensive experiments show that DSwinIR achieves strong results, including state-of-the-art performance on several evaluated benchmarks. For instance, in all-in-one image restoration, our DSwinIR surpasses the most recent backbone GridFormer by 0.53 dB on the three-task benchmark and 0.87 dB on the five-task benchmark.

Abstract:
There is a vast literature on representation learning based on principles such as coding efficiency, statistical independence, causality, controllability, or symmetry. In this paper we propose to learn representations from sequence data by factorizing the transformations of the latent variables into sparse components. Input data are first encoded as distributions of latent activations and subsequently transformed using a probability flow model, before being decoded to predict a future input state. The flow model is decomposed into a number of rotational (divergence-free) vector fields and a number of potential flow (curl-free) fields. Our sparsity prior encourages only a small number of these fields to be active at any instant and infers the speed with which the probability flows along these fields. Training this model is completely unsupervised using a standard variational objective and results in a new form of disentangled representations where the input is not only represented by a combination of independent factors, but also by a combination of independent transformation primitives given by the learned flow fields. When viewing the transformations as symmetries one may interpret this as learning approximately equivariant representations. Empirically we demonstrate that this model achieves state of the art in terms of both data likelihood and unsupervised approximate equivariance errors on datasets composed of sequence transformations.

Abstract:
Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively.

Abstract:
Recent advances in image synthesis have been propelled by powerful generative models, such as Masked Generative Transformers (MaskGIT), autoregressive models, diffusion models, and rectified flow models. A common principle behind their success is the decomposition of complex synthesis tasks into multiple tractable steps. However, this introduces a proliferation of step-specific parameters to be configured for modulating the iterative generation process (e.g., mask ratio, noise level, or temperature at each step). Existing approaches typically rely on manually-designed scheduling rules to manage this complexity, demanding expert knowledge and extensive trial-and-error. Furthermore, these static schedules lack the flexibility to adapt to the unique characteristics of each individual sample, yielding sub-optimal performance. To address this issue, we present AdaGen, a general, learnable, and sample-adaptive framework for scheduling the iterative generation process. Specifically, we formulate the scheduling problem as a Markov Decision Process, where a lightweight policy network is introduced to adaptively determine the most suitable parameters given the current generation state, and can be trained through reinforcement learning. Importantly, we demonstrate that simple reward designs, such as FID or pre-trained reward models, can be easily hacked and may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of the policy networks effectively. Finally, we introduce an inference-time refinement strategy and a controllable fidelity-diversity trade-off mechanism to further enhance the performance and flexibility of AdaGen. Comprehensive experiments across five benchmark datasets (ImageNet-256 × 256 & 512 × 512, MS-COCO, CC3M, and LAION-5B) and four distinct generative paradigms validate the superiority of AdaGen . For example, AdaGen achieves better performance on DiT-XL with \mathbf ～ 3× ∼3× lower inference cost and improves the FID of VAR from 1.92 to 1.59 with negligible additional computational overhead.

Abstract:
3D human pose estimation from 2D keypoint observation has been used in many human-centered computer vision applications. In this work, we tackle the task by formulating a novel grid representation learning paradigm that relies on grid convolution (GridConv), mimicking the wisdom of regular convolution operations in image space. GridConv is defined based on Semantic Grid Transformation (SGT) which leverages a binary assignment matrix to map standard skeleton 2D pose onto a regular weave-like grid pose joint by joint. We provide two ways to implement SGT: handcrafted and learnable SGT. Surprisingly, both designs turn out to achieve promising results and the learnable one is better, demonstrating the great potential of this new lifting representation learning formulation. To improve the ability of GridConv to encode contextual cues, we introduce an attention module over the convolutional kernel, making grid convolution operations input-dependent, spatial-aware and grid-specific. Besides our spatial grid lifting network for single-frame input, we also present a spatial-temporal grid lifting network for video-based input, which relies on an efficient multi-scale grid learning strategy to encode spatial and temporal joint variations. Extensive experiments demonstrate that the proposed grid lifting network outperforms existing approaches by remarkable margins on Human3.6M and MPI-INF-3DHP datasets. Our grid lifting networks also exhibit good generalization ability across three other keypoint-based tasks: 3D hand pose estimation, head pose estimation, and action recognition.

Abstract:
Enabling machines to solve mathematical problems is a vital endeavor in developing intelligence that emulates human-like thinking and reasoning. However, most existing approaches focus on reconstructing human comprehension of problems, which are still far from enough since they neglect the fundamental human ability to learn knowledge from experiences. In this article, we focus on empowering models with the cognitive capacity to autonomously learn knowledge from mathematical problem-solving. We first propose a Cognitive Solver (CogSolver) that contains an intelligent BRAIN-ARM framework as the cognitive structure and operates the knowledge learning process in Store-Apply-Update steps inspired by two cognitive science theories. The BRAIN system stores three basic types of mathematical knowledge, and the ARM system applies them organically in answer reasoning process. After solving problems, the BRAIN updates its stored knowledge based on the ARM’s feedback, with knowledge filters to eliminate redundancies and foster a more rational knowledge base. Our CogSolver carries out the above three steps iteratively, emulating a more human-like behavior. Furthermore, in order to overcome knowledge forgetting during the learning process, we extend CogSolver to CogSolver+ by incorporating an essential knowledge Recall mechanism, which is inspired by another prominent cognitive theory. We first discuss and fuse three crucial factors in simulating human memory replay. Then, we propose a influenced-based method with a theoretical guarantee of efficiency to consolidate the updated knowledge. Experiments on three math word problem benchmarks demonstrate the improvements of our CogSolver and CogSolver+ in answer reasoning and clearly illustrate how they acquire knowledge, leading to superior interpretability.

Abstract:
With the rapid growth of video content on social media, video summarization has become a crucial task in multimedia processing. However, existing methods face challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an intra-graph relational reasoning module with a dual-threshold graph convolution mechanism, which distinguishes semantically relevant frames from irrelevant ones between nodes. Additionally, our proposed language-guided cross-modal embedding module generates video summaries with specific textual descriptions. We model the summary generation output as a mixture of Bernoulli distribution and solve it with the EM algorithm. Experimental results show that our method outperforms existing approaches across multiple benchmarks. Moreover, we proposed LGRLN reduces inference time and model parameters by 87.8% and 91.7%, respectively.

Abstract:
One of the main challenges of federated learning (FL) is handling non-independent and identically distributed (non-IID) client data, which may occur in practice due to unbalanced datasets and use of different data sources across clients. Knowledge sharing and model personalization are key strategies for addressing this issue. Clustered federated learning is a class of FL methods that groups clients that observe similarly distributed data into clusters, such that every client is typically associated with one data distribution and participates in training a model for that distribution along their cluster peers. In this paper, we present a unified Bayesian framework for clustered FL which associates clients to clusters. Then we propose several practical algorithms to handle the, otherwise growing, data associations in a way that trades off performance and computational complexity. This work provides insights on client-cluster associations and enables client knowledge sharing in new ways. The proposed framework circumvents the need for unique client-cluster associations, which is seen to increase the performance of the resulting models in a variety of experiments.

Abstract:
Due to the complexity of data collection in the real world, Multi-view Representation Learning (MvRL) always encounters the incomplete information challenge, typically manifested as the Sample-missing Problem (SP) and the View-unaligned Problem (VP). Although several methods have been proposed, they fail to find a good trade-off among sample restoration, view alignment, and data diversity preservation. To address this issue, we take and mathematically formulate two sociological concepts for MvRL, i.e., community commonality and community versatility, where the former refers to the identical custom shared within the same community, and the latter refers to the similar but non-identical custom within communities of the same minority. One could find that the community commonality can enhance the compactness of view-specific clusters, and the community versatility can preserve the view diversity. Moreover, combining both of them could facilitate achieving robust MvRL with incomplete information. With the formulations, we propose a novel method dubbed Community-Aware Multi-viEw RepresentAtion learning with incomplete information (CAMERA). In brief, CAMERA employs a novel dual-stream network and an elaborate objective function that theoretically and empirically embraces community commonality and versatility. Extensive experimental results on seven datasets demonstrate that CAMERA remarkably outperforms 24 competitive multi-view learning methods on clustering, classification, and human action recognition tasks.

Abstract:
Change detection is essential in Earth observation, yet current models heavily rely on large-scale annotated datasets. Generative models offer a promising alternative by synthesizing training data, but generating temporally coherent image pairs with realistic, semantically meaningful changes remains a significant challenge. Existing approaches typically simulate changes by generating pre- and post-change label maps using either heuristic rules (e.g., copy-pasting) or text prompts. However, the former offers limited change diversity, while the latter often fails to maintain spatial consistency between image pairs. We observe that the noise space of diffusion models encodes strong generative capacity and spatial controllability: localized perturbations in the noise can yield meaningful, interpretable changes in corresponding image regions. Motivated by this, we propose Noise2Change, a framework for simulating change directly in the noise domain. The key idea is to manipulate the semantic composition of the initial noise sampled from the noise domain, such that the diffusion process generates structurally consistent pre- and post-change images reflecting realistic transformations. Since the unperturbed noise is shared between both images, the resulting pairs exhibit strong temporal alignment and semantic coherence, effectively addressing the trade-off between realism and consistency. Concretely, we employ a discrete diffusion model to extract high-level semantics from the initial noise. Guided by these semantics, we introduce a change simulation strategy that optimizes the noise to encode intended changes. The modified noise is then used to drive the diffusion process, yielding pre- and post-change label maps with natural structural transitions. These maps are passed through a unified framework for image generation and label refinement, producing highly aligned image-label pairs. Our framework supports diverse change types across a wide range of scenarios. Extensive experiments on multiple change detection tasks demonstrate that our method achieves superior performance compared to existing generative approaches.

Abstract:
Long-term motion generation is a challenging task that requires producing coherent and realistic sequences over extended durations. Current methods primarily rely on framewise motion representations, which capture only static spatial details and overlook temporal dynamics. This approach leads to significant redundancy across the temporal dimension, complicating the generation of effective long-term motion. To overcome these limitations, we introduce the novel concept of Lagrangian Motion Fields, specifically designed for long-term motion generation. By treating each joint as a Lagrangian particle with uniform velocity over short intervals, our approach condenses motion representations into a series of “supermotions” (analogous to superpixels). This method seamlessly integrates static spatial information with interpretable temporal dynamics, transcending the limitations of existing network architectures and motion sequence content types. Our solution is versatile and lightweight, eliminating the need for neural network preprocessing. Our approach excels in tasks such as long-term music-to-dance generation and text-to-motion generation, offering enhanced efficiency, superior generation quality, and greater diversity compared to existing methods. Additionally, the adaptability of Lagrangian Motion Fields extends to applications like infinite motion looping and fine-grained controlled motion generation, highlighting its broad utility.

Abstract:
Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method.

Abstract:
Out-of-distribution (OOD) detection is essential for the reliability of ML models. Most existing methods for OOD detection learn a fixed decision criterion from a given in-distribution dataset and apply it universally to decide if a data point is OOD. Recent work Fang et al. (2022) shows that given only in-distribution data, it is impossible to reliably detect OOD data without extra assumptions. Motivated by the theoretical result and recent exploration of test-time adaptation methods, we propose a Non-Parametric Test Time Adaptation framework for Out-Of-Distribution Detection (AdaODD). Unlike conventional methods, AdaODD utilizes online test samples for model adaptation during testing, enhancing adaptability to changing data distributions. The framework incorporates detected OOD instances into decision-making, reducing false positive rates, particularly when ID and OOD distributions overlap significantly. We demonstrate the effectiveness of AdaODD through comprehensive experiments on multiple OOD detection benchmarks, extensive empirical studies show that AdaODD significantly improves the performance of OOD detection over state-of-the-art methods. Specifically, AdaODD reduces the false positive rate (FPR95) by 23.23% on the CIFAR-10 benchmarks and 38% on the ImageNet-1 k benchmarks compared to the advanced methods. Lastly, we theoretically verify the effectiveness of AdaODD.

Abstract:
Social dilemmas can be considered situations where individual rationality leads to collective irrationality. The multi-agent reinforcement learning community has leveraged ideas from social science, such as social value orientations (SVO), to solve social dilemmas in complex cooperative tasks. In this paper, we first introduce the typical “division of labor or roles” mechanism in human society, and provide a promising solution for intertemporal social dilemmas (ISD) with SVOs. A novel learning framework, called Learning Roles with Emergent SVOs (RESVO), is proposed to transform the learning of roles into the social value orientation emergence, which is symmetrically solved by endowing agents with altruism to share rewards with other agents. An SVO-based role embedding space is then constructed by individual conditioning policies on roles with a novel rank regularizer and mutual information maximizer. Experiments show that RESVO achieves a stable division of labor and cooperation in ISDs with different complexity.

Abstract:
The cost of finetuning a pretrained model on downstream tasks steadily increases as they grow larger. Parameter-efficient transfer learning (PETL) is proposed to reduce this cost by changing only a tiny subset of trainable parameters. But, the GPU memory footprint during training is not effectively reduced in PETL. This issue happens because trainable parameters from these methods are generally tightly entangled with the backbone, such that a lot of intermediate states have to be stored for back propagation. To alleviate this issue, we introduce Disentangled Transfer Learning (DTL), which disentangles the trainable parameters from the backbone using a lightweight Compact Side Network (CSN). By progressively extracting task-specific information with a few low-rank linear mappings and appropriately adding the information back to the backbone, CSN effectively realizes knowledge transfer in various downstream recognition tasks. We further extend DTL to more difficult tasks such as object detection and semantic segmentation by employing a more sparse architectural design. Extensive experiments validate the effectiveness of DTL, which not only reduces a large amount of GPU memory usage and trainable parameters, but also outperforms existing PETL methods by a significant margin in accuracy.

Abstract:
Graph Contrastive Learning (GCL) methods typically leverage augmentation techniques to generate different graph views for comparison, thereby learning corresponding representations for graph-related tasks in label-scarce scenarios. However, existing GCL methods suffer from two primary limitations: 1) they use predefined or one-time perturbations for augmentation, ignoring adaptive noise injection during forward propagation and thus leading to suboptimal model robustness; 2) their contrast mechanisms mainly focus on the agreement of inter-graph representations while neglecting the dimensional feature redundancy within intra-graph representations. To solve these issues, we propose Layer-adaptive-augmentation-based Graph Contrastive Learning with feature Decorrelation (LGCLD). First, the designed layer-wise adaptive augmentation method performs dynamic perturbations while maintaining the semantic similarity between augmented and original graphs, which can improve model robustness. Second, we introduce an Agreement-Decorrelation loss (AD loss) that simultaneously optimizes the agreement between graph-level representations and the feature correlation among different dimensions within each graph-level representation, promoting the model to learn informative and non-redundant graph-level representations. Furthermore, we analyze the reasonableness of AD loss through the graph information bottleneck principle. Experiments on various-domain graph datasets demonstrate that LGCLD achieves better or competitive performance compared with a series of state-of-the-art baselines.

Abstract:
Foreground segmentation is a fundamental problem in computer vision, which includes salient object detection, forgery detection, defocus blur detection, shadow detection, and camouflage object detection. Previous works have typically relied on domain-specific solutions to address accuracy and robustness issues in those applications. In this paper, we present a unified framework for a number of foreground segmentation tasks without any task-specific designs. We take inspiration from the widely-used pre-training and then prompt tuning protocols in NLP and propose a new visual prompting model, named Explicit Visual Prompting (EVP). Different from the previous visual prompting which is typically a dataset-level implicit embedding, our key insight is to enforce the tunable parameters focusing on the explicit visual content from each individual image, i.e., the features from frozen patch embeddings and high-frequency components. Our method freezes a pre-trained model and then learns task-specific knowledge using a few extra parameters. Despite introducing only a small number of tunable parameters, EVP achieves superior performance than full fine-tuning and other parameter-efficient fine-tuning methods. Experiments in fourteen datasets across five tasks show the proposed method outperforms other task-specific methods while being considerably simple. The proposed method demonstrates the scalability in different architectures, pre-trained weights, and tasks.

Abstract:
Learning unnormalized statistical models (e.g., energy-based models) is computationally challenging due to the complexity of handling the partition function. To eschew this complexity, noise-contrastive estimation (NCE) has been proposed by formulating the objective as the logistic loss between the real data and the artificial noise. However, previous research indicates that NCE may perform poorly in many tasks due to its flat loss landscape and slow convergence. In this paper, we study a direct approach for optimizing the negative log-likelihood of unnormalized models through the lens of compositional optimization. To tackle the partition function, a noise distribution is introduced such that the log partition function can be expressed as a compositional function whose inner function can be estimated using stochastic samples. Consequently, the objective can be optimized via stochastic compositional optimization algorithms. Despite being a simple method, we demonstrate it is more favorable than NCE by (1) establishing a fast convergence rate and quantifying its dependence on the noise distribution through the variance of stochastic estimators; (2) developing better results in Gaussian mean estimation by showing our method has a much favorable loss landscape and enjoys faster convergence; (3) demonstrating better performance on various applications, including density estimation, out-of-distribution detection, and real image generation.

Abstract:
Electrical Impedance Tomography (EIT) provides a non-invasive, portable imaging modality with significant potential in medical and industrial applications. Despite its advantages, EIT encounters two primary challenges: the ill-posed nature of its inverse problem and the spatially variable, location-dependent sensitivity distribution. Traditional model-based methods mitigate ill-posedness through regularization but overlook sensitivity variability, while supervised deep learning approaches require extensive training data and lack generalization. Recent developments in neural fields have introduced implicit regularization techniques for image reconstruction; however, these methods often overlook the physical principles underlying EIT, thereby limiting their effectiveness. In this study, we propose PhyNC (Physics-driven Neural Compensation), an unsupervised deep learning framework that incorporates the physical principles of EIT. PhyNC addresses both the ill-posed inverse problem and the sensitivity distribution by dynamically allocating neural representational capacity to regions with lower sensitivity, ensuring accurate and balanced conductivity reconstructions. Extensive evaluations on both simulated and experimental data demonstrate that PhyNC outperforms existing methods in terms of detail preservation and artifact resistance, particularly in low-sensitivity regions. Our approach enhances the robustness of EIT reconstructions and provides a flexible framework that can be adapted to other imaging modalities with similar challenges.

Abstract:
LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.

Abstract:
Sequential Model-Based Optimization (SMBO) is a highly effective strategy for hyperparameter search in machine learning. It utilizes a surrogate model that fits previous trials and approximates the hyperparameter response surface (performance). This surrogate model primarily guides the decision-making process for selecting the next set of hyperparameters. Existing classic surrogates, such as Gaussian processes and random forests, focus solely on the current task of interest and cannot incorporate trials from historical tasks. This limitation hinders their efficacy in various applications. Inspired by the state-of-the-art convolutional neural process, this paper proposes a novel meta-learning-based surrogate model for efficient and effective hyperparameter optimization. Our surrogate is trained on the meta-knowledge from a range of historical tasks, enabling it to accurately predict the hyperparameter response surface even with a limited number of trials on a new task. We tested our approach on the hyperparameter selection problem for the well-known support vector machine (SVM), residual neural network (ResNet), and vision transformer (ViT) across hundreds of real-world classification datasets. The empirical results demonstrate its superiority over existing surrogate models, highlighting the effectiveness of meta-learning in hyperparameter optimization.

Abstract:
Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (\mathitg^\mathitngnConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. \mathitg^\mathitngn Conv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the proposed operation, we construct a new family of generic vision backbones for various visual modalities and tasks, including HorNet and HorFPN for image recognition, Hor3D for point cloud analysis, and HorCLIP for vision-language modeling. For image recognition, we propose HorNet as a stronger visual encoder, where we conduct extensive experiments on ImageNet classification, COCO object detection, and ADE20K semantic segmentation. HorNet outperforms Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and larger model sizes. Apart from image encoders, we also show \mathitg^\mathitngnConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. For point cloud analysis, we design Hor3D, demonstrating the efficacy of high-order interactions for unstructured point cloud data through experiments on challenging 3D semantic segmentation tasks in S3DIS and ScanNet V2. In vision-language modeling, our proposed HorCLIP surpasses mainstream Vision Transformer and ConvNeXt architectures with shorter training schedules on ImageNet zero-shot classification and shows remarkably higher performance on vision-language dense representation tasks on COCO Panoptic datasets. Our results demonstrate that \mathitg^\mathitngnConv with high-order spatial interactions can be a new basic operation for visual modeling that effectively combines the merits of both vision Transformers and CNNs.

Affiliations: School of Computer Science and Engineering, Southeast University, Nanjing, China; School of Software, Northwestern Polytechnical University, Xi’an, China; State Key Laboratory for Novel Software Technology, School of Intelligence Science and Technology, Nanjing University, Nanjing, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; New Laboratory of Pattern Recognition (NLPR), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China

Abstract:
Human action understanding serves as a foundational pillar in the field of intelligent motion perception.Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has potential applications in humanoid robot control and interaction. However, existing works often lack the scalability and generalization required to handle diverse action understanding tasks. There is no skeleton foundation model that can be adapted to a wide range of action understanding tasks. This paper presents a Unified Skeleton-based Dense Representation Learning (USDRL) framework, which serves as a foundational model for skeleton-based human action understanding. USDRL consists of a Transformer-based Dense Spatio-Temporal Encoder (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT). The DSTE module adopts two parallel streams to learn temporal dynamic and spatial structure features. The MG-FD module collaboratively performs feature decorrelation across temporal, spatial, and instance domains to reduce dimensional redundancy and enhance information extraction. The MPCT module employs both multi-view and multi-modal self-supervised consistency training. The former enhances the learning of high-level semantics and mitigates the impact of low-level discrepancies, while the latter effectively facilitates the learning of informative multimodal features. We perform extensive experiments on 25 benchmarks across across 9 skeleton-based action understanding tasks, covering coarse prediction, dense prediction, and transferred prediction. Our approach significantly outperforms the current state-of-the-art methods. We hope that this work would broaden the scope of research in skeleton-based action understanding and encourage more attention to dense prediction tasks.

Abstract:
It is prevalent to leverage unlabeled data to train deep learning models when it is difficult to collect large-scale annotated datasets. However, for 3D gaze estimation, most existing unsupervised learning methods face challenges in distinguishing subtle gaze-relevant information from dominant gaze-irrelevant information. To address this issue, we propose an unsupervised learning framework to disentangle the gaze-relevant and the gaze-irrelevant information, by seeking the shared information of a pair of input images with the same gaze and with the same eye respectively. Specifically, given two images, the framework finds their shared information by first encoding the images into two latent features via two encoders and then switching part of the features before feeding them to the decoders for image reconstruction. We theoretically prove that the proposed framework is able to encode different information into different parts of the latent feature if we properly select the training image pairs and their shared information. Based on the framework, we derive Cross-Encoder and Cross-Encoder++ to learn gaze representation from the eye images and face images, respectively. Experiments on public gaze datasets demonstrate that the Cross-Encoder and Cross-Encoder++ outperform the competitive methods. The ablation study quantitatively and qualitatively shows that the gaze feature is successfully extracted.

Abstract:
Prompt tuning, a recently emerging paradigm, adapts vision-language pre-trained models to new tasks efficiently by learning “soft prompts” for frozen models. However, in few-shot scenarios, its effectiveness is limited by sensitivity to the initialization and the time-consuming search for optimal initialization, hindering rapid adaptation. Additionally, prompt tuning risks reducing the models’ generalizability due to overfitting on scarce training samples. To overcome these challenges, we introduce a novel Gradient-RegulAted Meta-prompt learning (GRAM) framework that jointly meta-learns an efficient soft prompt initialization for better adaptation and a lightweight gradient regulating function for strong cross-domain generalizability in a meta-learning paradigm using only the weakly labeled image-text pre-training data. This is achieved through a Cross-Modal Hierarchical Clustering algorithm that organizes extensive image-text data into a structured hierarchy, facilitating robust meta-learning across diverse domains. Rather than designing a specific prompt tuning method, our GRAM can be easily incorporated into various prompt tuning methods in a model-agnostic way and bring about consistent improvement for them. Further, we consider a more practical but challenging setting: test-time prompt tuning with only unlabeled test samples and propose an improved structure-induced gradient regulating function to leverage the structured semantics of the meta-learning data for zero-shot generalization. This novel approach exploits the hierarchically clustered meta-learning data to model relationships between test-time data and meta-learning prototypes, facilitating the transfer of invariant knowledge without explicit annotations. Meanwhile, we introduce a structure complexity-informed strategy for adaptively constructing meta-training tasks and generating prototypes, which fully considers the diverse semantics within hierarchical clusters of different complexities. Comprehensive experiments demonstrate the state-of-the-art few- and zero-shot generalizability of our method.

Abstract:
Sentence-level semantics plays a key role in language understanding. There exist subtle relations and dependencies among sentence-level samples, which is to be exploited. For example, in relational triple extraction (RTE), existing models overemphasize extraction modules, ignoring the sentence-level semantics and relation information, which causes (1) the semantics fed to extraction modules is relation-unaware; (2) each sample is trained individually without considering inter-sample dependency. To address these issues, we first propose the model-agnostic multi-relation detection task, which incorporates relation information into text encoding to generate the relation-aware semantics. Then we propose the model-agnostic multi-relation supervised contrastive learning, which leverages the relation-derived inter-sample dependencies as a supervised signal to learn discriminative semantics via drawing together or pushing away the sentence-level semantics regarding whether they share the same/similar relations. Besides, we design the reverse label frequency weighting and hierarchical label embedding mechanisms to alleviate label imbalance and integrate relation hierarchy. Our method can be applied to any RTE model and we conduct extensive experiments on five backbones by augmenting them with our method. Experimental results on four public benchmarks show that our method can bring significant and consistent improvements to various backbones and model analysis further verify the effectiveness of our method.

Abstract:
3D scene generation has garnered growing attention in recent years and has made significant progress. Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments. To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities. Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff. Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation. Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles. To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations. Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations. Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.

Abstract:
Designing expressive hypergraph kernels that can effectively capture high-order structural information is a fundamental challenge in hypergraph learning. In this paper, we propose a novel comparison framework based on hypergraph homomorphisms to evaluate and compare the expressive ability of existing hypergraph kernels. We revisit classical kernels such as Hypergraph Weisfeiler-Lehman (HG WL) and Hypergraph Rooted kernels, providing theoretical conditions under which they fail to distinguish non-isomorphic hypergraphs. Motivated by these insights, we introduce the Hypergraph Subtree-Cycle Kernel, which augments subtree-based features with cycle-based structural patterns to enhance expressiveness. We propose two variants: HG SCKernelv1 and HG SCKernelv2. Extensive experiments on five graph and ten hypergraph classification benchmarks demonstrate the superior performance of our methods, confirming the effectiveness of integrating homomorphism-guided design into hypergraph kernels.

Abstract:
Given a collection of points in \mathbb R^3R3, KD-Tree and R-Tree are well-known nearest neighbor search (NNS) algorithms that rely on spatial partitioning and indexing techniques. However, when the query point is far from the data points or the data points inherently represent a 2-manifold surface, their query performance may degrade. To address this, we propose a novel dynamic programming technique that precomputes a Directed Acyclic Graph (DAG) to encode the proximity structure between data points. More specifically, the DAG captures how the proximity structure evolves during the incremental construction of the Voronoi diagram of the data points. Experimental results demonstrate that our method achieves a speed increase of 1-10x. Furthermore, our algorithm demonstrates significant practical value in diverse applications. We validated its effectiveness through extensive testing in four key applications: Point-to-Mesh Distance Queries, Iterative Closest Point (ICP) Registration, Density Peak Clustering, and Point-to-Segments Distance Queries. A particularly notable feature of our approach is its unique ability to efficiently identify the nearest neighbor among the first kk points in the point cloud, a capability that enables substantial acceleration in low-dimensional applications like Density Peak Clustering. As a natural extension of our incremental construction process, our method can also be readily adapted for farthest-point sampling tasks. These experimental results across multiple domains underscore the broad applicability and practical importance of our approach.

Abstract:
Multi-task learning (MTL) is a standard learning paradigm in machine learning. The central idea of MTL is to capture the shared knowledge among multiple tasks for mitigating the problem of data sparsity where the annotated samples for each task are quite limited. Recent studies indicate that graph multi-task learning (GMTL) yields the promising improvement over previous MTL methods. GMTL represents tasks on a task relation graph, and further leverages graph neural networks (GNNs) to learn complex task relationships. Although GMTL achieves the better performance, the construction of task relation graph heavily depends on simple heuristic tricks, which results in the existence of spurious task correlations and the absence of true edges between tasks with strong connections. This problem largely limits the effectiveness of GMTL. To this end, we propose the Generative Causality-driven Network (GCNet), a novel framework that progressively learns the causal structure between tasks to discover which tasks are beneficial to be jointly trained for improving generalization ability and model robustness. To be specific, in the feature space, GCNet first introduces a feature-level generator to generate the structure prior for reducing learning difficulty. Afterwards, GCNet develops a output-level generator which is parameterized as a new causal energy-based model (EBM) to refine the learned structure prior in the output space driven by causality. Benefiting from our proposed causal framework, we theoretically derive an intervention contrastive estimation for training this causal EBM efficiently. Experiments are conducted on multiple synthetic and real-world datasets. Extensive empirical results and model analyses demonstrate the superior performance of GCNet over several competitive MTL baselines.

Abstract:
We explore the potential of pretrain-and-finetune manner on the RGB-D semantic segmentation to solve the common mismatch problem in this field. Specifically, we present DFormer++, a novel RGB-D pretrain-and-finetune framework to learn transferable representations for RGB-D semantic segmentation. This paper has two vital innovations. 1) Framework perspective: Different from the existing methods that finetune RGB pretrained backbone to the RGB-D scenes, we pretrain the backbone using image-depth pairs from ImageNet-1 K, and hence the model is endowed with the capacity to encode RGB-D representations; 2) Architecture perspective: Our model comprises a sequence of RGB-D attention blocks, which are tailored for encoding both RGB and depth information through a novel attention mechanism. Our DFormer++ avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in previous works but has not been resolved. Meanwhile, the tailored architecture greatly reduces redundant parameters for encoding RGB-D data and achieves efficient and accurate perception. Experimental results show that our DFormer++ achieves new cutting-edge performance on three popular RGB-D semantic segmentation benchmarks.

Abstract:
The concept of viewing graph solvability has gained significant interest in the context of structure-from-motion. A viewing graph is a mathematical structure where nodes are associated with cameras and edges represent the epipolar geometry connecting overlapping views. Solvability studies under which conditions the cameras are uniquely determined by the graph. In this paper we propose a novel framework for analyzing solvability problems based on algebraic geometry, demonstrating its potential in understanding structure-from-motion graphs and proving a conjecture that was previously proposed.

Abstract:
Explainability in Graph Neural Networks (GNNs) has shown considerable promise in bolstering their trustworthiness, credibility and transparency. Our research delves into the assessment of explainability within GNNs, a pivotal factor for ensuring the reliability of explainability techniques in real-world applications. Existing evaluation metrics, typically involving taking explanatory subgraphs as inputs and measuring output differences, often face out-of-distribution (OOD) challenges. This issue occurs when explanatory subgraphs do not align with real-world data distributions, affecting the reliability of model explanations. With this in mind, in this work, we endeavor to confront this issue by introducing a novel evaluation metric, termed OOD-resistant Adversarial Robustness (OAR). Specifically, our approach is inspired by adversarial robustness, assessing the resilience of explanation subgraphs to attacks. Additionally, we incorporate a sophisticated OOD reweighting mechanism within the evaluation framework to ensure that assessments remain aligned with the original data distribution. Going beyond this, to accommodate a wider range of evaluation tasks, we further devise a counterfactual attack module and complement the perturbed subgraph using the conditional graph diffusion model. The refined paradigm, termed OAR+, ensures that our metric is versatile and applicable across various contexts. Furthermore, we establish a standardized framework, which serves as a benchmark for evaluating the fairness and accuracy of different metrics. We conduct extensive experiments to validate the effectiveness of the OAR and OAR+.

Abstract:
With a strong alignment between the training and test distributions, object relation as a context prior facilitates object detection. Yet, it turns into a harmful but inevitable training set bias upon test distributions that shift differently across space and time. Nevertheless, the existing detectors cannot incorporate deployment context prior during the test phase without parameter update. Such kind of capability requires the model to explicitly learn disentangled representations with respect to context prior. To achieve this, we introduce an additional graph input to the detector, where the graph represents the deployment context prior, and its edge values represent object relations. Then, the detector behavior is trained to bound to the graph with a modified training objective. As a result, during the test phase, any suitable deployment context prior can be injected into the detector via graph edits, hence “re-biasing” the detector towards the given prior at run-time without parameter update. Even if the deployment prior is unknown, the detector can self-rebias using deployment prior approximated using its own predictions. Comprehensive experimental results on the COCO dataset, as well as cross-dataset testing on the Objects365 dataset, demonstrate the effectiveness of the run-time re-biasable detector.

Abstract:
Deep learning based face-swap videos, widely known as deepfakes, have drawn wide attention due to their threat to information credibility. Recent works mainly focus on the problem of deepfake detection that aims to reliably tell deepfakes apart from real ones, in an objective way. On the other hand, the subjective perception of deepfakes, especially its computational modeling, imitation, is also a significant problem but lacks adequate study. In this paper, we focus on the photorealism assessment of deepfakes, which is defined as the automatic assessment of deepfake photorealism that approximates human perception of deepfakes. It is important for evaluating the quality, deceptiveness of deepfakes which can be used for predicting the influence of deepfakes on Internet, it also has potentials in improving the deepfake generation process by serving as a critic. This paper promotes this new direction by presenting a comprehensive benchmark called DREAM, which stands for Deepfake photoREalism AssessMent. It is comprised of a deepfake video dataset of diverse quality, a large scale annotation that includes 140, 000 photorealism scores, textual descriptions obtained from 3, 500 human annotators, a comprehensive evaluation, analysis of 18 representative photorealism assessment methods, including recent large vision language model based methods, a newly proposed description-aligned CLIP method. The benchmark, insights included in this study can lay the foundation for future research in this direction, other related areas.

Abstract:
Nighttime hazy vision is severely limited by the presence of haze and multi-colored light sources. Different from the daytime image dehazing task which has been widely studied, less progress has been made in nighttime image dehazing. In this paper, through extensive analysis and experimentation, we find that game engine simulations offer strong real-world generalization but suffer from unrealistic brightness. To tackle this, we introduce a three-step, brightness-aware synthetic-to-real learning approach. First, we use supervised learning to train a spatial-frequency network (SFN) on synthetic data to produce pseudo-labels. With these pseudo-labels, we develop a semi-supervised dehazing model (SFN+) that minimizes domain discrepancy through a brightness consistency loss applied to local windows. Building on SFN+, we fine-tune the model for better vision using a relative brightness improvement strategy that accounts for color shifts from lighting and brightness shifts during enhancement (SFN++). Experiments on popular benchmark datasets confirm our method’s superiority over state-of-the-art approaches.

Abstract:
Vector quantization (VQ) is a fundamental research problem in image synthesis, which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However, learning a codebook from scratch and in a code-independent manner is highly challenging, which may be a key reason causing codebook collapse, i.e., some code vectors can rarely be optimized without regard to the relationship between codes and good codebook priors such that die off finally. In this paper, inspired by pretrained language models, we find that these language models have actually pretrained a superior codebook via a large number of text corpus, but such information is rarely exploited in VQ. To this end, we propose a novel codebook transfer framework with vision-to-language translation, called VQCT-VLT, which aims to transfer a well-trained codebook from pretrained language models to VQ for robust codebook learning. Specifically, we first introduce a pretrained codebook from language models and part-of-speech knowledge as priors. Then, we construct a vision-related codebook with these priors for achieving codebook transfer. Finally, a novel codebook transfer network is designed to exploit abundant semantic relationships between codes contained in pretrained codebooks for robust codebook learning. Although the above version can achieve superior image synthesis performance, we find the learned codebook difficult to align the text semantics. To this end, we introduce image captions as auxiliary supervisory information and then design a vision-to-language translation module to further achieve vision-language-aligned codebook learning. Experimental results on various tasks show that our VQCT-VLT method achieves superior performance over previous state-of-the-art VQ methods.

Abstract:
Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. The consistent and substantial improvements observed across multiple MMDiT based text-to-image models such as SD3, SD3.5 and FLUX provide strong evidence of the general applicability of our method.

Abstract:
Federated heterogeneity refers to the disparities in data distributions, model architectures, and communication capabilities across various devices or institutional entities. In real-world scenarios, statistical heterogeneity can often lead to ineffective aggregation, severely impacting generalization performance and resulting in biased or unstable model weights. Theoretically, distributional robustness analysis indicates that the generalization performance of a learning model can be bounded with respect to any heterogeneity distribution. This insight motivates us to reconsider the aggregation strategy in federated statistical heterogeneity scenarios, and we thus propose a new weighting aggregation protocol that considers the generalization bound disagreement of each local model. Specifically, we estimate the upper and lower bounds of the second-order origin moment of the shifted distribution for the current local model, and using these bound disagreements as the aggregation proportions for weights in each communication round. Our experiments demonstrate that this proposed aggregation protocol significantly improves the performance of several representative Federated Learning algorithms on benchmark datasets.

Abstract:
Skip connection is an essential ingredient for modern deep models to be deeper and more powerful. Despite their huge success in normal scenarios (state-of-the-art classification performance on natural examples), we investigate and identify an interesting property of skip connections under adversarial scenarios, namely, the use of skip connections allows easier generation of highly transferable adversarial examples. Specifically, in ResNet-like models (with skip connections), we find that biasing backpropagation to favor gradients from skip connections–while suppressing those from residual modules via a decay factor–allows one to craft adversarial examples with high transferability. Based on this insight, we propose the Skip Gradient Method (SGM). Although starting from ResNet-like models in vision domains, we further extend SGM to more advanced architectures, including Vision Transformers (ViTs), models with varying-length paths, and other domains such as natural language processing. We conduct comprehensive transfer-based attacks against diverse model families, including ResNets, Transformers, Inceptions, Neural Architecture Search-based models, and Large Language Models (LLMs). The results demonstrate that employing SGM can greatly improve the transferability of crafted attacks in almost all cases. Furthermore, we demonstrate that SGM can still be effective under more challenging settings such as ensemble-based attacks, targeted attacks, and against defense equipped models. At last, we provide theoretical explanations and empirical insights on how SGM works. Our findings not only motivate new adversarial research into the architectural characteristics of models but also open up further challenges for secure model architecture design.

Abstract:
Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and temporal perspectives. In the incorporation stage, we propose Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover discriminative object-level and frame-level prototypes, respectively. In SKC, object-level prototypes adaptively aggregate important patch tokens under the guidance of spatial knowledge. Moreover, in TKC, frame-level prototypes utilize temporal attributes to assist in inter-frame temporal relation modeling. These learned prototypes thus provide transparency in capturing fine-grained spatial details and diverse temporal patterns. Experimental results show DiST achieves state-of-the-art results on five standard FSAR datasets.

Abstract:
We present a theoretical framework analyzing the relationship between data distributions and fairness guarantees in deep learning. Our work establishes novel bounds that explicitly account for data distribution heterogeneity across demographic groups, while introducing a formal analysis framework that minimizes expected loss differences across these groups. Moreover, we derive bounds for fairness errors and convergence rates, characterizing how distributional differences between groups affect the fundamental trade-off between fairness and accuracy. Through extensive experiments on diverse datasets across various modalities (image, tabular data, and text), including FairVision (eye disease detection), CheXpert (pleural effusion detection), HAM10000 (skin lesion classification), FairFace (facial attribute recognition), ACS Income (income prediction), CivilComments-WILDS (toxic comment detection), we validate our theoretical findings and demonstrate that differences in feature distributions across demographic groups significantly impact model fairness, with performance disparities particularly pronounced in racial categories. The theoretical bounds we derive corroborate these empirical observations, providing insights into the fundamental limits of achieving fairness in deep learning models when faced with heterogeneous data distributions. This work advances our understanding of fairness in AI and provides a theoretical foundation for developing more equitable algorithms. Motivated by these theoretical insights, particularly the link between feature distribution shifts and fairness gaps, we propose Fairness-Aware Regularization (FAR), a practical training objective that directly minimizes inter-group discrepancies in feature centroids and covariances to improve equitable performance. We validate the effectiveness of FAR across all datasets considered in this study, consistently observing improvements in overall AUC, ES-AUC, and subgroup performance.

Abstract:
Random features (RFs) provide an efficient approximation to kernel methods, and allow for scalable learning on large datasets by reducing computational complexity while maintaining strong theoretical guarantees. However, real-world data can often be contaminated by outliers or heavy-tailed noise, which significantly degrades the performance of standard RF algorithms. To address this issue, we propose a robust and adaptive regularized least squares method with random features (RRLS-RF) that incorporates response truncation. The truncation level adaptively balances robustness and bias based on the sample size and moment conditions. We establish the generalization properties of RRLS-RF by assuming only a bounded (1+\delta )(1+δ)-th moment for any \delta > 0δ>0. Specifically, our analysis shows that RRLS-RF achieves learning rates of \mathcal O(|D|^-\frac\delta 2\delta +2)O(|D|-δ2δ+2) with only \mathcal O(|D|^\frac\delta 2\delta +2\log |D|)O(|D|δ2δ+2log|D|) random features, where |D||D| denotes the training sample size. These results converge to the optimal learning rates of \mathcal O(|D|^-\frac12)O(|D|-12) as \delta \rightarrow \inftyδ→∞, covering the traditional boundedness or sub-Gaussian assumptions in the regularized least squares method with random features (RLS-RF). Furthermore, we refine our analysis and show that RRLS-RF can achieve even faster learning rates under source and capacity conditions, as well as a smaller number of RFs with data-dependent sampling strategies. The derived sharp learning rates can also cover the mis-specified settings where the true function may not precisely align with the assumed kernel space. We further establish the first minimax lower bound under the weak moment condition, which shows that the RRLS-RF estimator is optimal over a wide range of source conditions. Our numerical experiments and real data analysis verify the theoretical results and demonstrate the superior robustness of RRLS-RF against outliers and heavy-tailed noise compared to standard methods.

Abstract:
Machine vision systems, which can efficiently manage extensive visual perception tasks, are becoming increasingly popular in industrial production and daily life. Due to the challenge of simultaneously obtaining accurate depth and texture information with a single sensor, multimodal data captured by cameras and LiDAR is commonly used to enhance performance. Additionally, cloud-edge cooperation has emerged as a novel computing approach to improve user experience and ensure data security in machine vision systems. This paper proposes a pioneering solution to address the feature compression problem in multimodal 3D object detection. Given a sparse tensor-based object detection network at the edge device, we introduce two modes to accommodate different application requirements: Transmission-Friendly Feature Compression (T-FFC) and Accuracy-Friendly Feature Compression (A-FFC). In T-FFC mode, only the output of the last layer of the network’s backbone is transmitted from the edge device. The received feature is processed at the cloud device through a channel expansion module and two spatial upsampling modules to generate multi-scale features. In A-FFC mode, we expand upon the T-FFC mode by transmitting two additional types of features. These added features enable the cloud device to generate more accurate multi-scale features. Experimental results on the KITTI dataset using the VirConv-L detection network showed that T-FFC was able to compress the features by a factor of 4933 with less than a 3% reduction in detection performance. On the other hand, A-FFC compressed the features by a factor of about 733 with almost no degradation in detection performance. We also designed optional residual extraction and 3D object reconstruction modules to facilitate the reconstruction of detected objects. The reconstructed objects effectively reflected the shape, occlusion, and details of the original objects.

Abstract:
Practical Bayes filters often assume the state distribution of each time step to be Gaussian for computational tractability, resulting in the so-called Gaussian filters. When facing nonlinear systems, Gaussian filters such as extended Kalman filter (EKF) or unscented Kalman filter (UKF) typically rely on certain linearization techniques, which can introduce large estimation errors. To address this issue, this paper reconstructs the prediction and update steps of Gaussian filtering as solutions to two distinct optimization problems, whose optimal conditions are found to have analytical forms from Stein’s lemma. It is observed that the stationary point for the prediction step requires calculating the first two moments of the prior distribution, which is equivalent to that step in existing moment-matching filters. In the update step, instead of linearizing the model to approximate the stationary points, we propose an iterative approach to directly minimize the update step’s objective to avoid linearization errors. For the purpose of performing the steepest descent on the Gaussian manifold, we derive its natural gradient that leverages Fisher information matrix to adjust the gradient direction, accounting for the curvature of the parameter space. Combining this update step with moment matching in the prediction step, we introduce a new iterative filter for nonlinear systems called Natural Gradient Gaussian Approximation filter, or NANO filter for short. We prove that NANO filter locally converges to the optimal Gaussian approximation at each time step. Furthermore, the estimation error is proven exponentially bounded for nearly linear measurement equation and low noise levels through constructing a supermartingale-like property across consecutive time steps. Real-world experiments demonstrate that, compared to popular Gaussian filters such as EKF, UKF, iterated EKF, and posterior linearization filter, NANO filter reduces the average root mean square error by approximately 45% while maintaining a comparable computational burden.

Abstract:
Point cloud completion aims to reconstruct the geometry of partial point clouds captured by various sensors. Traditionally, point cloud models are trained on synthetic datasets that feature limited categories and differ significantly from real-world scenarios. This gap often causes existing methods to struggle when faced with unfamiliar categories and severe incompleteness in real-world applications. In this paper, we propose PrototypeCompletion, a novel prototype-based approach for point cloud completion. The method begins by generating rough prototypes, which are then refined with additional geometric details to make the final prediction. We introduce two distinct approaches for integrating prototypes into the network: explicit prototypes and implicit prototypes. Our approach demonstrates strong generalization capabilities, allowing it to handle point cloud completion for a variety of unseen categories beyond the training data. We demonstrate that incorporating language prompts into the training of point cloud completion models significantly expands their applicability and enhances their performance in diverse point cloud completion tasks. Furthermore, we propose a new evaluation metric and a test benchmark based on ScanNet200 and KITTI, designed to assess the model’s performance in real-world scenarios and foster future research in the field. Experimental results show that our method outperforms state-of-the-art models on the existing PCN and ShapeNet34 benchmarks and also excels in various real-world settings, handling different object categories and sensor types effectively. The code will be made publicly available.

Abstract:
Recent advancements in language models have demonstrated its capacity of context understanding and generative representations. Leveraged by these developments, we propose a novel multimodal trajectory predictor based on a vision-language model, named VLMTraj, which fully takes advantage of the prior knowledge of multimodal large language models and the human-like reasoning across diverse modality information. The key idea of our model is to reframe the trajectory prediction task into a visual question answering format, using historical information as context and instructing the language model to make predictions in a conversational manner. Specifically, we transform all the inputs into a natural language style: historical trajectories are converted into text prompts, and scene images are described through image captioning. Additionally, visual features from input images are also transformed into tokens via a modality encoder and connector. The transformed data is then formatted to be used in a language model. Next, in order to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answers. For training, we first optimize a numerical tokenizer with the prompt data to effectively separate integer and decimal parts, allowing us to capture correlations between consecutive numbers in the language model. We then train our language model using all the visual question answering prompts. During model inference, we implement both deterministic and stochastic prediction methods through beam-search-based most-likely prediction and temperature-based multimodal generation. Our VLMTrajvalidates that the language-based model can be a powerful pedestrian trajectory predictor, and outperforms existing numerical-based predictor methods. Extensive experiments show that VLMTrajcan successfully understand social relationships and accurately extrapolate the multimodal futures on public pedestrian trajectory prediction benchmarks.

Abstract:
Cloth-Changing Person Re-Identification (CC-ReID) aims to recognize individuals across camera views despite clothing variations, a crucial task for surveillance and security systems. Existing methods typically frame it as a cross-modal alignment problem but often overlook explicit modeling of interference factors such as clothing, viewpoints, and pedestrian actions. This oversight can distort their impact, compromising the extraction of robust identity features. To address these challenges, we propose a novel framework that systematically disentangles interference factors from identity features while ensuring the robustness and discriminative power of identity representations. Our approach consists of two key components. First, a dual-stream identity feature learning framework leverages a raw image stream and a cloth-isolated stream, to extract identity representations independent of clothing textures. An adaptive cloth-irrelevant contrastive objective is introduced to mitigate identity feature variations caused by clothing differences. Second, we propose a Text-Driven Conditional Generative Adversarial Interference Disentanglement Network (T-CGAIDN), to further suppress interference factors beyond clothing textures, such as finer clothing patterns, viewpoint, background, and lighting conditions. This network incorporates a multi-granularity interference recognition branch to learn interference-related features, a conditional adversarial module for bidirectional transformation between identity and interference feature spaces, and an interference decoupling objective to eliminate interference dependencies in identity learning. Extensive experiments on public benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, highlighting its effectiveness in CC-ReID.

Abstract:
The proliferation of AI-generated imagery poses escalating challenges for multimedia forensics, yet many existing detectors depend on assumptions about the internals of specific generative models, limiting their cross-model applicability. We introduce a self-supervised approach for detecting AI-generated images that leverages camera metadata—specifically exchangeable image file format (EXIF) tags—to learn features intrinsic to digital photography. Our pretext task trains a feature extractor solely on camera-captured photographs by classifying categorical EXIF tags (e.g., camera model and scene type) and pairwise-ranking ordinal and continuous EXIF tags (e.g., focal length and aperture value). Using these EXIF-induced features, we first perform one-class detection by modeling the distribution of photographic images with a Gaussian mixture model and flagging low-likelihood samples as AI-generated. We then extend to binary detection that treats the learned extractor as a strong regularizer for a classifier of the same architecture, operating on high-frequency residuals from spatially scrambled patches. Extensive experiments across various generative models demonstrate that our EXIF-induced detectors substantially advance the state of the art, delivering strong generalization to in-the-wild samples and robustness to common benign image perturbations.

Abstract:
Underwater image quality assessment (UIQA) is hindered by complex degradation and domain shifts across aquatic environments. Existing no-reference IQA methods rely on costly and subjective mean opinion scores (MOS), which limit their generalization to unseen domains. To overcome these challenges, we propose SCUIA, an unsupervised UIQA framework leveraging semantic contrastive learning for quality prediction without human annotations. Specifically, we introduce a vision-language contrastive learning strategy that aligns image features with textual embeddings in a unified semantic space, capturing implicit degradation-quality correlations. We further enhance quality discrimination with a hierarchical contrastive learning mechanism that combines image-specific statistical priors and semantic prompts. A triplet-based inter-group contrastive loss explicitly models relative quality relationships. To tackle cross-domain variations, we develop an unsupervised domain adaptation module that uses local statistical features to guide CLIP fine-tuning to disentangle domain-invariant quality representations from domain-specific noise. This enables zero-shot cross-domain quality prediction without labeled data. Extensive experiments on public UIQA benchmarks demonstrate significant improvements over existing methods, highlighting superior generalization and domain adaptability.

Abstract:
Hair editing is a long-standing problem in computer vision that demands both fine-grained local control and intuitive user interactions across diverse modalities. Despite the remarkable progress of GANs and diffusion models, existing methods still lack a unified framework that simultaneously supports arbitrary interaction modes (e.g., text, sketch, mask, and reference image) while ensuring precise editing and faithful preservation of irrelevant attributes. In this work, we introduce a novel paradigm that reformulates hair editing as proxy-based hair transfer. Specifically, we leverage the dense and semantically disentangled latent space of StyleGAN for precise manipulation and exploit its feature space for disentangled attribute preservation, thereby decoupling the objectives of editing and preservation. Our framework unifies different modalities by converting editing conditions into distinct transfer proxies, whose features are seamlessly blended to achieve global or local edits. Beyond 2D, we extend our paradigm to 3D-aware settings by incorporating EG3D and PanoHead, where we propose a multi-view boosted hair feature localization strategy together with 3D-tailored proxy generation methods that exploit the inherent properties of 3D-aware generative models. Extensive experiments demonstrate that our method consistently outperforms prior approaches in editing effects, attribute preservation, visual naturalness, and multi-view consistency, while offering unprecedented support for multimodal and mixed-modal interactions.

Abstract:
Recent advances in deep learning have significantly propelled the development of image forgery localization. However, existing models remain highly vulnerable to adversarial attacks: imperceptible noise added to forged images can severely mislead these models. In this paper, we address this challenge with an Adversarial Noise Suppression Module (ANSM) that generates a defensive perturbation to suppress the attack effect of adversarial noise. We observe that forgery-relevant features extracted from adversarial and original forged images exhibit distinct distributions. To bridge this gap, we introduce Forgery-relevant Features Alignment (FFA) as a first-stage training strategy, which reduces distributional discrepancies by minimizing the channel-wise Kullback–Leibler divergence between these features. To further refine the defensive perturbation, we design a second-stage training strategy, termed Mask-guided Refinement (MgR), which incorporates a dual-mask constraint. MgR ensures that the defensive perturbation remains effective for both adversarial and original forged images, recovering forgery localization accuracy to their original level. Extensive experiments across various attack algorithms demonstrate that our method significantly restores the forgery localization model’s performance on adversarial images. Notably, when ANSM is applied to original forged images, the performance remains nearly unaffected. To our best knowledge, this is the first report of adversarial defense in image forgery localization tasks.

Abstract:
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of tabular representation learning, covering the background, challenges, and benchmarks, along with the pros and cons of using DNNs. We organize existing methods into three main categories according to their generalization capabilities: specialized, transferable, and general models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. We introduce a hierarchical taxonomy for specialized models based on the key aspects of tabular data—features, samples, and objectives—and delve into detailed strategies for obtaining high-quality feature- and sample-level representations. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks, leveraging knowledge acquired from homogeneous or heterogeneous sources, or even cross-modalities such as vision and language. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without additional fine-tuning. We group these general models based on the strategies used to adapt across heterogeneous datasets. Additionally, we explore ensemble methods, which integrate the strengths of multiple tabular models. Finally, we discuss representative extensions of tabular learning, including open-environment tabular machine learning, multimodal learning with tabular data, and tabular understanding tasks.

Abstract:
While many deep learning models trained on private datasets have been deployed in various practical tasks, they may pose a privacy leakage risk as attackers could recover informative data or label knowledge from models. In this work, we present privacy-preserving model transcription, a data-free model-to-model conversion solution to facilitate model deployment with a privacy guarantee. To this end, we propose a cooperative-competitive learning approach termed differentially private synthetic distillation that learns to convert a pretrained model (teacher) into its privacy-preserving counterpart (student) via a trainable generator without access to private data. The learning collaborates with three players in a unified framework and performs alternate optimization: i) the generator is learned to generate synthetic data, ii) the teacher and student accept the synthetic data and compute differential private labels by flexible data or label noisy perturbation, and iii) the student is updated with noisy labels and the generator is updated by taking the student as a discriminator for adversarial training. We theoretically prove that our approach can guarantee differential privacy and convergence. The transcribed student has good performance and privacy protection, while the resulting generator can generate private synthetic data for downstream tasks. Extensive experiments clearly demonstrate that our approach outperforms 26 state-of-the-arts.

Abstract:
The rapid advancement of deep learning has intensified the need for comprehensive data for use by autonomous driving algorithms. High-quality datasets are crucial for the development of effective data-driven autonomous driving solutions. Next-generation autonomous driving datasets must be multimodal, incorporating data from advanced sensors that feature extensive data coverage, detailed annotations, and diverse scene representation. To address this need, we present OmniHD-Scenes, a large-scale multimodal dataset that provides comprehensive omnidirectional high-definition data. The OmniHD-Scenes dataset combines data from 128-beam LiDAR, six cameras, and six 4D imaging radar systems to achieve full environmental perception. The dataset comprises 1501 clips, each approximately 30-s long, totaling more than 450 K synchronized frames and more than 5.85 million synchronized sensor data points. We also propose a novel 4D annotation pipeline. To date, we have annotated 200 clips with more than 514 K precise 3D bounding boxes. These clips also include semantic segmentation annotations for static scene elements. Additionally, we introduce a novel automated pipeline for generation of the dense occupancy ground truth, which effectively leverages information from non-key frames. Alongside the proposed dataset, we establish comprehensive evaluation metrics, baseline models, and benchmarks for 3D detection and semantic occupancy prediction. These benchmarks utilize surround-view cameras and 4D imaging radar to explore cost-effective sensor solutions for autonomous driving applications. Extensive experiments demonstrate the effectiveness of our low-cost sensor configuration and its robustness under adverse conditions.

Abstract:
We introduce a novel approach to single-view face relighting in the wild, addressing challenges such as global illumination and cast shadows. A common scheme in recent methods involves intrinsicallydecomposing an input image into 3D shape, albedo, and lighting, then recomposing it with the target lighting. However, estimating these components is error-prone and requires many training examples with ground-truth lighting to generalize well. Our work bypasses the need for accurate intrinsic estimation and can be trained solely on 2D images without any light stage data, relit pairs, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM. Moreover, we propose a single-shot relighting framework that requires just one network pass, given pre-processed data, and even outperforms the teacher model across all metrics. Our method realistically relights in-the-wild images with temporally consistent cast shadows under varying lighting conditions. We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies. Please visit our page: https://diffusion-face-relighting-pp.github.io

Abstract:
Multi-view spectral clustering (MVSC) has garnered growing interest across various real-world applications, owing to its flexibility in managing diverse data space structures. Nevertheless, the fusion of multiple n× nn×n similarity matrices and the separate post-discretization process hinder the utilization of MVSC in large-scale tasks, where nn denotes the number of samples. Moreover, noise in different similarity matrices, along with the two-stage mismatch caused by the post-discretization, results in a reduction in clustering effectiveness. To overcome these challenges, we establish a novel fast multi-view discrete clustering (FMVDC) model via spectral embedding fusion, which integrates spectral embedding matrices (n× cn×c, c\ll nc≪n) to directly obtain discrete sample categories, where cc indicates the number of clusters, bypassing the need for both similarity matrix fusion and post-discretization. To further enhance clustering efficiency, we employ an anchor-based spectral embedding strategy to decrease the computational complexity of spectral analysis from cubic to linear. Since gradient descent methods are incapable of discrete models, we propose a fast optimization strategy based on the coordinate descent method to solve the FMVDC model efficiently. Extensive studies demonstrate that FMVDC significantly improves clustering performance compared to existing state-of-the-art methods, particularly in large-scale clustering tasks.

Abstract:
Cooperative inference in distributed sensor networks is challenged by limited communication bandwidth and the risk of node failures. This paper introduces Compressed Feature Diffusion for Decentralized Classification (CFD-DC), a novel framework that addresses these challenges. Each node performs local inference using its own features and compressed feature representations received from other nodes. Our approach relies on two key components: first, a trainable feature compressor at each node that learns compact representations, reducing communication while preserving critical discriminative information; second, an adaptive node weighting mechanism that dynamically adjusts the influence of local and remote features, providing robustness to unreliable or failed nodes. Experiments on multi-view image classification and a simulated multi-node underwater acoustic target classification task demonstrate the effectiveness of the framework. The results show competitive performance compared to centralized and state-of-the-art multi-view methods, reduced communication costs, and superior robustness in scenarios with node failures.

Abstract:
Multi-Agent Reinforcement Learning (MARL) has proven to be effective in learning cooperative policies, where agents learn decentralized policies, sharing the same network parameters, through centralized training. However, this parameter sharing can lead to similar behaviors among agents, hindering effective exploration. Existing multi-agent diversity methods that rely on the variational inference methods to differentiate agents may suffer from significant overfitting, which in turn hinders the exploration of new trajectories. To encourage multi-agent diversity and efficient exploration, we propose Active Exploration with Agent-Identity (AEAI), a novel exploration method, which maximizes the entropy over trajectories of different agents to promote sufficient exploration. Moreover, we derive a novel lower bound for the mutual information objective based on the successor features to align the directions of trajectories and agent identities in order to learn agent identity-conditioned policies. We combine these two items and integrate our method with existing MARL methods. We evaluate our proposed AEAI on challenging multi-agent tasks across various MARL benchmarks. Experimental results show that our method consistently outperforms existing state-of-the-art methods, highlighting its effectiveness in fostering diversity and improving exploration.

Abstract:
Diffusion-based technologies have made significant strides, particularly in personalized and customized facial generation. However, existing methods struggle to achieve high-fidelity and detailed identity (ID) consistency. This is mainly due to two challenges: insufficient fine-grained control over specific facial areas and the absence of a comprehensive strategy for ID preservation that accounts for both intricate facial details and the overall facial structure. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverse identity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two core components: a multimodal facial prompt generator and an ID-preservation network. The facial prompt generator combines localized facial features, facial feature descriptions, and overall facial descriptions to enhance the precision of facial detail reconstruction. The ID-preservation network, optimized with a facial attention localization strategy, ensures consistent identity preservation across facial regions. Together, these components leverage fine-grained multimodal identity information to improve identity preservation accuracy significantly. To drive ConsistentID’s training, we propose a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. In addition, although ConsistentID introduces more multimodal ID information, it still maintains rapid inference speed during the generation process.

Abstract:
In this paper, we address the challenging task of multimodal reasoning by incorporating the notion of “slow thinking” into multimodal large language models (MLLMs). Our core idea is that models can learn to adaptively use different levels of reasoning to tackle questions of varying complexity. We propose a novel paradigm of Self-structured Chain of Thought (SCoT), which consists of minimal semantic atomic steps. Unlike existing methods that rely on structured templates or free-form paradigms, our method not only generates flexible CoT structures for various complex tasks but also mitigates the phenomenon of overthinking for easier tasks. To introduce structured reasoning into visual cognition, we design a novel AtomThink framework with four key modules: (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning (SFT) process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single-step utilization rate. Extensive experiments demonstrate that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 × and boosts inference efficiency by 85.3%.

Abstract:
Split Learning (SL) is a distributed learning framework that has gained popularity for its privacy-preserving nature and low computational demands. However, recent studies have the potential that a server adversary to carry out inference attacks, compromising the privacy of victim clients. Nevertheless, upon re-evaluating prior studies, we found that existing methods rely on overly strong assumptions to enhance their performance, resulting in a significant decline in effectiveness under more realistic scenarios. In this work, we provide new insights into the inherent vulnerabilities of SL. Specifically, we discover that both the smashed data and the server model contain the client’s representation preference, which the server adversary can exploit to build a substitute client that approximates the target client’s unique feature extraction behavior. With a well-trained substitute client, the server can perfectly steal the target client’s functionality, training data, and labels. Building on this observation, we introduce Split Leakage (SLeak), a new threat that targets multiple privacy stealing objectives against SL. Notably, SLeak does not depend on strong privacy priors and only requires partial same-domain auxiliary public data to conduct the attacks. Experimental results on diverse datasets and target models show that SLeak surpasses the state-of-the-art method across multiple metrics. Moreover, ablation studies further confirm its robustness and applicability under various scenarios and assumptions.

Abstract:
Machine unlearning enables data holders to remove the contribution of their specified samples from trained models to protect their privacy. However, it is paradoxical that most unlearning methods require the unlearning requesters to first upload their data to the server as a prerequisite for unlearning. These methods are infeasible in many privacy-preserving scenarios where servers are prohibited from accessing users’ data, such as federated learning (FL). In this paper, we explore how to implement unlearning under the condition of not uncovering the erasing data to the server. We propose Blind Unlearning (BlindU), which carries out unlearning using compressed representations instead of original inputs. BlindU only involves the server and the unlearning user: the user locally generates privacy-preserving representations, and the server performs unlearning solely on these representations and their labels. For the FL model training, we employ the information bottleneck (IB) mechanism. The encoder of the IB-based FL model learns representations that distort maximum task-irrelevant information from inputs, allowing FL users to generate compressed representations locally. For effective unlearning using compressed representation, BlindU integrates two dedicated unlearning modules tailored explicitly for IB-based models and uses a multiple gradient descent algorithm to balance forgetting and utility retaining. While IB compression already provides protection for task-irrelevant information of inputs, to further enhance the privacy protection, we introduce a noise-free differential privacy (DP) masking method to deal with the raw erasing data before compressing. Theoretical analysis and extensive experimental results illustrate the superiority of BlindU in privacy protection and unlearning effectiveness compared with the best existing privacy-preserving unlearning benchmarks.

Abstract:
Image-sentence matching that aims to understand the correspondence between vision and language, has achieved significant progress with various deep methods trained under large-scale supervision. Different from natural images taken by camera, diagrams in the textbooks contain more graphic objects, drawings, and natural objects, and the diagram-sentence matching plays an important role in textbook understanding and question answering. However, existing matching models are not suitable for the challenging task between diagrams and sentences, due to the more serious few-shot content and incomplete description problems. In this paper, we propose a novel local-feedback self-regulating memory framework (LFSRM) for diagram-sentence matching. On one hand, LFSRM includes an external memory to store the useful multi-modal information, especially uncommon ones, to overcome the few-shot content problem, where the memory is updated flexibly according to the local-feedback from visual-textual alignment scores. On the other hand, LFSRM designs an attention mechanism on local-level alignment scores and a strengthening factor impacted on sentence-to-diagram matching direction for alleviating the incomplete description problem. Extensive experiments on three datasets show that LFSRM achieves satisfactory results on conventional image-sentence matching, and outperforms SOTA methods on few-shot image/diagram-sentence matching by a large margin.

Abstract:
The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.

Abstract:
Transformers have demonstrated impressive capabilities across various tasks, yet their performance on compositional problems remains a subject of debate. In this study, we investigate the internal mechanisms underlying Transformers’ behavior in compositional tasks. We find that complexity control strategies—particularly the choice of parameter initialization scale and weight decay—significantly influence whether the model learns primitive-level rules that generalize out-of-distribution (reasoning-based solutions) or relies solely on memorized mappings (memory-based solutions). By applying masking strategies to the model’s information circuits and employing multiple complexity metrics, we reveal distinct internal working mechanisms associated with different solution types. Further analysis reveals that reasoning-based solutions exhibit a lower complexity bias, which aligns with the well-studied neuron condensation phenomenon. This lower complexity bias is hypothesized to be the key factor enabling these solutions to learn reasoning rules. We validate these conclusions across multiple real-world datasets, including image generation and natural language processing tasks, confirming the broad applicability of our findings.

Abstract:
Time series Semi-Supervised Classification (SSC) aims to improve model performance by utilizing abundant unlabeled data in scenarios where labeled samples are limited. Previous approaches mainly focus on exploiting temporal dependencies within the time domain for SSC. However, these temporal dependencies are susceptible to sampling noise and may not effectively capture the global periodicity of features across categories. To this end, we propose a time series SSC framework called CompleMatch, leveraging the complementary information from both temporal and frequency representations for unlabeled data learning. CompleMatch simultaneously trains two deep neural networks based on time-domain and frequency-domain views, with pseudo-labels generated via label propagation in the representation space guiding the training of the opposing view’s classifier. In this co-training paradigm, we incorporate a constraint term to harness the complementary nature of temporal-frequency representations, thereby enhancing the model’s robustness under limited labeled data. In addition, we design a temporal-frequency contrastive learning module that integrates supervised and self-supervised signals to enhance pseudo-label quality by learning more discriminative representations. Extensive experiments demonstrate that CompleMatch surpasses state-of-the-art methods. Furthermore, analyses of model behavior (i.e., ablation studies and visualization) underscore the effectiveness of our proposed approach.

Abstract:
High-dimensional and weakly supervised (HiDWS) data present significant challenges for traditional machine learning and pattern recognition. Although semi-supervised feature selection has shown effectiveness in improving the quality of HiDWS data, existing methods remain sensitive and lack robustness due to the unreliability of unlabeled data learning and the uncertainty in modeling processes. Hence, this study focuses on a multi-granularity zentropy modeling (Ze-MGM) framework with model-agnostic for highly-accuracy and robust semi-supervised feature selection. Unlike existing methods, Ze-MGM does not rely on specific settings such as rough or fuzzy set assumptions and can effectively capture the granularity of information under high-dimensional and weakly supervised data scenarios. Specifically, we first introduce a strategic soft label (S2-S2-Label) learning method that integrates object proximity and classification certainty to reduce uncertainty between features and labels. This method also enables the selection of compatible instances, thereby mitigating the negative impact of incompatible objects on label learning. Subsequently, a multi-granularity knowledge space and zentropy uncertainty measure are constructed by analyzing the hierarchical relationships among labels, decisions, and specific classes, which enables accurate multi-granularity knowledge representation and multi-granularity uncertainty characterization in HiDWS data modeling processing. Finally, two multi-granularity significance measures based on multi-granularity uncertainty are defined for feature evaluation and selection via a semi-supervised paradigm. Extensive experiments on multiple benchmark datasets demonstrate that the proposed Ze-MGM method achieves superior generalization performance and robustness compared to state-of-the-art methods.

Abstract:
Bot detection is crucial for combating misinformation and preserving the authenticity of online interactions on social media. However, the increasing sophistication of bots in mimicking genuine accounts and evading detection has created an ongoing arms race between detection systems and modeling techniques. In this paper, we propose a novel Structural Information principles-based Adversarial framework, namely SIAMD, designed to Model bot behaviors and achieve proactive Detection. This framework begins by organizing multi-relational interactions between user accounts and social messages into a unified heterogeneous structure, incorporating structural entropy to quantify the uncertainty inherent in historical activities. The high-dimensional entropy is then minimized to uncover a layered hierarchy within account communities, which facilitates activity determination and account selection in behavioral modeling for bot accounts. For each modeled bot and its selected account, SIAMD extracts historical messages and user descriptions to construct prompts and integrates large language models to generate the associated message content. By embedding synthetic message vertices and establishing multi-relational interactions within the original heterogeneous network, SIAMD achieves network evolution in both structure and content, thereby enhancing graph-based proactive detection in an adversarial manner. Extensive comparative experiments on well-established real-world datasets demonstrate that SIAMD significantly and consistently outperforms state-of-the-art detection baselines for social bots in terms of effectiveness, generalizability, robustness, and interpretability.

Abstract:
In the rapidly advancing realm of visual generation, diffusion models have revolutionized the landscape, marking a significant shift in capabilities with their impressive text-guided generative functions. However, relying solely on text for conditioning these models does not fully cater to the varied and complex requirements of different applications and scenarios. Acknowledging this shortfall, a variety of studies aim to control pre-trained text-to-image (T2I) models to support novel conditions. In this survey, we undertake a thorough review of the literature on controllable generation with T2I diffusion models, covering both the theoretical foundations and practical advancements in this domain. Our review begins with a brief introduction to the basics of denoising diffusion probabilistic models (DDPMs) and widely used T2I diffusion models. Additionally, we provide a detailed overview of research in this area, categorizing it from the condition perspective into three directions: generation with specific conditions, generation with multiple conditions, and universal controllable generation. For each category, we analyze the underlying control mechanisms and review representative methods based on their core techniques.

Abstract:
Density-Based Spatial Clustering of Applications with Noise (DBSCAN), a well-known density-based clustering algorithm, has gained widespread popularity and usage due to its effectiveness in identifying clusters of arbitrary shapes and handling noisy data. However, it encounters challenges in producing satisfactory cluster results when confronted with datasets of varying density scales, a common scenario in real-world applications. In this paper, we propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN. First, we model the initial dataset as a two-level encoding tree and categorize the data vertices into distinct density partitions according to the information uncertainty determined in the encoding tree. Each partition is then assigned to an agent to find the best clustering parameters without manual assistance. The allocation is density-adaptive, enabling AR-DBSCAN to effectively handle diverse density distributions within the dataset by utilizing distinct agents for different partitions. Second, a multi-agent deep reinforcement learning guided automatic parameter searching process is designed. The process of adjusting the parameter search direction by perceiving the clustering environment is modeled as a Markov decision process. Using a weakly-supervised reward training policy network, each agent adaptively learns the optimal clustering parameters by interacting with the clusters. Third, a recursive search mechanism adaptable to the data’s scale is presented, enabling efficient and controlled exploration of large parameter spaces. Extensive experiments are conducted on nine artificial datasets and a real-world dataset. The results of offline and online tasks show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI) metrics, respectively, but also is capable of robustly finding dominant parameters.

Abstract:
Foundation models like the Segment Anything Model (SAM) have significantly advanced promptable image segmentation in computer vision. However, extending these capabilities to videos presents substantial challenges, particularly in ensuring precise and temporally consistent mask propagation in dynamic scenes. SAM 2 attempts to address this by training a model on massive image and video data from scratch to learn complex spatiotemporal associations, resulting in huge training costs that hinder research and practical deployment. In this paper, we introduce SAM-I2V++, a training-efficient image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. Our approach strategically upgrades the pre-trained SAM to support PVS, significantly reducing training complexity and resource requirements. To achieve this, we introduce three key innovations: (i) an image-to-video feature extraction upgrader built upon SAM’s static image encoder to enable spatiotemporal video perception, (ii) a memory selective associator that retrieves the most relevant past frames via similarity-driven selection and uses multiscale-enhanced cross-attention to associate selected memory features with the current frame, and (iii) a memory-as-prompt mechanism leveraging object memory to ensure temporally consistent mask propagation in dynamic scenes. Comprehensive experiments demonstrate that our method achieves 93% of SAM 2’s performance while using only 0.2% of its training cost. Our work presents a resource-efficient pathway to PVS, lowering barriers for further research in PVS model design and enabling broader applications and advancements in the field.

Abstract:
Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.

Abstract:
Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepthV2 implements a self-promptable camera module predicting a dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss which enhances the localization and sharpness of edges in the metric depth outputs, a revisited, simplified and more efficient architectural design, and an additional uncertainty-level output which enables downstream tasks requiring confidence. Thorough evaluations on ten depth datasets in a zero-shot regime consistently demonstrate the superior performance and generalization of UniDepthV2. Code and models are available at: github.com/lpiccinelli-eth/UniDepth.

Abstract:
Physical adversarial examples (PAEs) are regarded as “whistle-blowers” of real-world risks in deep-learning applications, thus worth further investigation. However, current PAE generation studies show limited adaptive attacking ability to diverse and varying scenes, revealing the urgent requirement of dynamic PAEs that are generated in real time and conditioned on the observation from the attacker. The key challenge in generating dynamic PAEs is learning the sparse relation between PAEs and the observation of attackers under the noisy feedback of attack training. To address the challenge, we present DynamicPAE, the first generative framework that enables scene-aware real-time physical attacks. Specifically, to address the noisy feedback problem that obfuscates the exploration of scene-related PAEs, we introduce the residual-guided adversarial pattern exploration technique. We first introduce the limited feedback information restriction to model the training degeneracy problem under noisy feedback. Then, residual-guided training, which relaxes the attack training with a reconstruction task, is proposed to enrich the feedback information, thereby achieving a more comprehensive exploration of PAEs. To address the alignment problem between the trained generator, which represents the learned relation, and the real-world scenario, we introduce the distribution-matched attack scenario alignment, consisting of the conditional-uncertainty-aligned data module and the skewness-aligned objective re-weighting module. The former aligns the training environment with the incomplete observation of the real-world attacker. The latter facilitates consistent stealth control across different attack targets by balancing the objectives with the skewness indicator. Extensive digital and physical evaluations demonstrate the superior attack performance of DynamicPAE, attaining a 2.07× boost (58.8% average AP drop under attack) on representative object detectors (e.g., DETR) over state-of-the-art static PAE generating methods. Overall, our work opens the door to end-to-end modeling of dynamic PAEs.

Abstract:
Multi-modal hashing aims to succinctly encode heterogeneous modalities into binary hash codes, facilitating efficient multimedia retrieval characterized by low storage demands and high retrieval speed. Despite the commendable achievements of existing methods, they still face three crucial challenges: 1) Inadequate bridging of the heterogeneous modality gap through coarse, global feature-level alignment and fusion. 2) The erosion of bit independence and consequent limitations on the semantic representation capacity of hash codes during feature-level hash code learning. 3) The insufficiency of binary label-based pairwise semantic preservation strategies in capturing intricate fine-grained semantic correlations within multi-modal data. To address these challenges, this paper introduces the Dynamic Bit-wise Semantic Transformer Hashing (DBSTH) framework. Remarkably, it treats each hash bit as a unique semantic concept, facilitating concept-level alignment of heterogeneous modalities. This safeguards bit independence and augments representation capabilities. Specifically, we devise a dynamic unit fusion strategy for the adaptive combination of local multi-modal information units, facilitating the acquisition of bit-wise semantic concepts. Subsequently, we incorporate a transformer encoder to refine these concepts by uncovering latent correlations among distinct concepts. Finally, we perform the multi-modal alignment and fusion on the fine-grained concept-level, independently encoding each concept to its corresponding hash bit. To provide enhanced guidance for concept learning, a label prototype learning mechanism is introduced, which learns prototype embeddings for all categories through the consideration of co-occurrence priors. This mechanism effectively captures fine-grained explicit semantic correlations and generates supervising hash codes. Additionally, to improve the robustness of the hashing model in handling noisy multi-modal data, a masked concept learning strategy is introduced, facilitating the acquisition of resilient semantic concepts. Extensive experiments conducted on three widely tested multi-modal retrieval datasets demonstrate the superiority of our method in conventional, noisy, and open-set retrieval scenarios.

Abstract:
Existing research on unconstrained in-the-wild head pose estimation suffers from the flaws of its datasets, which consist of either numerous samples by non-realistic synthesis or constrained collection, or small-scale natural images yet with plausible manual annotations. This makes fully-supervised solutions compromised due to the reliance on generous labels. To alleviate it, we propose the first semi-supervised unconstrained head pose estimation method SemiUHPE, which can leverage abundant easily available unlabeled head images. Technically, we choose semi-supervised rotation regression and adapt it to the error-sensitive and label-scarce problem of unconstrained head pose. Our method is based on the observation that the aspect-ratio invariant cropping of wild heads is superior to previous landmark-based affine alignment given that landmarks of unconstrained human heads are usually unavailable, especially for underexplored non-frontal heads. Instead of using a pre-fixed threshold to filter out pseudo labeled heads, we propose dynamic entropy based filtering to adaptively remove unlabeled outliers as training progresses by updating the threshold in multiple stages. We then revisit the design of weak-strong augmentations and improve it by devising two novel head-oriented strong augmentations, termed pose-irrelevant cut-occlusion and pose-altering rotation consistency respectively. Extensive experiments and ablation studies show that SemiUHPE outperforms its counterparts greatly on public benchmarks under both the front-range and full-range settings. Furthermore, our proposed method is also beneficial for solving other closely related problems, including generic object rotation regression and 3D head reconstruction, demonstrating good versatility and extensibility.

Abstract:
Introducing user-specified visual concepts in image editing is highly practical as these concepts convey the user's intent more precisely than text-based descriptions. We propose FreeEdit, a novel approach for achieving such reference-based image editing, which can accurately reproduce the visual concept from the reference image based on user-friendly language instructions. Our approach leverages the multi-modal instruction encoder to encode language instructions to guide the editing process. This implicit way of locating the editing area eliminates the need for manual editing masks. To enhance the reconstruction of reference details, we introduce the Decoupled Residual Refer-Attention (DRRA) module. This module is designed to integrate fine-grained reference features extracted by a detail extractor into the image editing process in a residual way without interfering with the original self-attention. Given that existing datasets are unsuitable for reference-based image editing tasks, particularly due to the difficulty in constructing image triplets that include a reference image, we curate a high-quality dataset, FreeBench, using a newly developed twice-repainting scheme. FreeBench comprises the images before and after editing, detailed editing instructions, as well as a reference image that maintains the identity of the edited object, encompassing tasks such as object addition, replacement, and deletion. By conducting phased training on FreeBench followed by quality tuning, FreeEdit achieves high-quality zero-shot editing through convenient language instructions. We conduct extensive experiments to evaluate the effectiveness of FreeEdit across multiple task types, demonstrating its superiority over existing methods.

Abstract:
Multi-agent reinforcement learning (MARL) requires effective coordination among multiple decision-making agents to achieve joint goals. Approaches based on a global value function face the curse of dimensionality, while fully decomposed centralized training with decentralized execution (CTDE) methods often suffer from relative overgeneralization. Coordination graphs mitigate this issue but typically fail to capture dynamic collaboration patterns that evolve over time and across tasks. We propose Dynamic Deep Factor Graphs (DDFG), a value decomposition algorithm that represents the global value via factor graphs and learns graph structures on the fly through a graph-generation policy, adapting to evolving inter-agent relations. We provide a theoretical upper bound on the approximation error of high-order decompositions and reveal how the maximum order DD trades off accuracy against computation, offering guidance for balancing performance and cost. Using max-sum for inference, DDFG efficiently derives joint policies. Experiments on higher-order predator–prey and SMAC show consistent gains over strong value-decomposition baselines, demonstrating improved sample efficiency and robustness in complex settings.

Abstract:
Protecting data privacy in deep learning (DL) is of crucial importance. Several celebrated privacy notions have been established and used for privacy-preserving DL. However, many existing mechanisms achieve privacy at the cost of significant utility degradation and computational overhead. In this paper, we propose a stochastic differential equation-based residual perturbation for privacy-preserving DL, which injects Gaussian noise into each residual mapping of ResNets. Theoretically, we prove that residual perturbation guarantees differential privacy (DP) and reduces the generalization gap of DL. Empirically, we show that residual perturbation is computationally efficient and outperforms the state-of-the-art differentially private stochastic gradient descent (DPSGD) in utility maintenance without sacrificing membership privacy.

Abstract:
Generalized Out-of-distribution (OOD) detection task plays the key role in reliable and safety-critical applications. Existing researches mainly devote to designing or training the powerful score function but overlook investigating the decision rule based on the proposed score function. Different from previous work, this paper aims to design a decision rule with rigorous theoretical guarantee and well empirical performance. Specifically, we provide a new insight for the OOD detection task from a hypothesis testing perspective and propose a novel generalized Benjamini Hochberg (g-BH) procedure to solve the testing problem. Theoretically, the g-BH procedure controls false discovery rate (FDR) under pre-specified level without the consideration of dependence for the p-values. Furthermore, we derive an upper bound and a lower bound of the expectation of false positive rate (FPR) for the g-BH procedure based on the tailed generalized Gaussian distribution family, indicating that the FPR of g-BH procedure converges to zero in probability. Finally, the extensive experimental results verify the superiority of g-BH procedure over the traditional threshold-based decision rule on several generalized OOD detection benchmarks.

Abstract:
Audio classification is an active research area with a wide range of applications. Over the past decade, convolutional neural networks (CNNs) have been the de-facto standard building block for end-to-end audio classification models. Recently, neural networks based solely on self-attention mechanisms such as the Audio Spectrogram Transformer (AST) have been shown to outperform CNNs. In this paper, we find an intriguing interaction between the two very different models - CNN and AST models are good teachers for each other. When we use either of them as the teacher and train the other model as the student via knowledge distillation (KD), the performance of the student model noticeably improves, and in many cases, is better than the teacher model. In our experiments with this CNN/Transformer Cross-Model Knowledge Distillation (CMKD) method we achieve new state-of-the-art performance on FSD50 K, AudioSet, and ESC-50.

Abstract:
The Area Under the Receiver Operating Characteristics Curve (AUC) is a widely used metric for evaluating model performance across all possible decision thresholds. Existing methods for AUC optimization typically assume a predefined parametric distribution of thresholds. However, the optimal decision threshold depends on the misclassification costs, which follow a non-parametric distribution.This motivates us to introduce a variant of AUC, termed Cost-aware AUC (CAUC), where the thresholds are conditioned on an empirically determined cost distribution. Unfortunately, as a bilevel problem, it is challenging to directly optimize the CAUC: 1) The inner problem of finding the optimal thresholds is non-convex, leading to potential issues with convergence; 2) The outer problem involves the derivative of False Positive Rate (FPR) w.r.t. the threshold, which is unavailable without an explicit formulation of threshold distribution. To address challenge 1), we utilize the convex relaxation technique to reshape the inner problem into a convex one. Facing challenge 2), we propose an adaptive kernel density estimation framework. Specifically, the derivative of FPR is considered an aggregation of various kernel functions. To avoid manually crafting the aggregation function, we propose a finite-difference-based stochastic algorithm to optimize the model without explicit aggregation function. Theoretically, the proposed algorithm enjoys a convergence rate of \mathcal O(\epsilon ^-4)O(ε-4). Empirical studies across various datasets and cost distributions speak to the effectiveness and soundness of our framework.

Abstract:
Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: (i) VFM-driven superpixel generation for detailed semantic representation, (ii) a VFM-assisted contrastive learning strategy to align multimodal features, (iii) superpoint temporal consistency to maintain stable representations across time, and (iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach achieves substantial gains over state-of-the-art methods in linear probing and fine-tuning for LiDAR-based segmentation and object detection. Extensive experiments on 11 large-scale multi-sensor datasets highlight our superior performance, demonstrating adaptability, efficiency, and robustness in real-world autonomous driving scenarios.

Abstract:
Data augmentation is widely utilized as an effective technique to enhance the generalization performance of deep models. However, data augmentation may inevitably introduce distribution shifts and noises, which significantly constrain the potential and deteriorate the performance of deep networks. To this end, we propose a novel information-preserving framework, namely IPF-RDA, to enhance the robustness of data augmentations in this paper. IPF-RDA combines the proposal of (i) a new class-discriminative information estimation algorithm that identifies the points most vulnerable to data augmentation operations and corresponding importance scores; And (ii) a new information-preserving scheme that preserves the critical information in the augmented samples and ensures the diversity of augmented data adaptively. We divide data augmentation methods into three categories according to the operation types and integrate these approaches into our framework accordingly. After being integrated into our framework, the robustness of data augmentation methods can be enhanced and their full potential can be unleashed. Extensive experiments demonstrate that although being simple, IPF-RDA consistently improves the performance of numerous commonly used state-of-the-art data augmentation methods with popular deep models on a variety of datasets, including CIFAR-10, CIFAR-100, Tiny-ImageNet, CUHK03, Market1501, Oxford Flower, and MNIST, where its performance and scalability are stressed.

Abstract:
In most existing multi-view modeling scenarios, cross-view correspondence (CVC) between instances of the same target from different views, like paired image-text data, is a crucial prerequisite for effortlessly deriving a consistent representation. Nevertheless, this premise is frequently compromised in certain applications, where each view is organized and transmitted independently, resulting in the view-unaligned problem (VuP). Restoring CVC of unaligned multi-view data is a challenging and highly demanding task that has received limited attention from the research community. To tackle this practical challenge, we propose to integrate the permutation derivation procedure into the bipartite graph paradigm for view-unaligned clustering, termed Probabilistically Aligned View-unaligned Clustering with Adaptive Template Selection (PAVuC-ATS). Specifically, we learn consistent anchors and view-specific graphs by the bipartite graph, and derive permutations applied to the unaligned graphs by reformulating the alignment between two latent representations as a 2-step transition of a Markov chain with adaptive template selection, thereby achieving the probabilistic alignment. The convergence of the resultant optimization problem is validated both experimentally and theoretically. Extensive experiments on six benchmark datasets demonstrate the superiority of the proposed PAVuC-ATS over the baseline methods.

Abstract:
Fine-tuning pre-trained vision-language models (VLMs) has shown substantial benefits in a wide range of downstream tasks, often achieving impressive performance with minimal labeled data. Parameter-efficient fine-tuning techniques, in particular, have demonstrated their effectiveness in enhancing downstream task performance. However, these methods frequently struggle to generalize to out-of-distribution (OOD) data due to their reliance on non-causal representations, which can introduce biases and spurious correlations that negatively impact decision-making. Such spurious factors hinder the model’s generalization ability beyond the training distribution. To address these challenges, in this paper, we propose a novel causal intervention-based prompt tuning method to adapt VLMs to few-shot OOD generalization. Specifically, we leverage the front-door adjustment technique from causal inference to mitigate the effects of spurious correlations and enhance the model’s focus on causal relationships. Built upon VLMs, our approach begins by decoupling causal and non-causal representations in the vision-language alignment process. The causal representation that captures only essential semantically relevant information can serve as a mediator variable between the input image and output label, mitigating the biases from the latent confounder. To further enrich this causal representation, we propose a novel text-based diversity augmentation technique that uses textual features to provide additional semantic context. This augmentation technique can enhance the diversity of the causal representation, making it more robust and generalizable to various OOD scenarios. Experimental results across multiple OOD datasets demonstrate that our method significantly outperforms existing approaches, achieving state-of-the-art generalization performance.

Abstract:
Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.

Abstract:
Semi-supervised object detection (SSOD) mitigates the annotation burden in object detection by leveraging unlabeled data, providing a scalable solution for modern perception systems. Concurrently, detection transformers (DETRs) have emerged as a popular end-to-end framework, offering advantages such as non-maximum suppression (NMS)-free inference. However, existing SSOD methods are predominantly designed for conventional detectors, leaving the exploration of DETR-based SSOD largely uncharted. This paper presents a systematic study to bridge this gap. We begin by identifying two principal obstacles in semi-supervised DETR training: (1) the inherent one-to-one assignment mechanism of DETRs is highly sensitive to noisy pseudo-labels, which impedes training efficiency; and (2) the query-based decoder architecture complicates the design of an effective consistency regularization scheme, limiting further performance gains. To address these challenges, we propose Semi-DETR++, a novel framework for efficient SSOD with DETRs. Our approach introduces a stage-wise hybrid matching strategy that enhances robustness to noisy pseudo-labels by synergistically combining one-to-many and one-to-one assignments while preserving NMS-free inference. Furthermore, based on our observation of the unique layer-wise decoding behavior in DETRs, we develop a simple yet effective re-decode query consistency training method to regularize the decoder. Extensive experiments demonstrate that Semi-DETR++ enables more efficient semi-supervised learning across various DETR architectures, outperforming existing methods by significant margins. The proposed components are also flexible and versatile, showing superior generalization by readily extending to semi-supervised segmentation tasks.

Abstract:
Decentralized federated learning (DFL) is an emerging paradigm to enable edge devices collaboratively training a learning model using a device-to-device (D2D) communication manner without the coordination of a parameter server (PS). Aggregation weights, also known as mixing weights, are crucial in DFL process, and impact the learning efficiency and accuracy. Conventional design relies on a so-called central entity to collect all local information and conduct system optimization to obtain appropriate weights. In this paper, we develop a distributed aggregation weight optimization algorithm to align with the decentralized nature of DFL. We analyze convergence by quantitatively capturing the impact of the aggregation weights over decentralized communication networks. Based on the analysis, we then formulate a learning performance optimization problem by designing the aggregation weights to minimize the derived convergence bound. The optimization problem is further transformed as an eigenvalue optimization problem and solved by our proposed subgradient-based algorithm in a distributed fashion. In our algorithm, edge devices only need local information to obtain the optimal aggregation weights through local (D2D) communications, just like the learning itself. Therefore, the optimization, communication, and learning process can be all conducted in a distributed fashion, which leads to a genuinely distributed DFL system. Numerical results demonstrate the superiority of the proposed algorithm in practical DFL deployment.

Abstract:
Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in the literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding of existing methods for semantic matching, we thoroughly conduct controlled experiments to analyze the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development.

Abstract:
Portraits or selfie images taken from a close distance typically suffer from perspective distortion. In this paper, we propose an end-to-end deep learning-based rectification pipeline to mitigate the effects of perspective distortion. We learn to predict the facial depth by training a deep CNN. The estimated depth is utilized to adjust the camera-to-subject distance by moving the camera farther, increasing the camera focal length, and reprojecting the 3D image features to the new perspective. The reprojected features are then fed to an inpainting module to fill in the missing pixels. We leverage a differentiable renderer to enable end-to-end training of our depth estimation and feature extraction nets to improve the rectified outputs. To boost the results of the inpainting module, we incorporate an auxiliary module to predict the horizontal movement of the camera which decreases the area that requires hallucination of challenging face parts such as ears. Unlike previous works, we process the full-frame input image at once without cropping the subject’s face and processing it separately from the rest of the body, eliminating the need for complex post-processing steps to attach the face back to the subject’s body. To train our network, we utilize the popular game engine Unreal Engine to generate a large synthetic face dataset containing various subjects, head poses, expressions, eyewear, clothes, and lighting. Quantitative and qualitative results show that our rectification pipeline outperforms previous methods, and produces comparable results with a time-consuming 3D GAN-based method while being more than 260 times faster.

Abstract:
Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance-aware capabilities and the necessity of ensuring consistent predictions between instance-level boxes and masks. To address these limitations, we propose InstanceVG, a multi-task generalized visual grounding framework equipped with instance-aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance-level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance-aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state-of-the-art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at https://github.com/Dmmm1997/InstanceVG.

Abstract:
Stochastic Kriging (SK) is a generalized variant of Gaussian process regression, and it is developed for dealing with non-i.i.d. noise in functional responses. Although SK has achieved substantial success in various engineering applications, its intrinsic modeling strategy by focusing on the sample mean limits its flexibility and capability of predicting individual functional samples. Moreover, the performance of SK can be impaired under scarce data scenarios, which are commonly encountered in engineering applications, especially for start-up or just deployed systems. In this paper, we propose a novel transfer learning framework to address the challenges of individualization and data scarcity in traditional SK. The proposed framework features a within-process model to facilitate individualized prediction and a between-process model to leverage information from related processes for resolving the issue of data scarcity. The within- and between-process models are integrated through a tailored convolution process, which quantifies interactions within and between processes using a specially designed covariance matrix and corresponding kernel parameters. Statistical properties are investigated on the parameter estimation of the proposed framework, which provide theoretical guarantees for the performance of transfer learning. The proposed method is compared with benchmark methods through various numerical and real case studies, and the results demonstrate the superiority of the proposed method in dealing with individualized prediction of functional responses, especially when limited data are available in the process of interest. The reproducibility code is available in the supplementary materials.

Abstract:
The spectacular success of training large models on extensive datasets highlights the potential of scaling up for exceptional performance. To deploy these models on edge devices, knowledge distillation (KD) is commonly used to create a compact model from a larger, pretrained teacher model. However, as models and datasets rapidly scale up in practical applications, it is crucial to consider the applicability of existing KD approaches originally designed for limited-capacity architectures and small-scale datasets. In this paper, we revisit current KD methods and identify the presence of a small-data pitfall, where most modifications to vanilla KD prove ineffective on large-scale datasets. To guide the design of consistently effective KD methods across different data scales, we conduct a meticulous evaluation of the knowledge transfer process. Our findings reveal that incorporating more useful information is crucial for achieving consistently effective KD methods, while modifications in loss functions show relatively less significance. In light of this, we present a paradigmatic example that combines vanilla KD with deep supervision, incorporating additional information into the student during distillation. This approach surpasses almost all recent KD methods. We believe our study will offer valuable insights to guide the community in navigating beyond the small-data pitfall and toward consistently effective KD.

Abstract:
This paper concentrates on Multi-modal Referring Video Segmentation task, where a well optimized model is able to recognize and segment the target objects referred by the given guidance signals, e.g., language description. Early approaches model this task as a sequence prediction problem. The lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships. Some recent works propose to perform temporal modeling with vanilla attention mechanism. However, the condensed visual representation tends to be messy about target information due to occlusion or motion blur. Unlimited non-local operation would spread such noise to all the sequences and interfere with the extraction of global representations. To address the above issue, we present Semantic-assisted Object Cluster network (SOC) and the improved SOC++ in this paper. Our method unifies temporally selective interaction and cross-modal alignment to achieve video-level understanding. In SOC++, a proxy-assisted multi-modal fusion module is introduced to perform preliminary bidirectional activation. Then a semantic integration module with progressive frame-to-video structure facilitates joint space learning across modalities and time steps. Considering that potential noisy visual embeddings would impair the overall representation of target objects in unconstrained inter-frame interactions, we propose to perform tendentious video aggregation through emphasizing the indicative role of the informative frames with lower entropy in this part. A multi-modal query contrastive supervision is also utilized to help construct well-aligned joint space at the video level. Moreover, to integrate the advantage of high-level video information and the low-level details of each frame, we introduce a dynamic query fusion module that performs joint updating of these embeddings. We conduct extensive experiments on popular referring video segmentation benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations..

Abstract:
Multi-object tracking (MOT) is crucial for applications such as autonomous driving and robotics, yet traditional image-based methods struggle in high-speed scenarios due to motion blur and temporal gaps caused by low frame rates. Spike cameras, with their ability to continuously record spatiotemporal signals, overcome these limitations. However, existing spike-based methods often rely on intermediate image reconstruction or discrete clustering, limiting real-time performance and temporal continuity. To address this, we propose SNNTracker, the first fully spiking neural network (SNN)-based MOT algorithm tailored for spike cameras. SNNTracker integrates a dynamic neural field (DNF)-based attention mechanism for target detection and a winner-take-all (WTA)-based tracking module with online spike-timing-dependent plasticity (STDP) for adaptive learning of object trajectories. By directly processing spike streams without reconstruction, SNNTracker reduces latency, computational overhead, and dependency on image quality, making it ideal for ultra-high-speed environments. It maintains robust, continuous tracking even under occlusions, severe lighting variations, or temporary object disappearance, by leveraging SNN-estimated motion predictions and long-term online clustering. We construct three types of spike-camera MOT datasets covering dense and sparse annotations across diverse real-world scenarios, including camera ego-motion, deformable and ultra-fast motion (up to 2600 RPM), occlusion, indoor/outdoor lighting changes, and low-visibility tracking. Extensive experiments demonstrate that SNNTracker consistently outperforms state-of-the-art MOT methods—both ANN- and SNN-based—achieving MOTA scores above 96% and up to 100% in many sequences. Our results highlight the advantages of spike-driven SNNs for low-latency, high-speed, and label-free multi-object tracking, advancing neuromorphic vision for real-time perception.

Abstract:
The extraction of road network is essential for the generation of high-definition maps since it enables the precise localization of road landmarks and their interconnections. However, generating road network poses a significant challenge due to the conflicting underlying combination of Euclidean (e.g., road landmarks location) and non-Euclidean (e.g., road topological connectivity) structures. Existing methods struggle to merge the two types of data domains effectively, but few of them address it properly. Instead, our work establishes a unified representation of both types of data domain by projecting both Euclidean and non-Euclidean data into an integer series called RoadNet Sequence. Further than modeling an auto-regressive sequence-to-sequence Transformer model to understand RoadNet Sequence, we decouple the dependency of RoadNet Sequence into a mixture of auto-regressive and non-autoregressive dependency. Building on this, our proposed non-autoregressive sequence-to-sequence approach leverages non-autoregressive dependencies while fixing the gap towards auto-regressive dependencies, resulting in success in both efficiency and accuracy. We further identify two main bottlenecks in the current RoadNetTransformer on a non-overfitting split of the dataset: poor landmark detection limited by the BEV Encoder and error propagation to topology reasoning. Therefore, we propose Topology-Inherited Training to inherit better topology knowledge into RoadNetTransformer. Additionally, we collect SD-Maps from open-source map datasets and use this prior information to significantly improve landmark detection and reachability. Extensive experiments on the nuScenes dataset demonstrate the superiority of RoadNet Sequence representation and the non-autoregressive approach compared to existing state-of-the-art alternatives.

Affiliations: School of Mathematics and Statistics and the Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an, China; School of Computer Science and Technology and the Ministry of Education Key Lab for Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an, China; School of Mathematics and Statistics, Northwestern Polytechnical University, Xi’an, China; National Laboratory of Radar Signal Processing, Xidian University, Xi’an, China; Key Laboratory of Electronic Information Countermeasure and Simulation of the Education Ministry of China, Xidian University, Xi’an, China; School of Mathematics and Statistics, Ministry of Education Key Lab of Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an, China

Abstract:
Hyperspectral image (HSI) plays a vital role in various fields such as agriculture and environmental monitoring. However, due to the expensive acquisition cost, the number of hyperspectral images is limited, degenerating the performance of downstream tasks. Although some recent studies have attempted to employ diffusion models to synthesize HSIs, they still struggle with the scarcity of HSIs, affecting the reliability and diversity of the generated images. Some studies propose to incorporate multi-modal data to enhance spatial diversity, but spectral fidelity cannot be ensured. In addition, existing HSI synthesis models are typically uncontrollable or only support single-condition control, limiting their ability to generate accurate and reliable HSIs. To alleviate these issues, we propose HSIGene, a novel HSI generation foundation model which is based on latent diffusion and supports multi-condition control, allowing for more precise and reliable HSI generation. To enhance the spatial diversity of the training data while preserving spectral fidelity, we propose a new data augmentation method based on spatial super-resolution, in which HSIs are upscaled first, and thus abundant training patches could be obtained by cropping the high-resolution HSIs. In addition, to improve the perceptual quality of the augmented data, we introduce a novel two-stage HSI super-resolution framework, which first applies RGB bands super-resolution and then utilizes our proposed Rectangular Guided Attention Network (RGAN) for guided HSI super-resolution. Experiments demonstrate that the proposed model is capable of generating a vast quantity of realistic HSIs for downstream tasks such as denoising and super-resolution.

Abstract:
High-fidelity 3D surface is essential for vision tasks across various domains such as medical imaging, cultural heritage preservation, quality inspection, virtual reality, and autonomous navigation. However, the intricate nature of 3D data representations poses significant challenges in restoring diverse 3D surfaces while capturing fine-grained geometric details at a low cost. This paper introduces an efficient multimodal normal-based 3D surface super-resolution (mn3DSSR) framework, designed to address the challenges of microgeometry enhancement and computational overhead. Specifically, we have constructed one of the largest normal-based multimodal dataset, ensuring superior data quality and diversity through meticulous subjective selection. Furthermore, we explore a new two-branch multimodal alignment approach along with a multimodal split fusion module to mitigate computational complexity while improving restoration performances. To address the limitations associated with normal-based multimodal learning, we develop novel normal-induced loss functions that facilitate geometric consistency and improve feature alignment. Extensive experiments conducted on seven benchmark datasets across four different 3D data representations demonstrate that mn3DSSR consistently outperforms state-of-the-art super-resolution methods in terms of restoration accuracy with high computational efficiency.

Abstract:
We propose CLIP-Actor-X, a text-driven motion generation and neural mesh stylization system for 4D human avatar generation. CLIP-Actor-X generates a detailed 3D human mesh, motion animation, and texture to conform to a given text prompt input from a user. CLIP- Actor-X system mainly consists of two modules. First, for generating realistic human motion, we build a text-driven human motion synthesis module modeled by a retrieval-augmented generative model, powered by a text-to-motion diffusion model. Second, our novel zero-shot neural style optimization module detailizes and texturizes the sampled sequence of a neutral human mesh template, such that the resulting mesh and appearance comply with the input text prompt in a temporally-consistent and pose-agnostic manner. In contrast to the prior arts that use an artist-designed, non-animatable mesh as an input, our output representation is animatable and better aligned between an input text and the generated avatar without additional post-processes, e.g., re-alignment, retargeting, or rigging. We further propose the ways to stabilize the optimization process: spatio-temporal view augmentation and visibility-aware embedding attention, which deals with poorly rendered views. We demonstrate that CLIP-Actor-X produces perceptually plausible and human-recognizable human avatar in motion with detailed geometry and texture solely from a natural language prompt.

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet they often suffer from hallucinations and lack reliable factual grounding. Meanwhile, knowledge graphs (KGs) provide structured factual knowledge, but lack the flexible reasoning abilities of LLMs. In this paper, we present Reason-Align-Respond (RAR), a novel framework that systematically integrates LLM reasoning with knowledge graphs for knowledge graph question answering (KGQA). Our approach consists of three key components: a Reasoner that generates human-like natural language reasoning chains, an Aligner that maps these chains to valid KG paths, and a Responser that synthesizes the final answer. We formulate this process as a latent variable mixture model and optimize it using the Expectation-Maximization algorithm, which iteratively refines the reasoning chains and knowledge paths. Extensive experiments on multiple benchmarks demonstrate the effectiveness of RAR, achieving state-of-the-art performance with Hit scores of 93.3% and 91.0% on WebQSP and CWQ respectively. Human evaluation confirms that RAR generates high-quality, interpretable reasoning chains well-aligned with KG paths while maintaining computational efficiency during inference.

Affiliations: School of Computer Science, National Engineering Research Center for Multimedia Software, Hubei Key Laboratory of Multimedia and Network Communication Engineering and Institute of Artificial Intelligence, Wuhan University, Wuhan, China; School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China; TikTok, ByteDance, Sydney, NSW, Australia; Tencent Inc., Shenzhen, China; College of Electronic and Information Engineering, Tongji University, Shanghai, China; School of Medical Information and Engineering, Southwest Medical University, Luzhou, China; College of Computing and Data Science and the Generative AI Lab, Nanyang Technological University, Singapore

Abstract:
Few-step diffusion models enable efficient high-resolution image synthesis but struggle to align with specific downstream objectives due to limitations of existing reinforcement learning (RL) methods in low-step regimes with limited state spaces and suboptimal sample quality. To address this, we propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework tailored for few-step diffusion models. SDPO introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization. For further efficiency, we develop a latent similarity-based dense reward prediction strategy to minimize costly dense reward queries. Leveraging these dense rewards, SDPO optimizes a dense reward difference learning objective that enables more frequent and granular policy updates. Additional refinements, including stepwise advantage estimates, temporal importance weighting, and step-shuffled gradient updates, further enhance long-term dependency, low-step priority, and gradient stability. Our experiments demonstrate that SDPO consistently delivers superior reward-aligned results across diverse few-step settings and tasks.

Abstract:
Attribution explanation is a typical approach for interpreting deep neural networks (DNNs), aiming to quantify the contribution score of individual input variables to model predictions. Despite extensive methodological development, a fundamental faithfulness problem remains unresolved: whether existing attribution methods faithfully reflect the true decision-making logic of DNNs, which significantly limits their reliability and practical adoption. These concerns largely stem from three core challenges: the lack of a unified theoretical framework, clear theoretical rationales, and principled faithfulness evaluation in the absence of ground truth. Recently, a growing body of theoretical studies has begun to address these issues, marking an important shift toward principled understanding of attribution methods. In this survey, we provide a comprehensive review of these advances, with a particular emphasis on three interconnected directions: (i) Theoretical unification, which uncovers key commonalities and differences among attribution methods; (ii) Theoretical rationale, which clarifies the mathematical and conceptual justifications underlying existing methods; (iii) Theoretical evaluation, which rigorously proves whether attribution methods satisfy established faithfulness principles. Beyond a comprehensive review, we provide practical recommendations and a case study illustrating how theoretical findings can be translated into operational decision rules for method design, selection, and usage. We conclude with a discussion of promising open problems for further work.

Abstract:
Large-scale dataset distillation requires storing auxiliary soft labels that can be 30-40× (ImageNet-1K) or 200× (ImageNet-21K) larger than the condensed images, undermining the goal of dataset compression. We identify two fundamental issues necessitating such extensive labels: (1) insufficient image diversity, where high within-class similarity in synthetic images requires extensive augmentation, and (2) insufficient supervision diversity, where limited variety in supervisory signals during training leads to performance degradation at high compression rates. To address these challenges, we propose Label Pruning and Quantization for Large-scale Distillation (LPQLD). We enhance image diversity via class-wise batching and BN supervision during synthesis. For supervision diversity, we introduce Label Pruning with Dynamic Knowledge Reuse to enhance label-per-augmentation diversity, and Label Quantization with Calibrated Student-Teacher Alignment to enhance augmentation-per-image diversity. Our approach reduces soft label storage by 78× on ImageNet-1K and 500× on ImageNet-21K while improving accuracy by up to 7.2% and 2.8%, respectively. Extensive experiments validate the superiority of LPQLD across different network architectures and other dataset distillation methods.

Abstract:
The majority of standard diffusion models employ pixel-wise degradations while neglecting multi-scale characteristics of images. Recently, generalized diffusion models with Positive Semi-definite Degradations (PSD), such as heat dissipation and blurring, have been proposed to solve it, but suffering from problems of low generation quality due to incomplete optimization analysis and non-adaptiveness to the training process and different data distributions with hand-crafted and fixed inductive biases. In this paper, we present a comprehensive theoretical analysis of the optimization process in frequency domain for PSD-based generalized diffusion models, which implies the forward process of PSD frequency domain non-isotropic degradation implicitly acting on the inductive biases of the Variational Lower Bound non-isotropic weighting in the optimization reverse process. Based on this insight, we propose the Frequency Inductive Biases Bootstrapping Optimization (FIBBO) method, which parameterizes the forward process and learns distinct frequency degradation-generation trajectories iteratively. To tackle the problem of PSD hand-crafted and fixed inductive biases, FIBBO dynamically modifies the non-isotropic Gaussian kernel of the forward degradation process so that the inductive biases introduced can be adjusted adaptively during training. Experiments on public datasets show that FIBBO makes significant improvements in the generation quality of PSD-based generalized diffusion models.

Abstract:
Building on the. success of diffusion models in visual generation, flow-based models reemerge as another prominent family of generative models that have achieved competitive or better performance in terms of both visual quality and inference speed. By learning the velocity field through flow-matching, flow-based models tend to produce a straighter sampling trajectory, which is advantageous during the sampling process. However, unlike diffusion models for which fast samplers are well-developed, efficient sampling of flow-based generative models has been rarely explored. In this paper, we propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor’s outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. Additionally, we introduce several techniques including a pseudo corrector and sample-aware compilation to further reduce inference time. Since FlowTurbo does not change the multi-step sampling paradigm, it can be effectively applied for various tasks such as image editing, inpainting, etc. Besides, we propose a new multi-stage refinement technique that is designed to reduce the inference costs with large flow-based image generation models. Specifically, the multi-stage refinement split the whole generation procedure on different resolutions, forming a coarse-to-fine text-to-image pipeline. We further adopt a stage-aware deployment strategy that can maximize the inference speed in terms of both latency and throughput. By integrating FlowTurbo into different flow-based models, we obtain an acceleration ratio of 53.1%～∼58.3% on class-conditional generation and 29.8%～∼38.5% on text-to-image generation. Notably, FlowTurbo reaches an FID of 2.12 on ImageNet with 100 (ms/img) and FID of 3.93 with 38 (ms/img), achieving the real-time image generation and establishing the new state-of-the-art. Equipped with the recent SD 3.5 Large, we achieved FID of 28.05 with a speed improvement of around 50% on NVIDIA 3090 GPU.

Abstract:
Synthesizing novel views from sparse views has achieved impressive advances with radiance fields, yet prevailing methods suffer from high consumption or insufficient refinement capability. This paper introduces DNGaussian, a depth-regularized framework based on 3D Gaussian Splatting, offering real-time and high-quality few-shot novel view synthesis at low costs. Our motivation stems from the remarkable advancement of recent 3D Gaussian Splatting, despite it will encounter a geometry degradation when input views decrease. In the Gaussian radiance fields, we find this degradation in scene geometry primarily lined to the positioning of Gaussian primitives and can be mitigated by depth constraint. Consequently, we propose a Hard and Soft Depth Regularization to restore accurate scene geometry under coarse monocular depth supervision while maintaining a fine-grained color appearance. To further refine detailed geometry, we introduce Global-Local Depth Normalization, enhancing the focus on small local depth changes. Although DNGaussian shows impressive performance, its patch-wise regularization obscures the inconsistency in cross-patch errors. Additionally, primitives can still be irreversibly trapped in local minima under sparse views, even if depth regularization is applied. In this paper, we propose an extended version, DNGaussian++. First, a Geometry Instance Regularizer is developed to enable depth regularization for continuous consistency by exploiting reliable instance-level depth cues. Leveraging the depth gradient guidance, we then propose a Depth-Guided Geometry Reorganization to address the aforementioned local minima problem with high representation efficiency. Extensive experiments show that DNGaussian++ exhibits state-of-the-art performance in multiple datasets and scenarios with high efficiency, and the broad applicability and effectiveness are verified on various backbones and tasks.

Abstract:
The advancement of sequencing technologies has generated an unprecedented volume of single-cell multi-omics data, providing new opportunities for biological discovery and medical research. However, due to the high heterogeneity across different omics types, effective integration of single-cell multi-omics data remains a critical challenge. Existing methods generally ignore the graph structure information among cells or resort to additional knowledge to construct the cell graphs, leading to suboptimal performance and potentially limited practical utility. In this study, we propose a novel Graph-embedded Deep Generative Clustering model (GeDGC) for single-cell multi-omics data integration. Specifically, GeDGC simultaneously learns the shared latent representations and cluster factors across multiple omics by leveraging Gaussian mixture models. Moreover, we impose the graph embedding constraint on both the latent representations and the cluster assignments to ensure the preservation of intrinsic local data structure among cells. As a result, our model captures complex correlations across omics and obtains informative shared latent embeddings for downstream tasks. Extensive experimental results with seventeen competing methods on ten datasets confirm the superiority of GeDGC in single-cell multi-omics data integration.

Abstract:
Recent years have witnessed remarkable progress in image restoration, yet achieving both high performance and efficiency remains a persistent challenge. To address this issue, we present VIVNet, a strong and efficient unified baseline designed to balance accuracy and practicality. Drawing inspiration from the high efficiency of the human visual system, VIVNet embeds a biologically inspired micro visual module into each block of a macro U-shaped vision architecture. This module mimics key perceptual processes such as retinal encoding, lateral inhibition, and high-order processing by combining lightweight depth-wise convolutions for multi-receptive-field feature extraction, a similarity-aware weighting mechanism to emphasize informative signals, and high-order interactions implemented via iterative element-wise multiplication to capture complex dependencies. This design enhances the model’s representational capacity while maintaining computational efficiency. Unlike most existing methods that are limited to narrow task settings, we evaluate VIVNet across a wide range of scenarios, including general, all-in-one, and composite degradation tasks, as well as ultra-high-definition (UHD), underwater, medical, and remote sensing datasets. Extensive experiments show that VIVNet delivers competitive performance with high efficiency.

Abstract:
Out-of-distribution (OOD) detection serves as an unknown-handling mechanism for open-world classification, enabling the identification of OOD data that diverge semantically from in-distribution (ID) data. The learning strategy known as outlier exposure (OE) enhances this process by incorporating OOD data during model training, directly making models learn to discern between ID and OOD patterns. However, in practice, the collected OOD data often contain many ID semantics, of which the scenario is commonly referred to as wild OOD detection. It can markedly compromise the reliability of models in OOD detection, yet few studies have addressed this critical issue. In this paper, we theoretically analyze wild OOD detection from the instance and distribution facets, respectively, to better comprehend its challenges and accordingly introduce two general solutions. At the instance facet, ID/OOD indicators contain errors due to the wild nature, where some data are of OOD labels yet should be assigned as ID. Hence, we introduce a general framework that can dynamically estimate the true ID/OOD indicators solely based on wild OOD data, thereby mitigating their negative impacts. At the distribution facet, the wild OOD distribution is a mixture of ID and OOD distributions, where the ID sub-distribution can mislead the model. We therefore propose a resampling scheme to remove the potential ID sub-distribution, with resampling probabilities estimated from the known ID distribution, enabling OE training to better address wild OOD detection. We provide theoretical guarantees for both solutions and develop algorithms that enhance their practical efficacy, ultimately integrating them into a unified framework that leverages their complementary strengths. Ultimately, we validate our approaches through comprehensive empirical evaluations across a range of wild OOD detection scenarios, clearly demonstrating the superior performance and reliability of our methods when compared to advanced counterparts.

Abstract:
The success of existing graph matching methods heavily relies on high-quality training data with complete and precise correspondences between keypoints across different graphs. However, this assumption is often violated in real-world scenarios, leading to partial correspondence and noisy correspondence challenges. In brief, partial correspondence arises from viewpoint occlusions, where certain keypoints (i.e., outliers) lack valid counterparts in the target graph, while noisy correspondence refers to both incorrectly established (i.e., false positives) and neglected (i.e., false negatives) correspondences due to annotation error. In this paper, we propose the first unified framework to address both partial and noisy correspondence challenges in graph matching. Specifically, we introduce a dual-expert cooperative framework that integrates Koopmans-Beckmann and Lawler's quadratic assignment programming formulations (KB-QAP and L-QAP) through an align-fuse-refine pipeline. In the alignment stage, the KB-QAP expert aligns keypoints and distinguishes inliers from outliers using a novel quadratic contrastive loss. In the fusion stage, the L-QAP expert employs a graph transformer on the association graph to merge the aligned graphs and incorporates a learnable outlier-rejection mechanism to handle partial correspondences. Finally, by exploiting the different noise resistances of the two experts, we identify and refine the false positive and false negative correspondences, thereby enhancing robustness against noisy correspondence. Extensive experiments on four widely-used graph matching datasets demonstrate the effectiveness of our method against 17 competitive baselines in both partial and noisy correspondence scenarios.

Abstract:
Current few-shot action recognition involves two primary sources of information for classification: (1) intra-video information, determined by frame content within a single video clip, and (2) inter-video information, measured by relationships (e.g., feature similarity) among videos. However, existing methods inadequately exploit these two information sources. In terms of intra-video information, current sampling operations for input videos may omit critical action information, reducing the utilization efficiency of video data. For the inter-video information, the action misalignment among videos makes it challenging to calculate precise relationships. Moreover, how to jointly consider both inter- and intra-video information remains under-explored for few-shot action recognition. To this end, we propose a novel framework, Video Information Maximization (VIM), for few-shot video action recognition. VIM is equipped with an adaptive spatial-temporal video sampler and a spatial-temporal action alignment model to maximize intra- and inter-video information, respectively. The video sampler adaptively selects important frames and amplifies critical spatial regions for each input video based on the task at hand. This preserves and emphasizes informative parts of video clips while eliminating interference at the data level. The alignment model performs temporal and spatial action alignment sequentially at the feature level, leading to more precise measurements of inter-video similarity. Finally, based on the mutual information measurement, we introduce a new training objective into few-shot learning, which provides explicit guidance in jointly maximizing intra- and inter-video information in our VIM. Extensive experimental results on public datasets for few-shot action recognition demonstrate the effectiveness of our framework.

Abstract:
Benefiting from the effectiveness of the self-attention mechanisms in the Transformer framework for modeling non-local features of images, significant progress has been achieved in image super-resolution. We note that existing self-attention mechanisms usually explore all similarities of the tokens between the queries and keys for the feature aggregation. However, using all the similarities does not effectively facilitate the high-quality image reconstruction as not all the tokens from the queries are relevant to those in keys. We further note that self-attention mechanisms are less effective for local feature exploration, which are less effective for the structural detail restoration. To overcome these problems, we develop a simple yet effective adaptive sparse self-attention method to utilize the most useful information of tokens for image restoration. We first develop a local spatial-variant feature estimation method to build the query and key used in the self-attention so that local information can be better modeled. Then, we present a simple yet effective sparse self-attention to adaptively select the most useful similarity values from the self-attention matrix for better the feature aggregation. We analyze that the proposed method models both local and non-local features and thus facilitates better structural detail restoration. We further show that the proposed method can serve as an alternative to existing self-attention mechanisms for better image restoration. Experimental results show that the proposed method performs favorably against state-of-the-art ones on benchmark datasets in terms of accuracy and model complexity.

Abstract:
We present T-Rex2++, a unified and highly practical framework for generic open-set object perception, encompassing both object detection and instance segmentation. Previous methods relying on text prompts effectively encapsulate the abstract concept of common objects, but struggle with rare or complex object representation due to data scarcity and descriptive limitations. Conversely, visual prompts excel in depicting novel objects through concrete visual examples, but fall short in conveying the abstract concept of objects as effectively as text prompts. Recognizing these complementary strengths, we introduce a text-visual synergy mechanism that aligns both modalities within a single feature space via contrastive learning. Crucially, T-Rex2++ advances beyond the passive perception paradigm of its predecessor by introducing a novel Universal Prompt. This learnable component models generic objectness, empowering the system to autonomously discover and localize arbitrary objects without any user-provided cues, thereby closing the loop between human-guided interaction and fully automatic perception. Furthermore, we extend the synergy verification to the pixel level by integrating a zero-shot instance segmentation module, demonstrating that our contrastive alignment generalizes robustly to fine-grained masks. Comprehensive experiments demonstrate that T-Rex2++ exhibits strong zero-shot object perception capabilities across a wide spectrum of scenarios, validating T-Rex2++ as a versatile foundation for generic object perception.

Affiliations: Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, China; Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology, Shenzhen, China; Department of Computer Science, Brunel University of London, London, U.K.; School of Information Science and Technology, Hangzhou Normal University, Hangzhou, China; School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; School of Computer Science and Technology, University of Science and Technology of China, Hefei, China; School of Artificial Intelligence and Computer Science, Nantong University, Nantong, China

Abstract:
Semi-supervised learning can leverage both labeled and unlabeled samples simultaneously to improve performance. However, existing methods often present the following issues: (1) The emphasis of learning is put on either the similarity structures or the regression losses of data, neglecting the interaction between them. (2) The similarity structures among boundary samples might be unreliable, which misleads label propagation and impairs the performance of models on out-of-sample data. (3) They often involve the inverses of high-order matrices, making them inefficient in computation. To overcome these issues, we propose a scalable semi-supervised learning framework with Discriminative Label Propagation and Correction (DLPC), which collaboratively exploits the regression losses and similarity structures of data. Particularly, each sample is projected onto the independent class labels associated with nonnegative adjustment vectors rather than the propagated labels, such that the distances between samples from different classes are naturally enlarged, making regression losses more effective for boundary samples. Benefiting from this, the regression losses can guide the propagation of labels in boundary areas. Thus, the label information is first propagated through dynamically optimized graph structures and then corrected by the regression losses, effectively improving the quality of labels and facilitating feature projection learning. Furthermore, an accelerated solution has been developed to reduce the computational costs of DLPC on sample scales, thereby making it scalable to relatively large-scale problems. Moreover, the proposed DLPC can not only be applied to single-view scenarios but also extended to multi-view tasks. Additionally, an optimization strategy with fast convergence has been presented for DLPC, and extensive experiments demonstrate the effectiveness and superiority of DLPC over state-of-the-art competitors.

Abstract:
Knowledge graph embeddings (KGE) are effective for representing factual data for numerous applications. However, real-world facts continually evolve, necessitating ongoing updates to knowledge graphs as new information emerges. Under these circumstances, existing KGE models in transductive, inductive, and continual learning settings are prone to catastrophic forgetting or require costly retraining to integrate new information. To address these challenges, we propose a novel model called the Context-aware Adaptive learning model for Knowledge Graph Embeddings (CAKGE). Our model first identifies semantic-relevant entities and uncovers latent relational paths to facilitate the acquisition of new knowledge. To ensure the paths are semantically aligned with the query, we employ a context-aware fusion module, which leverages multiple specialized expert networks to assess and integrate the relevance of these relational paths. Building on this, we introduce an adaptive message aggregation module that incorporates a knowledge replay strategy, enabling the model to integrate both new and existing knowledge efficiently, without retraining the knowledge graph. Additionally, to mitigate catastrophic forgetting, we reformulate the challenge of aligning new with existing knowledge as a graph-matching task using the Fused Gromov-Wasserstein distance, enabling the alignment of old and new knowledge from both semantic and topological perspectives. Furthermore, we provide theoretical guarantees for the expressiveness and reasoning ability of CAKGE, showing that it is the first unified framework tackling transductive, inductive, and continual settings. Extensive experiments show that CAKGE achieves state-of-the-art performance, demonstrating its effectiveness in dynamic KGE modeling.

Abstract:
Egocentric Task Verification (ETV) aims to determine if the operation flows of procedural tasks in egocentric videos align with the logic of given rules. Early works adopt the video-based verification paradigm that compares a reference video to the testing video, which limits the flexibility of model deployment. Recent researches incorporate reference textual rules instead of videos, describing the operational logic with natural language, but also raises the challenges of cross-modal heterogeneity and hierarchical misalignment between the two modalities. While previous works mainly address the cross-modal heterogeneity between vision and text modalities, they inevitably suffer from two additional key challenges: (1) Existing methods are mostly developed in synthetic domains, yet have not considered the issues of synthetic-to-realistic generalization challenges in real-world applications. (2) The intricate relations between visual content and textual rule involve multiple matching correlations, indicating high-order matching interactions. To address these issues, we proposed the Generalizable Egocentric Task Verification (GETV), and construct a cross-domain ETV benchmark dataset, EgoCross. It features synthetic-to-real cross-domain evaluation, covering both synthetic datasets for training and realistic datasets for testing, across three different types of tasks. Furthermore, we also propose a novel method for this challenge, termed Cross-modal Hybrid Hypergraph Matching (CHHM), which models the logical cross-modal matching in the GETV challenge as a heterogeneous hybrid hypergraph learning process, thus addressing intrinsic multiple matching correlations. Additionally, to tackle the problems of synthetic-to-realistic generalization, we enhance the cross-modal matching process with prototype-based graph representation alignment, which effectively mitigates the cross-domain gap. Extensive experiments on the existing two ETV benchmark datasets, i.e., EgoTV and CSV-NL, and our proposed GETV dataset EgoCross, demonstrate our approach establishes new state-of-the-art performance on both intra-domain and cross-domain challenges.

Abstract:
The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark specifically designed to evaluate hallucinations in visual relationships. R-Bench includes both image-level questions to assess the existence of relationships and instance-level questions that probe deeper into local visual comprehension. Our analysis reveals that relationship hallucinations arise from three types of co-occurrences: relationship-relationship, subject-relationship, and relationship-object, exacerbated by the long-tail distribution in visual datasets. Moreover, LVLMs often ignore visual content, over-relying on common sense from language models, particularly in spatial reasoning tasks. We further demonstrate that region-level image-text alignment helps mitigate relationship hallucinations and propose a new baseline, Region-Aware Alignment Mitigation (RA^22M), that enhances model attention to relevant regions, improving alignment between generated text and images.

Abstract:
The abstract visual reasoning ability in human intelligence benefits discovering underlying rules in the novel environment. Raven’s Progressive Matrix (RPM) is a classic test to realize such ability in machine intelligence by selecting from candidates. Recent studies suggest that solving RPM in an answer-generation way boosts a more in-depth understanding of rules. However, existing generative solvers cannot discover the global concept-changing rules without auxiliary supervision (e.g., rule annotations and distractors in candidate sets). To this end, we propose a deep latent variable model for Concept-changing Rule ABstraction (CRAB) by learning interpretable concepts and parsing concept-changing rules in the latent space. With the iterative learning process, CRAB can automatically abstract global rules shared on the dataset on each concept and form the learnable prior knowledge of global rules. CRAB outperforms the baselines trained without auxiliary supervision in the arbitrary-position answer generation task and achieves comparable and even higher accuracy than the compared models trained with auxiliary supervision. Finally, we conduct experiments to illustrate the interpretability of CRAB in concept learning, answer selection, and global rule abstraction.

Abstract:
Severe weather restoration models often face the simultaneous interaction of multiple degradations in real-world scenarios. Existing approaches typically handle single or composite degradations based on scene descriptors derived from text or image embeddings. However, due to the varying proportions of different degradations within an image, these scene descriptors may not accurately differentiate between degradations, leading to suboptimal restoration in practical applications. To address this issue, we propose a novel Transformer-based restoration framework, AllRestorer, for dealing with four physical severe weather impairments: low-light, haze, rain, and snow. In AllRestorer, we enable the model to adaptively consider all weather impairments, thereby avoiding errors from scene descriptor misdirection. Specifically, we introduce the All-in-One Transformer Block (AiOTB), the core innovation of which is the ability to adaptively handle multiple degradations in a single image, beyond the limitation of existing Transformers that can only handle one type of degradation at a time. To accurately address different variations potentially present within the same type of degradation and minimize ambiguity, AiOTB utilizes a Composite Scene Embedding consisting of both image and text embeddings to define the degradation. Moreover, AiOTB includes an adaptive weight for each degradation, allowing for precise control of the restoration intensity. By leveraging AiOTB, AllRestorer avoids misdirection caused by inaccurate scene descriptors, achieving a 5.00 dB increase in PSNR compared to the baseline on the CDD-11 dataset.

Abstract:
Compatibilities between the hyperedges of two hypergraphs can be represented as a sparse tensor to avoid exponentially increasing computational costs in hypergraph matching. Kd-tree-based approximate nearest neighbor (ANN) methods have been widely adopted to obtain the sparse compatibility tensor and usually need a relatively high density to guarantee greater accuracy without prior knowledge of the correspondences between a pair of feature point sets. For large scale problems, they require exhaustive computations. This work introduces a novel cascaded second and third-order framework for efficient hypergraph matching. Its core is a CUR decomposition-based sparse compatibility tensor generation method. A rough node assignment is calculated first by a CUR-based pairwise matching process that has a lower computational cost in the second order. Using that intermediate assignment as prior knowledge, a compatibility tensor with higher sparsity can be calculated, with a significantly decreased memory footprint by a novel probability relaxation labeling (PRL)-based hypergraph matching algorithm. The term “reliability” was used to describe how the tensor affects the matching performance and a new measurement, the reliability rate, was proposed to quantify the reliability of a sparse compatibility tensor. Experiment results on large-scale synthetic datasets, and widely adopted benchmarks, demonstrated that the proposed framework outperformed existing methods, creating a more than ten times sparser, but more reliable, compatibility tensor. This proposed CUR-based tensor generation method can be integrated into existing hypergraph matching algorithms and will significantly increase their performance with lower computational costs.

Abstract:
Causal interaction inference is prone to spurious causal interactions, due to the substantial confounders in a biological system. While many existing methods attempt to address misidentification challenges, there remains a notable lack of effective methods to infer causal interaction under latent/unobserved confounders. In this work, we propose a method to overcome such challenges to infer dynamical causality under invisible confounders (CIC) and further reconstruct the latent confounders from time-series data by developing an orthogonal decomposition theorem in a delay embedding space. This theoretical foundation ensures the causal detection for any high-dimensional system even with only two observed variables under many latent confounders, which is a long-standing problem in the field. In addition to the latent confounder problem, such a decomposition makes the coupled variables separable in the embedding space, thus also solving the non-separability problem of causal inference. Extensive validation of the CIC method is carried out using various real datasets, which all demonstrates its effectiveness to reconstruct real biological networks and unobserved confounders.

Abstract:
The problem of robust matrix completion—the recovery of a low-rank matrix and a sparse matrix from a sampling of their superposition—has been addressed extensively in prior literature. Yet, much of this work has focused exclusively on the case in which the matrix sampling is done at random, as this scenario is amenable to theoretical analysis. In contrast, sampling with an arbitrary deterministic pattern is often more accommodating to hardware implementation; consequently, the problem of robust matrix completion under deterministic sampling is considered. To this end, a restricted approximate isometry property is proposed and used, along with a modified golfing scheme and a slightly strengthened incoherence condition, to prove that the latent low-rank and sparse matrices are uniquely recoverable via convex optimization with asymptotically high probability, providing the first exact-recovery theory for robust matrix completion with arbitrary deterministic sampling. A corresponding convex-optimization algorithm, driven by a traditional nuclear norm, is developed and then subsequently generalized by substituting a convolutional nuclear norm in order to cover a broader range of application scenarios. Empirical experiments on synthetic data verify the proposed theory while a battery of results on real-world images demonstrate the practical efficacy of the generalized algorithm for robust matrix recovery.

Abstract:
Federated learning (FL) has advanced semantic segmentation through decentralized training to reduce annotation costs. However, most FL-based semantic segmentation methods assume fixed foreground classes, resulting in catastrophic forgetting of old categories when local clients continually collect streaming data of new classes without storing old categories. Moreover, the irregular participation of new local clients with novel classes unseen by others may exacerbate heterogeneous forgetting across clients during global FL training. To resolve the above challenges, we propose a Hierarchical Forgetting Alleviation (HFA) model. By tackling forgetting within and across local clients, our model ensures that all local clients learn from each other as they continuously learn new categories. Specifically, to alleviate class-imbalanced forgetting within local clients induced by background shift, we develop a confidence-regularized pseudo labeling strategy to produce class-balanced soft pseudo labels for old categories that are labeled as background. Guided by soft pseudo labels, we design a graph-induced relation matching loss and a forgetting-balanced gradient propagation module to tackle ambiguous inter-class relations and class-imbalanced gradient propagation among old classes. Besides, a novel task detection module and an adaptive DBSCAN clustering are devised to address inter-client heterogeneous forgetting. They detect the arrival of new tasks to store the old global model for local pseudo labeling and distillation, while supplying global class prototypes for modeling inter-class relations and warm-starting global classifier. Experiments on multiple datasets verify our model’s superiority over other methods.

Abstract:
Data augmentation is an effective technique for tackling data sparsity in sequential recommendation (SR). Existing methods generate new data during the model training to improve the performance. However, deploying them on a backbone model requires retraining, architecture modification, or introducing additional modules and learnable parameters. These processes are time-consuming and costly for well-trained models, especially when the model and data scales become large. In this work, we explore the test-time augmentation (TTA) for SR, which augments the input sequences during the inference phase and then fuses the model’s predictions to improve final accuracy. It avoids the significant overhead associated with training-time augmentation. We first experimentally examine the potential of existing augmentation operators for TTA and find that the Substitute and Mask consistently achieve better performance. Further analysis reveals that these two operators retain the original sequential pattern while adding appropriate perturbations. Moreover, the random selection of augmentation positions creates suitable augmented samples from both semantic and temporal perspectives. Meanwhile, we find that the fixed operation ratio limits the diversity of augmented data, and the TTA may impair the model’s performance on long sequences. In addition, the two operators still face time-consuming similarity-based item selection or interference from mask tokens. Based on the analysis and limitations, we present TNoise and TMask. The former injects uniform noise into the representation, avoiding the computational overhead of item selection. The latter blocks mask tokens from participating in model calculations (TMask-B) or directly removes interactions that should have been replaced with mask tokens (TMask-R). Further, we sample the augmentation ratio from a uniform distribution to improve the data diversity. For short sequences, we introduce a sequence smoothing and lengthening method based on inter-item interpolation. For long sequences, we set a threshold to avoid the negative effects of TTA. Comprehensive experiments demonstrate the effectiveness, efficiency, and generalizability of our method.

Abstract:
In the field of Deep reinforcement learning (DRL), enhancing exploration capabilities and improving the accuracy of Q-value estimation remain two major challenges. Recently, double-actor DRL methods have emerged as a promising class of DRL approaches, achieving substantial advancements in both exploration and Q-value estimation. However, existing double-actor DRL methods feature actors that operate independently in exploring the environment, lacking mutual learning and collaboration, which leads to suboptimal policies. To address this challenge, this work proposes a generic solution that can be seamlessly integrated into existing double-actor DRL methods by promoting mutual learning among the actors to develop improved policies. Specifically, we calculate the difference in actions output by the actors and minimize this difference as a loss during training to facilitate mutual imitation among the actors. Simultaneously, we also minimize the differences in Q-values output by the various critics as part of the loss, thereby avoiding significant discrepancies in value estimation for the imitated actions. We present two specific implementations of our method and extend these implementations beyond double-actor DRL methods to other DRL approaches to encourage broader adoption. Experimental results demonstrate that our method significantly improves twenty state-of-the-art (SOTA) DRL methods, including SOTA double-actor DRL methods, across eleven tasks, as measured by return and other metrics.

Abstract:
Fully-supervised category-level pose estimation aims to determine the 6-DoF poses of unseen instances from known categories, requiring expensive manual labeling costs. Recently, various self-supervised category-level pose estimation methods have been proposed to reduce the requirement of the annotated datasets. However, most methods rely on synthetic data or 3D CAD model, and they are typically limited to addressing single-object pose problems without considering multi-objective tasks or shape reconstruction. To overcome these challenges and limitations, we introduce a diffusion-driven self-supervised network for multi-object shape reconstruction and categorical pose estimation, only leveraging the shape priors. Specifically, to capture the SE(3)-equivariant pose features and 3D scale-invariant shape information, we present a Prior-Aware Pyramid 3D Point Transformer. This module adopts a point convolutional layer with radial-kernels for pose-aware learning and a 3D scale-invariant graph convolution layer for object-level shape representation. Furthermore, we introduce a Pretrain-to-Refine Self-Supervised Training Paradigm to train our network. It enables proposed network to capture the associations between shape priors and observations, addressing the challenge of intra-class shape variations by utilising the diffusion mechanism. Extensive experiments conducted on four public datasets and a self-built dataset demonstrate that our method significantly outperforms state-of-the-art self-supervised category-level baselines and even surpasses some fully-supervised instance-level and category-level methods. The project page is released at Self-SRPE.

Abstract:
As a fundamental problem in computer vision, multi-view stereo (MVS) aims at recovering the 3D geometry of the target from a set of 2D images. However, the reconstructed quality is significantly impacted by the presence of low-textured areas. In this paper, we propose a Hierarchical Prior Mining (HPM) framework for non-local multi-view stereo. Different from most existing works dedicated to focusing on local information and only using a single prior, HPM captures non-local structural cues and leverages multi-source priors for geometry recovery. Based on the framework, we first propose HPM-MVS, which obtains precise initial hypotheses through non-local operations, simultaneously constructing a better planar prior model in an HPM framework to further facilitate hypothesis generation. In addition, we futher propose HPM-MVS++, which excavates the structured region information of images and spatial geometric relationships of hypotheses as prior knowledge. Then, it incorporates them into probabilistic graphical models, ultimately deducing two novel multi-view matching costs. This significantly enhances the robustness to challenging situations and improves the completeness of the reconstruction. Experimental results on the ETH3D and Tanks & Temples have verified the superior performance and strong generalization capability of our approach.

Abstract:
Deep neural networks (DNNs) often struggle with out-of-distribution data, limiting their reliability in real-world visual applications. To address this issue, domain generalization methods have been developed to learn domain-invariant features from single or multiple training domains, enabling generalization to unseen testing domains. However, existing approaches usually overlook the impact of style frequency within the training set. This oversight predisposes models to capture spurious visual correlations caused by style confounding factors, rather than learning truly causal representations, thereby undermining inference reliability. In this work, we introduce Style Deconfounding Causal Learning (SDCL), a novel causal inference-based framework that explicitly addresses style as a confounding factor to enhance domain generalization in image modalities. Our approaches begins with constructing a structural causal model (SCM) tailored to the domain generalization problem and applies a backdoor adjustment strategy to account for style influence. Building on this foundation, we design a style-guided expert module (SGEM) to adaptively clusters style distributions during training, capturing the global confounding style. Additionally, a backdoor causal learning module (BDCL) performs causal interventions during feature extraction, ensuring fair integration of global confounding styles into sample predictions, effectively reducing style bias. The SDCL framework is highly versatile and can be seamlessly integrated with state-of-the-art data augmentation techniques. Extensive experiments across diverse natural and medical image recognition tasks validate its efficacy, demonstrating superior performance in both multi-domain and the more challenging single-domain generalization scenarios.

Abstract:
As machine learning evolves, domain generalization (DG) and domain adaptation (DA) have become crucial for improving model robustness across diverse environments. Contrastive Language–Image Pretraining (CLIP) plays a central role in these tasks, offering strong zero-shot capabilities that allow models to operate effectively in unseen domains. Yet, despite CLIP’s growing influence, no comprehensive survey has systematically examined its applications in DG and DA, underscoring the need for this review. This survey provides a unified and in-depth overview of CLIP-driven DG and DA. Before reviewing methods, we establish precise and complete scenario definitions covering source accessibility (SA vs. SF), source number (SS vs. MS), and label relations (CS, PS, OS, OPS), forming a coherent taxonomy that structures all subsequent analyses. For DG, we categorize methods into prompt optimization techniques that enhance task alignment and architectures that leverage CLIP as a backbone for transferable feature extraction. For DA, we examine both source-available approaches that rely on labeled source data and source-free approaches operating primarily on target-domain samples, emphasizing the knowledge transfer mechanisms that enable adaptation across heterogeneous settings. We further provide consolidated trend analyses for both DG and DA, revealing overarching patterns, methodological principles, and scenario-dependent behaviors. We then discuss key challenges such as realistic deployment scenarios, LLM knowledge integration, multimodal fusion, interpretability, and catastrophic forgetting, and outline future directions for developing scalable and trustworthy CLIP-based DG and DA systems. By synthesizing existing studies and highlighting critical gaps, this survey offers actionable insights for researchers and practitioners, motivating new strategies for leveraging CLIP to advance domain robustness in real-world scenarios.

Abstract:
Spatio-temporal (ST) prediction is crucial in earth sciences, including meteorological forecasting and urban computing, to name just a few. Access to ample high-quality data, combined with deep models adept at inference, is essential for attaining significant outcomes. Yet, data scarcity and the substantial costs of sensor deployment result in notable data imbalances. Overly specialized models that lack causal linkages further undermine the generalizability of inference techniques. To address these challenges, we first introduce a causal framework for ST predictions, named \mathttNuwaDynamicsNuwaDynamicsNuwaDynamics, aimed at pinpointing causal regions in data and providing models with the capability for causal reasoning in a dual-phase process. Initially, we employ upstream self-supervision to identify causally significant patches, equipping the model with generalizable insights and performing targeted interventions on non-essential patches to approximate potential testing distributions. This stage is known as the discovery phase. Progressing from discovery, we apply the insights to downstream tasks tailored to specific ST goals, enhancing the model’s recognition of a wider potential data distribution and augmenting its causal perceptual abilities (referred to as the Update phase). Additionally, we address environmental controllability and high computational complexity by implementing channel multiplication and conditional generation methods. This process, termed \mathttNuwaDynamics+NuwaDynamics+NuwaDynamics+, can further be interpreted as the front-door adjustment technique in the causality domain. Through comprehensive experiments across ten real-world or simulated ST benchmarks, we demonstrate that integrating the \mathttNuwaDynamics+NuwaDynamics+NuwaDynamics+ concept substantially improves various model performance. \mathttNuwaDynamics+NuwaDynamics+NuwaDynamics+ concept also significantly enhances the versatility across various dynamic ST tasks, such as extreme weather forecasting and long-temporal-step super-resolution predictions.

Abstract:
Seeing clearly with high resolution is a foundation of Multimodal Large Language Models (MLLMs), which has been proven to be vital for visual perception and reasoning. Existing works usually employ a straightforward resolution upscaling method, where the image consists of global and local branches, with the latter being the sliced image patches but resized to the same resolution as the former. This means that higher resolution requires more local patches, resulting in exorbitant computational expenses, and meanwhile, the dominance of local image tokens may diminish the global context. In this paper, we dive into the problems and propose a new framework as well as an elaborate optimization strategy. Specifically, we extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. With regard to local patches, learnable query embeddings are introduced to reduce image tokens, the important tokens most relevant to the user question will be further selected by a similarity-based selector. Our empirical results demonstrate a ‘less is more’ pattern, where utilizing fewer but more informative local image tokens leads to improved performance. Besides, a significant challenge lies in the training strategy, as simultaneous end-to-end training of the global mining block and local compression block does not yield optimal results. We thus advocate for an alternating training way, ensuring balanced learning between global and local aspects. Finally, we also introduce a challenging dataset with high requirements for image detail, enhancing the training of the local compression layer. The proposed method, termed MLLM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME), achieves leading performance across various benchmarks with only 2 million training data.

Abstract:
This paper aims to recover object materials from posed images captured under an unknown static lighting condition. Recent methods solve this task by optimizing material parameters through differentiable physically based rendering. However, due to the coupling between object geometry, materials, and environment lighting, there is inherent ambiguity during the inverse rendering process, preventing previous methods from obtaining accurate results. To overcome this ill-posed problem, our key idea is to learn the material prior with a generative model for regularizing the optimization process. We observe that the general rendering equation can be split into diffuse and specular shading terms, and thus formulate the material prior as diffusion models of albedo and specular. Thanks to this design, our model can be trained using the existing abundant 3D object data, and naturally acts as a versatile tool to resolve the ambiguity when recovering material representations from RGB images. In addition, we develop a coarse-to-fine training strategy that leverages estimated materials to guide diffusion models to satisfy multi-view consistent constraints, leading to more stable and accurate results. Extensive experiments on real-world and synthetic datasets demonstrate that our approach achieves state-of-the-art performance on material recovery.

Abstract:
End-to-end autonomous driving has emerged as a promising paradigm for directly mapping sensor inputs to planning maneuvers using learning-based modular integrations. However, existing imitation learning (IL)-based models suffer from generalization to hard cases, and a lack of corrective feedback loop under post-deployment. While reinforcement learning (RL) offers a potential solution to tackle hard cases with optimality, it is often hindered by overfitting to specific driving cases, resulting in catastrophic forgetting of generalizable knowledge and sample inefficiency. To overcome these challenges, we propose Reinforced Refinement with Self-aware Expansion (R2SE), a novel learning pipeline that constantly refines hard domain while keeping generalizable driving policy for model-agnostic end-to-end driving systems. Through reinforcement fine-tuning and policy expansion that facilitates continuous improvement, R2SE features three key components: 1) Generalist Pretraining with hard-case allocation trains a generalist imitation learning (IL) driving system while dynamically identifying failure-prone cases for targeted refinement; 2) Residual Reinforced Specialist Fine-tuning optimizes residual corrections using reinforcement learning (RL) to improve performance in hard case domain while preserving global driving knowledge; 3) Self-aware Adapter Expansion dynamically integrates specialist policies back into the generalist model, enhancing continuous performance improvement. Experimental results in closed-loop simulation and real-world datasets demonstrate improvements in generalization, safety, and long-horizon policy robustness over state-of-the-art E2E systems, highlighting the effectiveness of reinforce refinement for scalable autonomous driving.

Abstract:
Multimodal emotion recognition plays a vital role in enhancing user experience in human-computer interaction. Over the past few decades, researchers have developed a range of algorithms and made remarkable progress. While each approach demonstrates certain advantages, inconsistent choices in feature extraction methods, evaluation protocols, and experimental settings have hindered fair comparisons among them. These inconsistencies significantly impede the advancement of the field. To address this issue, we introduce MERBench, a unified evaluation benchmark for multimodal emotion recognition. Our goal is to assess the contributions of several key techniques commonly used in prior studies, such as feature selection, multimodal fusion, robustness analysis, fine-tuning, and pre-training. We believe this work offers clear and comprehensive guidance for future research. Based on the evaluation results of MERBench, we further point out some promising research directions. In addition, we present a new emotion dataset, MER2023, specifically designed for the Chinese language environment. This dataset serves as a benchmark for research in multi-label learning, noise robustness, and semi-supervised learning.

Abstract:
Existing generative image transformers follow a two-stage generation paradigm, where the first stage learns a codebook to encode images into discrete codes via vector quantization, and the second stage completes the image generation based on the learned codebook. However, existing methods ignore the naturally varying information densities across different image regions and indiscriminately encode fixed-size regions into fixed-length codes, resulting in insufficient encoding in important regions and redundant encoding in unimportant ones, which degrades both the image generation quality and speed. To address this challenge, we propose a novel information-density-based variable-length image coding and generation framework. In the first stage, our Dynamic Quantization VAE++ (DQVAE++) performs information-adaptive encoding by assigning variable-length codes to image regions according to their information densities, yielding more accurate and robust code representations. In the second stage, the Dynamic Generative Image Transformer (DGiT) enables information-adaptive image generation in both autoregressive and non-autoregressive manners. Specifically, for autoregressive (AR) generation, DGiT-AR generates images autoregressively from coarse-grained regions (smooth areas with fewer codes) to fine-grained regions (detailed areas with more codes). This is accomplished through a novel stacked-transformer architecture that alternately models the position and content of image codes, and a novel heterogeneous embedding scheme to distinguish codes of different granularities. Similarly, for non-autoregressive (NAR) generation, DGiT-NAR introduces a novel information-prioritized mask scheduling mechanism, prioritizing the generation of key structural regions with higher information density. This enables more coherent modeling of global structures initially, followed by a more effective synthesis of local details subsequently. Comprehensive experiments on unconditional and conditional image generation validate the superiority of our proposed variable-length coding in both effectiveness and efficiency.

Abstract:
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention layer, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model’s performance. Moreover, by expanding our model’s capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability. Evaluation on 12 benchmarks shows notable improvements: 5.2% in Scene Text-Centric tasks (including STVQA, TextVQA, and OCRVQA), 6.9% in Document-Oriented tasks (such as DocVQA, InfoVQA, ChartVQA, DeepForm, Kleister Charity, and WikiTableQuestions), and 2.8% in Key Information Extraction tasks (comprising FUNSD, SROIE, and POIE). It outperforms in scene text spotting with a 10.9% increase and sets a new standard on OCRBench, a comprehensive benchmark consisting of 29 OCR-related assessments, with a score of 561, surpassing previous open-sourced large multimodal models for document understanding.

Abstract:
Hardware image signal processing (ISP) transforms RAW inputs into high-quality RGB images through a series of processing modules, each with numerous tunable parameters. Traditionally, these parameters are manually tuned by imaging experts, a time-consuming and subjective process. Recent deep learning approaches predict ISP parameters, but often treat the process as a black box and overlook the intrinsic relationships among ISP modules. To address these fundamental issues, we introduce a novel ISP parameter optimization model based on single-agent reinforcement learning (RL) (i.e., SARL-ISP), formulating the hardware ISP parameter tuning as a sequential optimization problem. During the optimization process, the agent updates ISP parameter tuning strategies for different tasks through interaction with the environment. In order to explore the influence of the sequential structure of hardware ISP modules and the coupling relationships among ISP parameters on the tuning process, we further propose a sequential ISP framework based on collaborative multi-agent RL (i.e., MARL-ISP). Specifically, the serialized parameter tuning module (SPTM) realistically simulates the process of manual prediction and module pipeline. Additionally, the feature selection module (FSM) facilitates the transmission and fusion of agent features, thereby selecting more appropriate feature inputs for downstream tasks. Extensive experiments across various tasks (e.g., object detection, instance segmentation) validate the effectiveness and efficiency of our models. Even with minimal training data, our models also outperform current state-of-the-art methods in both quantitative metrics and qualitative evaluations.

Abstract:
Matching redundancy, which refers to fine-grained feature comparison between irrelevant image areas, is a prevalent limitation in current feature matching approaches. It leads to unnecessary and error-prone computations, ultimately diminishing matching accuracy. To reduce matching redundancy, we propose MESA and DMESA, both leveraging advanced image understanding of Segment Anything Model (SAM) to establish semantic area matches prior to point matching. These informative area matches, then, can undergo effective internal feature comparison, facilitating precise inside-area point matching. Specifically, MESA adopts a sparse matching framework, while DMESA applies a dense one. Both of them first obtain candidate areas from SAM results through a novel Area Graph (AG). In MESA, matching the candidates is formulated as a graph energy minimization and solved by graphical models derived from AG. In contrast, DMESA performs area matching by generating dense matching distributions on the entire image, aiming at enhancing efficiency. The distributions are produced from off-the-shelf patch matching, modeled as the Gaussian Mixture Model, and refined via the Expectation Maximization. With less repetitive computation, DMESA showcases an area matching speed improvement of nearly five times compared to MESA, while maintaining competitive accuracy. Our methods are extensively evaluated on four different tasks across six datasets, encompassing both indoor and outdoor scenes. The results suggest that our method achieves notable accuracy improvements for nine baselines of point matching in most cases. Furthermore, our methods exhibit promise generalization and improved robustness against image resolution.

Abstract:
Recent advancements in computer vision (CV) and large language models (LLMs) have spurred significant interest in multi-modal large language models (MLLMs), which aim to integrate visual and textual modalities for enhanced understanding and generation tasks. While much of the existing research focuses on optimizing projectors and LLMs to improve MLLM performance, a critical question remains underexplored: Has the full potential of visual features in MLLMs been realized? To address this question, we identify two key limitations in current MLLM architectures and propose vMLLM, a vision-enhanced MLLM designed to fully leverage the capabilities of visual features. vMLLM introduces two novel components: the Multi-level Aggregation Module (MAM) and the Intra- and inter-modal Enhancement Module (IEM). The MAM aggregates multi-layer features from the vision encoder, capturing both high-level semantic information and low-level spatial details, thereby enriching the visual representation. The IEM enhances visual features through intra- and inter-modal interactions, effectively suppressing irrelevant information while amplifying task-relevant features, leading to more robust multimodal understanding. We conduct extensive experiments on multiple benchmarks, evaluating vMLLM across diverse settings, including different vision encoders, training dataset scales, and varying sizes of LLMs. Our results demonstrate that vMLLM consistently achieves significant performance improvements, validating its effectiveness in harnessing the potential of visual features. These findings highlight the importance of optimizing visual feature extraction and interaction mechanisms in MLLMs, paving the way for more advanced multimodal AI systems..

Abstract:
Reconstructing 3D scenes with high fidelity and efficiency remains a central pursuit in computer vision and graphics. Recent advances in 3D Gaussian Splatting (3DGS) enable photorealistic rendering with Gaussian primitives, yet the modeling process remains governed predominantly by photometric supervision. This reliance often leads to irregular spatial distribution and indiscriminate primitive adjustments that largely ignore underlying geometric context. In this work, we rethink Gaussian modeling from a geometric standpoint and introduce Mini-Splatting2, an efficient scene modeling framework that couples structure-aware distribution and region-prioritized optimization, driving 3DGS into a geometry-regulated paradigm. The structure-aware distribution enforces spatial regularity through structured reorganization and representation sparsity, ensuring balanced structural coverage for compact organization. The region-prioritized optimization improves training discrimination through geometric saliency and computational selectivity, fostering appropriate structural emergence for fast convergence. These mechanisms alleviate the long-standing tension among representation compactness, convergence acceleration, and rendering fidelity. Extensive experiments demonstrate that Mini-Splatting2 achieves up to 4× fewer Gaussians and 3× faster optimization while maintaining state-of-the-art visual quality, paving the way towards structured and efficient 3D Gaussian modeling.

Abstract:
Label noise is pervasive in various real-world scenarios, posing challenges in supervised deep learning. Deep networks are vulnerable to such label-corrupted samples due to the memorization effect. One major stream of previous methods concentrates on identifying clean data for training. However, these methods often neglect imbalances in label noise across different mini-batches and devote insufficient attention to out-of-distribution noisy data. To this end, we propose a noise-robust method named Jo-SNC (Joint sample selection and model regularization based on Self- and Neighbor-Consistency). Specifically, we propose to employ the Jensen-Shannon divergence to measure the “likelihood” of a sample being clean or out-of-distribution. This process factors in the nearest neighbors of each sample to reinforce the reliability of clean sample identification. We design a self-adaptive, data-driven thresholding scheme to adjust per-class selection thresholds. While clean samples undergo conventional training, detected in-distribution and out-of-distribution noisy samples are trained following partial label learning and negative learning, respectively. Finally, we advance the model performance further by proposing a triplet consistency regularization that promotes self-prediction consistency, neighbor-prediction consistency, and feature consistency. Extensive experiments on various benchmark datasets and comprehensive ablation studies demonstrate the effectiveness and superiority of our approach over existing state-of-the-art methods.

Abstract:
Transformers have achieved great success in natural language processing and computer vision. The core and basic technique of transformers is the self-attention mechanism. The vanilla self-attention mechanism has quadratic complexity, which limits its applications to vision tasks. Most of the existing linear self-attention mechanisms will sacrifice performance to some extent to reduce complexity. In this paper, we propose a novel linear approximation of the vanilla self-attention mechanism named CURSA to achieve both high performance and low complexity at the same time. CURSA is based on the CUR decomposition to decompose the multiplication of large matrices into the multiplication of several small matrices to achieve almost linear complexity. Experiment results of CURSA in image classification tasks, semantic segmentation tasks, object detection tasks, and long-range arena show that it outperforms state-of-the-art self-attention mechanisms with better data efficiency, faster speed, and higher accuracy.

Abstract:
We propose a method which, given a sequence of stereo foggy images, estimates the parameters of a fog model and updates them dynamically. In contrast with previous approaches, which estimate the parameters sequentially and thus are prone to error propagation, our algorithm estimates all the parameters simultaneously by solving a novel optimisation problem. By assuming that fog is only locally homogeneous, our method effectively handles real-world fog, which is often globally inhomogeneous. The proposed algorithm can be easily used as an add-on module in existing visual Simultaneous Localisation and Mapping (SLAM) or odometry systems in the presence of fog. In order to assess our method, we also created a new dataset, the Stereo Driving In Real Fog (SDIRF), consisting of high-quality, consecutive stereo frames of real, foggy road scenes under a variety of visibility conditions, totalling over 40 minutes and 34 k frames. As a first-of-its-kind, SDIRF contains the camera's photometric parameters calibrated in a lab environment, which is a prerequisite for correctly applying the atmospheric scattering model to foggy images. The dataset also includes the counterpart clear data of the same routes recorded in overcast weather, which is useful for companion work in image defogging and depth reconstruction. We conducted extensive experiments using both synthetic foggy data and real foggy sequences from SDIRF to demonstrate the superiority of the proposed algorithm over prior methods. Our method not only produces the most accurate estimates on synthetic data, but also adapts better to real fog.

Abstract:
Photo enhancement plays a crucial role in augmenting the visual aesthetics of a photograph. In recent years, photo enhancement methods have either focused on enhancement performance, producing powerful models that cannot be deployed on edge devices, or prioritized computational efficiency, resulting in inadequate performance for real-world applications. To this end, this paper introduces a pyramid network called LLF-LUT++, which integrates global and local operators through closed-form Laplacian pyramid decomposition and reconstruction. This approach enables fast processing of high-resolution images while also achieving excellent performance. Specifically, we utilize an image-adaptive 3D LUT that capitalizes on the global tonal characteristics of downsampled images, while incorporating two distinct weight fusion strategies to achieve coarse global image enhancement. To implement this strategy, we designed a spatial-frequency transformer weight predictor that effectively extracts the desired distinct weights by leveraging frequency features. Additionally, we apply local Laplacian filters to adaptively refine edge details in high-frequency components. After meticulously redesigning the network structure and transformer model, LLF-LUT++ not only achieves a 2.64 dB improvement in PSNR on the HDR+ dataset, but also further reduces runtime, with 4 K resolution images processed in just 13 ms on a single GPU. Extensive experimental results on two benchmark datasets further show that the proposed approach performs favorably compared to state-of-the-art methods.

Abstract:
Dynamic graph neural networks, i.e., DyGNNs, have been extensively explored in literature to handle structural and temporal properties in graphs. On the one hand, there naturally exist distribution shifts in real-world scenarios relevant to dynamic graphs. On the other hand, the dynamics may further bring extra uncertainties to patterns in dynamic graphs. However, existing DyGNNs merely exploit variant patterns with respect to labels under distribution shifts, failing to accurately make predictions when there exist distribution shifts together with uncertain patterns from training data to test data. To deal with this issue, in this paper we propose to handle spatio-temporal distribution shifts in dynamic graphs via the discovery and utilization of invariant patterns, taking uncertainties in patterns into account, where the invariant patterns include structures and features whose predictive abilities are stable across distribution shifts. Nevertheless, we face the following key challenges: i) How to discover the complex invariant and variant spatio-temporal patterns involving time-varying topological structures and node-level features; ii) How to utilize the invariant and variant patterns to deal with the spatio-temporal distribution shifts in dynamic graphs; iii) How to handle the pattern uncertainties upon capturing the hidden invariance and variance with a theoretical guarantee. To tackle these challenges, we propose the Information Bottleneck guided Disentangled Dynamic Graph ATtention network (IB-D^22 GAT). Our proposed IB-D^22 GAT model is able to effectively handle spatio-temporal distribution shifts with uncertainties in dynamic graphs through discovering variant and invariant spatio-temporal patterns via information bottleneck. Specifically, we propose a disentangled spatio-temporal attention network to capture the invariant and variant patterns. Next, guided by the information bottleneck principle, we propose the distribution-based invariance optimization strategy which injects stochasticity into the invariant pattern identification so as to prevent the variant information from influencing the prediction, thus eliminating the spurious impacts of variant patterns. We further theoretically show that our proposed tailored invariance optimization strategy can lead to accurately capturing the invariant patterns with stable predictive abilities and therefore is capable of handling distribution shifts. Experiments on multiple real-world datasets and one synthetic dataset demonstrate the superiority of our method over state-of-the-art baselines under distribution shifts.

Abstract:
Deep learning-based methods have achieved remarkable success in brain-computer interfaces (BCIs). However, its inherent assumption of independent and identically distributed (i.i.d.) data renders it vulnerable to out-of-distribution (OOD) scenarios. To address this limitation, the present study proposed a causality-driven convolutional manifold attention network (CD-CMAN) that learned invariant representations from electroencephalogram (EEG) signals to enhance OOD generalization. The framework began with a spatiotemporal convolution module to extract rich temporal and spatial features. Guided by the defined structural causal model and leveraging the strengths of Riemannian geometry and deep learning, dual latent encoders with manifold attention units were crafted to explicitly separate spatiotemporal feature maps into semantic and variation latent factors. A reconstruction module with a dedicated loss was implemented to ensure these factors retaining informative, while the Hilbert-Schmidt independence criterion (HSIC) was introduced to enforce their statistical independence. Further, a variational information bottleneck and gradient reversal layer were incorporated to compress and disentangle the semantic and variation factors. Evaluations on two public datasets under both subject-dependent and subject-independent settings demonstrated that CD-CMAN consistently outperforms comparative baselines. These findings suggest that the proposed model could provide a new solution for the practical application of BCI technology.

Abstract:
Multi-labeled complementary label learning (MLCLL) is a resource-efficient paradigm aimed at reducing labeling efforts in multi-label learning (MLL). While existing methods address the MLCLL problem using neural network-based models, they often overfit to noisy information, leading to sharp decision boundaries. This overfitting issue is further exacerbated when the label correlation, which could help denoise the supervision, is not fully explored in existing works. In this paper, we propose a novel framework called NMCB to alleviate the impact of noisy information in MLCLL, which makes a first attempt to explore mixup for MLCLL problem. Specifically, a tailored version of mixup is employed to achieve a smoother decision boundary of the trained classifier, thereby reducing the sensitivity of NMCB to noisy labels and enhancing its generalization ability. Moreover, NMCB applies a model to automatically extract label correlations from non-complementary labels transformed by mixup during the learning process. These extracted correlations serve as alignment objectives for the output distribution of instance augmentations within a consistency regularization term of NMCB, further improving the model performance. Empirical studies demonstrate the effectiveness of the proposed method.

Affiliations: Institute of Remote Sensing and Geographic Information System, School of Earth and Space Sciences, Peking University, Beijing, China; National Engineering Research Center of Geographic Information System, China University of Geosciences, Wuhan, China; School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China; Department of Geography, Environment and Society, University of Minnesota, Minneapolis, MN, USA; JD Technology & JD Intelligent Cities Research, Beijing, China; Department of Computer Science, University of Illinois Chicago, Chicago, IL, USA

Abstract:
Human activity intensity prediction is crucial to many location-based services. Despite tremendous progress in modeling dynamics of human activity, most existing methods overlook physical constraints of spatial interaction, leading to uninterpretable spatial correlations and over-smoothing phenomenon. To address these limitations, this work proposes a physics-informed deep learning framework, namely Gravity-informed Spatiotemporal Transformer (Gravityformer) by integrating the universal law of gravitation to refine transformer attention. Specifically, it (1) estimates two spatially explicit mass parameters based on spatiotemporal embedding feature, (2) models the spatial interaction in end-to-end neural network using proposed adaptive gravity model to learn the physical constraint, and (3) utilizes the learned spatial interaction to guide and mitigate the over-smoothing phenomenon in transformer attention. Moreover, a parallel spatiotemporal graph convolution transformer is proposed for achieving a balance between coupled spatial and temporal learning. Systematic experiments on six real-world large-scale activity datasets demonstrate the quantitative and qualitative superiority of our model over state-of-the-art benchmarks. Additionally, the learned gravity attention matrix can be not only disentangled and interpreted based on geographical laws, but also improved the generalization in zero-shot cross-region inference. This work provides a novel insight into integrating physical laws with deep learning for spatiotemporal prediction.

Abstract:
Short-term origin-destination (OD) demand prediction is critical in managing the multimodal transportation system. The joint short-term OD demand prediction for multimodal systems faces three challenges: (1) data availability: real-time OD demand is not available for prediction; (2) sparsity and high-dimensionality of OD demand: the OD demand is spatiotemporal sparse and usually high dimension; (3) impact of different transportation modes: the future OD demand for one mode is affected by others, and extensive studies primarily focus on a single transportation mode, overlooking the influence between different modes. To tackle these challenges, we propose a multitask learning and Partial-Differential-based model to predict the short-term Multimodal Transport Systems OD demand (PD-MTSOD), which includes (1) an OD demand learner to estimate real-time OD demand, (2) data aggregation with hypergraph attention to capture spatiotemporal features, and (3) OD demand decomposition into self-generated increment, other-modes-generated increment, and real-time OD demand, and use partial-differential-based methods to model intermodal correlations. Extensive tests on Beijing and New York city’s multimodal systems show that PD-MTSOD surpasses baseline models. In addition, we prove the benefits of joint considering multiple transportation and explore the correlations of different transportation modes. This paper offers a reliable method for understanding multimodal transportation systems.

Abstract:
In this paper, we propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects’ physical commonsense based on both video and audio input, with the main challenge being how to imitate the reasoning ability of humans, even in the scenario of missing modalities. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and the lack of causal reasoning ability in models impedes the progress of implicit physical knowledge inference. To address these issues, our proposed RDCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space using the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual learning module to augment the model's reasoning ability by modeling physical knowledge relationships among different objects under counterfactual intervention. To alleviate the incomplete modality data issue, we introduce a robust multimodal learning method to recover the missing data by decomposing the shared features and model-specific features. Our proposed method is a plug-and-play module that can be incorporated into any baseline, including VLMs. In experiments, we show that our proposed method improves the reasoning accuracy and robustness of baseline methods and achieves the state-of-the-art performance.

Abstract:
Open-vocabulary semantic segmentation aims to partition an image into distinct semantic regions based on an open set of categories. Existing approaches primarily rely on image-level pre-trained vision-language models to perform this pixel-level task. In this paper, we propose SED, a simple yet effective encoder-decoder architecture for open-vocabulary semantic segmentation leveraging pre-trained vision-language models. SED consists of a hierarchical image encoder, a text encoder, and a gradual fusion decoder. The hierarchical image encoder and text encoder collaboratively generate a cost volume, which is progressively decoded by the gradual fusion decoder to produce segmentation results. In contrast to a plain encoder, the hierarchical encoder better captures image detail information while maintaining linear computational complexity with respect to input size. The gradual fusion decoder adopts a top-down structure to progressively integrate high-resolution features with the cost volume. Furthermore, a category early rejection strategy is introduced in gradual fusion decoder to filter out non-existent categories at different layers, significantly improving inference efficiency. Based on SED, we further introduce two modules, including non-label text embedding and additional category early rejection in the encoder. Moreover, we extend our method with minimal decoder modification for open-vocabulary video semantic segmentation. Extensive experiments on multiple datasets validate the effectiveness and efficiency of our proposed method. With ConvNeXt-B, our method achieves an mIoU of 34.9% on the ADE20 K with 150 classes (i.e., A-150) at an inference speed of 69 ms per image on a single A6000 GPU, and has an mIoU score of 40.2% on video segmentation dataset VSPW.

Abstract:
Deep learning has demonstrated remarkable generalization capability with independent and identically distributed (i.i.d.) training and test data, however, it often struggles with data drawn from different, albeit causally related, distributions. This problem is generally known as Out-of-Distribution (OoD) generalization. While there is a plethora of algorithms proposed for OoD generalization, the current understanding of the data commonly employed to evaluate these algorithms remains relatively naive. In this study, we identify two distinct types of distribution shifts, namely diversity shift and correlation shift, that are ubiquitous in various OoD datasets. We propose a quantifiable formal definition for the two shifts and show that the performance of OoD algorithms is upper bounded by them. To validate our theoretical insight, we evaluate a number of OoD generalization algorithms across two groups of datasets from both classification and object detection areas, each dominated by one of the shifts, exposing the strengths of the algorithms against one shift as well as their limitations against the other. We further proved that all performance degradations according to data distribution shifts can be attributed to these two types of shifts defined in our paper. The benchmark integrates existing datasets and algorithms from different research areas that seem unrelated into a coherent picture, which may serve as a foundation for future OoD generalization research.

Abstract:
Video causal reasoning aims to achieve a high-level understanding of videos from a causal perspective. However, it exhibits limitations in its scope, primarily executed in a question-answering paradigm and focusing on brief video segments containing isolated events and basic causal relations, lacking comprehensive and structured causality analysis for videos with multiple interconnected events. To fill this gap, we introduce a new task and dataset, Multi-Event Causal Discovery (MECD). It aims to uncover the causal relations between events distributed chronologically across long videos. Given visual segments and textual descriptions of events, MECD identifies the causal associations between these events to derive a comprehensive and structured event-level video causal graph explaining why and how the result event occurred. To address the challenges of MECD, we devise a novel framework inspired by the Granger Causality method, incorporating an efficient mask-based event prediction model to perform an Event Granger Test. It estimates causality by comparing the predicted result event when premise events are masked versus unmasked. Furthermore, we integrate causal inference techniques such as front-door adjustment and counterfactual inference to mitigate challenges in MECD like causality confounding and illusory causality. Additionally, context chain reasoning is introduced to conduct more robust and generalized reasoning. Experiments validate the effectiveness of our framework in reasoning complete causal relations, outperforming GPT-4o and VideoChat2 by 5.77% and 2.70%, respectively. Further experiments demonstrate that causal relation graphs can also contribute to downstream video understanding tasks such as video question answering and video event prediction.

Abstract:
Graph neural networks have shown remarkable success in exploiting the spatial and temporal patterns on dynamic graphs. However, existing GNNs exhibit poor generalization ability under distribution shifts, which is inevitable in dynamic scenarios. As dynamic graph generation progresses amid evolving latent non-stationary environments, it is imperative to explore their effects on out-of-distribution (OOD) generalization. This paper proposes a novel Evolving Graph Learning framework for OOD generalization (EvoGOOD) by environment-aware invariant pattern recognition. Specifically, we first design an environment sequential variational auto-encoder to model environment evolution and infer underlying environment distribution. Then, we introduce a mechanism for environment-aware invariant pattern recognition, tailored to address environmental diversification through inferred distributions. Finally, we conduct fine-grained causal interventions on individual nodes using a mixture of instantiated environment samples. This approach helps to distinguish spatio-temporal invariant patterns for OOD prediction, especially in non-stationary environments. Experimental results demonstrate the superiority of EvoGOOD on both real-world and synthetic dynamic datasets under distribution shifts. To the best of our knowledge, it is the first attempt to study the dynamic graph OOD generalization problem from the environment evolution perspective.

Abstract:
In recent years, intelligent vehicles operating in urban environments have demonstrated the capability to autonomously execute various tasks, such as object detection, lane detection, segmentation, etc. This advancement is facilitated by the extensive datasets accumulated by researchers, alongside advancements in intelligent algorithms, as well as significant breakthroughs in software and hardware. However, within the autonomous driving community, there is a scarcity of data regarding scenarios encountered in mining environments. This scarcity presents challenges and bottlenecks for the advancement of comprehensive autonomous driving systems and autonomousoperations. Although we previously released our dataset, AutoMine, which includes over 18 hours of driving data in open-pit mines, its scope is limited to two specific tasks. This scope limitation impedes the training and validation of the majority of algorithms for different tasks in this particular scenario. To broaden the scope of autonomous driving visual tasks in mining environments, we have curated a diverse collection encompassing multiple tasks, including detection, segmentation, tracking, etc. Additionally, we have established benchmarks and set up baselines for the aforementioned multiple tasks. By comparing the performance differences of visual algorithms between mining areas and other scenarios, we demonstrate the distinctive characteristics of mining regions in an intuitive manner. We have developed a suite of tools for converting annotated data into the standardized format used in existing driving datasets. Our aspiration is to establish data and benchmark foundations, supporting research endeavors in intelligent transportation within mining environments and autonomous driving in comprehensive scenarios. Our project website can be seen in AutoMine, and the dataset can be downloaded via AutoMine-Benchmark.

Abstract:
This work studies sparse adversarial perturbations, including both unstructured and structured ones. We propose a framework based on a white-box PGD-like attack method named Sparse-PGD to effectively and efficiently generate such perturbations. Furthermore, we combine Sparse-PGD with a black-box attack to comprehensively and more reliably evaluate the models’ robustness against unstructured and structured sparse adversarial perturbations. Moreover, the efficiency of Sparse-PGD enables us to conduct adversarial training to build robust models against various sparse perturbations. Extensive experiments demonstrate that our proposed attack algorithm exhibits strong performance in different scenarios. More importantly, compared with other robust models, our adversarially trained model demonstrates state-of-the-art robustness against various sparse attacks.

Abstract:
This paper aims to tackle the problem of modeling dynamic urban streets for autonomous driving scenes. Recent methods extend NeRF by incorporating tracked vehicle poses to animate vehicles, enabling photo-realistic view synthesis of dynamic urban street scenes. However, significant limitations are their slow training and rendering speed. We introduce Street Gaussians, a new explicit scene representation that tackles these limitations. Specifically, the dynamic urban scene is represented as a set of point clouds equipped with semantic logits and Gaussian primitives, each associated with either a foreground object or the background. To model the dynamics of foreground objects, each object point cloud is optimized with optimizable tracked poses, along with a 4D spherical harmonics model for the dynamic appearance. The explicit representation allows easy composition of objects and background, which in turn allows for scene editing operations and rendering at 135 FPS (1066 1600 resolution) within half an hour of training. The proposed method is evaluated on multiple challenging benchmarks, including KITTI and Waymo Open datasets. Experiments show that the proposed method consistently outperforms state-of-the-art methods across all datasets.

Abstract:
Existing low-light enhancement methods typically rely on fitting data mappings (pixel-wise mappings through fully supervised methods or distribution-wise mappings through weakly supervised or self-supervised methods). However, their performance is heavily dependent on specific scenes and fails to adequately model the intrinsic prior of natural images, resulting in poor generalization. To tackle this challenge, we leverage the strengths of powerful generative diffusion models, conditioned on a thoughtfully designed prior, and propose a novel zero-reference low-light enhancement framework that gets rid of dependence on the distribution of low-light images. In detail, we address the most fundamental core by proposing an illumination-invariant prior derived from the theory of physical light transfer, bridging the gap between normal and low-light domains, and enabling zero-shot enhancement without the need for low-light-specific training. A prior-to-image restoration framework is built upon generative diffusion models, pre-trained on normal-light data. During inference, the framework extracts the illumination-invariant prior from low-light inputs and maps them back to high-quality images, naturally for low-light enhancement. Additionally, such intrinsic properties of illumination-invariant prior open up opportunities for distilling diffusion models into compact CNN-based networks. We propose a novel prior-injected distillation paradigm incorporating intensity, frequency, and gradient domain-augmented regularization comprehensively. This distillation framework not only reduces computational costs but also maintains high fidelity and perceptual quality in enhanced outputs, making it more efficient and practical for real-world applications. The approach further extends seamlessly to handle over-exposure scenarios, demonstrating its versatility in addressing complex lighting conditions. Extensive experiments demonstrate the superiority of our framework in various scenarios, as well as its strong interpretability, robustness, and efficiency.

Abstract:
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We propose a novel approach to narrow the gap by mining the potential of VLMs for better performance across various cross-modal tasks. It tackles the following questions: (1) How can high-resolution visual tokens improve image understanding without lengthening the token sequence? (2) How to improve reasoning and generation abilities of VLM with high-quality data? (3) How to close the gap between open-source VLMs and proprietary models on reasoning-driven generation? In particular, to enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. The proposed model supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B, which achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. It is demonstrated to attain 80.6% accuracy on the MMB benchmark (+5.4 vs Gemini Pro) and 74.1% on TextVQA (+4.6 vs LLaVA-NeXT), achieving leading performance in several zero-shot benchmarks and even surpasses the developed private models. Furthermore, Mini-Gemini is proven to improve consistently with stronger LLM, visual encoder, and data in experiments.

Abstract:
Time series domain adaptation aims to transfer the complex temporal dependence from the labeled source domain to the unlabeled target domain. Recent advances leverage the stable causal mechanism over observed variables to model the domain-invariant temporal dependence. However, modeling precise causal structures in high-dimensional data, such as videos, remains challenging. Additionally, direct causal edges may not exist among observed variables (e.g., pixels). These limitations hinder the applicability of existing approaches to real-world scenarios. To address these challenges, we find that the high-dimension time series data are generated from the low-dimension latent variables, which motivates us to model the causal mechanisms of the temporal latent process. Based on this intuition, we propose a latent causal mechanism identification framework that guarantees the uniqueness of the reconstructed latent causal structures. Specifically, we first identify latent variables by utilizing sufficient changes in historical information. Moreover, by enforcing the sparsity of the relationships of latent variables, we can achieve identifiable latent causal structures. Built on the theoretical results, we develop the Latent Causality Alignment (LCA) model that leverages variational inference, which incorporates an intra-domain latent sparsity constraint for latent structure reconstruction and an inter-domain latent sparsity constraint for domain-invariant structure reconstruction. Experiment results on eight benchmarks show a general improvement in the domain-adaptive time series classification and forecasting tasks, highlighting the effectiveness of our method in real-world scenarios.

Abstract:
Graph Convolution Networks (GCNs) have achieved remarkable success in representation of structured graph data. As we know that traditional GCNs are generally defined on the fixed first-order neighborhood receptive field which makes them be incapable to capture the long-range dependencies between distant nodes and also vulnerable to graph attacks and noises. To address these limitations, we revisit deformable convolution on graphs and propose a novel deformable graph convolution, termed Neighborhood-Deformable Graph Convolution (NDGC). The core of NDGC is to explicitly achieve the deformable convolution on graphs by introducing virtual neighbors which encode large-range information via the offsetting and interpolation function. That is, the introduced virtual neighbors can provide a larger receptive field with deformable receptive shape for graph convolution definition. Also, NDGC conducts message aggregation on the deformable virtual neighbors which thus performs more robustly w.r.t. graph attacks and noises. In particular, NDGC provides a general neighborhood deformable scheme, seamlessly integrating with many graph convolution definitions to derive their deformable variants. Experimental results validate the effectiveness and advantages of the proposed NDGC networks on several graph learning tasks.

Abstract:
In this paper, we explore the problem of event-based meshflow estimation, a novel task that involves predicting a spatially smooth sparse motion field from event cameras. To start, we review the state-of-the-art in event-based flow estimation, highlighting two key areas for further research: i) the lack of meshflow-specific event datasets and methods, and ii) the underexplored challenge of event data density. First, we generate a large-scale High-Resolution Event Meshflow (HREM) dataset, which showcases its superiority by encompassing the merits of high resolution at 1280 × 720, handling dynamic objects and complex motion patterns, and offering both optical flow and meshflow labels. These aspects have not been fully explored in previous works. Besides, we propose Efficient Event-based MeshFlow (EEMFlow) network, a lightweight model featuring a specially crafted encoder-decoder architecture to facilitate swift and accurate meshflow estimation. Furthermore, we upgrade EEMFlow network to support dense event optical flow, in which a Confidence-induced Detail Completion (CDC) module is proposed to preserve sharp motion boundaries. We conduct comprehensive experiments to show the exceptional performance and runtime efficiency (30×faster) of our EEMFlow model compared to the recent state-of-the-art flow method. As an extension, we expand HREM into HREM+, a multi-density event dataset contributing to a thorough study of the robustness of existing methods across data with varying densities, and propose an Adaptive Density Module (ADM) to adjust the density of input event data to a more optimal range, enhancing the model’s generalization ability. We empirically demonstrate that ADM helps to significantly improve the performance of EEMFlow and EEMFlow+ by 8% and 10%, respectively.

Abstract:
Neural Architecture Search (NAS) has attracted increasing attention in recent years because of its capability to design neural networks automatically. Among them, differential NAS approaches such as DARTS, have gained popularity for search efficiency. However, they still suffer from three main issues, that are, the weak stability due to the performance collapse, the poor generalization ability of the searched architectures, and the inferior robustness to different kinds of proxies (i.e., computationally reduced search configurations). To solve the search stability and searched architecture’s generalization problems, a simple-but-effective regularization method, termed as Beta-Decay, is proposed to regularize the DARTS-based NAS searching process (referred as \betaβ-DARTS). Specifically, Beta-Decay regularization can impose constraints to keep the value and variance of activated architecture parameters from being too large, thereby ensuring fair competition among architecture parameters and making the supernet less sensitive to the impact of input on the operation set. In-depth theoretical analyses on how it works and why it works are provided, and comprehensive experiments on a variety of search spaces and datasets validate that Beta-Decay regularization can help to stabilize the searching process and make the searched network more transferable across different datasets. To address the proxy robustness problem, we first benchmark differentiable NAS methods under a wide range of proxy data, proxy channels, proxy layers, and proxy epochs, since the robustness of NAS under different kinds of proxies has not been explored before. We then conclude some interesting findings and find that \betaβ-DARTS always achieves the best result among all compared NAS methods under almost all proxy settings. We further introduce the novel flooding regularization to the weight optimization of \betaβ-DARTS (termed as Bi-level regularization), and experimentally and theoretically verify its effectiveness for improving the proxy robustness of differentiable NAS.

Abstract:
Single object tracking aims to localize target object with specific reference modalities (bounding box, natural language or both) in a sequence of specific video modalities (RGB, RGB+Depth, RGB+Thermal or RGB+Event.). Different reference modalities enable various human-machine interactions, and different video modalities are demanded in complex scenarios to enhance tracking robustness. Existing trackers are designed for single or several video modalities with single or several reference modalities, which leads to separate model designs and limits practical applications. Practically, a unified tracker is needed to handle various requirements. To the best of our knowledge, there is still no tracker that can perform tracking with these above reference modalities across these video modalities simultaneously. Thus, in this paper, we present a unified tracker, UniSOT, for different combinations of three reference modalities and four video modalities with uniform parameters. Extensive experimental results on 18 visual tracking, vision-language tracking and RGB+X tracking benchmarks demonstrate that UniSOT shows superior performance against modality-specific counterparts. Notably, UniSOT outperforms previous counterparts by over 3.0% AUC on TNL2K across all three reference modalities and outperforms Un-Track by over 2.0% main metric across all three RGB+X video modalities.

Abstract:
A common assumption in matrix completion (MC) and tensor completion (TC) is that the missing locations are sampled randomly. However, in real-world scenarios, the unobserved elements are often not arbitrarily located, and may concentrate within entire rows or columns. We refer to this missing mechanism as structural missingness, and traditional MC and TC schemes suffer from drastic degradation under these circumstances. This work addresses the challenge of restoring structural missingness by introducing a novel framework for simultaneously reconstructing multiple matrices, called multi-matrix completion (MMC). In MMC, tri-factorization across matrices captures the correlation between matrices, and Tikhonov regularization on each matrix exploits its correlation. This design enables MMC to efficiently handle both random and structural missingness. In addition, MMC is not affected by the smoothness along matrices which makes it suitable for a wider variety of data compared to Fourier transform based TC methods. The alternating direction method of multipliers is utilized to solve the resultant optimization problem. The global convergence of the algorithm is supported by comprehensive theoretical analyses. We demonstrate the versatility of MMC through extensive experiments in image and video restoration, and showcase its superior performance in comparison to traditional MC and TC methods.

Abstract:
Existing weakly supervised video violence detection (VVD) methods primarily rely on Euclidean representation learning, which often struggles to distinguish visually similar yet semantically distinct events due to limited hierarchical modeling and insufficient ambiguous training samples. To address this challenge, we propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries to enhance discriminative feature representation. Specifically, PiercingEye introduces a layer-sensitive hyperbolic aggregation strategy with hyperbolic Dirichlet energy constraints to progressively model event hierarchies, and a cross-space attention mechanism to facilitate complementary feature interactions between Euclidean and hyperbolic spaces. Furthermore, to mitigate the scarcity of ambiguous samples, we leverage large language models to generate logic-guided ambiguous event descriptions, enabling explicit supervision through a hyperbolic vision-language contrastive loss that prioritizes high-confusion samples via dynamic similarity-aware weighting. Extensive experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance, with particularly strong results on a newly curated ambiguous event subset, validating its superior capability in fine-grained violence detection.

Abstract:
Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4D benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.

Affiliations: National Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University, Shanghai, China; Seed Group of ByteDance, Beijing, China; National Key Laboratory for Multi-modal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China; University of Oulu, Oulu, Finland; National Key Laboratory for Multi-modal Artificial Intelligence Systems, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Engineering, Southeast University, Nanjing, China; Department of Automation, Tsinghua University, Beijing, China

Abstract:
Partial label learning (PLL) is a typical weakly supervised learning, where each sample is associated with a set of candidate labels. Its basic assumption is that the ground-truth label must be in the candidate set, but this assumption may not be satisfied due to the unprofessional judgment of annotators. Therefore, we relax this assumption and focus on a more general task, noisy PLL, where the ground-truth label may not exist in the candidate set. To address this challenging task, we propose a novel framework called “Iterative Refinement Network (IRNet)”, aiming to purify noisy samples through two key modules (i.e., noisy sample detection and label correction). To achieve better performance, we exploit smoothness constraints to reduce prediction errors in these modules. Through theoretical analysis, we prove that IRNet is able to reduce the noise level of the dataset and eventually approximate the Bayes optimal classifier. Meanwhile, IRNet is a plug-in strategy that can be integrated with existing PLL approaches. Experimental results on multiple benchmark datasets show that IRNet outperforms state-of-the-art approaches on noisy PLL.

Abstract:
We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem—where the scale transformation of multi-camera systems is unknown—we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy.

Abstract:
This paper focuses on representation learning for dynamic graphs with temporal interactions. A fundamental issue is that both the graph structure and the nodes own their own dynamics, and their blending induces intractable complexity in the temporal evolution over graphs. Drawing inspiration from the recent progress of physical dynamic models in deep neural networks, we propose Graph Neural Controlled Differential Equations (GN-CDEs), a continuous-time framework that jointly models node embeddings and structural dynamics by incorporating a graph enhanced neural network vector field with a time-varying graph path as the control signal. Our framework exhibits several desirable characteristics, including the ability to express dynamics on evolving graphs without piecewise integration, the capability to calibrate trajectories with subsequent data, and robustness to missing observations. Empirical evaluation on a range of dynamic graph representation learning tasks demonstrates the effectiveness of our proposed approach in capturing the complex dynamics of dynamic graphs.

Abstract:
We study distributed principal component analysis (PCA) for large-scale federated data when the sample size nn and dimension dd are both ultra-large. This type of data is currently very common, but faces numerous challenges in PCA learning, such as communication overhead and computational complexity. We develop a new algorithm \mathsf FedFaskFedFask (Fast Sketching for Federated learning) with lower communication cost O(dr)O(dr) and lower computational complexity O(d(np/m+p^2+r^2))O(d(np/m+p2+r2)), where mm is the number of workers, rr is the rank of matrix, pp is the dimension of sketched column space, and r\leq p\ll dr≤p≪d. In \mathsf FedFaskFedFask, we adopt and develop technologies such as fast sketching, alignments with orthogonal Procrustes Fixing, and matrix Stiefel manifold via Kolmogorov-Nagumo-type average. Thus, \mathsf FedFaskFedFask has a higher accuracy, lower stochastic variation, and best representation of multiple randomly projected eigenspaces, and avoids the orthogonal ambiguity of eigenspaces. We show that \mathsf FedFaskFedFask achieves the same rate of learning O\left(\frac\kappa _rr\lambda _r\sqrt\fracr^n\right)Oκrrλrrn as the centralized PCA uses all data, and tolerates more workers to parallel acceleration computation. We conduct extensive experiments to demonstrate the effectiveness of \mathsf FedFaskFedFask.

Abstract:
Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and uncrewed aerial vehicles (UAV)-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods.

Abstract:
Multi-view clustering (MVC), as an important machine learning task, aims to group data into distinct groups by leveraging complementary and consistent information across multiple views. During the last two decades, it has been widely studied, and many methods have been proposed, which has brought incredible development to this field. However, few works comprehensively summarize existing methods and point out the potential challenges in this field for the next decades. To this end, our survey thoroughly reviews existing MVC methods according to three taxonomies, i.e., techniques, fusion strategies, and scenarios. Specifically, seven typical techniques, four fusion strategies, and five typical scenarios are included. Besides, we also collect the commonly used datasets and analyze the performance of typical MVC methods. Moreover, we summarize six application scenarios of existing MVC methods ranging from computer vision, and information retrieval tasks to medical diagnosis and bio-informatics. In particular, we point out seven interesting future directions in this field, which will definitely enlighten the readers.

Abstract:
The scarcity of annotations poses a significant challenge in medical image analysis, which demands extensive efforts from radiologists, especially for high-dimension 3D medical images. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo implicitly encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. To assess effectiveness, we (1) introduce PreCT-160 K, the largest medical image pre-training dataset to date, which comprises 160 K Computed Tomography (CT) volumes covering diverse anatomic structures; (2) investigate scaling laws and propose guidelines for tailoring different model sizes to various medical tasks; (3) build a comprehensive benchmark encompassing 51 medical tasks, including segmentation, classification, registration, and vision-language. Extensive experiments highlight the superiority of VoCo, showcasing promising transferability to unseen modalities and datasets. VoCo notably enhances performance on datasets with limited labeled cases and significantly expedites fine-tuning convergence.

Abstract:
Learning over dynamic graphs poses major challenges, including capturing the evolving relationship in the graphs. Inspired by the advantages of hyperbolic embedding in static graphs, the hyperbolic space is expected to capture complex interactions in dynamic graphs. However, due to the distortion errors in the standard tangent space mappings, hyperbolic methods become more sensitive to noise and reduce the learning capacity. To address the distortion in tangent space, we proposed HMPTGN, a temporal graph network that operates directly on the hyperbolic manifold. In this journal paper, we introduce the HMPTGN+ architecture, an extension of the original HMPTGN with major updates to learn better representations of dynamic graphs based on the hyperbolic embedding. Our framework incorporates a high-order graph neural network for extracting spatial dependencies, a dilated causal attention mechanism for modeling temporal patterns while preserving causality, and a curvature-awareness mechanism to capture dynamic structures. Extensive experiments demonstrate the effectiveness of our proposed HMPTGN+ framework over state-of-the-art baselines in both temporal link prediction and temporal new link prediction tasks.

Abstract:
Understanding motion in dynamic environments is critical for autonomous driving, thereby motivating research on class-agnostic motion prediction. In this work, we investigate weakly and self-supervised class-agnostic motion prediction from LiDAR point clouds. Outdoor scenes typically consist of mobile foregrounds and static backgrounds, allowing motion understanding to be associated with scene parsing. Based on this observation, we propose a novel weakly supervised paradigm that replaces motion annotations with fully or partially annotated (1%, 0.1%) foreground/background masks for supervision. To this end, we develop a weakly supervised approach utilizing foreground/background cues to guide the self-supervised learning of motion prediction models. Since foreground motion generally occurs in non-ground regions, non-ground/ground masks can serve as an alternative to foreground/background masks, further reducing annotation effort. Leveraging non-ground/ground cues, we propose two additional approaches: a weakly supervised method requiring fewer (0.01%) foreground/background annotations, and a self-supervised method without annotations. Furthermore, we design a Robust Consistency-aware Chamfer Distance loss that incorporates multi-frame information and robust penalty functions to suppress outliers in self-supervised learning. Experiments show that our weakly and self-supervised models outperform existing self-supervised counterparts, and our weakly supervised models even rival some supervised ones. This demonstrates that our approaches effectively balance annotation effort and performance.

Abstract:
High-order correlations, which capture complex interactions among multiple entities, extend beyond traditional graph representations and support a wider range of applications. However, existing neural network models for high-order correlations encounter scalability issues on large datasets due to the substantial computational complexity involved in processing large-scale structures. In addition, long-tailed distributions, which are common in real-world data, result in underrepresented categories and hinder the model’s ability to learn effective high-order interaction patterns for rare instances. To address these issues, we introduce a novel framework known as HyperGraph-based High-order Correlation analysis (HGHC) for large-scale long-tailed data classification. Firstly, to tackle the long-tailed distribution problem, HGHC generates synthetic vertices and computes their attributed high-order correlations using an oversampling module inspired by SMOTE, termed HSMOTE, to enhance the representation of tail categories. Secondly, for efficient computational scaling, we treat the data as having two modalities: the structural modality capturing high-order relationships and the feature modality representing individual attributes. We perform computations on both CPU and GPU separately and then fuse the results to achieve a lightweight vertex transformation and aggregation scheme for high-order correlation data. Additionally, we contribute the first benchmark for large-scale long-tailed datasets involving high-order correlations, known as Amazon-LT, which includes multiple datasets with varying imbalance ratios. Our experimental results demonstrate that HGHC achieves state-of-the-art performance in handling high-order correlation analysis issues for large-scale, long-tailed data.

Abstract:
We propose an Event-Based Snow Removal algorithm called EBSnoR. We developed a technique to measure the dwell time of snowflakes on a pixel using event-based camera data, which is used to carry out a statistically optimal dwell time thresholding to partition event stream into snowflake and background events. The effectiveness of the proposed EBSnoR was verified qualitatively on a new dataset called UDayton25EBSnow comprised of front-facing event-based camera in a car driving through snow with manually annotated bounding boxes around surrounding vehicles, as well as a quantitatively using new snowflake event simulator called EBSnoGen. Qualitatively, EBSnoR correctly identifies events corresponding to snowflakes; and quantitatively, EBSnoR showed accuracy of 96.19%. Additional experiments showed that snow removal improved event-based object detection performance.

Abstract:
Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, performing well only on short videos. For long videos, the computational complexity and memory costs associated with long-term temporal connections are significantly increased, posing additional challenges. Leveraging the hierarchical memory structure of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination, we propose MovieChat within a training-free memory consolidation mechanism to overcome these challenges, which transfers dense frames from short-term memory into sparse tokens in long-term memory by temporally merging adjacent frames. We lift pre-trained large multi-modal models for understanding long videos without additional trainable modules, employing a zero-shot approach. Additionally, in our new version, MovieChat+, we design an enhanced training-free vision-question matching-based memory consolidation mechanism to better anchor predictions to relevant visual content. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1 K benchmark with 1 K long video, 2 K temporal grounding labels, and 14 K manual annotations.

Abstract:
This article considers the decentralized composite optimization problem. We propose a novel decentralized variance-reduction proximal-gradient algorithmic framework, called PMGT-VR, which combines several techniques, including multi-consensus, gradient tracking, and variance reduction. The proposed framework imitates centralized algorithms and algorithms under this framework achieve convergence rates similar to that of their centralized counterparts. We also describe and analyze two representative algorithms, PMGT-SAGA and PMGT-LSVRG, and compare them to existing state-of-the-art proximal algorithms. To the best of our knowledge, PMGT-VR is the first linearly convergent decentralized stochastic algorithm that can solve decentralized composite optimization problems. Numerical experiments are provided to demonstrate the effectiveness of the proposed algorithms.

Abstract:
Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H2OT), for efficient transformer-based 3D human pose estimation from videos. H2OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H2OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method.

Abstract:
The rise of autonomous vehicles has significantly increased the demand for robust 3D object detection systems. While cameras and LiDAR sensors each offer unique advantages—cameras provide rich texture information and LiDAR offers precise 3D spatial data—relying on a single modality often leads to performance limitations. This paper introduces MV2DFusion, a multi-modal detection framework that integrates the strengths of both worlds through an advanced query-based fusion mechanism. By introducing an image query generator to align with image-specific attributes and a point cloud query generator, MV2DFusion effectively combines modality-specific object semantics without biasing toward one single modality. Then the sparse fusion process can be accomplished based on the valuable object semantics, ensuring efficient and accurate object detection across various scenarios. Our framework’s flexibility allows it to integrate with any image and point cloud-based detectors, showcasing its adaptability and potential for future advancements. Extensive evaluations on the nuScenes and Argoverse2 datasets demonstrate that MV2DFusion achieves state-of-the-art performance, particularly excelling in long-range detection scenarios.

Abstract:
Due to the inherent imbalance in real-world datasets, naïve Empirical Risk Minimization (ERM) tends to bias the learning process towards the majority classes, hindering generalization to minority classes. To rebalance the learning process, one straightforward yet effective approach is to modify the loss function via class-dependent terms, such as re-weighting and logit-adjustment. However, existing analysis of these loss-oriented methods remains coarse-grained and fragmented, failing to explain some empirical results. After reviewing prior work, we find that the properties used through their analysis are typically global, i.e., defined over the whole dataset. Hence, these properties fail to effectively capture how class-dependent terms influence the learning process. To bridge this gap, we turn to explore the localized versions of such properties i.e., defined within each class. Specifically, we employ localized calibration to provide consistency validation across a broader range of losses and localized Lipschitz continuity to provide a fine-grained generalization bound. In this way, we reach a unified perspective for improving and adjusting loss-oriented methods. Finally, a principled learning algorithm is developed based on these insights. Empirical results on both traditional ResNets and foundation models validate our theoretical analyses and demonstrate the effectiveness of the proposed method.

Abstract:
Transferable adversarial images raise critical security concerns for computer vision systems in real-world, black-box attack scenarios. Although many transfer attacks have been proposed, existing research lacks a systematic and comprehensive evaluation. In this paper, we systemize transfer attacks into five categories around the general machine learning pipeline and provide the first comprehensive evaluation, with 23 representative attacks against 11 representative defenses, including the recent, transfer-oriented defense and the real-world Google Cloud Vision. In particular, we identify two main problems of existing evaluations: (1) for attack transferability, lack of intra-category analyses with fair hyperparameter settings, and (2) for attack stealthiness, lack of diverse measures. Our evaluation results validate that these problems have indeed caused misleading conclusions and missing points, and addressing them leads to new, consensus-challenging insights, such as (1) an early attack, DI, even outperforms all similar follow-up ones, (2) the state-of-the-art (white-box) defense, DiffPure, is even vulnerable to (black-box) transfer attacks, and (3) even under the same L_pLp constraint, different attacks yield dramatically different stealthiness results regarding diverse imperceptibility metrics, finer-grained measures, and a user study. We hope that our analyses will serve as guidance on properly evaluating transferable adversarial images and advance the design of attacks and defenses.

Abstract:
Proper guidance strategies are essential to achieve high-quality generation results without retraining diffusion and flow-based text-to-image models. Existing guidance either requires specific training or strong inductive biases of diffusion model networks, which potentially limits their ability and application scope. Motivated by the observation that artifact outliers can be detected by a significant decline in the density from a noisier to a cleaner noise level, we propose Self-Guidance (SG), which can significantly improve the quality of the generated image by suppressing the generation of low-quality samples. The biggest difference from existing guidance is that SG only relies on the sampling score function of the original diffusion or flow model at different noise levels, with no need for any tricky and expensive guidance-specific training. This makes SG highly flexible to be used in a plug-and-play manner by any diffusion or flow models. We also introduce an efficient variant of SG, named SG-prev, which reuses the output from the immediately previous diffusion step to avoid additional forward passes of the diffusion network. We conduct extensive experiments on text-to-image and text-to-video generation with different architectures, including UNet and transformer models. With open-sourced diffusion models such as Stable Diffusion 3.5 and FLUX, SG exceeds existing algorithms on multiple metrics, including both FID and Human Preference Score. SG-prev also achieves strong results over both the baseline and the SG, with 50 percent more efficiency. Moreover, we find that SG and SG-prev both have a surprisingly positive effect on the generation of physiologically correct human body structures such as hands, faces, and arms, showing their ability to eliminate human body artifacts with minimal efforts.

Abstract:
Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving oriented objects common in aerial images unexplored. At the same time, the annotation cost of oriented objects is significantly higher than that of their horizontal counterparts (an approximate 36.5% increase in costs). Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images usually have arbitrary orientations, small scales, and dense distribution, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V2.0/DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.90/2.14, +2.16/2.18, and +2.66/2.32) mAP under 10%, 20%, and 30% labeled data settings, respectively, with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. Moreover, our method demonstrates stable generalization ability across different oriented detectors, even for multi-view oriented 3D object detectors.

Abstract:
Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shelf VOS models, part of the existing VOS benchmarks mainly focuses on short-term videos, where objects remain visible most of the time. However, these benchmarks may not fully capture challenges encountered in practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average. Each video includes various attributes, especially challenges encountered in the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models’ performance in real scenarios. Based on LVOS, we evaluate 15 existing VOS models under 3 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that one of the significant factors contributing to accuracy decline is the increased video length, interacting with complex challenges such as long-term reappearance, cross-temporal confusion, and occlusion, which emphasize LVOS’s crucial role. We hope our LVOS can advance development of VOS in real scenes.

Abstract:
Personalized federated learning (PFL) plays a pivotal role in ensuring efficient privacy preservation and secure collaborative learning. However, PFL faces significant challenges due to data heterogeneity and device diversity. To enhance personalization and robustness in PFL, we propose a novel model called FedNODE, which leverages hierarchical embeddings. FedNODE incorporates personalized, pseudo-generic, and fusion embeddings to facilitate hierarchical information representation. We utilize a hypernetwork based on neural ordinary differential equations (ODEs) within the server to generate backbone parameters for different clients, enabling the creation of personalized embeddings. Additionally, we introduce a pseudo-generic embedding based on a learnable vector to balance personalized and generic information. A neural ODE-based network follows the backbone module for each client, integrating personalized and pseudo-generic embeddings. To validate the efficacy of FedNODE, we conduct extensive evaluations across various classification datasets, encompassing diverse statistically heterogeneous settings and noisy scenarios. The results demonstrate that FedNODE achieves state-of-the-art performance.

Abstract:
Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.

Abstract:
Spatial transcriptomics has revolutionized the ability to investigate transcriptional patterns within tissue morphology. However, many ST clustering pipelines operate on a single preselected gene set, typically prioritizing either highly variable genes (HVGs) or spatially variable genes (SVGs), and therefore may not directly model how genes with different levels of global variability provide complementary cues for spatial domain identification. Although non-HVG signals can be partially captured through SVG selection and spatial graph modeling, a dedicated two-view formulation that disentangles high-variance and low-variance gene subsets and fuses them under a unified objective remains underexplored. To this end, we propose a Spatial Transcriptomics clustering framework for Cross-view information Fusion, termed STCF, which casts HVGs and low-variability genes (LVGs) as two gene-expression views and integrates them via a plug-and-play cross-view fusion strategy. Specifically, STCF introduces a cross-view fusion mechanism that employs reverse-scaled cosine error loss (R-SCE) to balance alignment and separation of gene embeddings, ensuring robust representation learning while preserving spatial coherence, which enhances the model’s ability to resolve fine-grained spatial structures. Extensive experiments on three benchmark datasets (DLPFC, HBC, and MBA) demonstrate the superiority, effectiveness, and transferability of STCF. Case studies further validate its ability to uncover latent spatial patterns and improve clustering precision.

Abstract:
Polarization, as an intrinsic property of light alongside amplitude and phase, has demonstrated great potential in a variety of downstream applications by providing valuable physical cues encoded in the degree of polarization (DoP) and the angle of polarization (AoP). Polarimetric imaging aims to acquire these polarimetric parameters by capturing polarized snapshots. However, compared to conventional imaging, it faces greater difficulties due to the presence of polarizers, which attenuate light intensity in a spatially variant manner. Such attenuation complicates exposure control: a short exposure leads to low signal-to-noise ratio and color distortion, whereas a relatively long exposure increases the risk of motion blur and saturation. To address these challenges, this work proposes PolFusion+, a unified framework that robustly produces clean and sharp polarized snapshots by complementarily fusing a degraded pair of short-exposed noisy and long-exposed blurry inputs. Building upon a polarization-aware three-phase fusion scheme, PolFusion+ introduces two key advancements. First, to handle saturation in the blurry snapshot, the irradiance restoration phase extracts and rectifies color information from both inputs, effectively mitigating saturation-induced degradation. Second, to ensure physically faithful polarization reconstruction, the framework explicitly models the individual characteristics and interdependencies of the DoP and AoP, enabling their joint restoration. These improvements are supported by a degradation-oriented neural network tailored to the fusion scheme. Experimental results demonstrate that PolFusion+ achieves state-of-the-art performance, effectively benefiting downstream applications.

Abstract:
Recently, the integration of the efficient feed-forward scheme into 3D Gaussian Splatting (3DGS) has been actively explored. However, most existing methods focus on sparse view reconstruction of small regions and cannot produce eligible whole-scene reconstruction results in terms of either quality or efficiency. In this paper, we propose FreeSplat++, which focuses on extending the generalizable 3DGS to become an alternative approach to large-scale indoor whole-scene reconstruction, which has the potential of significantly accelerating the reconstruction speed and improving the geometric accuracy. To facilitate whole-scene reconstruction, we initially propose the Low-cost Cross-View Aggregation framework to efficiently process extremely long input sequences. Subsequently, we introduce a carefully designed pixel-wise triplet fusion method to incrementally aggregate the overlapping 3D Gaussian primitives from multiple views, adaptively reducing their redundancy. Furthermore, given the fused 3DGS primitives with accumulated weights after the fusion step, we propose a weighted floater removal strategy that can effectively reduce floaters, which serves as an explicit depth fusion approach that is tailored for generalizable 3DGS methods and becomes crucial in whole-scene reconstruction. After the feed-forward reconstruction of 3DGS primitives, we investigate a depth-regularized per-scene fine-tuning process. Leveraging the dense, multi-view consistent depth maps obtained during the feed-forward prediction phase for an extra constraint, we refine the entire scene’s 3DGS primitive to enhance rendering quality while preserving geometric accuracy. Extensive experiments confirm that our FreeSplat++ significantly outperforms existing generalizable 3DGS methods, especially in whole scene reconstructions. Compared to conventional per-scene optimized 3DGS approaches, our method with depth-regularized per-scene fine-tuning demonstrates substantial improvements in reconstruction accuracy and a notable reduction in training time.

Abstract:
Representing signals using coordinate networks dominates the area of inverse problems recently, and is widely applied in various scientific computing tasks. Still, there exists an issue of spectral bias in coordinate networks, limiting the capacity to learn high-frequency components. This problem is caused by the pathological distribution of the neural tangent kernel’s (NTK’s) eigenvalues of coordinate networks. We find that, this pathological distribution could be improved using classical normalization techniques (batch normalization and layer normalization), which are commonly used in convolutional neural networks but rarely used in coordinate networks. We prove that normalization techniques greatly reduces the maximum and variance of NTK’s eigenvalues while slightly modifies the mean value, considering the max eigenvalue is much larger than the most, this variance change results in a shift of eigenvalues’ distribution from a lower one to a higher one, therefore the spectral bias could be alleviated (see Fig. 1). Furthermore, we propose two new normalization techniques by combining these two techniques in different ways. The efficacy of these normalization techniques is substantiated by the significant improvements and new state-of-the-arts achieved by applying normalization-based coordinate networks to various tasks, including the image compression, computed tomography reconstruction, shape representation, magnetic resonance imaging, novel view synthesis and multi-view stereo reconstruction.

Abstract:
The rapid development of generative AI techniques enables the synthesis of highly realistic facial images, posing significant challenges for the accurate detection of face forgeries. In contrast to solely elevating detector awareness, proactively reducing the intrinsic difficulty of forgery detection can streamline detector complexity while improving both generalization and robustness. This insight motivates our defense strategy to make face forgery clues more evident. Specifically, a novel proactive approach dubbed Self-Steganographic Detection (SSD) is proposed to imperceptibly embed facial images into themselves as a form of detection evidence. The recovery process is designed to remain robust under normal manipulations while exhibiting deliberate degradation under malicious manipulations, thereby clearly revealing potential forgeries. Unlike embedding bit-level vectors, pixel-level images are informative to ensure the generalization of our approach. Due to the similarity between the protected and embedded images, SSD performs detection without storing any embedded information in advance. To support practical deployment, our approach incorporates a dual detection scheme that aims to identify unprotected images and determine the authenticity of protected images. Extensive experiments using 8 face forgery techniques demonstrate the effectiveness of our approach compared to state-of-the-art methods.

Abstract:
Interpreting Convolutional Neural Networks (CNNs) is critical for safety-sensitive applications such as healthcare and autonomous systems. Popular visual explanation methods like Grad-CAM use a single convolutional layer, potentially missing multi-scale cues and producing unstable saliency maps. We introduce Winsor-CAM, a single-pass gradient-based method that aggregates Grad-CAM maps from all convolutional layers and applies percentile-based Winsorization to attenuate outlier contributions. A user-controllable percentile parameter pp enables semantic-level tuning from low-level textures to high-level object patterns. We evaluate Winsor-CAM on six CNN architectures using PASCAL VOC 2012 and PolypGen, comparing localization (IoU, center-of-mass distance) and fidelity (insertion/deletion AUC) against seven baselines including Grad-CAM, Grad-CAM++, LayerCAM, ScoreCAM, AblationCAM, ShapleyCAM, and FullGrad. On DenseNet121 with a subset of Pascal VOC 2012, Winsor-CAM achieves 46.8% IoU and 0.059 CoM distance versus 39.0% and 0.074 for Grad-CAM, with improved insertion AUC (0.656vs. 0.623) and deletion AUC (0.197vs. 0.242). Notably, even the worst-performing fixed pp-value configuration outperforms FullGrad across all metrics. An ablation study confirms that incorporating earlier layers improves localization. Similar evaluation on PolypGen polyp segmentation further validates Winsor-CAM’s effectiveness in medical imaging contexts. Winsor-CAM provides an efficient, robust, and human-tunable explanation tool for expert-in-the-loop analysis.

Abstract:
Single-point annotation is increasingly prominent in visual tasks for labeling cost reduction. However, it challenges tasks requiring high precision, such as the point-prompted instance segmentation (PPIS) task, which aims to estimate precise masks using single-point prompts to train a segmentation network. Due to the constraints of point annotations, granularity ambiguity and boundary uncertainty arise i.e., the difficulty distinguishing between different levels of detail (e.g., whole object vs. parts) and the challenge of precisely delineating object boundaries. Previous works have usually inherited the paradigm of mask generation along with proposal selection to achieve PPIS. However, proposal selection relies solely on category information, failing to resolve the ambiguity of different granularity. Furthermore, mask generators offer only finite discrete solutions that often deviate from actual masks, particularly at boundaries. To address these issues, we propose the Semantic-Aware Point-Prompted Instance Segmentation Network (SAPNet). It integrates Point Distance Guidance and Box Mining Strategy to tackle group and local issues caused by the point’s granularity ambiguity. Additionally, we incorporate completeness scores within proposals to add spatial granularity awareness, enhancing multiple instance learning (MIL) in proposal selection termed S-MIL. The Multi-level Affinity Refinement conveys pixel and semantic clues, narrowing boundary uncertainty during mask refinement. These modules culminate in SAPNet++, mitigating point prompt’s granularity ambiguity and boundary uncertainty and significantly improving segmentation performance. Extensive experiments on four challenging datasets validate the effectiveness of our methods, highlighting the potential to advance PPIS.

Abstract:
Recently, Transformers have gained significant popularity in image restoration tasks such as image super-resolution and denoising, owing to their superior performance. However, balancing performance and computational burden remains a long-standing problem for transformer-based architectures. Due to the quadratic complexity of self-attention, existing methods often restrict attention to local windows, resulting in limited receptive field and suboptimal performance. To address this issue, we propose Adaptive Token Dictionary (ATD), a novel transformer-based architecture for image restoration that enables global dependency modeling with linear complexity relative to image size. The ATD model incorporates a learnable token dictionary, which summarizes external image priors (i.e., typical image structures) during the training process. To utilize this information, we introduce a token dictionary cross-attention (TDCA) mechanism that enhances the input features via interaction with the learned dictionary. Furthermore, we exploit the category information embedded in the TDCA attention maps to group input features into multiple categories, each representing a cluster of similar features across the image and serving as an attention group. We also integrate the learned category information into the feed-forward network to further improve feature fusion. ATD and its lightweight version ATD-light, achieve state-of-the-art performance on multiple image super-resolution benchmarks. Moreover, we develop ATD-U, a multi-scale variant of ATD, to address other image restoration tasks, including image denoising and JPEG compression artifacts removal. Extensive experiments demonstrate the superiority of out proposed models, both quantitatively and qualitatively.

Abstract:
Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still, however, a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort.

Abstract:
The Chamfer Distance (CD) is a cornerstone objective function for point cloud completion, yet its inherent symmetric weighting mechanism limits the quality of the generated results. By penalizing local detail deviations and global coverage deficiencies equally, standard CD often causes structural defects such as point aggregation and incomplete spatial structures. We introduce the Flexible-weighted Chamfer Distance (FCD), which decouples CD into local precision and global completeness sub-objectives. FCD employs an asymmetric weighting strategy that prioritizes global structural integrity, steering the optimization away from sub-optimal solutions. As a plug-and-play module with negligible overhead, extensive experiments on state-of-the-art networks demonstrate that FCD significantly enhances global distribution metrics while preserving local precision. Specifically, on the ShapeNet55 benchmark using AdaPoinTr, FCD reduces the Density-aware Chamfer Distance (DCD) by approximately 12.4% (from 0.613 to 0.537), effectively mitigating point clustering. Similarly, on the PCN dataset, the proposed method reduces the Earth Mover’s Distance (EMD) from 23.79 to 21.40, demonstrating superior global uniformity compared to the standard CD baseline. Furthermore, FCD demonstrates excellent generalization. When applied to diverse tasks and datasets, including real-world scans (KITTI), industrial components (ABC), and point cloud upsampling (PU-GAN), it yields significant quantitative gains and produces visually more uniform and structurally complete point clouds. These results underscore FCD’s potential as a versatile objective function for the broader point cloud generation domain.

Abstract:
In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data for supervised small-baseline and large-baseline homography learning and yield a state-of-the-art homography estimation network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content refinement diffusion model. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method outperforms existing competitors and previous supervised methods can also be improved based on the generated dataset.

Abstract:
Cloud-based third-party multimedia services have become increasingly popular in last decade, however, they pose serious threats to users’ privacy. To address this issue, in this paper, we propose a novel Adaptive Image Restoration network with Privacy protection, namely AIRPNet, which first attempts to perform image restoration in steganographic domain. Compared with existing methods, our method has significant advantages in invisibility, security and flexibility. Specifically, we first propose a wavelet lifting-based Adaptive Invertible Hiding (AIH) module to conceal the low-quality (LQ) secret image into a stego image. Then, instead of performing single type of restoration on the secret image, an adaptive secure restoration (ASR) module is developed to deal with multiple image degradations on the stego image. Finally, a high-quality (HQ) secret image can be extracted from the restored stego image. Here, since the secret image remains hidden throughout the whole image restoration process, the privacy of users can be greatly protected. The framework can be flexibly extended to multiple image restoration, which can restore multiple secret images from the same stego image. Experimental results on various datasets demonstrate that our AIRPNet outperforms existing methods in terms of restoration accuracy, invisibility and security on different image restoration tasks.

Abstract:
Parameter-efficient fine-tuning for continual learning (PEFT-CL) has shown promise in adapting pre-trained models to sequential tasks while mitigating catastrophic forgetting problem. However, understanding the mechanisms that dictate continual performance in this paradigm remains elusive. To unravel this mystery, we undertake a rigorous analysis of PEFT-CL dynamics to derive relevant metrics for continual scenarios using Neural Tangent Kernel (NTK) theory. With the aid of NTK as a mathematical analysis tool, we recast the challenge of test-time forgetting into the quantifiable generalization gaps during training, identifying three key factors that influence these gaps and the performance of PEFT-CL: training sample size, task-level feature orthogonality, and regularization. To address these challenges, we introduce NTK-CL, a novel framework that eliminates task-specific parameter storage while adaptively generating task-relevant features. Aligning with theoretical guidance, NTK-CL triples the feature representation of each sample, theoretically and empirically reducing the magnitude of both task-interplay and task-specific generalization gaps. Grounded in NTK analysis, our framework imposes an adaptive exponential moving average mechanism and constraints on task-level feature orthogonality, maintaining intra-task NTK forms while attenuating inter-task NTK forms. Ultimately, by fine-tuning optimizable parameters with appropriate regularization, NTK-CL achieves state-of-the-art performance on established PEFT-CL benchmarks. This work provides a theoretical foundation for understanding and improving PEFT-CL models, offering insights into the interplay between feature representation, task orthogonality, and generalization, contributing to the development of more efficient continual learning systems.

Abstract:
Despite substantial advances in all-in-one image restoration for addressing diverse degradations within a unified model, existing methods remain vulnerable to out-of-distribution degradations, thereby limiting their generalization in real-world scenarios. To tackle the challenge, this work is motivated by the intuition that multisource degraded feature distributions are induced by different degradation-specific shifts from an underlying degradation-agnostic distribution, and recovering such a shared distribution is thus crucial for achieving generalization across degradations. With this insight, we propose BaryIR, a representation learning framework that aligns multisource degraded features in the Wasserstein barycenter (WB) space, which models a degradation-agnostic distribution by minimizing the average of Wasserstein distances to multisource degraded distributions. We further introduce residual subspaces, whose embeddings are mutually contrasted while remaining orthogonal to the WB embeddings. Consequently, BaryIR explicitly decouples two orthogonal spaces: a WB space that encodes the degradation-agnostic invariant contents shared across degradations, and residual subspaces that adaptively preserve the degradation-specific knowledge. This disentanglement mitigates overfitting to in-distribution degradations and enables adaptive restoration grounded on the degradation-agnostic shared invariance. Extensive experiments demonstrate that BaryIR performs competitively against state-of-the-art all-in-one methods. Notably, BaryIR generalizes well to unseen degradations (e.g., types and levels) and shows remarkable robustness in learning generalized features, even when trained on limited degradation types and evaluated on real-world data with mixed degradations.

Abstract:
LiDAR-based human motion capture holds great promise for large-scale, unconstrained environments. However, existing approaches often rely on clean, pre-segmented point clouds and struggle with noisy or dynamic scenes, limiting their practical applicability. We propose OptimalCap, a robust and efficient LiDAR-based framework that integrates hierarchical skeletal modeling and kinematic-aware temporal optimization to enable accurate, coherent, and real-time multi-human motion capture. To support training and evaluation under realistic disturbances, we also introduce NoiseMotion, a large-scale synthetic dataset simulating human-object interactions in noisy environments. Extensive experiments on public and synthetic benchmarks demonstrate that OptimalCap achieves state-of-the-art accuracy, robustness, and temporal consistency, while supporting over 20 individuals, at 60 FPS and up to 100 meters, setting a new standard for scalable, real-world LiDAR-based motion capture.

Abstract:
Human cognitive mechanism depends on a sophisticated information processing framework, including perception, attention, memory, language, reasoning, problem solving and decision-making. However, current research only focuses on isolated process rather than systematically simulating human cognitive mechanism. Meanwhile, with the rapid development of large language models, related works have predominantly centered on language-level exploration, while in-depth mining of visual information remains insufficient. Here, to deeply activate the multi-modal understanding ability, a Systematic Human-like Cognitive (SHC) method is proposed for visual question answering, where the above mentioned sophisticated seven processes are systematically modeled as three core modules: hierarchical perception, semantic refinement and dynamic reasoning. The Hierarchical Perception Module (HPM) extracts hierarchical features from different levels to simulate the incremental integration mode of biological neural system. Based on the selective attention theory, one Semantic Refinement Module (SRM) is designed as a key-value accumulation optimization mechanism that enhances high-level semantics from low-level features via a multi-level cascaded attention structure. Finally, the Dynamic Reasoning Module (DRM), following the utility maximization decision theory, employs a dual weighting mechanism to dynamically fuse high-level semantic features and low-level fine-grained features, forming a unified high-quality visual representation that is then fed into the large language model for reasoning together with the text input. Experimental results demonstrate that SHC achieves competitive performance on multiple visual question answering benchmarks, including VQA-v2, Text-VQA, GQA, and ScienceQA, as well as multimodal evaluation benchmarks such as POPE, MMB, MME, and MM-Vet. Comparative experiments with multiple models of the same-scale validate the latent capacity of SHC to prompt the performance of multi-modal understanding tasks and its superiority in fine-grained visual information perception, and even surpasses multimodal models with larger-scale on certain tasks.

Abstract:
Recently, enhancing the generative capability of text-to-image (T2I) models has become a promising direction in both academia and industry. Prior studies often focused on either improving generative quality or reducing inference latency, but typically failed to improve both quality and speed simultaneously. Moreover, existing inference-enhancement methods do not achieve significant improvements simultaneously across both diffusion models (DMs) and autoregressive models (ARMs). In this paper, we introduce a general tuning-based inference-enhancement framework, named CoRe^22, which is the first to simultaneously achieve significant generative quality and reduced inference overhead across DMs and ARMs, to the best of our knowledge. CoRe^22 comprises three stages: Collect, Reflect, and Refine. During the Collect stage, classifier-free guidance (CFG) trajectories are collected and subsequently used in the Reflect stage to train a weak model capable of reflecting the “easy-to-learn” content. Finally, during the Refine stage, CoRe^22 can utilize the trained weak model to achieve speedup and performance gain in inference. Specifically, in the early sampling steps, CoRe^22 employs weak-to-strong guidance to refine the “difficult-to-learn” and realistic content, thereby improving generative quality. In the later sampling steps, CoRe^22 can use the weak model to generate “easy-to-learn” content instead of CFG, dramatically reducing inference time. Experimental outcomes substantiates CoRe^22 achieve significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench across SDXL, SD3.5, FLUX and LlamaGen. Notably, for SD3.5, CoRe^22 can be seamlessly integrated with the state-of-the-art inference-enhancement algorithm Z-Sampling, outperforming it even with less time.

Abstract:
The remarkable success of GNNs has provoked the challenge of high computational and memory overhead when training with large-scale graphs. As a promising solution, graph condensation is committed to constructing synthetic graphs with significantly smaller size, which are expected to preserve the essential characteristics of the original ones. During this process, a core problem is how to accurately portray and align the data distribution structures between the original graph space and the synthetic graph space. A mainstream idea in existing research is matching the class distributions between the two spaces. Unfortunately, they generally overlook two key issues: 1) heterophilic nodes in original graphs may render the chaotic class distribution patterns; 2) coarse-grained matching of the overall class centroid between original and synthetic spaces is insufficient for data with complex subcategory distributions. In this paper, we propose a novel Graph Condensation method via homophily node Refinement and fine-grained class Distribution matching (GCRD). Given the original large-scale graph, we first distinguish the nodes into advantageous homophilic nodes and detrimental heterophilic nodes, followed by adaptively assigning node weights to refine the generated class distribution patterns of the original graphs. Furthermore, with the refined class distribution patterns, we propose a fine-grained distribution matching objective to more delicately align the local distribution structure of subclasses within each class. The rigorous theoretical analysis confirms the effectiveness of our proposal in precisely learning the class information. Extensive experiments demonstrate our state-of-the-art classification and cross-architecture generalization performance against various baselines.

Abstract:
Semantic segmentation takes a pivotal role in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, and position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs of constructing datasets. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a broad variety of semantic segmentation models, spanning from conventional close-set models to recent open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) traditional data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our generation pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.

Abstract:
Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in visual–language reasoning, yet long-video understanding remains a formidable challenge due to the need for coherent reasoning over ultra-long spatiotemporal dependencies. Existing methods struggle with the vast candidate space for relevant information in long videos, often failing to distinguish meaningful events from redundant content. We identify two critical and previously under-explored issues: absolute redundancy, where static visual content inflates token counts without adding narrative value, and relative redundancy, where task-irrelevant segments introduce noise that impairs reasoning. Compounding these issues is the weak spatiotemporal modeling in current MLLMs, which limits their ability to capture complex event dynamics. To address these multifaceted challenges, we introduce SELongVLM, a dynamically lenient-to-stringent selection long video language model. SELongVLM integrates two coordinated branches: a Residual Token Pruner (RTP) that removes repetitive background tokens via inter-frame residual modeling thus mitigating absolute redundancy while preserving motion cues, and a Semantic-aware Self-Correction Selector (SCSelector) that progressively refines query-relevant clip selection without frame-level annotations to reduce relative redundancy, guided by a stringent-to-lenient self-correcting mechanism during optimization. To ensure causal continuity and bolster spatiotemporal reasoning across disjoint clips, the framework further incorporates an action-aware operation for intra-clip dynamics and a temporal memory for cross-clip context, enabling robust spatiotemporal inference on long videos. Extensive experiments across eight benchmarks demonstrate that SELongVLM markedly outperforms existing models on both general and specialized long-video tasks. Specifically, it achieves 65.5% on VideoMME and 69.8% on MLVU for general benchmarks, and delivers strong performance on four specialized benchmarks – for example, 39.2% on TOMATO for fine-grained temporal reasoning and 69.2% on EventBench for event-level understanding.

Abstract:
Recently, Multi-View Graph Clustering (MVGC) methods have achieved significant progress, leading to their wide adoption in various applications. However, most MVGC methods merely pursue consistent information by simply fusing multi-view graphs, ignoring the cross-view interactions among them, which limits the ceiling of their performance. To make up for this deficiency, we design a credible cross-view graph enhancement module to explore the credible topological structure, while accomplishing cross-view interactions, to boost clustering performance in multi-view graph scenarios. Besides, we reconsider the graph clustering task from the perspective of graph signal processing. From this novel perspective, we adapt the high-order Graph Trend Filter to reveal the inhomogeneities in graph smoothness levels and further consider the brand-new local preference in MVGC, which provides theoretical guidance for graph clustering. Building on these insights, we propose the Enhanced Graph Trend Filter Clustering (EGTFC) method and present an effective algorithm accompanied by corresponding theoretical analyses to tackle the optimization problem inherent in EGTFC. Finally, substantial experimental results on twelve benchmark datasets demonstrate the effectiveness of our proposals and the superiority over thirteen state-of-the-art MVGC methods.

Abstract:
Remote sensing images exhibit intrinsic domain complexity arising from multi-source sensor variances, which heterogeneity fundamentally challenges conventional cross-domain few-shot methods that assume simple distribution shifts. Addressing this, we propose a first-order Cross-Domain Meta Learning (CDML) for few-shot remote sensing object classification. CDML implements a dual-stage domain adaptation task as the fundamental meta-learning unit, and includes a cross-domain meta-train phase (CDMTrain) and a cross-domain meta-test phase (CDMTest). In CDMTrain, we propose an inner-loop multi-domain few-shot task sampling, which enables a teacher model encapsulate both cross-category discriminative features and authentic inter-domain distributional divergence. This alternating cyclic learning paradigm captures genuine domain shifts, with each update direction progressively guiding the model toward parameters that balance multi-domain performance. In CDMTest, we evaluate a domain diversity enhancement by transferring teacher parameters to the student model for cross-domain capability assessment on the reserved pseudo-unseen domain. The task-level design progressively improves domain generalization through iterative domain adaptive task learning. Meanwhile, to mitigate the conflicts and inadequacies caused by multi-domain scenarios, we propose a learnable affine transformation model. It adaptively learns affine transformation parameters through intermediate layer features to fine-tune the update direction. Extensive experiments on five remote sensing classification benchmarks demonstrate a superior performance of the proposed method compared with the state-of-the-art methods.

Abstract:
Robust matrix completion (RMC) is a widely used machine learning tool that simultaneously tackles two critical issues in low-rank data analysis: missing data entries and extreme outliers. This paper proposes a novel scalable and learnable non-convex approach, coined Learned Robust Matrix Completion (LRMC), for large-scale RMC problems. LRMC enjoys low computational complexity with linear convergence. Motivated by the proposed theorem, the free parameters of LRMC can be effectively learned via deep unfolding to achieve optimum performance. Furthermore, this paper proposes a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fixed-number iterations to infinite iterations. The superior empirical performance of LRMC is verified with extensive experiments against state-of-the-art on synthetic datasets and real applications, including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.

Abstract:
Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capabilities in understanding relationships between visual and textual data through joint embedding spaces. Despite their effectiveness, these models remain vulnerable to adversarial attacks, particularly in the image modality, posing significant security concerns. Building upon our previous work on Adversarial Prompt Tuning (AdvPT), which introduced learnable text prompts to enhance adversarial robustness in VLMs without extensive parameter training, we present a significant extension by introducing the Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning). As a significant extension, NAP-Tuning first establishes a comprehensive multi-modal (text and visual) and multi-layer prompting framework. The core of this framework is a targeted structural augmentation for feature-level purification, implemented through our Neural Augmentor approach. This framework implements feature purification by incorporating TokenRefiners—lightweight neural modules that learn to reconstruct purified features via residual connections—to directly address distortions in the feature space. This structural intervention is what enables the multi-modal and multi-layer system to effectively perform modality-specific and layer-specific feature rectification. Comprehensive experiments demonstrate that NAP-Tuning significantly outperforms existing methods across various datasets and attack types. Notably, our approach shows significant improvements over the strongest baselines under the challenging AutoAttack benchmark, outperforming them by 32.3% on ViT-B16 and 31.3% on ViT-B32 architectures while maintaining competitive clean accuracy. This work highlights the efficacy of internal feature-level intervention in prompt tuning for adversarial robustness, moving beyond input-side alignment approaches to create an adaptive defense mechanism that can identify and rectify adversarial perturbations across embedding spaces.

Abstract:
Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions (TSDs)—critical for transitioning large models from pretrained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process, thereby enhancing model performance on targeted tasks. Additionally, based on our exploration of TSD, we focus on an important issue in PEFT: the initialization of LoRA. While some works have pointed out the significance of initialization for LoRA’s performance and proposed various strategies, these methods are often empirical and not task-specific. To address this issue, we propose LoRA-Init. Starting from TSD, we identify the directions that require the most adjustment during fine-tuning for downstream tasks. By initializing the matrices in LoRA with these directions, LoRA-Init significantly enhances LoRA’s performance. Moreover, we can combine LoRA-Dash and LoRA-Init to create the final version of LoRA based on TSDs, which we refer to as LoRA-TSD. Extensive experiments have conclusively demonstrated the effectiveness of these methods, and in-depth analyses further reveal the underlying mechanisms of these methods.

Abstract:
Reversible image conversion (RIC) suffers from ill-posedness issues due to its forward conversion process being considered an underdetermined system. Despite employing invertible neural networks (INN), existing RIC methods intrinsically remain ill-posed as inevitably introducing uncertainty by incorporating randomly sampled variables. To tackle the ill-posedness dilemma, we focus on developing a reliable approximate left inverse for the underdetermined system by constructing an overdetermined system with a non-zero Gram determinant, thus ensuring a well-posed solution. Based on this principle, we propose a well-posed invertible 1× 11×1 convolution (WIC), which eliminates the reliance on random variable sampling and enables the development of well-posed invertible networks. Furthermore, we design two innovative networks, WIN-Naïve and WIN, with the latter incorporating advanced skip-connections to enhance long-term memory. Our methods are evaluated across diverse RIC tasks, including reversible image hiding, image rescaling, and image decolorization, consistently achieving state-of-the-art performance. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to overcome the bottlenecks of existing RIC solutions and setting a new benchmark in the field.

Abstract:
Evidential deep learning (EDL) models, based on Subjective Logic, introduce a principled and computationally efficient way to make deterministic neural networks uncertainty-aware. The resulting evidential models can quantify fine-grained uncertainty using learned evidence. However, the Subjective-Logic framework constrains evidence to be non-negative, requiring specific activation functions whose geometric properties can induce activation-dependent learning-freeze behavior—a regime where gradients become extremely small for samples mapped into low-evidence regions. We theoretically characterize this behavior and analyze how different evidential activations influence learning dynamics. Building on this analysis, we design a general family of activation functions and corresponding evidential regularizers that provide an alternative pathway for consistent evidence updates across activation regimes. Extensive experiments on four benchmark classification problems (MNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet), two few-shot classification problems, and blind face restoration problem empirically validate the developed theory and demonstrate the effectiveness of the proposed generalized regularized evidential models.

Abstract:
Large-size very-high-resolution (VHR) remote sensing imagery has emerged as a critical data source for high-precision vector mapping of multi-scale geographical elements such as building, water, road and etc. When dealing with the large-size image, due to the limited memory of GPU, the deep learning-based vector mapping methods often employ the sliding block strategy. This inevitably leads to the degenerated performance because of the stitching difficulty of the sliding blocks’ vector mapping results. Therefore, it is necessary to conduct full-scope vector mapping via mining the consistent cue in large-size remote sensing imagery. To this end, this paper presents a novel global context-aware local point optimization method. To leverage the global context, this paper proposes a novel pyramid fusion network (PFNet) to conduct semantic segmentation of the large-size image in an end-to-end manner. Under the constraint of the global semantic segmentation result, a new inflection-point perception network (IPNet) is proposed to generate a set of stable points to depict the boundary of each element. Extensive experiments on building, water and road datasets, where each image has over 100 million pixels, show that our method obviously outperforms the existing methods.

Abstract:
Few-shot learning seeks to recognize novel classes from limited examples. Model-agnostic meta-learning (MAML), known for its simplicity and flexibility, learns an effective initialization for fast adaptation in data-scarce settings. However, MAML-based methods face challenges when there is a significant distributional shift between training and testing tasks, leading to inefficient learning and poor generalization across domains. In this work, we identify the core issues: inflexible weight update rules and limited adaptive learning capabilities. Instead of focusing solely on better initialization, we aim to enhance the adaptation process. Consequently, we propose a novel Layer-Adaptive Proportional-Integral-Derivative (LA-PID) optimizer integrated into a meta-learning framework. This design incorporates classical control theory, utilizing PID control to dynamically adjust task-specific gains at each network layer. Additionally, the theoretical conditions for optimal hyperparameter initialization and global model convergence are addressed from both control and optimization perspectives. Experiments on benchmark datasets show that LA-PID achieves state-of-the-art performance in few-shot classification, cross-domain, and regression tasks, while requiring fewer training steps.

Abstract:
Traditional cameras face limitations in maintaining focus across dynamic scenes, especially during rapid motion, due to the constraints of their lenses. Post-capture refocusing techniques, including deep learning-based methods and light field cameras, have been explored to mitigate these challenges. However, these approaches frequently struggle with temporal consistency or experience a trade-off in spatial resolution. In this paper, we introduce the coded event focal stack, a novel approach that captures both motion and depth information through event streams recorded during a modulated focal sweep. Our coded event focal stack enables the generation of full-time intermediate frames refocused at arbitrary focal distances. Extensive experiments on both synthetic and real-world datasets demonstrate the superior refocusing capability of our method over state-of-the-art techniques, particularly in dynamic scenes with complex motion and depth variations.

Abstract:
We study the problem of 3D semantic segmentation from raw point clouds. Unlike existing methods which primarily rely on a large amount of human annotations for training neural networks, we proposes GrowSP++, an unsupervised method to successfully identify complex semantic classes for every point in 3D scenes, without needing any type of human labels. Our method is composed of three major components: 1) a feature extractor incorporating 2D-3D feature distillation, 2) a superpoint constructor featuring progressively growing superpoints, and 3) a semantic primitive constructor with an additional growing strategy. The key to our method is the superpoint constructor together with the progressive growing strategy on both superpoints and semantic primitives, driving the feature extractor to progressively learn similar features for 3D points belonging to the same semantic class. We extensively evaluate our method on five challenging indoor and outdoor datasets, demonstrating state-of-the-art performance over all unsupervised baselines. We hope our work could inspire more advanced methods for unsupervised 3D semantic learning.

Abstract:
With the advancement of deep learning, deep recommendation models have achieved remarkable improvements in recommendation accuracy. However, due to the large number of candidate items in practice and the high cost of preference computation, these methods still suffer from low recommendation efficiency. The recently proposed tree-based deep recommendation models alleviate the problem by directly learning tree structure and representations under the guidance of recommendation objectives. To guarantee the effectiveness of beam search for recommendation accuracy, these models strive to ensure that the tree adheres to the max-heap assumption, where a parent node’s preference should be the maximum among its children’s preferences. However, they employ a one-versus-all strategy, framing the training task as a series of independent binary classification objectives for each node, which limits their ability to fully satisfy the max-heap assumption. To this end, we propose a Deep Tree-based Retriever (DTR for short) for efficient recommendation. DTR frames the training task as a softmax-based multi-class classification over tree nodes at the same level, enabling explicit horizontal competition and more discriminative top-k selection among them, which mimics the beam search behavior during training. To mitigate the suboptimality induced by the labeling of non-leaf nodes, we propose a rectification method for the loss function, which further aligns with the max-heap assumption in expectation. As the number of tree nodes grows exponentially with the levels, we employ sampled softmax to approximate optimization and thereby enhance efficiency. Furthermore, we propose a tree-based sampling method to reduce the bias inherent in sampled softmax. Theoretical results reveal DTR’s generalization capability, and both the rectification method and tree-based sampling contribute to improved generalization. The experiments are conducted on four real-world datasets, validating the effectiveness of the proposed method.

Abstract:
With the rise of Extended Reality (XR) technology, there is a growing need for real-time light field reconstruction from sparse view inputs. Existing methods can be classified into offline techniques, which can generate high-quality novel views but at the cost of long inference/training time, and online methods, which either lack generalizability or produce unsatisfactory results. However, we have observed that the intrinsic sparse manifold of Multi-plane Images (MPI) enables a significant acceleration of light field reconstruction while maintaining rendering quality. Based on this insight, we introduce RealLiFe, a novel light field optimization method, which leverages the proposed Hierarchical Sparse Gradient Descent (HSGD) to produce high-quality light fields from sparse input images in real time. Technically, the coarse MPI of a scene is first generated using a 3D CNN, and it is further optimized leveraging only the scene content aligned sparse MPI gradients in a few iterations. Extensive experiments demonstrate that our method achieves comparable visual quality while being 100x faster on average than state-of-the-art offline methods and delivers better performance (about 2 dB higher in PSNR) compared to other online approaches.

Abstract:
Online learning of deep neural networks faces challenges such as delayed non-incremental updating, increasing consumption, retrospective retraining, and catastrophic forgetting. To alleviate these drawbacks and achieve progressive immediate decision-making, we propose a novel Incremental Online Learning (IOL) framework of Randomized Neural Networks (Randomized NN), facilitating continuous improvements and analytics to Randomized NN performance in online scenarios. Within the framework, we further formulate IOL with ridge regularization (-R) and IOL with forward regularization (-F), both avoiding retrospective retraining and catastrophic forgetting. Moreover, the incremental algorithms for -R/-F on non-stationary batch stream are derived, featuring recursive weight updates and variable learning rates. Compared to -R, we recommend -F which improves learning performance using future unlabeled observations while further reducing online regrets to offline global experts. Additionally, we conduct a detailed analysis and theoretically derive relative cumulative regret bounds of the Randomized NN learners for -R/-F under adversarial assumptions via a novel methodology and present several corollaries, from which we observed the superiority in online learning acceleration and declined regret bounds of employing -F in IOL. Finally, our proposed methods were rigorously examined across diverse tasks, from simulation, regression, and classification tasks, to long-term time-series forecasting (LTSF) and continual learning (CL) fields, which distinctly validated the efficacy of the IOL frameworks and the advantages of forward regularization.

Abstract:
Recentworks have shown the potential of diffusion models in computer vision and natural language processing. Apart from the classical supervised learning fields, diffusion models have also shown strong competitiveness in reinforcement learning (RL) by formulating decision-making as sequential generation. However, incorporating temporal information of sequential data and utilizing it to guide diffusion models to perform better generation is still an open challenge. In this paper, we take one step forward to investigate controllable generation with temporal conditions that are refined from temporal information. We observe the importance of temporal conditions in sequential generation in sufficient scenarios and provide a comprehensive discussion and comparison of different temporal conditions. Based on the observations, we propose an effective temporally-conditional diffusion model coined Temporally-Composable Diffuser (TCD), which extracts temporal information from interaction sequences and explicitly guides generation with temporal conditions. Specifically, we separate the sequences into three parts according to time expansion and identify historical, immediate, and prospective conditions accordingly. Each condition preserves non-overlapping temporal information of sequences, enabling more controllable generation when we jointly use them to guide the diffuser. Finally, we conduct extensive experiments and analysis to reveal the favorable applicability of TCD in offline RL tasks, where our method reaches or matches the best performance compared with prior SOTA baselines.

Abstract:
Short-Term object-interaction Anticipation (STA) consists in detecting the location of the next-active objects, the noun and verb categories of the interaction, as well as the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants to understand user’s goals and provide timely assistance, or to enable human-robot interaction. In this work, we present a method to improve the performance of STA predictions. Our contributions are two-fold: 1) We propose STAformer and STAformer++, two novel attention-based architectures integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair; 2) We introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. We explore how to integrate environment affordances via simple late fusion and with an approach which adaptively learns how to best fuse affordances with end-to-end predictions. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant improvements on Overall Top-5 mAP, with gain up to +23%+23% on Ego4D and +31%+31% on a novel set of curated EPIC-Kitchens STA labels. We released the https://github.com/lmur98/AFFttention code, annotations, and pre-extracted affordances on Ego4D and EPIC-Kitchens to encourage future research in this area.

Abstract:
This article aims to speed up the training of large neural networks with the Mixture-of-Experts (MoE) structure. Training MoE often needs a lot of computing resources due to its large scale. Traditional acceleration methods either degrade prediction performance or rely on dedicated hardware with additional resources, but the resources are usually limited in real applications.One solution is to resort to new optimization strategies, such as learning from easy to hard by multiple stages. However, existing strategies are designed mainly for networks with a serial structure, but MoE has multiple expert networks working in parallel. They employ an identical learning plan for all experts, ignoring that each expert’s learning domain and speed differ, resulting in some experts being over-learned while others being under-learned. This mismatch will make it hard for experts to train together, harming training efficiency. To address this problem, we propose a new training acceleration framework. It can customize an effective learning plan for each expert by considering their training progress, avoiding blindly searching in a huge parameter space. In detail, we first design a multi-stage planner that starts with optimizing a subpart of the network and then scales it up to retrain until it expands to an entire network. It uses the density function to assess the knowledge gained by the expert in each stage, giving priority to the experts who learn faster to increase the training scale, so as to boost convergence. Afterward, we exploit the growth operator to add the expert training scale of the next stage. In each stage, the network would converge to some locally optimal values. That can provide a better initialization to train the next stage more easily, since the time and data required for training from scratch are greatly reduced. To alleviate the gradient vanishing problem caused by network growth, we develop a scheduler to dynamically adjust the learning rate. Extensive experiments are conducted to validate the effectiveness of our method. The results show that we can obtain more than 25% training acceleration on average.

Abstract:
Lifelong learning, also known as continual or incremental learning, is a crucial component for advancing Artificial General Intelligence (AGI) by enabling systems to continuously adapt in dynamic environments. While large language models (LLMs) have demonstrated impressive capabilities in natural language processing, existing LLM agents are typically designed for static systems and lack the ability to adapt over time in response to new challenges. This survey is the first to systematically summarize the potential techniques for incorporating lifelong learning into LLM-based agents. We categorize the core components of these agents into three modules: the perception module for multimodal input integration, the memory module for storing and retrieving evolving knowledge, and the action module for grounded interactions with the dynamic environment. We highlight how these pillars collectively enable continuous adaptation, mitigate catastrophic forgetting, and improve long-term performance. This survey provides a roadmap for researchers and practitioners working to develop lifelong learning capabilities in LLM agents, offering insights into emerging trends, evaluation metrics, and application scenarios.

Abstract:
End-to-end text spotting aims to jointly optimize text detection and recognition within a unified framework. Despite significant progress, designing an accurate and efficient end-to-end text spotter for arbitrary-shaped text remains challenging. We identify the primary bottleneck as the lack of a reliable and efficient text detection method. To address this, we propose a novel parameterized text shape representation based on low-rank approximation for precise detection and a triple assignment detection head for fast inference. Specifically, unlike current data-irrelevant shape representation methods, we exploit shape correlations among labeled text boundaries to construct a robust low-rank subspace. By minimizing an \ell _1ℓ1-norm objective, we extract orthogonal vectors that capture the intrinsic text shape from noisy annotations, enabling precise reconstruction via the linear combination of only a few basis vectors. Next, the triple assignment scheme decouples training complexity from inference speed. It utilizes a deep sparse branch to guide an ultra-lightweight inference branch, while a dense branch provides rich parallel supervision. Building upon these advancements, we integrate the enhanced detection module with a lightweight recognition branch to form an end-to-end text spotting framework, termed LRANet++, capable of accurately and efficiently spotting arbitrary-shaped text. Extensive experiments on challenging benchmarks demonstrate the superiority of LRANet++ compared to state-of-the-art methods.

Abstract:
Traditional imitation learning focuses on modeling the behavioral mechanisms of experts, which requires a large amount of interaction history generated by some fixed expert. However, in many streaming applications, such as streaming recommender systems, online decision-makers typically engage in online learning during the decision-making process, meaning that the interaction history generated by online decision-makers includes their behavioral evolution from novice expert to experienced expert. This poses a new challenge for existing imitation learning approaches that can only utilize data from experienced experts. To address this issue, this paper proposes an inverse batched contextual bandit (IBCB) framework that can efficiently perform estimations of environment reward parameters and learned policy based on the expert’s behavioral evolution history. Specifically, IBCB formulates the inverse problem into a simple quadratic programming problem by utilizing the behavioral evolution history of the batched contextual bandit with inaccessible rewards, and it can be extended to fairness-aware expert limitation. We demonstrate that IBCB is a unified framework for both deterministic and randomized bandit policies. The experimental results indicate that IBCB outperforms several existing imitation learning algorithms on synthetic and real-world data and significantly reduces running time. Additionally, empirical analyses reveal that IBCB exhibits better imitation ability for fairness-aware experts, out-of-distribution generalization and is highly effective in learning the bandit policy from the interaction history of novice experts. The code is publicly available.

Abstract:
We propose a novel post-processing approach for the local optimization of Locally Optimized RANdom SAmple Consensus (LO-RANSAC), called the Multi-Estimation-based Parameter Centroid (MEPC) decision. It is observed that the optimal thresholds for hypothesis generation and evaluation differ in local optimization with the inner RANSAC. Instead of binary labeling for inliers and outliers, a new ternary labeling for inliers, midliers, and outliers is introduced, using two thresholds. Our experimental results show that the highest-scoring model measured by the ternary method is closer to the real model than that measured by the existing binary method. However, it should be noted that the highest score still does not correspond to the best model due to inaccurate evaluation by data noise. We introduce a new linear model centroid decision method to compensate for the highest-scoring model distorted by noise. In this process, an efficient method for measuring the similarity between two hypotheses is introduced, and candidates close to the real model are found by comparing their similarity with the highest-scoring model. Our approach determines a representative model of the multiple candidate hypotheses, which is defined as the geometric centroid of hyperplanes. We test on various datasets for homography, fundamental, and essential matrices, demonstrating that applying MEPC to existing RANSAC algorithms achieves more accurate and stable model estimation. Moreover, additional experiments on vanishing point detection show the potential of our approach for various model estimation applications.

Abstract:
Single-photon imaging uses single-photon–sensitive picosecond-resolution sensors to capture 3D structure and supports diverse applications, but success remains mostly limited to simple scenes. In complex scenarios, traditional methods degrade and deep learning methods lack flexibility and generalization. Here, we propose a physics-informed deep neural network (PIDNN) framework that effectively addresses both aspects, adapting to complex and variable sensing environments by embedding imaging physics into the deep neural network for unsupervised learning. Within this framework, by tailoring the number of U-Net skip connections, we impose multi-scale spatiotemporal priors that improve photon-utilization efficiency, laying the foundation for addressing the inherent low-signal-to-background ratio (SBR) problem in subsequent complex scenarios. Additionally, we introduce volume rendering into the PIDNN framework and design a dual-branch structure, further extending its applicability to multiple-depth and fog occlusion. We validated the performance of this method in various complex environments through numerical simulations and real-world experiments. The results of photon-efficient imaging with multiple returns show robust performance under low SBR and large fields of view. The method attains lower root mean-squared error than traditional methods and exhibits stronger generalization than supervised approaches. Further multiple depths and fog interference experiments confirm that its reconstruction quality surpasses existing techniques, demonstrating its flexibility and scalability. Both simulation and experimental results validate its exceptional reconstruction performance and flexibility.

Abstract:
Deep learning-based feature matching has showcased great superiority for point cloud registration. While coarse-to-fine matching architectures are prevalent, they typically perform sparse and geometrically inconsistent coarse matching. This forces the subsequent fine matching to rely on computationally expensive optimal transport and hypothesis-and-selection procedures to resolve inconsistencies, leading to inefficiency and poor scalability for large-scale real-time applications. In this paper, we design a consistency-aware spot-guided Transformer (CAST) to enhance the coarse matching by explicitly utilizing geometric consistency via two key sparse attention mechanisms. First, our consistency-aware self-attention selectively computes intra-point-cloud attention to a sparse subset of points with globally consistent correspondences, enabling other points to derive discriminative features through their relationships with these anchors while propagating global consistency for robust correspondence reasoning. Second, our spot-guided cross-attention restricts cross-point-cloud attention to dynamically defined “spots”—the union of correspondence neighborhoods of a query’s neighbors in the other point cloud, which are most likely to cover the true correspondence of the query ensured by local consistency, eliminating interference from similar but irrelevant regions. Furthermore, we design a lightweight local attention-based fine matching module to precisely predict dense correspondences and estimate the transformation. Extensive experiments on both outdoor LiDAR datasets and indoor RGB-D camera datasets demonstrate that our method achieves state-of-the-art accuracy, efficiency, and robustness. Besides, our method showcases superior generalization ability on our newly constructed challenging relocalization and loop closing benchmarks in unseen domains.

Abstract:
Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations (e.g., Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces additional distortions. In this work, we propose an oriented-distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a latitude-aware distortion representation to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel-wise and spatial-wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-aware features. Considering the orientation sensitivity of the Gabor transform, we further introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code and models will be available at https://github.com/zhijieshen-bjtu/PGFuse.

Abstract:
All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-aware prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but may discard critical visual information for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a novel framework that fundamentally enhances prompt-task alignment through two complementary innovations: a Sparse Prompt Module (SPM) that efficiently captures degradation-specific features while minimizing redundancy, and a Contrastive Prompt Regularization (CPR) that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL optimizes the critical interaction between prompts and the restoration model itself. Extensive experiments across comprehensive benchmarks demonstrate that CPL consistently enhances state-of-the-art all-in-one restoration models, achieving significant improvements in both standard multi-task scenarios and challenging composite degradation settings. Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration. The code is available at https://github.com/Aitical/CPLIR.

Abstract:
Multi-modal image fusion (MMIF) enhances the information content of the fused image by combining the unique as well as common features obtained from different modality sensor images, improving visualization, object detection, and many more tasks. In this work, we introduce an interpretable network for the MMIF task, named FNet, based on an \ell _0ℓ0-regularized multi-modal convolutional sparse coding (MCSC) model. Specifically, for solving the \ell _0ℓ0-regularized CSC problem, we design a learnable \ell _0ℓ0-regularized sparse coding (LZSC) block in a principled manner through deep unfolding. Given different modality source images, FNet first separates the unique and common features from them using the LZSC block and then these features are combined to generate the final fused image. Additionally, we propose an \ell _0ℓ0-regularized MCSC model for the inverse fusion process. Based on this model, we introduce an interpretable inverse fusion network named IFNet, which is utilized during FNet’s training. Extensive experiments show that FNet achieves high-quality fusion results across eight different MMIF datasets. Furthermore, we show that FNet enhances downstream object detection and semantic segmentation in visible-thermal image pairs. We have also visualized the intermediate results of FNet, which demonstrates the good interpretability of our network.

Abstract:
Intelligent systems typically need to continually learn from streaming data subject to distribution shift, where a key requirement is that they cannot catastrophically forget the historical knowledge learned from previous data. More seriously, streaming data often contain substantial label noise, which can exacerbate catastrophic forgetting and lead to performance degradation on forthcoming data. To address these problems, Continual Noisy Label Learning (CNLL) has been proposed. However, existing CNLL methods still fall short of the ability in addressing catastrophic forgetting because they adopted heuristic strategies in handling label noise and did not explicitly characterize the distributional shift across time, which hinders effective knowledge transfer from historical data to new data. To tackle these challenges, we theoretically analyze the problem of learning from streaming noisy data with distribution shift and propose a unified framework called Continual Noisy Label Learning on Drifting Data Streams (CNLDD). Specifically, we theoretically explore, for the first time, the upper bound of cumulative generalization error for CNLL problem, which reveals three factors leading to forgetting, namely selection bias of buffered data, distribution shift, and label noise. To alleviate the selection bias of buffered data, we design a two-step buffer update strategy to narrow the distribution gap between the original historical data and the selected representative data in buffer. To address distribution shift, our CNLDD explicitly characterizes the distribution discrepancies between buffered data and incoming data, prioritizing historical data with minimal discrepancies to enhance knowledge transfer. To tackle noisy labels, CNLDD estimates the importance weight of each example with the instance-dependent noise transition matrix, thereby avoiding the data bias and knowledge forgetting arising from noisy labels. Empirically, due to the unified modeling of the aforementioned issues, our CNLDD achieves superior classification performance when compared with state-of-the-art CNLL methods on both synthetic and real-world datasets.

Abstract:
Hypergraph Neural Networks (HGNNs) are crucial in modeling complex high-order correlations in diverse domains, utilizing hyperedges that connect multiple vertices. However, their susceptibility to structural attacks and irrational connections can disrupt message propagation and degrade performance. To address these issues, we introduce the HGNN Shield, a defense framework incorporating two key modules: Hyperedge-Dependent Estimation (HDE) and High-Order Shield (HOS). The HDE module prioritizes vertex dependencies within hyperedges and adapts traditional connectivity measures to hypergraphs, facilitating precise structural modifications. This adaptation allows for a nuanced assessment of vertex relationships within hyperedges, contributing theoretically by extending classical graph-based connection dependency measures to hypergraphs. Following HDE, the HOS module, positioned before convolutional layers, consists of three submodules: Hyperpath Cut, Hyperpath Link, and Hyperpath Refine. These components collectively detect, disconnect, and refine adversarial connections, ensuring robust message propagation. The theoretical contribution of the HOS module lies in maintaining hyperpath integrity and learning trajectory under adversarial conditions, providing a certifiable defense mechanism against high-order structural attacks. Experiments on six hypergraph datasets indicate that HGNN Shield significantly enhances robustness and maintains data integrity against targeted attacks, outperforming existing methods (an average performance improvement of 9.33% over other methods). Our framework not only improves HGNN reliability but also advances security in hypergraph-based applications.

Abstract:
In recent years, there has been a surge of machine learning applications developed with hierarchical structure, which can be approached from Bi-Level Optimization (BLO) perspective. However, most existing gradient-based methods overlook the interdependence between hyper-gradient calculation and Lower-Level (LL) iterative trajectory, focusing solely on the former. Consequently, convergence theory is constructed with restrictive LL assumptions, which are often challenging to satisfy in real-world scenarios. In this work, we thoroughly analyze the constructed iterative trajectory, and highlight two deficiencies, including empirically chosen initialization and default use of entire trajectory for hyper-gradient calculation. To address these issues, we introduce two augmentation techniques including Initialization Auxiliary (IA) and Pessimistic Trajectory Truncation (PTT), and investigate various extension strategies such as prior regularization, different iterative mapping schemes and acceleration dynamics to construct Augmented Iterative Trajectory (AIT) for corresponding BLO scenarios (e.g., LL convexity and LL non-convexity). Theoretically, we provide convergence analysis for AIT and its variations under different LL assumptions, and establish the convergence analysis for BLOs with non-convex LL subproblem. Finally, we demonstrate the effectiveness of AIT through three numerical examples, typical learning and vision applications (e.g., data hyper-cleaning and few-shot learning) and more challenging tasks such as neural architecture search.

Abstract:
Deep Reinforcement Learning (DRL) methods have shown remarkable success in many applications, yet their high energy consumption limits their practicability. Recent studies incorporated energy-efficient Spiking Neural Networks (SNNs) to build Spiking DRL methods and lower energy consumption by setting a shorter simulation duration for SNNs to compute fewer gradients. However, these existing Spiking DRL methods fail to sample sufficient high-quality samples within a fixed-size replay buffer and perform poorly when the simulation duration is small, introducing the challenging tradeoff between energy consumption and model performance. Motivated by such observations, we develop a generic resilient experience replay method that can be seamlessly integrated into existing spiking DRL methods to effectively address the above tradeoff. Specifically, we allow the replay buffer to dynamically expand as the number of training samples increases, thereby accommodating more potentially valuable candidate samples for policy training. Meanwhile, we introduce an adaptive approach to manage the buffer size by determining when to shrink the replay buffer and removing redundant samples automatically. This strategy prevents the buffer from expanding unnecessarily, thereby mitigating the potential negative impact on model performance. Extensive experimental results demonstrate that our approach significantly enhances the performance of five state-of-the-art (SOTA) spiking DRL methods across various simulation durations in sixteen tasks, in terms of return, without compromising their energy efficiency.

Abstract:
This paper explores test-agnostic long-tail recognition, a challenging long-tail task where the test label distributions are unknown and arbitrarily imbalanced. We argue that the variation in these distributions can be broken down hierarchically into global and local levels. The global ones reflect a broad range of diversity, while the local ones typically arise from milder changes, often focused on a particular neighbor. Traditional methods predominantly use a Mixture-of-Expert (MoE) approach, targeting a few fixed test label distributions that exhibit substantial global variations. However, the local variations are left unconsidered. To address this issue, we propose a new MoE strategy, \mathsf DirMixEDirMixE, which assigns experts to different Dirichlet meta-distributions of the label distribution, each targeting a specific aspect of local variations. Additionally, the diversity among these Dirichlet meta-distributions inherently captures global variations. This dual-level approach also leads to a more stable objective function, allowing us to sample different test distributions better to quantify the mean and variance of performance outcomes. Building on this idea, we develop a general Latent Skill Finetuning (LSF) framework for parameter-efficient finetuning of foundation models. We provide implementations based on LoRA and Adapter. Theoretically, we derive upper bounds on the generalization error for both standard learning and PEFT. Under mild assumptions, we show that the variance-based regularization helps tighten these bounds. Furthermore, we prove that the covering number of the PEFT hypothesis class scales with the number of trainable parameters. Finally, extensive experiments on CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist validate the effectiveness of \mathsf DirMixEDirMixE.

Abstract:
As one of the most fundamental techniques in multimodal learning, cross-modal matching aims to project various sensory modalities into a shared feature space. To achieve this, massive and correctly aligned data pairs are required for model training. However, unlike unimodal datasets, multimodal datasets are extremely harder to collect and annotate precisely. As an alternative, the co-occurred data pairs (e.g., image-text pairs) collected from the Internet have been widely exploited in the area. Unfortunately, the cheaply collected dataset unavoidably contains many mismatched data pairs, which have been proven to be harmful to the model’s performance. To address this, we propose BiCro++ (Improved Bidirectional Cross-modal Similarity Consistency). This module can be integrated into existing cross-modal matching models, enhancing their robustness against noisy data through self-adaptive soft labels that dynamically reflect the true correspondence of data pairs. The basic idea of BiCro++ is motivated by that – taking image-text matching as an example – similar images should have similar textual descriptions and vice versa. This bidirectional similarity consistency can be directly translated into soft labels as a self-supervision signal to train the matching model. To further refine soft label quality, BiCro++ first introduces a Diagonal-Dominance Purification process to identify reliable anchor points from noisy dataset as the reference for soft label estimation. Then it employs a Hybrid-level Codebook Alignment mechanism that establishes enhanced consistency in bidirectional cross-modal similarity. The experiments on three popular cross-modal matching datasets show that our method significantly improves the noise-robustness of various matching models, and surpasses the state-of-the-art method by an average of 5.3%, 3.1% and 6.4% in terms of recall, respectively.

Abstract:
In the past few decades, autonomous driving algorithms have made significant progress in perception, planning, and control. However, evaluating individual components does not fully reflect the performance of entire systems, highlighting the need for more holistic assessment methods. This motivates the development of HUGSIM, a closed-loop, photo-realistic, and real-time simulator for evaluating autonomous driving algorithms. We achieve this by lifting captured 2D RGB images into the 3D space via 3D Gaussian Splatting, improving the rendering quality for closed-loop scenarios, and building the closed-loop environment. In terms of rendering, we tackle challenges of novel view synthesis in closed-loop scenarios, including viewpoint extrapolation and 360-degree vehicle rendering. Beyond novel view synthesis, HUGSIM further enables the full closed simulation loop, dynamically updating the ego and actor states and observations based on control commands. Moreover, HUGSIM offers a comprehensive benchmark across more than 70 sequences from KITTI-360, Waymo, nuScenes, and PandaSet, along with over 400 varying scenarios, providing a fair and realistic evaluation platform for existing autonomous driving algorithms. HUGSIM not only serves as an intuitive evaluation benchmark but also unlocks the potential for fine-tuning autonomous driving algorithms in a photorealistic closed-loop setting.

Abstract:
Quality enhancement methods have been widely integrated into visual communication pipelines to mitigate artifacts in compressed images. Ideally, these quality enhancement methods should perform robustly when applied to images that have already undergone prior enhancement during transmission. We refer to this scenario as multi-enhancement, which generalizes the well-known multi-generation scenario of image compression. Unfortunately, current quality enhancement methods suffer from severe degradation when applied in multi-enhancement.To address this challenge, we propose a novel adaptation method that transforms existing quality enhancement models into domain-consistent ones. Specifically, our method enhances a low-quality compressed image into a high-quality image within the natural domain during the first enhancement, and ensures that subsequent enhancements preserve this quality without further degradation. Extensive experiments validate the effectiveness of our method and show that various existing models can be successfully adapted to maintain both fidelity and perceptual quality in multi-enhancement scenarios.

Affiliations: State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China; National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China; School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; School of Electronic and Computer Engineering, Peking University, Shenzhen, China; Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China

Abstract:
Spike camera is an emerging bio-inspired vision sensor with ultra-high temporal resolution. It records scenes by accumulating photons and outputting binary spike streams. Optical flow estimation aims to estimate pixel-level correspondences between different moments, describing motion information along time, which is a key task of spike camera. High-quality optical flow is important since motion information is a foundation for analyzing spikes. However, extracting stable light-intensity information from spikes is difficult due to the randomness of binary spikes. Besides, the continuity of spikes can offer contextual information for optical flow. In this paper, we propose a network Spike2Flow++ to estimate optical flow for spike camera. In Spike2Flow++, we propose a differential of spike firing time (DSFT) to represent information in binary spikes. Moreover, we propose a dual DSFT representation and a dual correlation construction to extract stable light-intensity information for reliable correlations. To use the continuity of spikes as motion contextual information, we propose a joint correlation decoding (JCD) that jointly estimates a series of flow fields. To adaptively fuse different motions in JCD, we propose a global motion bank aggregation to construct an information bank for all motions and adaptively extract contexts from the bank for each iteration during recurrent decoding of each motion. To train and evaluate our network, we construct a real scene with spikes and flow++ (RSSF++) based on real-world scenes. Experiments demonstrate that our Spike2Flow++ achieves state-of-the-art performance on RSSF++, photo-realistic high-speed motion (PHM), and real-captured data.

Abstract:
Hyperspectral camouflaged object tracking remains a significant challenge due to the high similarity between objects and replicas in texture and color. Despite recent progress, the bias present in the tracker and the embedding token hinders the model training. Specifically, most methods rely on false-color three-channel images to fine-tune RGB-based trackers. However, it introduces a confounding effect within the RGB domain, potentially leading to harmful biases that misguide the model toward spurious correlations while neglecting the critical spectral discrimination inherent in hyperspectral images. Furthermore, current token-type embedding methods overlook the key correlations between templates and searches, ultimately confusing correlation and impairing tracking performance. To address these challenges, this paper proposes a new unbiased tracking framework named Causal HyperPrompter. It first introduces a structural causal model to disentangle and control exclusive causal factors during tracking, and incorporates a counterfactual intervention strategy to eliminate confounding variables and mitigate the bias inherited from RGB-based models. In addition, we present a novel token-type embedding module that integrates local spectral angle modeling to enhance the semantic link between template and search tokens, thereby improving the model’s sensitivity to object localization. Lastly, to overcome the difficulty of manually initializing the bounding box and addressing data scarcity, we introduce a large-scale hyperspectral camouflaged object detection and tracking dataset, BihoT-130 k, consisting of 1,30,750 annotated frames across various camouflage scenes. Extensive experiments on multiple large-scale datasets illustrate the effectiveness of our proposed methods.

Abstract:
Stochastic optimization is the workhorse behind the success of many machine learning algorithms. The existing theoretical analysis of stochastic optimization mainly focuses on the behavior on the training dataset or requires a convexity assumption. In this paper, we provide a comprehensive analysis on the generalization behavior of stochastic optimization with nonconvex problems. We first present both upper and lower bounds on the uniform convergence of gradients. Our analysis outperforms existing results by incorporating the 2nd moment of the gradient at a single model into the upper bound. Based on this uniform convergence, we provide a high-probability bound on the gradient norm of population risks for stochastic gradient descent (SGD), which significantly improves the existing results. We show that better bounds can be achieved under further assumptions such as quasi-convexity or Polyak-Łojasiewicz condition. Our analysis shows the computation cost can be further decreased by taking the variance-reduction trick. Finally, we study the utility guarantee of SGD under a privacy constraint. Our results show a linear speed up with respect to the batch size, which shows the benefit of computing gradients in a distributed manner.

Abstract:
Reliable uncertainty estimation has become a crucial requirement for the industrial deployment of deep learning algorithms, particularly in high-risk applications such as autonomous driving and medical diagnosis. However, uncertainty estimation methods relying on deep ensembling or Bayesian neural networks typically entail significant computational overhead. To address this challenge, a novel paradigm called Evidential Deep Learning (EDL) has emerged, providing high-quality uncertainty estimation with minimal additional computation in a single forward pass. This survey provides a comprehensive overview of the current research on EDL, designed to offer readers a broad introduction to the field without assuming prior knowledge. Specifically, we first delve into the theoretical foundation of EDL, the subjective logic theory, and discuss its distinctions from other uncertainty estimation frameworks. We further present existing theoretical advancements in EDL from four perspectives: reformulating the evidence collection process, improving uncertainty estimation via OOD samples, delving into various training strategies, and evidential regression networks. Thereafter, we elaborate on its extensive applications across various machine learning paradigms and downstream tasks. In the end, an outlook on future directions for better performances and broader adoption of EDL is provided, highlighting potential research avenues.

Abstract:
Embedding graphs in continuous spaces is a key factor for automatic information extraction in diverse tasks (e.g., learning, inferring, predicting). The reliability of graph embeddings directly depends on how much the geometry of the manifold in continuous space matches the graph structure. State-of-the-art of manifold-based graph embedding algorithms assume that the projection on a tangential space of each point in the manifold (corresponding to a node in the graph) would locally resemble a Euclidean space. Although this condition helps in achieving efficient analytical solutions to the embedding problem, it is not an adequate set-up to work with modern real life graphs, that are characterized by weighted connections across nodes often computed over sparse datasets with missing records. In this work, we introduce a new class of manifold, named soft manifold, that can solve this situation. Soft manifolds are mathematical structures with spherical symmetry where the tangent spaces to each point are hypocycloids whose shape is defined according to the velocity of information propagation across the data points. Experimental results on reconstruction tasks on synthetic and real datasets show how the proposed approach enable more accurate and reliable characterization of graphs in continuous spaces with respect to the state-of-the-art.

Abstract:
Robust rigid point cloud registration is effective for accurate positioning and measurement of complex components. The existing registration algorithms, however, fail to overcome the matching distortion caused by structural deviation, unknown abnormal allowance, and various measurement inherent defects. Although the recently proposed VMM and WPMAVM algorithms can inhibit the matching distortion to some extent, they still fail in the presence of numerous abnormal points. In this study, we present a progressive and adaptive variance minimization (PAVM) algorithm to address these issues. A progressive de-pseudo weight is established to ensure the involvement of all point pairs in optimization at the initial registration stage. Then, an approximately truncated weight function is employed to mitigate the influence of abnormal points on registration results. Furthermore, a novel adaptive coordination distance function is established by improving the symmetric point-to-plane distance metric and combining the first-order approximate point-to-point distance metric, which enhances the algorithm speed and stability. The analysis investigates the anti-abnormal interference ability and quadratic convergence, validating the feasibility of the PAVM algorithm. Experiments are undertaken to illustrate the notable benefits of our algorithm in convergence stability, matching speed, and universality. These attributes render the algorithm well-suited for registration tasks involving diverse complex components.

Abstract:
Effectively estimating the uncertainty attached to neural network predictions thus becomes essential to improve robustness, reliability, and trustworthiness. This paper provides an overview of various methodologies for representing, quantifying, and distinguishing two major types of uncertainties (namely, ‘aleatoric’ and ‘epistemic’ uncertainty) in neural networks. The review covers classical probabilistic techniques such as Bayesian neural networks and deep ensembles, methods from generalized probability that leverage uncertainty representations such as Dirichlet distributions, belief functions, random sets, probability intervals, and credal sets, among others. Additionally, interval-based approaches employing interval models are also examined. We discuss the strengths and limitations of various methodologies and identify promising research directions for potential future exploration.

Abstract:
Deep neural networks have achieved great success in the last decade. When designing neural networks to handle the ubiquitous geometric data such as point clouds and graphs, it is critical that the model can maintain invariance towards various transformations such as translation, rotation, and scaling. Most existing graph neural network (GNN) approaches can only maintain permutation-invariance, failing to guarantee invariance with respect to other transformations. Besides GNNs, other works design sophisticated transformation-invariant layers, which are computationally expensive and difficult to be extended. In this paper, we revisit why general neural networks cannot maintain transformation invariance. Our findings show that transformation-invariant and distance-preserving initial point representations are sufficient to achieve transformation invariance rather than needing sophisticated neural layer designs. Motivated by these findings, we propose Transformation Invariant Neural Networks (TinvNet), a straightforward and general plug-in for geometric data. Specifically, we realize transformation invariant and distance-preserving initial point representations by modifying multi-dimensional scaling and feed the representations into existing neural networks. We prove that TinvNet can strictly guarantee transformation invariance, being general and flexible enough to be combined with the existing neural networks. Extensive experimental results on point cloud analysis and combinatorial optimization demonstrate the effectiveness and general applicability of our method. We also extend our method into equivariance cases. Based on the results, we advocate that TinvNet should be considered as an essential baseline for further studies of transformation-invariant geometric deep learning.

Abstract:
This paper presents UniVST, a unified framework for localized video style transfer based on diffusion models. It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages the feature maps from the DDIM inversion. This streamlines the model’s architecture by obviating the need for tracking models. (2) A training-free AdaIN-guided localized video stylization mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding-window consistent smoothing scheme that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in stylized video. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object’s style while ensuring temporal consistency and detail preservation.

Abstract:
3D scene flow represents the dense per-point motion field in dynamic scenes, playing a crucial role in various downstream tasks, including motion segmentation, dynamic scene reconstruction, 4D content generation, etc. However, previous regression-based works commonly suffer from unreliable correlations caused by locally constrained search ranges and struggle with the absence of timely feedback regarding the flow estimation uncertainty during training. To address these challenges, we propose a novel uncertainty-aware network for scene flow estimation, termed DifFlow3D, based on the conditional probabilistic diffusion model. Hierarchical diffusion-based flow estimation blocks are designed to enhance the correlation robustness and resilience to challenging cases, e.g., dynamics, noisy inputs, repetitive patterns, etc. To mitigate the generation diversity, three key flow-related features are leveraged as conditions in our diffusion model. Furthermore, we develop an uncertainty estimation module within diffusion to assess the reliability of estimated scene flow dynamically. A Hidden State Denoising strategy (HSD) is also introduced to further boost the stability of the reverse denoising process. Extensive experiments conducted on four scene flow datasets, including both synthetic and real-world datasets (FlyingThings3D, KITTI 2015, Argoverse, and Waymo Open), demonstrate the superiority of our proposed DifFlow3D. Compared to prior state-of-the-art methods, DifFlow3D has 26.0%, 36.4%, 35.3%, and 17.7% EPE3D reduction respectively across four datasets. Only trained on the synthetic FlyingThings3D dataset, our method achieves an unprecedented millimeter-level accuracy (0.0070 m EPE3D) on the real-scene KITTI dataset, highlighting its exceptional generalization capability. Additionally, our diffusion-based refinement paradigm can be seamlessly integrated as a plug-and-play module into existing scene flow networks, significantly enhancing their estimation accuracy. We also introduce our pre-trained scene flow estimator as explicit motion priors into the novel dynamic LiDAR view synthesis task, which validates its great potential for improving the 4D LiDAR reconstruction performance.

Abstract:
Learning similarity between scene graphs and images aims to estimate a similarity score given a scene graph and an image. There is currently no research dedicated to this task, although it is critical for scene graph generation and downstream applications. Scene graph generation is conventionally evaluated by Recall@K@K and mean Recall@K@K, which measure the ratio of predicted triplets that appear in the human-labeled triplet set. However, such triplet-oriented metrics fail to demonstrate the overall semantic difference between a scene graph and an image and are sensitive to annotation bias and noise. Using generated scene graphs in the downstream applications is therefore limited. To address this issue, for the first time, we propose a Scene graPh-imAge coNtrastive learning framework, SPAN, that can measure the similarity between scene graphs and images. Our novel framework consists of a graph Transformer and an image Transformer to align scene graphs and their corresponding images in the shared latent space. We introduce a novel graph serialization technique that transforms a scene graph into a sequence with structural encodings. Based on our framework, we propose R-Precision measuring image retrieval accuracy as a new evaluation metric for scene graph generation. We establish new benchmarks on the Visual Genome and Open Images datasets. Extensive experiments are conducted to verify the effectiveness of SPAN, which shows great potential as a scene graph encoder.

Abstract:
Optimization of deep neural networks (DNNs) has been driving modern advancements in artificial intelligence. With DNNs characterized by a prolonged sequence of nonlinear propagation, determining their optimal parameters given an objective naturally fits within Optimal Control Programming. Such an interpretation of DNNs as dynamical systems has proven crucial in offering principled analysis from numerical equations to physics. In parallel to these theoretical pursuits, this paper focuses on an algorithmic perspective. Our motivated observation is the striking algorithmic resemblance between the Backpropagation algorithm for computing gradients in DNNs and the optimality conditions for dynamical systems, expressed through another backward process known as dynamic programming. Consolidating this connection, where Backpropagation admits a variational structure, solving an approximate dynamic programming up to the first-order expansion, leads to a new class of optimization methods exploring higher-order expansions of the Bellman equation. The resulting optimizer, Optimal Control Theoretic Neural Optimizer (OCNOpt), enables rich algorithmic opportunities, including layer-wise feedback policies, game-theoretic applications, and higher-order training of continuous-time models such as Neural ODEs. Extensive experiments demonstrate that OCNOpt improves upon existing methods in robustness and efficiency while maintaining manageable computational complexity, paving new avenues for principled algorithmic design grounded in dynamical systems and optimal control theory.

Abstract:
The field of neural rendering has seen remarkable progress, driven by advancements in generative models and differentiable rendering techniques. While 2D diffusion has achieved notable success, the development of a unified 3D diffusion pipeline remains an open challenge. This paper presents a novel framework, LN3Diff++, designed to bridge this gap and facilitate fast, high-quality, and versatile conditional 3D generation. Our method leverages a 3D-aware architecture and a variational autoencoder (VAE) to encode input image(s) into a structured, compact 3D latent space. The latent representation is then decoded by a transformer-based decoder into a high-capacity 3D neural field. By training a diffusion model on this 3D-aware latent space, our method achieves superior performance for category-specific 3D generation on ShapeNet and FFHQ, as well as category-free image/text-conditioned 3D generation over Objaverse. Moreover, it surpasses existing 3D diffusion methods in inference speed, requiring no per-instance optimization.

Abstract:
Holistic Visual Understanding (HVU), encompassing tasks like intention recognition, emotion analysis, scene understanding, and content moderation, necessitates integrating low-level visual perception (‘sight’) with high-level semantic reasoning (‘semantics’). While large Vision-Language Models (VLMs) like CLIP offer powerful representations, their inherent ‘sight’ bias limits their direct application to these semantically rich tasks. Our prior work, IntCLIP, addressed Multi-label Intention Understanding (MIU) using a dual-branch architecture but faced challenges with label generation instability (Hierarchical Class Integration - HCI) and limited feature interaction (unidirectional Sight-assisted Aggregation). This paper introduces an enhanced framework that significantly extends IntCLIP to tackle the broader HVU challenge. We propose Semantic Label Refinement (SLR), an iterative, metric-guided process leveraging Large Language Models (LLMs) and quantitative evaluation within the CLIP embedding space to generate stable, optimized semantic labels. We also introduce a novel bidirectional attention mechanism (Symmetric Aggregation) that enables balanced, mutual refinement between sight and semantic feature maps. By evaluating on a comprehensive benchmark spanning MIU, Image Emotion Recognition, Indoor Scene Recognition, and Visual Content Moderation, we demonstrate that our framework not only advances the state-of-the-art in MIU but also achieves superior performance across diverse HVU tasks. This framework provides a unified and robust solution for synergizing sight and semantics, pushing towards more human-like visual intelligence. Code is available at https://github.com/yan9qu/PAMI25-HVU.

Abstract:
Multi-output deep neural networks (MONs) contain multiple output branches of various tasks, and these tasks typically share partial network filters, resulting in entangled inference routes between different tasks within the networks. Due to the divergent optimization objectives, the task gradients during training usually interfere with each other along the shared routes, which decreases the overall model performance. To address this issue, we propose a novel gradient de-conflict algorithm named DR-MGF (Dynamic Routes and Meta-weighted Gradient Fusion). Different from existing de-conflict methods, DR-MGF achieves gradient de-conflict in MONs by learning task-preferred inference routes. The proposed method is motivated by our experimental findings that the shared filters are not equally important for different tasks. By designing learnable task-specific importance variables, DR-MGF evaluates the importance of filters for different tasks. Through making the dominance of tasks over filters proportional to the task-specific importance of filters, DR-MGF can effectively reduce inter-task interference. These task-specific importance variables ultimately determine task-preferred inference routes at the end of training iterations. Extensive experimental results on CIFAR, ImageNet, and NYUv2 demonstrate that DR-MGF outperforms existing de-conflict methods. Furthermore, DR-MGF can be extended to general MONs without modifying the overall network structures.

Abstract:
We present nonlinear formulations to Shape-from-Template (SfT) and Non-Rigid Structure-from-Motion (NRSfM) faithfully exploiting the isometric, conformal and equiareal deformation models. Existing work uses relaxations such as inextensibility or requires knowing the optic flow field around the correspondences, an impractical assumption. In contrast, the proposed formulations only require point correspondences and resolve all ambiguities using the notions of maximal depth and maximal isometry heuristics. We propose solution methods using Semi-Definite Programming (SDP) for all formulations. We show that straightforward SDP models conflict with the usual maximal depth heuristic and propose an adapted opposite-depth parameterisation demonstrating a lesser relaxation gap. Experimental results on many real-world benchmark datasets demonstrate superior accuracy over existing methods.

Abstract:
Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end, we present VBench++, a comprehensive benchmark suite that dissects “video generation quality” into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. VBench++ has several appealing properties: 1) Comprehensive Dimensions: VBench++ comprises 16 dimensions in text-to-video generation (e.g., subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationship, etc). The evaluation metrics with fine-grained levels reveal individual models’ strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks’ alignment with human perception, for each evaluation dimension respectively. 3) Valuable Insights: We look into current models’ ability across various evaluation dimensions, and various content types. We also investigate the gaps between video and image generation models. 4) Versatile Benchmarking: VBench++ is designed to evaluate a wide range of video generation tasks, including text-to-video and image-to-video. We introduce a high-quality Image Suite with an adaptive aspect ratio to enable fair evaluations across different image-to-video generation settings. Beyond assessing technical quality, VBench++ evaluates the trustworthiness of video generative models, providing a more holistic view of model performance. 5) Full Open-Sourcing: We fully open-source VBench++, including all prompts, the Image Suite, evaluation methods, generated videos, and human preference annotations.

Abstract:
Based on the message-passing paradigm, there has been an amount of research proposing diverse and impressive feature propagation mechanisms to improve the performance of GNNs. However, less focus has been put on feature transformation, another major operation of the message-passing framework. In this paper, we first empirically investigate the performance of the feature transformation operation in several typical GNNs. Unexpectedly, we notice that GNNs do not completely free up the power of the inherent feature transformation operation. By this observation, we propose the Bi-directional Knowledge Transfer (BiKT), a plug-and-play approach to unleash the potential of the feature transformation operations without modifying the original architecture. Taking the feature transformation operation as a derived representation learning model that shares parameters with the original GNN, the direct prediction by this model provides a topological-agnostic knowledge feedback that can further instruct the learning of GNN and the feature transformations therein. On this basis, BiKT not only allows us to acquire knowledge from both the GNN and its derived model but also promotes each other by injecting the knowledge into the other. In addition, a theoretical analysis is further provided to demonstrate that BiKT improves the generalization bound of the GNNs from the perspective of domain adaptation. An extensive group of experiments on up to 7 datasets with 5 typical GNNs demonstrates that BiKT brings up to 0.5% - 4% performance gain over the original GNN, which means a boosted GNN is obtained. Meanwhile, the derived model also shows a powerful performance to compete with or even surpass the original GNN, enabling us to flexibly apply it independently to some other specific downstream tasks.

Abstract:
Recent studies have revealed that text-to-image diffusion models are vulnerable to backdoor attacks, where attackers implant stealthy textual triggers to manipulate model outputs. Previous backdoor detection methods primarily focus on the static features of backdoor samples. However, a vital property of diffusion models is their inherent dynamism. This study introduces a novel backdoor detection perspective named Dynamic Attention Analysis (DAA), showing that these dynamic characteristics serve as better indicators for backdoor detection. Specifically, by examining the dynamic evolution of cross-attention maps, we observe that backdoor samples exhibit distinct feature evolution patterns at the < > token compared to benign samples. To quantify these dynamic anomalies, we first introduce DAA-I, which treats the tokens’ attention maps as spatially independent and measures dynamic feature using the Frobenius norm. Furthermore, to better capture the interactions between attention maps and refine the feature, we propose a dynamical system-based approach, referred to as DAA-S. This model formulates the spatial correlations among attention maps using a graph-based state equation and we theoretically analyze the global asymptotic stability of this method. Extensive experiments across six representative backdoor attack scenarios demonstrate that our approach significantly surpasses existing detection methods, achieving an average F1 Score of 79.27% and an AUC of 86.27%.

Abstract:
This paper introduces Test-time Correction (TTC), an online 3D detection system designed to rectify test-time errors using various auxiliary feedback, aiming to enhance the safety of deployed autonomous driving systems. Unlike conventional offline 3D detectors that remain fixed during inference, TTC enables immediate online error correction without retraining, allowing autonomous vehicles to adapt to new scenarios and reduce deployment risks. To achieve this, we equip existing 3D detectors with an Online Adapter (OA) module—a prompt-driven query generator for real-time correction. At the core of OA module are visual prompts: image-based descriptions of objects of interest derived from auxiliary feedback such as mismatches with 2D detections, road descriptions, or user clicks. These visual prompts, collected from risky objects during inference, are maintained in a visual prompt buffer to enable continuous correction in future frames. By leveraging this mechanism, TTC consistently detects risky objects, achieving reliable, adaptive, and versatile driving autonomy. Extensive experiments show that TTC significantly improves instant error rectification over frozen 3D detectors, even under limited labels, zero-shot settings, and adverse conditions. We hope this work inspires future research on post-deployment online rectification systems for autonomous driving.

Abstract:
Deep neural networks (DNNs) can be manipulated to exhibit specific behaviors when exposed to specific trigger patterns, without affecting their performance on benign samples, dubbed backdoor attack. Currently, implementing backdoor attacks in physical scenarios still faces significant challenges. Physical attacks are labor-intensive and time-consuming, and the triggers are selected in a manual and heuristic way. Moreover, expanding digital attacks to physical scenarios faces many challenges due to their sensitivity to visual distortions and the absence of counterparts in the real world. To address these challenges, we define a novel trigger called the Visible, Semantic, Sample-specific, and Compatible (VSSC) trigger, to achieve effective, stealthy and robust simultaneously, which can also be effectively deployed in the physical scenario using corresponding objects. To implement the VSSC trigger, we propose an automated pipeline comprising three modules: a trigger selection module that systematically identifies suitable triggers leveraging large language models, a trigger insertion module that employs generative models to seamlessly integrate triggers into images, and a quality assessment module that ensures the natural and successful insertion of triggers through vision-language models. Extensive experimental results and analysis validate the effectiveness, stealthiness, and robustness of the VSSC trigger. It can not only maintain robustness under visual distortions but also demonstrates strong practicality in the physical scenario. By providing the first automated pipeline, VSSC transforms physical backdoor attacks from a labor-intensive craft into a systematic and realistic threat to real-world AI systems. We hope the proposed VSSC trigger and implementation approach could inspire future studies on designing more practical triggers in backdoor attacks.

Abstract:
Albeit the scalable performance of vision transformers (ViTs), the dense computational costs undermine their position in industrial applications. Post-training quantization (PTQ), tuning ViTs with a tiny dataset and running in a low-bit format, well addresses the cost issue but unluckily bears more performance drops in lower-bit cases. In this paper, we introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion. I&S-ViT first identifies two issues in the PTQ of ViTs: (1) Quantization inefficiency in the prevalent log2 quantizer for post-Softmax activations; (2) Rugged and magnified loss landscape in coarse-grained quantization granularity for post-LayerNorm activations. Then, I&S-ViT addresses these issues by introducing: (1) A novel shift-uniform-log2 quantizer (SULQ) that incorporates a shift mechanism followed by uniform quantization to achieve both an inclusive domain representation and accurate distribution approximation; (2) A three-stage smooth optimization strategy (SOS) that amalgamates the strengths of channel-wise and layer-wise quantization to enable stable learning. Comprehensive evaluations across diverse vision tasks validate I&S-ViT’s superiority over existing PTQ of ViTs methods, particularly in low-bit scenarios. For instance, I&S-ViT elevates the performance of W3A3 ViT-B by an impressive 50.68%.

Abstract:
Obtaining highly consistent correspondences between point clouds is crucial for computer vision tasks such as 3D registration and recognition. Due to nuisances such as limited overlap and noise, initial correspondences often contain a large number of outliers, imposing a great challenge to downstream tasks. In this paper, we present a novel single voter spreading (SVOS) method for efficient 3D correspondence grouping and 3D registration. Our core insight is to leverage low-order graph constraints only in a single voter spreading voting scheme to achieve comparable constrain-ability as complex constraints without searching them. First, a simple first-order graph is constructed for the initial correspondence set. Second, a two-stage voting method is proposed, including single voter voting and spread voters voting. Each voting stage involves both local and global voting via edge constraints only. This promises good selectivity while making the voting process time- and storage-efficient. Finally, top-scored correspondences are opted for robust transformation estimation. Experiments on U3M, 3DMatch/3DLoMatch, ETH, and KITTI-LC datasets verify that SVOS achieves new state-of-the-art correspondence grouping and registration performance, while being light-weight and robust to graph construction parameters.

Abstract:
Existing cross-domain few-shot learning (CDFSL) methods, which develop training strategies in the source domain to enhance model transferability, face challenges when applied to large-scale pre-trained models (LMs), as their source domains and training strategies are not accessible. Besides, fine-tuning LMs specifically for CDFSL requires substantial computational resources, which limits their practicality. Therefore, this paper investigates the source-free CDFSL (SF-CDFSL) problem to solve the few-shot learning (FSL) task in target domain using only a pre-trained model and a few target samples, without requiring source data or training strategies. However, the inaccessibility of source data prevents explicitly reducing the domain gaps between the source and target. To tackle this challenge, this paper proposes a novel approach, Step-wise Distribution-aligned Style Prompt Tuning (StepSPT), to implicitly narrow the domain gaps from the perspective of prediction distribution optimization. StepSPT initially proposes a style prompt that adjusts the target samples to mirror the expected distribution. Furthermore, StepSPT tunes the style prompt and classifier by exploring a dual-phase optimization process (external and internal processes). In the external process, a step-wise distribution alignment strategy is introduced to tune the proposed style prompt by factorizing the prediction distribution optimization problem into the multi-step distribution alignment problem. In the internal process, the classifier is updated via standard cross-entropy loss. Evaluation on 5 datasets illustrates the superiority of StepSPT over existing prompt tuning-based methods and state-of-the-art methods (SOTAs). Furthermore, ablation studies and performance analyzes highlight the efficacy of StepSPT.

Abstract:
The human brain is a highly efficient processing unit, and understanding how it works can inspire new algorithms and architectures in machine learning. In this work, we introduce a novel framework named Brain Activation Network (BRACTIVE), a transformer-based approach to studying the human visual brain. The primary objective of BRACTIVE is to align the visual features of subjects with their corresponding brain representations using functional Magnetic Resonance Imaging (fMRI) signals. It enables us to identify the brain’s Regions of Interest (ROIs) in the subjects. Unlike previous brain research methods, which can only identify ROIs for one subject at a time and are limited by the number of subjects, BRACTIVE automatically extends this identification to multiple subjects and ROIs. Our experiments demonstrate that BRACTIVE effectively identifies person-specific regions of interest, such as face and body-selective areas, aligning with neuroscience findings and indicating potential applicability to various object categories. More importantly, we found that leveraging human visual brain activity to guide deep neural networks enhances performance across various benchmarks. It encourages the potential of BRACTIVE in both neuroscience and machine intelligence studies.

Abstract:
Semi-supervised learning (SSL) provides a practical framework for leveraging massive unlabeled samples, especially when labels are expensive for facial expression recognition (FER). Typical SSL methods like FixMatch select unlabeled samples with confidence scores above a fixed threshold for training. However, these methods face two primary limitations: failing to consider the varying confidence across facial expression categories and failing to utilize unlabeled facial expression samples efficiently. To address these challenges, we propose an Enhanced Adaptive Confidence Margin (EACM), consisting of dynamic thresholds for different categories, to fully learn unlabeled samples. Specifically, we employ the predictions on labeled samples at each training iteration to learn an EACM. It then partitions unlabeled samples into two subsets: (1) subset I, including samples whose confidence scores are no less than the margin; (2) subset II, including samples whose confidence scores are less than the margin. For samples in subset I, we constrain their predictions on strongly-augmented versions to match the pseudo-labels derived from the predictions on weakly-augmented versions. Meanwhile, we introduce a feature-level contrastive objective to enhance the similarity between two weakly-augmented features of a sample in subset II. We extensively evaluate EACM on image-based and video-based facial expression datasets, showing that our method achieves superior performance, significantly surpassing fully-supervised baselines in a semi-supervised manner. Additionally, our EACM is promising to leverage cross-dataset unlabeled samples for practical training to boost fully-supervised performance.

Abstract:
3D mask presentation attack detection is crucial for protecting face recognition systems against the rising threat of 3D mask attacks. While most existing methods utilize multimodal features or remote photoplethysmography (rPPG) signals to distinguish between real faces and 3D masks, they face significant challenges, such as the high costs associated with multimodal sensors and limited generalization ability. Detection-related text descriptions offer concise, universal information and are cost-effective to obtain. However, the potential of vision-language multimodal features for 3D mask presentation attack detection remains unexplored. In this paper, we propose a novel knowledge-based prompt learning framework to explore the strong generalization capability of vision-language models for 3D mask presentation attack detection. Specifically, our approach incorporates entities and triples from knowledge graphs into the prompt learning process, generating fine-grained, task-specific explicit prompts that effectively harness the knowledge embedded in pre-trained vision-language models. Furthermore, considering different input images may emphasize distinct knowledge graph elements, we introduce a visual-specific knowledge filter based on an attention mechanism to refine relevant elements according to the visual context. Additionally, we leverage causal graph theory insights into the prompt learning process to further enhance the generalization ability of our method. During training, a spurious correlation elimination paradigm is employed, which removes category-irrelevant local image patches using guidance from knowledge-based text features, fostering the learning of generalized causal prompts that align with category-relevant local patches. Experimental results demonstrate that the proposed method achieves state-of-the-art intra- and cross-scenario detection performance on benchmark datasets.

Abstract:
Adversarial attack is a major obstacle to the deployment of deep neural networks (DNNs) for security-sensitive applications. To address these adversarial perturbations, various adversarial defense strategies have been developed, with Adversarial Training (AT) being one of the most effective methods to protect neural networks from adversarial attacks. However, existing AT methods struggle against training-agnostic attacks due to their limited generalizability. This suggests that the AT models lack a unified perspective for various attacks to conduct universal defense. This paper sheds light on a generalizable prior under various attacks: consistent class confusion (3C), i.e., an AT classifier often confuses the predictions between correct and ambiguous classes in a highly similar pattern among diverse attacks. Relying on this latent prior as a bridge between seen and agnostic attacks, we propose a more generalized AT model by mitigating consistent class confusion (M3C) to resist training-agnostic attacks. Specifically, we optimize an Adversarial Confusion Loss (ACL), which is weighted by uncertainty, to distinguish the most confused classes and encourage the AT model to focus on these confused samples. To suppress malignant features affecting correct predictions and producing significant class confusion, we propose a Gradient-Aware Attention (GAA) mechanism to enhance the classification confidence of correct classes and eliminate class confusion. Experiments on multiple benchmarks and network frameworks demonstrate that our M3C model significantly improves the generalization of AT robustness against agnostic attacks. The finding of the 3C prior reveals the potential and possibility for defending against a wide range of attacks, and provides a new perspective to overcome such challenge in this field.

Abstract:
Variational autoencoders (VAEs) have been widely used for node clustering, with existing methods mainly focusing on enhancing the expressiveness of their latent space. Recently, the integration of diffusion models with VAEs has provided new opportunities to achieve this objective. However, the mechanism by which the diffusion model improves performance remains unclear. To bridge this gap, we conduct an empirical analysis from the perspective of graph spectral theory, revealing that the signal modulation induced by diffusion models closely aligns with the low-frequency spectral characteristics of VAEs, which in turn explains their effectiveness. Nevertheless, further experiments highlight that diffusion models exhibit limitations in modulating high-frequency signals, which diverge from the spectral characteristics of VAEs. Moreover, existing diffusion methods fail to enable the latent space to adequately capture and reflect cluster-specific characteristics. To address these challenges, we propose a novel plug-and-play method, FVD, to improve the performance of VAE-based methods in node clustering tasks. Specifically, we incorporate the graph wavelet transform as a secondary signal modulator, enabling independent adjustments of specific frequency bands to better align with the spectral characteristics of VAEs. Additionally, we introduce the Student’s t-distribution as a conditional constraint in the reverse process of FVD, deriving a more compact variational lower bound. This enhancement preserves fine-grained node information while focusing on clustering details, effectively mitigating the cluster collapse phenomenon. Comprehensive experimental results demonstrate that integrating FVD with existing methods achieves competitive performance improvements in most cases.

Abstract:
We propose a deep mixture of multimodal hierarchical variational auto-encoders called MMHVAE that synthesizes missing images from observed images in different modalities. MMHVAE’s design focuses on tackling four challenges: (i) creating a complex latent representation of multimodal data to generate high-resolution images; (ii) encouraging the variational distributions to estimate the missing information needed for cross-modal image synthesis; (iii) learning to fuse multimodal information in the context of missing data; (iv) leveraging dataset-level information to handle incomplete data sets at training time. Extensive experiments are performed on the challenging problem of pre-operative brain multi-parametric magnetic resonance and intra-operative ultrasound imaging.

Abstract:
Graph convolutional networks (GCNs) have emerged as powerful models for graph learning tasks, exhibiting promising performance in various domains. While their empirical success is evident, there is a growing need to understand their essential ability from a theoretical perspective. Existing theoretical research has primarily focused on the analysis of single-layer GCNs, while a comprehensive theoretical exploration of the stability and generalization of deep GCNs remains limited. In this paper, we bridge this gap by delving into the stability and generalization properties of deep GCNs, aiming to provide valuable insights by characterizing rigorously the associated upper bounds. Our theoretical results reveal that the stability and generalization of deep GCNs are influenced by certain key factors, such as the maximum absolute eigenvalue of the graph filter operators and the depth of the network. Our theoretical studies contribute to a deeper understanding of the stability and generalization properties of deep GCNs, potentially paving the way for developing more reliable and well-performing models.

Abstract:
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10 M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With Big Data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation.

Abstract:
Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets.

Abstract:
Text-to-image customization aims to generate images that align with both the given text and the subject in the given image. Existing works follow the pseudo-word paradigm, which represents the subject as a non-existent pseudo word and combines it with other text to generate images. However, the pseudo word inherently conflicts and entangles with other real words, resulting in a dual-optimum paradox between the subject similarity and text controllability. To address this, we propose RealCustom++, a novel real-word paradigm that represents the subject with a non-conflicting real word to generate a coherent guidance image and corresponding subject mask, there by disentangling the influence scopes of the text and subject for simultaneous optimization. Specifically, RealCustom++ introduces a train-inference decoupled framework: (1) during training, it learns a general alignment between visual conditions and all real text words; and (2) during inference, a dual-branch architecture is employed, where the Guidance Branch produces the subject guidance mask, and the Generation Branch utilizes this mask to customize the generation of the specific real word exclusively within subject-relevant regions. Extensive experiments validate RealCustom++s superior performance, which improves controllability by 7.48%, similarity by 3.04% and quality by 76.43% simultaneously. Moreover, RealCustom++ further improves controllability by 4.6% and multi-subject similarity by 6.34% for multisubject customization

Abstract:
This paper introduces SparseTSF, a novel and extremely lightweight method for Long-term Time Series Forecasting (LTSF), designed to address the challenges of modeling complex temporal dependencies over extended horizons with minimal computational resources. At the heart of SparseTSF lies the Cross-Period Sparse Forecasting technique, which simplifies the forecasting task by downsampling the original sequences to focus on cross-period trend prediction. This technique not only significantly reduces model complexity and the number of parameters but also serves as an implicit regularization mechanism that enhances the model’s robustness, achieving an optimal balance between performance and efficiency. Based on this technique, SparseTSF uses fewer than 1,000 parameters to achieve competitive performance compared to state-of-the-art methods, with evident advantages under longer look-back windows (e.g., 720) that allow the model to better exploit inherent periodicity and trend information. Furthermore, SparseTSF showcases remarkable generalization capabilities, making it well-suited for scenarios with limited computational resources, small samples, or low-quality data.

Abstract:
Currently, an increasing number of researchers are focusing on partial multiview incomplete multilabel learning. However, many methods generally integrate features from multiple views via an average weighting strategy, which overlooks the potential mismatch between the contribution of each view and their assigned fusion weights and thus generates unreliable fused features. To address this issue, we propose a novel uncertainty-driven reliable dynamic fusion framework for partial multiview incomplete multilabel learning. Unlike existing methods, the proposed uncertainty-driven reliable sample-level dynamic fusion module operates on the principle that samples exhibiting greater uncertainty possess fewer reliable features. This module evaluates the uncertainty of each sample and, in turn, estimates the reliability of features with the uncertainty of sample judgement, thereby obtaining reliable weights to guide the information fusion of multiple views. Furthermore, many existing approaches for handling incomplete multilabel scenarios typically concentrate on the information from annotated labels, neglecting the potential information of unknown tags. To bridge this gap, we incorporate an innovative pseudolabelling strategy that effectively identifies trustworthy pseudolabels that correspond to those unannotated uncertain labels, thereby adding additional supervisory information to assist model training. Moreover, we also devise a feature masking strategy to further augment the encoder’s representation learning capabilities. The experimental results across five datasets demonstrate that our method outperforms current state-of-the-art methods.

Abstract:
The troublesome model size and quadratic computational complexity associated with token quantity pose significant deployment challenges for Vision Transformers (ViTs) in practical applications. Despite recent advancements in model pruning and token reduction techniques speed up the inference speed of ViTs, these approaches either adopt a fixed sparsity ratio or overlook the meaningful interplay between architectural optimization and token selection. Consequently, this static and single-dimension compression often leads to pronounced accuracy degradation under aggressive compression rates, as they fail to fully explore redundancies across these two orthogonal dimensions. Therefore, we introduce PRANCE, a framework which can jointly optimize activated channels and tokens on a per-sample basis, aiming to accelerate ViTs’ inference process from a unified data and architectural perspective. However, the joint framework poses challenges to both architectural and decision-making aspects. First, while ViTs inherently support variable-token inference, they do not facilitate dynamic computations for variable channels. To overcome this limitation, we propose a meta-network using weight-sharing techniques to support arbitrary channels of the Multi-Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP) layers, serving as a foundational model for architectural decision-making. Second, simultaneously optimizing the model structure and input data constitutes a combinatorial optimization problem with an extremely large decision space, reaching up to around 10^141014, making supervised learning infeasible. To this end, we design a lightweight selector employing Proximal Policy Optimization algorithm (PPO) for efficient decision-making. Furthermore, we introduce a novel “Result-to-Go” training mechanism that models ViTs’ inference process as a Markov decision process, significantly reducing action space and mitigating delayed-reward issues during training. Additionally, our framework simultaneously supports different kinds of token optimization methods such as pruning, merging, and sequential pruning-merging strategies. Extensive experiments demonstrate the effectiveness of PRANCE in reducing FLOPs by approximately 50%, retaining only about 10% of tokens while achieving lossless Top-1 accuracy.

Abstract:
The absence of ground truth (GT) in most fusion tasks poses significant challenges for model optimization, evaluation, and generalization. Existing fusion methods achieving complementary context aggregation predominantly rely on hand-crafted fusion rules and sophisticated loss functions, which introduce subjectivity and often fail to adapt to complex real-world scenarios. To address this challenge, we propose Mask-DiFuser, a novel fusion paradigm that ingeniously transforms the unsupervised image fusion task into a dual masked image reconstruction task by incorporating masked image modeling with a diffusion model, overcoming various issues arising from the absence of GT. In particular, we devise a dual masking scheme to simulate complementary information and employ a diffusion model to restore source images from two masked inputs, thereby aggregating complementary contexts. A content encoder with an attention parallel feature mixer is deployed to extract and integrate complementary features, offering local content guidance. Moreover, a semantic encoder is developed to supply global context which is integrated into the diffusion model via a cross-attention mechanism. During inference, Mask-DiFuser begins with a Gaussian distribution and iteratively denoises it conditioned on multi-source images to directly generate fused images. The masked diffusion model, learning priors from high-quality natural images, ensures that fusion results align more closely with human visual perception. Extensive experiments on several fusion tasks, including infrared-visible, medical, multi-exposure, and multi-focus image fusion, demonstrate that Mask-DiFuser significantly outshines SOTA fusion alternatives.

Abstract:
We propose UAD, an end-to-end framework with Unsupervised pretext task for vision-based Autonomous Driving, achieving the best open-loop evaluation performance in nuScenes, meanwhile showing robust closed-loop driving quality in CARLA. Our motivation stems from the observation that current end-to-end autonomous driving (E2EAD) models still mimic the modular architecture in typical driving stacks, with carefully designed supervised perception and prediction subtasks to provide environment information for oriented planning. Although achieving groundbreaking progress, such design has certain drawbacks: 1) preceding subtasks require massive high-quality 3D annotations as supervision, posing a significant impediment to scaling the training data; and 2) each submodule entails substantial computation overhead in both training and inference. To this end, we propose UAD, an E2EAD framework with an unsupervised1 proxy to address all these issues. Firstly, we design a novel Angular Perception Pretext to eliminate the annotation requirement. The pretext perceives the driving scene by predicting the angular-wise spatial objectness and temporal dynamics, without manual annotation. Secondly, a self-supervised training strategy, which learns the consistency of the predicted trajectories under different augment views, is proposed to enhance the planning robustness in steering scenarios. Our UAD achieves 38.7% relative improvements over UniAD on the average collision rate of nuScenes open-loop evaluation and obtains the route completion score of 98.5% in closed-loop evaluation of CARLA’s Town05 Long benchmark, which outperforms the recent work VADv2. Moreover, the proposed method consumes only 44.3% training resources of UniAD and runs 3.4×3.4× faster in inference when employing the same backbone network. Our innovative design not only for the first time demonstrates unarguable performance advantages over supervised counterparts, but also enjoys unprecedented efficiency in data, training, and inference.

Abstract:
Multispectral filter array (MSFA) camera is increasingly used due to its compact size and fast capturing speed. However, because of its narrow-band property, it often suffers from the light-deficient problem, and images captured are easily overwhelmed by noise. As a type of commonly used denoising method, neural networks have shown their power to achieve satisfactory denoising results. However, their performance highly depends on high-quality noisy-clean image pairs. For the task of MSFA image denoising, there is currently neither a paired real dataset nor an accurate noise model capable of generating realistic noisy images. To this end, we present a physics-based noise model that is capable to match the real noise distribution and synthesize realistic noisy images. In our noise model, those different types of noise can be divided into SimpleDist component and ComplexDist component. The former contains all the types of noise that can be described using a simple probability distribution like Gaussian or Poisson distribution, and the latter contains the complicated color bias noise that cannot be modeled using a simple probability distribution. Besides, we design a noise-decoupled network consisting of a SimpleDist noise removal network (SNRNet) and a ComplexDist noise removal network (CNRNet) to sequentially remove each component. Moreover, according to the non-uniformity of color bias noise in our noise model, we introduce a learnable position embedding in CNRNet to indicate the position information. To verify the effectiveness of our physics-based noise model and noise-decoupled network, we collect a real MSFA denoising dataset with paired long-exposure clean images and short-exposure noisy images. Experiments are conducted to prove that the network trained using synthetic data generated by our noise model performs as well as trained using paired real data, and our noise-decoupled network outperforms other state-of-the-art denoising methods.

Abstract:
In this paper, we propose to address monocular 3D hand pose estimation from a single RGB or depth image via articulated anchor-to-joint 3D local regressors, in form of A2J-Transformer+. The key idea is to make the local regressors (i.e., anchor points) in 3D space be aware of hand’s local fine details and global articulated context jointly, to facilitate predicting their 3D offsets toward hand joints with linear weighted aggregation for joint localization. Our intuition is that, local fine details help to estimate accurate offset but may suffer from the issues including serious occlusion, confusing similar patterns, and overfitting risk. On the other hand, hand’s global articulated context can essentially provide additional descriptive clues and constraints to alleviate these issues. To set anchor points adaptively in 3D space, A2J-Transformer+ runs in a 2-stage manner. At the first stage, since the input modality property anchor points distribute more densely on X-Y plane, it leads to lower prediction accuracy along Z direction compared with those in the X and Y directions. To alleviate this, at the second stage anchor points are set near the joints yielded by the first stage evenly along X, Y, and Z directions. This treatment brings two main advantages: (1) balancing the prediction accuracy along X, Y, and Z directions, and (2) ensuring the anchor-joint offsets are of small values relatively easy to estimate. Wide-range experiments on three RGB hand datasets (InterHand2.6 M, HO-3D V2 and RHP) and three depth hand datasets (NYU, ICVL and HANDS 2017) verify A2J-Transformer+’s superiority and generalization ability for different modalities (i.e., RGB and depth) and hand cases (i.e., single hand, interacting hands, and hand-object interaction), even outperforming model-based manners. The test on ITOP dataset reveals that, A2J-Transformer+ can also be applied to 3D human pose estimation task.

Abstract:
In this paper, we propose SS-NeRF, the end-to-end Neural Radiance Field (NeRF)-based architectures for high-quality physically based rendering with sparse inputs. We modify the classical spectral rendering into two main steps, 1) the generation of a series of spectrum maps spanning different wavelengths, 2) the combination of these spectrum maps for the RGB output. The proposed architecture follows these two steps through the proposed multi-layer perceptron (MLP)-based architecture (SpectralMLP) and spectrum attention UNet (SAUNet). Given the ray origin and the ray direction, the SpectralMLP constructs the spectral radiance field to obtain spectrum maps of novel views, which are then sent to the SAUNet to produce RGB images of white-light illumination. Applying NeRF to build up the spectral rendering is a more physically-based way from the perspective of ray-tracing. Further, the spectral radiance fields decompose difficult scenes and improve the performance of NeRF-based methods. Previous baseline, such as SpectralNeRF, outperforms recent methods in synthesizing novel views but requires relatively dense viewpoints for accurate scene reconstruction. To tackle this, we propose SS-NeRF to enhance the detail of scene representation with sparse inputs. In SS-NeRF, we first design the depth-aware continuity to optimize the reconstruction based on single-view depth predictions. Then, the geometric-projected consistency is introduced to optimize the multi-view geometry alignment. Additionally, we introduce a superpixel-aligned consistency to ensure that the average color within each superpixel region remains consistent. Comprehensive experimental results demonstrate that the proposed method is superior to recent state-of-the-art methods when synthesizing new views on both synthetic and real-world datasets.

Abstract:
Despite the fast progress of deep learning, one standing challenge is the gap of the observed training samples and the underlying true distribution. There are multiple reasons for the causing of this gap e.g. sampling bias, noise etc. In the era of foundation models, we show that when leveraging the off-the-shelf (vision) foundation models (e.g., CLIP, DINOv2) for feature extraction, the geometric shapes of the resulting feature distributions exhibit remarkable transferability across domains and datasets. To verify its practical usefulness, we embody our geometric knowledge-guided distribution calibration framework in two popular and challenging settings: federated learning and long-tailed recognition. In the federated setting, we devise a technique of acquiring the global geometric shape under privacy constraints, then leverage this knowledge to generate new samples for clients, in the aim of bridging the gap between local and global observations. In long-tailed learning, it utilizes the geometric knowledge transferred from sample-rich categories to recover the true distribution for sample-scarce tail classes. Comprehensive experiments show that our proposed geometric knowledge-guided distribution calibration effectively overcomes information deficits caused by data heterogeneity and sample imbalance, with boosted performance across benchmarks.

Abstract:
As a prominent research topic, multi-view multi-label classification (MvMlC) aims to assign multiple labels to samples by integrating information from various perspectives. However, in real-world scenarios, MvMlC frequently faces the learning challenge of data with missing views and labels, typically resulting from sensor malfunctions, or the costly and time-consuming process of manual annotation. In addition, learning robust representations that are both consistent across views and specific to individual views remains a challenge. To address these issues, we propose a novel double incomplete multi-view multi-label classification framework based on Disentangling Consistent and Specific Information (DCSI). Specifically, we employ a dual-channel encoder with identical architecture but distinct objectives to extract cross-view consistent information and view-specific unique information from all views, respectively. Meanwhile, a view discriminator is constructed to decouple these two types of information, facilitating the extraction of pure consistent and specific information. Moreover, we meticulously design fusion strategies tailored to each representation type. Regarding consistent representations, we propose a dynamic-confidence-aware fusion mechanism that assesses the reliability of each view’s representations in relation to the classification task, enabling the model to prioritize information from trustworthy representations. For specific representations, in light of their complementary rather than redundant property, we suggest treating such representations from each view equally to ensure fairness. Through experimental validation on five datasets, the results demonstrate that our method outperforms existing state-of-the-art methods.

Abstract:
As automated classification systems become increasingly prevalent, concerns have emerged over their potential to reinforce and amplify existing societal biases. In the light of this issue, many methods have been proposed to enhance the fairness guarantees of classifiers. Most of the existing interventions assume access to group information for all instances, a requirement rarely met in practice. Fairness without access to demographic information has often been approached through robust optimization techniques, which target worst-case outcomes over a set of plausible distributions known as the uncertainty set. However, their effectiveness is strongly influenced by the chosen uncertainty set. In fact, existing approaches often overemphasize outliers or overly pessimistic scenarios, compromising both overall performance and fairness. To overcome these limitations, we introduce SPECTRE, a minimax-fair method that adjusts the spectrum of a simple Fourier feature mapping and constrains the extent to which the worst-case distribution can deviate from the empirical distribution. We perform extensive experiments on the American Community Survey datasets involving 20 states. The safeness of SPECTRE comes as it provides the highest average values on fairness guarantees together with the smallest interquartile range in comparison to state-of-the-art approaches, even compared to those with access to demographic group information. In addition, we provide a theoretical analysis that derives computable bounds on the worst-case error for both individual groups and the overall population, as well as characterizes the worst-case distributions responsible for these extremal performances.

Abstract:
Multi-Layer Perceptron (MLP) models are the foundation of contemporary point cloud processing. However, their complex network architectures obscure the source of their strength and limit the application of these models. In this article, we develop a two-stage abstraction and refinement (ABS-REF) view for modular feature extraction in point cloud processing. This view elucidates that whereas the early models focused on ABS stages, the more recent techniques devise sophisticated REF stages to attain performance advantages. Then, we propose a High-dimensional Positional Encoding (HPE) module to explicitly utilize intrinsic positional information, extending the “positional encoding” concept from Transformer literature. HPE can be readily deployed in MLP-based architectures and is compatible with transformer-based methods. Within our ABS-REF view, we rethink local aggregation in MLP-based methods and propose replacing time-consuming local MLP operations, which are used to capture local relationships among neighbors. Instead, we use non-local MLPs for efficient non-local information updates, combined with the proposed HPE for effective local information representation. We leverage our modules to develop HPENets, a suite of MLP networks that follow the ABS-REF paradigm, incorporating a scalable HPE-based REF stage. Extensive experiments on seven public datasets across four different tasks show that HPENets deliver a strong balance between efficiency and effectiveness. Notably, HPENet surpasses PointNeXt, a strong MLP-based counterpart, by 1.1% mAcc, 4.0% mIoU, 1.8% mIoU and 0.2% Cls. mIoU, with only 50.0%, 21.5%, 23.1%, 44.4% of FLOPs on ScanObjectNN, S3DIS, ScanNet, and ShapeNetPart, respectively.

Abstract:
Graph neural networks (GNNs) have emerged as a powerful framework for a wide range of node-level graph learning tasks. However, their performance typically depends on random or minimally informed initial feature representations, where poor initialization can lead to slower convergence and increased training instability. In this paper, we address this limitation by leveraging a statistically grounded one-hot graph encoder embedding (GEE) as a high-quality, structure-aware initialization for node features. Integrating GEE into standard GNNs yields the GEE-powered GNN (GG) framework. Across extensive simulations and real-world benchmarks, GG provides consistent and substantial performance gains in both unsupervised and supervised settings. For node classification, we further introduce GG-C, which concatenates the outputs of GG and GEE and outperforms competing methods, achieving roughly 10–50% accuracy improvements across most datasets. These results demonstrate the importance of principled, structure-aware initialization for improving the efficiency, stability, and overall performance of graph neural network architecture, enabling models to better exploit graph topology from the outset.

Abstract:
Charts are common in literature across various scientific fields, conveying rich information easily accessible to readers. Current chart-related tasks focus on either chart perception that extracts information from the visual charts, or chart reasoning given the extracted data, e.g. in a tabular form. In this paper, we introduce StructChart, a novel framework that leverages Structured Triplet Representations (STR) to achieve a unified and label-efficient approach to chart perception and reasoning tasks, which is generally applicable to different downstream tasks, beyond the question-answering task as specifically studied in peer works. Specifically, StructChart first reformulates the chart data from the tubular form (linearized CSV) to STR, which can friendlily reduce the task gap between chart perception and reasoning. We then propose a Structuring Chart-oriented Representation Metric (SCRM) to quantitatively evaluate the chart perception task performance. To augment the training, we further explore the potential of Large Language Models (LLMs) to enhance the diversity in both chart visual style and statistical information. Extensive experiments on various chart-related tasks demonstrate the effectiveness and potential of a unified chart perception-reasoning paradigm to push the frontier of chart understanding.

Abstract:
One-shot Federated Learning (OFL) has emerged as a promising paradigm, enabling global model training with minimal communication overhead. In OFL, the server model is usually distilled from an ensemble of pre-trained client models, while the ensemble also facilitates synthetic data generation for the knowledge distillation process. Prior works show that the performance of the final model is fundamentally tied to both the quality of the synthetic data and the ensemble. However, existing methods often optimize these two components separately, overlooking their interaction. To address this coupled optimization problem and provide a unified solution to the dual challenges of data and model heterogeneity inherent in OFL, we introduce Co-Boosting++, a novel OFL framework where synthetic data generation and ensemble construction mutually enhance each other in an iterative fashion. First, we fix the ensemble and generate hard samples in an adversarial manner. These samples are crucial for enhancing the robustness of knowledge transfer, as they challenge the model to generalize better, thereby improving quality of the synthetic data and subsequent distillation process. Second, leveraging these hard samples, we enhance the ensemble via a Mixture of Experts (MoE) mechanism. MoE allows dynamic adjustment of ensemble weights based on the generated hard samples, which enables the ensemble to better capture diverse and heterogeneous knowledge from client models. Furthermore, we extend Co-Boosting++ to support the simultaneous generation of multiple heterogeneous target models, enabling efficient adaptation to diverse device constraints. Extensive experiments on benchmark datasets demonstrate that Co-Boosting++ consistently outperforms state-of-the-art methods due to its coupled optimization of data and ensemble quality. Additionally, Co-Boosting++ is highly practical in real-world model market scenarios, requiring no local training modifications, additional transmissions, or restrictions on client model architectures.

Abstract:
In recent years, affine correspondences (ACs) have emerged as widely adopted alternative to point correspondences (PCs) in geometric problems in computer vision. An AC is composed of a PC across two different views plus an affine transformation between the small patches around this PC. Prior studies have shown that a single affine correspondence (AC) generally yields three independent constraints for estimating relative pose. This work addresses relative pose estimation in multi-perspective camera systems, a relevant problem given their prevalence in modern technologies such as autonomous vehicles and augmented reality. More specifically, we introduce the first comprehensive suite of minimal solvers for 6DoF relative pose estimation across multiple cameras using only two ACs, which is notably valuable for robust model fitting scenarios. We analyze all possible configurations of two ACs in two views, and present minimal solvers covering all identified minimal cases. We make use of the hidden variable technique to eliminate the translation parameters, and represent rotation using either Cayley parameters or quaternions. We furthermore introduce novel constraints on the generalized relative pose problem that are beneficial in deriving more compact solvers with fewer solutions. Comprehensive experiments on synthetic and real-world data show that the proposed affine correspondence–based solvers are highly effective and computationally efficient.

Abstract:
Light field (LF) cameras capture the light rays of a 3D scene from multiple views simultaneously, and thus provide a more immersive experience of the real world as compared to traditional cameras. Although significant progress has been made in various LF image processing tasks, it remains challenging to effectively model the non-local spatial-angular correlations inherent in LF images, particularly when dealing with complex disparity variations. In this paper, we focus on orthogonal epipolar geometry of LF images and propose a generic Epipolar Transformer mechanism that incorporates geometrically meaningful correlations along the epipolar lines. Our Epipolar Transformer mechanism enjoys the following benefits: learning effective and diverse LF feature representations, delivering satisfactory results without redundant architectural designs, and enabling flexible extension to various LF-related tasks with simple adaptations. For LF spatial and angular super-resolution, our methods not only achieve state-of-the-art performance on benchmark datasets, but also demonstrate superior and robust performance on large disparity variations. For disparity estimation, we explore the use of geometry information encoded in our Epipolar Transformer to directly regress the disparity results, effectively avoiding the limitation of a fixed maximum disparity.

Abstract:
Data augmentation is crucial for addressing insufficient training data, especially for augmenting positive samples. However, existing methods mostly rely on neural network-based feedback for data augmentation and often overlook the optimization of feature distribution. In this study, we present a practical, distribution-preserving data augmentation pipeline that augments positive samples by optimizing a feature indicator (e.g., two-dimensional entropy), aiming to maintain alignment with the original data distribution. Inspired by the manifold hypothesis, we propose a Manifold Heuristic Optimization Algorithm (MHOA), which augments positive samples by exploring the low-dimensional Euclidean space around object contour pixels instead of the entire decision space. Guided by a “distribution-preservation-first” perspective, our approach explicitly optimizes fidelity to the original data manifold and only retains augmented samples whose feature statistics (e.g., mean, variance) align with the source class. It significantly improves image classification accuracy across neural networks, outperforming state-of-the-art data augmentation methods—especially when the dataset’s feature indicator follows a Gaussian distribution. The algorithm’s search space, focused on neighborhoods of key feature pixels, is the core driver of its superior performance.

Abstract:
Graph invariant learning (GIL) seeks invariant relations between graphs and labels under distribution shifts. Recent works try to extract an invariant subgraph to improve out-of-distribution (OOD) generalization, yet existing approaches either lack explicit control over compactness or rely on hard top-kk selection that shrinks the solution space and is only partially differentiable. In this paper, we provide an in-depth analysis of the drawbacks of some existing works and propose a few general principles for invariant subgraph extraction: 1) separability, as encouraged by our sparsity-driven mechanism, to filter out the irrelevant common features; 2) softness, for a broader solution space; and 3) differentiability, for a soundly end-to-end optimization pipeline. Specifically, building on optimal transport, we propose Graph Sinkhorn Attention (GSINA), a fully differentiable, cardinality-constrained attention mechanism that assigns sparse-yet-soft edge weights via Sinkhorn iterations and induces node attention. GSINA provides explicit controls for separability and softness, and uses a Gumbel reparameterization to stabilize training. It convergence behavior is also theoretically studied. Extensive empirical experimental results on both synthetic and real-world datasets validate its superiority.

Abstract:
Referring remote sensing interpretation holds significant application value in various scenarios such as ecological protection, resource exploration, and emergency management. However, referring remote sensing expression comprehension and segmentation (RRSECS) faces critical challenges, including micro-target localization drift problem caused by insufficient extraction of boundary features in existing paradigms. Moreover, when transferred to remote sensing domains, polygon-based methods encounter issues such as contour-boundary misalignment and multi-task co-optimization conflicts problems. In this paper, we propose SeeFormer, a novel contour autoregressive paradigm specifically designed for RRSECS, which accurately locates and segments micro, irregular targets in remote sensing imagery. We first introduce a brain-inspired feature refocus learning (BIFRL) module that progressively attends to effective object features via a coarse-to-fine scheme, significantly boosting small-object localization and segmentation. Next, we present a language-contour enhancer (LCE) that injects shape-aware contour priors, and a corner-based contour sampler (CBCS) to improve mask–polygon reconstruction fidelity. Finally, we develop an autoregressive dual-decoder paradigm (ARDDP) that preserves sequence consistency while alleviating multi-task optimization conflicts. Extensive experiments on RefDIOR, RRSISD, and OPTRSVG datasets under varying scenarios, scales, and task paradigms demonstrate transformative performance gains: compared to the baseline PolyFormer, our proposed SeeFormer improves oIoU and mIoU by 27.58% and 39.37% for referring image segmentation and by 18.94% and 28.90% for visual grounding on the RefDIOR dataset.

Abstract:
Large Language Models (LLMs) exhibit remarkable proficiency in understanding and managing text-based tasks.Many works try to transfer these capabilities to the video domain, which are referred to as Video-LLMs. However, current Video-LLMs can only grasp the coarse-grained semantics and are unable to efficiently handle tasks involving the comprehension or localization of specific video segments. To address these challenges, we propose Momentor, a Video-LLM designed to perform fine-grained temporal understanding tasks. To facilitate the training of Momentor, we develop an automatic data generation engine to build Moment-10M, a large-scale video instruction dataset with segment-level instruction data. Building upon the foundation of the previously published Momentor and the Moment-10M dataset, we further extend this work by introducing a Spatio-Temporal Token Consolidation (STTC) method, which can merge redundant visual tokens spatio-temporally in a parameter-free manner, thereby significantly promoting computational efficiency while preserving fine-grained visual details. We integrate STTC with Momentor to develop Momentor++ and validate its performance on various benchmarks. Momentor demonstrates robust capabilities in fine-grained temporal understanding and localization. Further, Momentor++ excels in efficiently processing and analyzing extended videos with complex events, showcasing marked advancements in handling extensive temporal contexts.

Abstract:
Wasserstein distances provide a powerful framework for comparing data distributions. They can be used to analyze processes over time or to detect inhomogeneities within data. However, simply calculating the Wasserstein distance or analyzing the corresponding transport plan (or coupling) may not be sufficient for understanding what factors contribute to a high or low Wasserstein distance. In this work, we propose a novel solution based on Explainable AI that allows us to efficiently and accurately attribute Wasserstein distances to various data components, including data subgroups, input features, or interpretable subspaces. Our method achieves high accuracy across diverse datasets and Wasserstein distance specifications, and its practical utility is demonstrated in three use cases.

Abstract:
Spike cameras generate binary spikes in response to light intensity changes, enabling high-speed visual perception with unprecedented temporal resolution. However, the unique characteristics of spike stream present significant challenges for reconstructing dense 3D scene representations, particularly in dynamic environments and under non-ideal lighting conditions. In this paper, we introduce DSNeRF, the first method to derive a NeRF-based volumetric scene representation from spike camera data. Our approach leverages NeRF’s multi-view consistency to establish robust self-supervision, effectively eliminating erroneous measurements and uncovering coherent structures within exceedingly noisy input amidst diverse real-world illumination scenarios. We propose a novel mapping from pixel rays to the spike domain, integrating the spike generation process directly into NeRF training. Specifically, DSNeRF introduces an integrate-and-fire neuron layer that models non-idealities to capture intrinsic camera noise, including both random and fixed-pattern spike noise, thereby enhancing scene fidelity. Additionally, we propose a motion-guided spiking neuron layer and a long-term rendering photometric loss to better align dynamic spike streams, ensuring accurate scene geometry. Our method optimizes neural radiance fields to render photorealistic novel views from continuous spike streams, demonstrating advantages over other vision sensors in certain scenes. Empirical evaluations on both real and simulated sequences validate the effectiveness of our approach.

Abstract:
As speech translation (ST) systems become increasingly prevalent, understanding their vulnerabilities is crucial for ensuring robust and reliable communication. However, limited work has explored this issue in depth. This paper explores methods of compromising these systems through imperceptible audio manipulations. Specifically, we present two approaches: (1) adapting perturbation-based techniques used for automatic speech recognition (ASR) attacks to the ST context, making our work the first to apply this approach to ST, and (2) proposing a novel music generation-based method to guide targeted translation, while also conducting more practical over-the-air attacks in the physical world. Our experiments reveal that carefully crafted audio perturbations can mislead translation models to produce targeted, harmful outputs, while adversarial music achieve this goal more covertly, exploiting the natural imperceptibility of music. These attacks have proven effective across multiple languages and translation models, highlighting a systemic vulnerability in current ST architectures. Beyond immediate security concerns, our findings highlight broader challenges in the robustness and interpretability of neural speech systems.

Abstract:
Adversarially robust knowledge distillation aims to compress a large-scale robust teacher model into a lightweight student counterpart while preserving adversarial robustness and natural performance. Previous methods primarily focused on aligning knowledge (e.g., predictions) between teacher and student models to transfer robustness. However, potentially incorrect predictions from the teacher can misguide the student, negatively impacting robustness transfer. To circumvent this, we propose a novel adversarially robust knowledge distillation scheme that promotes alignment towards more benign predictions rather than incorrect ones by refining inputs into so-called “inverse adversarial examples” via simply reversing the sign of adversarial perturbation. Through a comprehensive investigation of the properties of inverse adversaries, we provide new theoretical insights showing how mimicking the behavior of the teacher model on inverse adversaries facilitates reliable robustness transfer built upon the implicit connection between robustness and the input gradient information. We thus design a gradient matching mechanism between teacher and student models utilizing inverse adversaries to facilitate robust knowledge alignment. Furthermore, inspired by our analysis of the correlation between robustness and adversarial transferability, we propose a weight-space disruption strategy that jointly interacts with both teacher and student models to find a shared direction for better robustness transfer. Empirical evaluations across various datasets demonstrate that our method achieves state-of-the-art robustness and natural performance. Notably, on ImageNet, our approach outperforms prior methods by approximately 3.8% in both clean and robust accuracy. Moreover, we show that incorporating auxiliary generated data into distillation further boosts robustness. Our method can also be generalized to multimodal architectures.

Affiliations: Institute of Big Data Science and Industry and the School of Artificial Intelligence, Shanxi University, Taiyuan, China; School of Information Science and Technology, Northwest University, Xi’an, China; School of Information Science and Technology, University of Science and Technology of China, Hefei, China; School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an, China; State Key Laboratory of Electromechanical Integrated Manufacturing of High-Performance Electronic Equipment, Xidian University, Xi’an, China

Abstract:
Bipartite graph-based co-clustering is efficient in modeling cluster manifold structures. However, existing methods decouple bipartite graph construction from the learning of pseudo-labels for samples and anchors, often leading to suboptimal clustering performance. Moreover, neglecting local manifold relationships among anchors yields inferior anchor pseudo-labels, which further degrades the quality of sample pseudo-labels. To overcome these limitations, we propose a novel model termed Fast Co-Clustering (FC^22), which jointly captures both local and global correlations between samples and anchors. Specifically, to model the coupling between the one-hot pseudo-labels of samples and anchors, we construct a bipartite graph with adaptively updated weights during the clustering process. To prevent severely imbalanced cluster assignments, we prove the equivalence between maximizing pseudo-label covariance and balancing cluster proportions, and incorporate a balanced regularization term to enhance the rationality of the resulting clusters. Furthermore, the local smoothness of anchor pseudo-labels is preserved via a low-rank decomposition of a compact anchor similarity graph. These two components jointly ensure that spatially adjacent anchors tend to share similar cluster identities, and that samples and anchors in close proximity are also assigned to similar clusters. We develop an efficient iterative optimization algorithm to update all model variables. Extensive experiments on benchmark and synthetic datasets validate the superior performance and efficiency of the proposed method compared with state-of-the-art approaches.

Abstract:
Markov chains are simple yet powerful mathematical structures to model temporally dependent processes. They generally assume stationary data, i.e., fixed transition probabilities between observations/states. However, live, real-world processes, like in the context of activity tracking, biological time series, or industrial monitoring, often switch behavior over time. Such behavior switches can be modeled as transitions between higher-level modes (e.g., running, walking, etc.). Yet all modes are usually not previously known, often exhibit vastly differing transition probabilities, and can switch unpredictably. Thus, to track behavior changes of live, real-world processes, this study proposes an online and efficient method to construct Evolving Markov chains (EMCs). EMCs adaptively track transition probabilities, automatically discover modes, and detect mode switches in an online manner. In contrast to previous work, EMCs are of arbitrary order, the proposed update scheme does not rely on tracking windows, only updates the relevant region of the probability tensor, and enjoys geometric convergence of the expected estimates. Our evaluation of synthetic data and real-world applications on human activity recognition, electric motor condition monitoring, and eye-state recognition from electroencephalography (EEG) measurements illustrates the versatility of the approach and points to the potential of EMCs to efficiently track, model, and understand live, real-world processes.

Abstract:
Domain Generalization (DG) seeks to develop models that perform well on unseen target domains by learning domain-invariant representations. Recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have shown strong potential for enhancing DG through prompt tuning. However, existing VFM-based prompt tuning methods often focus on task-specific adaptation rather than disentangling domain-invariant features, leaving cross-domain generalization insufficiently explored. In this paper, we address this challenge by fully leveraging the controllable and flexible language prompt in VFMs. Observing that the text modality is inherently rich in semantics and easier to disentangle, we propose a novel framework termed Prompt Disentanglement via Language Guidance and Representation Alignment (PADG). PADG first employs a large language model (LLM) to disentangle textual prompts into domain-invariant and domain-specific components, which then guide the learning of domain-invariant visual representations. To complement the limitations of text-only guidance, we further introduce the Worst Explicit Representation Alignment (WERA) module, which enhances visual invariance by simulating bounded domain shifts through learnable stylization prompts and aligning representations between original and perturbed samples. Extensive experiments on mainstream DG benchmarks, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that PADG consistently outperforms existing state-of-the-art methods, validating its effectiveness in robust domain-invariant representation learning.

Abstract:
Large Language Models (LLMs) have demonstrated remarkable success across diverse applications, yet their susceptibility to malicious exploitation remains a critical challenge. Notably, LLMs are known to be vulnerable to jailbreaking attacks, where adversaries craft malicious inputs to induce harmful or unethical outputs. In this paper, motivated by the unique effectiveness and scalability of In-Context Learning (ICL) in LLMs, we explore its potential to modulate the safety alignment of LLMs. Specifically, we propose the In-Context Attack (ICA), which employs harmful demonstrations to subvert LLMs’ safety, and the In-Context Defense (ICD), which bolsters their resilience through examples that demonstrate refusal to produce harmful responses. By adjusting the distribution of safety in LLM outputs through adversarial demonstrations, our proposed in-context attack and defense facilitate effective manipulation of their alignment. We first provide theoretical insights to illustrate how minimal in-context demonstrations can efficiently alter safety alignment. Empirically, we validate ICA and ICD across multiple models, datasets, and attack baselines, showing their efficacy and scalability for red-teaming evaluations and robust safeguards for real-world deployment. Overall, our work unveils the pivotal yet understudied role of ICL in LLM safety, opening new avenues for understanding and improving them.

Abstract:
Top-kk feature selection in sparse learning is a fundamental problem in machine learning. It is difficult to conquer due to the rigid \ell _2,0ℓ2,0-norm constraint. Existing literature mostly relaxes the constraint and seeks the approximation of the selection matrix, degenerating primitive models and missing the genuine solutions. This research tackles the primitive top-kk feature selection model in sparse learning. From the perspective of universality, we investigate both supervised and semi-supervised models of top-kk feature selection in sparse learning. By disassembling the feature selection matrix, it is revealed that two different objectives could be unified into one general ratio-trace problem, which is a non-convex optimization problem. The accelerated coordinate descent method is raised to efficiently solve the non-convex objective, through which the local optimal solution of top-kk feature indices is obtained with a competitive time cost. To verify the proposed algorithm, we design toy experiments that could visualize the advantages of the selected features. Meanwhile, experimental results on nine normal datasets and the large-scale ImageNet dataset comprehensively show the superiority of our methods compared to representative and state-of-the-art supervised and semi-supervised algorithms.

Abstract:
We introduce a simple yet effective technique for estimating lighting from a single low-dynamic-range (LDR) image by reframing the task as a chrome ball inpainting problem. This approach leverages a pre-trained diffusion model, Stable Diffusion XL, to overcome the generalization failures of existing methods that rely on limited HDR panorama datasets. While conceptually simple, the task remains challenging because diffusion models often insert incorrect or inconsistent content and cannot readily generate chrome balls in HDR format. Our analysis reveals that the inpainting process is highly sensitive to the initial noise in the diffusion process, occasionally resulting in unrealistic outputs. To address this, we first introduce DiffusionLight (Phongthawee et al. 2024), which uses iterative inpainting to compute a median chrome ball from multiple outputs to serve as a stable, low-frequency lighting prior that guides the generation of a high-quality final result. To generate high-dynamic-range (HDR) light probes, an Exposure LoRA is fine-tuned to create LDR images at multiple exposure values, which are then merged. While effective, DiffusionLight is time-intensive, requiring approximately 30 minutes per estimation. To reduce this overhead, we introduce DiffusionLight-Turbo, which reduces the runtime to about 30 seconds with minimal quality loss. This 60x speedup is achieved by training a Turbo LoRA to directly predict the averaged chrome balls from the iterative process. Inference is further streamlined into a single denoising pass using a LoRA swapping technique. Experimental results that show our method produces convincing light estimates across diverse settings and demonstrates superior generalization to in-the-wild scenarios.

Abstract:
Personalized federated learning for multilingual sentiment analysis poses significant challenges arising from linguistic heterogeneity, non-IID data distributions, and strict privacy requirements. This paper proposes FedPerX, a federated transformer framework that integrates residual adapter-based personalization with adaptive multi-granular differential privacy. The architecture leverages a frozen multilingual backbone (XLM-R) while enabling each client to train lightweight, client-specific adapters. Privacy is enforced through dynamic noise injection at both the feature and adapter levels, calibrated using gradient sensitivity. FedPerX is evaluated on two multilingual benchmarks—MARC and TSMD—spanning structured reviews and informal social media content across more than ten languages. Experimental results demonstrate consistent improvements over seven state-of-the-art baselines, with up to +4.3% gains in macro-F1, a 70% reduction in communication overhead, and the lowest variance in client-level performance. Comprehensive analyses, including fairness, personalization gap, privacy-utility trade-off, and ablation studies, validate the framework’s robustness and adaptability. FedPerX advances the design of scalable, personalized, and privacy-preserving models for federated multilingual sentiment analysis

Abstract:
The use of imaging and genetic data for biomarker detection and disease diagnosis can deepen the understanding of disease pathogenesis and assist in clinical diagnosis. However, current methods face two major challenges: 1) the significant heterogeneity between multimodal data hampers modality fusion and 2) effectively exploring consistency and variability information from similar diseases for enhancing model performance is difficult. In this paper, we propose a novel unified framework, termed dual adaptive disentangled representation learning (DADRL), to simultaneously achieve disease-shared and disease-specific biomarker detection as well as disease diagnosis. Our DADRL comprises three components: 1) a biology information constraints-based modality fusion strategy is applied to adaptively explore inter- and intra-modal correlations, thereby effectively fusing multimodal data; 2) a unified framework that integrates modality fusion and disease diagnosis is proposed to mine disease-related information for simultaneously accomplishing disease-related biomarker detection and disease diagnosis; and 3) disentangled representation learning and several adaptive metric constraints are incorporated into the unified framework to adaptively separate disease-specific information from disease-shared feature representations for effectively identifying disease-shared and disease-specific biomarkers, thereby deepening the understanding of disease pathogenesis. Extensive experiments on multiple real datasets and simulated data demonstrate that our method significantly improves performance of biomarker detection and disease diagnosis.

Abstract:
Overfitting in deep neural networks occurs less frequently than expected. This is a puzzling observation, as theory predicts that greater model capacity should eventually lead to overfitting – yet this is rarely seen in practice. But what if overfitting does occur, not globally, but in specific sub-regions of the data space? In this work, we introduce a novel score that measures the forgetting rate of deep models on validation data, capturing what we term local overfitting: a performance degradation confined to certain regions of the input space. We demonstrate that local overfitting can arise even without conventional overfitting, and is closely linked to the double descent phenomenon. Building on these insights, we introduce a two-stage approach that leverages the training history of a single model to recover and retain forgotten knowledge: first, by aggregating checkpoints into an ensemble, and then by distilling it into a single model of the original size, thus enhancing performance without added inference cost. Extensive experiments across multiple datasets, modern architectures, and training regimes validate the effectiveness of our approach. Notably, in the presence of label noise, our method – Knowledge Fusion followed by Knowledge Distillation – outperforms both the original model and independently trained ensembles, achieving a rare win-win scenario: reduced training and inference complexity.

Abstract:
Category-Agnostic Pose Estimation (CAPE) aims to detect keypoints of unseen object categories in a few-shot setting, where the scarcity of labeled data poses significant challenges to generalization. In this work, we propose Prompt Pose Matching (PPM), a novel framework that unleashes the power of off-the-shelf text-to-image diffusion models for CAPE. PPM learns pseudo prompts from few-shot examples via the text-to-image diffusion model. These learned pseudo prompts capture semantic information of keypoints, which can then be used to locate the same type of keypoints from images. To provide prompts with representative initialization, we introduce a category-agnostic pre-training strategy to capture the foreground prior shared across categories and keypoints. To support the reliable prompt pre-training, we propose a Foreground-Aware Region Aggregation (FARA) module to provide robust and consistent supervision signal. Based on the foreground prior, a Foreground-Guided Attention Refinement (FGAR) module is further proposed to reinforce cross-attention responses for accurate keypoint localization. For efficiency, a Prompt Ensemble Inference (PEI) scheme enables joint keypoint prediction. Unlike previous methods that highly rely on base-category annotated data, our PPM framework can operate in a base-category-free setting while retaining strong performance. Code will be available at: https://github.com/DuoPeng-CVer/Prompt-Pose-Matching.

Abstract:
Transformer architecture has shown significant potential in various visual tasks, including point cloud registration. Positional encoding, as an order-aware module, plays a crucial role in Transformer framework. In this paper, we propose OIF-PCR++, a conditional positional encoding (CPE) method for point cloud registration. The core CPE module utilizes length and vector encoding at different stages, conditioned on the relative pose states between the point clouds to be registered. As a result, it progressively alleviates feature ambiguity through the incorporation of geometric cues. Building upon CPE, we introduce an iterative positional encoding optimization pipeline comprising two stages: 1) We find one correspondence via a differentiable optimal transport layer, and use it to encode length information into point cloud features, enhancing spatial consistency across different reference frames. 2) We apply a progressive direction alignment strategy to achieve rough alignment between paired point clouds, and then gradually incorporate direction information with the aid of this alignment, further enhancing feature distinctiveness and reducing feature ambiguity. Through this iterative optimization process, length and direction information are effectively integrated to achieve consistent and distinctive positional encoding, enabling the learning of discriminative point cloud features. Additionally, we present an inlier propagation mechanism that harmoniously integrates consistent geometric information for positional encoding. The proposed method is highly efficient, introducing marginal computational overhead while significantly improving feature distinguishability. Extensive experiments demonstrate superior performance over state-of-the-art methods on indoor, outdoor, object-level, and multi-way benchmarks, as well as strong generalization to complex real-world scenarios.

Abstract:
In this paper, we present a Neuron Abandoning Attention Flow (NAFlow) method to address the unsolved problem of visually explaining the attention evolution dynamics inside CNNs when making their classification decisions. A novel cascading neuron abandoning back-propagation algorithm is designed to precisely exclude the abandoned neurons on all intermediate layers inside a CNN model for the first time. Firstly, a Neuron Abandoning Back-Propagation module is proposed to generate Back-Propagation Feature Maps (BPFM) by using inverse function of the intermediate layers of CNN models, on which the neurons not used for decision-making are removed. Meanwhile, the cascading NA-BP modules calculate the tensors of importance coefficients which are linearly combined with the tensors of BPFMs to form the NAFlow. Secondly, to be able to visualize attention flow for similarity metric-based CNN models, a new channel contribution weights module is proposed to calculate the importance coefficients via Jacobian Matrix. Extensive evaluations demonstrate the effectiveness of the proposed NAFlow across eleven widely-used CNN models for various tasks of general image classification, contrastive learning classification, few-shot image classification, and image retrieval.

Abstract:
Video Question-Answering (VideoQA) enables machines to interpret and respond to complex video content, advancing human-computer interaction. However, existing multimodal large language models (MLLMs) often provide incomplete or opaque explanations and existing benchmarks mainly focus on the correction of final answers, limiting insight into their reasoning processes and hindering both transparency and verifiability. To address this gap, we propose the Question Parsing, Video Alignment and Answer Aggregation framework (QPVA^33), which leverages a compositional graph to drive visual and logical reasoning in VideoQA. Specifically, QPVA^33 consists of three core components, the planner, executor, and reasoner to generate the compositional graph and conduct graph-driven reasoning. For the original question, the planner parses it into the compositional graph, capturing the underlying reasoning logic and structuring it into a series of interconnected questions. For each question in compositional graph, the executor aligns the video by selecting relevant video clips and generates answers, ensuring accurate, context-specific responses. For each question with its first-order descents, the reasoner aggregates answers by integrating reasoning logic with visual evidence, resolving conflicts to produce a coherent and accurate response. Moreover, to assess the performance of existing MLLMs in the reasoning processes of VideoQA, we introduce novel compositional consistency metrics and construct a VideoQA benchmark (QPVA^33 Bench) with 3,492 question-video tuples, each annotated with detailed compositional graphs and fine-grained answers. We evaluate the QPVA^33 framework on QPVA^33 Bench and 5 other VideoQA benchmarks. Experimental results demonstrate that our framework improves both consistency and accuracy compared to baselines, leading to a more transparent and verifiable VideoQA system. This approach has the potential to advance the field, as supported by our comprehensive evaluation and benchmarking efforts.

Abstract:
High-dimensional and incomplete (HDI) data are ubiquitous in various Big Data-related industrial applications, such as drug innovation and recommender systems. Hash-learning is the most efficient representation learning approach to extract hidden information from HDI data owing to its fast reasoning and low storage. However, an existing hash learning approach commonly employs gradient-based optimization techniques to address the discrete objective caused by the binary nature of hash factors, where the Quantization (i.e., quantizing the real values to binary codes) loss is inevitable, resulting in accuracy loss when representing HDI data. Motivated by these critical and vital issues, this paper proposes a non-gradient hash factor (NGHF) model with three-fold ideas: a) innovating a discrete differential evolution (DDE) algorithm able to simulate the continuous optimization via disabling bits of binary codes based on the projected Hamming dissimilarity, thus enabling an effective discrete optimizer, b) applying the proposed DDE algorithm to directly optimize the discrete learning objective of NGHF defined on HDI data, thereby facilitating its efficient and precise training without any Quantization loss, and c) theoretically proving the convergence of NGHF. As such, NGHF possesses high representation learning ability comparable to that of a real-valued model, making it able to achieve precise binary representation to HDI data. Extensive experimental results on nine real-world datasets demonstrate that NGHF significantly outperforms eight state-of-the-art hash learning models. Moreover, its accuracy is amazingly comparable to that of a real-valued model for HDI data representation learning.

Abstract:
Image fusion aims to blend complementary information from diverse sensing modalities, yet most current methods lack robustness in complex fusion scenarios and cannot flexibly accommodate user intent. We present DiTFuse, the first Diffusion-Transformer (DiT) framework for instruction-driven, dynamic fusion control. Guided by natural-language instructions, DiTFuse flexibly blends multimodal content to enable hierarchical and fine-grained control over fusion dynamics. The training phase employs a multi-degrade-mask-image-modeling (M3) strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ideal reference images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion—as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multiimage fusion scenarios, including instruction-conditioned segmentation.

Abstract:
Social media platforms enable users to express emotions by posting text with accompanying images. In this paper, we propose the Affective Image Filter (AIF) task, which aims to reflect visually-abstract emotionsfrom text into visually-concrete images, thereby creating emotionally compelling results. We first introduce the AIF dataset and the formulation of the AIF models. Then, we present AIF-B as an initial attempt based on a multi-modal transformer architecture. After that, we propose AIF-D as an extension of AIF-B towards deeper emotional reflection, effectively leveraging generative priors from pre-trained large-scale diffusion models. Quantitative and qualitative experiments demonstrate that AIF models achieve superior performance for both content consistency and emotional fidelity compared to state-of-the-art methods. Extensive user study experiments demonstrate that AIF models are significantly more effective at evoking specific emotions. Based on the presented results, we comprehensively discuss the value and potential of AIF models.

Abstract:
Continual learning (CL) enables AI models to adapt to evolving environments while mitigating catastrophic forgetting, which is a critical capability for dynamic real-world applications. With the growing popularity of pre-trained Vision Transformer (ViT) models and visual prompt tuning (VPT) technique in CL, this work explores a CL method on top of the ViT-based foundation model, through VPT mechanism with theoretical guarantees. Inspired by the orthogonal projection method, we aim to leverage this approach for VPT to enhance CL performance, particularly in long-term scenarios. However, since the orthogonal projection is originally designed for linear operations in CNNs, applying it to ViTs poses challenges induced by the non-linear self-attention mechanism and the distribution drift within LayerNorm. To address these issues, we deduced two orthogonality conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of maintaining stability. Considering the strict orthogonal constraints can diminish model capacity and reduce plasticity, we further propose an importance-aware orthogonal regularization framework. By applying varying degrees of orthogonal constraints to different parameters based on their importance to old and new tasks, the framework adaptively enhances model capacity and thereby promotes long-sequence CL while improving the stability-plasticity trade-off. To implement the proposed approach, a null-space-based approximation solution is employed to efficiently achieve the prompt gradient orthogonal projection. Extensive experiments on various class-incremental learning benchmarks demonstrate that our method achieves state-of-the-art performance across diverse CL scenarios.

Abstract:
The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: 1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; 2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; 3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.

Abstract:
Graph-level anomaly detection (GLAD) aims to identify graphs that significantly deviate from the norm. Despite remarkable advancements in recent years, existing GLAD approaches struggle with the scarcity of labeled anomalies. Although some semi-supervised approaches leverage a small fraction of anomalous graphs during training, the limited diversity of these anomalies poses challenges in learning robust decision boundaries. Additionally, the detection of multi-task graph anomalies, a prevalent challenge in real-world scenarios, remains largely unexplored. To bridge these gaps, we propose MoEGAD, a novel framework leveraging a mixture of experts (MoE) architecture for GLAD. MoEGAD introduces an iterative anomalous graph generation module to produce pseudo-anomalous graphs, which facilitates the subsequent decision boundary learning. An early stopping mechanism is incorporated to ensure that the generated anomalies preserve sufficient dissimilarity from normal graphs. More importantly, we also propose a latent MoE module comprising multiple expert networks alongside a specialized gating network, which promotes cross-task adaptability for diverse GLAD problems. To the best of our knowledge, this is the first work exploring the potential of MoE architecture in the context of GLAD. Extensive experiments across single-task, large-scale, and multi-task scenarios demonstrate that MoEGAD significantly outperforms state-of-the-art GLAD baselines.

Abstract:
Graph Neural Networks (GNNs) have made significant strides in the analysis and modeling of complex network data, particularly excelling in graph and node classification tasks. However, the “closed box” nature of GNNs impedes user understanding and trust, thereby restricting their broader application. This challenge has spurred a growing focus on demystifying GNNs to make their decision-making processes more transparent. Traditional methods for explaining GNNs often rely on selecting subgraphs and employing combinatorial optimization to generate understandable outputs. However, these methods are closely linked to the inherent complexity of GNNs, leading to higher explanation costs. To address this issue, we introduce a lower-complexity proxy model to explain GNNs. Our approach leverages knowledge distillation with inter-layer alignment, specifically targeting the challenge of over-smoothing and its detrimental impact on model explanation. Initially, we distill critical insights from complex GNN models into a more manageable proxy model. We then apply an inter-layer alignment-based distillation technique to ensure alignment between the proxy and the original model, facilitating the extraction of node or edge-level explanations within the proxy framework. We theoretically prove that the explanations derived from the proxy model are faithful to both the proxy and the original model. Additionally, we show that the upper bound of unfaithfulness between the proxy and the original model remains consistent when the distillation error is infinitesimal. This inter-layer alignment knowledge distillation technique enables the proxy model to retain the knowledge learning and topological representation capabilities of the original model to the greatest extent. Experimental evaluations on numerous real-world datasets confirm the effectiveness of our method, demonstrating robust performance.

Abstract:
Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as “re-indexing”. However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C^22R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. Specifically, a bidirectional compatible transfer network is first designed to bridge the relationship between new and old knowledge and continuously update the old gallery features to the new feature space after the updating. Secondly, a bidirectional compatible distillation module and a bidirectional anti-forgetting distillation model are designed to balance the compatibility between the new and old knowledge in dual feature spaces. Finally, a feature-level exponential moving average strategy is designed to adaptively fill the diverse knowledge gaps between different data domains. Finally, we verify our proposed Bi-C^22R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.

Abstract:
In-context segmentation, also known as one-shot segmentation, aims to segment objects based on a single labeled example. While the Segment Anything Model (SAM) excels in interactive segmentation, it is not inherently designed for in-context tasks. To bridge this gap, we propose a new Dual Consistency SAM (DC-SAM), a prompt-tuning framework that adapts SAM and SAM2 for both image and video in-context segmentation. Instead of relying solely on pre-trained backbones, DC-SAM enhances the prompt encoder by generating high-quality visual prompts through feature fusion. Furthermore, we introduce a novel cycle-consistent cross-attention mechanism to enforce alignment between fused features and visual prompts, complemented by a dual-branch design incorporating discriminative positive and negative prompts. Additionally, we extend DC-SAM to the video domain via a novel mask-tube training strategy. To facilitate research, we curate the first In-Context Video Object Segmentation (IC-VOS) benchmark. Extensive experiments demonstrate that DC-SAM achieves state-of-the-art performance, yielding 55.5 mIoU (+1.4) on COCO-20^ii, 73.0 (+1.1) mIoU on PASCAL-5^ii, and a \mathcal J \& amp; \mathcalFJ&F score of 71.52 on IC-VOS.

Abstract:
We handle a new problem of multi-view multi-human tracking in the bird’s eye view (BEV). Different from previous works, we require neither the calibration among the multi-view cameras nor the actually captured BEV video. This makes the studied problem closer to real-world applications, however, more challenging. For this purpose, in this work, we propose a novel BEVTrack scheme. Specifically, given multi-view videos, we first use a virtual BEV transform module to obtain the BEV for each view. Then, we propose a unified BEV alignment module to fuse the respectively generated BEVs, in which we specifically design the self-supervised losses by considering both the spatial consistency and the temporal continuity. During the inference, we design the camera-subject collaborative registration and tracking strategy to make use of the mutual dependence between the multi-view cameras and the multiple targets, to achieve the desired BEV tracking. We also build a new benchmark for training and evaluation, the experimental results on which have verified the rationality of the problem and the effectiveness of our method.

Abstract:
Multi-task learning (MTL) presents greater optimization challenges than single-task learning (STL) due to conflicting gradients across tasks. While parameter sharing promotes cooperation among related tasks, many tasks require specialized representations. To balance cooperation and specialization, we propose Mod-Squad (Chen et al. 2023), a modular transformer-based model composed of a “squad” of experts. Each task activates a sparse subset of experts through a differentiable matching process, guided by a novel mutual information-based loss. This modular structure avoids full backbone sharing and scales effectively with the number of tasks and dataset size. In this extended version, we generalize Mod-Squad to support multi-dataset pre-training, enabling joint learning across disjoint, single-task datasets (e.g., ImageNet, COCO, ADE20 K). This is achieved via a new formulation of the mutual information loss that unifies learning across heterogeneous sources. More importantly, while most prior work in large models has focused on efficiency, few have explored adjustable efficiency. In this study, we further evaluate the model’s generalization to downstream tasks and introduce a set of efficient adaptation techniques that leverage Mod-Squad’s modularity for flexible fine-tuning—enabling dynamic adjustment of model size, parameter count, and computational cost. Additionally, we present a hybrid adaptation scheme that combines these techniques to achieve favorable performance–efficiency trade-offs. In summary, Mod-Squad provides a robust foundation for sparse modular models that can learn from diverse supervision and datasets. Its emergent modularity enables strong generalization, decomposition into high-performing components, and rapid, resource-efficient adaptation for downstream applications.

Abstract:
Model fusion aims to integrate several deep neural network (DNN) models’ knowledge into one by fusing parameters, and it has promising applications, such as improving the generalization of foundation models and parameter averaging in federated learning. However, models under different settings (data, hyperparameter, etc.) have diverse neuron permutations; in other words, from the perspective of loss landscape, they reside in different loss basins, thus hindering model fusion performances. To alleviate this issue, previous studies highlighted the role of permutation invariance and have developed methods to find correct network permutations for neuron alignment after training. Orthogonal to previous attempts, this paper studies training-time neuron alignment, improving model fusion without the need for post-matching. Training-time alignment is cheaper than post-alignment and is applicable in various model fusion scenarios. Starting from fundamental hypotheses and theorems, a simple yet lossless algorithm called TNA-PFN is introduced. TNA-PFN utilizes partially fixed neuron weights as anchors to reduce the potential of training-time permutations, and it is empirically validated in reducing the barriers of linear mode connectivity and multi-model fusion. It is also validated that TNA-PFN can improve the fusion of pretrained models under the setting of model soup (vision transformers) and ColD fusion (pretrained language models). Based on TNA-PFN, two federated learning methods, FedPFN and FedPNU, are proposed, showing the prospects of training-time neuron alignment. FedPFN and FedPNU reach state-of-the-art performances in federated learning under heterogeneous settings and can be compatible with the server-side algorithm.

Abstract:
While text-to-image diffusion models exhibit outstanding results, they struggle to faithfully generate key subjects with corresponding attributes in prompts, challenges known as catastrophic neglect and attribute binding. Previous works typically utilize attention adjustments to solve the above problems, whereas we observe that they may still generate unfaithful images. In this paper, we carefully analyze the text-to-image process and pinpoint three pivotal bottlenecks that hinder image faithful generation: (1) unequal responses of neglected subjects in text embedding, (2) competition and entanglement between subjects’ attention, and (3) suboptimal quality of intermediate features from U-Net. Based on the aforementioned observations, we propose a Refine, Control, and Distill (RCD) framework built upon the stable diffusion model to alleviate the negative effects raised by the bottlenecks mentioned above, respectively. Specifically, we achieve the above goals through a text embedding refinement module, three region-level attention control losses, and self-distillation of intermediate semantic features in the denoising process. Our approach exhibits promising capability in generating faithful and high-quality images and outperforms state-of-the-art methods through extensive quantitative and qualitative evaluations on recent advanced base diffusion models.

Abstract:
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. Recent research on task arithmetic-based MTL demonstrates that merging the parameters of independently fine-tuned models can effectively achieve MTL. However, existing merging methods primarily seek a static optimal solution within the original model parameter space, which often results in performance degradation due to the inherent diversity among tasks and potential interferences. To address this challenge, in this paper, we propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. Specifically, we first identify critical (or sensitive) modules by analyzing parameter variations in core modules of Transformer-based models before and after fine-tuning. Then, our WEMoE statically merges non-critical modules while transforming critical modules into a mixture-of-experts (MoE) structure. During inference, expert modules in the MoE are dynamically merged based on input samples, enabling a more flexible and adaptive merging approach. Building on WEMoE, we further introduce an efficient-and-effective WEMoE (E-WEMoE) method, whose core mechanism involves eliminating non-essential elements in the critical modules of WEMoE and implementing shared routing across multiple MoE modules, thereby significantly reducing both the trainable parameters, the overall parameter count, and computational overhead of the merged model by WEMoE. Experimental results across various architectures and tasks demonstrate that both WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.

Abstract:
Shapley value is a widely used tool in explainable artificial intelligence (XAI), as it provides a principled way to attribute contributions of input features to model outputs. However, estimation of Shapley value requires capturing conditional dependencies among all feature combinations, which poses significant challenges in complex data environments. In this article, EmSHAP (Energy-based model for Shapley value estimation), an accurate Shapley value estimation method, is proposed to estimate the expectation of Shapley contribution function under the arbitrary subset of features given the rest. By utilizing the ability of energy-based model (EBM) to model complex distributions, EmSHAP provides an effective solution for estimating the required conditional probabilities. To further improve estimation accuracy, a GRU (Gated Recurrent Unit)-coupled partition function estimation method is introduced. The GRU network captures long-term dependencies with a lightweight parameterization and maps input features into a latent space to mitigate the influence of feature ordering. Additionally, a dynamic masking mechanism is incorporated to further enhance the robustness and accuracy by progressively increasing the masking rate. Theoretical analysis on the error bound as well as application to four case studies verified the higher accuracy and better scalability of EmSHAP in contrast to competitive methods.

Abstract:
Recent studies show that the visual place recognition (VPR) method using pre-trained visual foundation models can achieve promising performance. In our previous work, we propose a novel method to realize seamless adaptation of foundation models to VPR (SelaVPR). This method can produce both global and local features that focus on discriminative landmarks to recognize places for two-stage VPR by a parameter-efficient adaptation approach. Although SelaVPR has achieved competitive results, we argue that the previous adaptation is inefficient in training time and GPU memory usage, and the re-ranking paradigm is also costly in retrieval latency and storage usage. In pursuit of higher efficiency and better performance, we propose an extension of the SelaVPR, called SelaVPR++. Concretely, we first design a parameter-, time-, and memory-efficient adaptation method that uses lightweight multi-scale convolution (MultiConv) adapters to refine intermediate features from the frozen foundation backbone. This adaptation method does not back-propagate gradients through the backbone during training, and the MultiConv adapter facilitates feature interactions along the spatial axes and introduces proper local priors, thus achieving higher efficiency and better performance. Moreover, we propose an innovative re-ranking paradigm for more efficient VPR. Instead of relying on local features for re-ranking, which incurs huge overhead in latency and storage, we employ compact binary features for initial retrieval and robust floating-point (global) features for re-ranking. To obtain such binary features, we propose a similarity-constrained deep hashing method, which can be easily integrated into the VPR pipeline. Finally, we improve our training strategy and unify the training protocol of several common training datasets to merge them for better training of VPR models. Extensive experiments show that SelaVPR++ is highly efficient in training time, GPU memory usage, and retrieval latency (6000× faster than TransVPR), as well as outperforms the state-of-the-art methods by a large margin (ranks 1st on MSLS challenge leaderboard).

Abstract:
We propose Cross-Attention in Audio, Space, and Time (C\textA^2A2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, Kinetics-400, ActivityNet, and HD-EPIC to show balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, EPIC- SOUNDS, and HD-EPIC-SOUNDS. CAVA shows favorable performance on these datasets, demonstrating the effective information exchange among multiple experts within the B-CA module. In addition, C\textA^2A2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.

Abstract:
Detecting oriented tiny objects, which are limited in appearance information yet prevalent in real-world applications, remains an intricate and under-explored problem. To address this, we systematically introduce a new dataset, a benchmark, and a dynamic coarse-to-fine learning scheme in this study. Our proposed dataset, AI-TOD-R, features the smallest object sizes among all oriented object detection datasets. Based on AI-TOD-R, we present a benchmark spanning a broad range of detection paradigms, including both fully-supervised and label-efficient approaches. Through investigation, we identify a learning bias presents across various learning pipelines: confident objects become increasingly confident, while vulnerable oriented tiny objects are further marginalized, hindering their detection performance. To mitigate this issue, we propose a Dynamic Coarse-to-Fine Learning (DCFL) scheme towards unbiased learning. DCFL dynamically updates prior positions to better align with the limited areas of oriented tiny objects, and it assigns samples in a way that balances both quantity and quality across different object shapes, thus mitigating biases in prior settings and sample selection. Extensive experiments across 10 challenging object detection datasets demonstrate that DCFL achieves state-of-the-art accuracy, high efficiency, and remarkable versatility.

Abstract:
The state-of-the-art zero-shot cross-lingual spoken language understanding (SLU) model utilizes cross-lingual unsupervised contrastive learning to achieve multilingual semantics alignment. While existing methods have achieved promising results, they still have two issues limiting cross-lingual knowledge transfer: (1) dual-task correlative knowledge is not explicitly modeled and transferred to target languages; (2) the semantics differences among samples are ignored, and the contrastive semantics knowledge is not transferred to target languages. In this paper, we propose a dual-task cross-lingual alignment network (DXA-Net), which makes the first attempt to tackle zero-shot cross-lingual SLU based on the prompt-tuning paradigm. To solve the first issue, we propose the co-guiding prompt, which allows the model to conditionally generate one task’s label based on another one’s. To solve the second issue, we propose the intent/slot contrastive prompt to teach the model to discriminate whether a pair of samples have the same or similar labels. Additionally, we propose multilingual semantics contrastive prompt to enhance multilingual semantics alignment. Experiments on the benchmark show that our model achieves new state-of-the-art performance on nine languages.

Abstract:
This paper investigates a fundamental yet underexplored issue in Salient Object Detection (SOD): the size-invariant property for evaluation protocols, particularly in scenarios when multiple salient objects of significantly different sizes appear within a single image. We first present a novel perspective to expose the inherent size sensitivity of existing widely used SOD metrics. Through careful theoretical derivations, we show that the evaluation outcome of an image under current SOD metrics can be essentially decomposed into a sum of several separable terms, with the contribution of each term being directly proportional to its corresponding region size. Consequently, the prediction errors would be dominated by the larger regions, while smaller yet potentially more semantically important objects are often overlooked, leading to biased performance assessments and practical degradation. To address this challenge, a generic Size-Invariant Evaluation (SIEva) framework is proposed. The core idea is to evaluate each separable component individually and then aggregate the results, thereby effectively mitigating the impact of size imbalance across objects. Building upon this, we further develop a dedicated optimization framework (SIOpt), which adheres to the size-invariant principle and significantly enhances the detection of salient objects across a broad range of sizes. Notably, SIOpt is model-agnostic and can be seamlessly integrated with a wide range of SOD backbones. Theoretically, we also present generalization analysis of SOD methods and provide evidence supporting the validity of our new evaluation protocols. Finally, comprehensive experiments speak to the efficacy of our proposed approach.

Abstract:
Deep model training on extensive datasets is increasingly cost-prohibitive, prompting adoption of deep model fusion to leverage knowledge from pre-existing models. From weight averaging to more sophisticated methods, fusion effectively improves model performance and accelerates new model development. However, parameter interference between models and the lack of interpretability remain challenges. Existing methods address interference by evaluating parameters attributes, such as magnitude or sign, or by pruning. We begin by examining the fine-tuning of linear layers through the lens of subspace analysis and define parameter interference as an optimization problem. Subsequently, we introduce an innovative approach called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction, which upscales source models into an MoE model without extra data or training. Our approach relies on the observation that fine-tuning mostly keeps the important parts from the pre-training, but it uses less significant or unused areas to adapt to new tasks. Additionally, the issue of parameter interference, which is intrinsically challenging in the original parameter space, can be managed by expanding the dimensions. We conduct extensive experiments across both image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning, and we apply our method to LLMs, highlighting the adaptability and scalability of SMILE. For full fine-tuned models, about 50% additional parameters can achieve around 98% -99% of the performance of eight individual fine-tuned ViT models, while for LoRA fine-tuned Flan-T5 models, maintaining 99% performance with only 2% extra parameters. Code is available at https://github.com/tanganke/fusion_bench.

Abstract:
Variance reduction has been shown to improve the performance of Stochastic Gradient Descent (SGD) in centralized machine learning. However, when it is extended to federated learning systems, many issues may arise, including (i) mega-batch size settings; (ii) additional noise introduced by the gradient difference between the current iteration and the snapshot point; and (iii) gradient (statistical) heterogeneity. In this paper, we propose a lightweight algorithm termed federated adaptive batch size time evolving variance reduction (FedATEVR) to tackle these issues, consisting of an adaptive batch size setting scheme and a time-evolving variance reduction gradient estimator. In particular, we use the historical gradient information to set an appropriate mega-batch size for each client, which can steadily accelerate the local SGD process and reduce the computation cost. The historical information involves both global and local gradient, which mitigates unstable varying in mega-batch size introduced by gradient heterogeneity among the clients. For each client, the gradient difference between the current iteration and the snapshot point is used to tune the time-evolving weight of the variance reduction term in the gradient estimator. This can avoid meaningless variance reduction caused by the out-of-date snapshot point gradient. We theoretically prove that our algorithm can achieve a linear speedup of of \mathcal O(\frac1\sqrtSKT)O(1SKT) for non-convex objective functions under partial client participation. Extensive experiments demonstrate that our proposed method can achieve higher test accuracy than the baselines and decrease communication rounds greatly.

Abstract:
The structured illumination microscopy (SIM) technique, when applied under low photon efficiency, provides an effective solution for rapid live-cell imaging, thereby enabling the investigation of dynamic cellular processes. However, noise interference during the acquisition process significantly hinders the reconstruction of SIM images, leading to substantial artifacts. To address this challenge, we propose a zero-shot learning-based SIM image denoising method (ZS-SIM). This approach relies solely on a single acquisition of noisy SIM data and achieves accurate denoising through neural network training. The original SIM image stack is downsampled and interpolated to complete the resampling process, while the traditional Wiener-SIM reconstruction method is integrated to ensure physical fidelity. We introduce a symmetric reconstruction loss and a mutual constraint SSIM loss that jointly enhance training stability and accelerate convergence, as demonstrated by our convergence analysis. ZS-SIM further achieves a favorable balance between denoising quality and computational efficiency, with low model complexity and fast inference speed, making it well-suited for practical deployment in microscopy workflows. Experimental results demonstrate that ZS-SIM efficiently and rapidly achieves artifact-free, high-fidelity denoising reconstruction, making it particularly well-suited for low-photon efficiency live-cell imaging and scenarios with limited computational resources. Furthermore, by extending the method to scanning electron microscopy (SEM) data, we validate the effectiveness of ZS-SIM for SEM data denoising, significantly enhancing the performance of downstream segmentation tasks. We anticipate that ZS-SIM will play a pivotal role in low-photon efficiency imaging, driving advancements in this field and providing crucial support for rapid validation in biomedical research, thereby overcoming the challenges posed by acquisition noise.

Abstract:
Infrared and visible images present different domains that hinder the fusion process, thereby losing texture details. Besides, the low-level fusion and subsequent high-level segmentation appear cross-task feature gap that impedes their mutual promotion, causing blurred object edges. Addressing the above issues, this paper proposes a novel infrared and visible image fusion method that simultaneously crosses domain and task. First, a swap image translation strategy is built to transfer the features of visible and infrared images into an adaptive domain. Meanwhile, a global-local constraint is introduced to achieve overall domain space transfer, and shorten their feature distance. Second, a task interaction & query module is designed to explore the cross-task feature interactive relationship, which is then used as a bridge to realize the gradient backpropagation. Thus, a fine-grained mapping from the segmentation feature to fusion feature is obtained. Extensive experiments demonstrate that the proposed method exhibits superior fusion and segmentation performance than the state-of-the-art methods.

Abstract:
The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions.

Abstract:
Specific emitter identification (SEI) refers to the technique of identifying different individuals from the signals emitted by wireless devices. Recent studies have focused mainly on deep learning (DL) models that automatically learn valid inherent features from raw time-domain signals. However, current studies rarely consider real open-world scenarios, where new classes may emerge during the inference phase, and the utilized model must evolve as new classes incrementally appear. An incremental open-world learning (IOWL) framework is proposed in this paper, and we show how IOWL can continually recognize and learn new classes. The proposed method is based on a novel exemplar selection and generalization mechanism. First, by applying edge pattern detection (EPD) and shifting edge samples along the adversarial direction, a high-quality pseudo unknown dataset is generated to improve the open-set recognition (OSR) process. Second, a hybrid class-incremental learning method is proposed to maintain the previous identification capabilities through boundary exemplar generation, which not only benefits each individual paradigm but also highlights their synergies in a common framework. We provide a theoretical analysis of the obtained generalization error bounds to prove the benefits of the proposed method. Numerical results on real collected data indicate that IOWL consistently outperforms the other baseline algorithms.

Abstract:
In real-world scenarios, training and test data are often collected in diverse settings, leading to domain shifts arising from evolving environments and selection bias. While causality-inspired methods have shown promising results in tackling the out-of-distribution (OOD) generalization issue, prior methods treat the discovered differences across domains as confounding variables. While effective in handling domain differences (i.e., unseen environmental features in test data), they may fail when confronted with intricate spurious correlations in real-world datasets. In this study, we first analyze this limitation to inadequate modeling of causal intervention and derive the OOD generalization bound to explain the challenges it introduces. To address this problem, we propose a modified causal intervention approach to mitigate various types of confounders. Motivated by the mathematical formulation of our modified causal intervention, we introduce the Causal Feature Selection Module (CFSM) to suppress model weights on both domain-differences features and spurious correlation features. Integrated within the Base Feature Extraction Module, In-Sample Module, and Cross-Sample Module (B-I-C architecture), CFSM collectively neutralizes the confounding effects arising from both domain discrepancies and correlation distinctions, thereby achieving causal feature selection. Under mild assumptions, we prove that the proposed CFSM method can achieve strictly lower OOD errors. Further experiments conducted on various benchmark datasets demonstrate the effectiveness of the proposed method. Compared to previous deconfounding methods, our method not only mitigates the effect of domain-differences features but also the hard-to-identify spurious correlation features, achieving significant improvements in two-dimensional OOD generalization.

Abstract:
In this work, we introduce Wonder3D++, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about 3 minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works.

Abstract:
The misuse of deep learning-based facial manipulation poses a serious threat to civil rights. To prevent such fraud at its source, proactive defense methods have been proposed that embed invisible adversarial perturbations into images, disrupting the manipulation process and rendering the forged output unconvincing to observers. However, non-targeted disruption of the output may leave identifiable facial features intact, potentially leading to the stigmatization of individuals. In this work, we propose a universal framework for combating facial manipulation, termed ID-Guard. The framework employs a single forward pass of an encoder–decoder network to generate cross-model transferable adversarial perturbations. We introduce a novel Identity Destruction Module (IDM) to suppress identifiable features in manipulated faces. The perturbation generation is optimized by formulating the disruption of various manipulation types as a multi-task learning problem, with a dynamic weighting strategy designed to enhance cross-model performance. Experimental results show that ID-Guard effectively defends against diverse facial manipulation models while degrading identifiable regions in manipulated images. It also enables disrupted images to evade facial inpainting and facial recognition systems. Moreover, ID-Guard can be seamlessly integrated as a plug-and-play component into other tasks, such as adversarial training.

Affiliations: School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China; School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; School of Psychological and Cognitive Sciences and Beijing Key Laboratory of Behavior and Mental Health, IDG/McGovern Institute for Brain Research, Peking-Tsinghua Center for Life Sciences, and Key Laboratory of Machine Perception, MOE, Peking University, Beijing, China; Brain Health Institute, National Center for Mental Disorders, Shanghai Mental Health Center, School of Medicine and School of Psychology, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Humans exhibit remarkable abilities in recognizing relationships and performing complex reasoning. In contrast, deep neural networks have long been critiqued for their limitations in abstract visual reasoning (AVR), a key challenge in achieving artificial general intelligence. Drawing on the well-known concept of prediction errors from neuroscience, we propose that prediction errors can serve as a unified mechanism for both supervised and self-supervised learning in AVR. In our novel supervised learning model, AVR is framed as a prediction-and-matching process, where the central component is the discrepancy (i.e., prediction error) between a predicted feature based on abstract rules and candidate features within a reasoning context. In the self-supervised model, prediction errors as a key component unify the learning and inference processes. Both supervised and self-supervised prediction-based models achieve state-of-the-art performance on a broad range of AVR datasets and task conditions. Most notably, hierarchical prediction errors in the supervised model automatically decrease during training, an emergent phenomenon closely resembling the decrease of dopamine signals observed in biological learning. These findings underscore the critical role of prediction errors in AVR and highlight the potential of leveraging neuroscience theories to advance computational models for high-level cognition in artificial intelligence.

Abstract:
Few-shot object detection (FSOD) poses a significant challenge due to the difficulty of learning robust and discriminative object representations under limited supervision. A widely adopted solution is the two-stage fine-tuning framework, wherein knowledge acquired from a large-scale base dataset is transferred to a novel dataset containing only a small number of labeled instances. However, this framework is prone to systematically misclassifying novel objects as background, primarily due to incorrect background label caused by the domain gap between base and novel datasets—an issue exacerbated by the sparse representation of novel categories. In this work, we show that this inherent weakness can be exploited by explicitly redefining the category structure and transferring the representations learned during the base training stage. Building on this insight, we propose a simple yet effective framework grounded in the Product of Experts (PoE) formulation, which estimates the joint distribution over background and novel categories by combining the unnormalized logits from independently trained classifiers. Notably, it does not require modifications of the base model or repetition of the base training phase. Furthermore, we introduce a strategy for identifying additional novel-category instances within the base dataset, which effectively augmenting the training set for fine-tuning. The resulting method is architecture-agnostic, imposes negligible overhead, and integrates seamlessly with existing two-stage fine-tuning pipelines. Extensive experiments on PASCAL VOC and COCO demonstrate that the proposed method yields consistent improvements across different baselines, achieving significant gains over state-of-the-art FSOD approaches.

Abstract:
Optimizing the performance of deep neural networks (DNNs) remains a significant challenge due to the sensitivity of models to both hyperparameter selection and weight initialization. Existing approaches typically address these two factors independently, which often leads to limiting adaptability and overall effectiveness. In this paper, we present a novel meta-learning framework that jointly recommends hyperparameters and initial weights by leveraging dataset similarity. Our method begins by extracting meta-features from a collection of historical datasets. For a given query dataset, similarity is computed based on distances in the meta-feature space, and the most similar historical datasets are used to recommend the underlying parameter configurations. To capture the diverse characteristics of image datasets, we introduce two complementary types of meta-features. The first, referred to as shallow or visible meta-features, comprises five groups of statistical measures that summarize color and texture information. The second, termed deep or invisible meta-features, consists of 512 descriptors extracted from a convolutional neural network pre-trained on ImageNet. We evaluated our framework in 105 real-world image classification tasks, using 75 datasets for historical modeling and 30 for querying. Experimental results with both vision transformers and convolutional neural networks demonstrate that our approach consistently outperforms state-of-the-art baselines, underscoring the effectiveness of dataset-driven parameter recommendation in deep learning.

Abstract:
In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To mitigate this issue, we introduce a Masked Jigsaw Puzzle (MJP) framework. MJP starts with random token shuffling to break the token order, and then a learnable unknown (unk) position embedding is used to mask out the PEs of the shuffled tokens. In this manner, the local spatial information which is encoded in the position embeddings is disrupted, and the models are forced to learn feature representations that are less reliant on the local spatial information. Notably, with the careful use of MJP, we can not only improve models’ robustness against gradient attacks, but also boost their performance in both vision and text application scenarios, such as classification for images (e.g., ImageNet-1 K) and sentiment analysis for text (e.g., Yelp and Amazon). Experimental results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks.

Abstract:
A central challenge in source-free domain adaptation (SFDA) is the lack of a theoretical framework for explicitly analyzing domain shifts, as the absence of source data prevents direct domain comparisons. In this paper, we introduce the Vicinal Gaussian Transform (VGT), an analytical operator that models source-informed latent vicinities as Gaussians and shows that vicinal prediction divergence is bounded by their covariance. By this formulation, SFDA can be reframed as shrinking covariance to reinforce label consistency. To operationalize this idea, we introduce the Energy-based VGT (EBVGT), a novel SDE that realizes the Gaussian transform by contracting covariance through a denoising mechanism. A recovery-likelihood with a Schrödinger-Bridge smoothness penalty denoises perturbed states, while a BYOL-derived energy function, directly obtained from model predictions, provides the score to guide label-consistent trajectories within the vicinity. This design not only yields noise-suppressed vicinal features for adaptation without source data, but also eliminates the need for additional learnable parameters for score estimation, in contrast to conventional deep SDEs. Our EBVGT is model- and modality-agnostic, efficient for classification, and improves state-of-the-art SFDA methods by 1.3–3.0% (2.0% on average) across both 2D image and 3D point cloud benchmarks.

Abstract:
Graph Neural Networks (GNNs) have achieved remarkable success in machine learning tasks by learning the features of graph data. However, experiments show that vanilla GNNs fail to achieve good classification performance in the field of graph anomaly detection. To address this issue, we propose and theoretically prove that the high-Class Homophily Variance (CHV) characteristic is the reason behind the suboptimal performance of GNN models in anomaly detection tasks. Statistical analysis shows that in most standard node classification datasets, homophily levels are similar across all classes, so CHV is low. In contrast, graph anomaly detection datasets have high CHV, as benign nodes are highly homophilic while anomalies are not, leading to a clear separation. To mitigate its impact, we propose a novel GNN model named Homophily Edge Augment Graph Neural Network (HEAug). Different from previous work, our method emphasizes generating new edges with low CHV value, using the original edges as an auxiliary. HEAug samples homophily adjacency matrices from scratch using a self-attention mechanism, and leverages nodes that are relevant in the feature space but not directly connected in the original graph. Additionally, we modify the loss function to punish the generation of unnecessary heterophilic edges by the model. Extensive comparison experiments demonstrate that HEAug achieved the best performance across eight benchmark datasets, including anomaly detection, edgeless node classification and adversarial attack. We also defined a heterophily attack to increase the CHV value in other graphs, demonstrating the effectiveness of our theory and model in various scenarios.

Abstract:
Online Active Learning (OAL) is a powerful tool for classifying evolving data streams using limited annotations from a human operator who is a domain expert. The objective of the OAL learning paradigm is to minimize jointly the classification error rate and the annotation cost across the data stream by posing periodic Active Learning (AL) queries. In this paper, this objective is extended to include identification of classifier errors by the expert during the typical workflow. To this end, Corrective Feedback (CF) is introduced as a second channel of interaction between the expert and the learning algorithm, complementary to the AL channel, that allows the algorithm to obtain additional training labels without disrupting the expert’s workflow. Online Active Learning with Corrective Feedback (OAL-CF) is formally defined as a paradigm, and its efficacy is proven through experimental application to two binary classification tasks, Spoken Language Verification and Voice-Type Discrimination. Finally, the effects of adding CF to the OAL paradigm are analyzed in terms of classification performance, annotation cost, trends over time, and class balance of the collected training data. Overall, the addition of CF results in a 53% relative reduction in cost compared to OAL without CF.

Abstract:
With the popularity of personal devices, there are abundant valuable face image datasets in the industry, which provides opportunities for the development of visual models. However, privacy concerns related to identity sensitive information hinder face datasets sharing. Despite existing works dedicated to removing identity sensitive information from images, they either lack provable privacy guarantees or compromise crucial face dataset utilities, e.g., identity correlation and image naturalness. To overcome these weaknesses, we propose a novel face dataset publication scheme that protects face images by obfuscating face features. The obfuscated features still retain a certain level of correlation, allowing the protected dataset to be used for training. In the process of obfuscating the features, we design a novel metric differential privacy mechanism, which can enhance the correlation between features while ensuring privacy. Furthermore, we construct a latent diffusion model with identity and attribute as inputs to improve the naturalness of generated images. Extensive experimental results and theoretical analysis demonstrate our scheme significantly outperforms existing works in providing privacy protection while maintaining high dataset utility for downstream tasks.

Abstract:
Graphs with abundant attributes are essential in modeling interconnected entities and enhancing predictions across various real-world applications. Traditional Graph Neural Networks (GNNs) often require re-training for different graph tasks and datasets. Although the emergence of Large Language Models (LLMs) has introduced new paradigms in natural language processing, their potential for generic graph mining—training a single model to simultaneously handle diverse tasks and datasets—remains under-explored. To this end, our novel framework \sf MuseGraphMuseGraph, seamlessly integrates the strengths of GNNs and LLMs into one foundation model for graph mining across tasks and datasets. This framework first features a compact graph description to encapsulate key graph information within language token limitations. Then, we propose a diverse instruction generation mechanism with Chain-of-Thought (CoT)-based instruction packages to distill the reasoning capabilities from advanced LLMs like GPT-4. Finally, we design a graph-aware instruction tuning strategy to facilitate mutual enhancement across multiple tasks and datasets while preventing catastrophic forgetting of LLMs’ generative abilities. Our experimental results demonstrate significant improvements in five graph tasks and ten datasets, showcasing the potential of our \sf MuseGraphMuseGraph in enhancing the accuracy of graph-oriented downstream tasks while improving the generation abilities of LLMs.

Abstract:
Layout plays a crucial role in graphic design and poster generation. Recently, the application of deep learning models for layout generation has gained significant attention. This paper focuses on using a GAN-based model conditioned on images to generate advertising poster graphic layouts, requiring a dataset of paired product images and layouts. To address this task, we introduce the Content-aware Graphic Layout Dataset (CGL-Dataset), consisting of 60,548 paired inpainted posters with annotations and 121,000 clean product images. The inpainting artifacts introduce a domain gap between the inpainted posters and clean images. To bridge this gap, we design two GAN-based models. The first model, CGL-GAN, uses Gaussian blur on the inpainted regions to generate layouts. The second model combines unsupervised domain adaptation by introducing a GAN with a pixel-level discriminator (PD), abbreviated as PDA-GAN, to generate image-aware layouts based on the visual texture of input images. The PD is connected to shallow-level feature maps and computes the GAN loss for each input-image pixel. Additionally, we propose three novel content-aware metrics to assess the model’s ability to capture the intricate relationships between graphic elements and image content. Quantitative and qualitative evaluations demonstrate that PDA-GAN achieves state-of-the-art performance and generates high-quality image-aware layouts.

Abstract:
Pooling and unpooling are indispensable in constructing hierarchical spherical convolutional neural networks (HS-CNNs). Most existing models employ simple downsampling-based pooling, which ignores the sampling theorem and cannot adapt to different spherical signals (with different spectra) and tasks (dependent on different frequency components), thus suffering a significant information loss. Besides, signals reconstructed by the widely-adopted padding-based unpooling may also change unwantedly the spectra of original signals. To address these, we propose a novel framework of HS-CNNs with lifting structures to learn adaptive spherical wavelets for pooling and unpooling, named LiftHS-CNNs. Specifically, we learn spherical wavelets with a lifting structure to adaptively partition the input signal into low- and high-frequency sub-bands, with the down-scaled representations for pooling generated to preserve more information in the low-frequency sub-band. The lifting structure consists of learnable update and predict operators parameterized with graph attention to jointly consider the signal’s characteristics and underlying geometries. We then propose an unpooling operation invertible to the lifting-based pooling for restoring the up-scaled representations, which can well preserve spectral characteristics of the original signal. Particular properties (i.e., spatial locality, vanishing moments, and stability) of the learned wavelets and the information preserving ability of the proposed pooling and unpooling are further studied. Experiments on benchmark spherical datasets for a wide range of tasks verify the superiority of our LiftHS-CNNs.

Abstract:
Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360^\circ∘ environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360^\circ∘ audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360^\circ∘ videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360^\circ∘ scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos.

Abstract:
Recent Neural Radiance Field (NeRF) methods on large-scale scenes have demonstrated promising results and underlined the importance of scene decomposition for scalable NeRFs. Although these methods achieved reasonable scalability, there are several critical problems remaining unexplored in the existing large-scale NeRF modeling methods, i.e., learnable decomposition, modeling scene heterogeneity, and modeling efficiency. In this paper, we introduce Switch-NeRF++, a Heterogeneous Mixture of Hash Experts (HMoHE) network that addresses these challenges within a unified framework. Our framework is a highly scalable NeRF that learns heterogeneous decomposition and heterogeneous Neural Radiance Fields efficiently for large-scale scenes in an end-to-end manner. In our framework, a gating network learns to decompose scenes into partitions and allocates 3D points to specialized NeRF experts. This gating network is co-optimized with the experts by our proposed Sparsely Gated Mixture of Experts (MoE) NeRF framework. Our network architecture incorporates a hash-based gating network and distinct heterogeneous hash experts. The hash-based gating efficiently learns the decomposition of the large-scale scene. The distinct heterogeneous hash experts consist of hash grids of different resolution ranges. This enables effective learning of the heterogeneous representation of different decomposed scene parts within large-scale complex scenes. These design choices make our framework an end-to-end and highly scalable NeRF solution for real-world large-scale scene modeling to achieve both quality and efficiency. We evaluate our accuracy and scalability on existing large-scale NeRF datasets. Additionally, we also introduce a new dataset with very large-scale scenes ( > 6.5\,\textkm^2>6.5km2) from UrbanBIS. Extensive experiments demonstrate that our approach can be easily scaled to various large-scale scenes and achieve state-of-the-art scene rendering accuracy. Furthermore, our method exhibits significant efficiency gains, with an 8x acceleration in training and a 16x acceleration in rendering compared to the best-performing competitor Switch-NeRF.

Abstract:
Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm and detect objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track++, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. We also propose an auxiliary token in this attention-based association module, which helps mitigate disproportionately high attention to incorrect association targets caused by attention normalization. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms.

Abstract:
Human beings have the ability to continuously analyze a video and immediately extract the motion components. We want to adopt this paradigm to provide a coherent and stable motion segmentation over the video sequence. In this perspective, we propose a novel long-term spatio-temporal model operating in a totally unsupervised way. It takes as input the volume of consecutive optical flow (OF) fields, and delivers a volume of segments of coherent motion over the video. More specifically, we have designed a transformer-based network, where we leverage a mathematically well-founded framework, the Evidence Lower Bound (ELBO), to derive the loss function. The loss function combines a flow reconstruction term involving spatio-temporal parametric motion models combining, in a novel way, polynomial (quadratic) motion models for the spatial dimensions and B-splines for the time dimension of the video sequence, and a regularization term enforcing temporal consistency on the segments. We report experiments on four VOS benchmarks, demonstrating competitive quantitative results while performing motion segmentation on a sequence in one go. We also highlight through visual results the key contributions on temporal consistency brought by our method.

Abstract:
Modern diffusion-based image generative models have made significant progress and become promising to enrich training data for the object detection task. However, the generation quality and the controllability for complex scenes containing multi-class objects and dense objects with occlusions remain limited. This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection. Given a domain-specific object detection dataset, we first fine-tune a pre-trained diffusion model on both cropped foreground objects and entire images to fit target distributions. Then we propose to control the diffusion model using synthesized visual prompts with spatial constraints and object-wise textual descriptions. ODGEN exhibits robustness in handling complex scenes and specific domains. Further, we design a dataset synthesis pipeline to evaluate ODGEN on 7 domain-specific benchmarks to demonstrate its effectiveness. Adding training data generated by ODGEN improves up to 25.3% mAP@.50:.95 with object detectors like YOLOv5 and YOLOv7, outperforming prior controllable generative methods. We also design an evaluation protocol based on COCO-2014 to validate the synthetic data of ODGEN in general domains and observe an advantage up to 5.6% in mAP@.50:.95 against existing methods. In addition, we employ a series of large-scale object detection datasets to train a general model named Stable Box Diffusion, which covers thousands of object categories in most common scenes.

Abstract:
Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird’s-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, with multi-guided global interaction and LiDAR-guided adaptive fusion, named MGAF. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth and spatial information. The designed semantic segmentation network captures category and orientation prior information for raw point clouds. In the following, an Adaptive Fusion Dual Transformer (AFDT) is developed to adaptively enhance the interaction of different modal BEV features from both global and bidirectional perspectives. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed in order to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. Notably, the proposed AFDT is general, which also shows superior performance on other models. Our framework has undergone extensive experimentation on the large-scale nuScenes dataset, Waymo Open Dataset, and long-range Argoverse2 dataset, consistently demonstrating state-of-the-art performance.

Abstract:
We present DrivingGaussian++, an efficient and effective framework for realistic reconstruction and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background with incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity.

Abstract:
Multi-view data encompasses various data types, including multi-feature, multi-sequence, and multi-modal data. Multi-view multi-label classification aims to leverage the rich semantic information contained in multiple views to achieve enhanced multi-label classification performance. In practical applications, the absence of views and labels poses a significant challenge to multi-view multi-label classification tasks. Premised on the assumption that shared semantic information across multiple views is sufficient to support the downstream task, we propose CTRL, a novel incomplete multi-view multi-label classification framework to address the multi-view learning challenge on the data with partially missing views and missing labels in this paper. The core mechanism of CTRL lies in learning a high-purity, low-redundancy condensed representation that adequately captures the essential information of the original data. Specifically, we design a new objective loss to enhance the semantic information of shared cross-view within the joint representation learning process while simultaneously suppressing intra-view redundant information that is irrelevant to the downstream task. This enables CTRL to extract task-relevant representations even when views are incomplete. Furthermore, we employ the Beta Evidential Neural Network to model the label distribution. This network is then integrated with Dempster-Shafer theory, enabling our model to perform label-level classification uncertainty estimation. This also allows us to use the estimated uncertainty and belief mass to create high-reliability pseudo-labels, resulting in further gains in model performance. Experimental results on multiple benchmark datasets demonstrate the superior performance of our proposed model in terms of accuracy, robustness, and reliability.

Abstract:
Meta learning is a promising paradigm in the era of large models, and task distributional robustness has become an indispensable consideration in real-world scenarios. Recent advances have examined the effectiveness of tail task risk minimization in fast adaptation robustness improvement. This work contributes to more theoretical investigations and practical enhancements in the field. Specifically, we reduce the distributionally robust strategy to a max-min optimization problem, constitute the Stackelberg equilibrium as the solution concept, and estimate the convergence rate. Under certain scenarios, we incorporate the diversity regularizer into the acquisition criteria design during active subset selection and further improve meta learners’ comprehensive generalization under tail risk minimization. In the presence of tail risk, we further derive the generalization bound, establish connections with estimated quantiles, systematically analyze the diversity regularizer’s impacts, and practically improve the studied strategy. Accordingly, extensive evaluations on tasks such as few-shot sinusoid regression, system identification, image classification, and meta reinforcement learning, along with experiments on multimodal large models, demonstrate the significance, robustness and scalability of our proposal.

Abstract:
Recent advances in self-supervised image denoising have highlighted the potential of Blind-Spot Networks (BSNs). However, existing methods suffer from three major limitations: (1) Their effectiveness in real-world scenarios is limited by strong assumptions, such as noise independence, which rarely hold in practice. (2) While sampling-based strategies can partially improve performance, BSNs inherently suffer from information loss caused by centroid masking, and removing the blind spot leads to noise overfitting, both of which hinder denoising performance. (3) Sampling-based methods often introduce checkerboard artifacts, yet existing studies typically overlook the fundamental differences between these artifacts and real noise. To address these issues, we propose a novel self-supervised denoising framework, Dual Double-Sampling with Random Sub-samples Generation (D2S-RSG-SSD). To address Limitation 1, we introduce a sampling-based framework that breaks noise dependence by combining Random Sub-samples Generation (RSG) with a cross-paired loss \mathcal L_RSGLRSG. RSG generates diverse sub-samples with inherent variance, referred to as sampling differences, which serve as natural perturbations to augment training data and disrupt spatial noise correlations. The proposed loss function ensures full utilization of these sub-samples while stabilizing optimization. To address Limitation 2, we propose a Dual Double-Sampling (D2S) strategy with fixed sampling patterns and a dual-branch architecture. This design reduces reliance on pixel-level information and leverages complementary features to mitigate both noise overfitting and information loss. A key advantage is its compatibility with various advanced denoising networks, lifting the constraint of using BSNs in self-supervised settings. Additionally, we introduce a fixed sub-image sampling strategy to prevent pattern collapse during inference and ensure stability. To address Limitation 3, we explicitly differentiate checkerboard artifacts from real noise and develop a dedicated artifact remover to correct pixel discontinuities caused by sampling-based operations. This design preserves fine image details while reducing over-smoothing. Experiments on benchmark real-noise datasets and self-captured noisy images demonstrate the robustness and generalizability of our framework, achieving better performance over existing methods.

Abstract:
Blind Face Restoration (BFR) aims to reconstruct high-quality face images from low-quality inputs without any prior knowledge of degradation types or levels. Recent advances, particularly through GAN- and diffusion-based approaches, have greatly improved perceptual realism and reconstruction fidelity. However, existing approaches typically rely solely on visual cues from degraded images. This often results in inaccurate reconstruction of facial details and noticeable identity distortion, particularly under severe or complex degradations. To address these limitations, we incorporate auxiliary textual information into BFR to facilitate the recovery of subtle facial attributes, such as wrinkles and moles, which are often overlooked by conventional visual priors. To support this idea, we first construct a large-scale dataset containing 30,000 detailed textual descriptions paired with CelebA-HQ images to capture fine-grained facial semantics. To bridge the gap between visual data and natural language, we further propose FaceCLIP, a fine-tuned vision-language model specifically tailored to human faces, enabling more accurate image-text alignment by capturing nuanced semantic cues critical for faithful reconstruction. Built upon these foundations, we propose Text-guided Blind Face Restoration (TBFR), a diffusion-based framework that explicitly integrates textual guidance into the restoration process. Within TBFR, a text-guided hybrid attention block fuses visual and textual features, and a text-aware loss enforces semantic consistency. Extensive experiments demonstrate that TBFR outperforms state-of-the-art BFR methods in both quantitative metrics and perceptual quality, establishing a new benchmark for BFR tasks.

Abstract:
Ensuring the safety of environmental exploration is a critical problem in reinforcement learning (RL). While limiting exploration to a feasible zone has become widely accepted as a way to ensure safety, key questions remain unresolved: what is the maximum feasible zone achievable through exploration, and how can it be identified? This paper, for the first time, answers these questions by revealing that the goal of safe exploration is to find the equilibrium between the feasible zone and the environment model. This conclusion is based on the understanding that these two components are interdependent: a larger feasible zone leads to a more accurate environment model, and a more accurate model, in turn, enables exploring a larger zone. We propose the first equilibrium-oriented safe exploration framework called safe equilibrium exploration (SEE), which alternates between finding the maximum feasible zone and the least uncertain model. Using a graph formulation of the uncertain model, we prove that the uncertain model obtained by SEE is monotonically refined, the feasible zones monotonically expand, and both converge to the equilibrium of safe exploration. Experiments on classic control tasks show that our algorithm successfully expands the feasible zones with zero constraint violation, and achieves the equilibrium of safe exploration within a few iterations.

Abstract:
In human pose estimation, a comprehensive evaluation of state-of-the-art frameworks is necessary to advance both research and practical applications. This paper presents a thorough review of state-of-the-art 2D and 3D human pose estimation frameworks, analyzing 118 papers and four GitHub repositories, with a focus on frameworks made since 2019. The following frameworks are chosen based on predefined inclusion criteria: AlphaPose, Detectron2, MediaPipe, MeTRAbs, MHFormer, MMPose, MoveNet, OpenPifPaf, OpenPifPaf-vita, OpenPose, PoseFormerV2, rtmlib, StridedTransformer-Pose3D, ultralytics (YOLOv8), ViTPose, and YOLOv7. This paper evaluates these 16 frameworks on an existing, unpublished dataset consisting of exercise videos recorded with a monocular RGB camera and synchronized gold-standard motion capture data. The dataset includes videos of nine individuals performing eight exercises, recorded from two camera views with different planar angles. The analysis evaluates joint angle performance of the frameworks using weighted mean absolute error and weighted intraclass correlation coefficient as quantitative metrics. MeTRAbs emerged as the best overall framework, while AlphaPose, rtmlib, and YOLOv7 were the top 2D performers. The used code is available open source.1

Abstract:
Most machine learning methods assume fixed probability distributions, limiting their applicability in nonstationary real-world scenarios. While continual learning methods address this issue, current approaches often rely on closed-box models or require extensive user intervention for interpretability. We propose SyMPLER (Systems Modeling through Piecewise Linear Evolving Regression), an explainable model for time series forecasting in nonstationary environments based on dynamic piecewise-linear approximations. Unlike other locally linear models, SyMPLER uses generalization bounds from Statistical Learning Theory to automatically determine when to add new local models based on prediction errors, eliminating the need for explicit clustering of the data. Experiments show that SyMPLER can achieve comparable performance to both closed-box and existing explainable models while maintaining a human-interpretable structure that reveals insights about the system’s behavior. In this sense, our approach conciliates accuracy and interpretability, offering a transparent and adaptive solution for forecasting nonstationary time series.

Abstract:
With the success of the 3D deep learning models, various perception technologies for autonomous driving have been developed in the LiDAR domain. While these models perform well in the trained source domain, they struggle in unseen domains with a domain gap. In this paper, we propose a representation learning approach for domain generalization in LiDAR semantic segmentation, termed DGLSS++, which is designed to ensure robust performance in both the source domain and unseen domains despite training exclusively on the source domain. Our approach focuses on generalizing from a single source domain, addressing the domain shift caused by variations in LiDAR sensor configurations and scene distributions. To tackle both sparse-to-dense and dense-to-sparse generalization scenarios, we simulate unseen domains by generating sparsely and densely augmented domains. With the augmented domain, we introduce two constraints for generalizable representation learning: generalized masked sparsity invariant feature consistency (GMSIFC) and localized semantic correlation consistency (LSCC). GMSIFC aligns the internal sparse features of the source domain with those of the augmented domain at different sparsity, introducing a novel masking strategy to exclude voxel features associated with multiple inconsistent classes. For LSCC, class prototypes from spatially local regions are constrained to maintain similar correlations across all local regions, regardless of the scene or domain. In addition, we establish standardized training and evaluation protocols utilizing four real-world datasets and implement several baseline methods. Extensive experiments demonstrate our approach outperforms both UDA and DG baselines.

Abstract:
With the continuous growth in the number of parameters of the Transformer-based pretrained language models (PLMs), particularly the emergence of large language models (LLMs) with billions of parameters, many natural language processing (NLP) tasks have demonstrated remarkable success. However, the enormous size and computational demands of these models pose significant challenges for adapting them to specific downstream tasks, especially in environments with limited computational resources. Parameter-Efficient Fine-Tuning (PEFT) offers an effective solution by reducing the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine-tuning. The demands for fine-tuning PLMs, especially LLMs, have led to a surge in the development of PEFT methods, as depicted in Fig. 1. In this paper, we present a comprehensive and systematic review of PEFT methods for PLMs. We summarize these PEFT methods, discuss their applications, and outline future directions. Furthermore, extensive experiments are conducted using several representative PEFT methods to better understand their effectiveness in parameter efficiency and memory efficiency. By offering insights into the latest advancements and practical applications, this survey serves as an invaluable resource for researchers and practitioners seeking to navigate the challenges and opportunities presented by PEFT in the context of PLMs.

Abstract:
Large Vision-Language Models (LVLMs) with “multimodal distractibility,” where plausible but irrelevant visual or textual inputs cause significant drops in reasoning consistency and lead to unreliable outputs. This paper introduces a comprehensive framework to systematically diagnose, evaluate, and mitigate this critical challenge. We present three core components: the large-scale IR-VQA benchmark to surface these vulnerabilities across four paradigms; novel diagnostic metrics, Positive Consistency (PC) and Negative Consistency (NC), which move beyond standard accuracy to rigorously measure a model’s reasoning stability; and the Relevance-Gated Multimodal Routing (RGMR) mechanism, a novel, lightweight module that proactively and dynamically filters distractions at inference time. Our experiments reveal that state-of-the-art models exhibit significant drops in consistency on IR-VQA. We demonstrate that finetuning on IR-VQA and deploying RGMR substantially improve model robustness where standard prompting fails. Our comprehensive analysis of model behaviors under different types of distractions and the underlying reasoning failures provides a clear path forward for developing more reliable multimodal systems.

Abstract:
Adversarial imitation learning (AIL), a prominent approach in imitation learning, has achieved significant practical success powered by neural network approximation. However, existing theoretical analyses of AIL are primarily confined to simplified settings—such as tabular and linear function approximation—and involve complex algorithmic designs that impede practical implementation. This creates a substantial gap between theory and practice. This paper bridges this gap by exploring the theoretical underpinnings of online AIL with general function approximation. We introduce a novel framework called optimization-based AIL (OPT-AIL), which performs online optimization for reward learning coupled with optimism-regularized optimization for policy learning. Within this framework, we develop two concrete methods: model-free OPT-AIL and model-based OPT-AIL. Our theoretical analysis demonstrates that both variants achieve polynomial expert sample complexity and interaction complexity for learning near-expert policies. To the best of our knowledge, they represent the first provably efficient AIL methods under general function approximation. From a practical standpoint, OPT-AIL requires only the approximate optimization of two objectives, thereby facilitating practical implementation. Empirical studies demonstrate that OPT-AIL outperforms previous state-of-the-art deep AIL methods across several challenging tasks.

Abstract:
Graph clustering is a longstanding topic in machine learning. In recent years, deep learning methods have achieved encouraging results, but they still require predefined cluster numbers KK, and typically struggle with imbalanced graphs, especially in identifying minority clusters. The limitations motivate us to study a challenging yet practical problem: deep graph clustering without KK considering the imbalance in reality. We approach this problem from a fresh perspective of information theory (i.e., structural information). In the literature, structural information has rarely been touched in deep clustering, and the classic definition falls short in its discrete formulation, neglecting node attributes and exhibiting prohibitive complexity. In this paper, we first establish a differentiable structural information, generalizing the discrete formalism to continuous realm, so that we design a hyperbolic deep model (LSEnet) to learn the neural partitioning tree in the Lorentz model of hyperbolic space. Theoretically, we demonstrate its capability in clustering without requiring KK and identifying minority clusters in imbalanced graphs. Second, we refine hyperbolic representations of the partitioning tree, enhancing graph semantics, for better clustering. Contrastive learning for tree structures is non-trivial and costs quadratic complexity. Instead, we further advance our theory by discovering an interesting fact that structural entropy indeed bounds the tree contrastive loss. Finally, with an efficient reformulation, we approach graph clustering through a novel augmented structural information learning (ASIL), which offers a simple yet effective objective of augmented structural entropy to seamlessly integrates hyperbolic partitioning tree construction and contrastive learning. With a provable improvement in graph conductance, ASIL achieves effective debiased graph clustering in linear complexity with respect to the graph size. Extensive experiments show the ASIL outperforms 20 strong baselines by an average of +12.42%+12.42% in NMI on Citeseer dataset.

Abstract:
Providing explainable molecular property predictions is critical for many scientific domains, such as drug discovery and material science. Though transformer-based language models have shown great potential in accurate molecular property prediction, they neither provide chemically meaningful explanations nor faithfully reveal the molecular structure-property relationships. In this work, we develop a framework for explainable molecular property prediction based on language models, dubbed as Lamole, which can provide chemical concepts-aligned explanations. We take a string-based molecular representation — Group SELFIES — as input tokens to pre-train and fine-tune our Lamole, as it provides chemically meaningful semantics. By disentangling the information flows of Lamole, we propose considering both self-attention weights and gradients for better quantification of each chemically meaningful substructure's impact on the model's output. To make the explanations more faithful to the structure-property relationship, we then carefully craft a marginal loss to explicitly optimize the explanations to align with the chemists’ annotations. We bridge the manifold hypothesis with the elaborated marginal loss to prove that the loss can align the explanations with the tangent space of the data manifold, leading to concept-aligned explanations. Experimental results over eight datasets demonstrate Lamole can achieve comparable prediction accuracy and boost the explanation accuracy by up to 14.3%, being the state-of-the-art in explainable molecular property prediction. To further illustrate the actionable utility of the explanations derived from Lamole, we integrated the framework with an evolutionary algorithm. This integration established an interpretable optimization pipeline for molecular editing, demonstrating that Lamole functions beyond simple post-hoc analysis but serves as a practical guide for molecule discovery.

Abstract:
Claude Monet’s late paintings of Water Lilies exhibit stylistic transformations that are often characterized by art historians as increasingly abstract and gesturally expressive. However, it remains challenging to define and systematically identify this stylistic shift. Here, we introduce a machine learning framework for analyzing Monet’s evolving brushwork using streamline curves: computational representations that capture the dynamic movement patterns inherent in brushstrokes. From 554 image patches sampled from 47 paintings spanning early (pre-1913) and later (post-1913) periods of Monet’s output, we extract streamlines and compute geometric features for each, including smoothness of curvature and directional variability. Each image is represented as a set of streamline feature vectors, a data type referred to as distributional. A new deep neural network architecture named Composition to Attribute (C2A) is designed for classifying distributional data. We hypothesize that Monet’s so-called ‘abstract’ style does not uniformly characterize all late-period Water Lilies, and that non-abstract flowers, regardless of period, share similar brushwork qualities. Under these assumptions, building on C2A, we propose a novel learning paradigm named Discover Embedded Group with Asymmetry (DEGA) which enforces a shared distribution of DNN-extracted features for non-abstract flower patches across both periods while distinguishing the abstract ones. DEGA reveals a meaningful two-dimensional feature space, where one dimension differentiates abstract from mimetic Water Lilies, while the other separates abstract flowers from close-up flowers of the early period. Our findings suggest that the so-called ‘abstract’ qualities of Monet’s late style retain certain visual affinities with his earlier approach to depicting close-up floral motifs. When this brushwork is used in more expansive scenes, the depiction of flowers shifts away from realistic renderings of individual petals toward a looser, more allusive expression, conveying a sense of floral presence rather than botanical detail. This study highlights the value of computational analysis for a more accurate understanding of an artist’s stylistic development.

Abstract:
In this paper, we addressed the limitation of relying solely on distribution alignment and source-domain empirical risk minimization in Unsupervised Domain Adaptation (UDA). Our information-theoretic analysis showed that this standard adversarial-based framework neglects the discriminability of target-domain features, leading to suboptimal performance. To bridge this theoretical–practical gap, we defined “good representation learning” as guaranteeing both transferability and discriminability, and proved that an additional loss term targeting target-domain discriminability is necessary. Building on these insights, we proposed a novel adversarial-based UDA framework that explicitly integrates a domain alignment objective with a discriminability-enhancing constraint. Instantiated as Domain-Invariant Representation Learning with Global and Local Consistency (RLGLC), our method leverages Asymmetrically-Relaxed Wasserstein of Wasserstein Distance (AR-WWD) to address class imbalance and semantic dimension weighting, and employs a local consistency mechanism to preserve fine-grained target-domain discriminative information. Extensive experiments across multiple benchmark datasets demonstrate that RLGLC consistently surpasses state-of-the-art methods, confirming the value of our theoretical perspective and underscoring the necessity of enforcing both transferability and discriminability in adversarial-based UDA.

Affiliations: School of Atmospheric Sciences, Sun Yat-sen University, Guangdong, China; School of Engineering Science, University of Science and Technology of China, Hefei, China; School of Computer Science, Wuhan University, Wuhan, China; School of Information and Electronics, Beijing Institute of Technology, Beijing, China; University of Tokyo, Tokyo, Japan; KU Leuven, Leuven, Belgium; School of Electronics and Communication Engineering, Sun Yat-Sen University, Guangzhou, China; School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai, China; School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China; State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China

Abstract:
Due to the substantial domain gaps in Remote Sensing (RS) images that are characterized by variabilities such as location, wavelength, and sensor type, Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. However, research in this area remains underexplored: (1) Current cross-domain methods primarily focus on Domain Adaptation (DA), which adapts models to predefined domains rather than to unseen ones; (2) Few studies target the RSDG issue, especially for semantic segmentation tasks. Existing related models are developed for specific unknown domains, struggling with issues of underfitting on other unseen scenarios; (3) Existing RS foundation models tend to prioritize in-domain performance over cross-domain generalization. To this end, we introduce the first vision foundation model for RSDG semantic segmentation, CrossEarth. CrossEarth demonstrates strong cross-domain generalization through a specially designed data-level Earth-Style Injection pipeline and a model-level Multi-Task Training pipeline. In addition, for the semantic segmentation task, we have curated an RSDG benchmark comprising 32 semantic segmentation scenarios across various regions, spectral bands, platforms, and climates, providing comprehensive evaluations of the generalizability of future RSDG models. Extensive experiments on this collection demonstrate the superiority of CrossEarth over existing state-of-the-art methods.

Abstract:
Vision-Language Transformers (VLTs) have achieved remarkable success, yet their high computational costs remain challenging due to numerous input tokens and large model parameters. Existing VLT compression methods primarily rely on single-modality-based token pruning or coarse-grained weight pruning techniques. However, these methods face significant obstacles, such as ignoring the critical alignment of different modalities and lacking layer-wise dynamic token pruning flexibility, exhibiting inevitable performance degradation due to coarsegrained weight pruning, and struggling with the simultaneous compression of both input tokens and model parameters. To address those limitations, we propose MADTP++, a novel approach that integrates custom-made token and weight pruning processes into a unified framework, achieving superior compression in both parameter counts and computational costs. Specifically, for the token pruning process, we introduce the Multi-modality Alignment Guidance (MAG) module and the Dynamic Token Pruning (DTP) module to align semantic features across different modalities and guide the dynamic elimination of redundant tokens based on different input instances. For the weight pruning process, we propose a Hardware-aware Weight Pruning (HWP) module that leverages the Sparse Tensor Cores across diverse hardware setups to enable fine-grained parameter pruning within VLTs. To further unify token and weight pruning, we also propose a Cooperative Optimization Training Strategy that automatically allocates GFLOPs and parameter reductions per branch before pruning and employs Knowledge Distillation Constraints to facilitate joint optimization of both pruning dimensions. Extensive experiments conducted on various VLT models and datasets demonstrate that MADTP++ can significantly reduce model parameters and computational costs while maintaining competitive performance.

Abstract:
Domain adaptation and generalization are crucial for real-world applications, such as autonomous driving and medical imaging where the model must operate reliably across environments with distinct data distributions. However, these tasks are challenging because the model needs to overcome various domain gaps caused by variations in, for example, lighting, weather, sensor configurations, and so on. Addressing domain gaps simultaneously in different modalities, known as multimodal domain adaptation and generalization, is even more challenging due to unique challenges in different modalities. Over the past few years, significant progress has been made in these areas, with applications ranging from action recognition to semantic segmentation, and more. Recently, the emergence of large-scale pre-trained multimodal foundation models, such as CLIP, has inspired numerous research studies, which leverage these models to enhance downstream adaptation and generalization. This survey summarizes recent advances in multimodal adaptation and generalization, particularly how these areas evolve from traditional approaches to foundation models. Specifically, this survey covers (1) multimodal domain adaptation, (2) multimodal test-time adaptation, (3) multimodal domain generalization, (4) domain adaptation and generalization with the help of multimodal foundation models, and (5) adaptation of multimodal foundation models. For each topic, we formally define the problem and give a thorough review of existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions.

Abstract:
Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demonstrating that our method can also accelerate flow-matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with < <3% additional fine-tuning iterations, our approach reduces the FLOPs of DiT-XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet.

Abstract:
Recently, the mainstream practice for training low-light raw image denoising methods has shifted towards employing synthetic data. Noise modeling, which focuses on characterizing the noise distribution of real-world sensors, profoundly influences the effectiveness and practicality of synthetic data. Currently, physics-based noise modeling struggles to characterize the entire real noise distribution, while learning-based noise modeling impractically depends on paired real data. In this paper, we propose a novel strategy: learning the noise model from dark frames instead of paired real data, to break down the data dependency. Based on this strategy, we introduce an efficient physics-informed noise neural proxy (PNNP) to approximate the real-world sensor noise model. Specifically, we integrate physical priors into neural proxies and introduce three efficient techniques: physics-guided noise decoupling (PND), physics-aware proxy model (PPM), and differentiable distribution loss (DDL). PND decouples the dark frame into different components and handles different levels of noise flexibly, which reduces the complexity of noise modeling. PPM incorporates physical priors to constrain the synthetic noise, which promotes the accuracy of noise modeling. DDL provides explicit and reliable supervision for noise distribution, which promotes the precision of noise modeling. PNNP exhibits powerful potential in characterizing the real noise distribution. Extensive experiments on public datasets demonstrate superior performance in practical low-light raw image denoising. The source code will be publicly available at the https://fenghansen.github.io/publication/PNNP.

Abstract:
Event cameras are dynamic vision sensors inspired by the biological retina, offering high dynamic range, high temporal resolution, and low power consumption. These qualities allow them to perceive 3D environments even in extreme conditions. Event data is continuously recorded over time, capturing pixel movements in detail. To leverage this temporal density, we introduce a temporal event stereo framework that continuously uses past information. The event stereo matching network is jointly trained with stereoscopic flow, which tracks pixel movements from stereo cameras. Instead of relying on optical flow ground truth, our method trains motion flows using disparity maps. The temporal aggregation of information via stereoscopic flow boosts stereo matching performance, achieving state-of-the-art results on MVSEC, DSEC, M3ED, and EVIMO2 datasets. Our method also demonstrates computational efficiency by stacking past data in a cascading manner.

Abstract:
Pedestrian behavior exhibits inherent multi-modality, necessitating predictions that balance accuracy and diversity to adapt effectively to various complex scenarios. However, conventional noise addition in diffusion models is often aimless and unguided, leading to redundant noise reduction steps and the generation of uncontrollable samples. To address these issues, we propose a Prior Condition-Guided Diffusion Model (CGD-TraP) for multi-modal pedestrian trajectory prediction. Instead of directly adding Gaussian noise to trajectories at each timestep during the forward process, our approach leverages internal intention and external interaction to guide noise estimation. Specifically, we design two specialized modules to extract and aggregate intention and interaction features. These features are then adaptively fused through a spatial-temporal fusion based on selective state space, which estimates a controllable noisy trajectory distribution. By optimizing the noise addition process in a more controlled and efficient manner, our method ensures that the denoising process is effectively guided, resulting in predictions that are both accurate and diverse. Extensive experiments on the ETH-UCY, SDD, and NBA datasets demonstrate that CGD-TraP surpasses state-of-the-art diffusion-based and other generative methods, achieving superior efficiency, accuracy, and diversity.

Abstract:
Learning neural implicit fields of 3D shapes is a rapidly emerging field that enables shape representation at arbitrary resolutions. Due to the flexibility, neural implicit fields have succeeded in many research areas, including shape reconstruction, novel view image synthesis, and more recently, object pose estimation. Neural implicit fields enable learning dense correspondences between the camera space and the object’s canonical space – including unobserved regions in camera space – significantly boosting object pose estimation performance in challenging scenarios like highly occluded objects and novel shapes. Despite progress, predicting canonical coordinates for unobserved camera-space regions remains challenging due to the lack of direct observational signals. This necessitates heavy reliance on the model’s generalization ability, resulting in high uncertainty. Consequently, densely sampling points across the entire camera space may yield inaccurate estimations that hinder the learning process and compromise performance. To alleviate this problem, we propose a method combining an SO(3)-equivariant convolutional implicit network and a positive-incentive point sampling (PIPS) strategy. The SO(3)-equivariant convolutional implicit network estimates point-level attributes with SO(3)-equivariance at arbitrary query locations, demonstrating superior performance compared to most existing baselines. The PIPS strategy dynamically determines sampling locations based on the input, thereby boosting the network’s accuracy and training efficiency. The PIPS strategy is implemented with a PIPS estimation network which generates sparse sample points with distinctive features capable of determining all object pose DoFs with high certainty. To collect the training data of the PIPS estimation network, we propose to automatically generate the pseudo ground-truth with a teacher model. Our method outperforms the state-of-the-art on three pose estimation datasets. It achieves 0.63 in the 5^\circ 25∘2 cm metric on NOCS-REAL275, 0.62 in the 5^\circ 55∘5 cm metric on ShapeNet-C, and 77.3 in the AR metric on LineMOD-O. Notably, it demonstrates significant improvements in challenging scenarios, such as objects captured with unseen pose, high occlusion, novel geometry, and severe noise.

Abstract:
As two mainstream frameworks in federated learning (FL), both centralized and decentralized approaches have shown great application value in practical scenarios. However, existing studies do not provide sufficient evidence and clear guidance for analysis of which performs better in the FL community. Although decentralized methods have been proven to approach the comparable convergence of centralized with less communication, their test performance always falls short of expectations in empirical studies. To comprehensively and fairly compare their efficiency gaps in FL, in this paper, we explore their stability and generalization efficiency. Specifically, we prove that on the general smooth non-convex objectives, 1) centralized FL (CFL) always generalizes better than decentralized FL (DFL); 2) CFL achieves the best performance via adopting partial participation instead of full participation; and, 3) there is a necessary requirement for the topology in DFL to avoid performance collapse as the training scale increases. We also conduct extensive experiments on several common setups in FL to validate that our theoretical analysis is consistent with experimental phenomena and contextually valid in several general and practical scenarios.

Abstract:
Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the “same instance, similar representation” assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).

Abstract:
Recent advances in Neural Architecture Search (NAS) are essentially attributed to Performance Estimation (PE), i.e., a method aims to effectively estimate an architecture. Meanwhile, Kendall’s \tauτ is well recognized as the principled evaluation criteria for PE strategies in the literature. We argue that Kendall’s \tauτ is not the optimal solution. Through extensive experiments and theoretical analysis, we take the initiative to reveal the problem behind the Kendall’s \tauτ and propose a novel criterion named Minimum Keeping Ratio (MKR), which is closely connected to the final performance of NAS. It allows us to compare different PE approaches in a unified perspective, and use effective ablation studies to verify common beliefs and key differences of PE strategies. Based on the findings from MKR, we are able to derive a simple NAS method by integrating different PE strategies with random sampling. Such a method shows very strong performance in efficiency and effectiveness through extensive experiments on different challenging benchmarks. In particular, our simple random sampling NAS finds the optimal architecture in NASbenchMacro, NASbench201, and NASbench301. It is also well generalized to different search spaces (MobileNet) and tasks (semantic segmentation), finding an architecture surpasses the previous state-of-the-art architectures by 4.25 mIoU under 600M600M FLOPs on ADE20K. Codes are available at https://anonymous.4open.science/r/Anonymization11264.

Abstract:
Contemporary deep learning approaches for optical flow estimation continue to face persistent challenges in model interpretability, generalization capacity, and deployment efficiency, significantly constraining their practical implementation. This limitation becomes particularly critical in applications such as visual odometry (VO), where precise sparse point tracking supersedes the conventional emphasis on dense optical flow accuracy. Moreover, the lack of a joint framework combining keypoint detection and optical flow estimation limits sparse optical flow performance. To address these fundamental issues, we propose a novel dual-task imperative learning framework that synergistically optimizes sparse optical flow estimation (iFLOW) with adaptive keypoint detection (iPOINT). Our methodology implements an Expectation-Maximization (EM) paradigm where iFLOW and iPOINT undergo alternating optimization through a Gauss-Newton reasoning engine. This innovative architecture leverages convolutional feature advantages under the generalized feature invariance principle. The resulting imperative learning mechanism imbues our framework with enchanced interpretability and cross-domain adaptability while maintaining computational efficiency. Through comparative evaluations against classical and learning-based baselines, our ultra-compact models (0.05M parameters for iFLOW, 0.09M for iPOINT) demonstrate remarkable performance across multiple metrics (End-point Error, F1-all, VO trajectory accuracy) despite requiring only 200 training image pairs.

Abstract:
In this paper, we introduce a general framework for analyzing the numerical conditioning of minimal problems in multiple view geometry, using tools from computational algebra and Riemannian geometry. Special motivation comes from the fact that relative pose estimation, based on standard 5-point or 7-point Random Sample Consensus (RANSAC) algorithms, can fail even when no outliers are present and there is enough data to support a hypothesis. We argue that these cases arise due to the intrinsic instability of the 5- and 7-point minimal problems. We apply our framework to characterize the instabilities, both in terms of the world scenes that lead to infinite condition number, and directly in terms of ill-conditioned image data. The approach produces computational tests for assessing the condition number before solving the minimal problem. Lastly, synthetic and real data experiments suggest that RANSAC serves not only to remove outliers, but in practice it also selects for well-conditioned image data, which is consistent with our theory.

Abstract:
When applying Reinforcement Learning (RL) algorithms to vision-based tasks, the significant variations between training and actual working environments pose a challenge to their generalization capability. While previous methods can enhance the generalization of the base RL algorithm, they are often limited to cases with minor changes between training and working environments. In this paper, we propose an effective auxiliary task called Jacobian Matrix Meets Masked Contrastive Learning (J-Mac), which aims to enhance the base RL algorithm’s generalization capability even when there are significant changes between training and working environments. Specifically, we learn the correlations between visual states via transition dynamic learning. Meanwhile, on this basis, we eliminate task-irrelevant features from the representation of the visual state via masked contrastive learning. Extensive experiments demonstrate that our approach significantly improves the generalization of various base RL algorithms, outperforming other state-of-the-art methods across different vision-based benchmarks.

Abstract:
Rotation-invariant recognition of shapes is a common challenge in computer vision. Recent approaches have significantly improved the accuracy of rotation-invariant recognition by encoding the rotational invariance of shapes as hand-crafted image features and introducing deep neural networks. However, the methods based on pixels have too much redundant information, and the critical geometric information is prone to early leakage, resulting in weak rotation-invariant recognition of fine-grained shapes. In this paper, we reconsider the shape recognition problem from the perspective of contour points rather than pixels. We propose an anti-noise rotation-invariant convolution module based on contour geometric aware for fine-grained shape recognition. The module divides the shape contour into multiple local geometric regions (LGA), where we implement finer-grained rotation-invariant coding in terms of point topological relations. We provide a deep network composed of five such cascaded modules for classification and retrieval experiments. The results show that our method exhibits excellent performance in rotation-invariant recognition of fine-grained shapes. In addition, we demonstrate that our method is robust to contour noise and the rotation centers.

Abstract:
Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer’s egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines.

Abstract:
Scene Graph Generation (SGG) is a critical cross-modal task for scene understanding, which aims to detect visual relations in an image. Most SGG methods are significantly affected by highly skewed long-tailed bias, and prefer predicates with sufficient samples regardless of the semantic accuracy. Current unbiased SGG methods focus on compensating for the imbalanced long-tailed distribution, but they are fragile to dataset changes. The fundamental cause for this problem is the limited generalization ability, thus the diversity of classes needs to be modeled explicitly. By imitating the human cognition, a Grounded Cognition Method (GCM) for unbiased scene graph generation is proposed here, where the simulation, bodily states, and situated action are modeled. For simulations, an Out Domain Knowledge Injection module is proposed to expand the model’s visual perception by reducing the reliance on an isolated class. Meanwhile, a Semantic Group Aware Synthesizer is proposed for linguistic perception modeling by categorizing specific predicate classes into a high-level semantic group. For bodily states, the modalities are erased separately to imitate the limited state of physical senses, which forces the model to rely on the remaining modality to compensate for the understanding of the whole scene. For situated actions, a Shapley Enhanced Multimodal Counterfactual module is proposed to model the dynamic interaction with the environment and cope with diverse contexts. Experiments on Visual Genome, GQA, and Open Images V6 demonstrate the effectiveness of our GCM, which outperforms state-of-the-art methods and achieves a better trade-off.

Abstract:
Recent years have witnessed the rapid progress and broad application of diffusion probabilistic models (DPMs). Sampling from DPMs can be viewed as solving an ordinary differential equation (ODE). Despite the promising performance, the generation of DPMs usually consumes much time due to the large number of function evaluations (NFE). Though recent works have accelerated the sampling to around 20 steps with high-order solvers, the sample quality with less than 10 NFE can still be improved. In this paper, we propose a unified sampling framework (USF++) to study the optional strategies for solver. Under this framework, we further reveal that taking different solving strategies at different timesteps may help further decrease the truncation error, and a carefully designed solver schedule has the potential to improve the sample quality by a large margin. Therefore, we propose a new sampling framework based on the exponential integral formulation that allows free choices of solver strategy at each step and design specific decisions for the framework. Moreover, we apply evolutionary search to find outstanding solver schedules which outperform the state-of-the-art sampling methods on CIFAR-10, ImageNet, and LSUN-Bedroom datasets. Specifically, we achieve 3.89 FID with 5 NFE on CIFAR-10 dataset and 8.62 FID with 3 NFE on LSUN-Bedroom dataset, outperforming the SOTA method significantly. We further apply searching to Stable-Diffusion model and get an acceleration ratio of 2×, showing the feasibility of sampling in very few steps without retraining the neural network.

Abstract:
Unpaired image restoration (UIR) is a significant task due to the difficulty of acquiring paired degraded/clear images with identical backgrounds. In this paper, we propose a novel UIR method based on the assumption that an image contains both degradation-related features, which affect the level of degradation, and degradation-unrelated features, such as texture and semantic information. Our method aims to ensure that the degradation-related features of the restoration result closely resemble those of the clear image, while the degradation-unrelated features align with the input degraded image. Specifically, we introduce a Feature Orthogonalization Module optimized on Stiefel manifold to decouple image features, ensuring feature uncorrelation. A task-driven Depth-wise Feature Classifier is proposed to assign weights to uncorrelated features based on their relevance to degradation prediction. To avoid the dependence of the training process on the quality of the clear image in a single pair of input data, we propose to maintain several degradation-related proxies describing the degradation level of clear images to enhance the model’s robustness. Finally, a weighted PatchNCE loss is introduced to pull degradation-related features in the output image toward those of clear images, while bringing degradation-unrelated features close to those of the degraded input.

Abstract:
In this paper, we introduce a novel framework for creating multimodal interactive digital twin characters, from dialogue videos of TV shows. Specifically, these digital twin characters are capable of responding to user inputs with harmonious textual, vocal, and visual content. They not only replicate the external characteristics, such as appearance and tone, but also capture internal attributes, including personality and habitual behaviors. To support this ambitious task, we collect the Multimodal Character-Centric Conversation Dataset, named MCCCD, which includes character-specific and high-quality multimodal dialogue data with detailed annotations, featuring 6.8 k utterances and 4.6 hours of audio/video per character. Notably, the MCCCD dataset is approximately ten times larger than existing datasets in terms of per-character data volume, facilitating the detailed modeling of complex character-centric traits. Further, we propose a baseline framework to create digital twin characters, consists of dialogue generation through large language models, voice generation via speech synthesis models, and visual representation with 3D talking head models. Experimental results demonstrate that our approach significantly outperforms existing methods in generating consistent and character-specific responses, setting a new benchmark for digital character creation. Our collected dataset and proposed baseline have paved the way for the creation of highly interactive and natural digital avatars, opening the door to extensive and practical applications of digital humans.

Abstract:
Neural implicit functions including signed distance functions (SDFs) and unsigned distance functions (UDFs) have shown powerful ability in fitting the shape geometry. However, inferring continuous distance fields from discrete unoriented point clouds still remains a challenge. The neural network typically fits the shape with a rough surface and omits fine-grained geometric details such as shape edges and corners. In this paper, we propose a novel non-linear implicit filter to smooth the implicit field while preserving high-frequency geometry details. Our novelty lies in that we can filter the surface (zero level set) by the neighbor input points with gradients of the signed distance field. By moving the input raw point clouds along the gradient, our proposed implicit filtering can be extended to non-zero level sets to keep the promise consistency between different level sets, which consequently results in a better regularization of the zero level set. Since the unsigned distance function is non-differentiable at the zero level set and lacks a stable gradient field, we further propose a gradient immutable training schema to migrate the filter to the unsigned distance function learned from point clouds. By leveraging the UDF training schema, we also improve sparse-view reconstruction results. We conduct comprehensive experiments in surface reconstruction from objects, complex scene point clouds, and multi-view images, and we further extend to the point normal estimation and point cloud upsampling tasks. The numerical and visual comparisons demonstrate our improvements over the state-of-the-art methods under the widely used benchmarks.

Abstract:
Semantic segmentation of remote sensing imagery (RSI) is a fundamental task that aims at assigning a category label to each pixel. To pursue precise segmentation with one or more fine-grained categories, semantic segmentation often requires holistic segmentation of whole-scene RSI (WRI), which is normally characterized by a large size. However, conventional deep learning methods struggle to handle holistic segmentation of WRI due to the memory limitations of the graphics processing unit (GPU), thus requiring to adopt suboptimal strategies such as cropping or fusion, which result in performance degradation. Here, we introduce the Robust End-to-end semantic Segmentation architecture for whole-scene remoTe sensing imagery (REST). REST is the first intrinsically endtoend framework for truly holistic segmentation of WRI, supporting a wide range of encoders and decoders in a plugandplay fashion. It enables seamless integration with mainstream semantic segmentation methods, and even more advanced foundation models. Specifically, we propose a novel spatial parallel interaction mechanism (SPIM) within REST to overcome GPU memory constraints and achieve global context awareness. Unlike traditional parallel methods, SPIM enables REST to process a WRI effectively and efficiently by combining parallel computation with a divideandconquer strategy. Both theoretical analysis and experiments demonstrate that REST attains nearlinear throughput scalability as additional GPUs are employed. Extensive experiments demonstrate that REST consistently outperforms existing cropping-based and fusion-based methods across a variety of scenarios, ranging from single-class to multi-class segmentation, from multispectral to hyperspectral imagery, and from satellite to drone platforms. The robustness and versatility of REST are expected to offer a promising solution for the holistic segmentation of WRI, with the potential for further extension to large-size medical imagery segmentation.

Abstract:
This article presents a general Bayesian learning framework for multi-modal groupwise image registration. The method builds on probabilistic modelling of the image generative process, where the underlying common anatomy and geometric variations of the observed images are explicitly disentangled as latent variables. Therefore, groupwise image registration is achieved via hierarchical Bayesian inference. We propose a novel hierarchical variational auto-encoding architecture to realise the inference procedure of the latent variables, where the registration parameters can be explicitly estimated in a mathematically interpretable fashion. Remarkably, this new paradigm learns groupwise image registration in an unsupervised closed-loop self-reconstruction process, sparing the burden of designing complex image-based similarity measures. The computationally efficient disentangled network architecture is also inherently scalable and flexible, allowing for groupwise registration on large-scale image groups with variable sizes. Furthermore, the inferred structural representations from multi-modal images via disentanglement learning are capable of capturing the latent anatomy of the observations with visual semantics. Extensive experiments were conducted to validate the proposed framework, including four different datasets from cardiac, brain, and abdominal medical images. The results have demonstrated the superiority of our method over conventional similarity-based approaches in terms of accuracy, efficiency, scalability, and interpretability.

Abstract:
Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M^33D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.

Abstract:
Adversarial phenomena have been widely observed in machine learning (ML) systems, especially those using deep neural networks. These phenomena describe situations where ML systems may produce predictions that are inconsistent and incomprehensible to humans in certain specific cases. Such behavior poses a serious security threat to the practical application of ML systems. To exploit this vulnerability, several advanced attack paradigms have been developed, mainly including backdoor attacks, weight attacks, and adversarial examples. For each individual attack paradigm, various defense mechanisms have been proposed to enhance the robustness of models against the corresponding attacks. However, due to the independence and diversity of these defense paradigms, it is challenging to assess the overall robustness of an ML system against different attack paradigms. This survey aims to provide a systematic review of all existing defense paradigms from a unified lifecycle perspective. Specifically, we decompose a complete ML system into five stages: pre-training, training, post-training, deployment, and inference. We then present a clear taxonomy to categorize representative defense methods at each stage. The unified perspective and taxonomy not only help us analyze defense mechanisms but also enable us to understand the connections and differences among different defense paradigms. It inspires future research to develop more advanced and comprehensive defense strategies.

Abstract:
The challenging task of 3D planar reconstruction from images involves several sub-tasks including frame-wise plane detection, segmentation, parameter regression and possibly depth prediction, along with cross-frame plane correspondence and relative camera pose estimation. Previous works adopt a divide and conquer strategy, addressing above sub-tasks with distinct network modules in a two-stage paradigm. Specifically, given an initial camera pose and per-frame plane predictions from the first stage, further exclusively designed modules relying on external plane correspondence labeling are applied to merge multi-view plane entities and produce refined camera pose. Notably, existing work fails to integrate these closely related sub-tasks into a unified framework, and instead addresses them separately and sequentially, which we identify as a primary source of performance limitations. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all tasks of multi-view planar reconstruction and pose estimation within a compact single-stage framework, eliminating the need for the initial pose estimation and supervision of plane correspondence. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, achieving a new state-of-the-art performance on the public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.

Abstract:
Multi-view clustering (MVC) has gained widespread recognition as a valuable technique for enhancing clustering performance by harnessing diverse data sources. Nonetheless, current methods mainly concentrate on obtaining consistent information, often ignoring the risk of redundant information across different views. In this study, we propose a novel methodology, called Sufficient Multi-View Clustering (STMVC), which evaluates the multi-view clustering framework through an information-theoretic lens, intending to learn inter-view consistency information while removing redundant information among views. Specifically, we first utilize variational analysis to extract inter-view consistency information, and to further enhance the consistency information and minimize the redundant information between different views, we propose a sufficient representation lower bound. Furthermore, in order to improve the adaptability and generalizability of our proposed approach, we expand the application of STMVC to single-view scenarios and incomplete multi-view scenarios. The STMVC method provides a promising solution to the challenge of multi-view clustering and introduces a fresh perspective for analyzing multi-view data. To validate our model, we conducted a theoretical analysis based on the Bayesian error rate, and experiments on several multi-view datasets and single-view datasets show the outstanding performance of STMVC.

Abstract:
Enhancing perception performance via multi-agent collaboration has gained increasing attention in the field of autonomous driving. However, as the number of agents grows, the manual annotation required for training collaborative detectors increases significantly. To tackle this problem, we introduce an unsupervised method that learns to Detect Objects from Multi-Agent LiDAR scans, named DOtA, without using labels from external. DOtA first generates preliminary labels by an initial detector, which is trained by internally shared information of collaborative agents. DOtA then optimizes these preliminary labels by utilizing the physical rule constraints derived from the surrounding area of the object. Building on DOtA, we further propose DOtA++, an enhanced version that improves performance by leveraging composite prior constraints. Beyond physical rule constraints, DOtA++ further uses image data as an auxiliary modality to introduce multi-agent observation consistency constraints, boosting object classification, while also incorporating point cloud geometric distribution constraints to improve structural description. Extensive experiments on widely-used benchmarks demonstrate that DOtA and DOtA++ effectively perceive potential objects in the scene without manual annotations. In particular, DOtA++ shows 10.7% mAP improvement over traditional unsupervised methods on V2X-R dataset.

Abstract:
Face recognition models are vulnerable to spoofing of adversarial patches in the physical world. Attackers can enable face recognition models to make false identity judgments by simply pasting a sticker with a special pattern on the face. However, existing attacks lack the ability to transfer to black-box models, and the improvement of transferability is mainly focused on adversarial perturbations based on the p-norm. To further improve the attack performance and transferability, a highly transferable face recognition adversarial patches generation method named as AdvDiffusion is proposed. It first determines the region for adversarial patches generation based on facial gradient maps, and then an image is reconstructed to generate an adversarial patch by adding noise and denoising it with a pre-trained diffusion model. In the denoising, an adversarial loss is used to fine-tune the model and control the image to generate an adversarial patch with spoofing capability. Experiments and analysis show that the adversarial patches generated by the proposed mehtod have good adversarial attack capability on black-box face recognition models in both digital and physical domains, and also have better robustness under the changes of a complex physical environment compared with some state-of-the-art methods. It has great potential application for black-box attacks in the physical domain.

Abstract:
Recent years have witnessed an explosive increase of face content, which drives a distinct shift from static images to dynamic video formats. The shift of formats inherently alters the characteristics within face videos, whereby pixel-wise artifacts are intertwined with motion-related impairments. Addressing the emerging distortions that now always appear by twins in practice, however, is challenging and non-trivial, due to the distinct characteristics in addressing spatial-temporal frequencies in videos. In this paper, we propose a novel Unified recurrent network for joint Face video quality Enhancement and Stabilization (UniFES), as the first successful attempt for both quality enhancement and motion stabilization. Correspondingly, our UniFES method proposes to effectively aggregate the mutual information in the pixel and motion domains. For the quality enhancement, our UniFES method decomposes the shaking temporal alignment problem into progressive feature alignment with explicit physical information, which includes the global dynamics from the motion domain, i.e., from the stabilization task. Regarding the video stabilization, we integrate the mixed dynamics from the enhancement task (i.e., from pixel domain) to take into account both pixel-wise and motion-related characteristics, for ensuring robust trajectory estimation and motion stabilization. Subsequently, we refine the warping masks to achieve high-quality full frame rendering. We further establish a synthetic dataset for training and evaluation regarding this emerging task. Comprehensive experiments have illustrated the superior performances of our UniFES method over 32 comparing baselines on both newly established synthetic and real-world datasets.

Abstract:
This paper addresses the important and challenging task of large-scale unsupervised semantic segmentation (LUSS). We present the first attempt to unleash the power of foundation models (FMs) for the challenging, dense prediction task LUSS, and our main objective is to present simple, effective yet efficient solutions for LUSS, namely Prompting foundation models for LUSS (PLUSS). Firstly, we proposed a cascade framework PLUSS_\alphaα by effectively marrying CLIPS, Grounding DINO, and SAM in a zero-shot manner. This cascade architecture automatically generates semantic and spatial prompts for SAM, establishing a strong baseline that significantly outperforms previous state-of-the-art methods. Building upon this foundation, we propose PLUSS_\betaβ, which addresses the critical bottleneck of prompt quality through two novel tuner modules: a semantic tuner that enhances fine-grained category discrimination via visual prompt tuning, and a box tuner that improves object localization through cross-modal feature fusion. Both tuners are optimized by capitalizing on the knowledge already present within the foundation models themselves, deriving self-supervised signals from internal model consistency. This approach requires no external supervision or updates to the foundation models’ parameters. Extensive experiments on ImageNet-S benchmarks demonstrate that PLUSS_\betaβ achieves remarkable performance improvements, surpassing the previous best method by 39.6%, 27.3%, and 22.6% in mIoU for 50, 300, and 919 categories respectively. Our approach exhibits robust category-shape representation across varying object sizes and dataset scales, while maintaining strong generalization capabilities for open-vocabulary tasks. The proposed framework provides a solid baseline for adapting foundation models to downstream vision tasks.

Affiliations: School of Computing and Artificial Intelligence, and Engineering Research Center of Intelligent Finance, Ministry of Education, Southwestern University of Finance and Economics, Chengdu, Sichuan, China; School of Mathematical Sciences/Multi-Hazard Early Warning Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China, Chengdu, Sichuan, China; Yingcai Honors College and School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China; Institute of Methodologies for Environmental Analysis, CNR-IMAA, Tito Scalo, Italy

Abstract:
The goal of a deep learning-based general image fusion method is to solve multiple image fusion tasks with a single model, thereby facilitating the deployment of models in practical applications. However, existing methods fail to provide an efficient and comprehensive solution from both model training and network design perspectives. Regarding model training, current approaches cannot effectively leverage complementary information across different tasks. In terms of network design, they rely on experience-based network designs. To address these issues, we propose a comprehensive framework for general image fusion using the newly proposed gradient transfer learning and fusion rule unfolding. To leverage complementary information across different tasks during training, we propose a sequential gradient-transfer framework based on the idea that different image fusion tasks often exhibit complementary structural details and that image gradients effectively capture these details. To move beyond heuristic-based network design, we evolved a fundamental image fusion rule and integrated it into a deep equilibrium model, resulting in a more efficient and versatile image fusion network capable of uniformly handling various fusion tasks. Considering three different image fusion tasks, i.e., multi-focus image fusion, multi-exposure image fusion, and infrared and visible image fusion, our method not only produces images with richer structural information but also achieves highly competitive objective metrics. Furthermore, the results of generalization experiments on previously unseen image fusion tasks, i.e., medical image fusion, demonstrate that our method significantly outperforms competing approaches.

Abstract:
Thermal infrared imaging has attracted widespread attention in many fields due to the advantages of all-weather imaging and strong penetration. However, existing methods for thermal infrared novel-view synthesis often produce results with coarse details and floating artifacts, primarily caused by physical factors such as atmospheric transmission effects and thermal conduction. These challenges hinder accurate reconstruction of intricate structures and temperature distributions in thermal scenes, limiting the practical utility of previous approaches. To address these limitations, this paper introduces a physics-induced 3D Gaussian splatting method named Thermal3D-GS, the first novel-view synthesis method that relies exclusively on thermal infrared image. Thermal3D-GS begins by modeling atmospheric transmission effects and thermal conduction in three-dimensional media using neural networks. Additionally, considering the sparse features of infrared images, sparse feature priors are designed to improve the reconstruction accuracy of thermal infrared images. Furthermore, to validate the effectiveness of our method, the first large-scale benchmark dataset named Thermal Infrared Novel-view Synthesis Dataset (TI-NSD) is created. This dataset comprises 50 authentic thermal infrared video scenes, covering indoor, outdoor, traffic and uncrewed aerial vehicle (UAV) scenarios, with a total of 15,213 frames of thermal infrared image data. In addition, an expanded validation thermal infrared dataset, which includes three high-resolution scenes and five special scenes under varying atmospheric conditions and complex propagation media is constructed to assess generalization performance of the proposed method. Based on this dataset, this paper experimentally verifies the effectiveness of Thermal3D-GS. The results indicate that our method outperforms the baseline method with a 3.19 dB improvement in PSNR and significantly addresses the issues of floaters and indistinct edge features present in the baseline method.

Abstract:
Diffusion models (DMs) have recently demonstrated remarkable success in modeling large-scale data distributions. However, many downstream tasks require guiding the generated content based on specific differentiable metrics, typically necessitating backpropagation during the generation process. This approach is computationally expensive, as generating with DMs often demands tens to hundreds of recursive network calls, resulting in high memory usage and significant time consumption. In this paper, we propose a more efficient alternative that approaches the problem from the perspective of parallel denoising. We show that full backpropagation throughout the entire generation process is unnecessary. The downstream metrics can be optimized by retaining the computational graph of only one step during generation, thus providing a shortcut for gradient propagation. The resulting method, which we call Shortcut Diffusion Optimization (SDO), is generic, high-performance, and computationally lightweight, capable of optimizing all parameter types in diffusion sampling. We demonstrate the effectiveness of SDO on several real-world tasks, including controlling generation by optimizing latent and aligning the DMs by fine-tuning network parameters. Compared to full backpropagation, our approach reduces computational costs by ～\! 90%∼90% while maintaining superior performance. Code is available at https://github.com/deng-ai-lab/SDO.

Abstract:
We propose Next Bit Prediction (NBP), a unified framework that simultaneously addresses lossless compression and lossy reconstruction of 3D point cloud geometry through a next-bit probability estimation paradigm. Our key insight is that both lossless compression and lossy reconstruction fundamentally rely on accurate probability estimation of geometric symbols, though targeting different metrics. Lossless compression minimizes bitrate via precise symbol distribution prediction, while lossy reconstruction enhances reconstruction fidelity through probability-guided geometry refinement. Recognizing that point clouds become sparser with increasing bit depth, NBP introduces two key technical innovations. For more significant bits, where the point density is higher, we develop a multi-stage Occupancy Probability Estimation (OPE) mechanism to estimate the probability distribution of occupancy status across multiple iteration stages, with each stage supporting either lossless or lossy mode. For less significant bits that focus on point placement, a Disentangled Probability Estimation (DPE) module is proposed to handle density information and binary residuals, simultaneously enabling lossless compression and facilitating probability-driven coordinate refinement for high-quality lossy reconstruction. Extensive experiments demonstrate the advantages of NBP, including low complexity, progressive coding, and superior coding efficiency, achieving state-of-the-art results both quantitatively and qualitatively.

Abstract:
Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations. Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning, alleviating issues like occlusion or ambiguity. However, these solutions often face misalignment issues wherein the corresponding features at the same position across different frames may have different semantic meanings during the aggregation process, which leads to unreliable contextual fusion results and an unstable representation learning process. To address this problem, we introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP). Hi-SOP first disentangles the geometric and temporal context for separate alignment, which two branches are then composed to enhance the reliability of SOP. This parsing of the visual input into a local-global alignment hierarchy includes: (I) disentangled geometric and temporal separate alignment, within each leverages depth confidence and camera pose as prior for relevant feature matching respectively; (II) global alignment and composition of the transformed geometric and temporal volumes based on semantics consistency. Our method outperforms SOTAs for semantic scene completion on the SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset.

Abstract:
Procedure planning in instructional videos entails predicting an action sequence that transitions a given start state to a desired goal state. This task is particularly challenging due to two key sources of uncertainty: limited visual observations and an enormous decision space. The former results in multiple plausible plan variations due to missing intermediate visual states, while the latter complicates prediction by requiring selection from a large set of potential actions. Unlike prior work that addresses these issues implicitly, we propose an explicit solution. To mitigate the first challenge, we employ image generation models to synthesize diverse intermediate visual states using various text prompts, followed by a prompt selection module integrated within a diffusion model. To tackle the second challenge, we introduce a task-selective diffusion model that applies a task-specific mask to constrain the action space. As the effectiveness of this mask depends on accurate task classification, we further enhance visual representation by leveraging pre-trained vision-language models to generate action-aware, text-enriched multimodal embeddings. Extensive experiments on three benchmark datasets validate the superior performance of our proposed approach.

Abstract:
Weakly-supervised object detection (WSOD) learns detectors with only image-level classification annotations. Without precise instance-level labels, most previous WSOD methods in remote sensing images (RSIs) select the highest-scoring proposals as the final detection results, which are confronted by two major challenges: (1) instances with small scale or rare poses are easily neglected; (2) optimizing network by the top-scoring region inevitably overlooks many valuable candidate proposals. To mitigate the above-mentioned challenges, we propose a data-driven bidirectional spatial-adaptive network (BSANet). It contains a forward-reverse spatial dropout (FRSD) module to reduce instance ambiguity induced from extreme scales and poses, as well as crowded scene, and to better excavate the entire instances. From attention learning perspective, the proposed FRSD is conceptually similar to a data-driven hard attention mechanism, which adaptively samples and reconstructs the spatially related regions for mining more latent feature responses. Meanwhile, our FRSD effectively alleviates the inherent problem that non-parametric hard attention learning fashion cannot adapt to different datasets. In addition, we build a soft attention branch to simultaneously model soft pixel-level and hard region-level attention information for exploring the complementary benefit between soft and hard attention learning. We evaluate our BSANet on the challenging NWPU VHR-10.v2 and DIOR datasets. Experimental results demonstrate that our method sets a new state-of-the-art.

Affiliations: School of Computing and Data Science, The University of Hong Kong, Hong Kong, China; Department of Mathematics, The University of Hong Kong, Hong Kong, China; Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China; Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China; Department of Biomedical Data Science, Stanford University, Stanford, CA, USA; Department of Computer Science and Engineering, University of California, Santa Cruz, CA, USA; Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong, China

Abstract:
Federated Learning (FL) has emerged as a promising privacy-preserving collaborative model training paradigm without sharing raw data. However, recent studies have revealed that private information can still be leaked through shared gradient information and attacked by Gradient Inversion Attacks (GIA). While many GIA methods have been proposed, a detailed analysis, evaluation, and summary of these methods are still lacking. Although various survey papers summarize existing privacy attacks in FL, few studies have conducted extensive experiments to unveil the effectiveness of GIA and their associated limiting factors in this context. To fill this gap, we first undertake a systematic review of GIA and categorize existing methods into three types, i.e., optimization-based GIA (OP-GIA), generation-based GIA (GEN-GIA), and analytics-based GIA (ANA-GIA). Then, we comprehensively analyze and evaluate the three types of GIA in FL, providing insights into the factors that influence their performance, practicality, and potential threats. Our findings indicate that OP-GIA is the most practical attack setting despite its unsatisfactory performance, while GEN-GIA has many dependencies and ANA-GIA is easily detectable, making them both impractical. Finally, we offer a three-stage defense pipeline to users when designing FL frameworks and protocols for better privacy protection and share some future research directions from the perspectives of attackers and defenders that we believe should be pursued. We hope that our study can help researchers design more robust FL frameworks to defend against these attacks.

Abstract:
Multimodal neuroimages, such as diffusion tensor imaging (DTI) and resting-state functional MRI (fMRI), offer complementary perspectives on brain activities by capturing structural or functional interactions among brain regions. While existing studies suggest that fusing these multimodal data helps detect abnormal brain activity caused by neurocognitive decline, they are generally implemented in Euclidean space and can’t effectively capture the intrinsic hierarchical organization of structural/functional brain networks. This paper presents a hyperbolic kernel graph fusion (HKGF) framework for neurocognitive decline analysis with multimodal neuroimages. It consists of a multimodal graph construction module, a graph representation learning module that encodes brain graphs in hyperbolic space through a family of hyperbolic kernel graph neural networks (HKGNNs), a cross-modality coupling module that enables effective multimodal data fusion, and a hyperbolic neural network for downstream predictions. Notably, HKGNNs represent graphs in hyperbolic space to capture both local and global dependencies among brain regions while preserving the hierarchical structure of brain networks. Extensive experiments involving over 4,000 subjects with DTI and/or fMRI data demonstrate the superiority of HKGF over state-of-the-art methods in two neurocognitive decline prediction tasks. The proposed HKGF is a general framework for multimodal data analysis, facilitating objective quantification of brain structural or functional connectivity changes associated with neurocognitive decline.

Abstract:
Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for complex tasks, their reliance on labor-intensive and costly retraining for new scenarios limits real-world applicability. To address these limitations, behavior(al) foundation models (BFMs) have emerged as a new paradigm that leverages large-scale pre-training to learn reusable primitive skills and broad behavioral priors, enabling zero-shot or rapid adaptation to a wide range of downstream tasks. In this paper, we present a comprehensive overview of BFMs for humanoid WBC, tracing their development across diverse pre-training pipelines. Furthermore, we discuss real-world applications, current limitations, urgent challenges, and future opportunities, positioning BFMs as a key approach toward scalable and general-purpose humanoid intelligence. Finally, we provide a curated and regularly updated collection of BFM papers and projects to facilitate further research, which is available at https://github.com/yuanmingqi/awesome-bfm-papers.

Abstract:
Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model’s robustness against blur.

Abstract:
The physical world is composed of graphs, such as the protein structures in life science, the patient relations in medical diagnosis, the user connections in social media, etc. Graphs help both build the world itself and understand the semantics behind the data for humans. However, how such graph structures work toward semantic representation is still unclear, where existing attempts focus on employing the graphs for special tasks. In this work, we first introduce two measures to evaluate graph quality, namely structural complexity and homophily. Structural complexity describes the quantity of graph structural information representing the graph structure's symmetry, and homophily describes the percentage of intra-class edges to quantify edge consistency. Using these two measures, we then discover the relationship between the graph quality and the corresponding performance for general tasks, that is the performance positively correlates with the structural complexity, and “J”-shaped correlates with homophily, which are proved mathematically. Based on these, we design a graph augmentation tool Graph^++. Graph^++ can enhance the natural graph structure and accordingly improve the general tasks. Empirical validation on tasks including Alzheimer's diagnosis and breast cancer subtype identification shows Graph^++'s ability to improve both graph structure and task performance, revealing the underlying data semantics.

Abstract:
Predicting trajectories is essential for interpreting human behavior, yet it remains a challenging task when relying solely on observed motion patterns. Despite substantial progress, most existing methods assume fully observed trajectories and fail to account for missing data caused by occlusion, limited field of view, or sensor failures. This limitation substantially compromises the reliability of trajectory prediction, particularly in real-world deployment where observations are often incomplete. In light of this issue, our work presents the Gaussian Mixture Conditional Variational Recurrent Neural Network (GMC-VRNN), which unifies trajectory imputation and prediction within a single framework. Our GMC-VRNN framework couples a Multi-Space Graph Neural Network (MS-GNN) with a Gaussian Mixture Conditional VRNN, further augmented by a Bidirectional Temporal Decay (BTD) module, to achieve robust spatio-temporal representation learning under incomplete observations. To verify its effectiveness, we conduct extensive evaluations on two sports datasets covering multiple scenarios, jointly tackling trajectory imputation and prediction. Our experiments confirm that GMC-VRNN surpasses recent state-of-the-art approaches, offering enhanced precision and stronger robustness under diverse conditions.

Abstract:
Level-5 driving automation requires a robust visual perception system that can parse input images under any condition. However, existing driving datasets for dense semantic perception are either dominated by images captured under normal conditions or are small in scale. To address this, we introduce ACDC, the Adverse Conditions Dataset with Correspondences for training and testing methods for diverse semantic perception tasks on adverse visual conditions. ACDC consists of a large set of 8012 images, half of which (4006) are equally distributed between four common adverse conditions: fog, nighttime, rain, and snow. Each adverse-condition image comes with a high-quality pixel-level panoptic annotation, a corresponding image of the same scene under normal conditions, and a binary mask that distinguishes between intra-image regions of clear and uncertain semantic content. 1503 of the corresponding normal-condition images feature panoptic annotations, raising the total annotated images to 5509. ACDC supports the standard tasks of semantic segmentation, object detection, instance segmentation, and panoptic segmentation, as well as the newly introduced uncertainty-aware semantic segmentation. A detailed empirical study demonstrates the challenges that the adverse domains of ACDC pose to state-of-the-art supervised and unsupervised approaches and indicates the value of our dataset in steering future progress in the field.

Abstract:
In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds (about 10 k images with pixel-aligned SMAL labels) and CtrlAVES3D (about 7 k images with pixel-aligned AVES labels). To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3 k mammalian and 12.4 k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.

Abstract:
Achieving human-level intelligence requires refining the transition from the fast, intuitive System 1 to the slower, more deliberate System 2 reasoning. While System 1 excels in quick, heuristic decisions, System 2 relies on logical reasoning for more accurate judgments and reduced biases. Foundational Large Language Models (LLMs) excel at fast decision-making but lack the depth for complex reasoning, as they have not yet fully embraced the step-by-step analysis characteristic of true System 2 thinking. Recently, reasoning LLMs like OpenAI’s o1/o3 and DeepSeek’s R1 have demonstrated expert-level performance in fields such as mathematics and coding, closely mimicking the deliberate reasoning of System 2 and showcasing human-like cognitive abilities. This survey begins with a brief overview of the progress in foundational LLMs and the early development of System 2 technologies, exploring how their combination has paved the way for reasoning LLMs. Next, we discuss how to construct reasoning LLMs, trace the evolution of various reasoning models, and examine the core methods that enable advanced reasoning behind them. Additionally, we provide an overview of reasoning benchmarks, offering an in-depth comparison of the performance of representative reasoning LLMs. Finally, we explore promising directions for advancing reasoning LLMs and maintain a real-time GitHub Repository to track the latest developments. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this rapidly evolving field.

Abstract:
Contrastive learning methods enforce label distance relationships in feature space to improve representation capability for regression models. However, these methods highly depend on label information to correctly recover ordinal relationships of features, limiting their applications to semi-supervised regression. In this work, we extend contrastive regression methods to allow unlabeled data to be used in the semi-supervised setting, thereby reducing the dependence on costly annotations. Particularly we construct the feature similarity matrix with both labeled and unlabeled samples in a mini-batch to reflect inter-sample relationships, and an accurate ordinal ranking of involved unlabeled samples can be recovered through spectral seriation algorithms if the level of error is within certain bounds. The introduction of labeled samples above provides regularization of the ordinal ranking with guidance from the ground-truth label information, making the ranking more reliable. To reduce feature perturbations, we further utilize the dynamic programming algorithm to select robust features for the matrix construction. The recovered ordinal relationship is then used for contrastive learning on unlabeled samples, and we thus allow more data to be used for feature representation learning, thereby achieving more robust results. The ordinal rankings can also be used to supervise predictions on unlabeled samples, serving as an additional training signal. We provide theoretical guarantees and empirical verification through experiments on various datasets, demonstrating that our method can surpass existing state-of-the-art semi-supervised deep regression methods.

Abstract:
As a variant of the Area Under the ROC Curve (AUC), the partial AUC (PAUC) focuses on a specific range of false positive rate (FPR) and/or true positive rate (TPR) in the ROC curve. It is a pivotal evaluation metric in real-world scenarios with both class imbalance and decision constraints. However, selecting instances within these constrained intervals during its calculation is NP-hard, and thus typically requires approximation techniques for practical resolution. Despite the progress made in PAUC optimization over the last few years, most existing methods still suffer from uncontrollable approximation errors or a limited scalability when optimizing the approximate PAUC objectives. In this paper, we close the approximation gap of PAUC optimization by presenting two simple instance-wise minimax reformulations: one with an asymptotically vanishing gap, the other with the unbiasedness at the cost of more variables. Our key idea is to first establish an equivalent instance-wise problem to lower the time complexity, simplify the complicated sample selection procedure by threshold learning, and then apply different smoothing techniques. Equipped with an efficient solver, the resulting algorithms enjoy a linear per-iteration computational complexity w.r.t. the sample size and a convergence rate of O(\epsilon ^-1/3)O(ε-1/3) for typical one-way and two-way PAUCs. Moreover, we provide a tight generalization bound of our minimax reformulations. The result explicitly demonstrates the impact of the TPR/FPR constraints \alphaα/\betaβ on the generalization and exhibits a sharp order of \tildeO(\alpha ^-1n_+^-1 + \beta ^-1n_-^-1)O˜(α-1n+-1+β-1n--1). Finally, extensive experiments on several benchmark datasets validate the strength of our proposed methods.

Abstract:
High-speed imaging, which captures the fleeting dynamics of moving objects at extreme frame rates, has become an indispensable tool across a wide range of scientific disciplines. Yet, the pursuit of high temporal resolution often comes at the cost of significant image degradation, due to the inherent limitations of imaging sensors and the extreme conditions of ultra-short exposure and massive data throughput. As a result, high-speed cameras often produce images marred by strong noise and severe color distortions. In this work, we propose a deep image signal processing (ISP) paradigm that enables high-speed cameras to maintain extremely high frame rates while achieving image quality comparable to that of digital single-lens reflex (DSLR) cameras. To this end, we make two key contributions: 1) constructing RHID, the first large-scale real-world high-speed imaging ISP dataset, comprising 282,912 RAW images captured by high-speed cameras and corresponding sRGB images captured by DSLRs, featuring complex degradations intrinsic to high-speed acquisition; and 2) proposing a misalignment-robust ISP learning framework (MisISP), equipped with a prior mapper-guided image alignment module (PMIA) and a spectrum-guided weakly-aligned image supervisory loss, which effectively addresses inherent pixel misalignments caused by heterogeneous sensor characteristics. Extensive experiments demonstrate that our paradigm substantially advances the performance of existing deep ISP models for high-speed imaging, achieving remarkable improvements in noise suppression, brightness enhancement, and color preservation.

Abstract:
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks recently. However, their heavy computation costs remain daunting for resource-limited devices. To address this, researchers have dedicated themselves to compressing redundant information in ViTs for acceleration. However, existing approaches generally sparsely drop redundant image tokens by token pruning or brutally remove channels by channel pruning, leading to a sub-optimal balance between model performance and inference speed. Moreover, they struggle when transferring compressed models to downstream vision tasks that require the spatial structure of images, such as semantic segmentation. To tackle these issues, we propose CAIT, a joint compression method for ViTs that achieves a harmonious blend of high accuracy, fast inference speed, and favorable transferability to downstream tasks. Specifically, we introduce an asymmetric token merging (ATME) strategy to effectively integrate neighboring tokens. It can successfully compress redundant token information while preserving the spatial structure of images. On top of it, we further design a consistent dynamic channel pruning (CDCP) strategy to dynamically prune unimportant channels in ViTs. Thanks to CDCP, insignificant channels in multi-head self-attention modules of ViTs can be pruned uniformly, significantly enhancing the model compression. Extensive experiments on multiple benchmark datasets show that our proposed method can achieve state-of-the-art performance across various ViTs.

Affiliations: Department of Computer Science and Institute of Artificial Intelligence, University of Central Florida, Orlando, FL, USA; School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China; Department of Computer Science, College of Computing, Grand Valley State University, Allendale, MI, USA; Northeastern University, Boston, MA, USA; The University of Sydney, Camperdown, NSW, Australia; University at Buffalo, Getzville, NY, USA; College of Computing and Data Science, and Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore

Abstract:
Continual learning (CL) focuses on learning non-stationary data distribution without forgetting previous knowledge. The most widely used memory-replay approaches are often prone to memory overfitting due to the limited memory diversity and hardness. Existing work mitigating memory overfitting either lacks data diversity or hardness or is hard to train. To address the above limitations and release the memory buffer potential, we view the memory buffer transformation from a new dynamic system perspective and propose a continuous and reversible memory transformation method. We introduce an adversarial optimization objective that jointly learns the CL model and memory transformer. Specifically, we present a deterministic continuous memory transformer (DCMT) to generate diverse memory data. Furthermore, we inject uncertainty into the transformation function and develop a stochastic continuous memory transformer (SCMT), which substantially enhances the diversity of the transformed memory buffer. The presented neural transformation approaches have significant advantages over existing ones: (1) they significantly increase the memory buffer diversity and hardness to overfit; (2) they are memory efficient without needing to make a replica of the memory data. Extensive experiments show a significant improvement with our approach compared to strong baselines.

Abstract:
LiDAR perception for autonomous driving applications offers highly accurate scene depiction in three-dimensional (3D) spaces, whose most representative task is LiDAR panoptic segmentation (LPS), as it offers exhibition of both instance- and semantic-level segmentation in a holistic manner. Although previous approaches have achieved mature performance, no research has explored temporal information for enhancing LPS performance. As multi-frame processing can assist in better predictions in terms of feature representation and recursive forecasting, which has been proven in other LiDAR perception challenges, this study proposes an effective and temporal-aware panoptic segmentation method for LiDAR point clouds. Specifically, we introduce two modules: convolution-based cross-frame fusion attention (CFFA) and adjacent shifted feature encoder (ASFE) modules. The CFFA module can fuse multi-frame features on the basis of the idea of convolution-based attention, whereas the ASFE module leverages adjacent model outputs and serves as an intermediate guide for final segmentation predictions. Consequent to our extensive experiments, the two modules have been reaffirmed in terms of their productivity in the realm of the LPS. The proposed LPS model achieves impressive panoptic-quality metric scores that are evaluated on different popular benchmarks (63.36% under SemanticKITTI and 78.54% under Panoptic nuScenes), outperforming previous state-of-the-art methods by a significant margin. Further quantitative and qualitative analyses provide evidence of the advantages of multi-frame processing for the LPS together with demonstrations of its particular behavior under different settings.

Abstract:
Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the extrinsic contextual information to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes intrinsic contextual information to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%.

Abstract:
Today, convolutional neural network (CNN) pruning techniques often rely on manually crafted importance criteria and pruning structures. Due to their heuristic nature, these methods may lack generality, and their performance is not guaranteed. In this paper, we propose a theoretical framework to address this challenge by leveraging the concept of \gammaγ-weak submodularity, based on a new efficient importance function. By deriving an upper bound on the absolute error in the layer subsequent to the pruned layer, we formulate the importance function as a \gammaγ-weakly submodular function. This formulation enables the development of an easy-to-implement, low-complexity, and data-free oblivious algorithm for selecting filters to be removed from a convolutional layer. Extensive experiments show that our method outperforms state-of-the-art benchmark networks across various datasets, with a computational cost comparable to the simplest pruning techniques, such as l_2l2-norm pruning. Notably, the proposed method achieves an accuracy of 76.52%, compared to 75.15% for the overall best baseline, with a 25.5% reduction in network parameters. According to our proposed resource-efficiency metric for pruning methods, the ACLI approach demonstrates orders-of-magnitude higher efficiency than the other baselines, while maintaining competitive accuracy.

Abstract:
Unsigned distance functions (UDFs) have emerged as powerful representation for modeling and reconstructing geometries with open surfaces. However, the development of 3D generative models for UDFs remains largely unexplored, limiting current methods from generating diverse open-surface 3D content. Moreover, mainstream 3D datasets predominantly consist of watertight meshes, revealing a critical challenge: the absence of standardized datasets and benchmarks specifically tailored for open-surface generation and reconstruction. In this paper, we begin by introducing UDiFF, a novel diffusion-based 3D generative model specifically designed for UDFs. UDiFF supports both conditional and unconditional generation of textured 3D shapes with open surfaces. At its core, UDiFF generates UDFs in the spatial-frequency domain using a learnable wavelet transform. Instead of relying on manually selected wavelet transforms, which are labor-intensive and prone to information loss, we introduce a data-driven approach that learns the optimal wavelet transformation from UDFs datasets. Beyond UDiFF, we present the UWings dataset, comprising 1,509 high-quality 3D open-surface models of winged creatures. Using UWings, we establish comprehensive benchmarks for evaluating both generative and reconstruction methods based on UDFs.

Abstract:
No-Reference Image Quality Assessment (NR-IQA) models play an important role in various real-world applications. Recently, adversarial attacks against NR-IQA models have attracted increasing attention, as they provide valuable insights for revealing model vulnerabilities and guiding robust system design. Some effective attacks have been proposed against NR-IQA models in white-box settings, where the attacker has full access to the target model. However, these attacks often suffer from poor transferability to unknown target models in more realistic black-box scenarios, where the target model is inaccessible. This work makes the first attempt to address the challenge of low transferability in attacking NR-IQA models by proposing a transferable Signed Ensemble Gaussian black-box Attack (SEGA). The main idea is to approximate the gradient of the target model by applying Gaussian smoothing to source models and ensembling their smoothed gradients. To ensure the imperceptibility of adversarial perturbations, SEGA further removes inappropriate perturbations using a specially designed perturbation filter mask. Experimental results demonstrate the superior transferability of SEGA, validating its effectiveness in enabling successful transfer-based black-box attacks against NR-IQA models.

Abstract:
The absence of publicly available, large-scale, high-quality datasets for Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) has significantly hindered the application of rapidly advancing deep learning techniques, which hold huge potential to unlock new capabilities in this field. This is primarily because collecting large volumes of diverse target samples from SAR images is prohibitively expensive, largely due to privacy concerns, the characteristics of microwave radar imagery perception, and the need for specialized expertise in data annotation. Throughout the history of SAR ATR research, there have been only a number of small datasets, mainly including targets like ships, airplanes, buildings, etc. There is only one vehicle dataset MSTAR collected in the 1990s, which has been a valuable source for SAR ATR. To fill this gap, this paper introduces a large-scale, new dataset named ATRNet-STAR with 40 different vehicle categories collected under various realistic imaging conditions and scenes. It marks a substantial advancement in dataset scale and diversity, comprising over 190,000 well-annotated samples—10×10× larger than its predecessor, the famous MSTAR. Building such a large dataset is a challenging task, and the data collection scheme will be detailed. Second, we illustrate the value of ATRNet-STAR via extensively evaluating the performance of 15 representative methods with 7 different experimental settings on challenging classification and detection benchmarks derived from the dataset. Finally, based on our extensive experiments, we identify valuable insights for SAR ATR and discuss potential future research directions in this field. We hope that the scale, diversity, and benchmark of ATRNet-STAR can significantly facilitate the advancement of SAR ATR.

Abstract:
With the functions of egocentric observation and multimodal perception equipped in augmented reality (AR) devices, the next generation of smart assistants has the potential to reduce human labor and enhance execution efficiency in assembly tasks. Among diverse assembly activity understanding tasks, anticipating the near future activities is crucial yet challenging, which can assist humans or agents to actively plan and engage in interactions with the environment. However, the existing egocentric activity anticipation methods still struggle to achieve a decent trade-off between accuracy and computational efficiency, hindering them to be deployed in practical applications. To address this dilemma, in this paper, we propose a goal-guided prompting framework with adaptive modality selection (GP-AMS), for assembly activity anticipation in egocentric videos. For bridging the semantic gap between the historical observations and unobserved future activities, we inject the inferred high-level goal clues into the constructed prompts, which are further utilized to guide a pre-trained vision-language (V-L) model to compensate relevant semantics of unseen future. Moreover, a mask-and-predict strategy is adopted with two imposed constraints, i.e., casual masking and probabilistic token-dropping, to mine the intrinsic associations between the assembly activities within a specific procedure. For maintaining the benefits of exploiting multimodal information while avoiding extensively increasing the computational burdens, an adaptive modality selection strategy is designed to train a policy network, which learns to dynamically decide which modalities should be sampled for processing by the anticipation model on a per observation time-step basis. By allocating major computation to the selected indicative modalities on-the-fly, the efficiency of the overall model can be improved, thus paving the way for feasibility on real-world devices. Extensive experimental results on two public data sets validate that the proposed method yields not only consistent improvements in anticipation accuracy, but also significant savings in computation budgets.

Abstract:
The widespread proliferation of fake news on the Internet, especially in multi-modal formats, poses a substantial threat to society. Most deep learning-based approaches for fake news detection yield accurate predictions but lack explainability. Existing models focusing on explainability visualize key components from results or generate surface causes via Large Language Models. However, they can hardly provide the deep rationale behind the fabrication of fake news, which is indispensable for misinformation mitigation. Thus, we approach explainability from a different perspective, focusing on explaining how fake news is fabricated, which we term deceptive patterns, at its very source. First, four types of deceptive patterns are pre-established, namely Image Manipulation, Cross-modal Inconsistency, Image Repurposing and Others. Based on this, we propose GE-NSLM, a General Explainable Neuro-Symbolic Latent Model that integrates the power of Large Vision Language Models, which not only provides accurate judgments but also offers insights on deceptive patterns. Specifically, each deceptive pattern is represented as a binary learnable latent variable, obtained through amortized variational inference and weak supervision guided by logical rules. Experiments show GE-NSLM achieves competitive performance. More importantly, it provides interpretable insights into the underlying reasons why specific news items are fake.

Abstract:
Salient Object Detection (SOD) aims to identify and segment the most prominent objects in an image. In real open environments, intelligent systems often encounter complex and challenging scenes, such as low-light, rain, snow, etc., which we call constrained conditions. These real situations pose more severe challenges to existing SOD models. However, there is no comprehensive and in-depth exploration of this field at both the data and model levels, and most of them focus on ideal situations or a single condition. To bridge this gap, we launch a new task, Condition-Constrained Salient Object Detection (CSOD), aimed at robustly and accurately locating salient objects in constrained environments. On the one hand, to compensate for the lack of datasets, we construct the first large-scale condition-constrained salient object detection dataset CSOD10 K, comprising 10,000 pixel-level annotated images and over 100 categories of salient objects. This dataset is oriented towards the real environment and includes 8 real-world constrained scenes under 3 main constraint types, making it extremely challenging. On the other hand, we abandon the paradigm of “restoration before detection” and instead introduce a unified end-to-end framework CSSAM that fully explores scene attributes, eliminating the need for additional ground-truth restored images and reducing computational overhead. Specifically, we design a Scene Prior-Guided Adapter (SPGA), which injects scene priors to enable the foundation model to better adapt to downstream constrained scenes. To automatically decode salient objects, we propose a Hybrid Prompt Decoding Strategy (HPDS), which can effectively integrate multiple types of prompts to achieve adaptation to the SOD task. Extensive experiments show that our model significantly outperforms state-of-the-art methods on both the CSOD10 K dataset and existing standard SOD benchmarks.

Abstract:
4D head capture aims to generate dynamic facial meshes in the same topology with corresponding UV maps, which requires temporal correspondence between 3D head models. Existing pipelines either involve manual processing of artists or employ constraints such as landmark tracking and optical flow, failing to achieve a trade-off between accuracy and efficiency. To enhance this process, we propose Topo4D++, a novel framework for automatic geometry and texture reconstruction that optimizes densely aligned 4D heads and 8 K BRDF maps directly from calibrated multi-view videos. Our key insight is to represent facial models as a set of dynamic 3D Gaussians with fixed topology, where the Gaussian centers are bound to the mesh vertices. This enables tracking all vertices rather than sparse vertices on the face accurately by leveraging the inverse rendering capabilities of 3D Gaussian Splatting (3DGS), while also enabling ultra-high-resolution texture generation. To maintain face structure during dynamic 3DGS optimization, we propose to optimize geometry and texture alternatively under physical and topological constraints frame-by-frame and employ blendshape-based expression priors to address extreme expressions. Then, we propose to extract dynamic facial meshes in a regular wiring arrangement and high-fidelity textures with pore-level details from the learned Gaussians. Finally, we train a diffusion-based model to generate BRDF texture maps to achieve physically based rendering. Given the absence of a universal benchmark, we construct JHead, a novel benchmark for the comprehensive evaluation of 4D head capture methods. Extensive experiments on different datasets demonstrate that our method is generalized to different capture systems, identities, and expressions, outperforming current state-of-the-art head reconstruction methods in both mesh and texture qualitatively and quantitatively.

Abstract:
Image Quality Assessment (IQA) with references plays an important role in optimizing and evaluating computer vision tasks. Traditional methods assume that all pixels of the reference and test images are fully aligned. Such Aligned-Reference IQA (AR-IQA) approaches fail to address many real-world problems with various geometric deformations between the two images. Although significant effort has been made to attack Geometrically-Disparate-Reference IQA (GDR-IQA) problem, it has been addressed in a task-dependent fashion, for example, by dedicated designs for image super-resolution and retargeting, or by assuming the geometric distortions to be small that can be countered by translation-robust filters or by explicit image registrations. Here we rethink this problem and propose a unified, non-training-based Deep Structural Similarity (DeepSSIM) approach to address the above problems in a single framework, which assesses structural similarity of deep features in a simple but efficient way and uses an attention calibration strategy to alleviate attention deviation. The proposed method, without application-specific design, achieves state-of-the-art performance on AR-IQA datasets and meanwhile shows strong robustness to various GDR-IQA test cases. Interestingly, our test also shows the effectiveness of DeepSSIM as an optimization tool for training image super-resolution, enhancement and restoration, implying an even wider generalizability.

Abstract:
Face Recognition (FR) technology has made significant strides with the emergence of deep learning. Typically, most existing FR models are built upon Convolutional Neural Networks (CNN) and take RGB face images as the model’s input. In this work, we take a closer look at existing FR paradigms from high-efficiency, security, and precision perspectives, and identify the following three problems: (i) CNN frameworks are vulnerable in capturing global facial features and modeling the correlations between local facial features. (ii) Selecting RGB face images as the model’s input greatly degrades the model’s inference efficiency, increasing the extra computation costs. (iii) In the real-world FR system that operates on RGB face images, the integrity of user privacy may be compromised if hackers successfully penetrate and gain access to the input of this model. To solve these three issues, we propose two novel FR frameworks, i.e., TransFace and TransFace++, which successfully explore the feasibility of applying ViTs and image bytes to FR tasks, respectively. Firstly, as revealed from our observations, we find that ViTs perform vulnerably when applied to FR scenarios with extremely large datasets. We investigate the reasons for this phenomenon and discover that the existing data augmentation approaches and hard sample mining strategies are incompatible with ViTs-based FR backbone due to the lack of tailored consideration on preserving face structural information and leveraging each local token information. To remedy these problems, we first propose a superior FR model called TransFace, which contains a patch-level data augmentation strategy named Dominant Patch Amplitude Perturbation (DPAP) and a hard sample mining strategy named Entropy-guided Hard Sample Mining (EHSM). Furthermore, to improve inference efficiency and user privacy protection, we investigate the intrinsic property of image bytes and propose a superior FR model termed TransFace++. The proposed model is trained directly on image bytes, presenting a novel approach to address the aforementioned issues. Specifically, considering the importance of local correlations in bytes, an image bytes compression strategy named Topology-based Image Bytes Compression (TIBC) is introduced to extract prominent features from the raw bytes and integrate these features with byte embeddings, effectively mitigating information loss during the bytes mapping process. Moreover, to strengthen the model’s perception on geometric information encoded in image bytes, a novel cross-attention module named Structure Information-guided Cross-Attention (SICA) is designed to inject structure information into byte tokens for information interaction, significantly improving the model’s generalization ability. Experiments on popular face benchmarks demonstrate the superiority of our TransFace and TransFace++.

Abstract:
Tensor train (TT) representation has achieved tremendous success in visual data completion tasks, especially when it is combined with tensor folding. However, folding an image or video tensor breaks the original data structure, leading to local information loss as nearby pixels may be assigned into different dimensions and become far away from each other. In this paper, to fully preserve the local information of the original visual data, we explore not folding the data tensor, and at the same time adopt graph information to regularize local similarity between nearby entries. To overcome the high computational complexity introduced by the graph-based regularization in the TT completion problem, we propose to break the original problem into multiple sub-problems with respect to each TT core fiber, instead of each TT core as in traditional methods. Furthermore, to avoid heavy parameter tuning, a sparsity-promoting probabilistic model is built based on the generalized inverse Gaussian (GIG) prior, and an inference algorithm is derived under the mean-field approximation. Experiments on both synthetic data and real-world visual data show the superiority of the proposed methods.

Abstract:
Neural Ordinary Differential Equations (NODEs) serve as continuous-time analogs of residual networks. They provide a system-theoretic perspective on neural network architecture design and offer natural solutions for time series modeling, forecasting, and applications where invertible neural networks are essential. However, these models suffer from slow performance due to heavy numerical solver overhead. For instance, a popular solution for training and inference of NODEs consists in using adaptive step size solvers such as the popular Dormand–Prince 5(4) (DOPRI). These solvers dynamically adjust the Number of Function Evaluations (NFE) as the equation fits the training data and becomes more complex. However, this comes at the cost of an increased number of function evaluations, which reduces computational efficiency. In this work, we propose a novel approach: making the parameters of the numerical integration scheme trainable. By doing so, the numerical scheme dynamically adapts to the dynamics of the NODE, resulting in a model that operates with a fixed NFE. We compare the proposed trainable solvers with state-of-the-art approaches, including DOPRI, for different benchmarks, including classification, density estimation, and dynamical system modeling. Overall, we report a state-of-the-art performance for all benchmarks in terms of accuracy metrics, while enhancing the computational efficiency through trainable fixed-step-size solvers. This work opens up new possibilities for practical and efficient modeling applications with NODEs.

Abstract:
Out-of-distribution (OoD) inputs pose a persistent challenge to deep learning models, often triggering overconfident predictions on non-target objects. While prior work has primarily focused on refining scoring functions and adjusting test-time thresholds, such algorithmic improvements offer only incremental gains. We argue that a rethinking of the entire development lifecycle is needed to mitigate these risks effectively. This work addresses two overlooked dimensions of OoD detection in object detection. First, we reveal fundamental flaws in widely used evaluation benchmarks: contrary to their design intent, up to 13% of objects in the OoD test sets actually belong to in-distribution classes, and vice versa. These quality issues severely distort the reported performance of existing methods and contribute to their high false positive rates. Second, we introduce a novel training-time mitigation paradigm that operates independently of external OoD detectors. Instead of relying solely on post-hoc scoring, we fine-tune the detector using a carefully synthesized OoD dataset that semantically resembles in-distribution objects. This process shapes a defensive decision boundary by suppressing objectness on OoD objects, leading to a 91% reduction in hallucination error of a YOLO model on BDD-100 K. Our methodology generalizes across detection paradigms such as YOLO, Faster R-CNN, and RT-DETR, and supports few-shot adaptation. Together, these contributions offer a principled and effective way to reduce OoD-induced hallucination in object detectors.

Abstract:
Unsigned distance functions (UDFs) have been a vital representation for open surfaces. With different differentiable renderers, current methods are able to train neural networks to infer a UDF by minimizing the rendering errors with the UDF to the multi-view ground truth. However, these differentiable renderers are mainly handcrafted, which makes them either biased on ray-surface intersections, or sensitive to unsigned distance outliers, or not scalable to large scenes. To resolve these issues, we present a novel differentiable renderer to infer UDFs more accurately. Instead of using handcrafted equations, our differentiable renderer is a neural network which is pre-trained in a data-driven manner. It learns how to render unsigned distances into depth images, leading to a prior knowledge, dubbed volume rendering priors. To infer a UDF for an unseen scene from multiple RGB images, we generalize the learned volume rendering priors to map inferred unsigned distances in alpha blending for RGB image rendering. To reduce the bias of sampling in UDF inference, we utilize an auxiliary point sampling prior as an indicator of ray-surface intersection, and propose novel schemes towards more accurate and uniform sampling near the zero-level sets. We also propose a new strategy that leverages our pretrained volume rendering prior to serve as a general surface refiner, which can be integrated with various Gaussian reconstruction methods to optimize the Gaussian distributions and refine geometric details. Our results show that the learned volume rendering prior is unbiased, robust, scalable, 3D aware, and more importantly, easy to learn. Further experiments show that the volume rendering prior is also a general strategy to enhance other neural implicit representations such as signed distance function and occupancy. We evaluate our method on both widely used benchmarks and real scenes, and report superior performance over the state-of-the-art methods.

Abstract:
Speech editing has garnered more and more attention due to its diverse applications. However, existing systems often require substantial manual effort or have limited capabilities in attribute editing, imposing significant constraints. In this work, we present SpeechPalette, a comprehensive high-quality speech editing method that allows users to easily modify various attributes of the selected speech segment according to their preferences. Specifically, the proposed model approaches speech editing from a decoupling perspective, disentangling critical information such as text, pitch, duration and more from the input speech. Then, reconstruction is achieved through a mask and prediction mechanism. Furthermore, we leverage a diffusion model to predict the residuals between the real and predicted speech, further enhancing synthesis quality. The proposed method not only excels at text-based speech editing but also handles tasks involving pitch and speed rate adjustments. Moreover, it also demonstrates remarkable performance in one-shot text-to-speech scenarios. While recent large-scale models achieve impressive synthesis quality through massive computational resources, SpeechPalette offers a balanced approach with explicit fine-grained control over speech attributes, practical deployment requirements, and competitive performance relative to similarly-sized systems. Experimental results across a range of tasks consistently demonstrate the superior performance of our method compared to baseline systems. Additionally, comprehensive ablation studies validate the effectiveness of our proposed approach.

Affiliations: College of Computer Science, Sichuan University, Chengdu, China; College of Mathematics, Sichuan University, Chengdu, China; School of Information Technology & Management, University of International Business and Economics, Beijing, China; Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA; School of ComputerScience, Peking University, Beijing, China; Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates; School of Computer Science and Technology, Central South University, Changsha, China; College of Computer, National University of Defense Technology, Changsha, China; Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA

Abstract:
Graph-structured data exhibits universality and widespread applicability across diverse domains, such as social network analysis, biochemistry, financial fraud detection, and network security. Significant strides have been made in leveraging Graph Neural Networks (GNNs) to achieve remarkable success in these areas. However, in real-world scenarios, the training environment for models is often far from ideal, leading to substantial performance degradation of GNN models due to various unfavorable factors, including imbalance in data distribution, the presence of noise in erroneous data, privacy protection of sensitive information, and generalization capability for out-of-distribution (OOD) scenarios. To tackle these issues, substantial efforts have been devoted to improving the performance of GNN models in practical real-world scenarios, as well as enhancing their reliability and robustness. In this paper, we present a comprehensive survey that systematically reviews existing GNN models, focusing on solutions to the four mentioned real-world challenges including imbalance, noise, privacy, and OOD in practical scenarios that many existing reviews have not considered. Specifically, we first highlight the four key challenges faced by existing GNNs, paving the way for our exploration of real-world GNN models. Subsequently, we provide detailed discussions on these four aspects, dissecting how these solutions contribute to enhancing the reliability and robustness of GNN models. Last but not least, we outline promising directions and offer future perspectives in the field.

Abstract:
Salient object detection (SOD) and camouflaged object detection (COD) are related but distinct binary mapping tasks, each involving multiple modalities that share commonalities while maintaining unique characteristics. Existing approaches often rely on complex, task-specific architectures, leading to redundancy and limited generalization. Our previous work, VSCode, introduced a generalist model that effectively handles four SOD tasks and two COD tasks. VSCode leveraged VST as its foundation model and incorporated 2D prompts within an encoder-decoder framework to capture domain and task-specific knowledge, utilizing a prompt discrimination loss to optimize the model. Building upon the proven effectiveness of our previous work VSCode, we identify opportunities to further strengthen generalization capabilities through focused modifications in model design and optimization strategy. To unlock this potential, we propose VSCode-v2, an extension that introduces a Mixture of Prompt Experts (MoPE) layer to generate adaptive prompts. We also redesign the training process into a two-stage approach: first learning shared features across tasks, then capturing specific characteristics. To preserve knowledge during this process, we incorporate distillation from our conference version model. Furthermore, we propose a contrastive learning mechanism with data augmentation to strengthen the relationships between prompts and feature representations. VSCode-v2 demonstrates balanced performance improvements across six SOD and COD tasks. Moreover, VSCode-v2 effectively handles various multimodal inputs and exhibits zero-shot generalization capability to novel tasks, such as RGB-D Video SOD.

Abstract:
Integrating multimodal data of pathological image and gene expression for cancer survival analysis can achieve better results than using a single modality. However, existing multimodal learning methods ignore fine-grained interactions between both modalities, especially the interactions between biological pathways and pathological image patches. In this article, we propose a novel Pathway-Aware Multimodal Transformer (PAMT) framework for interpretable cancer survival analysis. Specifically, the PAMT learns fine-grained modality interaction through three stages: (1) In the intra-modal pathway-pathway / patch-patch interaction stage, we use the Transformer model to perform intra-modal information interaction; (2) In the inter-modal pathway-patch alignment stage, we introduce a novel label-free contrastive loss to aligns semantic information between different modalities so that the features of the two modalities are mapped to the same semantic space; and (3) In the inter-modal pathway-patch fusion stage, to model the medical prior knowledge of “genotype determines phenotype”, we propose a pathway-to-patch cross fusion module to perform inter-modal information interaction under the guidance of pathway prior. In addition, the inter-modal cross fusion module of PAMT endows good interpretability, helping a pathologist to screen which pathway plays a key role, to locate where on whole slide image (WSI) are affected by the pathway, and to mine prognosis-relevant pathology image patterns. Experimental results based on three datasets of bladder urothelial carcinoma, lung squamous cell carcinoma, and lung adenocarcinoma demonstrate that the proposed framework significantly outperforms the state-of-the-art methods.

Abstract:
Deep neural networks possess remarkable learning capabilities but are vulnerable to overfitting in the presence of mislabeled data. A well-known memorization effect causes networks to first fit clean samples and later memorize noisy labels. Although early stopping can partially alleviate this issue, it cannot prevent the accumulation of incorrect knowledge or recover information lost due to mislabeled inputs. In this paper, we introduce an innovative mechanism for continuous review and timely correction of learned knowledge. Our approach allows the network to repeatedly revisit and reinforce correct information while promptly addressing any inaccuracies stemming from mislabeled data. We present a novel method called self-not-true-distillation (SNTD). This technique employs self-distillation, where the network from previous training iterations acts as a teacher, guiding the current network to review and solidify its understanding of accurate labels. Crucially, SNTD masks the true class label in the logits during this process, concentrating on the non-true classes to correct any erroneous knowledge that may have been acquired. We also recognize that different data classes follow distinct learning trajectories. A single teacher network might struggle to effectively guide the learning of all classes at once, which necessitates selecting different teacher networks for each specific class. Additionally, the influence of the teacher network's guidance varies throughout the training process. To address these challenges, we propose SNTD+, which integrates a class-wise distillation strategy along with a dynamic weight adjustment mechanism. Together, these enhancements significantly bolster SNTD's robustness in tackling complex scenarios characterized by label noise.

Abstract:
In the era of deep learning, video saliency prediction task still remains major challenge due to the issue of catastrophic forgetting during feature learning. Most prior works commonly employ generative replay strategies to generate pseudo-samples from previous tasks, enabling them to recall the data distribution. However, scaling up generative replay to accommodate class-incremental and task-incremental settings poses challenges, as generated data with low quality can severely deteriorate performance. Additionally, existing advances mainly focus on preserving memory stability to alleviate catastrophic forgetting, but they remain difficult to flexibly adapt to incremental changes in dynamic scenes. To achieve a better balance between memory stability and learning plasticity, we propose a novel biologically-inspired continual learning (BICL) model tailored to effectively predict human attention in dynamic scenes while mitigate catastrophic forgetting. In particular, inspired by the function of the hippocampus in the human neural system, we elaborately design a visual saliency memory bank module to explicitly store and retrieve representative features from previous tasks. Furthermore, drawing inspiration from the Drosophila \gammaγMB system, we propose an active forgetting strategy equipped with multiple parallel adaptive learner modules, which can appropriately attenuate old memories in parameter distribution to enhance learning plasticity to adapt to new tasks, and accordingly to ensure compatibility among multiple learners. Notably, without compromising the performance of old tasks, our proposed model can achieve a better trade-off between memory stability and learning plasticity. Through extensive experiments on several benchmark datasets, our model not only enhances performance in task-incremental settings, but also potentially provides deep insights into neurological adaptive mechanisms.

Abstract:
Event cameras offer significant advantages for low-light video enhancement, primarily due to their high dynamic range. Current research, however, is severely limited by the absence of large-scale, real-world, and spatio-temporally aligned event-video datasets. To address this, we introduce a large-scale dataset with over 30,000 pairs of frames and events captured under varying illumination. This dataset was curated using a robotic arm that traces a consistent non-linear trajectory, achieving spatial alignment precision under 0.03 mm and temporal alignment with errors under 0.01 s for 90% of the dataset. Based on the dataset, we propose EvLight++, a novel event-guided low-light video enhancement approach designed for robust performance in real-world scenarios. First, we design a multi-scale holistic fusion branch to integrate structural and textural information from both images and events. To counteract variations in regional illumination and noise, we introduce Signal-to-Noise Ratio (SNR)-guided regional feature selection, enhancing features from high SNR regions and augmenting those from low SNR regions by extracting structural information from events. To incorporate temporal information and ensure temporal coherence, we further introduce a recurrent module and temporal loss in the whole pipeline. Extensive experiments on ours and the synthetic SDSD dataset demonstrate that EvLight++ significantly outperforms both single image- and video-based methods by 1.37 dB and 3.71 dB, respectively. To further explore its potential in downstream tasks like semantic segmentation and monocular depth estimation, we extend our datasets by adding pseudo segmentation and depth labels via meticulous annotation efforts with foundation models. Experiments under diverse low-light scenes show that the enhanced results achieve a 15.97% improvement in mIoU for semantic segmentation.

Abstract:
Traffic state prediction based on spatiotemporal data has become a prominent focus in data-driven AI research. While significant progress has been made, most mainstream approaches assume uniform spatial and temporal correlations across conditions and use shared parameters for all scenarios. This simplification overlooks the complexity and heterogeneity inherent in human mobility patterns, often leading to suboptimal predictions. Recently, methods adopting the “decompose, then predict” (DTP) paradigm have gained traction. These methods break down data into smaller, manageable subcomponents, each predicted using dedicated parameters. Although effective in practice, DTP methods face unresolved theoretical questions: What type of decomposition truly makes subcomponents more manageable than the original data? To address this, we present an information theory-based analysis that derives sufficient conditions for a decomposition algorithm to reduce data-induced prediction errors. These conditions suggest that an effective algorithm should ensure decomposed components are as independent as possible, a principle we term the Component Independence Principle. Guided by this principle, we introduce the Theory-guided Graph Decomposition Learning (TGDL) framework, which decomposes graph-based multivariate time series data into approximately independent subgraph components that are easier to predict than the original data. Moreover, TGDL is a portable framework that can be integrated into any graph-based traffic prediction model to improve its predictive performance. Extensive experiments on four public datasets demonstrate the effectiveness of our approach. With a solid theoretical foundation, our TGDL enhances the performance of diverse traffic prediction models, yielding an average improvement of 19.37% across experiments.

Abstract:
Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information related to boundary components). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model’s ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework’s capability of consistent boundary refinement for coarse results from diverse discriminative architectures.

Abstract:
Autonomous vehicles, open-world robots, and other automated systems rely on accurate, efficient perception modules for real-time object detection. Although high-precision models improve reliability, their processing time and computational overhead can hinder real-time performance and raise safety concerns. This paper introduces an Edge-based Mixture-of-Experts Optimal Sensing (EMOS) System that addresses the challenge of co-achieving accuracy, latency and scene adaptivity, further demonstrated in the open-world autonomous driving scenarios. Algorithmically, EMOS fuses multimodal sensor streams via an Adaptive Multimodal Data Bridge and uses a scenario-aware MoE switch to activate only a complementary set of specialized experts as needed. The proposed hierarchical backpropagation and a multiscale pooling layer let model capacity scale with real-world demand complexity. System-wise, an edge-optimized runtime with accelerator-aware scheduling (e.g., ONNX/TensorRT), zero-copy buffering, and overlapped I/O–compute enforces explicit latency/accuracy budgets across diverse driving conditions. Experimental results establish EMOS as the new state of the art: on KITTI, it increases average AP by 3.17% while running 2.6×2.6× faster on Nvidia Jetson. On nuScenes, it improves accuracy by 0.2% mAP and 0.5% NDS, with 34% fewer parameters and a 15.35×15.35× Nvidia Jetson speedup. Leveraging multimodal data and intelligent experts cooperation, EMOS delivers accurate, efficient and edge-adaptive perception system for autonomous vehicles, thereby ensuring robust, timely responses in real-world scenarios.

Abstract:
Point cloud registration is a central theme in computer vision, with alignment algorithms continuously improving for greater robustness. Commonly used methods evaluate Euclidean distances between point clouds and minimize an objective function, such as Root Mean Square Error (RMSE). However, these approaches are most effective when the point clouds are well-prealigned and issues such as differences in density, noise, holes, and limited overlap can compromise the results. Traditional methods, such as Iterative Closest Point (ICP), require choosing one point cloud as fixed, since Euclidean distances lack commutativity. When only one point cloud has issues, adjustments can be made, but in real scenarios, both point clouds may be affected, often necessitating preprocessing. The authors introduce a novel differential entropy-based metric, designed to serve as the objective function within an optimization framework for fine rigid pairwise 3D point cloud registration, denoted as Iterative Differential Entropy Minimization (IDEM). This metric does not depend on the choice of a fixed point cloud and, during transformations, reveals a clear minimum corresponding to the best alignment. Multiple case studies are conducted, and the results are compared with those obtained using RMSE, Chamfer distance, and Hausdorff distance. The proposed metric proves effective even with density differences, noise, holes, and partial overlap, where RMSE does not always yield optimal alignment.

Abstract:
In this article, we explore the capability of both the Adjacency Spectral Embedding (ASE) and the Graph Encoder Embedding (GEE) for capturing an embedded pseudo-clique structure in the random dot product graph setting. In both theory and experiments, we demonstrate that, in the absence of additional clean (i.e., without the implanted pseudo-clique) network data, this pairing of model and methods can yield worse results than the best existing spectral clique detection methods. However, these methods can be used to asymptotically localize the pseudo-cliques if additional clean, independent network data is provided. This demonstrates at once the methods’ potential ability/inability to capture modestly sized pseudo-cliques and the methods’ robustness to the model contamination giving rise to the pseudo-clique structure. To further enrich our analysis, we also consider the Variational Graph Auto-Encoder (VGAE) model in our simulation and real data experiments.