TNNLS2026

Abstract:
We address the problem of machine unlearning in neural information retrieval (IR), introducing a novel task termed neural machine unranking (NuMuR). This problem is motivated by growing demands for data privacy compliance and selective information removal in neural IR systems. Existing task-agnostic or model-agnostic unlearning approaches, primarily designed for classification tasks, are suboptimal for NuMuR due to two core challenges: 1) neural rankers output unnormalised relevance scores rather than probability distributions, limiting the effectiveness of traditional teacher–student distillation frameworks and 2) entangled data scenarios, where queries and documents appear simultaneously across both forget and retain sets, may degrade retention performance in existing methods. To address these issues, we propose contrastive and consistent loss (CoCoL), a dual-objective framework. CoCoL comprises 1) a contrastive loss that reduces relevance scores on forget sets while maintaining performance on entangled samples and 2) a consistent loss that preserves accuracy on the retain set. Extensive experiments on two datasets, across four neural IR models, demonstrate that CoCoL achieves substantial forgetting with minimal retention and generalization performance loss. CoCoL facilitates more effective and controllable data removal than existing techniques.

Abstract:
Compute-in-memory (CIM) systems implemented with resistive random access memory (RRAM) crossbars are a promising approach for accelerating deep neural network (DNN) computations. However, it is noteworthy that RRAM-based CIM systems are susceptible to computational errors. Unlike digital computation, the nature of analog computing introduces the risk of error accumulation throughout the computation process. Various techniques have been proposed to help deal with the errors in CIM systems, among which, training methods to create noise-tolerant CIM-based DNNs (CIM-DNNs) models that are insensitive to weight variations are the most promising due to their simplicity and low implementation cost. Although promising empirical results of variation-aware training (VAT) showcasing DNN models with high tolerance to device nonidealities have been demonstrated, there remains a significant gap in the understanding of noise tolerance properties in VAT-trained CIM-DNNs and how to improve VAT based on these understandings. The exploration of these theoretical aspects represents an area requiring further investigation and research. This work endeavors to explore the fundamental properties of noise tolerance in DNNs for CIM systems. We encapsulate our contributions into three key points. First, we identify factors that influence DNNs’ performance when subjected to noise through a series of training experiments. Second, we offer both theoretical insights and practical demonstrations illustrating how VAT operates to yield solutions with heightened resistance to noise during the training process. Finally, leveraging these insights, we provide guidelines for implementing VAT to obtain optimal noise tolerance in CIM-DNNs. Our objective is to establish a theoretical foundation for VAT, and building on these insights, we aim to offer general and straightforward guidelines for DNN training, experimenting with factors such as hyperparameter choices for optimizers and weight clamping. Ultimately, our aim is to contribute to general and practical solutions for the development of reliable CIM systems. Our studies focus on analyzing how noise injection and different optimizers affect the convergence dynamics during training to reach a more noise-tolerant solution through VAT. Future studies could incorporate advanced regularizers reflecting the flatness of the solution into the cost function, which may be necessary for models beyond DNNs studied here. Combined together, these techniques can potentially lead to practical solutions for the development of reliable CIM systems.

Abstract:
Audio–visual event localization (AVEL) aims to recognize events in videos by associating audio–visual information. However, events involved in existing AVEL tasks are usually coarse-grained events. Actually, finer-grained events are sometimes necessary to be distinguished, especially in certain expert-level applications or rich-content-generation studies. However, this is challenging because they are more difficult to detect or distinguish compared with coarse-grained events. To better address this problem, we discuss a new setting of fine-grained AVEL from dataset to method. First, we constructed the first fine-grained audio–visual event dataset, which is called IT-AVE, relying on videos of playing musical instruments, containing 13k video clips and over 52k audio–visual events. All events are labeled from professional music practitioners, and the event categories are all derived from playing techniques, which are fine-grained with little interclass variation. Next, we designed a new fine-grained event localization method, spatial–temporal video event detector (SVED), which focuses on the challenges that fine-grained events are more imperceptible and prone to be disturbed. Finally, we conduct extensive experiments based on the proposed IT-AVE dataset versus fine-grained versions of two existing related datasets, including UnAV-22 derived from UnAV-100 and FineAction-AV derived from FineAction. Experimental results demonstrate the effectiveness of our method. We hope that this work will contribute to the exploration of an integrated understanding of audio–visual videos.

Abstract:
Policy optimization methods are promising to tackle high-complexity reinforcement learning (RL) tasks with multiple agents. In this article, we derive a general trust region for policy optimization methods by considering the effect of subpolicy combinations among agents in multiagent environments. Based on this trust region, we propose an inductive objective to train the policy function, which can ensure agents learn monotonically improving policies. Furthermore, we observe that the policy always updates very weakly before falling into a local optimum. To address this, we introduce a cost regarding policy distance in the inductive objective to strengthen the motivation of agents to explore new policies. This approach strikes a balance during training, where the policy update step size remains within the constraints of the trust region, preventing excessive updates while avoiding getting stuck in local optima. Simulations on wind farm (WF) control tasks and two multiagent benchmarks demonstrate the high performance of the proposed multiagent inductive policy optimization (MAIPO) method.

Abstract:
Deep model fusion/merging is an emerging technique that integrates parameters or predictions from multiple deep learning (DL) models into a unified framework. It combines the abilities of different models to compensate for the biases and errors of an individual model, improving overall performance. However, deep model fusion, especially on large-scale DL models such as large language models (LLMs) and foundation models, faces several challenges, including high computational cost and interference between different heterogeneous models. In order to understand it better, we present a comprehensive survey to summarize the recent progress. We categorize existing model fusion methods as fourfold: 1) weight average (WA) averages the parameters of multiple models to obtain results closer to the optimal solution; 2) considering that direct averaging of models often yields suboptimal results, “mode connectivity” connects networks via paths of nonincreasing loss in weight spaces before the fusion. Along these paths, initial models are transformed into forms with consistent functions and better fusion effects; 3) similarly, for models with poor direct fusion results, “alignment” matches the corresponding units and merges these models, thus fully exploiting the corresponding relationships between the models; and 4) in addition to the above-mentioned methods of parameter fusion, “ensemble learning” fuses the outputs of multiple models in the inference stage to improve the accuracy and robustness of networks. In addition, we analyze the challenges of deep model fusion and illuminate the possible research directions in the future.

Abstract:
A universal kernel is constructed whose sections approximate any causal and time-invariant filter in the fading memory category with inputs and outputs in a finite-dimensional Euclidean space. This kernel is built using the reservoir functional associated with a state-space representation of the Volterra series expansion available for any analytic fading memory filter, and it is hence called the Volterra reservoir kernel. Even though the state-space representation and the corresponding reservoir feature map are defined on an infinite-dimensional tensor algebra space, the kernel map is characterized by explicit recursions that are readily computable for specific datasets when employed in estimation problems using the representer theorem. The empirical performance of the Volterra reservoir kernel is showcased and compared to other standard static and sequential kernels in a multidimensional and highly nonlinear learning task for the conditional covariances of financial asset returns.

Abstract:
Representation learning is a key area in machine learning and deep learning, focusing on extracting meaningful features to support downstream tasks such as classification and clustering. Current mainstream representation learning methods primarily rely on nonlinear data mining techniques such as kernel methods and deep neural networks (DNNs) to extract abstract knowledge from complex datasets. However, most of them are “black-box” methods, lacking transparency and interpretability in the learning process, which constrain their practical utility. To this end, this article introduces a novel representation learning method called fuzzy rule-based differentiable representation learning (FRDRL), which is grounded in an interpretable fuzzy rule-based model. Specifically, it is built upon the Takagi–Sugeno–Kang fuzzy system (TSK-FS) to map input data to a high-dimensional fuzzy feature space through the antecedent part of the TSK-FS. Subsequently, a novel differentiable optimization method is proposed for learning in the consequent part, which preserves interpretability and transparency while effectively capturing nonlinear relationships in the data. By retaining the essence of traditional optimization and parameterizing key components as differentiable modules, the method improves performance without sacrificing interpretability. Moreover, a second-order geometry preservation strategy is incorporated to further improve robustness. Extensive evaluations conducted on various benchmark datasets validate the superiority of the proposed method. The source codes are available at https://github.com/BBKing49/FEDRL

Abstract:
Patents are crucial for protecting technological innovations and fostering competitive advancements in industry. Patent prediction, a novel task in the field of patent mining, aims to forecast future technological trends, providing valuable insights for strategic planning and innovation in the industry. However, the complexity of patent data and the diversity of technological fields make effective patent prediction a significant challenge. Existing methods for predicting scientific research trends struggle to effectively model patent structures and capture dependencies between patents, resulting in suboptimal patent trend predictions. In this article, we propose a novel method, patent prediction with prompt learning (P3L), to achieve effective and accurate prediction of future patent developments based on a pretrained language model (PLM). P3L includes a patent similarity path extraction module to extract multiple patent development paths from extensive datasets. Following this, we design a patent prompt learning approach that integrates patent development paths, keywords, and patent similarities into the prompts. To mitigate potential noise introduced by this integration, we introduce an attention mask matrix for prompt denoising. Finally, we introduce three patent datasets with rich structures, and conduct extensive experiments on these datasets as well as a public dataset, demonstrating the superiority of the proposed method. The dataset and code have been made publicly available at https://github.com/AllminerLab/P3L

Affiliations: School of Big Data and Software Engineering, Chongqing University, Chongqing, China; School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China; Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong; Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; School of Information Technology, Halmstad University, Halmstad, Sweden; School of Data Science, The Chinese University of Hong Kong, Shenzhen, China; Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong

Abstract:
With large-scale language models demonstrating superior capabilities in a wide range of downstream natural language processing tasks, the future trajectory of research in the field of text categorization faces increasing uncertainty. In this evolving paradigm of open-ended language modeling, where task delimitations are increasingly blurred, a pressing question arises: to what extent has text classification advanced under the full potential of large language model (LLM)? To address this pivotal inquiry, we introduce recurrent generative pre-trained transformer (RGPT), an adaptive boosting framework meticulously designed to craft a dedicated LLM for text classification. RGPT constructs a sequence of base learners by dynamically modulating the training data distribution and iteratively fine-tuning LLMs. These base learners are then progressively integrated, leveraging historical prediction trajectories to form a highly specialized text classification model. Extensive empirical evaluations demonstrate that RGPT surpasses eight state-of-the-art pretrained language models and seven cutting-edge LLMs across four benchmark datasets, achieving an average performance gain of 2.90%.

Abstract:
Proximal policy optimization (PPO) is one of the most popular state-of-the-art on-policy algorithms that has become a standard baseline in modern reinforcement learning with applications in numerous fields. Though it delivers stable performance with theoretical policy improvement guarantees, high variance and high sample complexity still remain critical challenges in on-policy algorithms. To alleviate these issues, we propose a hybrid-policy PPO (HP3O), which utilizes a trajectory replay buffer to make efficient use of trajectories generated by recent policies. Particularly, the buffer applies the “first in, first out” (FIFO) strategy so as to keep only the recent trajectories to attenuate the data distribution drift. A batch consisting of the trajectory with the best return and other randomly sampled ones from the buffer is used for updating the policy networks. The strategy helps the agent to improve its capability on top of the most recent best performance and, in turn, reduce variance empirically. We theoretically construct the policy improvement guarantees for the proposed algorithm. HP3O is validated and compared against several baseline algorithms using multiple continuous control environments. Our code is available at https://anonymous.4open.science/r/HP30-EB61/HP3O_train.py

Abstract:
Weakly supervised video anomaly detection (WSVAD) aims at predicting frame-level anomaly scores by modeling training videos with video-level annotations. The category names of abnormal events contain high-level knowledge abstracted by humans about abnormalities, which is of great help in identifying abnormal events. To utilize the knowledge implicit in category names, based on the visual-language pretraining model, we introduce a learnable abnormal prompt from three aspects: learnable domain prompt, learnable category prompt, and nonlearnable category definition prompt. Based on the learnable abnormal prompt, we propose a novel fine-grained WSVAD method: PromptVAD, which exploits a learnable abnormal prompt to reduce the semantic gap between visual images and anomaly categories. Through a similarity measure and our proposed coarse-grained two-class prompt module, our PromptVAD jointly learns coarse-grained and fine-grained VAD. Extensive experimental results on the ShanghaiTech, University of Central Florida (UCF)-Crime, and XD-Violence datasets show that our method achieves state-of-the-art performance. Specifically, our method achieves an area under the curve (AUC) of 88.62% on the UCF-Crime dataset.

Abstract:
Multimodal image registration aims to spatially align images from different modalities at the pixel level. However, due to the nonlinear relationship of radiation intensities caused by different imaging modalities, achieving high accuracy in multimodal image registration presents a significant challenge. Additionally, the presence of both global transformations (i.e., large-scale rigid affine transformations) and local distortions (i.e., small-scale nonrigid deformations) between paired images further complicates the registration process. This article addressed the challenge resulting from modality differences through modality distillation. Specifically, a teacher (i.e., a homomodal image registration model) is trained to guide the student (i.e., a multimodal image registration model). Besides, this article simultaneously aligned large-scale rigid and small-scale nonrigid deformations by predicting deformation flow from both global and local features, thereby achieving high-precision registration. Furthermore, this proposed method incorporated a deformation mask during training to mitigate the negative impact of black edges in the obtained registration results on model performance. Experimental results demonstrate that the proposed method delivers state-of-the-art registration accuracy across various multimodal datasets, with ablation studies confirming the effectiveness of each component. The codes will be available at https://github.com/2351056918/Multimodality-Image-Registration-with-Modailty-Distillation

Abstract:
Neural collapse (NC) is a simple and symmetric phenomenon for deep neural networks (DNNs) at the terminal phase of training, where the last-layer features collapse to their class means and form a simplex equiangular tight frame (ETF) aligning with the classifier vectors. However, the relationship of the last-layer features to the data and intermediate layers during training remains unexplored. To this end, we characterize the geometry of intermediate layers of residual neural network (ResNet) and propose a novel conjecture, progressive feedforward collapse (PFC), claiming the degree of collapse increases during the forward propagation of DNNs. We derive a transparent model for the well-trained ResNet based on the principle that ResNet with weight decay approximates the geodesic curve in the Wasserstein space at the terminal phase. The metrics of PFC indeed monotonically decrease across depth on various datasets. We propose a new surrogate model, multilayer unconstrained feature model (MUFM), connecting intermediate layers by an optimal transport regularizer. The optimal solution of MUFM is inconsistent with NC but is more concentrated relative to the input data. Overall, this study extends NC to PFC to model the collapse phenomenon of intermediate layers and its dependence on the input data, shedding light on the theoretical understanding of ResNet in classification problems.

Abstract:
Compared with frequent pattern mining, sequential pattern mining emphasizes the temporal aspect and finds broad applications across various fields. However, numerous studies treat temporal events as single time points, neglecting their durations. Time-interval-related pattern (TIRP) mining is introduced to address this issue and has been applied to healthcare analytics, stock prediction, etc. Typically, mining all patterns is not only computationally challenging for accurate forecasting but also resource-intensive in terms of time and memory. Targeting the extraction of TIRPs based on specific criteria can improve data analysis efficiency and better align with customer pReferences. Therefore, this article proposes a novel algorithm called TaTIRP to discover targeted TIRP. In addition, we develop multiple pruning strategies to eliminate redundant extension operations, thereby enhancing performance on large-scale datasets. Finally, we conduct experiments on various real-world and synthetic datasets to validate the accuracy and efficiency of the proposed algorithm.

Affiliations: College of Electronic and Information Engineering, Tongji University, Shanghai, China; School of Future Science and Engineering, Soochow University, Suzhou, China; DataLab: Data Science and Informatics, University of California at Davis, Davis, CA, USA; College of Intelligent Robotics and Advanced Manufacturing, Fudan University, Shanghai, China; Department of Computer Science, The University of Hong Kong, Hong Kong, China; School of Data Science and Engineering, East China Normal University, Shanghai, China; College of Future Information Technology, Fudan University, Shanghai, China

Abstract:
The video anomaly detection (VAD) aims to automatically analyze spatiotemporal patterns in surveillance videos collected from open spaces to detect anomalous events that may cause harm, such as fighting, stealing, and car accidents. However, vision-based surveillance systems such as closed-circuit television (CCTV) often capture personally identifiable information. The lack of transparency and interpretability in video transmission and usage raises public concerns about privacy and ethics, limiting the real-world application of VAD. Recently, researchers have focused on privacy concerns in VAD by conducting systematic studies from various perspectives, including data, features, and systems, making privacy-preserving VAD (P2VAD) a hotspot in the AI community. However, the current research in P2VAD is fragmented, and prior reviews have mostly focused on methods using RGB sequences, overlooking privacy leakage and appearance bias considerations. To address this gap, this article is the first to systematically review the progress of P2VAD, defining its scope and providing an intuitive taxonomy. We outline the basic assumptions, learning frameworks, and optimization objectives of various approaches, analyzing their strengths, weaknesses, and potential correlations. In addition, we provide open access to research resources such as benchmark datasets and available code. Finally, we discuss key challenges and future opportunities from the perspectives of AI development and P2VAD deployment, aiming to the guide future work in the field.

Abstract:
Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone–Weierstrass theorem, we propose expandable residual approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher’s representation through a divide-and-conquer approach. Specifically, ERA employs a multibranched residual network (MBRNet) to implement this residual knowledge decomposition. Additionally, a teacher weight integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher’s head weights. Extensive experiments show that ERA improves the Top-1 accuracy on ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks.

Abstract:
Matrix factorization (MF) is a fundamental problem in machine learning, which is usually used as a feature learning method in various fields. For complex data involving spatiotemporal interactions, MF that only handles 2-D data will disrupt spatial dependence or temporal dynamics, failing to effectively couple spatial information with temporal factors. According to Markov chain principle, the spatial information of the present time is related to the spatial state of the previous time. We propose a spatial–temporal diffusion model for MF (STDMF), which uses graph diffusion to couple spatial–temporal information. Then, MF is used to learn the joint feature of data and spatial–temporal diffusion graph. Specifically, STDMF utilizes the graph diffusion with physical laws to generate spatial–temporal structure information. It obtains the underlying core structure of complex systems from a global perspective, which enhances the generalization ability of MF in noisy time-series data. To learn the lowest rank subspace of MF in time-series data, STDMF uses structural learning to constrain the rank of the learned features. Finally, STDMF is applied to clustering and anomaly detection of dynamic graph. The effectiveness of this method is verified by sufficient experiments, especially for noisy data.

Abstract:
This article advances the theoretical foundations of stochastic configuration networks (SCNs) by rigorously analyzing their convergence properties, approximation guarantees, and the limitations of nonadaptive randomized methods. We introduce a principled objective function that aligns incremental training with orthogonal projection, ensuring maximal residual reduction at each iteration without recomputing output weights. Under this formulation, we derive a novel necessary and sufficient condition for strong convergence in Hilbert spaces and establish sufficient conditions for uniform geometric convergence, offering the first theoretical justification of the SCN residual constraint. To assess the feasibility of unguided random initialization, we present a probabilistic analysis showing that even small support shifts markedly reduce the likelihood of sampling effective nodes in high-dimensional settings, thereby highlighting the necessity of adaptive refinement in the sampling distribution. Motivated by these insights, we propose greedy SCNs (GSCNs) and two optimized variants—Newton–Raphson GSCN (NR-GSCN) and particle swarm optimization GSCN (PSO-GSCN)—that incorporate Newton–Raphson refinement and particle swarm-based exploration to improve node selection. Empirical results on synthetic and real-world datasets demonstrate that the proposed methods achieve faster convergence, better approximation accuracy, and more compact architectures compared to existing SCN training schemes. Collectively, this work establishes a rigorous theoretical and algorithmic framework for SCNs, laying out a principled foundation for subsequent developments in the field of randomized neural network (NN) training.

Abstract:
Computing matrix gradient has become a key aspect in modern signal processing/machine learning, with the recent use of matrix neural networks requiring matrix backpropagation. In this field, two main methods exist to calculate the gradient of matrix functions for symmetric positive definite (SPD) matrices, namely, the Daleckiǐ–Kreǐn/Bhatia formula and the Ionescu method. However, there appear to be a few errors. This brief aims to demonstrate each of these formulas in a self-contained and unified framework, to prove theoretically their equivalence, and to clarify inaccurate results of the literature. A numerical comparison of both methods is also provided in terms of computational speed and numerical stability to show the superiority of the Daleckiǐ–Kreǐn/Bhatia approach. We also extend the matrix gradient to the general case of diagonalizable matrices. Convincing results with the two backpropagation methods are shown on the EEG-based BCI competition dataset with the implementation of an SPDNet, yielding around 80% accuracy for one subject. Daleckiǐ–Kreǐn/Bhatia formula achieves an 8% time gain during training and handles degenerate cases.

Abstract:
Semantic image editing involves inpainting pixels guided by a semantic map. This is a challenging task, as the inpainted regions must both align harmoniously with the surrounding context and strictly adhere to the semantic constraints. Most prior methods approach this by attempting to encode all necessary information from the erased regions alone. However, when adding new objects—such as a car—to a scene, their style often cannot be inferred solely from the surrounding context. On the other hand, the models that can output diverse generations struggle to output images that have seamless boundaries between the generated and unerased parts. In this work, we propose a framework that can encode visible and partially visible objects with a novel mechanism to achieve consistency in the style encoding and final generations. We extensively compare with previous conditional image generation and semantic image editing algorithms. Our extensive experiments show that our method significantly improves over the state-of-the-art. Our method not only achieves better quantitative results but also provides diverse results. Demo and code will be released.

Abstract:
Stochastic games form the foundational mathematical framework for describing multiagent interactions and underpin the theoretical foundations of multiagent reinforcement learning (MARL) and optimal decision making. However, previous research has typically focused on either two-agent settings or large-scale well-mixed agent populations, where the considered interaction scenarios were far from realistic. In this article, we consider structured populations where agents can interact with immediate neighbors. By using the pair-approximation method, we develop a new dynamical model to describe the Q -learning dynamics in stochastic games on regular graphs. Through comparisons with agent-based simulation results, we validate the accuracy of our dynamical model across various stochastic games, population structures, and algorithm parameters. Our research thus provides both qualitative and quantitative insights into the effects of state transition rules and graph topologies in population dynamics. In particular, we show that, under certain conditions, state transitions can significantly promote the evolution of cooperation in social dilemmas. We also explored the effects of agent degree on cooperation, and unlike previous findings, we show that this can have either positive or negative implications for cooperation depending on the transition rules.

Abstract:
Deep models, characterized by complex structures and end-to-end optimization, proved effective in providing decision support based on real-world data. However, the lack of transparency in their decision-making process and the difficulty in interpreting the role of individual neurons limited their practical applicability in many critical and sensitive domains. Inspired by the parallels between neural networks and ensemble models, where performance was achieved through the collaboration of multiple weak learners, this article presents a novel perspective that reframes neural networks as hierarchical ensembles. We propose the hierarchical backpropagated ensemble (HBE) model, wherein each neuron functions both as a base learner and as part of an ensemble of preceding neurons. This framework applies ensemble learning techniques to neural networks, allowing each neuron to focus on specific subtasks while progressively constructing a network that meets global objectives. Experimental results on real-world data show that this hierarchical structure enhances the effectiveness of traditional ensemble models, and the ensemble-based explanations offer improved initialization and dynamically adjustable network structures, leading to more efficient training.

Abstract:
Model merging has become a popular approach for combining individual models into a single model that inherits their capabilities and achieves improved performance. However, its success has not yet been transferred to semantic segmentation tasks due to two major challenges: 1) current model merging methods predominantly employ static merging strategies with fixed coefficients, limiting their ability to incorporate task-specific prior knowledge and 2) semantic segmentation faces large distribution shifts across multiple domains, causing negative transfer in the merged model. In this article, we propose an effective model merging approach for semantic segmentation, named M2Seg. To dramatically integrate relevant priors based on the input data, we propose a novel SVD-structured MoE module for adaptive merging. To address the severe distribution shifts, we further introduce a test-time dynamic calibration function designed to minimize discrepancies between training and test statistics. Additionally, historical information is leveraged to refine activation statistics during inference. Recognizing that unreliable data can negatively impact update directions, we develop a pixel-efficient entropy minimization mechanism to filter unstable pixels, thus stabilizing the merging process and enhancing segmentation performance. Extensive experiments on both seen and unseen semantic segmentation tasks demonstrate the superior effectiveness and generalization capability of our proposed method. The source code and pretrained checkpoints are available at https://github.com/cht619/MMSeg

Abstract:
Imitation learning offers a flexible approach for robot skill acquisition, enabling robots to learn complex tasks directly from demonstrations. However, most existing methods require a large number of demonstrations, whereas humans typically only need one or a few demonstrations. This discrepancy results in significant time consumption for data collection. Furthermore, these methods often assume that test scenarios will always be identical to the demonstration, which can lead to substantial performance degradation when facing novel scenarios, such as manipulating objects from the same category but with different shapes and sizes, or encountering object collisions during manipulation. To address these challenges, we propose a generalized multistage manipulation network for category-level robot assembly tasks. This network allows a robot to learn a multistage screw–nut assembly task from a single demonstration and generalize to new object instances with varying shapes and sizes. Specifically, the network uses category-level pose estimation to extract manipulation trajectories from the demonstration and applies manipulation-pose generalization to transfer these trajectories to novel instances. In addition, real-time action correction adjusts the trajectory based on real-time force feedback, enabling the robot to adapt to unexpected collisions during execution. We validate our method through experiments in both simulation and real-world environments, verifying its effectiveness and flexibility.

Affiliations: Chongqing Key Laboratory of Bio-Perception and Multimodal Intelligent Information Processing and the School of Microelectronics and Communication Engineering, Chongqing University, Chongqing, China; School of Data Science, Lingnan University, Hong Kong, China; Chongqing Key Laboratory of Computational Intelligence, Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, Chongqing University of Posts and Telecommunications, Chongqing, China; National Center for Applied Mathematics in Chongqing, Chongqing Normal University, Chongqing, China

Abstract:
State space model (SSM) is a mathematical model used to describe and analyze the behavior of dynamic systems. This model has witnessed numerous applications in several fields, including control theory, signal processing, economics, and machine learning. In the field of deep learning, SSMs are used to process sequence data, such as time series analysis, natural language processing (NLP), and video understanding. By mapping sequence data to state space, long-term dependencies in the data can be better captured. In particular, modern SSMs have shown strong representational capabilities in NLP, especially in long sequence modeling, while maintaining linear time complexity. In particular, based on the latest SSMs, Mamba merges time-varying parameters into SSMs toward efficient training and inference. Given its impressive efficiency and strong long-range dependency modeling capability, Mamba is expected to become a new AI architecture that may be capable of surpassing Transformer. Recently, a number of works attempt to study the potential of Mamba in various fields, such as general vision, multimodal learning, medical image analysis, and remote sensing image analysis, by extending Mamba from natural language domain to visual domain. To fully understand Mamba in the visual domain, we conduct a comprehensive survey and present a taxonomy study. This survey focuses on Mamba’s application to a variety of visual tasks and data types, and discusses its predecessors, recent advances, and far-reaching impact on a wide range of domains.

Abstract:
In this article, we introduce an approach called coupled filters decomposition, which builds on the key observation that redundancy exists among filters in a convolutional layer, meaning that similar filters can produce partially overlapping outputs. Leveraging this insight, we propose a joint decomposition of filters using coupled tensor decompositions, specifically coupled canonical polyadic decomposition (CPD), which enables the sharing of a common factor matrix across similar filters. This joint factorization not only reduces the number of parameters but also lowers computational complexity by eliminating redundant computations. To further improve efficiency, we first cluster the filters before decomposition. The grouping relies on a custom metric based on the subspace spanned by the shared-mode factor. Within each group, the coupling constraint is less restrictive. Extensive experiments across various architectures, datasets, and tasks validate the effectiveness of our method, demonstrating its competitive performance compared to state-of-the-art model compression techniques. Our code is available for research purposes at https://codec-ai.github.io/

Abstract:
The reliability of a learning model is key to the successful deployment of machine learning in various applications. Creating a robust model, particularly one unaffected by adversarial attacks, requires a comprehensive understanding of the adversarial examples phenomenon. However, it is difficult to describe the phenomenon due to the complicated nature of the problems in machine learning. It has been shown that adversarial training can improve the robustness of the hypothesis. However, this improvement usually comes at the cost of decreased performance on natural samples. Hence, it has been suggested that robustness and accuracy of a hypothesis are at odds with each other. In this article, we put forth the alternative proposal that it is the continuity of a hypothesis that is incompatible with its robustness and accuracy in many of these scenarios. In other words, a continuous function cannot effectively learn the optimal robust hypothesis. We introduce a framework for a rigorous study of harmonic and holomorphic hypotheses in learning theory terms and provide empirical evidence that continuous hypotheses do not perform as well as discontinuous hypotheses in some common machine learning tasks. From a practical point of view, our results suggest that a robust and accurate learning rule would train different continuous hypotheses for different regions of the domain. From a theoretical perspective, our analysis explains the adversarial examples phenomenon in these situations as a conflict between the continuity of a sequence of functions and its uniform convergence to a discontinuous function. Given that many of the contemporary machine learning models are continuous functions, it is important to theoretically study the continuity of robust and accurate classifiers as it is consequential in their construction, analysis, and evaluation. It is important in their construction as discontinuities can lead to artifacts in their approximations. It is necessary to analyze these classifiers as they carry topological information about their domain. It is critical in their evaluation because it reveals the ways that performance metrics, such as the accuracy score, can fail in assessing these classifiers.

Abstract:
This article presents a novel reflection removal algorithm that integrates a flash-based optical cue into a diffusion model to control the recovery of the transmission image. The algorithm accepts a pair of ambient and flash images as inputs, and a flash-only image, which corresponds to the one captured with flash as the sole illumination source, is derived from the inputs. In light of the reflection-free nature of the flash-only image, we use it to guide the diffusion model to reconstruct the structures of the transmission image. A feature distillation scheme is designed to infer the chromatic attributes of the transmission image from the ambient image, and the features are used to modulate the generative priors learned by the diffusion model. We use time-aware strategies to ensure the synchronization between feature distillation and the dynamic image generation process of the diffusion model. The performance of the proposed algorithm is sequentially optimized in latent and pixel spaces. We also develop a plug-and-play fidelity-enhancing module (FEM) and integrate it into the proposed model to enable the faithful reconstruction of fine-granular visual characteristics of the target scene and reduce artifacts. Comparative experiments demonstrate that the proposed algorithm shows superior quantitative and qualitative performance over state-of-the-art methods in real-world scenarios. By leveraging the optical cue and the generative capability of the diffusion model, the algorithm can accurately restore the visual details of the transmission image even in the presence of strong reflections, and it also exhibits satisfactory robustness against nonlinear image representation and misalignment.

Abstract:
Unsupervised reinforcement learning (RL) aims to discover diverse behaviors that can accelerate the learning of downstream tasks. Previous methods typically focus on entropy-based exploration or empowerment-driven skill learning. However, entropy-based exploration struggles in large-scale state spaces (e.g., images), and empowerment-based methods with mutual information (MI) estimations have limitations in state exploration. To address these challenges, we propose a novel skill discovery objective that maximizes the deviation of the state density of one skill from the explored regions of other skills, encouraging inter-skill state diversity similar to the initial MI objective. For state-density estimation, we construct a novel conditional autoencoder with soft modularization for different skill policies in high-dimensional space. Meanwhile, to incentivize intra-skill exploration, we formulate an intrinsic reward based on the learned autoencoder that resembles count-based exploration in a compact latent space. Through extensive experiments in challenging state and image-based tasks, we find our method learns meaningful skills and achieves superior performance in various downstream tasks.

Abstract:
Class incremental learning (CIL) is the key to achieving adaptive vision intelligence, and one of the main streams for CIL is network expansion (NE). However, state-of-the-art (SOTA) methods usually suffer from feature diffusion, growing parameters, feature confusion, and classifier bias. In view of this, a novel dynamic structure dubbed as recurrent NE (RNE) is proposed by establishing connections among task experts. Specifically, the previous task experts transfer features sequentially through a shared module and the new task expert makes adjustments based on received features rather than reextracted ones, thereby focusing more on the key area and avoiding feature diffusion. Furthermore, the RNE is compressed by replacing additional task experts with lightened ones, in order to significantly reduce the number of parameters while keeping the performance almost unaltered. In addition, feature confusion is alleviated by a decoupled classifier and classifier bias is corrected by pseudo-feature generation. Extensive experiments on four widely adopted benchmark datasets, i.e., CIFAR-100, ImageNet-100, Food-101, and ImageNet-1K, have demonstrated that RNE achieves SOTA performance in both ordinary and challenging CIL settings.

Abstract:
Graph anomaly detection (GAD) refers to identifying abnormal graph nodes or edges that heavily deviate from normal observations. Existing approaches inevitably suffer from the influence of imbalanced data and privacy protection. This shortcoming poses challenges in optimizing node embeddings and detecting multitype anomalies simultaneously, resulting in decreased accuracy of existing GAD models. To address this shortcoming, we introduce a new federated learning model for graph anomaly detection (FedGAD). FedGAD enables collaborative unsupervised learning among decentralized data centers without requiring direct access to the distributed subgraphs. Specifically, FedGAD masks and reconstructs the neighborhood features to enhance the knowledge of node representations. Considering the data diversity across distributed clients, we also design a cross-clients’ node representation module that enables nodes to reconstruct neighbors by leveraging information from other clients. Furthermore, we use a multiscale contrastive learning function, which includes both structure-level and contextual-level learning functions, to detect graph anomalies in the condition that subgraphs located at different clients show imbalanced data distributions. Experimental results on seven benchmark datasets demonstrate the superior performance of FedGAD compared with baseline methods, verifying its capability of improving GAD performance.

Abstract:
In essence, reinforcement learning (RL) solves optimal control problem (OCP) by employing a neural network (NN) to fit the optimal policy from state to action. The accuracy of policy approximation is often very low in complex control tasks, leading to unsatisfactory control performance compared with online optimal controllers. A primary reason is that the landscape of value function is always not only rugged in most areas but also flat on the bottom, which damages the convergence to the minimum point. To address this issue, we develop a bicriteria policy optimization (BPO) algorithm, which leverages a few optimal demonstration trajectories to guide the policy search at the gradient level. Different from conventional problem definition, BPO seeks to solve a bicriteria OCP, which has two homomorphic objectives: one is from the standard reward signals and the other is to align the demonstration trajectories. We introduce two co-state variables, one for each objectives, and formulate two Hamiltonians for this bicriteria OCP. The resulting new optimality condition preserves the minimum values of both Hamiltonians. Furthermore, we find that gradient conflict is a key obstacle to simultaneously descending both Hamiltonians, and its impact is negatively proportional to the inner product between the ideal and actual gradients. A minimax optimization problem is built at each RL iteration to minimize conflicts between two homomorphic objectives, whose solution for policy updating is referred to as harmonic gradient. By converting its inner optimization loop into a linear programming with convex trust region constraint, we simplify this problem into a single-loop maximization problem with much increased computational efficiency. Experiment tests on both linear and nonlinear control tasks validate the effectiveness of our BPO algorithm on the accuracy improvement of policy network.

Abstract:
Fine-grained visual categorization (FGVC) in open-world settings frequently encounters heavy occlusion (HO) samples that compromise discriminative features. However, effectively addressing heavy occlusion remains a challenge. Existing methods often either discard the occluded parts or utilize them through additional techniques such as image inpainting or multimodel strategies, each with its own set of advantages and limitations. In this article, we propose a novel approach inspired by human self-regulated learning (SRL) behavior: cyclical attention that leverages occluded regions through the attention recalibration in the feedback loop. In particular, we introduce a new multi-instance model where occluded parts are essential due to a special feedback structure at the basis of a cooperative game mechanism. This mimics SRL to re-evaluate the previous attention-based image patch selection strategy. We then embed the proposed multi-instance model into a transformer architecture, creating an SRL-FGVC transformer. The key innovation of this design is the cyclical attention, with the forward and feedback self-attention formulating a cooperative union to mitigate attention bias. Extensive experiments on six public datasets and an additional dataset we established demonstrate that the SRL-FGVC transformer consistently outperforms existing approaches in HO scenarios. This work presents a promising new direction for robust FGVC in challenging real-world conditions.

Abstract:
Model-based offline reinforcement learning (RL) constructs environment models from offline datasets to perform conservative policy optimization. Existing approaches focus on learning state transitions through ensemble models, rolling out conservative estimation to mitigate extrapolation errors. However, the static data makes it challenging to develop a robust policy, and offline agents cannot access the environment to gather new data. To address these challenges, we introduce Model-based Offline Reinforcement learning with AdversariaL data augmentation (MORAL). In MORAL, we replace the fixed horizon rollout by employing adversarial data augmentation to execute alternating sampling with ensemble models to enrich training data. Specifically, this adversarial process dynamically selects ensemble models against policy for biased sampling, mitigating the optimistic estimation of fixed models, thus robustly expanding the training data for policy optimization. Moreover, a differential factor (DF) is integrated into the adversarial process for regularization, ensuring error minimization in extrapolations. This data-augmented optimization adapts to diverse offline tasks without rollout horizon tuning, showing remarkable applicability. Extensive experiments on the D4RL benchmark demonstrate that MORAL outperforms other model-based offline RL methods in terms of policy learning and sample efficiency.

Abstract:
Large language models (LLMs) are popping up all over the place, and they have been gaining prominence due to their exceptional abilities in conducting various tasks. Although extensive LLM evaluation has been explored on natural language understanding tasks like text classification and sentiment analysis, evaluating LLMs on named entity recognition (NER) still remains under-explored. To fill this gap, we evaluate twenty-eight representative LLMs on thirteen datasets across five domains, whose parameters range from 3 billion to 175 billion, from four perspectives, that is, supervised fine-tuning (SFT), parameter scales, hallucinations, and prompt designs. We propose an LLM-based NER framework (LLM-NER) for the evaluation, which consists of a Recognition phase and a Check phase. Specifically, the Check guides LLMs to examine the correctness of recognized entities, which is designed to mitigate hallucinations in the NER scenario. Qualitative and quantitative evaluation analyses demonstrate that in the NER scenario: 1) SFT empowers LLMs to understand and follow human instructions; 2) LLMs’ ability generally improves as their parameter scales consistently increase; 3) hallucinations exist in all evaluated LLMs, and guiding LLMs to check their outputs is a feasible way to alleviate hallucinations; and 4) all evaluated LLMs are sensitive to prompt designs. Based on the analyses, we highlight a number of promising directions for future study. Moreover, our evaluation shows high consistency with two LLM evaluation leaderboards, which evaluate LLMs on other tasks, demonstrating the rationality of our evaluation design.

Abstract:
Neural architecture search (NAS) has achieved significant success in automating neural network design, particularly through evolutionary NAS. To address the critical need for efficient architecture discovery across diverse scenarios, such as computer vision and natural language processing, multitask NAS (MT-NAS) methods have emerged. Nevertheless, existing MT-NAS approaches still face critical challenges, including redundant search arising from insufficient exploitation of population historical information across generations and negative transfer caused by unguided interactions between tasks. To address these limitations, a population historical information-driven evolutionary multitask neural architecture search (HIMT-NAS) algorithm is proposed. For each generation, the population historical information is recorded, which includes the operation information and the topology information. In the search process, systematic utilization of population historical information to guide evolutionary search directions, preventing redundant search. Furthermore, the proposed method adjusts cross-task knowledge transfer probability by measuring task similarity through patterns in population historical information, and then updates transfer probabilities when the information proves useful across multiple tasks. Extensive experiments on MedMNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate consistent advantages of the proposed method over both single-task NAS methods and recent MT-NAS methods.

Abstract:
In the history of knowledge distillation (KD), the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of decoupled KD (DKD), which reemphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the generalized DKD (GDKD) loss, which offers a more versatile method for decoupling logits. Then, we pay particular attention to the teacher model’s predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: 1) the partitioning by the top logit considerably improves the interrelationship of nontop logits and 2) amplifying the focus on the distillation loss of nontop logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models’ predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD’s superior performance over both the original DKD and other leading KD methods. The code is available at https://github.com/ZaberKo/GDKD

Abstract:
Vision transformer (ViT) has recently demonstrated remarkable performance in fine-grained visual classification (FGVC). However, most existing ViT-based methods often overlook the varied focus of different attention heads, in which heads that attend to nondiscriminative regions would dilute the discriminative signal crucial for FGVC. To address such issues, we propose a novel adaptive attention quantization transformer (A2QTrans) for FGVC to select the key discriminative features by analyzing the heads’ attention, which comprises three key modules: the adaptive quantization selection (AQS) module, the background elimination (BE) module, and the dynamic hybrid optimization (DHO) module. Specifically, the AQS module dynamically selects the most discriminative features in a data-driven manner by quantizing the attention scores across multiple attention heads with a global, learnable threshold. This process effectively filters out generally irrelevant information from nondiscriminative tokens, thus concentrating attention on important regions. To address the nondifferentiability inherent in updating this threshold during binarization, our AQS module employs a straight-through estimator (STE) for discrete optimization, enabling end-to-end gradient backpropagation. In addition, we utilize the prior that background regions usually do not contain meaningful information, and design the BE module to further calibrate the focus of the attention heads to the main objects in images. Finally, the DHO module adaptively optimizes and integrates the attentive results of the AQS and BE modules to achieve optimal classification performance. Extensive experiments conducted on four challenging FGVC benchmark datasets and three ViT variants demonstrate A2QTrans’s superior performance, achieving state-of-the-art (SOTA) results. The source code is available at https://github.com/Lishixian0817/A2QTrans

Abstract:
Transformer-based methods have recently shown remarkable success in hyperspectral image classification (HSIC). However, their applications, in practice, still face two significant challenges. First, although the multihead mechanism in self-attention improves model robustness during training, it may overlook the continuity of spectral bands. Second, existing methods often struggle to effectively balance global and local information during multiscale feature extraction, limiting further improvements in classification performance. To address these issues, we propose a novel spectral-guided multiscale feature-aware Transformer (SMFAT) framework for HSIC. Specifically, a global low-rank spectral learning (GLSL) module is introduced to project hyperspectral image patches into a low-rank subspace, reducing spectral redundancy and capturing global spectral correlations. Furthermore, we introduce the multiscale feature-aware self-attention (MFASA) mechanism, which dynamically integrates fine- and coarse-grained features to enhance multiscale feature modeling. Finally, a spectral-guided fusion (SGF) module leverages the global spectral information extracted by the GLSL module to guide MFASA in more effectively capturing interspectral correlations and spectral continuity. This approach facilitates a more effective integration of spectral and spatial features in HSIs. Experiments on three well-known HSI datasets verify that the proposed SMFAT method significantly outperforms several state-of-the-art approaches in real-world HSIC tasks. The source code for this work is available at https://github.com/stellaZ77/SMFAT

Abstract:
Remarkable progress has been achieved in the detection and segmentation of the baseline; however, for high-level visual tasks in complex scenes (e.g., dense, occlusion, scale diversity, high background noise, etc.), existing frameworks often fail to provide satisfactory performance. To further improve the object recognition ability, this article introduces a leader-based multiexpert mechanism into the detection and segmentation tasks. In this work, we first design a leader-based attention learning layer to fully integrate multilevel features from the backbone network, which can effectively obtain global semantics and assign instructions to detection experts. Then, we propose multiple feature pyramids with dual fusion paths to replace the traditional single pipeline using semantic and spatial allocators. With this strategy, we can further establish deep supervision for multiple experts during training and sufficiently utilize the multiexpert detection results from leaders’ assignments during reasoning, thereby comprehensively improving the performance of the model in complex scenarios. In the experiment, we established ablation studies and performance comparisons on COCO 2017 detection and segmentation tasks. Finally, we demonstrated the model’s performance in three complex application scenarios (remote sensing, autonomous driving, and industrial fields), and the results showed our advantages.

Abstract:
Graph pooling is crucial for enlarging the receptive field and reducing computational costs in deep graph representation learning. In this work, we propose a simple but effective graph probabilistic pooling (GP-Pool) framework to facilitate graph feature learning. Instead of either deterministic selection or random dropping, we design a probabilistic subgraph sampling to reach an expected distribution by deducing a variational bound. Accordingly, a Bernoulli graph pooling (BernPool) is first derived to sample nodes together with the local structures, for which a learnable reference set is introduced to encode nodes into a latent expressive probability space. Hereby, the resultant BernPool captures salient graph substructures while possessing much diversity on sampled nodes due to its nondeterministic manner. For more controllable pooling, we derive the Poisson-distributed version (aka PoissonPool) from BernPool to explicitly cut the node quantity with less variables in variational learning. Furthermore, considering the complementarity of node sampling and clustering, we propose a hybrid graph pooling (HGP) paradigm to combine a compact subgraph (via BernPool/PoissonPool) and a coarsening graph (via clustering), to retain both representative substructures and global topology. Extensive experiments on multiple public graph classification datasets demonstrate that our GP-Pool is superior to various graph pooling methods and achieves state-of-the-art performance.

Abstract:
Deep learning (DL) supervised techniques have been extensively employed in magnetic resonance imaging (MRI) reconstruction, delivering notable performance enhancements over traditional non-DL methods. Nonetheless, these models have vulnerabilities during testing such as their susceptibility to worst-case or noise-based measurement perturbations, variations in training/testing settings like acceleration factors, contrast, k -space sampling locations, and distribution shifts stemming from unseen lesions and different anatomies. This article addresses these robustness challenges by leveraging diffusion models (DMs). In particular, we present a robustification strategy that improves the resilience of DL-based MRI reconstruction methods by utilizing pretrained DMs as purifiers. We dub our method as robust DL-based MRI with diffusion purification (RODIO). In contrast to conventional robustification methods for DL-based MRI reconstruction, such as adversarial training (AT), our proposed approach eliminates the need to tackle a minimax optimization problem. It only necessitates efficient fine-tuning on purified examples. Our experimental results underscore the effectiveness of our approach in addressing the mentioned instabilities, outperforming standalone diffusion-based MRI reconstructors and leading robustification methods for deep supervised MRI reconstruction, including AT and randomized smoothing (RS). Our experiments demonstrate: 1) the adaptability of our approach across multiple DL-based supervised MRI reconstruction models; 2) compatibility with accelerated diffusion-based samplers; 3) robustness to data with unseen lesions; and 4) effectiveness when applied to unsupervised single-shot generative reconstructors.

Affiliations: University of Electronic Science and Technology of China, Chengdu, China; Department of Computer and Information Sciences, Faculty of Engineering and Environment, Northumbria University, Newcastle upon Tyne, U.K.; Department of Electrical and Computer Engineering, National University of Singapore, Queenstown, Singapore; Department of Data Science and Artificial Intelligence and the Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong; Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen, China

Abstract:
The Fourier transform (FT) stands as a fundamental tool in modern signal processing with widespread applications across various scientific and engineering fields. Therefore, there remains a need for continued research efforts to devise energy-efficient implementations of the FT. Due to their inherent energy efficiency, biologically plausible spiking neural networks (SNNs) emerge as a promising alternative solution. However, current SNN implementations of the FT suffer from two key shortcomings, namely, high latency and reduced accuracy. In this article, we analyze the underlying causes of these limitations and highlight deficiencies in the existing spike-based encoding mechanisms and spiking neuron models. We then propose a new SNN-based FT (SNN-FT) based on a logarithmically polarized time-to-first-spike (TTFS) encoding method (called LP-TTFS) along with a novel piecewise spiking neuron (PTSN) model based on ternary spikes (referred to as PTSN). The resulting SNN-FT is mathematically equivalent to the conventional FT and demonstrates superior performance in accuracy as well as reduced latency. We assess the performance of the proposed SNN-FT alternative through extensive experiments on FT-based applications, such as radar and audio signal processing, and the obtained results demonstrate the efficacy of SNN-FT and its superiority over the existing approaches. This study unveils a novel energy-efficient neuromorphic computing technique with great potential for FT applications across diverse scientific and engineering domains.

Affiliations: Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, School of Computer and Data Science, Minjiang University, Fuzhou, China; College of Computer and Data Science, Fuzhou University, Fuzhou, China; Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology, Shenzhen, China; Fujian Provincial Key Laboratory of Big Data Mining and Applications, School of Computer Science and Mathematics, Fujian University of Technology, Fuzhou, China; School of Computer Science and Engineering, Nanyang Technological University, Singapore

Abstract:
Semi-supervised learning (SSL) offers a promising solution to the challenge of learning from limited labeled data by leveraging the potential of unlabeled data, thus circumventing the need for costly labeling efforts. However, common SSL methods often encounter domain shifts in many real-world scenarios, where class distribution is imbalanced. In order to make machine learning more robust to imbalanced datasets, it is imperative to ensure that consistent representations are learned for each class, regardless of the amount of data available. Therefore, we propose a straightforward yet effective kernel function mapping strategy to align the representations of each class in an infinite-dimensional space. Specifically, we employ a Gaussian kernel function to map the representations of unlabeled data to the centroids of labeled data, enabling similarity comparisons in the infinite-dimensional space. In this way, we are able to refine the predicted pseudo-labels at the representation level. To better handle class imbalance, we note that it is common to obtain a high recall but low precision for the majority classes and a high precision but low recall for the minority classes. A selective strategy is adopted for predictions corrected for the majority classes while maintaining confidence in the pseudo-labels assigned to the minority classes. Extensive evaluations on various benchmarks and training settings validate the superior performance of the proposed method compared to the existing relevant state-of-the-art approaches.

Abstract:
Surrogate-assisted evolutionary algorithms (SAEAs) have garnered significant attention for addressing expensive multiobjective optimization problems. Most existing SAEAs, however, still rely on conventional genetic operators in reproduction, which is inefficient in generating promising candidate solutions. To address the above issue, this article presents a learning-based generative model that replaces crossover and mutation and learns to conduct multiobjective search for expensive multiobjective optimization problems. The key idea is to design an attention-enhanced convolutional residual network with the assistance of surrogate model for offspring generations. The proposed framework employs a generative model to produce promising solutions for each decomposed subproblem based on the Tchebycheff metric, while a surrogate model assists in optimizing the generative model’s hyperparameters through an online learning process. We demonstrate the efficacy of our learning-based multiobjective generative model (LMOGM) on DTLZ, ZDT, and WFG benchmark function suites, varying in dimensions from 30 to 200, as well as through a practical application involving the geothermal energy extraction design optimization. Experimental results highlight the superior performance of the proposed approach when compared to traditional evolutionary algorithms and state-of-the-art surrogate-assisted multiobjective evolutionary algorithms (MOEAs).

Abstract:
In robotic perception, cross-granularity object detection is essential for identifying and localizing targets at varying levels of detail. Traditional detection methods often struggle to bridge the gap between coarse object detection and fine-grained component localization, limiting their ability to associate parts, such as a cup and its handle. Vision–language models (VLMs), while effective in spatial reasoning, face challenges in fine-grained detection due to the scarcity of annotated datasets. To address these issues, we first propose the chain-of-detection (CoD) framework, which focuses on guiding detection in a step-by-step manner from coarse recognition to fine-grained localization. During this process, we observe that existing detectors still lack sufficient capability in recognizing fine-grained components. To overcome this limitation, we further combine the CoD framework with Monte Carlo tree search (MCTS) to automatically generate fine-grained datasets, eliminating the need for manual labeling and significantly improving detector performance. Experiments show that our approach achieves an average improvement of 17.31% in robotic manipulation success rates for common objects, 51.39% for larger object operations, and about 50% in simulated environments. These results demonstrate the effectiveness of CoD in advancing cross-granularity detection and enhancing precise robotic manipulation. The implementation is publicly available at https://github.com/tinnel123666888/CoD and the CoD dataset is released at https://huggingface.co/datasets/tinnel123/CoD_dataset

Abstract:
The 2-D smoothing (TDS) algorithm is a powerful tool for smoothing and filtering 2-D sequences, serving a crucial role in tasks such as image processing and filtering. Despite its significance, the theoretical properties of the TDS algorithm have not been thoroughly explored in the current literature. In this article, we present a comprehensive analysis of the TDS algorithm, elucidating its mathematical properties and proposing innovative models for its application in image processing. First, we provide an equivalent description of the TDS algorithm and demonstrate that the trend sequence makes the loss function reach the global minimum. Regarding the convergence of the TDS algorithm, we demonstrate the convergence of both trend and fluctuation sequences. Specifically, as the global smoothing parameter tends to infinity, both sequences converge to a deterministic sequence that is independent of the global smoothing parameter. In this case, the TDS algorithm becomes equivalent to computing the trend sequence via the least squares method (LSM). Subsequently, we reveal the smoothing mechanism of the TDS algorithm, which attenuates the energy of the original sequence in the transform domain. Furthermore, we show that the forward transform kernel of the TDS algorithm is a separable orthogonal transform. In addition, we explore the intrinsic relationship between the trend and fluctuation sequences. Notably, an insightful result is that the fluctuation sequence is the trend sequence of the sequence obtained by applying the characteristic lag operator polynomial to the original sequence. Building on these insights, we propose several application scenarios and models for the TDS algorithm in image processing, such as image smoothing, high-frequency extraction, edge detection, and enhancement. To validate the effectiveness of the TDS algorithm, we present numerical simulations and image processing experiments, which demonstrate the correctness of the proposed theoretical framework and the superior performance of TDS in 2-D filtering tasks. This work lays a solid theoretical foundation for the practical application of the TDS algorithm, offering novel methodologies and insights for image processing, 2-D filtering, wireless communication, and computer vision.

Abstract:
Representation learning techniques effectively unveil latent patterns within raw data. However, the learning process is often marred by uncertainties, such as variations in data quality and heterogeneous scenarios, which greatly affect the reliability of representation learning. In this article, we introduce a reliable representation learning framework to establish a connection between data attributes and modeling strategies, namely the interpretable attribute-oriented representation learning framework. First, by focusing on the inherent knowledge embedded in the data, we decouple it into four principal attributes: fidelity, topology, invariance, and discriminability. To explicitly address these attributes, we incorporate them into an optimization-derived framework using corresponding general loss terms. Furthermore, by treating the iterative solution process as a bridge, each derived network module possesses traceable interpretability, thus laying a reliable foundation. Ultimately, we extend the proposed framework to multisource heterogeneous scenarios, enabling it to adapt to complex environments while maintaining reliability. In essence, our work aims to seamlessly integrate deep representations with prior knowledge during the learning process, thereby creating a solid basis for dependable modeling. Networks derived from the proposed framework achieve promising results, particularly in complex multisource heterogeneous environments, demonstrating both their effectiveness and reliability. The code is available at https://github.com/ZihanFang11/2025_AORLNet_TNNLS.

Abstract:
Incomplete multiview clustering (IMVC) faces significant challenges due to missing data and inherent view discrepancies. While deep neural networks offer powerful representation learning capabilities for IMVC, existing methods often overlook view diversity and force representations across views to be identical, leading to 1) biased representations with distorted topologies and 2) inaccurate imputation for missing data, ultimately degrading clustering performance. To address these issues, we propose prototype-graph transformer (PGFormer), a novel IMVC framework that integrates prototype assignments, rather than direct representations, to enhance clustering performance. PGFormer leverages view-specific encoders to extract features from available samples in each view, employs a PGFormer designed to refine node embeddings, and reconstructs available samples using these refined embeddings. For each view, PGFormer utilizes a graph convolutional network (GCN) to model node-to-node topologies and generate semantic prototypes from the node embeddings. These view-specific prototypes and embeddings are then refined through dual attention mechanisms: prototype-to-prototype (P2P) self-attention and prototype-to-node (P2N) cross-attention, enabling a thorough exploration of multilevel topological relationships within each view. To address missing data, the cross-prototype imputation (CPI) module leverages the weighted prototype assignments from different views to impute missing samples using refined intraview prototypes. Building on this, the cross-view alignment module calibrates prototype assignments to ensure consistent predictions across views. Extensive experiments demonstrate that PGFormer can achieve superior performance compared with the baselines.

Abstract:
Spiking neural networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional artificial neural networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing tasks. Two key challenges hinder progress: 1) the high computational overhead during training caused by multitimestep spike firing and 2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce the input-aware multilevel spikeformer (IML-Spikeformer), a spiking transformer architecture specifically designed for large-scale speech processing. Central to our design is the input-aware multilevel spike (IMLS) mechanism, which simulates multitimestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a reparameterized spiking self-attention (RepSSA) module with a hierarchical decay mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multiscale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates (WERs) of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64× and 4.32× , respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency. Our source code and model checkpoints are publicly available at github.com/Pooookeman/IML-Spikeformer

Abstract:
Existing point cloud completion methods rely on extracting latent codes from a partial point cloud to reconstruct a complete structure. However, the complexity of the partial point clouds, making the completion results of such methods less satisfactory, especially in long-distance (away from partial point cloud) areas. To tackle this challenge, we propose a point cloud completion network via heuristic structure growing (HSG-Net), which progressively completes the close-distance structure through an iterative heuristic structure growth strategy. Particularly, a novel data preprocessing (DP) method is proposed to obtain ground truth (GT) with specific structural integrity, guiding the network to learn close-distance structural information. In addition, the proposed consistency constraint displacement module (CCDM) is employed to fulfill structure growth, and a feature memory module (FMM) further enhances the quality of the grown structure. Furthermore, a proposed local information generator is used to further refine the structure-grown point cloud, fetching the final result. Extensive quantitative and qualitative results demonstrate that our HSG-Net outperforms the state-of-the-art methods.

Abstract:
Standard integral neural networks (INNs) employ continuous integration layer representation across the kernel and channel dimensions. However, they neglect the reparameterization problem of continuous integration layers, making it difficult to deploy INNs on resource-constrained mobile devices by decoupling their training-time and inference-time structure. We propose a continuous reparameterization strategy that reparameterizes the train-time multiple integration layers into a feed-forward structure at inference time to address this issue. Then, we extend the vision transformer (ViT)–like MetaFormer structure to the continuous integration layer design and leverage an overparameterization integral branch to improve the representation capacity of INNs. Last, exploiting the above innovative techniques, we establish a family of lightweight reparameterizable INNs (RINNs) to achieve strong performance on resource-constrained mobile devices. The results of extensive experiments show the superior performance of RINNs to state-of-the-art lightweight ViTs and favorable zero-shot transfer performance in downstream tasks. On the ImageNet dataset, our RINNs achieve over 79.1 top-1 accuracy with 0.87-ms latency on mobile devices. Moreover, our RINNs maintain the same performance at up to 50% of the rate of structural pruning, without fine-tuning, compared with the 25%–50% accuracy loss of the state-of-the-art discrete models. The codes are publicly available here: https://github.com/ljh3832-ccut/RINN

Abstract:
The inherent compliant nature of soft robots can offer remarkable advantages over their rigid counterparts in terms of safety to human users and adaptability in unstructured environments. However, this feature also magnifies the complexity of their bodies, rendering their proprioception, and hence their control, extremely challenging. Given this intricacy, machine learning is a potent candidate for extracting proprioceptive insights from sensor data due to its proven capabilities in tackling analogous issues in computer vision (CV) and natural language processing (NLP). Recently, key aspects of soft robot proprioception have been addressed via learning-based techniques, but most of these are rooted in the supervised learning (SL) paradigm. This typically requires collecting a large number of costly annotated training samples, thereby constraining its widespread and speedy adoption in real-world applications. To mitigate this limitation, we propose a self-SL framework for soft robot proprioception. Our method utilizes vast unannotated data for network pretraining by self-SL. Then, the pretrained model is fine-tuned with a limited set of annotated samples by SL. We validate the proposed method’s efficacy on a high-resolution 3-D morphological reconstruction task using a publicly available dataset. Remarkably, our approach is shown to necessitate only about 1/20 of annotated samples to achieve better performance than the fully supervised method.

Abstract:
Radial basis function neural networks (RBFNNs) are widely applied due to their rapid modeling capabilities and efficient learning performance. However, when dealing with high-dimensional data, RBFNNs encounter two critical limitations: the hidden layer responses using Gaussian kernels suffer from ineffective activation and numeric underflow; and the estimation of output layer weights typically involves tedious parameter tuning and inefficient loading of high-dimensional feature matrices. To overcome these challenges, we first propose a dimensionality-adaptive Gaussian kernel function (DAGKF) equipped with a novel width adjustment mechanism that flexibly mitigates the numerical difficulties inherent in high-dimensional spaces. Moreover, to avoid processing entire feature matrices simultaneously, we introduce a multioutput coordinate descent (MOCD) algorithm that enables parallel computation across multioutput systems. Building upon MOCD, we further develop the joint residual MOCD (JRMOCD) algorithm, which incorporates a joint residual criterion for more effective weight estimation. The convergence of the JRMOCD algorithm is rigorously proven. Extensive experiments demonstrate the superior performance of the proposed methods, particularly in high-dimensional settings.

Abstract:
Traffic prediction is a cornerstone of intelligent transportation systems (ITSs). The effectiveness of existing spatiotemporal graph neural networks (STGNNs) heavily relies on the independent identically distributed (i.i.d.) assumption of traffic data, which is frequently violated in practice because of distribution shifts owing to exogenous factors. While learning features that remain stable across all environments is promising for modeling robust frameworks, the fundamental challenge involves the decomposition of invariant features from the dynamic nature of spatiotemporal dependencies. In this article, we propose the disentangled spatiotemporal (DIST) graph neural networks, a novel framework for robust traffic forecasting considering distribution shifts. In DIST, latent invariant variables are explicitly decoupled from dynamically evolving spatiotemporal dependencies, enabling the learning of topology-agnostic representations resilient to distribution shifts. Specifically, we formulate a causality-driven learning objective that guides the separation of invariant variables from various exogenous factors. We then propose a spatiotemporal graph modeling module that can adaptively capture spatiotemporal dependencies in evolving traffic systems. Furthermore, we present a graph perturbation module to simulate topology variations during training, thereby encouraging the model to identify perturbation-sensitive dependencies and infer invariant and variant features for prediction and intervention tasks. The prediction risk and its variance on multiple interventional distributions are minimized in our learning strategy, allowing the model to identify invariant features, thus improving its robustness. The results of comprehensive real-world experiments demonstrate the superiority of our approach. The source code is available: https://github.com/tingwang25/DIST

Affiliations: Chongqing Key Laboratory of Image Cognition, the Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, and the Key Laboratory of Big Data Intelligent Computing, Chongqing University of Posts and Telecommunications, Chongqing, China; School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China; Rocket Force University of Engineering, Xi’an, China; Chongqing Key Laboratory of Image Cognition and the Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, Chongqing University of Posts and Telecommunications, Chongqing, China

Abstract:
Collaborative camouflaged object segmentation (CoCOS) is a challenging task, focusing on identifying objects that blend closely with their backgrounds by jointly processing intraclass images. Existing methods fail to fully leverage the shared features (e.g., shape, texture, and contour) from these intraclass images, which leads to poor segmentation performance in relatively complex scenarios. To address this issue, we propose a novel mutually guided fusion refinement network (MFRNet), which improves the model performance by more effectively collaborating and optimizing the shared information. Specifically, it includes feature encoding, single-image branch feature enhancement, multiimage branch feature enhancement, and mutual guidance. After the feature encoding step, we design the graph convolution self-attention (GCS) and spatial context exploration (SCE) modules to enhance multilevel features of the single-image and multiimage branches, respectively. Moreover, we propose a mutual guidance fusion (MGF) module to utilize cross-scene image information for mutual guidance and progressive refinement, enhancing intraclass collaboration for improving target feature distinction. Extensive experimental results demonstrate that our MFRNet significantly outperforms existing CoCOS methods, achieving a mean E-measure score of 0.846 on the CoCOD8K dataset. Our code will be published at https://github.com/another-u/MFRNet

Abstract:
Fractional derivatives generalize integer-order derivatives, making them relevant for studying their convergence in descent-based optimization algorithms. However, existing convergence analysis of fractional gradient descent (FGD) is limited in both methods and settings. This article bridges these gaps by establishing convergence guarantees for FGD on a broader class of non-convex functions, known as matrix-smooth functions. We leverage the matrix smoothness properties of the function to prove convergence and accelerate the FGD iterates. We propose two novel stochastic fractional descent algorithms, named compressed FGD (CFGD), incorporating a matrix-valued stepsize to minimize matrix-smooth non-convex objectives. Our theoretical analysis covers both single-node and distributed settings and shows that matrix stepsizes better capture the structure of the objective, leading to faster convergence than scalar stepsizes. In addition, we highlight the importance of matrix stepsizes to leverage model structure effectively. To the best of our knowledge, this is the first work to introduce FGD in a federated/distributed setting.

Abstract:
Feature upsampling is a fundamental and indispensable ingredient of almost all current network structures for dense prediction tasks. Very recently, a popular similarity-based feature upsampling pipeline has been proposed, which utilizes a high-resolution (HR) feature as guidance to help upsample the low-resolution (LR) deep feature based on their local similarity. Albeit achieving promising performance, this pipeline has specific limitations in methodological designs: 1) HR query and LR key features are not well aligned in a controllable manner; 2) the similarity between query–key features is computed based on the fixed inner product form, lacking flexibility; and 3) neighbor selection is coarsely operated on LR features, resulting in mosaic artifacts. These shortcomings make the existing methods along this pipeline primarily applicable to hierarchical network architectures with iterative features as guidance, and they are not readily extended to a broader range of structures, especially for a direct high-ratio upsampling. Against these issues, we thoroughly refresh this pipeline and meticulously optimize every methodological design. Specifically, we first propose an explicitly controllable query–key feature alignment from both semantic-aware and detail-aware perspectives and then construct a parameterized paired central difference convolution block for flexibly calculating the similarity between the well-aligned query–key features. Besides, we develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts. Based on these careful designs, we systematically construct a refreshed similarity-based feature upsampling framework named ReSFU. Based on 13 types of network backbones, comprehensive experiments substantiate that only in a simple and direct high-ratio upsampling manner, our ReSFU consistently achieves satisfactory performance on six tasks, including semantic segmentation, medical image segmentation, instance segmentation, panoptic segmentation, object detection, and monocular depth estimation, showing superior generality and ease of deployment beyond the existing upsamplers. Codes are available at https://github.com/zmhhmz/ReSFU

Abstract:
Thermal imaging offers valuable properties, but suffers from inherently low spatial resolution, which can be enhanced using a high-resolution (HR) visible image as guidance. However, the substantial modality differences between thermal and visible images, coupled with significant resolution gaps, pose challenges to existing guided super-resolution (SR) approaches. In this article, we present dual-conditional diffusion (DuaDiff), an innovative diffusion model featuring a dual-conditioning mechanism to enhance guided thermal image SR. Unlike typical conditional diffusion models, DuaDiff integrates a learnable Laplacian pyramid to extract high-frequency details from the visible image, serving as one of the conditioning inputs. By capturing multiscale high-frequency components, DuaDiff effectively focuses on intricate textures and edges in the HR visible images, significantly enhancing thermal image fidelity. Furthermore, we project both thermal and visible images into a semantic latent space, constructing another conditioning input. Leveraging these complementary conditions, DuaDiff employs a multimodal latent feature cross-attention module to facilitate effective interaction between noise, thermal, and visible latent representations. Extensive experiments on the FLIR-ADAS and CATS datasets for 4× and 8× guided SR demonstrate that combining learnable Laplacian conditioning with semantic latent conditioning enables DuaDiff to surpass state-of-the-art methods in both visual quality and metric evaluation, particularly in scenarios with a large resolution gap. Besides, the applications to downstream tasks further confirm the capability of DuaDiff to recover high-fidelity semantic information. The code will be released.

Abstract:
Hybrid architectures that combine convolutional neural networks (CNNs) with Transformers have emerged as a promising approach for medical image segmentation. However, existing networks based on this hybrid architecture often encounter two challenges. First, while the CNN branch effectively captures local image features through convolution operations, vanilla convolution lacks the ability to achieve adaptive feature extraction. Second, although the Transformer branch can model global image information, conventional self-attention (SA) primarily focuses on spatial relationships, neglecting channel and cross-dimensional attention, leading to suboptimal segmentation results, particularly for medical images with complex backgrounds. To address these limitations, we propose a dual-branch cross-fusion Transformer–CNN architecture for medical image segmentation (DCTC-Net). Our network provides two key advantages. First, a dynamic deformable convolution (DDConv) is integrated into the CNN branch to overcome the limitations of adaptive feature extraction with fixed-size convolution kernels and also eliminate the issue of shared convolution kernel parameters across different inputs, significantly enhancing the feature expression capabilities of the CNN branch. Second, a (shifted)-window adaptive complementary attention module ((S)W-ACAM) and compact convolutional projection are incorporated into the Transformer branch, enabling the network to comprehensively learn cross-dimensional long-range dependencies in medical images. Experimental results demonstrate that the proposed DCTC-Net achieves superior medical image segmentation performance compared to state-of-the-art (SOTA) methods, including CNN and Transformer networks. In addition, our DCTC-Net requires fewer parameters and lower computational costs and does not rely on pretraining.

Abstract:
The latent factor analysis (LFA) model is an effective tool for extracting valuable information from high-dimensional and sparse (HiDS) matrices. However, traditional LFA usually suffers from low accuracy due to the limitations of the stochastic gradient descent (SGD) algorithm used in its model training. First, the learning rate of SGD is adjusted manually, which greatly affects the training efficiency. A recent solution is adjusting this hyperparameter by the particle swarm optimization (PSO) algorithm. However, PSO cannot adapt well to the dynamic decision space of this problem due to its strong convergence. Second, SGD relies solely on the gradient information to perform the optimization, which decreases the training accuracy. To address the above two issues, this article proposes a novel LFA model called genetic algorithm-based two-step LFA (GA-TSLFA), which employs the GA to facilitate the model training. Compared to PSO, the GA has better flexibility, which can be employed to tune the hyperparameter of LFA in dynamic decision spaces and refine the model in high-dimensional and complex decision spaces by designing suitable evolutionary operators. The training of the proposed GA-TSLFA consists of two steps. In the first step, the model is pretrained by SGD whose learning rate is adaptively adjusted by a proposed GA. In the second step, the LF matrices generated by SGD are further refined using a proposed GA-based framework. This framework operates by optimizing a subset of partial vectors, which are selected through a dedicated strategy. In this way, the model’s accuracy can be further enhanced. Empirical studies on benchmark datasets show that the GA-TSLFA surpasses state-of-the-art LFA models in prediction accuracy and has a competitive efficiency.

Abstract:
Offline reinforcement learning (RL) provides a promising solution to learning an agent fully relying on a data-driven paradigm. However, constrained by the limited quality of the offline dataset, its performance is often suboptimal. Therefore, it is desired to further finetune the agent via extra online interactions before deployment. Unfortunately, offline-to-online RL can be challenging due to two main challenges: constrained exploratory behavior and state–action distribution shift. In view of this, we propose a simple unified uncertainty-guided (SUNG) framework, which naturally unifies the solution to both challenges with the tool of uncertainty. Specifically, SUNG quantifies uncertainty via a variational autoencoder (VAE)-based state–action visitation density estimator. To facilitate efficient exploration, SUNG presents a practical optimistic exploration strategy to select informative actions with both high value and high uncertainty. Moreover, SUNG develops an adaptive exploitation method by applying conservative offline RL objectives to high-uncertainty samples and standard online RL objectives to low-uncertainty samples to smoothly bridge offline and online stages. SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods, across various environments and datasets in the D4RL benchmark. Codes are made publicly available in https://github.com/guosyjlu/ACBR

Abstract:
The complex interweaving of authentic and fake news on social networks poses formidable challenges to modern information management. Existing fake news detection methods heavily rely on extensive labeled data and auxiliary information, such as propagation structures, facing critical issues of data scarcity and semantic integrity destruction in practical applications. Traditional data augmentation methods, due to their fixed rule-based transformation patterns, fail to effectively simulate the complexity and irregularity of information propagation in real social networks. To address these fundamental limitations, this article proposes a contrastive learning-driven fake news detection (CLFD) framework that breaks through existing technical bottlenecks through innovative distortion-reversion dual-view manipulation mechanisms and distortion-aware contrastive learning methods. The core innovation of CLFD lies in employing learnable neural networks to precisely simulate nonlinear information transformation processes, generating stylistically diverse contrastive views while preserving semantic core integrity, fundamentally solving the critical problem of semantic destruction in traditional methods. More importantly, our method achieves efficient detection using only textual content, requiring no additional information such as propagation structures or social network topologies, demonstrating outstanding universality and portability. Through dynamic view generation and multiobjective joint optimization strategies, CLFD significantly enhances the model’s capability to capture deceptive features in fake news. Extensive experiments on multiple benchmark datasets demonstrate that our framework significantly outperforms existing state-of-the-art methods in detection accuracy, robustness, and generalization capability.

Abstract:
Anchor-based strategies have been widely used to accelerate spectral clustering, yet their effectiveness is directly affected by the quality of the selected anchors. Random sampling has become one of the most important anchor determination methods due to its efficiency. However, the anchors obtained by a single random sampling often fail to adequately capture the topological structure of the original data, making it difficult for the constructed anchor graph to achieve satisfactory clustering performance. To solve this problem, we propose a novel spectral embedding representation model based on random anchor graph aggregation (RAGA), in which an aggregated anchor graph can be produced to obtain enhanced sample representation capability. Specifically, we perform multiple random samplings to make the distribution of the selected anchors approximate the original data within a reasonable sampling time. Subsequently, adaptive weighted learning is performed on the contribution of the constructed multiple anchor graphs, and then an aggregated anchor graph can be formed, which can portray the topological structure of the original samples more precisely. In addition, spectral embedding and spectral rotation are integrated into a joint learning framework to reduce the model learning error accumulation caused by the traditional two-stage framework. Notably, we propose a rigorous theorem for analyzing the approximation of samples by the selected anchors in multiple random samplings. Our proposed RAGA maintains the speed advantage of random sampling while obtaining a high-quality aggregated anchor graph, enabling it to handle large-scale data scenarios. Experimental results on several benchmark datasets show that the RAGA model outperforms other state-of-the-art (SOTA) anchor graph-based clustering methods.

Abstract:
This article focuses on a decentralized online optimization problem over multiagent systems, where the interactions are modeled by a strongly connected directed graph. The objective of each agent is to minimize the global loss function accumulated by all agents’ local loss functions, which are time-varying and only known by themselves. To address the communication bottleneck caused by the high-dimensional data and large-scale networks, we design a decentralized online algorithm with compressed communication, decentralized online gradient push-sum with compressed communication (CC-DOGPS). For strongly convex functions, a sublinear regret bound \mathcal O((\ln T)^2) of our designed algorithm is obtained, where T is the time horizon. Finally, two numerical simulations are given to validate the theoretical results and illustrate the efficiency of our designed algorithm.

Abstract:
The field of low-tubal-rank tensor recovery, especially with subspace prior information, has recently garnered significant attention. However, existing methods encounter limitations when dealing with tensor data affected by simultaneous damage and loss. Moreover, they frequently necessitate clean (with no outliers) data to generate subspace prior information, which presents practical challenges. Addressing these issues, this article proposes a generalized subspace coupling (GSC) scheme, equipped with a novel tool to quantify the accuracy of the prior subspace. Building upon this foundation, we delve into the robust low-tubal-rank tensor completion problem, aiming to recover a low-tubal-rank tensor from partially observed data corrupted by sparse noise. Importantly, we theoretically demonstrate that the proposed method achieves exact tensor recovery under significantly weaker incoherence conditions compared to those previously suggested. Additionally, to optimize the proposed model, we design a symmetric Gauss–Seidel-based alternating direction method of multipliers (sGS-ADMM) with guaranteed convergence. Experiments conducted on various datasets, including facial images, medical scans, and video sequences, validate the superiority of our model over existing competitors in both qualitative and quantitative assessments.

Abstract:
Multimodal aspect-based sentiment analysis (MABSA) is a challenging task that predicts sentiment polarity for specific aspect terms based on inputs across modalities. Existing approaches typically employ advanced visual and textual encoders to extract multimodal features and align them for MABSA prediction, yet they still face challenges in handling complex connections between multiple modalities. Recent bloom of large language models (LLMs), as well as their multimodal counterparts, has shown significant promise in various tasks, which offer a promising solution for MABSA, with potential limitations such as semantic mismatch between images and texts, and their high computational cost of fine-tuning for specific tasks. To address these limitations, in this article, we propose a novel plugin-based approach for MABSA, which uses plugins to encode key knowledge instances, such as salient objects in images and word relationships in texts, with an attentive graph convolutional network (A-GCN). We further utilize a memory-based hub to integrate the encoded multimodal knowledge and align the knowledge representations with the LLM, guiding it to better understand the intricate connections between modalities. We evaluate our approach on two benchmark MABSA datasets, which outperforms baselines and achieves state-of-the-art performance over existing studies. Further analysis shows that our approach enables efficient and scalable adaptation of multimodal LLMs to specific tasks, making it a promising solution for related tasks. The code is available at https://github.com/synlp/MABSA-LLMPlug

Abstract:
Due to the complex physical properties of granular materials, research on robot learning for manipulating such materials predominantly either disregards the consideration of their physical characteristics or uses surrogate models to approximate their physical properties. Learning to manipulate granular materials based on physical information obtained through precise modeling remains an unsolved problem. In this article, we propose to address this challenge by constructing a differentiable physics-based simulator for granular materials using the Taichi programming language and developing a learning framework accelerated by demonstrations generated through gradient-based optimization on nongranular materials within our simulator, eliminating the costly data collection and model training of prior methods. Experimental results show that our method, with its flexible design, trains robust policies that are capable of executing the task of transporting granular materials in both simulated and real-world environments, beyond the capabilities of standard reinforcement learning (RL), imitation learning (IL), and prior task-specific granular manipulation methods.

Abstract:
The pedestrian detection is crucial in practical applications, such as autonomous driving and video surveillance. However, the existing research mainly focuses on improving detection accuracy, with relatively little attention paid to model complexity and operational efficiency. In scenarios with high real-time requirements, the practical deployment of pedestrian detectors still faces many difficulties. To this end, we propose a lightweight and efficient pedestrian detection network (LEPD-Net). First, we design a PoolFormer-based detection head (PDH) to reduce the model computation and inference time. Second, to compensate for the deficiency of PDH in global context modeling, we design a triple-branch joint attention module (TJAM). TJAM uses only a small number of parameters and strengthens the model’s contextual representation by capturing spatial location dependencies and global semantic information between channels. Finally, after incorporating PDH and TJAM into the backbone network, a lightweight and efficient pedestrian detector is constructed. We benchmarked the model on mainstream pedestrian datasets Caltech and CityPersons. The results show that our model achieves the current state-of-the-art performance level. In addition, our model reduces inference time by 25% while maintaining accuracy.

Abstract:
The field of knowledge graph representation learning (KGRL) has been rapidly expanding. To effectively apply KGRL models to large real-world knowledge graphs (KGs), anchor-based methods have been proposed. These methods aim to reduce computational costs and parameter requirements by encoding entities using a small set of entity anchors. However, existing anchor selection approaches are often rudimentary and sometimes yield suboptimal results. In this article, we propose a scalable anchor-based KGRL method called SKIP. By leveraging prototype information, our method selects representative entities as anchors. The SKIP method consists of two main steps. First, pretraining models are employed to encode entities by utilizing the topological structure and textual information in KGs. Second, the prototype learning module (PLM) extracts entity prototypes, which are then used to sample entity anchors that contain valuable prototype information. These settings enable SKIP to identify representative and reasonable entity anchors, leading to improved performance while requiring fewer computational resources. Extensive experiments conducted on various downstream tasks using KGs of different scales demonstrate the superiority and effectiveness of SKIP. Particularly, on the large OGB WikiKG 2 dataset, our method achieves comparable performance while reducing running time by approximately 21.28% and requiring 21.43% fewer model parameters compared to the baseline. This indicates the superior scalability of SKIP.

Abstract:
Multiagent reinforcement learning (MARL) has been widely investigated, ranging from theoretical analysis to real-life applications. However, the utilization of existing non-transparent neural network architectures has resulted in opaque decision-making processes, making it difficult for humans to understand and trust the models being used. Fundamentally, all data is a topological structure, which provides reliable transparency for MARL tasks due to its powerful relational expression capability, scalability, and explicit structural relationships. In this article, we propose a novel approach of graph cooperation modeling (GCM), explicitly capturing and comprehending the complex dynamics of collaborative relationships among agents with the graph structure. GCM learns a metric function to discern beneficial interactions among agents, integrating it into the agent aggregation strategy of a graph neural network (GNN) capable of modeling arbitrary-order interactions. Furthermore, GCM utilizes identity semantics together with global state and individual value functions to estimate the credit of each agent, enhancing each agent’s distinct focus on task-related regions. Extensive experiments on a range of challenging MARL benchmarks demonstrate that GCM not only delivers up to 28.75% relative performance gains on super-hard maps but also offers clear interpretability that provides insights into the underlying cooperative patterns.

Abstract:
In real-world decision-making tasks, it is critical for reinforcement learning (RL) methods to be both stable and robust. Maximum entropy RL methods typically generate a robust policy with entropy augmented reward. While incorporating entropy into the reward offers the benefit of exploration, it presents limited universal applicability and persistent convergence difficulties, such as suboptimal policy stabilization and unstable Q value update. From optimization, we define these two issues as tremulous policy and spiky Q-function, investigating their underlying causes and relationships. Analysis with this, the maximum entropy principle leads to a spiky Q -function update, which ultimately results in a tremulous policy. We thus introduce a beta-symmetric Kullback–Leibler (KL) divergence objective to mitigate such issues under the maximum entropy framework. With this objective function, the tremulous nature of the policy could be controlled with a large beta value. The spiky Q -function could be avoided by annealing the entropy in the target Q value, as the beta-symmetric KL divergence is an upper bound of the original reverse KL divergence. Theoretically, we prove that minimizing our new objective function results in a new policy that presents an improvement in the Q value. Guaranteed by these results, we ultimately derive the optimal policy by iteratively updating the Q value and policy, and we call this method max-entropy stable optimization (MeSO). Experimental results on the Mujoco and Roboschool platforms demonstrate that our algorithm maintains stability while offering better flexibility and overall performance.

Abstract:
Efficient single-image super-resolution (ESISR) primarily aims to enhance super-resolution (SR) performance while keeping model complexity low, making it more suitable for deployment on edge devices. However, the limited receptive field caused by conventional convolution’s locality restricts the backbone networks and attention modules from effectively capturing nonlocal features, leading to suboptimal SR performance. Additionally, insufficient interaction between high- and low-frequency feature information results in incomplete feature representation. To overcome these limitations, we propose a novel ESISR network called the lightweight dual-kernel information aggregation network (LDIAN). First, we design a dual-kernel convolution (DKC) that combines depth-wise 1-D convolution and dilated convolution to efficiently extract richer image features in an expanded receptive field while minimizing model complexity. Building upon DKC, we further develop a dual-kernel enhanced convolution (DEConv) and a dual-kernel enhanced distillation block (DEDB). Additionally, we propose a lightweight dual-kernel attention (DKA) mechanism to focus on more representative features for SR reconstruction. Second, we design an innovative feature fusion structure named the information aggregation block (IAB) to integrate spatial features and strengthen the interaction between high- and low-frequency information, thereby improving feature representation. Extensive quantitative and qualitative experiments demonstrate that the LDIAN achieves state-of-the-art performance with an optimal balance between model performance and complexity. Notably, compared to SRFormer-light, LDIAN-L delivers superior performance across five standard datasets while requiring only about 50% of the model’s FLOPs.

Abstract:
Domain generalization (DG) methods traditionally rely on multiple source domains to achieve the robust performance across unseen target domains. However, single-DG (SDG) presents a more practical paradigm by learning from a single source domain, addressing scenarios where access to multiple domains is limited. While existing SDG approaches primarily focus on data augmentation and style transfer techniques to enhance the model robustness, these methods often incur substantial computational overhead and may inadequately capture the complexity of real-world domain shifts. In this article, we propose path flatness-aware optimization (PFO), an optimization framework that addresses the fundamental challenges of SDG. Unlike conventional approaches that rely on the synthetic data generation, PFO identifies and exploits regions of flat minima within the optimization landscape of deep neural networks. The framework employs an iterative optimization strategy to construct a path through the parameter space along which an ensemble of candidate models achieves the minimal empirical risk. The initialization of this optimization path is achieved through the strategic interconnection of model instances, each originating from carefully selected anchor points that are computationally determined through the systematic analysis of classification decision manifolds. This optimization path serves as a mechanism for implicit distribution alignment between source and target domains within the loss landscape, consequently enhancing the model’s capacity for cross-DG. Empirical evaluation on multiple benchmark datasets demonstrates significant performance improvements in cross-DG, validating the efficacy of our approach.

Abstract:
Learning trustworthy and reliable offline policies presents significant challenges due to the inherent uncertainty in pre-collected datasets. In this article, we propose a novel offline reinforcement learning (RL) method to tackle this issue. Inspired by the concepts of Lyapunov stability and control-invariant sets from control theory, the central idea is to introduce a restricted state space for the agent to operate within, which allows the learned models to exhibit reduced Bellman uncertainty and make reliable decisions. To achieve this, we regulate the expected Bellman uncertainty associated with the new policy, ensuring that its growth trend in subsequent states remains within acceptable limits. The resulting method, termed Lyapunov uncertainty control (LUC), is shown to guarantee that the agent remains within a low-uncertainty state enclosure throughout its entire trajectory. Furthermore, we perform extensive theoretical and experimental analysis to showcase the effectiveness and feasibility of the proposed LUC.

Abstract:
Generative models (GMs), particularly large language models (LLMs), have garnered significant attention in machine learning and artificial intelligence for their ability to generate new data by learning the statistical properties of training data and creating data that resemble the original data. This capability offers a wide range of applications across various domains. However, the complex structures and numerous model parameters of GMs obscure the input–output processes and complicate the understanding and control of the outputs. Moreover, the purely data-driven learning mechanism limits GMs’ abilities to acquire broader knowledge. There remains substantial potential for enhancing the robustness and generalization capabilities of GMs. In this work, we leverage fuzzy system, a classical modeling method, to combine both data-driven and knowledge-driven mechanisms for generative tasks. We propose a novel generative fuzzy system framework, named GenFS, which integrates the deep learning capabilities of GMs with the term-based interpretability and dual-driven mechanisms of fuzzy systems. Specifically, we propose an end-to-end GenFS-based model for sequence generation, called FuzzyS2S. A series of test studies were conducted on 12 datasets, covering three distinct categories of generative tasks: machine translation, code generation, and summary generation. The results demonstrate that FuzzyS2S outperforms the transformer in terms of accuracy and fluency. Furthermore, it exhibits better performance than state-of-the-art models T5 and CodeT5 for some application scenarios.

Abstract:
Recent advances in counterfactual fairness have shifted focus from flawed group fairness metrics to ensuring individual-level fairness through counterfactual reasoning. However, most existing approaches remain limited to in-processing strategies—injecting fairness constraints into predictive models—while largely overlooking the potential of data preprocessing to mitigate inherent biases. Moreover, few methods address the critical challenge of real-world distribution shift, which can compromise the generalizability of fair models across domains. In this article, we propose the counterfactual reflux variational autoencoder (CRVAE), a novel framework for generating counterfactual samples and learning fair representations. To the best of our knowledge, this is the first work to explicitly consider counterfactual fair representation learning under covariate shift, enabling both single-domain and covariate shift prediction tasks. For fairness, we introduce a Reflux technique that enforces consistency between factual and counterfactual representations. For transferability, we incorporate a domain discriminator to align fair representations across domains. Experimental results show that our approach improves fairness with minimal performance loss and maintains generalization across domains. Furthermore, CRVAE can be flexibly combined with existing in-processing fairness methods. Future work may explore extending this framework to settings with limited causal graph knowledge.

Abstract:
Link prediction (LP) is fundamental to graph-based applications, yet existing graph autoencoders (GAEs) and variational GAEs (VGAEs) often struggle with intrinsic graph properties, particularly the presence of negative eigenvalues in adjacency matrices, which limits their adaptability and predictive performance. To address this limitation, we propose Hyperspherical Kolmogorov–Arnold Networks for LP (HKANLP), a novel framework that combines multiple graph neural network (GNN)-based representation learning strategies with Kolmogorov–Arnold networks (KANs) in a hyperspherical embedding space. Specifically, our model leverages the von Mises–Fisher (vMF) distribution to impose geometric consistency in the latent space and employs KANs as universal function approximators to reconstruct adjacency matrices, thereby mitigating the impact of negative eigenvalues and enhancing spectral diversity. Extensive experiments on homophilous, heterophilous, and large-scale graph datasets demonstrate that HKANLP achieves superior LP performance and robustness compared to state-of-the-art baselines. Furthermore, visualization analyses illustrate the model’s effectiveness in capturing complex structural patterns. The source code of our model is publicly available at https://github.com/zxj8806/HKANLP/

Abstract:
In this article, we first tackle a more realistic domain adaptation (DA) setting: source-free blending-target DA (SF-BTDA), where we cannot access to source-domain data while facing mixed multiple target domains without any domain labels in prior. Compared to existing DA scenarios, SF-BTDA generally faces the coexistence of different label shifts in different targets, along with noisy target pseudolabels generated from the source model. In this article, we propose a new method called evidential graph contrastive alignment (EGCA) to decouple the blending-target domain and alleviate the effect of noisy target pseudolabels. First, to improve the quality of pseudo target labels, we propose a calibrated evidential learning (CEL) module to iteratively improve both the accuracy and certainty of the resulting model and adaptively generate high-quality pseudo target labels. Second, we design a graph contrastive learning with the domain distance matrix and confidence-uncertainty criterion, to minimize the distribution gap of samples of the same class in the blending-target domain, which alleviates the coexistence of different label shifts in blended targets. We conduct a new benchmark based on three standard DA datasets, and EGCA outperforms other methods with considerable gains and achieves comparable results compared with those that have domain labels or source data in prior.

Abstract:
Despite the strong performance of transformers, quadratic computation complexity of self-attention presents challenges in applying them to vision tasks. Linear attention reduces this complexity from quadratic to linear, offering a strong computation–performance tradeoff. To further optimize this, automatic pruning is an effective method to find a structure that maximizes performance within a target resource through training without any heuristic approaches. However, directly applying it to multihead attention is not straightforward due to channel mismatch. In this article, we propose an automatic pruning method to deal with this problem. Different from existing methods that rely solely on training without any prior knowledge, we integrate channel similarity-based weights into the pruning indicator to preserve the more informative channels within each head. Then, we adjust the pruning indicator to enforce that channels are removed evenly across all heads, thereby avoiding any channel mismatch. We incorporate a reweight module to mitigate information loss due to channel removal and introduce an effective pruning indicator initialization for linear attention, based on the attention differences between the original structure and each channel. By applying our pruning method to the FLattenTransformer on ImageNet-1K, which incorporates original and linear attention mechanisms, we achieve a 30% reduction of FLOPs in a near lossless manner. It also has 1.96% of accuracy gain over the DeiT-B model while reducing FLOPs by 37%, and 1.05% accuracy increase over the Swin-B model with a 10% reduction in FLOPs as well. The proposed method outperforms previous state-of-the-art efficient models and the recent pruning methods.

Abstract:
Imitation learning (IL) aims to learn a policy from expert demonstrations and has been applied to various applications. By learning from the expert policy, IL methods do not require environmental interactions or reward signals. However, most existing IL algorithms assume perfect expert demonstrations, but expert demonstrations often contain imperfections caused by errors from human experts or sensor/control system inaccuracies. To address the above problems, this work proposes a filter-and-restore framework to best leverage expert demonstrations with inherent noise. Our proposed method first filters clean samples from the demonstrations and then learns conditional diffusion models to recover the noisy ones. We evaluate our proposed framework and existing methods in various domains, including robot arm manipulation, dexterous manipulation, and locomotion. The experiment results show that our proposed framework consistently outperforms existing methods across all the tasks. Ablation studies further validate the effectiveness of each component and demonstrate the framework’s robustness to different noise types and levels. These results confirm the practical applicability of our framework to noisy offline demonstration data.

Abstract:
Recently, domain alignment and metric-based few-shot learning (FSL) have been introduced into hyperspectral image classification (HSIC) to solve the issues of uneven data distribution and scarcity of annotated data faced in practical applications. However, existing cross-domain few-shot methods ignore pivotal frequency priors of the complex field, which contribute to better category discrimination and knowledge transfer. To address this issue, we propose a novel physics-guided time-interactive-frequency network (PTFNet) for cross-domain few-shot HSIC, enabling the extraction of both frequency priors and spatial features (termed “time domain” following Fourier convention) simultaneously through a lightweight time-interactive-frequency module (TiF-Module) as a pioneering effort. Meanwhile, a spectral Fourier-based augmentation module (SFA-Module) is designed to decouple the frequency priors and enhance the diversity of distribution of physical attributes to imitate the domain shift. Then, the physics consistency loss is introduced to regularize the diverse embeddings to approximate the center of each category’s physical attributes, guiding the network to excavate more transferable knowledge of source domain (SD). Furthermore, to fully exploit the discriminant time–frequency information and further improve the accuracy of boundary pixels, a set of multiorientation homogeneous prototypes is adopted to represent each class comprehensively, and an intuitive and flexible uncertainty-rectified bidirectional random walk strategy is applied to replace the Euclidean metric for more reliable classification. The experimental results on four public datasets demonstrate the prominent performance of the proposed PTFNet.

Abstract:
Knowledge graphs (KGs) have caught more and more attention in recent years. Currently, in some practical scenarios, KG embedding (KGE) models are expected to reduce their spatial complexity without losing much performance to address the challenges of storage limitations and knowledge reasoning efficiency. To achieve this, existing works use one or more large and high-performance teacher models to improve the performance of a lightweight student model via knowledge distillation (KD), thus meeting the requirements of some practical complicated applications. However, in resource-constrained scenarios, obtaining high-performance teacher models is challenging due to high training costs and significant storage requirements. Thus, enhancing the student model’s performance without large teacher models is crucial. To address this issue, we propose Dual-View Mutual Distillation Framework for Knowledge Graph Embeddings (DMutDE), a distillation framework leveraging mutual learning for peer-to-peer distillation between two KGE models with different architectures. In KGE models, we notice that the way of modeling relational directed edges determines the model view of KGE model for learning KG data. Thus, integrating the model views from two different KGE models by KD into a student KGE model can improve its generalization, so as to increase its performance. To identify an effective dual-view fusion method, we design two modules in the DMutDE framework. Specifically, we design a novel soft-label fusion (SLF) module for noise filtering and response knowledge transfer. Then, we propose an entity embedding distillation (EED) module to distill structural features from each other. Finally, we conduct several comprehensive experiments on the standard open-source benchmarks to demonstrate that our framework achieves the state-of-the-art results. The code is available at https://github.com/RuizhouLiu/DMutDE

Abstract:
Optimal transport (OT) has gained significant attention in deep learning as a powerful mathematical tool for transforming distributions. Specifically, in deep generative models, the incorporation of OT helps address issues such as training instability, vanishing gradients, and mode collapse. However, in these models, most of the OT mappings learned by neural networks are typically implicit, making it difficult to explicitly model the relationship between the source and target domains. This limitation reduces the interpretability of the model and hinders its applicability in conditional generation tasks. To address this issue, we introduce Nesterov’s smoothing technique to smooth the Brenier potential, enabling the derivation of an explicit OT mapping that serves as the foundation for constructing an advanced generative model. The proposed model offers the following advantages. First, it explicitly captures the mapping between the source and target domains, thereby enhancing the interpretability of the generative process and enabling a novel pathway for conditional sample generation based on a smoothed approximation of OT mapping. Second, the model can generate new samples directly through an explicit OT mapping, eliminating the need for interpolation and rejection sampling commonly seen in traditional methods, thereby improving generation efficiency. Moreover, extensive experiments show that our proposed model achieves superior performance in both unconditional and conditional generation tasks.

Abstract:
Source-free domain adaptation (SFDA) is a challenging, yet valuable task within unsupervised domain adaptation (UDA), which adapts pretrained models to diverse unlabeled target domains while safeguarding the data security of the source domain. However, existing SFDA methods primarily focus on computer vision applications, often overlooking the unique characteristics of time series, such as temporal dependencies and sequential nature. Moreover, the fine-tuning paradigm of current SFDA methods is typically limited to posterior adaptation, focusing solely on constraining the statistical properties of model outputs. We argue that this black-box paradigm lacks semantic interpretability and risks aligning with spurious contextual noise, leading to negative transfer. This necessitates a paradigm evolution from blind statistical adaptation to interpretable adaptation. To this end, we introduce model salience as a quantifiable proxy of semantic interpretability, representing the importance weights a trained model assigns to specific temporal fragments. Accordingly, we propose a novel fine-tuning paradigm for time-series SFDA, termed PrEPoA, which integrates Prior Evaluation of model salience with Posterior Adaptation. In the prior evaluation stage, a key pattern reconstruction (KPR) module based on a sensitive masking mechanism is designed to quantify the model salience, while a novel interpattern triplet loss is introduced to calibrate it. In the posterior adaptation stage, robust prototype clustering (RPC) generates trustworthy reference labels as pseudo-ground truth for adaptation. Comprehensive experiments on the wireless sensor data mining (WISDM), human activity recognition (HAR), heterogeneity HAR (HHAR), machine fault diagnosis (MFD), and sleep stage classification (SSC) datasets demonstrate the superiority of our PrEPoA framework compared to nine UDA and seven SFDA methods. Furthermore, we experimentally validate that PrEPoA serves as a plug-and-play module that effectively incorporated into other SFDA methods.

Abstract:
Segmenting medical images accurately is crucial for disease prevention and treatment. Despite the significant progress of deep learning techniques in semi-supervised segmentation, they still face the inability to effectively identify and utilize ambiguous regions with high predictive volatility in practical applications. Considering that ambiguous regions in unlabeled data contain more informative complementary cues, this article proposes an innovative ambiguous focusing and correction (AFoCo) framework. AFoCo consists of two parallel and complementary networks: the ambiguous focus and the ambiguous correction network. The ambiguous focus network combines historical change prediction and instantaneous information entropy to compute ambiguity indices and accurately capture ambiguous regions. Meanwhile, the ambiguous correction network utilizes the identified deterministic information to redistribute the pixel labels of the ambiguous region through the weight-weighted similarity strategy, thus effectively alleviating prediction volatility in ambiguous areas. Furthermore, we propose a task-aware asymmetric cross-supervision constraint, which assigns differentiated cross-pseudo supervision signals based on the task-specific characteristics of the two networks. By leveraging a consistency constraint, it enhances global prediction stability, ensuring precise ambiguous region focusing and high-quality feature rectification. The experimental results show that AFoCo performs better than other SOTA techniques on four medical image datasets, significantly improving the segmentation accuracy and effectively reducing the proportion of ambiguous regions.

Abstract:
Local spectral features and global spatial context are essential for hyperspectral image (HSI) classification. However, existing methods based on convolutional neural networks (CNNs), graph convolutional networks (GCNs), and Transformers often rely on multibranch structures to separately extract and fuse local and global features, resulting in high computational complexity and redundant information that can negatively affect classification performance. To address these issues, we propose a two-stage graph convolutional mamba network (TGMN) that enables efficient modeling of local and global features through sequential intrasubgraph local feature extraction and intersubgraph global information learning. Specifically, in the first stage, we partition the HSI into superpixel regions and treat each superpixel as a subgraph, where a GCN is applied to aggregate spectral–spatial features within each subgraph. We further design a downsampled subgraph feature reconstruction (DSFR) module that dynamically selects key nodes to reduce redundancy, highlight critical features, and enhance model representation capability. In the second stage, the Mamba network models the global dependencies between subgraphs and introduces a region-relation aware absolute positional encoding (RAPE) module. This module encodes spatial positional information into embedded vectors by integrating the relative distance and direction between the geometric center of each superpixel and the image center, which are then deeply fused with the feature matrix to improve spatial relationship comprehension. The two-stage sequential structure ensures effective local and global feature extraction, avoiding the high computational complexity and redundancy issues commonly associated with multibranch models. Experiments on three benchmark datasets demonstrate its superiority, achieving classification accuracies of 98.54%, 98.30%, and 96.94% on the Indian Pines, Dioni, and Honghu datasets, respectively. Compared to state-of-the-art methods, TGMN achieves higher classification accuracy with significantly lower computational cost, demonstrating its efficiency and effectiveness for HSI classification.

Abstract:
Neural networks (NNs) have gained significant popularity for modeling complex, nonlinear systems due to their powerful approximation capabilities. However, designing an appropriate network structure and tuning parameters remains challenging, especially for nonlinear dynamic systems where offline training data are unavailable and poor approximations from badly tuned NNs can cause instability. This article presents a novel adaptive frequency-based constructive wavelet NN (AFBCWNN) for tracking reference trajectories for a class of unknown nonlinear dynamic systems. Using online measurements, the AFBCWNN integrates adaptive weight updating, adjustable network structures, and rigorous stability analysis using Lyapunov techniques. Unlike conventional methods, the proposed AFBCWNN leverages frequency-domain analysis to estimate the energy distribution of the unknown nonlinear mapping from measured data. This frequency-based approach provides a uniform design guideline for network initialization, enabling the network to dynamically add wavelet bases when the desired accuracy is not achieved and prune nonenergy-active (low-energy) bases, reducing computational cost without compromising accuracy. Rigorous stability analysis establishes conditions for uniformly bounded trajectories, and simulation results confirm the AFBCWNN’s superior performance in capturing complex, nonlinear dynamics compared to the existing adaptive methods.

Abstract:
This article develops a scheme to tackle the safe optimal formation tracking issue for multiple fixed-wing uncrewed aerial vehicles (UAVs) with external disturbances and asymmetric control constraints. To ensure safety constraints in collision avoidance, a safe set is first constructed by a super level set of a continuously differential function, following a novel control barrier function (CBF) to characterize the safety. Subsequently, we transform the safe optimal formation tracking control into a constrained zero-sum (ZS) differential game to mitigate the destabilizing effects of the disturbances, where the cost function is constructed in a nonquadratic form to cope with asymmetric input constraints. Particularly, the designed CBF is integrated into the cost function to penalize the unsafe behavior, and a damping coefficient is included to balance the optimality and safety. Afterwords, a critic-only reinforcement learning (RL) strategy is developed to learn the robust safe Nash policy, where the critic weights are updated by applying experience replay technology, thus avoiding the requirement for persistence of excitation condition. Moreover, the stability and forward invariance of the safe set of the presented scheme are also verified. Finally, simulation examples are provided to substantiate the validity of the control scheme.

Abstract:
Knowledge graph reasoning (KGR) is an important task in data mining. It aims to mine the logical rules based on the existing facts and further infer new facts, which makes the graph complete and accurate. Currently, with the development of large language models (LLMs), they are widely integrated with different baseline models for better performance. A few works are proposed on LLM-enhanced KGR models, which leaves many issues to be addressed. Inspired by the efficiency and accuracy of LLM in generating text semantic information, this article proposes a KGR method based on LLM information enhancement and subgraph alignment (LSA). LSA first utilizes LLM to generate textual descriptions corresponding to graph entities, relationships, and subgraphs. Then, it utilizes the generated textual attribute in both explicit and implicit ways: 1) explicit utilization, treating LLM-generated text features as the initialized features for the previous KGR model; and 2) implicit utilization, aligning the structural and textual information of key subgraphs via a learning mechanism. Finally, LSA is evaluated on three typical datasets. The promising performances demonstrate that our LSA leverages LLM to make the KG for richer information, and the representation learning model is empowered with better expressive ability.

Abstract:
Specific emitter identification (SEI) is a crucial task in various applications such as wireless communications and radar systems. The low-pass nature of vanilla Transformers hinders the extraction of high-frequency fingerprint features, resulting in poor SEI performance. Moreover, the introduction of additional high-frequency sensing structures can increase the computational efficiency of the already computationally intensive Transformer. To address these issues, we propose a high-frequency enhanced and low-complexity Transformer named HET. The framework integrates a multihead low-complexity self-attention (MLSA) module, a high-frequency enhanced connection, and a multihead high-frequency enhanced low-complexity self-attention (MESA) module. The MLSA module reduces the computational complexity by key and value mapping. The MESA and high-frequency enhanced connection module capture high-frequency information by reconstructing the low-frequency and high-frequency components of the features. We construct three HET variants, namely, \text HET_n , \text HET_u , and \text HET_m , based on different enhancement methods and positions using \text MESA_n , \text MESA_u , and \text MESA_m , respectively. Extensive experiments are conducted on the XSRP, ADS-B, and Wi-Fi datasets to evaluate the proposed models, demonstrating their competitive accuracy and faster throughput compared with popular methods. Theoretical proofs of high-frequency suppression and frequency response results confirm that the proposed framework has more gain for high-frequency information in SEI. Code is available at: https://github.com/zhailei-zl/HETmodel

Abstract:
Existing time-series forecasting methods often struggle to adapt to dynamic scenarios and lack flexibility in prediction. They typically require retraining the model when the prediction length or position changes. Moreover, these methods still face challenges in effectively capturing and utilizing time-position embeddings (PEs). To address these limitations, this article proposes a novel model called D2Vformer. Unlike conventional prediction methods that rely on fixed-length predictors, D2Vformer can directly handle scenarios with arbitrary prediction lengths. In addition, it significantly reduces training resource consumption and proves highly effective in real-world dynamic environments. In D2Vformer, the Date2Vec (D2V) module is devised to leverage timestamp information and feature sequences to generate time PEs. Subsequently, D2Vformer introduces an innovative fusion module that leverages an attention mechanism to capture the mapping between input and target time PEs, thereby enabling flexible prediction. Extensive experiments on six datasets demonstrate that D2V outperforms other time-PE methods, while D2Vformer surpasses state-of-the-art approaches in both fixed-length and arbitrary-length prediction tasks. The code for D2Vformer is available at: https://github.com/TeamofHaoWang/D2Vformer

Abstract:
Current image de-raining methods primarily learn from a limited dataset, leading to inadequate performance in varied real-world rainy conditions. To tackle this, we introduce a new framework that enables networks to progressively expand their de-raining knowledge base by tapping into a growing pool of datasets, significantly boosting their adaptability. Drawing inspiration from the human brain’s ability to continually absorb and generalize from ongoing experiences, our approach borrows the mechanism of the complementary learning system. Specifically, we first deploy generative adversarial networks (GANs) to capture and retain the unique features of new data, mirroring the hippocampus’s role in learning and memory. Then, the de-raining network is trained with both existing and GAN-synthesized data, mimicking the process of hippocampal replay and interleaved learning. Furthermore, we employ knowledge distillation with the replayed data to replicate the synergy between the neocortex’s activity patterns triggered by hippocampal replays and the preexisting neocortical knowledge. This comprehensive framework empowers the de-raining network to accumulate knowledge from various datasets, continually enhancing its performance on previously unseen rainy scenes. Our testing on three benchmark de-raining networks confirms the framework’s effectiveness. It not only facilitates continual knowledge accumulation across six datasets but also surpasses state-of-the-art methods in generalizing to new real-world scenarios. Our code is available at https://github.com/wangkunyu241/CLGID

Abstract:
Graph neural network (GNN)-based approaches have achieved remarkable success in temporal knowledge graph (TKG) reasoning. Despite these advances, two critical challenges remain: 1) inadequate modeling of local contextual dynamics, which limits the adaptability of entity and relation representations to specific queries and 2) inadequate mechanisms for handling emerging patterns, that is, novel interactions absent from historical data, which reduces predictive performance in dynamic environments. To address these limitations, we propose TCDR–PD, a temporal and contextual dynamic representation network with pattern decomposition. TCDR–PD introduces a temporal and contextual dynamic representation learning (TCDR) module to capture both global temporal trends and query-specific contextual dynamics, enabling more precise embeddings. Additionally, the pattern decomposition (PD) prediction module explicitly disentangles the prediction of recurring and emerging patterns, enabling tailored strategies to improve reasoning performance. Experiments on four benchmark datasets demonstrate that TCDR–PD outperforms state-of-the-art methods, effectively supporting stable reasoning over evolving TKGs.

Abstract:
Causal discovery plays a pivotal role in scientific inquiry and subsequent applications in prediction or decision-making. While many methods have been proposed, many of them rely on independence tests. However, these tests are difficult to implement and computationally intensive. In this article, we aim to propose a direct and computationally efficient method to determine the causal relationship between two observed variables in the linear non-Gaussian case. Building on the insight that cumulants provide information about the shape of a probability distribution, we show that interestingly, the (in)dependence between two observed variables can be directly inferred from the difference in the product of certain joint cumulants of these variables. This concept is named the cause difference criterion. Based on this criterion, we introduce two practical methods, high-order cumulant (HC) and HC-linear non-Gaussian acyclic model (LiNGAM), for causal discovery in the high-dimensional case. Theoretical analyses ensure the identifiability of the proposed criteria and methods. Experimental results indicate that our methods outperform most existing methods.

Abstract:
People’s social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e.g., wedding rings, roses, hugs, or holding hands. This brings unique challenges to recognizing social relationships, requiring understanding and capturing the essence of these contexts from visual appearances. However, current methods of social relationship understanding rely on the basic classification paradigm of detected persons and objects, which fails to understand the comprehensive context and often overlooks decisive social factors, especially subtle visual cues. To highlight the social-aware context and intricate details, we propose a novel approach that recognizes contextual social relationships (ConSoRs) from a social cognitive perspective. Specifically, to incorporate social-aware semantics, we build a lightweight adapter upon the frozen contrastive language-image pretraining (CLIP) to learn social concepts via our novel multimodal side adapter tuning mechanism. Furthermore, we construct social-aware descriptive language prompts (e.g., scene, activity, objects, and emotions) with social relationships for each image, and then compel ConSoR to concentrate more intensively on the decisive visual social factors via visual–linguistic contrasting. Impressively, ConSoR outperforms previous methods with a 7.6% gain on the people-in-social-context (PISC) dataset and a 9.8% increase on the people-in-photo-album (PIPA) benchmark. Furthermore, we observe that ConSoR excels at finding critical visual evidence to reveal social relationships. Code is available at https://github.com/starsholic/ConSoR

Abstract:
Gradient descent, computed through backpropagation (BP), has been widely used to train spiking neural networks (SNNs). However, the approach has several limitations. It requires manual intervention to tune the network architecture, is prone to catastrophic forgetting of previously learned information when exposed to data containing new information, and is computationally demanding. To address these issues, we propose brain-mimetic developmental spiking neural networks (BDNNs), which emulate the postnatal development of biological neural circuits. We evaluated BDNNs using a neuromorphic tactile system with the task of classifying objects through grasping. Our findings show that BDNNs grow dynamically in response to input data by incrementally recruiting hidden neurons, leading to steadily increasing classification accuracy without the need for manual architecture tuning. The growth process adapts autonomously to the complexity of incoming data. BDNNs also exhibit strong knowledge transfer capabilities, which effectively leverage previously learned knowledge about grasping objects to incrementally learn about new objects. Furthermore, in comparative experiments using the same dataset and hardware, BDNNs achieved classification performance comparable to the standard BP-based method and its variants, while learning one to three orders of magnitude faster. Furthermore, the BDNN outperforms existing continual learning algorithms in the performance and speed. These results highlight BDNNs as a promising approach for continual learning and real-time edge computing applications. The source code of our work is publicly available at https://github.com/1jiaqixing/BDNNversion1

Abstract:
Segmenting unknown domains using a model trained in the source domain still faces challenges. Although some approaches tried to resolve the problem through various data generation and network architecture designs, they cannot achieve satisfactory segmentation results compared with single domain segmentation of consistent data distribution. Therefore, we propose a data augmentation method based on amplitude perturbation to expand the distribution of data types, thereby covering target data. A feature suppression strategy is proposed to reduce the network’s over-reliance on important features of the source domain data to improve generalization performance. In addition, we design a luminance contrast consistency (LCC) learning module to harmonize the data styles between different domains and a multiscale convolutional attention (MSCA) module to enhance the network’s perception of small target objects and improve the segmentation performance of the model, which further improves segmentation performance. Our method achieves the state-of-the-art (SOTA) results on two public datasets of ATLAS2.0 and Prostate. The code is available at https://github.com/butterflyGN/DGSFTAFS

Abstract:
Remote sensing image (RSI) denoising is an important and fundamental task in RSI processing. Existing denoising methods usually assume that RSI lies in a single matrix or tensor subspace. However, due to the wavelength difference or/and temporal variability, the assumption of a single subspace may not be suitable for RSI. To address this, we propose a tensor multi-subspace representation (TenMSR) for RSI mixed noise removal. To be specific, in this work, we introduce TenMSR to finely characterize the intrinsic tensor multi-subspace structure of RSI. Compared with the single matrix/tensor subspace-based methods, the proposed method can not only precisely describe the wavelength difference or/and temporal variability of RSI but also produce a more compact image distribution in tensor multi-subspace. To mine and preserve the multi-subspace structure, we introduce a nonlinear transform-based 3-D tensor nuclear norm to characterize the tensor low rankness of the multi-subspace representation coefficient. An effective algorithm based on the proximal alternating minimization (PAM) framework is developed to solve the proposed model with theoretical convergence analysis. Extensive experiments show the effectiveness and superiority of the proposed method over existing state-of-the-art single matrix/tensor subspace RSI denoising methods.

Abstract:
Training-based algorithms significantly outperform training-free methods in terms of recognition performance for steady-state visual-evoked potential (SSVEP)-based brain–computer Interfaces (BCIs). However, collecting training data requires calibration experiments that are effort-intensive and often costly. These calibration demands limit the practicality of BCI, as users (and even system operators) may experience fatigue or lose interest in continued use. Transfer learning (TL) offers an effective solution, but it typically relies on either a certain amount of target domain data or extensive source domain data. To address this limitation, we introduce the concept of cross-dataset TL in SSVEP for the first time to extract transfer knowledge from other datasets. During this process, we identified a data mismatch problem that severely compromises the generalizability of transfer knowledge. To overcome this challenge, we propose a TL-SSVEP decoding algorithm calibrated with single-trial data (TL-CSTD). Specifically, we use 2 s of 8 Hz single-trial calibration data from the target domain to obtain matched transfer templates from the source domain. These templates are then corrected to extract holistic and single-period transfer knowledge, which are subsequently employed to construct an efficient TL-SSVEP decoding model for the target subject. Experimental results on three large SSVEP datasets demonstrate that TL-CSTD effectively addresses the data mismatch problem and achieves excellent SSVEP recognition performance using only 2 s of single-trial calibration data, showing its significant application potential and practicality.

Abstract:
The operation and control of active distribution networks (ADNs) are becoming increasingly important due to the high penetration of renewable energy (RE). The inherent uncertainty of RE can affect the stability and efficiency of ADN operations. To mitigate the inherent uncertainty and rapid variability of the high RE penetration in ADNs, this article uses an online ADN reconfiguration (ADNR) approach to ensure swift responses to RE fluctuations. Unlike traditional deep reinforcement learning (DRL)-based methods, which typically model the ADNR as a Markov decision process (MDP) and rely on historical ADN data to train the DRL agent, this approach may lead to a mismatch between the MDP’s characteristics and the actual ADNR and pose challenges in handling scenarios that do not exist in the training data. To address this issue, this article proposes an online–offline DRL framework for online ADNR. Initially, during the offline stage, ADNR is formulated as a state-driven Markov decision process, which incorporates the operational characteristics of the ADN. Following this, a state-driven proximal policy optimization (SD-PPO) algorithm is proposed to enhance the generalization capability of DRL. In the subsequent step, we present the optimized action proximal policy optimization (OA-PPO) algorithm, which performs personalized training based on SD-PPO to further improve DRL performance in the online stage. The proposed approach is applied to three IEEE ADN systems. Numerical results demonstrate the effectiveness of our approach in reducing power loss and enhancing RE accommodation. Furthermore, detailed comparisons with other DRL and traditional ADNR algorithms confirm the superior computational performance of our proposed method.

Abstract:
Automated lesion segmentation through breast ultrasound (BUS) images is an essential prerequisite in computer-aided diagnosis. However, the task of breast segmentation remains challenging, due to the time-consuming and labor-intensive process of acquiring precise labeled data, as well as severely ambiguous lesion boundaries and low contrast in BUS images. In this article, we propose a novel semi-supervised breast segmentation framework based on confidence-ranked features and bi-level prototypes (CoBiNet) to alleviate these issues. Our outputs are derived from two branches: classifier and projector. In the projector branch, we first rank the features by multilevel sampling to obtain multiple feature sets with different confidence levels. Then, these sets are progressed in two directions. One is to acquire local prototypes at each level by local sampling and perform trans-confidence level (TCL) contrastive learning. This encourages the low-confidence features to converge to the high-confidence features, which enhances the model’s ability to recognize ambiguous regions. The other process is to generate more representative global prototypes by global sampling, followed by generating more reliable predictions and performing cross-guidance (CG) consistency learning with the classifier output predictions, facilitating knowledge transfer between the structure-aware projector and the category-discriminative classifier branches. Extensive experiments on two well-known public datasets, BUSI and UDIAT, demonstrate the superiority of our method over state-of-the-art approaches. Codes will be released upon publication.

Abstract:
Offline reinforcement learning (RL) aims to learn effective agents from previously collected datasets, facilitating the safety and efficiency of RL by avoiding real-time interaction. However, in practical applications, the approximation error of the out-of-distribution (OOD) state–actions can cause considerable overestimation due to error exacerbation during training, finally degrading the performance. In contrast to prior works that merely addressed the OOD state–actions, we discover that all data introduces estimation error whose magnitude is directly related to data sparsity. Consequently, the impact of data sparsity is inevitable and vital when inhibiting the error exacerbation. In this article, we propose an offline RL approach to inhibit error exacerbation with data sparsity (IEEDS), which includes a novel value estimation method to consider the impact of data sparsity on the training of agents. Specifically, the value estimation phase includes two innovations: 1) replace Q-net with V-net, a smaller and denser state space makes data more concentrated, contributing to more accurate value estimation and 2) introduce state sparsity to the training by design state-aware-sparsity Markov decision process (MDP), further lessening the impact of sparse states. We theoretically prove the convergence of IEEDS under state-aware-sparsity MDP. Extensive experiments on offline RL benchmarks reveal that IEEDS’s superior performance.

Abstract:
The success of multiview subspace clustering (MVSC) lies in the efficient integration of consensus and complementary information from subspace structures. However, existing MVSC algorithms often treat these two types of information separately, overlooking the obvious correlations—both positive and negative—that exist between them. To address this issue, we propose a novel method called contrastive-driven diversity and consistency exploration in tensorized MVSC (CD-TMSC). Specifically, our method begins by segmenting self-representations into a consensus representation and a set of specific representations to accurately model both consensus and complementary information. Drawing inspiration from contrastive learning, we introduce a novel fractional regularization term to harness both the positive and negative correlations inherent in consensus and complementary information, where the numerator, employing the Hilbert–Schmidt independence criterion (HSIC), quantifies the negative correlation between the consensus and view-specific representations, as well as among the view-specific representations themselves. Conversely, the denominator, also leveraging HSIC, measures the positive correlation between the original data matrices and their self-representations. Minimizing this term has a dual effect: it reduces the numerator, thereby amplifying the negative correlation, a move that might be seen as counterproductive but, in our innovative approach, it promotes the diversity within representation matrices. Simultaneously, it increases the denominator, reinforcing the positive correlation and bolstering the consistency of the information. Additionally, we incorporate a graph regularization term for the consensus matrix to capture more consistent manifold information. Finally, utilizing the high-quality consensus and view-specific representations derived from our constraints, we reconstruct self-representation matrices and construct a third-order tensor with a low-rank constraint to explore higher order correlations within the self-representations. Our method integrates contrastive-driven regularization, manifold learning, and low-rank tensor learning into a cohesive framework, optimized using an alternating direction minimization strategy. Experimental results on multiple benchmark datasets show that our approach outperforms several state-of-the-art methods.

Abstract:
The effective integration and classification of hyperspectral images (HSIs) and light detection and ranging (LiDAR) data is of great significance in Earth observation missions, which are confronted with challenges such as insufficient information utilization and feature heterogeneity. This article proposes a multimodal quaternion representation network (MMQRN) for multisource remote sensing (RS) data classification. Specifically, we first propose the multimodal quaternion representation (MMQR), which employs the orthogonal imaginary components of quaternions to model the complex nonlinear interactions among complementary features, thereby enabling their comprehensive fusion and utilization. Subsequently, we design a multimodal feature cross-fusion (MFCF) framework to integrate multisource, multimodal, and multilevel features adequately. Finally, we leverage the ability to capture long-term dependencies of transformers to design a quaternion convolutional transformer network (QCTN) for modeling global and local spatial–spectral information, respectively. Experiments conducted on three multisource RS datasets demonstrate the superior performance of the proposed MMQRN relative to other state-of-the-art classification methods.

Abstract:
Few-shot segmentation (FSS) has garnered significant attention. Many recent approaches attempt to introduce the segment anything model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in FSS. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an unbiased semantic decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the contrastive language-image pretraining (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual–text target prompt generator (VTPG) is proposed by interacting target text embeddings and clip visual features. Without requiring retraining of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information. Experiments on both the PASCAL- 5^i and COCO- 20^i show that our proposed method outperforms the existing approaches by a clear margin and achieves new state-of-the-art performances.

Abstract:
Federated learning (FL) has become a mainstream decentralized learning paradigm due to its privacy-preserving features. However, the heterogeneity of data in FL can reduce predictive accuracy and complicate the analysis of the generalization properties of FL methods. In this article, we propose efficient federated kernel learning (FedK) algorithms and study their generalization properties. We first devise FedK with random features (FedK-RF), which acquires global information through sharing RF of local data subsets, enhancing predictive capability while protecting privacy. We then propose federated Nyström approximation with RF (FedNK-RF) that reduces errors resulted from RF. Furthermore, using integral operator theory, we derive the excess risk bounds with minimax optimal rates, which illustrate the impacts from data heterogeneity and shared information. Finally, we conduct several experiments that demonstrate the superiority of the proposed FedNK-RF.

Abstract:
Soft-thresholding (ST) has been widely used in deep neural networks. Its fundamental network structure is a deep soft-thresholding fully connected network (ST-FCN). However, training deep ST-FCN to achieve convergence remains time-consuming or even encounters gradient explosion, in part because the convergence behavior is not fully understood. To address this issue, this article proves the relationship between the convergence of deep ST-FCN and the values of network weights and biases. Theoretical analysis shows that, as the number of network layers approaches infinity, deep ST-FCN converges when the network weights tend to an identity matrix, while the biases tend to zero. Following this guidance, we initialize the network weights as the identity matrix, compare it with other representative initialization methods (Gaussian, He, LeCun, Xavier, and Uniform), and quantify their effects on network convergence. Extensive results on a synthetic spectrum dataset and real-world datasets (MNIST and CIFAR-10) demonstrate that initializing the weights to the identity matrix and the bias to zero leads to fast and stable convergence. These conclusions are further supported by additional experiments and statistical analysis on deeper ST networks (with more than ten layers) and other representative architectures (DenseNet-161, ResNet-152, and VGG-19), and more challenging benchmarks (CIFAR-100, STL-10, and Tiny ImageNet). This work provides a theoretical foundation for understanding the convergence of ST neural networks. Furthermore, convergence theory analysis for deep recurrent neural networks (RNNs) with ST is deduced.

Abstract:
This article presents an accelerated distributed optimization algorithm for online optimization problems over large-scale networks. The proposed algorithm’s iteration only relies on local computation and communication. To effectively adapt to dynamic changes and achieve a fast convergence rate while maintaining good convergence performance, we design a new algorithm called NGTAdam. This algorithm combines the Nesterov acceleration technique with an adaptive moment estimation method. The convergence of NGTAdam is evaluated by evaluating its dynamic regret through the use of linear system inequality. For online convex optimization problems, we provide an upper bound on the dynamic regret of NGTAdam, which depends on the initial conditions and the time-varying nature of the optimization problem. Moreover, we show that if the time-varying part of this upper bound is sublinear with time, the dynamic regret is also sublinear. Through a variety of numerical experiments, we demonstrate that NGTAdam outperforms state-of-the-art distributed online optimization algorithms.

Abstract:
In industrial scenarios, semantic segmentation of surface defects is vital for identifying, localizing, and delineating defects. However, new defect types constantly emerge with product iterations or process updates. Existing defect segmentation models lack incremental learning capabilities, and direct fine-tuning (FT) often leads to catastrophic forgetting. Furthermore, low contrast between defects and background, as well as among defect classes, exacerbates this issue. To address these challenges, we introduce a plug-and-play Transformer-based semantic complement module (TSCM). With only a few added parameters, it injects global contextual features from multi-head self-attention into shallow convolutional neural network (CNN) feature maps, compensating for convolutional receptive-field limits and fusing global and local information for better segmentation. For incremental updates, we propose multi-scale spatial pooling distillation (MSPD), which uses pseudo-labeling and multi-scale pooling to preserve both short- and long-range spatial relations and provides smooth feature alignment between teacher and student. Additionally, we adopt an adaptive weight fusion (AWF) strategy with a dynamic threshold that assigns higher weights to parameters with larger updates, achieving an optimal balance between stability and plasticity. The experimental results on two industrial surface defect datasets demonstrate that our method outperforms existing approaches in various incremental segmentation scenarios.

Abstract:
Graph unlearning has emerged as a pivotal method to delete information from an already trained graph neural network (GNN). One may delete nodes, a class of nodes, edges, or a class of edges. An unlearning method enables the GNN model to comply with data protection regulations (i.e., the right to be forgotten), adapt to evolving data distributions, and reduce the GPU-hours carbon footprint by avoiding repetitive retraining. Removing specific graph elements from graph data is challenging due to the inherent intricate relationships and neighborhood dependencies. Existing partitioning and aggregation-based methods have limitations due to their poor handling of local graph dependencies and additional overhead costs. Our work takes a novel approach to address these challenges in graph unlearning through knowledge distillation, as it distills to delete in GNN (D2DGN). It is an efficient model-agnostic distillation framework where the complete graph knowledge is divided and marked for retention and deletion. It performs distillation with response-based soft targets and feature-based node embedding while minimizing KL-divergence. The unlearned model effectively removes the influence of the deleted graph elements while preserving knowledge about the retained graph elements. D2DGN surpasses the performance of existing methods when evaluated on various real-world graph datasets by up to \mathbf 43.1% (AUC) in edge and node unlearning tasks. Other notable advantages include better efficiency, better performance in removing target elements, preservation of performance for the retained elements, and zero overhead costs. Source code: https://github.com/MachineUnlearn/D2DGN

Abstract:
Sketch-less facial image retrieval (SLFIR) framework efficiently retrieves target images with minimal strokes through human–computer interaction, thus overcoming the traditional model’s reliance on high-quality sketch images. However, the variability in sketching styles and the randomness of stroke placement during the drawing process pose challenges in matching target images. To address this issue, we propose a feature-driven foundation model for sketch-less facial image retrieval (FDSRM), which is designed to be independent of the sketch style and comprises two core components: the feature observer and the adaptive fusion adapter (AFA). First, to address the diversity of sketch styles, we design the feature observer module (FOM). It employs multiple experts focused on extracting key features and semantic information common to various sketch styles and the target image. This helps the model to precisely identify crucial features for effective matching in stylistically diverse sketches. Second, to address the randomness of stroke placement, we introduce prior knowledge of sketching and, in conjunction with the AFA component, dynamically learn and adjust the fusion strategy of sketches and text based on the current state of sketch strokes. This enables more accurate and targeted feature fusion throughout the sketching process. Furthermore, we train a facial image–text alignment pretraining (FAIP) model on a large-scale facial dataset and use it as the backbone of FDSRM, which significantly improved the model’s robustness to unknown facial features. Extensive experiments demonstrate that our method exhibits significant advantages in terms of accuracy in early retrieval and system generalization capabilities. Even without additional auxiliary information, it outperforms state-of-the-art methods in both qualitative and quantitative measures in multistyle application scenarios.

Abstract:
We propose a novel point cloud U-Net diffusion architecture for 3-D generative modeling capable of generating high-quality and diverse 3-D shapes while maintaining fast generation times. Our network employs a dual-branch architecture, combining the high-resolution representations of points with the computational efficiency of sparse voxels. Our fastest variant outperforms all nondiffusion generative approaches on unconditional shape generation, the most popular benchmark for evaluating point cloud generative models, while our largest model achieves state-of-the-art results among diffusion methods, with a runtime approximately 70% of the previously state-of-the-art point-voxel diffusion (PVD), measured on the same hardware setting. Beyond unconditional generation, we perform extensive evaluations, including conditional generation on all categories of ShapeNet, demonstrating the scalability of our model to larger datasets, and implicit generation, which allows our network to produce high-quality point clouds on fewer timesteps, further decreasing the generation time. Finally, we evaluate the architecture’s performance in point cloud completion and super-resolution. Our model excels in all tasks, establishing it as a state-of-the-art diffusion U-Net for point cloud generative modeling. The code is publicly available at https://github.com/JohnRomanelis/SPVD

Abstract:
The spiking federated learning (FL) is an emerging distributed learning paradigm that allows resource-constrained devices to train collaboratively at low power consumption without exchanging local data. It takes advantage of both the privacy computation property in FL and the energy efficiency in spiking neural networks (SNNs). However, existing spiking FL methods employ a random selection approach for client aggregation, assuming unbiased client participation. This neglect of statistical heterogeneity significantly affects the convergence and precision of the global model. In this work, we propose a credit assignment-based active client selection strategy for spiking federated learning, the SFedCA, to aggregate clients contributing to the global sample distribution balance judiciously. Specifically, the client credits are assigned by the firing intensity state before and after local model training, which reflects the difference in local data distribution from the global model. The comprehensive experiments are conducted on various non-identical and independent distribution (non-IID) scenarios. The experimental results demonstrate that the SFedCA outperforms the existing state-of-the-art spiking FL methods and requires fewer communication rounds.

Abstract:
In this article, we prove that a general family of spurious local minima exist in the loss landscape of deep convolutional neural networks (CNNs) with strictly convex loss functions and ReLU activations. Our construction of spurious local minima is general and applies to CNNs with arbitrary architectures. We construct a local minimum \theta at first, and then construct another point \theta ^\prime in parameter space with the same empirical risk as \theta . Data samples are split into some groups such that each group behaves differently under the perturbation around \theta ^\prime to produce a lower empirical risk. We tackle the challenges caused by convolutional layers in the construction. We show that a differentiation of data samples is always possible somewhere in the feature maps, and despite network parameters being tied in each feature map, our perturbation scheme only affects the output of a single or a few neurons for a group of data samples. We then give an example of nontrivial spurious local minimum in which multiple activation patterns are explicitly constructed. Finally, based on our construction of spurious local minima, we design a deterministic optimization method to escape local minima that is applicable to CNNs, ResNets, MLPs, and transformers. Experimental results on CIFAR-10, CIFAR-100, and ImageNet-1k datasets verify our theoretical findings and show that our optimization method outperforms SGD or Adam in accuracy (by 0.27% on average) consistently on all these architectures and datasets.

Abstract:
Current single-object tracking algorithms depend on the information supplied by the template to identify and locate the object within the search area. However, environmental complexities and unknown factors can alter the object’s state, causing mismatches in template information. The existing works using the template update mechanism (TUM) and multiple template feature fusion have the following problems: 1) TUM is affected by input superposition, making it hard to eliminate noise; 2) they suffer a temporal lag in their responsiveness to changes that occur in the object during the tracking process; 3) it is insufficient to rely solely on visual features within the search area of the current frame to improve the template; and 4) the prior knowledge regarding the input is not fully leveraged to learn the time-variant state of the object. We observe that in complex tracking scenarios, humans subconsciously analyze the evolutionary patterns of the object and its surroundings and integrate this information with the object’s initial impression, thereby maintaining an awareness of the object’s temporal state. Motivated by this, we propose a novel solution to the above problem, named MtvTrack, which can model the time-variant state of the object through the dynamic evolution pattern and static initial impression. Simultaneously, we propose a method for predicting the evolution pattern of scenes by utilizing past, present, and future (PPF) states. This approach effectively eliminates the information redundancy between consecutive frames and addresses the issue of delayed predictions of the target state in relation to changes within the search area. We establish a joint probability generative model and fully utilize prior knowledge to learn the time-variant state of the object. In addition, we develop a vector quantized-PPF (VQPPF) module for predicting the object’s time-variant state. Experimental results on public benchmarks confirm the superior performance of our method. Source code is available at: https://github.com/long-wa/MtvTrack-main

Abstract:
Employing graph neural networks (GNNs) for graph clustering has shown promising results in deep graph clustering (DGC). However, existing methods disregard the reciprocal relationship between representation learning and structure augmentation: the more homogeneous the graph, the more cohesive the node representations; the more cohesive the node representations, the more reliable the structure augmentation becomes. Moreover, the generalization ability of existing GNN-based models on the low homophily graph is relatively poor. To this end, we propose a graph clustering framework named synergistic deep graph clustering network (SynC). SynC employs a transform input graph autoencoder (TIGAE) to obtain high-quality embeddings via mitigating the representation collapse issue of GAE for guiding structure augmentation. Then, we recapture neighborhood representations on the refined graph to obtain clustering-friendly embeddings and conduct self-supervised clustering. Notably, these two stages share weights, resulting in synergistic boosting while significantly reducing the number of model parameters. Additionally, we introduce a structure fine-tuning (SF) strategy to improve the model’s generalization on the low homophily graph. Extensive experiments on benchmark datasets demonstrate the superiority of SynC. The code is released at https://github.com/Marigoldwu/SynC.

Abstract:
Traffic flow prediction is fundamental to traffic information services, control, and guidance. The challenge to accurately model the traffic flow is to comprehensively capture dynamic, global spatial, and local temporal correlations. To address challenges in spatiotemporal dependencies, global similarity, and local dynamics, we propose a multilayer spatiotemporal correlation-aware graph attention network (MSTC-GAT) for traffic flow prediction. Our model contains a multilayer spatial structure-aware module [spatial graph attention network (S-GAT)] using a spatial GAT with hierarchical attention masks and a path-based node correlation matrix to effectively capture local and global spatial dependencies. The temporal structure-aware module [temporal graph attention networks (T-GATs)] constructs a short-term similarity matrix of nodes for the temporal GAT to capture local dynamic temporal dependencies. Finally, a spatiotemporal Transformer (ST-Transformer) fuses weighted spatiotemporal node embeddings to capture global dynamic dependencies for accurate prediction. We conduct extensive experiments on four public benchmark datasets compared with 10 state-of-the-art models. The experimental results demonstrate that the MSTC-GAT outperforms all comparisons for short- and long-term predictions.

Abstract:
Multirate industrial processes pose significant challenges for accurate forecasting due to varying sampling frequencies and missing data. This article proposes a novel hybrid deep learning framework that effectively addresses these issues. Our approach uses a combination of time series decomposition, inverted transformer (iTransformer)-based feature extraction, and a modified minimal gated unit (MGU) network. To handle missing quality variables, we introduce a robust adaptive parameter update algorithm based on dead-zone Kalman filtering. Through extensive experiments conducted on real-world industrial datasets, our method achieves a mean absolute error (MAE) reduction of 61.42%, a root-mean-square error (RMSE) reduction of 64.11%, and a high qualification rate improvement of 14.73% compared to the average performance of state-of-the-art technologies, thereby outperforming existing state-of-the-art techniques in terms of both forecasting accuracy and robustness.

Abstract:
We show how a pair of points can uniquely represent a left-infinite sequence obtained from observations of an underlying dynamical system through a phenomenon called causal embedding. A driven dynamical system creates such pairs, and a function can be learned on them that can reconstruct the underlying dynamics as in Takens delay embedding. The approach assures embedding stability unlike Takens delay embedding, and learnability, which can be absent, while stability can be present in the current reservoir computing framework. This accurately models underlying systems where recent methods like the next-generation reservoir computing fail. We demonstrate results and compare with the other methods, including SINDY-PI.

Abstract:
State estimation is an important property in the operation of logical dynamic systems, and how to analyze and synthesize this property when the estimation conditions are not satisfied is the focus of the current research. This work addresses the synthesis problem of detectability with probability 1 for probabilistic Boolean networks (PBNs) via flip control and optimal-flip-based segmented reinforcement learning (OFSRL). First, in the framework of the semitensor product (STP), the PBN is transformed into an algebraic form to serve as the structure matrix for OFSRL. Depending on the attractor condition in the state pair, the analysis of detectability synthesis is divided into four different cases, and in each case, flip control is applied to the structure matrix and output matrix as actions in the OFSRL framework. Second, the detectability synthesis problem of the PBN is transformed into a set stabilization problem, and to reduce computational complexity, necessary and sufficient conditions based on the reachable set are introduced as criteria for implementing OFSRL. Furthermore, based on the above structure matrix, actions, and criteria, OFSRL is proposed to address the detectability synthesis problem of PBNs and obtain the optimal flipping sequence. Finally, two numerical simulations are conducted to verify the reliability of the proposed conclusions, and comparisons are made between OFSRL, traditional theoretical methods, and conventional reinforcement learning (RL) algorithms to highlight the advantages of OFSRL.

Abstract:
Industrial multiprocess collaborative optimization presents significant challenges due to the intricate spatiotemporal dependencies inherent in modern process industries. Traditional optimization and reinforcement learning often treat subprocesses as independent entities, neglecting the fine-grained interdependencies among operational variables across different subprocesses. To fundamentally address this limitation, we introduce, a novel spatiotemporal topology-informed multiprocess collaborative optimization (STI-MCO) framework, which pioneers action-level interdependency modeling through an innovative spatiotemporal graph architecture. Rather than treating subprocesses as monolithic entities, STI-MCO operates at the operational variable level, enabling precise representation of both interprocess relationships and intraprocess dependencies through a hierarchical two-stage decision framework. This approach enables more precise coordination through fine-grained variable interactions, better temporal consistency via dynamic graph structures, and enhanced scalability compared with conventional agent-level methods. This paradigm shift from subprocess-level to variable-level collaboration, combined with dynamic graph-based coordination, enables extensive simulations and experiments conducted across three benchmark environments with progressively complex topologies to demonstrate that STI-MCO consistently outperforms baseline methods, achieving up to 38.9% improvement over centralized methods and 171.9% improvement over existing multiagent strategies. In addition, STI-MCO exhibits superior convergence efficiency, requiring significantly fewer training steps to achieve high performance. Its practical applicability is further validated through deployment in a real-world Salt Lake chemical process. By fundamentally shifting the optimization paradigm from holistic subprocess control to fine-grained variable-level collaboration, this work establishes a new framework for more effective optimization in complex industrial processes, particularly those with strong interunit coupling.

Abstract:
The complex interactions between flexible needles and tissues present significant challenges in predicting the needle shape during the puncture procedure. In particular, the accurate prediction of flexible needle shape during insertion into complex multilayer tissues, especially when measurement feedback involves non-Gaussian noise, remains an open problem. In this article, we develop a novel reinforcement learning-based active modeling scheme to predict the deflection of the robotic flexible needle. First, the active modeling scheme is constructed by deriving an extended Kalman filter under the maximum correntropy criterion to enhance insensitivity to non-Gaussian noise. Subsequently, based on this scheme, the reinforcement active modeling (RAM) framework is built by incorporating reinforcement learning to compensate for the modeling residuals. Specifically, the theoretical convergence of the proposed scheme is proved by using the Banach fixed-point theorem, thereby ensuring the reliability of needle shape prediction. Finally, a series of comparative experiments is carried out on a self-built robotic flexible needle. The experimental results demonstrate the superior performance of the proposed deflection predictor. Under non-Gaussian noise conditions, the proposed RAM scheme achieves a generalization prediction error reduction of 46.4% in RMSE and over 76.1% in Var during insertion into unknown multilayer tissue.

Abstract:
Various Riemannian optimization tasks, such as Riemannian metaoptimization (RMO) and Riemannian metalearning, can be formulated as Riemannian bilevel optimization problems (i.e., the inner-level and outer-level optimization). Implicit differentiation has shown effectiveness in solving RMO, which decouples the computation of outer gradients from the inner-level process, avoiding huge computational burdens. However, extending implicit differentiation to other Riemannian bilevel optimization tasks is nontrivial because it requires much expert involvement for case-by-case derivations. In this article, we propose a Riemannian implicit differentiation method that provides a unified expression for outer gradients, leading to flexible application to other tasks with less expert involvement. Specifically, we formulate the inner-level optimization as a root-finding process of a fixed-point equation, through which the inner-level optimization among different tasks is formulated in a unified way. By differentiating the fixed-point equation, we derive a unified expression for outer gradients, circumventing the case-by-case derivations for different tasks. Then, we present convergence analysis and approximation error analysis, which guarantee the effectiveness of our method in various Riemannian optimization tasks. We further conduct experiments on multiple Riemannian optimization tasks, and the experimental results confirm the effectiveness.

Abstract:
Effectively solving multimodal multiobjective optimization problems (MMOPs) requires maintaining an optimal balance between the diversity and the convergence. Traditional algorithms often struggle with environmental selection adaptability, leading to suboptimal performance across diverse MMOPs. This article innovatively integrates the actor–critic reinforcement learning (RL) with the evolutionary algorithm, significantly improving environmental selection adaptability through synergistic online learning between its actor and critic components. An RL process that dynamically optimizes the niche size is established, which critically determines the trade-off between the diversity and convergence preferences. The process is formulated by defining: 1) convergence and diversity measures as the state; 2) niche size adjustment as the continuous action; and 3) state improvement as the reward. Two specialized fully connected neural networks are employed as the actor network for action generation and the critic network for value estimation, which collaboratively adapt to population states through real-time online learning. This adaptive niching technology, integrated with local convergence quality assessment, enables comprehensive evaluation of both diversity and potential convergence. Extensive experimental validation demonstrates the superior performance of the proposed algorithm against ten state-of-the-art algorithms across 48 benchmark problems and a real-world application. The results consistently show significant improvements in balance maintenance and overall optimization effectiveness compared to existing algorithms.

Abstract:
Sparse-view computed tomography (SVCT) is an advancement in computed tomography (CT) technology that aims to reduce the radiation dose during imaging. Reconstructing high-quality images from sparse-view (SV) projections is an ill-posed inverse problem. Recently, implicit neural representations (INRs) as a self-supervised paradigm for solving underdetermined inverse problems have demonstrated excellent performance in SVCT reconstruction. However, since INR-based approaches rely on subject-specific training, they require a significant investment of time to optimize from scratch. Consequently, previous INR methods have not been able to meet the requisite timeliness of reconstruction. In our work, we propose RTSyner, an algorithm-hardware collaboration framework that facilitates the real-time efficiency of CT reconstruction. On the algorithmic side, we introduce an efficient coordinate-based feature module that exploits the local latent features as a positional external condition, leveraging the limited structural information of corrupted images derived from the sensory domain. By fusing latent features and coordinate information, the model learns a neural representation of the final tomographic image. On the hardware side, we design a dedicated hardware architecture with a customized algorithm flow to improve reconstruction speed and reduce power consumption. Furthermore, we improve the efficiency of model inference through model quantization, which also facilitates the subsequent deployment of hardware. Our extensive experimental results demonstrate that the RTSyner based on neural representation has achieved real-time SVCT reconstruction through the synergistic acceleration of the algorithm and hardware. We further explore its application potential via volume reconstructions under more complex acquisition geometries.

Abstract:
This article addresses the challenge of domain adaptation on graphs, a specialized form of graph transfer learning (GTL), which involves adapting a graph model trained on source graphs to unlabeled target graphs that significantly differ in distribution. Traditional methods often rely heavily on the source graph to transfer learned task knowledge, but certain situations may render the source graph unavailable or restricted due to privacy or security concerns, thus impeding the usability and flexibility of graph model adaptation. Therefore, this article studies the problem of source-free domain adaptation (SFDA) in graph transfer learning (GTL). Our objective is to adapt a pretrained model to effectively operate on the target graph without the need to access the source graph. To achieve this, we first incorporate a weighted information maximization loss to enhance the model’s discriminative ability on the target graph, where we introduce the concept of posterior integrities of target nodes to assess their optimization confidence. Then, we estimate the distributions of the source graph and generate synthesized source nodes. We propose a reconstruction decoder to enhance the authenticity of the synthesized nodes and use adversarial learning to align the distributions between graphs, leading to improved adaptation of the model. Finally, extensive experimental results on a range of publicly accessible datasets demonstrate the superior performance of our method over the state of the art.

Abstract:
Deinterleaving radar signals is a significant task in contemporary electronic warfare reconnaissance. However, as adaptive waveforms and multifunctional radar technologies evolve, the dynamic range of signal parameters broadens and modulation styles diversify, resulting in critical performance bottlenecks for single-feature deinterleaving algorithms. Therefore, this article explores a novel intelligent deinterleaving paradigm that fuses multidimensional signal properties. A radar signal deinterleaving method based on the multiscale attention mechanism, Multiscale attention deinterleaving (MSAD), is proposed to address issues like how to depict the multidomain coupling characteristics of radar signals, how to model the contribution differences of features at different scales, and how to improve the generalization ability of algorithms for complex modulated signals. The approach first expands the dimensionality of the pulse description word (PDW) data, which is then converted using a Gramian angular field (GAF) into a pulse description graph (PDG). This allows for the joint graphical description of multidimensional characteristics that include time, frequency, space, and energy. Next, build a Laplace Pyramid multiscale feature extraction framework and employ deep convolutional networks (DCNs) to hierarchically capture signal patterns at various granularities. Last, a physically interpretable deinterleaving decision is created by dynamically fusing the feature weights of each scale using the attention mechanism (AM). According to experiments, the MSAD method outperforms the current approaches [bidirectional long short-term memory (BLSTM), bidirectional gated recurrent unit (BGRU), DCN, sequential difference histogram (SDIF), and pulse repetition interval transform (PRI-Tran)] in deinterleaving by taking advantage of the enhancement effect of multiscale image representations on electromagnetic signal features and the dynamic weight assignment by the AM. In addition, in multifunctional radar and jittered PRI deinterleaving, the MSAD approach demonstrates competitive performance gains.

Abstract:
Benefiting from the gap increasing between the optimal action and its competitors, the advantage learning (AL) operator is more robust to estimation errors in the approximated Q -functions than the Bellman optimality operator in reinforcement learning (RL). However, our analysis reveals that its robustness and larger action gaps come at the cost of a worse performance loss bound, leading to slower convergence of value functions. To address this issue, we present a novel method, named Occam’s Razor-based AL (ORAL), which follows Occam’s Razor principle and takes the necessity into consideration when increasing the action gap. Specifically, our ORAL can adaptively increase the action gap for different state–action pairs, depending on the proximity of their Q values to the optimal ones. We first propose a naive implementation of ORAL, employing a nonsmooth clipping function to realize the above idea, and then introduce a smooth version of ORAL aimed at achieving more stable learning. Furthermore, our methods can be easily plugged into other AL-based operators and extended to more complex continuous-control tasks. Theoretical analysis supports the feasibility of our approaches, demonstrating their ability to balance the gap increasing with fast convergence. Empirical results further validate its effectiveness, showing significant performance improvements across multiple benchmarks.

Abstract:
This article develops inverse reinforcement learning (IRL) control algorithms for nonlinear networked control systems (NCSs) to mimic trajectories of a target system governed by an unknown optimal cost function, despite the presence of random data dropouts and external disturbances. Data dropouts occur during: 1) reception of target trajectory data by the controller; 2) reception of state feedback data by the controller; and 3) reception of control input data by the actuator. By organically integrating H_\infty control to account for disturbances and dropout-induced uncertainty, a model-based IRL algorithm is first developed. Building on this, a neural-network-based data-driven IRL algorithm is developed to infer the cost function and optimal control policy using available data while reducing dependence on system models. The proposed methods enable effective trajectory imitation under partial model knowledge, data dropouts, and disturbances, as demonstrated through simulation studies.

Abstract:
The recent rise of semantic-style communications has fostered the development of goal-oriented communications (GO-COMs), facilitating remarkably efficient multimedia information transmissions. The concept of GO-COMs leverages advanced artificial intelligence (AI) tools to address the rising demand for bandwidth efficiency in applications, such as edge computing and the Internet of Things (IoT). Unlike traditional communication systems focusing on source data accuracy, GO-COMs provide intelligent message delivery catering to the special needs critical to accomplishing downstream tasks at the receiver. In this work, we present a novel GO-COM framework, namely LaMI-GO, that utilizes emerging generative AI for better quality of service (QoS) with ultrahigh communication efficiency. Specifically, we design our LaMI-GO system backbone based on a latent diffusion model followed by a vector-quantized generative adversarial network (VQGAN) for efficient latent embedding and information representation. The system trains a common-feature codebook for the receiver side. Our experimental results demonstrate substantial improvement in perceptual quality, accuracy of downstream tasks, and bandwidth consumption over the state-of-the-art GO-COM systems and establish the power of our proposed LaMI-GO communication framework.

Abstract:
Accurate and fast detection of traffic signs is critical for autonomous driving, particularly in complex environments with diverse sign scales and varying detection distances. Existing approaches, incorporating attention modules or modifying detection heads, frequently encounter high rates of false positives and omissions due to the increased sampling depth. To address these limitations, we propose MDSF-you only look once (YOLO), a novel detection framework that integrates multiscale sequence fusion (MSF) for synergistic feature integration across granularities, enhancing the precision of both localization and semantic information fusion. Additionally, our dilated-wise residual (DWR) module leverages dilated convolutions and channel-wise reparameterization to improve fine-grained feature extraction. The architecture further introduces a P_2 detection head for shallow features and fully decouples all detection heads, optimizing target localization and category identification. Extensive experiments on the TT100K and CCTSDB2021 datasets demonstrate the superiority of MDSF-YOLO over benchmark models, including YOLOv11s, with significant improvements in mAP by 8.8% and 2.4% on respective datasets while substantially reducing false positives and leakage rate. Besides, the marked improvement of MDSF-YOLO on the VisDrone2019 dataset verifies its enhanced capability to address drone-based object detection. These advances underscore the efficiency and robustness of the proposed model, providing a promising solution for autonomous driving and similar object detection scenarios.

Abstract:
Multitask learning with a pretext task has excelled in time-series classification task lacking labeled data. The key to multitask learning is to build a pretext task and learn the most representative feature from the raw time series. In this article, we propose trend and order features for semi-supervised time-series classification via multitask learning (TOFL). Specifically, we propose a simple but effective pretext task—self-sequence order prediction (SOP)—to discover the order relation. In addition, we design a gradual trend fusion (GTF) block concatenating different trend features as the shared backbone network basis element to obtain high-quality trend features for the SOP task. Finally, we not only theoretically analyze the uniform stability and generalization error of TOFL but also evaluate the results compared with state-of-the-art (SOTA) supervised and semi-supervised methods on the 128 UCR datasets and three real-world datasets. TOFL demonstrates a high level of competitiveness and, in most cases, closely matches or even surpasses SOTA methods in terms of accuracy. The source code and data of TOFL are freely available at: https://github.com/Sample-design-alt/TOFL

Abstract:
Source-free unsupervised domain adaptation (SFUDA) aims to improve performance in unlabeled target domain data without accessing source domain data. This is crucial in scenarios with data-sharing restrictions due to privacy or compliance constraints. Existing SFUDA approaches often rely on pseudo-labeling techniques based on entropy or confidence metrics. These often overlook fine-grained data features, resulting in noisy pseudo-labels that degrade model performance. To overcome this limitation, we develop a new method called fine-grained pseudo-labeling and feature alignment (FGPLFA) to enhance SFUDA’s performance. FGPLFA starts with a gradient-based metric that integrates insights from both model knowledge and data features, creating a more reliable sample metric. To enhance fine granularity, the fine-grained pseudo-labeling (FGPL) module was introduced. This module clusters data based on the magnitude and direction of gradients, allowing for dataset partitioning into subsets at the sample level. The subsets are pseudo-labeled with category-specificity and domain specificity, establishing a multilevel granularity structure that reduces noisy pseudo-labels. Subsequently, the mean-covariance adjustment feature alignment (MCAFA) method was introduced. Features from the subsets are aligned in a specified sequence, enhancing model adaptability in the target domain. Extensive experiments conducted across multiple datasets validate the superiority of FGPLFA.

Abstract:
Few-shot learning has garnered increasing attention in hyperspectral image classification (HSIC) due to its potential to reduce dependency on labor-intensive and costly labeled data. However, most existing methods are constrained to feature extraction using a single image patch of fixed size, and typically neglect the pivotal role of the central pixel in feature fusion, leading to inefficient information utilization. In addition, the correlations among sample features have not been fully explored, thereby weakening feature expressiveness and hindering cross-domain knowledge transfer. To address these issues, we propose a novel few-shot HSIC framework incorporating dynamic fusion and hierarchical enhancement. Specifically, we first introduce a robust feature extraction module, which effectively combines the content concentration of small patches with the noise robustness of large patches, and further captures local spatial correlations through a central-pixel-guided dynamic pooling strategy. Such patch-to-pixel dynamic fusion enables a more comprehensive and robust extraction of ground object information. Then, we develop a support–query hierarchical enhancement module that integrates intraclass self-attention and interclass cross-attention mechanisms. This process not only enhances support-level and query-level feature representation but also facilitates the learning of more informative prior knowledge from the abundantly labeled source domain. Moreover, to further increase feature discriminability, we design an intraclass consistency loss and an interclass orthogonality loss, which collaboratively encourage intraclass samples to be closer together and interclass samples to be more separable in the metric space. Experimental results on four benchmark datasets demonstrate that our method substantially improves classification accuracy and consistently outperforms competing approaches. Code is available at https://github.com/guoying918/DFHE2025

Abstract:
In the era of information explosion, clustering analysis of graph-structured data and empty graph-structured data is of great significance for extracting the intrinsic value of data. From the perspective of spatial information, empty graph-structured data and graph-structured data are essentially the same type of data, both containing rich spatial information. However, there is currently no general clustering method that can handle both types of data, and the clustering methods applicable to empty graph-structured data pay little attention to the spatial information they contain. Meanwhile, graph convolutional neural networks (GCN) have made significant progress in processing graph-structured data, but applying them to empty graph-structured data still faces challenges because the latter lacks an explicit topological structure. To address these problems, this study proposes a multigranularity deep GCN node clustering method leveraging spatial information (CMDGCN). It converts empty graph-structured data into graph-structured data using the k -nearest neighbor (k-nn) algorithm and constructs multigranularity graph structures based on feature segmentation to extend the network depth to deep layers, thereby addressing the issue of shallow network layers in traditional GCN models. In addition, this study improves the self-expressiveness principle, ensuring that the learned similarity matrix not only depends on the node embedding representation but also incorporates the original structural information of the graph, resulting in a high-quality and interpretable similarity matrix. Furthermore, through experimental verification on multiple graph-structured datasets and empty graph-structured datasets, our method outperforms existing methods in several key indicators, proving its effectiveness and robustness. This achievement not only provides new methods and perspectives for graph node clustering but also offers new effective tools for processing empty graph-structured data.

Abstract:
Deep neural networks have achieved promising progress in signal modulation classification (SMC), playing an essential role in a variety of applications such as cognitive radio networks, cyber defense, and electronic surveillance. However, most existing SMC methods still follow the traditional machine learning paradigm that trains on static closed datasets, lacking the ability to cope with the challenge of continuous data distribution shifts in real communication scenarios. Directly applying the model to a new environment may lead to severe degradation of classification performance on previous scenarios, i.e., catastrophic forgetting. To address this, this article proposes the first domain-incremental learning (DIL) paradigm for SMC and designs a parameter-efficient isolation DIL (PID) method, which enables SMC models to rapidly adjust to new scenarios by extending only a few parameters, while significantly retaining classification capabilities on previous scenarios. Specifically, we first propose a parameter space decomposition-based classifier (PSD), separating the model parameters into a set of bases and corresponding coefficients. By freezing the bases and fine-tuning the low-dimensional coefficients, the catastrophic forgetting problem can be efficiently eliminated. Furthermore, we design a scene-aware domain controller (SDC) to select the most suitable domain-specific coefficients for each sample, thereby maintaining the SMC model’s classification capabilities across all domains. The extensive experimental results show the superiority of the proposed PID, which achieves state-of-the-art (SOTA) overall performance. The code will be available at: https://github.com/SMC-IL/PID

Abstract:
Dynamic behaviors of the classical Kuramoto models have been widely studied. The dynamics of the all-to-all connected oscillator Ising machines (OIMs) is similar to that of the classical Kuramoto models, with the main difference being that there is an additional term in OIMs, called the second harmonic term. However, the dynamic behavior of an all-to-all connected OIM is significantly different and its intricate properties are largely unexplored. In this article, we study in detail the properties of the all-to-all connected OIMs and explore their application as associative memory. The number of patterns such an OIM can store increases exponentially with respect to the number of oscillators. To improve the performance of the OIMs for associative memory, we propose a new harmonic term so that the resulting OIM achieves pattern retrieval with high accuracy in the presence of a high level of noise.

Abstract:
Variational graph auto-encoders (VGAEs) are a key tool for node clustering, but existing models face several significant challenges. These challenges include a mismatch between inference and generative models after incorporating the clustering inductive bias, as well as posterior collapse (PC), where latent representations become overly influenced by the prior distribution. In addition, in existing VGAEs, noisy clustering assignments lead to the feature randomness (FR) challenge, while the strong tradeoff between clustering accuracy and reconstruction quality results in the feature drift (FD) problem. To address these issues, we propose a multiscale contrastive VGAE (MCVGAE). This multiscale model combines cluster-level and graph-level contrastive learning with proximity-level and cluster-level self-supervised methods. MCVGAE improves the alignment between the hidden space and the data distribution and prevents PC. Moreover, it reduces FR and FD more effectively than existing techniques. Achieving impressive accuracy scores of 79.09% on Cora, 90.04% on ACM, 75.12% on Pubmed, 72.7% on Citeseer, 74.11% on DBLP, and 59.79% on Wiki clearly demonstrates the superiority of MCVGAE over 30 state-of-the-art methods.

Abstract:
Deep learning has been widely applied in various domains. Current widely-used optimizers, such as SGD, Adam, and their variants, are designed based on the assumption that the gradient noise generated during model training follows a Gaussian distribution. However, recent empirical studies have found that the gradient noise often does not follow a Gaussian distribution. Instead, the noise exhibits heavy-tailed characteristics consistent with an \alpha -stable distribution, casting doubt on the performance and robustness of optimizers designed under the assumption of Gaussian noise. Inspired by the least mean p-power (LMP) algorithm from the field of adaptive filtering, we propose a novel optimizer called Ape for deep learning. Ape integrates a p-power adjustment mechanism to compress large gradients and amplify small ones, mitigating the impact of heavy-tailed gradient distributions. It also employs an approach for estimating second moments tailored to \alpha -stable distributions. Extensive experiments on benchmark datasets demonstrate Ape’s effectiveness in improving both accuracy and training speed compared to existing optimizers. The Ape optimizer showcases the potential of cross-disciplinary approaches in advancing deep learning optimization techniques and lays the groundwork for future innovations in this domain.

Abstract:
Multiobjective optimization problems (MOPs) arise in numerous real-world scenarios, yet finding their solutions with optimal trade-offs can be a formidable challenge. This article studies the continuous optimization problem involving large-scale variables, many objectives, and intricate constraints, which is rarely comprehensively discussed in existing works, due to the coexisting difficulties posed by the curse of dimensionality, selection pressure, and feasibility restrictions. To address these problems, this work pioneers a novel optimization framework, optimization pattern learning, embedded with machine learning (ML) techniques. Within this framework, the concept of measurable order and its corresponding learning mechanism are proposed to extract valuable knowledge from solutions. This measurable order is a general form of those orders used explicitly or implicitly in the existing studies, providing a more flexible means to evaluate solutions for efficient optimization adaptively. By substituting original solutions with their measurable orders, this framework effectively avoids the selection pressure from many objectives and the feasibility restrictions from intricate constraints. Furthermore, two novel ML models based on measurable orders are developed to progressively learn effective optimization patterns from iterative data in high-dimensional search spaces. Leveraging these learned patterns, this framework successfully addresses the curse of dimensionality from large-scale variables and thus achieves efficient optimization. Owing to the strong adaptability and search capabilities of this framework, it also demonstrates excellent scalability as the number of variables, objectives, and constraints increases. Extensive simulations validate the effectiveness of the framework and underscore its competitiveness relative to state-of-the-art algorithms in this field.

Abstract:
We propose a novel feature enhancement module designed for fine-grained visual classification tasks, which can be seamlessly integrated into various backbone architectures, including both convolutional neural network (CNN)-based and Transformer-based networks. The plug-and-play module outputs pixel-level feature maps and performs a weighted fusion of filtered features to enhance fine-grained feature representation. We introduce a class-centric loss function that optimizes the alignment of samples with their target class centers by pulling them toward the center of the target class while simultaneously pushing them away from the center of the most visually similar nontarget classes. Soft labels are employed to mitigate overfitting, ensuring the model generalizes well to unseen examples. Our approach consistently delivers significant improvements in accuracy across various mainstream backbone architectures, underscoring its versatility and robustness. Furthermore, we achieved the highest accuracy on the NABirds (NAB) and our proprietary lock cylinder datasets. We have released our source code and pretrained model on GitHub: https://github.com/Richard5413/FEM-CC.git

Abstract:
Atomic electron tomography (AET) is essential for characterizing the atomic structure of functional materials. However, raw 3-D tomograms often exhibit severe artifacts caused by geometric constraints and low radiation doses. Although point-attention-based ensemble augmentation methods effectively remove artifacts in simulated datasets with varying structure factors, they struggle with the complexity of real tomograms that demand multidomain feature learning. Moreover, existing models degrade in multidomain scenarios and incur high parameter counts introduced by point-attention mechanisms, which increase hardware demands. To address these challenges, we propose a sparse mixture of Mambas (MoMambas), a novel 3-D augmentation method that enhances domain generalization. MoMambas decouple domain-specific parameters by integrating a sparse mixture-of-experts (MoE) framework with Mamba-based experts, resolve positional ambiguity in sparse input sequences through positional information enhancement, and boost MoE accuracy via a multihead routing algorithm. Our approach achieves a 22% accuracy improvement over state-of-the-art AET augmentation methods in multidomain learning, reduces the parameter count to just 2.9% of the original, and lowers computational cost by 6%. Codes and data are publicly available at https://github.com/yuy38457/MoMambas

Abstract:
Recent advances have shown great promise in mining multimodal protein knowledge for better protein–protein interaction (PPI) prediction by enriching the representation of proteins. However, existing solutions lack a comprehensive consideration of both local patterns and global dependencies in proteins, hindering the full exploitation of modal information. Additionally, the inherent disparities between modalities are often disregarded, which may lead to inferior modality complementarity effects. To address these issues, we propose a cross-modality enhanced PPI prediction method from the perspectives of protein sequence and structure modalities, namely SSPPI. In this framework, our main contribution is that we integrate both sequence and structural modalities of proteins and employ an alignment and fusion method between modalities to further generate more comprehensive protein representations for PPI prediction. Specifically, we design two modal representation modules (Convformer and Graphormer) tailored for protein sequence and structure modalities, respectively, to enhance the quality of modal representation. Subsequently, we introduce a Cross-modality enhancer module to achieve alignment and fusion between modalities, thereby generating more informative modal joint representations. Finally, we devise a cross-protein fusion (CPF) module to model residue interaction processes between proteins, thereby enriching the joint representation of protein pairs. Extensive experimentation on four benchmark datasets demonstrates that our proposed model surpasses all current state-of-the-art (SOTA) methods. The source codes are publicly available at the following link https://github.com/bixiangpeng/SSPPI/

Abstract:
In this work, a novel PID-type adaptive iterative learning control (AILC) method is proposed for a class of nonlinear systems with unspecified control gain matrices and bounded iterative-varying uncertainties. Unlike the existing iterative learning method with accumulation of control information, the new PID-type AILC avoids control information accumulation in traditional iterative learning control (ILC), maintaining convergence based on error information and confining iteration to parameter estimation, suitable for amplitude- or frequency-limited controllers. Different from the existing approaches of P-type AILC, this work extends ILC advances to PID-type AILC for nonlinear square or nonsquare systems with unknown control gain matrices, enhancing robustness through simultaneous convergence of integral and proportional error terms over a larger range. This analysis method diverges from traditional approaches relying on contraction mappings or asymptotic stability theorems; error convergence is analyzed using inequalities of a composite energy function (CEF). The effectiveness of this work has been validated through two illustrated examples. The results show that compared with P-type AILC, the convergence speed can be increased by approximately two to three times.

Abstract:
In the evolving field of machine learning, ensuring group fairness has become a critical concern, prompting the development of algorithms designed to mitigate bias in decision-making processes. Group fairness refers to the principle that a model’s decisions should be equitable across different groups defined by sensitive attributes such as gender or race, ensuring that individuals from privileged groups and unprivileged groups are treated fairly and receive similar outcomes. However, achieving fairness in the presence of group-specific concept drift remains an unexplored frontier, and our research represents pioneering efforts in this regard. Group-specific concept drift refers to situations where one group experiences concept drift over time, while another does not, leading to a decrease in fairness even if accuracy (ACC) remains fairly stable. Within the framework of federated learning (FL), where clients collaboratively train models, its distributed nature further amplifies these challenges since each client can experience group-specific concept drift independently while still sharing the same underlying concept, creating a complex and dynamic environment for maintaining fairness. The most significant contribution of our research is the formalization and introduction of the problem of group-specific concept drift and its distributed counterpart, shedding light on its critical importance in the field of fairness. In addition, leveraging insights from prior research, we adapt an existing distributed concept drift adaptation algorithm to tackle group-specific distributed concept drift, which uses a multimodel approach, a local group-specific drift detection mechanism, and continuous clustering of models over time. The findings from our experiments highlight the importance of addressing group-specific concept drift and its distributed counterpart to advance fairness in machine learning.

Abstract:
Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the action-region tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling distinguishing similar actions effectively. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics serve as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are then organized into action tracklets, which characterize the region-based action dynamics by linking related responses across different video frames in a coherent sequence. The text-constrained queries are designed to expressly encode nuanced semantic representations derived from the textual descriptions of action labels, as extracted by the language branches within visual language models. To optimize generated action tracklets, we design a multilevel tracklet contrastive constraint among multiple region responses at spatial and temporal levels, which can effectively distinguish individual region responses in each video frame (spatial level) and establish the correlation of similar region responses between adjacent video frames (temporal level). In addition, we implement a task-specific fine-tuning mechanism to refine textual semantics during training. This ensures that the semantic representations encoded by vision language models (VLMs) are not only preserved but also optimized for specific task preferences. Comprehensive experiments on several widely used action recognition benchmarks, i.e., FineGym, Diving48, NTURGB-D, Kinetics, and Something-Something, clearly demonstrate the superiority to previous state-of-the-art baselines.

Abstract:
This article proposes a novel model-based planning framework for freeway ramp metering (RM), denoted as Koopman-driven linearized model-based offline planning (KLMOP). This framework integrates the model predictive control (MPC) and offline reinforcement learning (RL) under assumptions of a linear Markov decision process (MDP) with the Koopman operator. KLMOP introduces a fully linearized control framework by learning and modeling the dynamics, reward function, and value function in a latent space through a Koopman-based latent dynamical model (KLDM) and a pessimistic value iteration (PEVI) algorithm. This formulation builds upon the connection between Koopman operator theory and linear MDP. Contrastive learning is employed to ensure the expressiveness and structural conditions of the latent representation in linear MDP, enabling accurate reward prediction and efficient policy optimization. The MPC-based planning policy, then, leverages these components to solve a linear MPC problem efficiently in the latent space. Extensive simulation studies demonstrate that KLMOP significantly improves computational efficiency and control performance as compared with existing baseline methods for RM control. This framework provides a theoretically grounded and computationally efficient approach to linearizing nonlinear control problems, and its learning-based design makes it adaptable to broader applications.

Abstract:
Advanced cognition can be measured from the human brain using brain–computer interfaces (BCIs). Integrating these interfaces with computer vision techniques, which possess efficient feature extraction capabilities, can achieve more robust and accurate detection of dim targets in aerial images. However, existing target detection methods primarily concentrate on homogeneous data, lacking efficient and versatile processing capabilities for heterogeneous multimodal data. In this article, we first build a brain–eye–computer-based object detection system for aerial images under few-shot conditions. This system detects suspicious targets using region proposal networks (RPNs), evokes the event-related potential (ERP) signal in electroencephalogram (EEG) through the eye-tracking-based slow serial visual presentation (ESSVP) paradigm, and constructs the EEG–image data pairs with eye movement data. Then, an adaptive modality balanced online knowledge distillation (AMBOKD) method is proposed to recognize dim objects with the EEG–image data. AMBOKD fuses EEG and image features using a multihead attention module, establishing a new modality with comprehensive features. To enhance the performance and robust capability of the fusion modality, simultaneous training and mutual learning between modalities are enabled by end-to-end online KD (OKD). During the learning process, an adaptive modality balancing module is proposed to ensure multimodal equilibrium by dynamically adjusting the weights of the importance and the training gradients across various modalities. The effectiveness and superiority of our method are demonstrated by comparing it with existing state-of-the-art methods. Additionally, experiments conducted on public datasets and real-world scenarios demonstrate the reliability and practicality of the proposed system and the designed method. The dataset and the source code can be found at: https://github.com/lizixing23/AMBOKD

Abstract:
Clustering complex-shaped clusters is still chal lenging for most existing clustering algorithms. Herein, the peak-padding clustering algorithm (PeakPad)—clustering by padding density peaks with the minimum padding cost—is proposed. PeakPad executes clustering on the density surface and views complex-shaped clusters as combinations of highly associated single-peak clusters. The minimum padding cost that fully considers the surrounding context of a density peak is proposed to reflect a density peak’s center potential, enabling PeakPad to have robust center detection performance. Unlike mean-shift (MSC), which detects centers based on their attributes in a complex-shaped density surface embedded in the high-dimensional space of density and features, PeakPad detects centers in a standard-shaped surface embedded in the 2-D density-change (DC) density space (composed of density and DC feature). Such standardization allows PeakPad to have fast and robust cluster center detection performance on complex-shaped clusters based on the minimum padding cost. Besides, PeakPad can provide a reasonable evaluation of the association between single-peak clusters by using the minimum padding cost. As a result, PeakPad can fast capture complex-shaped clusters, achieve robust center detection performance, and be suitable for large datasets. Benchmark test results on both synthetic and real datasets demonstrate the effectiveness of PeakPad.

Abstract:
Transfer reinforcement learning (TRL) aims to boost the efficiency of reinforcement learning (RL) agents by leveraging knowledge from related tasks. Prior research primarily focuses on intradomain transfer, overlooking the complexities of transferring knowledge across tasks with differing state and action spaces. Recent efforts in cross-domain TRL aim to bridge this gap by establishing mappings between disparate source and target spaces, thereby enabling knowledge transfer across RL tasks with varied state and action configurations. However, existing studies often rely on strict prior assumptions about the relationships between state spaces, which limits their practical generality. In this article, we propose a novel approach to cross-domain TRL based on seeded graph matching, which enables alignment between source and target tasks regardless of differences in their state–action spaces. In particular, we model RL tasks as directed graphs, identify seed node pairs based on common RL properties, and devise a graph matching algorithm to align the source and target tasks by leveraging their structural characteristics. Building on this alignment, we introduce a policy-based transfer algorithm that improves the performance of the target RL task as its RL process progresses. Finally, we conduct comprehensive empirical studies on both discrete and continuous tasks with diverse state–action spaces. The experimental results validate the effectiveness of the proposed algorithm.

Abstract:
In recent years, contrastive learning (CL) frameworks have been widely applied to multivariate time series classification (MTSC) tasks. However, existing methods lack task-specific guidance, leading to limitations in fully capturing the complex dynamics and invariant representations in time series data. Motivated by the auxiliary tasks in multitask learning (MTL) and to fully utilize the rich frequency-domain information of time series data, we propose a novel time series classification framework, uncertainty-based time–frequency supervised CL (U-TFSCL). This framework uses SCL in the time and frequency domains and time–frequency consistency as auxiliary tasks to improve the primary task of using only instance-level labels for time series classification. Furthermore, inspired by the homogeneous uncertainty in MTL, we derive a novel uncertainty loss function, which automatically adjusts the weights according to the degree of uncertainty of different tasks to optimize the learning and prediction process of the model. The proposed framework is evaluated on MTSC tasks, including human activity recognition (HAR), air writing, and gesture recognition. In addition, we create a human–drone interaction (HDI) dataset consisting of 20 subjects and conduct real-world experiments to evaluate the proposed framework. The extensive experiments conducted in various settings verify the effectiveness of the proposed framework.

Abstract:
Neural operators, such as graph neural operators (GNOs) and Fourier neural operators (FNOs), directly learn the mapping from any functional parametric dependence to the solution and have achieved remarkable progress in solving partial differential equations (PDEs). GNOs exhibit excellent interpretability, as they construct graph models of physical fields to mine the mutual relationships between different nodes. However, existing methods are unable to mine the deep-level graph node features, which leads to insufficient information from neighboring nodes during the graph information aggregation process, ultimately resulting in a decline in solution accuracy. This article proposes a space–frequency cross-attention (CA) node feature optimization GNO (NFO-GNO) to address this issue. Considering the multiscale nature of the PDEs, we first construct a multiscale graph building module to obtain the PDEs information at different scales by processing graphs of different scales. After obtaining the multiscale graph models, we use a node feature optimization network (NFON) to extract and optimize the node features of the graphs in the spatial and frequency domains, and utilize CA to fuse them, thereby obtaining deep-level graph node features. Finally, we use a GNO to solve the optimized node features. NFO-GNO achieves superior solving performance compared to the baselines on four standard benchmarks covering both solid mechanics and fluid dynamics simulations. Notably, NFO-GNO maintains better performance with limited training samples and low-resolution training data, reducing data requirements and making it more adaptable to scenarios where high-quality datasets are difficult to obtain.

Abstract:
With the rapid advancement of large language models (LLMs) in both academia and industry, their growing size and complexity have introduced significant challenges in terms of computational cost and deployment efficiency. To address these issues, a wide range of inference optimization techniques—including but not limited to model compression—have been proposed to accelerate LLM inference while preserving model performance. This survey provides a comprehensive overview of LLM inference acceleration strategies, analyzing them from multiple perspectives, including foundational principles, algorithmic techniques, real-world applications, and open research challenges. We begin by introducing core concepts underlying inference optimization and propose a new taxonomy that categorizes existing approaches, including quantization, pruning, distillation, efficient architectures, compilation, and hardware-aware methods. Following the lifecycle of LLM development and deployment, we examine how these techniques interact with model training, fine-tuning, and serving. Furthermore, we highlight key applications of efficient LLMs and discuss emerging trends and unresolved issues in the field. By synthesizing recent advances, this survey aims to provide actionable insights and practical guidance for researchers and practitioners working with scalable and efficient LLM systems.

Abstract:
Human action recognition (HAR), which aims to recognize and understand individual actions and intentions, has rapidly become a research hotspot in computer vision. Compared with other data modalities, skeleton data offers more efficient node semantics and more coherent spatio-temporal motion patterns, effectively reducing the impact of lighting and background changes. In recent years, many researchers have focused on skeleton-based action recognition methods and have made significant progress. However, we believe that the current skeleton-based action recognition methods still face three major challenges: 1) reducing reliance on expensive labeled data while maintaining model performance; 2) enabling the model to understand and recognize new behavior classes with a limited number of samples; and 3) addressing the challenges posed by the lack of skeleton information in single-modality spatio-temporal motion representation learning. Based on these challenges, we conduct a comprehensive review of the existing skeleton-based action recognition methods. Additionally, we provide an extensive review and analysis of publicly available action recognition datasets. This review aims to offer researchers a comprehensive perspective, stimulate more innovative ideas, and promote the application and breakthrough of skeleton action recognition in a wider range of computer vision tasks.

Abstract:
Broad learning system (BLS), as an innovative type of neural network, has demonstrated exceptional performance in regression tasks. Nonetheless, the majority of BLS methods, which rely on the least squares criterion, are highly sensitive to outliers and noisy data, resulting in reduced prediction accuracy. To improve the robustness of broad networks, a sparse Bayesian BLS via adaptive Lasso priors (AL-SBBLS) is proposed in this article to handle regression tasks with data contaminated by outliers and noise. Specifically, adaptive Lasso constraints are first applied to enhance the adaptive sparsity of output weights, which facilitates the automatic selection of highly correlated features. Subsequently, a multilayer Bayesian framework is constructed to provide an adaptive Lasso prior to the output weights, allowing the model for the adaptive learning of regularization factors and the estimation of probability distributions for output values, while further sparsifying the network. By selecting highly correlated features and estimating the probability distributions of output values, the impact of outliers and noise can be effectively mitigated. To effectively train the networks, corresponding optimization algorithms are designed for AL-SBLS and AL-SBBLS using the alternating direction method of multipliers (ADMMs) and variational Bayesian inference methods, respectively. The effectiveness and robustness of the proposed methods are validated through robust regression experiments on 14 real-world datasets and complex nonlinear data. Quantitative results demonstrate that the proposed AL-SBBLS achieves the best performance on most datasets, attaining the lowest average ranking of 1.44 in Friedman tests compared with 11 state-of-the-art BLS variants, which confirms its superior predictive accuracy and robustness. The resource code of AL-SBBLS proposed in this article is available at: https://github.com/taocheny/AL-SBBLS

Abstract:
This brief presents the adaptive optimal prescribed performance tracking solutions for the multiplayer nonlinear systems based on the adaptive critic learning scheme, where the tracking errors are constrained to a predefined bounded set. First, the general optimal tracking solutions of multiplayer nonlinear systems are presented. Every optimal tracking solution of multiple players consists of the steady-state part and the adaptive feedback part. The steady-state part can be obtained directly according to the tracking signal and system dynamics. Then, the adaptive feedback part can be studied with the prescribed performance constraints and adaptive critic learning such that multiple value functions achieve a Nash equilibrium with error constraints. Moreover, the convergence of the critic network weight is analyzed by the Lyapunov algorithm. Finally, simulation results and experiments are presented to demonstrate the satisfactory performance of the proposed method.

Abstract:
Data scarcity is a long-standing challenge in the vision-language navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often require extensive labor to remove the noise. In this article, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce object-enriched observation rewriting, where we combine vision-language models (VLMs) and large language models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via text-to-image generation models (T2IMs). Then, we propose observation-contrast instruction rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method.

Abstract:
Neural architecture search (NAS) has gained significant traction in automating the design of neural networks. To reduce search time, differentiable architecture search (DAS) reframes the traditional paradigm of discrete candidate sampling and evaluation into a differentiable optimization over a super-net, followed by discretization. However, most existing DAS methods primarily focus on optimizing the coarse-grained operation-level topology, while neglecting finer-grained structures such as filter-level and weight-level patterns. This limits their ability to balance model performance with model size. In addition, many methods compromise search quality to save memory during the search process. To tackle these issues, we propose Multigranularity DAS (MG-DARTS), a unified framework that aims to discover both effective and efficient architectures from scratch by comprehensively yet memory-efficiently exploring a multigranularity search space. Specifically, we improve the existing DAS methods in two aspects. First, we adaptively adjust the retention ratios of searchable units across different granularity levels through adaptive pruning, which is achieved by learning granularity-specific discretization functions along with the evolving architecture. Second, we decompose the super-net optimization and discretization into multiple stages, each operating on a subnet, and introduce progressive re-evaluation to enable repruning and regrowth of previous units, thereby mitigating potential bias. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that MG-DARTS outperforms other state-of-the-art methods in achieving a better tradeoff between model accuracy and parameter efficiency. Codes are available at: https://github.com/lxy12357/MG_DARTS

Abstract:
While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this article, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes—including appearance, motion patterns, and associated risks—LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module mimicking the human-driving learning process. The system consists of an analytic process (System-II) that accumulates driving experience through logical reasoning and a heuristic process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared with camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/

Abstract:
This addresses typesetting errors in [1]. The typesetting errors and their corrections are listed as follows. 1)On page 277, below (2):“where E denote s...”Corrected to:“where E denotes...”2)On page 277, below (2):“The input T is projected...”Corrected to:“The input E is projected...”3)On page 278, (10): \beginequation \Psi _\mathrm strip\left ( \mathrm x \right)=\left [ \Phi _\mathrm strip^1\left ( \mathrm x \right)\vert \vert \ldots \vert \mathrm \vert \Phi _\mathrm strip^\mathrm S\left ( \mathrm x \right) \right ]\endequation Corrected to: \beginequation \Psi _\mathrm strip\left ( \mathrm x \right)=\left [ \Psi _\mathrm strip^1\left ( \mathrm x \right)\vert \vert \ldots \vert \mathrm \vert \Psi _\mathrm strip^\mathrm S\left ( \mathrm x \right) \right ].\endequation 4)On page 278, below (15):(( N_new / N_new+ N_old)) ^\mathrm 1/2 Corrected to:( N_new /( N_new+ N_old)) ^\mathrm 1/2 .

Abstract:
The spatiotemporal dynamics of traffic forecasting make it a challenging task. In recent years, by adapting to the topology of traffic networks where road segments serve as nodes, graph convolutional networks (GCNs) have been able to capture spatiotemporal dependencies, thereby improving traffic forecasting performance. However, there are two shortcomings of GCN-based methods: 1) existing methods treat the delays between nodes in the traffic network as equally important and fail to extract critical information effectively, leading to information redundancy, the introduction of irrelevant noise, and increased computational costs and 2) most methods overlook the issue that spatiotemporal correlations between nodes are inconsistent across different timescales. This article designs a new dynamic delay-aware multiscal spatiotemporal graph convolutional network (DDAMGCN) for traffic forecasting. Specifically, a dynamic delay-aware module is designed to identify key nodes and model the important delays from key nodes, so that the model focuses on key information and reduces computational cost. Additionally, a novel multiscale spatiotemporal graph convolution module is designed to achieve fine-grained modeling of the spatiotemporal correlation of different nodes at different timescales. Experiments on eight real traffic datasets verify the superiority of the proposed method compared to several state-of-the-art baselines.

Abstract:
The anchor-based clustering method is currently a predominant technique for handling large-scale data. However, in multiview data, existing anchor-based methods face a key challenge: balancing individual anchor graph distinctiveness with final consistency. To address this challenge, we propose a large-scale multiview clustering (MVC) method via joint learning of anchor representation and multigraph alignment (ARMGA). Specifically, ARMGA introduces a unified framework that facilitates the concurrent learning of single-view anchor representations and virtual graph-based multigraph alignment. The approach aims to preserve the adaptability of anchor learning across different views, while ensuring the ultimate consistency of the merged anchor graph. Furthermore, ARMGA employs Schatten- \boldsymbol p norm on the tensor formed by the adaptive anchor representation, originating from multigraph alignment, to reinforce cross-view consistency. This technique effectively leverages complementary information preserved across views to bolster the overall structure and consensus information. Ultimately, to attenuate the noise impact on the anchor representation matrix, ARMGA capitalizes on the cosine angle information from the low-rank representation as coefficients within the relationship matrix and efficiently reduces computational complexity through deductions. On nine datasets, ARMGA has exhibited a notable improvement in clustering performance indicators by 2%–10% over other algorithms, while also maintaining lower time complexity.

Abstract:
Few-shot knowledge graph completion (FKGC) aims to infer missing triples for long-tail relationships using a small set of References. Existing FKGC models focus mainly on entity representation aggregation, heavily relying on interactions between central entities and their neighbors. However, real-world knowledge graphs contain relations with multiple semantics, and existing models struggle to capture the diverse semantic information of the relations in different contexts. To address this issue, we propose a novel FKGC model, context-aware relational learning and multidimensional matching (CRL-MM). First, CRL-MM enhances the representation of task relations by obtaining semantic information in different scenarios based on the semantic similarity between task relations and background relations. Second, unlike previous models, which rely mainly on neighborhood relations to capture relation information, CRL-MM considers the entity pair and its neighborhood as a unified contextual whole, aggregating neighborhood information through adaptive task relations and paired entity awareness to improve entity encoding. In addition, during the matching phase, we design a matching network from multiple dimensions, which includes not only the similarity score of the entity pairs but also the triple rationality score to further improve the generalizability of the model. Extensive experiments on public benchmark datasets show that CRL-MM outperforms state-of-the-art methods, and the ablation experiments also demonstrate the effectiveness of each module of the proposed CRL-MM.

Abstract:
In recent years, federated learning (FL) has received widespread attention for its ability to enable collaborative training across multiple clients while protecting user privacy, especially demonstrating significant value in scenarios such as medical data analysis, where strict privacy protection is required. However, most existing FL frameworks mainly focus on data heterogeneity without fully addressing the challenge of heterogeneous model aggregation among clients. To address this problem, this article proposes a novel FL framework called FedMKD. This framework introduces proxy models as a medium for knowledge sharing between clients, ensuring efficient and secure interactions while effectively utilizing the knowledge in each client’s data. In order to improve the efficiency of asymmetric knowledge transfer between proxy models and private models, a hybrid feature-guided multilayer fusion knowledge distillation (MKD) learning method is proposed, which eliminates the dependence on public data. Extensive experiments were conducted using a combination of multiple heterogeneous models under diverse data distributions. The results demonstrate that FedMKD efficiently aggregates model knowledge.

Affiliations: School of Computer Science, Northwestern Polytechnical University, Xi’an, China; Fowler College of Business (FCB) and the Center for Human Dynamics in the Mobile Age (HDMA), San Diego State University (SDSU), San Diego, CA, USA; Data Science and Artificial Intelligence Lab, Indiana University Bloomington, Bloomington, IN, USA; School of Computer Science and Technology, University of Science and Technology of China, Hefei, China; State Key Laboratory of Management and Control for Complex Systems, Institute of Automation Chinese Academy of Sciences, Beijing, China

Abstract:
Over the past two decades, deep learning (DL) has achieved unprecedented breakthroughs across diverse application domains spanning computer vision (CV) to natural language processing (NLP). However, despite significant advances in computational resources and algorithmic frameworks, the training of deep neural networks continues to present formidable challenges due to persistent issues of training inefficiency and inherent data distribution biases. Recent years have witnessed the emergence of hard sample mining (HSM) as a promising paradigm to mitigate training inefficiencies and enhance model robustness through representative sample selection. Although HSM is reshaping contemporary AI research, its critical role in enabling efficient and robust model training has not yet been systematically explored. This article presents a comprehensive survey of HSM methodologies by: 1) establishing unified definitions of hard samples through rigorous sample complexity quantification criteria; 2) proposing a systematic taxonomy of HSM approaches with in-depth technical analysis; and 3) identifying pivotal research frontiers in this evolving field. This survey not only consolidates the foundations of HSM but also provides a roadmap for advancing efficient, robust, and generalizable deep learning models.

Abstract:
This article pioneers the study of boundary-optimized fault-tolerant tracking control for flexible manipulators in a switching digraph with a heterogeneous linear leader. Compared with existing research, the proposed methods have several features. First, a distributed observer is designed to observe the leader’s information in a general switching graph where communication can be interrupted. Second, a new partial differential equation (PDE)-based fault observer (FO) is designed to estimate unknown faults using only a few boundary states. Third, a novel long-term integral cost function is formulated to minimize angle-tracking errors, vibration deflections, and control energy in flexible manipulators. The ideal boundary optimal control laws are, then, derived and approximated using actor–critic neural networks (NNs) based on reinforcement learning (RL). Under the proposed fully distributed optimized fault-tolerant controllers, the closed-loop flexible manipulator’s error states are proven uniformly ultimately bounded (UUB). Finally, the effectiveness of the proposed method is demonstrated through numerical simulation results.

Abstract:
Predicting power consumption for the Mars Express (MEX) mission is essential for optimizing its operational lifespan and mission assignments. However, the complexity of the Martian environment and the extended solar cycle obscure the periodicity of power consumption, making it difficult for existing methods to capture both intraperiodic and interperiodic features. This study introduces the bionic hierarchical learning network (BHL-Net) to enhance power consumption predictions. Leveraging 2-D frequency preprocessing and brain visual modeling techniques, BHL-Net mimics natural image encoding in the prefrontal cortex (PFC) to improve predictive performance. It incorporates a temporal oscillation activation module and a stripe intensity attention module to focus on local features, while a multihead attention adaptive aggregation module identifies key global features. Comparative experiments show that BHL-Net outperforms existing transformer-based models for MEX power consumption prediction. Ablation studies further validate the effectiveness of the FFT-based 2-D transformation and bionic attention framework. By emulating human brain response coding mechanisms, BHL-Net captures variations within and between complex cycles, providing a competitive solution for time series prediction in industrial applications.

Abstract:
In autonomous driving, accurate 3-D multiobject tracking (MOT) plays a key role in ensuring vehicle safety. However, due to the complexity of the environment, existing methods still face many challenges when dealing with long-distance objects, partial occlusions, and interference from similar categories. To tackle these challenges, we propose a 3-D MOT framework based on a voxel masking encoder (VME) and a deep hashing paradigm (DHP). We introduce a masking strategy that processes voxel features from near to far while maintaining feature sparsity, effectively capturing global contextual information between spatial features. Simultaneously, DHP is utilized to generate image hash codes and compute their hamming distance from the category hash codes. This process effectively distinguishes between object categories and thus avoids cross-category object dissociation. In addition, we propose a distance optimization matching (DOM) method that uses geometric dimensions and spatial distances to build a cost matrix, achieving more efficient and precise object associations. Results from experiments conducted on the KITTI dataset reveal that our framework delivers outstanding tracking performance, surpassing other advanced methods in tracking accuracy. The code is released at https://github.com/lsy-collab/VD-MOT

Abstract:
Accurate imputation of missing data is crucial in the Industrial Internet-of-Things (IIoT), where operations are often compromised by noisy samples from harsh environments. Traditional imputation methods struggle with such noise due to their black-box nature or lack of adaptability. To address this issue, we recast data imputation as a distribution alignment challenge, utilizing the flexibility of optimal transport (OT) to handle noisy samples. Specifically, we first introduce the Proximal Optimal Transport (POT) problem, where the transportation cost is obtained by the network simplex approach with a selective matching mechanism, which renders it capable of matching distributions with noisy samples. Subsequently, we propose the POT-I framework, where the objective is to minimize the transport cost of POT. The produced gradient is used to refine the imputation value, which achieves missing data imputation (MDI) while getting robustness to noisy samples. Experiments on real-world IIoT datasets demonstrate the superiority of POT-I over state-of-the-art imputation methods.

Abstract:
Multiobjective reinforcement learning (MORL) aims to seek a complete Pareto front (PF) with different compromise policies in multiobjective Markov decision processes (MOMDPs). However, most MORL algorithms currently have a limitation in handling the MOMDPs with nonconvex PFs. In this article, we propose a nonlinear MORL algorithm based on decomposition and variance reduction (MORL/D-VR) to overcome this limitation. MORL/D-VR adopts the Tchebycheff approach to transform a given MOMDP into a set of single-objective Markov decision processes (MDPs) and subsequently applies an improved policy gradient algorithm, called expected utility policy gradient (EUPG), to solve each single-objective MDP efficiently. We analyze the Pareto optimality of employing the Tchebycheff approach and policy gradient methods that use the full return to update policy for solving MOMDPs. The analysis shows that such a case can identify any Pareto optimal policy regardless of the shape of PFs theoretically. This can provide a theoretical guarantee for applying the Tchebycheff approach and EUPG in MORL/D-VR to obtain the policies within the nonconvex PFs. Moreover, we devise a new baseline for EUPG to reduce the variance of gradient updates and adopt a weight vector adaptation method to improve diversity. The experimental results show that MORL/D-VR achieves a desirable performance in handling problems with different convex and nonconvex PFs and outperforms current state-of-the-art MORL algorithms.

Abstract:
Hyperspectral anomaly detection is a crucial technique for recognizing abnormal pixels in hyperspectral images (HSIs), that is, those with distinct spectral characteristics from those of the surrounding background. Traditional methods always fall short in effectively leveraging the information regarding the spectral and spatial aspects of the dataset simultaneously, limiting their detection performances. This article proposes a novel framework using U-Net, termed hybrid convolution and transformer-based U-Net (HCT-Unet), which integrates convolution with a multihead attention mechanism in Transformer for enhanced hyperspectral anomaly detection. To ensure a more comprehensive understanding of spatial and spectral interactions, the HCT-Unet architecture capitalizes on the strengths of local feature extraction of convolutional layers and the capabilities of the long-range dependency modeling of Transformers. A key innovation of this framework is an error attention mechanism, which facilitates adaptive multiscale feature fusion and enhances the feature representation capacity. Furthermore, a new anomaly score calculation method is proposed, which combines reconstruction error with the pixelwise structural similarity index (SSIM) to determine pixel anomaly from both local structural preservation and global spectral consistency perspectives. Experiments carried out on seven different hyperspectral datasets reveal that the proposed method consistently outperforms the widely accepted state-of-the-art methods in hyperspectral anomaly detection.

Affiliations: College of Information Science and Technology and Artificial Intelligence and the College of Mechanical and Electronic Engineering, Nanjing Forestry University, Nanjing, China; School of Computer Science and Technology, Shandong Technology and Business University, Yantai, Shandong, China; Key Laboratory of Knowledge Engineering with Big Data, Hefei University of Technology, Hefei, China; Chinese Academy of Forestry, Institute of Forest Resource Information Techniques, Beijing, China; College of Information Science and Technology and Artificial Intelligence, the State Key Laboratory of Tree Genetics and Breeding and the Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing, China

Abstract:
Recently, there has been a surge in the development of robust norm distance-based linear discriminant analysis (LDA) techniques, which have garnered significant attention in the field of feature extraction. However, a persistent issue that has yet to be resolved is that the successful suppression of outliers may inadvertently impede the accurate discrimination of normal points. To solve this problem, we, in this article, study a novel robust LDA measured by double capped L_p -norm distance (CLD) metrics with min constraints (DCLDA) to learn robust discriminant projections, in which normal points and outliers are separately treated. To be specific, it takes a double capped L_p -norm with “Min” constraints in the proposed model to measure the distances for between- and within-class dispersions. The proposed model effectively ensures accurate discrimination of normal points by L_p -norm, while also eliminating the exaggerated effect of outliers that may arise from larger p values. The resulted objective is not trivial because of its nonconvexity and nonsmoothness. As one of the major contributions of this article, we introduce a new reformulation that provides an objective problem theoretically equivalent to the original. By this reformulation, we develop an effective iterative algorithm to solve the proposed model. The algorithm is proven to be convergent through rigorous theoretical analysis. Extensive experiments were conducted on several real-world datasets across different image classification tasks to showcase the effectiveness of the proposed method.

Abstract:
Protein–protein interaction (PPI) and their interaction sites [PPI site (PPIS)] hold immense potential for elucidating cellular mechanisms and advancing targeted drug development. While deep learning has driven progress in PPI research by capturing protein features, it remains limited by its overreliance on sequence information and inability to effectively integrate protein internal structural features. To address these challenges, we propose MEGAE, a novel model capable of achieving high-precision prediction of PPI and PPIS. MEGAE reconstructs amino acid microenvironments through a vector quantization autoencoder, integrating physicochemical properties, structural details, and sequence data to provide a comprehensive representation of proteins. We innovatively introduce a multiview random masking training strategy, introducing controlled randomness during the reconstruction process to enhance the robustness of microenvironment embeddings. The model combines these fused embeddings with protein graphs and protein interaction networks, leveraging graph neural networks (GNNs) to capture multilevel relationships from local amino acid interactions to global signal network connections—thereby achieving precise predictions. Experimental results demonstrate that MEGAE outperforms state-of-the-art sequence- and structure-based methods across multiple datasets, exhibiting higher accuracy in predicting interaction types and interaction sites. This advancement underscores the potential of microenvironment-aware modeling in uncovering complex protein interactions.

Abstract:
Learning rules are critical to the problem-solving ability of neural networks. Significant progress has been made in neural networks based on deoxyribonucleic acid (DNA) strand displacement (DSD). Traditional chemical reaction networks (CRNs) usually focus on the implementation of one type of learning rule. The coexistence of multiple learning rules remains challenging. In this article, CRNs based on DSD are constructed. The networks consist of a weight multiplication module, an activation function module, a learning signal module, a weight update module, and a weight output module. By exploring the concentration of auxiliary strands in the modules, discrete perceptron, Hebbian, and filtered learning rules are simulated successfully. The feasibility is verified through a simple instance. Modules are also used to build a classification model that can learn about thyroid disease and make predictions about test categories. The simulation is verified by the software Visual DSD. This article will provide a theoretical basis for biomedical prediction and identification.

Abstract:
Object navigation (ObjcetNav), which enables an agent to seek any instance of an object category, has shown great advances. However, current agents are built upon occlusion-prone visual observations or compressed 2-D maps, which hinder their embodied perception of 3-D scene geometry. Furthermore, existing methods usually decouple ObjectNav into the exploration and exploitation subtasks, easily leading to ambiguous object localization and blind exploration. To address these issues, we first propose an embodied contrastive learning (ECL) method with geometric consistency (GC) and behavioral awareness (BA), which motivates agents to encode 3-D scene layouts and semantic cues actively. The BA is modeled by predicting navigational actions based on multiframe visual images, as behaviors causing differences between adjacent visual sensations are crucial for learning correlations among continuous visions. The GC is modeled by aligning the behavior-aware visual stimulus with 3-D semantic shapes through unsupervised contrastive learning. Then, based on the above ECL pretraining, a coarse-to-fine ObjectNav policy with explorer and discriminator cooperation is proposed, inspired by the treasure-hunting mindset. Concretely, the explorer is designed to adaptively switch the action spaces, thereby switching the global and local exploration thoughts according to the accumulated scene priors. The discriminator is designed to discriminate the target’s authenticity using behavior-aware visual features and geometric invariance priors, which permits mimicking the human behavior of “approaching to confirm” when distinguishing objects from a distance. As expected, our ECL method performs well on object detection (ObjDet) and instance segmentation (InstSeg) tasks. Our ECL-enhanced ObjectNav strategy outperforms state-of-the-art (SOTA) methods on Matterport3D (MP3D), Gibson, and HM3D datasets.

Abstract:
Robust and efficient tracking and mapping are critical for underwater vehicles, but remain challenging due to degraded visual quality, ambiguous features, and limited computational resources. Although recent deep learning-based stereo matching methods have significantly improved geometric perception for robots, most existing approaches struggle to simultaneously achieve high speed and strong generalization. To address these challenges, we propose SAFT, a tracking and mapping framework based on self-supervised, robust, and real-time stereo matching. SAFT introduces three key innovations: 1) SAFT-Stereo, a novel stereo matching network that integrates cost aggregation with iterative optimization to enable efficient disparity estimation in feature-sparse regions; 2) a spatiotemporal self-supervised loss that leverages both spatial and temporal constraints to provide stable training signals in textureless regions; and 3) SAFT-DSOL, a real-time tracking and mapping algorithm that integrates the self-supervised models to achieve robust localization and dense reconstruction. Extensive experiments on both public and custom underwater datasets demonstrate that SAFT-Stereo achieves the best generalization performance among all real-time methods, while requiring only 1/6 of the inference time of RT-IGEV++. Moreover, the proposed SAFT-DSOL enables stable and efficient tracking and achieves real-time dense reconstruction in indoor shipwreck scenarios. The code is available at github.com/c237814486/SAFT-Stereo

Abstract:
Multimodal knowledge graph completion (MMKGC) enhances the precision and breadth of application of knowledge graphs by integrating rich data from various modalities, steadily increasing its appeal in the research community. Prior studies mainly focus on the common representation of different modalities while neglecting the different and complementary features. On the contrary, some works tend to model triples of each modality separately while overlooking the similarities between modalities. It is challenging to associate the heterogeneous modalities effectively for MMKGC. In this article, we introduce a novel MMKGC framework by cross-modal interaction with similarity-enhancing and difference-embracing (CISEDE), which leverages both the similarities and differences among multimodal entities by a proposed cross-modal interaction mechanism. In the cross-modal interaction, multihead attention is employed to enhance similarity information from multimodal entities and embrace different information by linking various modal triples. Through relation-guided fusion, the modal triples are decoded and merged for MMKGC. The experimental results on three commonly used datasets, FB15k-237, WN9, and WN18RR, show that the proposed method achieves state-of-the-art performance.

Abstract:
Multimodal cross-city semantic segmentation aims to adapt a network trained on multiple labeled source domains (MSDs) from one city to multiple unlabeled target domains (MTDs) in another city, where the multiple domains refer to different sensor modalities. However, remote sensing data from different sensors increases the extent of domain shift in the fused domain space, making feature alignment more challenging. Meanwhile, traditional fusion methods only consider complementarity within MSDs (or MTDs), which wastes cross-domain relevant information and neglects control over domain shift. To address the above issues, we propose a similarity-inspired fusion and invertible transformation learning network (SFITNet) for multimodal cross-city semantic segmentation. To alleviate the increasing alignment difficulty in multimodal fused domains, an invertible transformation learning strategy (ITLS) is proposed, which adopts a topological perspective on unsupervised domain adaptation. This strategy aims to simulate the potential distribution transformation function between the MSD and the MTD based on invertible neural networks (INNs) after feature fusion, thereby performing distribution alignment independently within the two feature spaces. A cross-domain similarity-inspired information interaction module (CDSiM) is also designed, which considers the correspondence between the MSD and the MTD in the fusion stage, effectively utilizes multimodal complementary information and promotes the subsequent alignment of fused domain shifts. The semantic segmentation tests are completed on the public C2Seg-AB dataset and a new multimodal cross-city Su-Wu dataset. Compared with some state-of-the-art techniques, the experimental results demonstrated the superiority of the proposed SFITNet.

Abstract:
The development of sophisticated models for video-to-video synthesis has been facilitated by recent advances in deep reinforcement learning (RL) and generative adversarial networks (GANs). In this article, we propose RL-V2V-GAN, a new deep neural network approach based on RL for unsupervised conditional video-to-video synthesis. While preserving the unique style of the source video domain, our approach aims to learn a mapping from a source video domain to a target video domain. We train the model using policy gradient and employ convolutional long short-term memory (ConvLSTM) layers to capture the spatial and temporal information by designing a fine-grained GAN architecture and incorporating spatiotemporal adversarial goals. The adversarial losses aid in content translation while preserving style. Unlike traditional video-to-video synthesis methods requiring paired inputs, our proposed approach is more general because it does not require paired inputs. Thus, when dealing with limited videos in the target domain, that is, few-shot learning, it is particularly effective. Our experiments show that RL-V2V-GAN can produce temporally coherent video results. These results highlight the potential of our approach for further advances in video-to-video synthesis.

Abstract:
In standard reinforcement learning, since the uncertainty of task objectives is not adequately considered in the policy training, the policy achieves poor generalization for the out-of-distribution (OOD) tasks. Although considerable efforts have been made to enhance the generalization for OOD tasks, most of these methods overlook the structural information of task representations in latent space during the generation of extrapolative data, resulting in biased and blurred data embeddings, which then affect the policy generalization. To address this issue, we propose a context-based meta-reinforcement learning (meta-RL) method, namely latent variable distribution enhancement sampler (LVDES), which enhances the policy generalization on OOD tasks by providing efficient task representation space and accurate augmentation policy training data for OOD tasks. Specifically, the proposed LVDES consists of four modules: a task inference module, a task separation module, a latent enhancement module (LEM), and a policy module. The task inference module is used to identify the task. The task separation module (TSM) learns a representation space with highly structured separability. The LEM generates relevant additional task trajectories for augmenting policy training data. The policy module learns a policy to solve tasks. By using efficient task representation space and augmented trajectory data, the exploration efficiency and generalization of the policy for OOD tasks can be enhanced by our LVDES method. Extensive experiments are conducted to demonstrate the effectiveness of our method in comparison with existing methods on the MuJoCo and Meta-World benchmarks. The experimental results show that the task completion accuracy of our LVDES on OOD tasks is increased by 60.20%, with the average exploration time being reduced by 62.99% in comparison with the most effective current method, which demonstrates that our LVDES can achieve great policy generalization on OOD tasks.

Abstract:
We study the leader–follower consensus problem in multiagent systems with heterogeneous agent dynamics and multiple internal players per agent, each with distinct and interaffected objectives. Formulated as a multiplayer differential game per agent, the goal is to achieve output consensus among all agents while ensuring Nash equilibrium controls across each agent’s internal players. To address this challenge, we introduce a distributed control framework that integrates both feedforward (regulator-based) and feedback (game-theoretic Riccati-based) components. We further design a FilterNet reinforcement learning (RL) architecture that solves the control solutions while eliminating the need for large-scale distributed data storage. Organized into four layers, FilterNet handles admissible policy identification, online initialization, asynchronous updates for Nash policy convergence, and real-time regulator solutions. This design reduces data requirements, ensures initial excitation, and accelerates convergence. Theoretical guarantees establish conditions for solvability and convergence. Numerical simulations and comparisons with existing methods confirm the effectiveness and superiority of the proposed approach.

Abstract:
Partial multilabel feature selection (PMLFS) is a prevalent subject that aims to enhance the performance of multilabel learning (MLL) in the context of noisy labels. In PMLFS, a crucial aspect is handling the false positive labels hidden in the candidate label set, as the imprecise annotations could mislead the feature selection process. However, many existing approaches for partial label disambiguation rely on topology information and tend to be error-prone. Besides, feature selection frameworks are often built upon a linear regression model, leading to a reliance on the classifier and a deficiency in exploring local structures. Focusing on the issues above, this article proposes a novel two-stage PMLFS method, resorting to the ideology of granular computing. In the first stage, a label disambiguation method is developed using label-specific information. Specifically, a specific granular ball computing model is designed to characterize the distribution of datapoints labeled differently, and therefore, using the affinity relationships among samples and balls, the label-specific information concealed in the data distribution can be captured for label disambiguation. In the second stage, a filter-based feature selection method that explores the local structure of samples is presented. This method relies on a devised fuzzy decision neighborhood rough set (FDNRS) to capture more detailed membership information by maximizing the neighborhood consistency of samples’ related labels. Simultaneously, the feature selection method minimizes the uncertainty derived from unrelated labels. Extensive experiments on 12 datasets in terms of four evaluation metrics demonstrated the effectiveness of the proposed approach.

Abstract:
Nowadays, billions of phones, internet-of-things (IoT), and edge devices around the world generate data continuously, enabling many machine-learning (ML)-based products and applications. However, due to increasing privacy concerns and regulations, these data tend to reside on devices (clients) instead of being centralized for performing traditional ML model training. Federated learning (FL) is a distributed approach in which a single server and multiple clients collaboratively build an ML model without moving data away from clients. Whereas existing studies on FL have their own experimental evaluations, most experiments were conducted using a simulation setting or a small-scale testbed. This might limit the understanding of FL implementation in realistic environments. In this empirical study, we systematically conduct extensive experiments on a large network of IoT and edge devices (called IoT–Edge devices) to present FL real-world characteristics, including learning performance and operation (computation and communication) costs. Moreover, we mainly concentrate on heterogeneous scenarios, which is the most challenging issue of FL. By investigating the feasibility of on-device implementation, our study provides valuable insights for researchers and practitioners, promoting the practicality of FL and assisting in improving the current design of real FL systems.

Abstract:
The evaluation of objective and constraint involving expensive simulations or physical experiments with multiple optimal solutions is referred to as expensive constrained multimodal optimization problems (ECMMOPs). Under limited real function evaluations (FEs), it is challenging to find multiple optimal solutions accurately while satisfying constraints. To address these issues, this article studies a self-clustering particle swarm optimization algorithm with modal detection informed classification evaluation (MDICE) to solve ECMMOPs. To deal with multimodality, a surrogate-assisted self-clustering update mechanism is first designed to update individuals in each modality. Following that, a novel modal detection strategy is proposed based on the awareness of fitness landscapes to identify all potential modal seeds. For better utilization of FEs, a modality-guided classification evaluation strategy is designed to efficiently generate infilling samples for each constraint and modality. Moreover, to address the complex constraints, a surrogate-assisted feasibility search strategy is developed to quickly search for feasible solutions at a lower evaluation cost. Experimental results on 33 benchmark functions with various characteristics indicate that MDICE outperforms four state-of-the-art surrogate-assisted evolutionary algorithms.

Affiliations: Institute for Sustainable Industries and Liveable Cities (ISILC), Victoria University, Footscray, VIC, Australia; School of Information Technology, Deakin University, Burwood, VIC, Australia; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; RMIT Centre of Cyber Security Research and Innovation, School of Computing Technologies, RMIT University, Melbourne, VIC, Australia

Abstract:
With the rise of the smart industry, machine learning (ML) has become a popular method to improve the security of the Industrial Internet of Things (IIoT) by training anomaly detection models. Federated learning (FL) is a distributed ML scheme that facilitates anomaly detection on IIoT by preserving data privacy and breaking data silos. However, poisoning attacks pose significant threats to FL, where adversaries upload poisoned local models to the aggregation server, thereby degrading model accuracy. The prevalence of non-independent and identically distributed (non-IID) data across IIoT devices further exacerbates this threat, as it naturally leads to diverse local models, making malicious ones harder to distinguish. To address the above challenges, we propose a deep-layer sign-sharing personalized FL (DSPFL) scheme. DSPFL innovatively aggregates only the signs of stochastic gradients (SignSGD) from the deep layers of local models during training. This targeted aggregation enhances the robustness of the shared components against poisoning attacks, while shallow layers are retained locally to preserve personalization. This integrated approach improves the accuracy and resilience of personalized local models on IIoT devices under poisoning attacks. Extensive experimental results show that DSPFL consistently achieves up to 20% higher and more stable overall personalized model accuracy compared to state-of-the-art methods under specific poisoning attacks.

Abstract:
Class incremental learning (CIL) offers a promising framework for continuous fault diagnosis (CFD), allowing networks to accumulate knowledge from streaming industrial data and recognize new fault classes. However, current CIL methods assume a balanced data stream, which does not align with the long-tail distribution of fault classes in real industrial scenarios. To fill this gap, this article investigates the impact of long-tail bias in the data stream on the CIL training process through the experimental analysis. Observations show that long-tail bias in the data stream has a cascading effect, affecting the retention of old task knowledge and learning new tasks. Concurrently, the incremental model encounters challenges in identifying samples that conflict with its biases. Accordingly, we propose a CFD method called long-tail CIL via bias calibration (LTCIL-BC), which aims to improve the learning of bias-conflicting samples through bias exploration and debiasing. Specifically, LTCIL-BC simultaneously trains a primary debiased network and an auxiliary biased network. Then, a bias-indicating score is developed to provide insight into model bias and data bias based on the prediction error of the primary and auxiliary models, respectively. LTCIL-BC subsequently adjusts the logits of the debiased network using the bias-indicating score to guide optimization, thereby better utilizing the role of old class exemplars and reducing catastrophic forgetting. Experiments on power system (PS) and secure water treatment (SWaT) datasets demonstrate the superior performance of LTCIL-BC in CFD, achieving up to 9% improvement over state-of-the-art baselines in multiple long-tailed CIL setting. Comprehensive results demonstrate the effectiveness of LTCIL-BC in jointly addressing data and model bias during calibration and prioritizing bias-conflicting samples.

Abstract:
This article proposes a hierarchical neural learning (HNL) algorithm for optimal tracking control (OTC) of nonlinear strict-feedback systems (SFSs) with unmatched disturbances (uMDs) and unknown dynamics. Leveraging the recursive structure of SFSs, we introduce the virtual target (VT) construction scheme in which each VT is a nonlinear mapping of the current state and desired output, thereby eliminating the noncausal that typically plagues discrete-time SFS control. The VTs serve as auxiliary inputs for low-order subsystems, while a time-varying affine Hamilton–Jacobi–Isaacs (HJI) formulation establishes an explicit relationship between the auxiliary control and the disturbance. The controller is synthesized directly from input–output data, removing the need for an accurate plant model. Within an adaptive dynamic programming (ADP) framework, we further enhance the neural architecture by replacing the conventional action network with a tracking network (T-network) whose energy function merges gradient information with future tracking errors, ensuring that each policy update simultaneously reduces control effort and improves tracking accuracy. Simulations confirm that the proposed HNL scheme achieves outstanding performance in both (optimal) tracking modes, exhibiting strong robustness to uMDs and significant model uncertainties.

Abstract:
Aiming at the problems of redundant information accumulation, low computational efficiency, and fuzzy feature allocation in multiscale time-series prediction of traditional deep echo state network (DeepESN), this article proposed an interlayer sparse compression-based DeepESN model (ICS-DESN). The model uses the sparse sampling technology of deep fusion compressive sensing and the hierarchical dynamic feature extraction mechanism of DeepESN, introduces the adaptive compressed sampling module between the layers, and uses the Gaussian observation matrix to reduce the dimension of the high-dimensional state, which effectively inhibits the stacking of redundant information in the deep network, and explicitly allocates the multiscale temporal features. Through theoretical analysis, it is proven that ICS-DESN satisfies the stability condition of the echo state property (ESP) by constraining the weighted spectral radius of the reservoir. In the experiment, we used multiscenario time-series datasets, such as logistic chaotic systems, Lorenz attractors, sunspot data, NASDAQ stock index, ETTh1 dataset, and weather dataset to validate the effectiveness of the model. The results showed that compared with traditional comparison models, ICS-DESN significantly reduced prediction errors [mean squared error (MSE) and mean absolute error (MAE)], demonstrating higher computational efficiency and robustness. This research provides an efficient theoretical framework for complex time-series modeling and has potential application value in resource-constrained scenarios, such as edge computing.

Abstract:
With the growth of multisource sensor technology, multimodal learning has become pivotal in remote sensing (RS) image segmentation. Despite its potential, current methods face challenges in acquiring large-scale paired samples. When annotated optical images are available, but synthetic aperture radar (SAR) images lack annotations, learning discriminative features for SAR images from optical images becomes difficult. Unsupervised domain adaptation (UDA) offers a potential solution to this challenge, which we refer to as unpaired cross-modality UDA. In this article, we propose unlocking pseudolabel potential and alignment (ULPA) for unpaired cross-modality adaptation in RS image segmentation, a novel one-stage adaptation framework designed to enhance cross-modality knowledge transfer. Our approach employs a prototypical multidomain alignment (PMDA) strategy, which reduces the modality gap through contrastive learning between features and prototypes of identical classes across different modalities. In addition, we introduce the unreliable-sample-guided feature contrast (UFC) loss to address the underutilization of unreliable pixels during training. This strategy separates reliable and unreliable pixels based on prediction confidence, assigning unreliable pixels to a category-wise queue of negative samples, thus ensuring all candidate pixels contribute to the training process. Extensive experiments show that the integration of PMDA and UFC loss can lead to more effective cross-modality domain alignment and substantially boost the model’s generalization capability.

Abstract:
This article proposes a new cyberattack on decentralized federated learning (DFL), named user isolation poisoning (UIP). While following the standard DFL protocol of receiving and aggregating benign local models, a malicious user strategically generates and distributes compromised updates to undermine the learning process. The objective of the new UIP attack is to diminish the impact of benign users by isolating their model updates, thereby manipulating the shared model to reduce the learning accuracy. To realize this attack, we design a novel threat model that leverages an adversarial message-passing graph (MPG) neural network. Through iterative message passing, the adversarial MPG progressively refines the representations (also known as embeddings or hidden states) of each benign local model update. By orchestrating feature exchanges among connected nodes in a targeted manner, the malicious users effectively curtail the genuine data features of benign local models, thereby diminishing their overall influence within the DFL process. The MPG-based UIP attack is implemented in PyTorch, demonstrating that it effectively reduces the test accuracy of DFL by 49.5% and successfully evades existing cosine similarity- and Euclidean distance-based defense strategies.

Abstract:
In this article, we introduce a deep unfolding framework for Tail-iterative soft thresholding algorithm (ISTA) and Tail-fast ISTA (FISTA), extending classical sparse recovery algorithms into learned architectures and improving upon existing unfolding techniques. By combining the interpretability of iterative solvers with the adaptability of model-based networks, our approach achieves efficient and robust recovery of sparse signals. Tail-based methods incorporate an iterative support estimation step, where the support and target estimations are refined alternately, providing a key advantage over traditional approaches. We integrate this into our architecture, enhancing both recovery performance and noise robustness. We compare the proposed methods against classical solvers, including FISTA and Tail-FISTA, as well as deep unfolding techniques, LISTA and DU-FISTA, across various sparsity levels, dynamic ranges (DRs), and both noiseless and noisy conditions. In noiseless cases, our methods achieve slightly lower performance than classical solvers but with significantly reduced computational costs. Under heavy noise and a high number of nonzero elements, where classical methods struggle, our learned approaches remain resilient and achieve improved recovery rates. To evaluate generalization, we also tested our methods on data generated with a perturbed sensing matrix. In this case, under noisy scenarios, our proposed methods outperform classical sparse recovery algorithms. The proposed framework is general and applies to any linear sparse recovery task in compressed sensing (CS), offering computational efficiency, robustness to noise, and adaptability to real-world data, showcasing the advantages of deep unfolding techniques with iterative support estimation.

Affiliations: College of Information Science and Engineering, the Key Laboratory of Computing and Stochastic Mathematics (Ministry of Education), and the School of Mathematics and Statistics, Hunan Normal University, Changsha, China; College of Business, Hunan Normal University, Changsha, China; Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen, China; Department of Radiology, Chongqing University Three Gorges Hospital, Chongqing University, Chongqing, China

Abstract:
Timely risk prediction of Alzheimer’s disease (AD) holds significant clinical value. However, the inherent fuzziness of disease information hinders the deeper understanding of AD pathogenesis and limits the effectiveness of current predictive models. This article explores the staged evolutionary patterns of AD by integrating fuzzy graph-based disease modeling and deep learning. First, we use fuzzy graphs to quantify interpathogeny associations through fuzzy memberships. Second, we propose a fuzzy entropy propagation model to mathematically describe AD deterioration as the spread of fuzzy entropy information in fuzzy graphs. Finally, we introduce a novel fuzzy graph evolutionary generative adversarial network (FGE-GAN) for disease risk prediction and pathogeny extraction. In the generator of FGE-GAN, fuzzy graph convolution (FGC) layers are designed based on the mathematical model to capture AD’s evolutionary patterns with interpretability. Experiments on multiple brain disease datasets indicate that FGE-GAN outperforms state-of-the-art methods in disease risk prediction. In addition, the extracted multiomics pathogenies provide valuable insights for early intervention. The code is available at: github.com/fmri123456/FGE-GAN

Abstract:
Few-shot class-incremental learning (FSCIL) presents a greater challenge compared with few-shot task-incremental learning (FSTIL) due to the need to classify all previous classes without prior knowledge of the session identifier (session-ID). To address this, we propose Bamboo, a novel framework for FSCIL that introduces a cascading inference mechanism to explicitly infer the session-ID for each sample. This mechanism is enabled by a novel, session-specific equiangular tight frame prototype (ETF-P) classifier. By adaptively fusing session-agnostic and session-specific semantics, the ETF-P classifier reliably determines if a sample belongs to its associated session, which is the core decision required at each step of the cascade. Considering the incremental nature of the learning process, which resembles the continuous growth of bamboo, we treat the base session classifier as the foundational bamboo node and progressively add new session classifiers as additional nodes on top. During the testing phase, each sample flows sequentially through the bamboo nodes, from top to bottom, to determine its session-ID and to be classified accordingly. Overall, the Bamboo framework is capable of perceiving session-ID without prior knowledge and classifying each sample within the correct session, leading to state-of-the-art performance on multiple benchmark datasets.

Abstract:
The criminal court view generation (CCVG) task aims to produce succinct and coherent summaries of fact descriptions, providing interpretable opinions for verdicts. Traditional text generation evaluation metrics, such as ROUGE, BLEU, and BERTSCORE, are extensively employed for this task and measure performance by averaging the assessment scores of all samples within the test set. However, these sample-averaged metrics encounter two primary dilemmas: 1) they fail to fairly assess overall evaluation scores across different case types and 2) they overlook the measurement of the degree of performance imbalance between case types. To fill this research gap, we propose two novel case-type-oriented evaluation metrics: Case-type-oriented Text Generation (CTG) and Case-type-oriented Imbalance Performance (CIP). First, CTG mitigates the unfair assessment among different case types by assigning equal weight to each type. Second, CIP evaluates performance imbalance by measuring the distance between the performance of each case type and the overall performance. We provide three theorems to elucidate the properties of CIP, demonstrating that CIP can effectively identify the extent to which a CCVG model achieves balanced generation performance across different case types. Furthermore, we propose an embarrassingly simple and effective charge-guided encoder–decoder (CGED) framework to enhance performance fairly across different case types in encoder–decoder pretrained language models (PLMs). Code is available at https://yuquanle.github.io/Case-type-oriented-metrics-homepage/

Abstract:
How to recognize endangered bird species in complex outdoor environments has attracted considerable attention in the fields of computer vision and machine learning. However, fine-grained bird image classification (FBIC) is susceptible to problems such as arbitrary postures, interclass discriminability, and occlusions. We propose a novel semantic homology relationship representation learning for fine-grained bird classification with large language models, namely HomLLM, to address these challenges in FBIC effectively. Our proposed model aims to learn homology relationship representations adaptively by identifying invariant structural correspondences between visual features and semantic descriptions, using limited bird data and base class labels. Our approach yields two key findings: 1) invariant homology in key regions of birds that maintain structural consistency across different postures and 2) homological relationship that establish essential taxonomic markers among similar bird classes. Based on these insights, we propose two new modules of the model: the semantic homology generation (SHG) module and homology relationship mining (HRM) module. Specifically, in SHG, bird features are described at multiple granularities through a large language model (LLM) to establish semantic homology. In HRM, feature adaptation is performed separately for textual and visual information, and cross-modal homological interaction is performed hierarchically. In addition, we propose a hierarchical homology interaction scheme to integrate multilevel homological features while preserving structural consistency. Experiments on the commonly used bird datasets CUB-200-2011 and NABirds demonstrate that HomLLM exhibits better performance than state-of-the-art (SOTA) methods.

Abstract:
Diffusion models, as a class of generative frameworks based on step-wise denoising, have recently attracted significant attention in the field of medical image segmentation. However, existing diffusion-based methods typically rely on static fusion strategies to integrate conditional priors with denoised features, making them difficult to adaptively balance their respective contributions at different denoising stages. Moreover, these methods often lack explicit modeling of pixel-level uncertainty in ambiguous regions, which may lead to the loss of structural details during the iterative denoising process, ultimately compromising the accuracy (Acc) and completeness of the final segmentation results. To this end, we propose FEU-Diff, a diffusion-based segmentation framework that integrates fuzzy evidence modeling and uncertainty fusion (UF) mechanisms. Specifically, a fuzzy semantic enhancement (FSE) module is designed to model pixel-level uncertainty through Gaussian membership functions and fuzzy logic rules, enhancing the model’s ability to identify and represent ambiguous boundaries. An evidence dynamic fusion (EDF) module estimates feature confidence via a Dirichlet-based distribution and adaptively guides the fusion of conditional information and denoised features across different denoising stages. Furthermore, the UF module quantifies discrepancies among multisource predictions to compensate for structural detail loss during the iterative denoising process. Extensive experiments on four public datasets show that FEU-Diff consistently outperforms state-of-the-art (SOTA) methods, achieving an average gain of 1.42% in the Dice similarity coefficient (DSC), 1.47% in intersection over union (IoU), and a 2.26 mm reduction in the 95th percentile Hausdorff distance (HD95). In addition, our method generates uncertainty maps that enhance clinical interpretability.

Abstract:
The vast availability of free data has been critical to the success of large language models (LLMs). With the widespread use of LLMs, more and more concerns have been raised about the unauthorized use of publicly available data. To protect data from unauthorized use for training models, researchers have proposed adding imperceptible perturbations into image data so that models would be misled by the generated shortcut features and cannot mine information from these images. However, due to the inherent discrete property and semantic complexity of texts, directly applying these methods to text will cause semantic changes, resulting in meaningless shortcut features being constructed. To tackle this problem, in this article, we design a novel Unlearnable text examples generation algorithm via syntax-oriented shortcut (UTE-SS) by incorporating the syntactic structure of texts. Specifically, we propose a syntax template generator (STG) to generate the optimal perturbing syntax for a given category, which will realize imperceptible perturbations. Then, a perturbing text generator (PTG) is designed to perturb the in-class texts with the selected syntax template to stably deviate from the original texts. Along this line, models will be misled to learn the shortcut between the syntax template and the category, so as to keep text examples unlearnable. Extensive experiments over eight advanced Transformer-based pretrained language models (PLMs) on four different natural language processing (NLP) tasks demonstrate the effectiveness and flexibility of our proposed algorithm. Our method is easy to implement, and the code is publicly available at https://github.com/libolb/UTE-SS.

Abstract:
High-frequency trading (HFT) requires fast data processing without information lags for precise stock price forecasting. This high-paced stock price forecasting is usually based on vectors that need to be treated as sequential and time-independent signals due to the time irregularities that are inherent in HFT. A well-documented and tested method that considers these time irregularities is a type of recurrent neural network (NN), named long short-term memory (LSTM) NN. This type of NN is formed based on cells that perform sequential and stale calculations via gates and states without knowing whether their order, within the cell, is optimal. In this article, we propose a revised and real-time adjusted LSTM cell that selects the best gate or state as its final output. Our cell is running under a shallow topology, has a minimal look-back period, and is trained online. This revised cell achieves lower forecasting error compared to other recurrent NNs (RNNs) for online HFT forecasting tasks such as the limit order book (LOB) mid-price (MP) prediction as it has been tested on two high-liquid U.S. and two less-liquid Nordic stocks.

Abstract:
Deep reinforcement learning (DRL) excels at learning control policies in high-dimensional action spaces, making it crucial for robotic manipulation. However, its real-world application is limited by costly and risky data collection. Offline reinforcement learning (offline RL) addresses this issue by training on precollected datasets but struggles with Q value overestimation in high-dimensional discrete action spaces, where the number of out-of-distribution (OOD) actions rapidly increases, negatively impacting training stability. In this work, we propose the most overestimated Q value regularization (MQR), a novel offline RL algorithm that penalizes the action with the most overestimated Q value, effectively mitigating overestimation in high-dimensional discrete action spaces. By regulating the action most affected by Q value overestimation, rather than applying uniform penalties across the entire action space as in existing methods, MQR further prevents the policy from converging incorrectly. We evaluate MQR on a robotic pushing and grasping task, a challenging high-dimensional discrete action space problem, in both simulated and real-world environments with random, dense, and unknown object arrangements. The results demonstrate that MQR significantly outperforms baseline algorithms, achieving a clearance rate of 96.94% in simulations and 99.04% in real-world dense configurations, while maintaining high action efficiency and stability. These findings highlight MQR’s robustness, scalability, and adaptability for robotic manipulation, showcasing its potential for real-world deployment in industrial robotics. The code used in our research is publicly available at https://github.com/Hanyang-Robot/MQR

Abstract:
This article introduces a novel fuzzy logic-enhanced neuroadaptive sliding mode control (FLENNSMC) framework, developed for vehicular platoon systems subject to a confluence of challenges. Leveraging the synergistic integration of fuzzy logic’s interpretive strengths and neural networks’ adaptive learning capabilities, FLENNSMC effectively addresses nonlinear dynamics, stochastic disturbances, actuator faults, and stringent asymmetric spacing constraints. We propose a Takagi–Sugeno (T-S) fuzzy model to structure the learning process and a fuzzy logic-enhanced RBFNN (FLERBFNN) for robust approximation of unknown functions, including unmodeled dynamics and fault signals. The controller design incorporates a fault-tolerant control mechanism for enhanced robustness, an asymmetric barrier Lyapunov function (BLF) to strictly enforce spacing constraints, and a Nussbaum function to compensate for actuator faults with unknown directions. The fuzzy logic-enhanced structure allows for localized and efficient learning, which reduces computational burden and improves adaptation speed. Through a rigorous stochastic Lyapunov–Krasovskii stability analysis, we derive sufficient LMI-based conditions for the uniform ultimate boundedness (UUB) of tracking errors in the mean square sense and guarantee a mixed H-infinity/passivity performance. Extensive simulations on a 2-D multilane vehicular platoon demonstrate the superior performance of the proposed FLENNSFC compared to conventional neuroadaptive control approaches, particularly highlighting the benefits of fuzzy logic in structuring the learning process and handling complex uncertainties. Simulation code is available at https://github.com/zhanganguo/FLENNSMC-Platoon-Control-Simulation

Abstract:
This article investigates the system modeling problem for the dynamical process of human brain activity in human–robot cognitive interaction (HRCI). An important novelty of the proposed approaches is to build a computational model of a human-distributed robot-lumped parameter system (HDRLPS) that describes the inherent dynamical principle of human brain activity (with spatiotemporal-varying characteristic) undergoing the interaction between the intrinsic cognitive dynamics and extrinsic robot stimuli. A deterministic learning (DL)-based spatiotemporal dynamics identification scheme is proposed to accurately identify the spatiotemporal dynamics of HDRLS and obtain the associated knowledge as a constant radial basis functional neural network (RBF NN) model. A spatiotemporal dynamics estimator is designed with this model, which can accurately evaluate and monitor the dynamical process of human brain activity in real-time HRCI by the generated dynamics-synchronized state. The effectiveness and practicability of the approaches in the dynamics identification and evaluation for the human brain activity in HRCI are validated by the thorough analysis, including the mathematical proof, the simulation study, and the brain–computer interface (BCI) experiment using publicly available datasets. Our method is compared with state-of-the-art (SOTA) methods, such as LGGNet, EEGNet, Tsception, EEG-Deformer, EEG-Transformer, and EEGViT. The results show that our method can outperform these methods with better recognition accuracy and macro- F1 scores. The source code can be found at: https://github.com/alonexing/source_code/tree/master

Abstract:
Single-source domain generalization (SDG) in medical image segmentation (MIS) aims to generalize a model using only one source domain data to segment data from an unseen target domain. Despite substantial advances in SDG with data augmentation, existing methods often fail to fully consider the details and uncertain areas prevalent in MIS, leading to mis-segmentation. In this study, we propose a Fourier-based semantic augmentation method called FIESTA using uncertainty guidance (UG) to enhance the fundamental goals of MIS in an SDG context by manipulating the amplitude and phase components in the frequency domain. The proposed Fourier augmentative transformer (FAT) addresses semantic amplitude modulation based on meaningful angular points to induce pertinent variations and harnesses the phase spectrum to ensure structural coherence. Moreover, FIESTA employs uncertainty estimation to fine-tune the augmentation process, improving the ability of the model to adapt to diverse augmented data and concentrate on areas with higher ambiguity. Extensive experiments across three cross-domain scenarios demonstrate that FIESTA surpasses recent state-of-the-art SDG approaches in segmentation performance and significantly contributes to boosting the model’s applicability in medical imaging modalities.

Abstract:
Federated learning (FL) facilitates collaborative training among multiple clients while preserving data privacy by eliminating raw data transmission. However, the inherent data heterogeneity among participants induces bias during collaborative learning, significantly degrading the performance of local models. Existing FL solutions face critical challenges in achieving efficient knowledge transmission, particularly with respect to insufficient information extraction or excessive communication costs, which result in slow convergence and inferior performance. To address these limitations, we propose a novel FL framework in a synergy of multi-level prototype-based contrastive learning (CL) and soft label generation, named FedMPS. The proposed method first constructs multi-level prototypes from different layers of the model to capture semantic information in high-level features and detailed information in low-level features. These prototypes are then utilized through CL to enhance intra-class discriminability and intra-class consistency in the feature space. In addition, a prototype-guided soft label generation module is introduced to model latent interclass relationships in the output space. Instead of exchanging model parameters, FedMPS transmits only prototypes and soft labels, effectively reducing global knowledge shift and communication costs. Extensive experimental studies on six publicly available datasets validate the effectiveness of the proposed method when compared to the current state-of-the-art FL approaches. The code is available at github.com/wenxinyang1026/FedMPS

Abstract:
Spiking neural networks (SNNs) can be operated in an event-driven manner to save energy consumption of artificial neural networks (ANNs), which has attracted enormous research interests for their high biological plausibility and powerful spatiotemporal information processing. However, representative studies only evaluated SNNs on static temporal tasks or short sequence tasks, which could not fully demonstrate the advantages of SNNs in spatiotemporal learning. In addition, we point out that the existing directly trained SNNs to face the problems of long-term memory, network degeneration, gradient saturation, and heterogeneity learning, these limit the performance of SNNs. In this article, we propose channelwise regional integrate and multiple firing (CRIMF) neuron to improve the spatiotemporal learning of SNNs. First, CRIMF neuron contains a new internal state of regional current that enhances the memory of spiking neurons and facilitates the learning of temporal information over long time steps. Second, CRIMF neuron is implemented with the multiple firing mechanisms; it is able to adjust the distribution of membrane potential and membrane potential gradient in the single firing mechanism, thus mitigating the underactivation and gradient saturation. Third, CRIMF neuron is trained with the channelwise learning strategy for the targeted learning of different types of temporal features, and an index of differentiation degree is proposed to visualize the effectiveness of the channelwise learning strategy. We also introduce the regional current reset equation and normalize the input of postsynaptic neurons in spatiotemporal dimension to avoid network degeneration. Finally, we select two emotion electroencephalogram (EEG) datasets and perform the evaluations based on manual features and raw signals. Experimental results show that CRIMF-based SNNs outperform the state-of-the-art methods in static temporal task, and CRIMF neurons are superior to the advanced spiking neurons and recurrent units of ANNs in dynamic temporal task, using low energy consumption.

Abstract:
Hyperspectral image (HSI) super-resolution reconstruction is a challenging ill-posed inverse problem, which seeks to enhance the spatial resolution of low-resolution hyperspectral images (LR-HSIs) by integrating complementary information from high-resolution multispectral images (HR-MSIs), ultimately generating high-resolution HSIs (HR-HSIs). Existing methods commonly employ residual connections and deep layer stacking to facilitate information propagation. While residual connections effectively preserve gradient flow, we observe that naively increasing network depth in high-dimensional spectral tasks can lead to feature redundancy and performance saturation. To address these challenges, this article presents a novel Butterfly residual network (BRNet) that incorporates spectral Transformers and depth-wise convolutions to optimize both accuracy and computational efficiency of hyperspectral super-resolution reconstruction from two perspectives: learning strategy and feature extraction. Regarding learning strategy, a recursive structure coupled with a fusion parameter generation technique is proposed to promote efficient feature fusion and enable adaptive network pruning, thereby reducing redundant information and enhancing computational efficiency. For feature extraction, spectral Transformer and depth-wise convolution are employed to capture spectral and spatial features, respectively, effectively leveraging their complementary advantages across different dimensions. A specialized spectral-spatial interaction (SSI) module is then incorporated to effectively fuse the extracted features, thereby enriching the diversity of network features. Additionally, the convolutional gated feed-forward network (FFN) is designed to bolster the network’s ability to capture local features while significantly reducing the computational complexity. Experimental evaluations on three hyperspectral datasets demonstrate that the proposed method outperforms existing state-of-the-art super-resolution reconstruction methods across various performance metrics, validating its effectiveness and superiority.

Abstract:
High-performance and compliant manipulator control requires accurate dynamics models. However, the real manipulators are in open changing scenarios, and a controller with fixed structure and parameter often fails to control the manipulator well. In this article, a new evolving control framework that combines offline and online learning to improve the accuracy and compliance of manipulators in open changing scenarios is developed. The joint framework consists of offline learning based on a deep fuzzy neural network (DFNN) and online learning based on an evolving fuzzy system (EFS), where online learning uses both indirect learning and direct learning based on an expert knowledge base. To evaluate the effectiveness of the proposed method, tracking control simulations are performed on a PUMA560 manipulator as well as experiments on a real-world six-degree-of-freedom manipulator. The results show that the proposed method can effectively improve tracking control accuracy, significantly reduce feedback control torque, and improve adaptability to changing environments.

Affiliations: Department of Dermatology and Skin Science, Photomedicine Institute, the Centre for Clinical Epidemiology and Evaluation, the School of Biomedical Engineering, and Vancouver Coastal Health Research Institute, The University of British Columbia, Vancouver, BC, Canada; Department of Electrical and Computer Engineering, The University of British Columbia, Vancouver, BC, Canada; Guangdong Key Laboratory of Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen, Guangdong, China

Abstract:
The seven-point checklist (7PCL) is a widely used diagnostic tool in dermoscopy for identifying malignant melanoma by assigning point values to seven specific attributes. However, the traditional 7PCL is limited to distinguishing between malignant melanoma and melanocytic nevi (MN) and falls short in scenarios where multiple skin diseases with appearances similar to melanoma coexist. To address this limitation, we propose a novel diagnostic framework that integrates a clinical knowledge-based topological graph (CKTG) with a gradient diagnostic strategy featuring a data-driven weighting (GD-DDW) system. The CKTG captures both the internal and external relationships among the 7PCL attributes, while the GD-DDW emulates dermatologists’ diagnostic processes, prioritizing visual observation before making predictions. Additionally, we introduce a multimodal feature extraction approach leveraging a dual-attention mechanism to enhance feature extraction through cross-modal interaction and unimodal collaboration. This method incorporates meta-information to uncover interactions between clinical data and image features, ensuring more accurate and robust predictions. Our approach, evaluated on the EDRA dataset, achieved an average AUC of 88.6%, demonstrating superior performance in melanoma detection and feature prediction. This integrated system provides data-driven benchmarks for clinicians, significantly enhancing the precision of melanoma diagnosis.

Abstract:
Noninvasively predicting the status of isocitrate dehydrogenase (IDH) and chromosome arms 1p/19q preoperatively on multisequence magnetic resonance imaging (MRI) images is helpful for prognosis and optimal therapy planning of patients with gliomas. However, effectively learning discriminative features from MRI images for predicting IDH mutation and 1p/19q codeletion status remains challenging due to the high heterogeneity of gliomas. A dual structural feature exploration and alignment network (DSFEAnet) was proposed to effectively explore representative features associated with the intratumoral and marginal heterogeneity of gliomas for accurate prediction. First, a match and mismatch feature extraction (MMFE) module was introduced to extract image structural features related to intratumoral heterogeneity, such as information associated with tumor core localization and T2-fluid-attenuated inversion recovery (FLAIR) mismatch sign. Second, a graph-based geometry exploration (GGE) module was developed to explore graph structural features related to marginal heterogeneity. In this module, the vertex associations and variations perpendicularly along a 3-D tumor surface were integrated as a graph, which can effectively perceive changes in locations, sizes, and marginal textures of gliomas, thus enhancing the feature representational ability to describe glioma heterogeneity. Finally, a dual structural feature alignment (DSFA) module was incorporated to narrow the gaps among intra- and interstructural features. It can adaptively align and fuse different features and thus further improve the overall prediction performance. The proposed DSFEAnet was evaluated using a multicenter dataset, and its robustness was demonstrated on an independent clinical dataset. Specifically, preoperative MRI images of 560 glioma samples were collected from publicly available The Cancer Imaging Archive (TCIA) ( n =203 , age: 51.82 (15.21) years, male/female: 109/94, IDH-mutant/wild-type: 90/113, 1p/19q-codeleted/noncodeleted: 27/176), Nanfang hospital ( n =136 , age: 41.96 (12.29) years, male/female: 80/56, IDH-mutant/wild-type: 53/83, 1p/19q-codeleted/noncodeleted: 31/105), and Zhujiang hospital ( n =221 , age: 43.82 (17.36) years, male/female: 134/87, IDH-mutant/wild-type: 94/127, 1p/19q-codeleted/noncodeleted: 34/187). Our DSFEAnet achieved an AUC of 87.72% for IDH mutation status prediction and an AUC of 80.52% for 1p/19q codeletion status prediction in the Nanfang hospital dataset. Finally, the interpretability of the proposed modules was assessed to highlight the effectiveness of our method. Overall, the DSFEAnet exhibits great potential for predicting IDH mutation and 1p/19q codeletion status.

Abstract:
The olfaction transduction process commences when odorant molecules bind to specific olfactory receptors (ORs) located in the nasal cavity. Recognizing the interactions between odorant molecules and ORs remains a significant challenge due to the complex and nonlinear nature of the molecule–receptor relationship. Addressing this challenge is pivotal for advancing our understanding of human olfaction mechanisms and aiding in the development of novel synthetic pharmaceuticals. The primary difficulty arises from the intricate interactions between odorant molecules and ORs, where diverse molecule with varying physical and chemical properties can activate specific receptors, and conversely, individual receptors exhibit the ability to bind with multiple distinct molecules. In this study, we present a novel approach for predicting molecule–receptor interactions by leveraging multimodal deep learning networks to precisely identify specific ORs for given molecules. Our method demonstrates significant advancements and achieves an impressive accuracy of 95.1% when evaluated on a newly curated dataset. Notably, the proposed method achieves substantial improvements in performance metrics compared with other deep learning classification models and existing recognition approaches and exhibits robustness against discontinuities in the mapping of molecule structures to ORs. In addition, we developed a space distribution map to elucidate the structural intricacies of diverse receptors, revealing the clustering patterns among receptors. Our method facilitates a deeper understanding of the interaction mechanisms between molecules and receptors, laying a foundation for the digitization of receptor reaction.

Abstract:
Artificial neural networks (ANNs) have become a powerful tool for modeling complex relationships in large-scale datasets. However, their closed box nature poses trustworthiness challenges. In certain situations, ensuring trust in predictions might require following specific partial monotonicity constraints. However, certifying if an already-trained ANN is partially monotonic is challenging. Therefore, ANNs are often disregarded in some critical applications, such as credit scoring, where partial monotonicity is required. To address this challenge, this article presents a novel algorithm (LipVor) that certifies if a closed box model, such as an ANN, is positive based on a finite number of evaluations. Consequently, since partial monotonicity can be expressed as a positivity condition on partial derivatives, LipVor can certify whether an ANN is partially monotonic. To do so, for every positively evaluated point, the Lipschitzianity of the closed box model is used to construct a specific neighborhood, where the function remains positive. Next, based on the Voronoi diagram of the evaluated points, a sufficient condition is stated to certify if the function is positive in the domain. Unlike prior methods, our approach certifies partial monotonicity without constrained architectures or piecewise linear activations. Therefore, LipVor could open up the possibility of using unconstrained ANN in some critical fields. Moreover, some other properties of an ANN, such as convexity, can be posed as positivity conditions, and therefore, LipVor could also be applied.

Abstract:
Wearable sensors have found numerous applications in health and wellness promotion and have achieved great success leveraging advancements in deep learning. However, the development of robust continues to be hindered by issues related to sensor noise, inconsistent sampling rates, and individual differences. Topological data analysis (TDA) has emerged as a viable solution to extract robust features from such time-series data by converting them into persistence images (PIs), which capture intrinsic characteristics and demonstrate resilience to noise and signal variations. However, the computational costs of TDA pose significant challenges for small devices with limited resources. To more efficiently incorporate topological features, we utilize knowledge distillation (KD), which is a promising way to generate a smaller model using larger models. Multiple teachers can be adopted to enrich features in KD. However, this approach has presented two key challenges: 1) differences in feature dimensions from multimodal data and 2) conflicting knowledge provided by the different teachers, both of which can degrade the student model’s performance. To address these issues, we propose a novel KD framework called multimodal global latent workspace-based KD (mGLW-KD) that is motivated by global workspace theory (GTW) from cognitive neuroscience. GWT models how the brain integrates and distributes relevant information across different neural modules through a shared workspace, and it includes attentional control and working memory to prioritize and retain key information. Inspired by this theory, mGLW-KD incorporates a working memory module to unify diverse knowledge from multiple teacher models into a shared latent workspace, facilitating efficient knowledge transfer to the student model. By integrating topological insights with cognitive principles, mGLW-KD addresses the challenges posed by wearable sensor data and enables the student model to achieve superior performance using only time-series input during inference.

Abstract:
Time-variant-gain zeroing neural networks (TVG-ZNNs) are among the most powerful solvers for time-variant linear matrix-vector inequalities (TVLMVIs). Although TVG-ZNNs with complex nonlinear activation functions achieve effective convergence within finite or predefined time, they incur high computational costs and face challenges in precisely predefining their actual convergence time. In contrast, TVG-ZNNs with linear activation functions offer lower computational costs but struggle to achieve convergence within a finite or predefined time. In addition, the gain values of most existing TVG-ZNNs tend to increase over time, resulting in a significant rise in computational costs. To address these contradictory issues, we propose a novel hybrid-gain ZNN without a nonlinear activation function (HG-ZNN-WNAF) to solve TVLMVIs in both noisy and noise-free environments. Specifically, a new hybrid gain is cleverly designed to construct the HG-ZNN-WNAF activated by a linear activation function, while ensuring that the gain value does not keep increasing over time. Unlike the state-of-the-art TVG-ZNNs with or without nonlinear activation functions, our proposed HG-ZNN-WNAF achieves precisely predefined-time convergence due to the hybrid gain, meaning its actual convergence time can be accurately predefined. Additionally, the piecewise design of the hybrid gain, along with the use of the simple linear activation function, effectively reduces the model’s computational cost. Rigorous theoretical analysis demonstrates the precisely predefined-time convergence ability of the HG-ZNN-WNAF in both noisy and noise-free environments. Simulation and physical experiments validate the theoretical analysis and demonstrate that the HG-ZNN-WNAF achieves state-of-the-art performance in terms of convergence speed, robustness, and computational cost.