TKDE2026

Abstract:
Trust relationships play a crucial role in various domains, such as social spam detection, retweet behavior analytics, and recommendation systems. Trust is often implicit and difficult to observe directly in the real world, as it is driven by people’s underlying intentions and motivations. Therefore, when evaluating trust, it is critical to analyze not only user behavior data but also the intentions behind these behaviors that lead to trust. Existing trust evaluation methods often neglect the underlying reasons behind connections, such as shared hobbies or belonging to the same community. Therefore, these methods cannot differentiate the genuine intentions that lead to trust, resulting in an inaccurate evaluation of hidden trust relationships. To address this issue, we propose a novel Intent-based model for Trust Evaluation (INTRUST). This model can distinguish the intent behind high-order information in social communities using hypergraphs. Initially, we used hyperedges to represent high-order correlations between user-to-item and user-to-user interactions. Then, we construct KK intent prototypes, which serve as foundational elements to build trust. Furthermore, we distinguish KK-independent intent subgraphs from these high-order correlations. To enhance the generalization and robustness of the model, we employ self-supervised learning and construct contrastive views at the node-level, hyperedge-level, and node-hyperedge-level. Extensive experiments on real-world datasets demonstrate that our model outperforms state-of-the-art approaches in terms of trust evaluation accuracy and efficiency.

Abstract:
Knowledge distillation as a practical tool to enhance the performance of small-capacity student networks on downstream tasks comes at the cost of a lengthy distillation process due to the online inference of teacher networks, especially when there is a large capacity gap between them. Therefore, in this paper, we propose a fast distillation framework called TeDri based on region images by offline saving relevant regional information and its teacher guidance. Specifically, first, to alleviate the lack of diversity caused by the fixed augmentation path in region images, we propose Teacher-driven MixUp strategies with mild intensity and advocate binding the mixing factor \lambdaλ with teacher guidance confidence, where more confident category representations dominate the MixUp process. Furthermore, recognizing the need to evaluate these randomly cropped regions, and we propose region contrastive learning, encourage the student network to mimic the region partitioning behavior of the teacher, promoting a comprehensive understanding of global semantic content from multiple local perspectives. Finally, we introduce region mutual learning, employing spatial constraints among regions to require the student network towards consistent content interpretation across localized regions. Experiments on CIFAR-100 and ImageNet-1 K validate the effectiveness of the proposed TeDri, achieving competitive performance while significantly reducing training time.

Abstract:
Explainable clustering has become increasingly important as it presents the clustering results in a manner that can be easily understood by the end-users. However, most existing explainable clustering approaches generate only a single explanation such as a decision tree for the given clustering result, overlooking that multiple valid interpretations may exist. To address this limitation, we propose the CME (Clustering with Multiview Explanations) algorithm. This method constructs a diverse set of candidate decision trees by retaining multiple splitting points at each node, where each splitting point is valid in a statistical sense. The tree similarity based on optimal path matches across corresponding clusters is then used to build a similarity graph. A subset of representative decision trees are selected by solving the minimum dominating set problem defined on the corresponding similarity graph. Experiments on 10 real-world categorical datasets demonstrate that CME provides multiple complementary explanations that may be missed by existing algorithms. Moreover, these additional decision trees can be more accurate and interpretable than those ones identified by baseline methods.

Abstract:
The diversified route planning finds multiple paths that are sufficiently different from each other while as short as possible. It is of great significance to traffic alleviation through provided alternative routes during navigation. However, it is NP-H to find the optimal result and the existing solutions have either high quality (exact path enumeration-based) or high efficiency (alternative path), which is highly affected by the network properties and query parameters but has never been investigated before. Therefore, this paper proposes a hybrid diversified routing system that can handle any query efficiently with quality as high as possible. Specifically, we first analyze the path enumeration problem from the ground up and unify all the existing algorithms theoretically to identify the factors that affect algorithm performance. After that, we review and select the alternative path methods to identify the suitable ones for our system. Finally, we propose a query classification module to estimate the hardness of a query and determine how it should be processed. Extensive experiments on real-life networks validate the effectiveness and efficiency of our hybrid system compared with state-of-the-art solutions.

Abstract:
Order-preserving pattern matching (OPPM) is a specialized area within the domain of pattern recognition and string matching. This specialized area is dedicated to identifying patterns in sequences where the intrinsic order of elements is crucially important. This comprehensive review provides an in-depth analysis of diverse order-preserving pattern matching techniques, focusing on their algorithms and methodologies. Particular attention is paid to the challenges researchers face in preserving order during pattern matching. The review also evaluates the performance and scalability of various techniques to handle large-scale datasets. By discussing the current state of OPPM research, we identify gaps, opportunities, and potential avenues for future exploration. Through this exploration, we aim to contribute valuable insights that will guide researchers and practitioners in advancing the frontiers of OPPM research, shaping the trajectory of this field in the coming years.

Abstract:
The recommendation system, as a widely used and effective tool to alleviate information overload, has been receiving increasing attention regarding its issues of bias and fairness. Many studies have focused on addressing fairness on the item side, targeting item fairness by minimizing exposure discrepancies of items among similar individuals. However, in real-world recommendation scenarios, many relevant items requiring similar exposure to users may exhibit certain dissimilarities while existing methods could not solve the problem. To address this, we define a broader item fairness recommendation issue aimed at improving fairness within specified groups of related individual items, which we term as “intra-group item fairness”. To solve this issue, we propose a Group-oriented Individual Fairness recommendation model called GIFRec. First, we introduce a global exposure balance module to mitigate exposure imbalances at a global level, with the help of multimodal information contained in each item. Then, at the group level, we propose a group fusion embedding representation method, allowing individual items within the same group to adaptively share group information. Additionally, as unfair training opportunities may arise for different items during model training, we propose a general fair intra-group optimization method to reduce individual training biases within the same group. Extensive experiments conducted on four real-world datasets demonstrate the effectiveness of our approach with an average improvement of 11.27% in accuracy and 24.91% in fairness compared to eight SOTA methods.

Abstract:
Heterogeneous graph neural networks (HGNNs) are effective for modeling multi-relational structured data. Existing HGNNs usually assume the training samples are relatively sufficient, thus focusing on improving the predictive performance by complicating the model architecture with more learnable parameters. In this paper, we instead explore how to design HGNNs when training labels are scarce, under which we observe that existing HGNNs suffer from serious overfitting issues. Inspired by the graph random neural network (GRAND)—a consistency regularization framework for graph learning, we propose a simple yet efficient R-GRAND framework to overcome the issues above. R-GRAND is a general relation-aware consistency regularized training method with both labeled and unlabeled nodes to facilitate the model’s generalization capability. It designs a lightweight relational graph convolution neural network (SRGC) as the backbone model to deal with the heterogeneous information. To enable regularized training, we further advance the data augmentation methods of GRAND with a Multi-block DropEdge strategy. The proposed training framework not only excels with its default SRGC backbone but also effectively enhances the performance of other HGNN architectures, such as RGCN and Simple-HGN. Extensive experiments on seven heterogeneous graph datasets demonstrate that R-GRAND can achieve remarkable performance improvements over state-of-the-art HGNNs with better generalization ability and high efficiency.

Abstract:
In-memory key-value storage necessitates a substantial quantity of computation and storage resources for both performance and scalability, thereby diminishing the resources available for user applications. The emergence of programmable network hardware, including SmartNICs and programmable switches, provides the opportunity to offload operations from server CPUs. We present Epiphron, a novel distributed in-memory key-value store architecture that co-designs with off-path SmartNICs and programmable switches. Facing the limited performance of off-path SmartNICs, Epiphron successfully achieves high resource efficiency while keeping load balancing and fault tolerance by (i)(i) hybridizing erasure coding with replication in storage management, (ii)(ii) accelerating read operations with a new data plane design (conflict detection and RDMA-compatible forwarding) on programmable switches, (iii)(iii) employing a network protocol extended from one-sided RDMA. We evaluate Epiphron on Barefoot Tofino switches, NVIDIA BlueField-2 SmartNICs, and commodity servers. The experimental results demonstrate that compared to existing solutions, Epiphron improves throughput by up to 2.2× and consumes 47% less memory while completely bypassing server CPUs.

Abstract:
The proliferation of location-based services has facilitated the generation of extensive semantic trajectories for moving objects. In this paper, we investigate the problem of semantic trajectory similarity search, a fundamental task of trajectory analysis. The objective is to identify “similar” trajectories to a query trajectory by considering both spatial proximity and semantic similarity. Existing studies predominantly utilize keywords-based discrete semantic modeling, which fails to capture broader semantics, such as text descriptions. Moreover, these studies typically develop search algorithms on a single machine, resulting in limited processing capability. To this end, we propose an effective framework to perform distributed semantic trajectory similarity searches. We introduce a novel method for representing semantic trajectories by employing both geographical sequences and semantic sequences. This approach enhances spatial-semantic awareness in trajectory modeling. Next, we implement the query framework in Spark and propose a carefully designed two-phase, computation-aware partitioning architecture. This is the first architecture to consider both semantic and spatial aspects of trajectories, as well as inter-trajectory distances, to guide the partitioning process. It enables efficient pruning of most partitions in the cluster when processing queries. Within each local partition, we construct an ST-HNSW index to further accelerate queries. Our framework supports three widely used trajectory distance measures: DTW, LCSS, and EDR. Extensive experiments on two real and one synthetic datasets demonstrate the significant advantages of our design, achieving efficiency improvements of up to 1 to 3 orders of magnitude over baseline and alternative approaches.

Abstract:
Bipartite graphs are widely used in many real-world applications, where discovering clusters is crucial for understanding their underlying structure. However, most existing clustering methods for bipartite graphs enforce the assignment of all vertices to clusters, often neglecting the important roles of outliers and hubs. To address this limitation, we plan to extend the structural clustering model from unipartite to bipartite graphs. This extension is non-trivial due to the lack of common neighbors in bipartite graphs, which renders traditional similarity measures less effective. Recognizing that similarity is key to structural clustering, we resort to butterflies—the fundamental building blocks of bipartite graphs—to define a more effective similarity measure. Building on this, we further propose a novel structural clustering model, \mathsf SBCSBC, tailored for bipartite graphs. To enable clustering under this model, we develop efficient online and index-based methods, along with a dynamic maintenance method to accommodate graph updates over time. Extensive experiments on real-world bipartite graphs demonstrate that: (1) The \mathsf SBCSBC model greatly enhances clustering quality, achieving higher modularity while effectively identifying outliers and hubs. (2) Our proposed clustering methods are highly scalable, enabling the processing of graphs with up to 12.2 million edges within 2 seconds.

Abstract:
When testing data and training data come from different distributions, deep neural networks (DNNs) will face significant safety risks in practical applications. Therefore, out-of-distribution (OOD) detection techniques, which can identify OOD samples at test time and alert the system, are urgently needed. Existing graph OOD detection methods usually characterize fine-grained in-distribution (ID) patterns from multiple perspectives, and train end-to-end graph neural networks (GNNs) for prediction. However, due to the unavailability of OOD data during training, the absence of explicit supervision signals could lead to sub-optimal performance of end-to-end encoders. To address this issue, we follow the pre-training+prompting paradigm to utilize pre-trained GNN encoders, and propose Disentangled Graph Prompting (DGP), to capture fine-grained ID patterns with the help of ID graph labels. Specifically, we design two prompt generators that respectively generate class-specific and class-agnostic prompt graphs by modifying the edge weights of an input graph. We also design several effective losses to train the prompt generators and prevent trivial solutions. We conduct extensive experiments on ten datasets to demonstrate the superiority of our proposed DGP, which achieves a relative AUC improvement of 3.63% over the best graph OOD detection baseline. Ablation studies and hyper-parameter experiments further show the effectiveness of DGP.

Abstract:
Multimodal clustering (MMC) overcomes the limitations of unimodal methods by integrating information from multiple sources, but the complexity of heterogeneous information coupling hinders effective feature extraction. Critically, existing MMC paradigms primarily focus on capturing consensus through coarse-grained cross-modal alignment. However, such task-agnostic strategies overlook the differences in the utility of feature information across varying task environments. In the absence of task-centric guidance, models often struggle to effectively distinguish task-relevant critical information from task-irrelevant redundant noise during the disentanglement process, leading to information confusion in the representation space. To address this challenge, we propose a deep disentangled multimodal clustering method guided by information theory, named DRLMMC, which employs a tripartite information optimization mechanism to achieve deep disentanglement of cross-modal representations. 1) We design modality-specific encoders to construct nonlinear mapping spaces, transforming the reconstruction mechanism of autoencoders into an information-theoretic mutual information (MI) constraint problem, preserving the unique features of different modalities; 2) To establish cross-modal semantic associations, it constructs a cross-modal shared information extraction module, and, based on an information-theoretic framework, designs an optimization objective function to progressively align multimodal feature subspaces through MI maximization and contrastive learning, capturing task-relevant invariant features across modalities; 3) A unique information dynamic perception module is proposed, which employs a conditional MI projection network combined with learning distribution regularization to adaptively extract and enhance modality-specific task-relevant unique information. Experimental results demonstrate that DRLMMC outperforms existing state-of-the-art methods on multimodal benchmark datasets, exhibiting excellent generalization ability. Notably, it achieves precise disentanglement of cross-omics features in multi-omics analysis, offering a novel methodological approach for handling complex biomedical data.

Abstract:
The long-tail recommendation problem remains a significant challenge in modern recommender systems, primarily due to data sparsity and popularity bias, which hinder the accurate ID representation of users and items. Recent advancements in large language models (LLMs) have enabled the direct modeling of user and item semantic representations, offering potential improvements in representation learning through the alignment of these two types of representations. However, systems relying on LLM representation alignment face two critical challenges: (1) the substantial differences between LLMs and recommendation models in terms of training objectives, phases, and data; (2) the pervasive popularity bias in collaborative data. These challenges create a semantic gap between ID representations and semantic representations. Directly aligning these representations risks introducing recommendation-irrelevant noise, disrupting the collaborative information embedded in ID representations, and ultimately leading to suboptimal recommendation outcomes. To address this gap, we propose DeltaRec, a Double-enhancement framework for long-tail Recommendation. DeltaRec tackles the long-tail recommendation problem through two approaches. First, it incorporates semantic information for all items. Second, it provides additional supervision signals specifically for long-tail items. The framework begins by disentangling ID representations into interest representations and conformity representations. To integrate semantic information from LLMs while preserving popularity information, we design a contrastive learning-based semantic alignment module that aligns interest representations with semantic representations. Furthermore, to enhance the representation learning of unpopular items, we introduce a ranking-based behavior alignment module, which provides additional supervision signals for these items. To avoid introducing recommendation-irrelevant noise and disrupting collaborative semantics due to excessive alignment, we propose a curriculum learning-based training mechanism. Extensive experiments on real-world datasets demonstrate that DeltaRec effectively mitigates popularity bias and significantly improves long-tail recommendation performance without relying on prior knowledge of popularity distributions.

Abstract:
Sequence classification is a fundamental research issue in data mining and machine learning. However, existing sequence classification methods primarily focus on improving the prediction performance. Although a few methods attempt to enhance the interpretability, they often fail to provide intuitive model explanations and come with high computational costs. To fill this gap, we propose an interpretable sequence classification algorithm based on decision set. Each rule in the decision set is only associated with one discriminative pattern (subsequence) and the classification decision is made based on one best-matched rule. Hence, the proposed method has good interpretablility since the classification decision is solely determined by one simple intuitive rule. Experimental results on real-world data sets demonstrate that our algorithm outperforms the state-of-the-art interpretable sequence classification methods in terms of both interpretability and classification accuracy.

Abstract:
Electrical closeness centrality is a classical and robust graph centrality measure. However, existing algorithms for computing electrical closeness centrality are often computationally expensive for large graphs, as they require determining the diagonal elements of the pseudo-inverse of the graph Laplacian matrix, denoted as L^\daggerL†. To address this challenge, we propose novel solutions for approximating L^\daggerL† by establishing a connection with the inverse of a Laplacian submatrix L_vLv, which is obtained by removing the vv-th row and column from the original Laplacian matrix LL. A key advantage of this connection is that L_v^-1Lv-1 admits various insightful combinatorial interpretations. Specifically, we present two novel interpretations of L_v^-1Lv-1 based on spanning trees and loop-erased random walks, which facilitate the development of efficient sampling algorithms. Building upon these theoretical insights, we introduce two algorithms for efficiently approximating electrical closeness centrality. We extensively evaluate the performance of our algorithms on five real-world datasets. Experimental results demonstrate that our approaches significantly outperform state-of-the-art methods by several orders of magnitude in both running time and estimation accuracy.

Abstract:
kk-means clustering is a fundamental problem in many scientific and engineering domains. The optimization problem associated with kk-means clustering is nonconvex, for which standard algorithms are only guaranteed to find a local optimum. Leveraging the hidden structure of local solutions, we propose a general algorithmic framework for escaping undesirable local solutions and recovering the global solution or the ground truth clustering. This framework consists of iteratively alternating between two steps: (i) detect mis-specified clusters in a local solution, and (ii) improve the local solution by non-local operations. We discuss specific implementation of these steps, and elucidate how the proposed framework unifies many existing variants of kk-means algorithms through a geometric perspective. We also present two natural variants of the proposed framework, where the initial number of clusters may be over- or under-specified. We provide theoretical justifications and extensive experiments to demonstrate the efficacy of the proposed approach.

Affiliations: Institute of Big Data Science and Industry, Shanxi University, Taiyuan, China; School of Information Science and Technology, Northwest University in Xi’an, Shaanxi, China; School of Information Science and Technology, University of Science and Technology of China, Hefei, China; School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an, China; State Key Laboratory of Electromechanical Integrated Manufacturing of High-Performance Electronic Equipment at Xidian University, Xi’an, China

Abstract:
Sample-anchor co-clustering has demonstrated potential in improving clustering efficiency; however, existing methods face two major limitations. First, the intrinsic geometric relationships among anchors are often overlooked, leading to insufficient smoothness in the anchor cluster structure. Second, the inability to directly infer discrete one-hot pseudo-labels for both samples and anchors undermines the stability and interpretability of clustering results. To address these challenges, we propose BGFC, a bipartite graph factorization clustering model. BGFC employs non-negative matrix factorization of the bipartite graph to directly generate one-hot pseudo-labels for both samples and anchors, enhancing local consistency in label assignments. In addition, a compact anchor similarity graph is constructed and refined via low-rank decomposition to explicitly promote the consistency of pseudo-labels among geometrically related anchors. An alternating optimization algorithm is developed to jointly update all model variables, enabling efficient and scalable training. Extensive experiments on benchmark datasets demonstrate that BGFC consistently outperforms state-of-the-art co-clustering methods in both clustering performance and computational efficiency.

Abstract:
In high-dimensional data, features often exhibit complex correlations and redundancies that hinder effective learning and reduce model interpretability. Therefore, extracting critical feature structural information from such data is essential for effective feature selection. This structural information can be represented as a feature graph, which reveals associations among features, including redundancies and correlations. However, most existing feature graph-based methods adopt a single strategy, focusing either on selecting relevant features or removing redundant ones, often resulting in suboptimal performance. Moreover, these methods typically construct static feature graphs using all input features, which may introduce irrelevant or detrimental relationships. Such simplistic approaches and low-quality graphs limit the effectiveness of feature selection. To address these limitations, we propose the Dynamic Feature Graph (DFG) framework that jointly learns feature graphs and selects features. By decoupling feature associations according to category, the DFG framework dynamically identifies key feature relationships, eliminating redundancy across classes while preserving essential intra-class correlations. This results in the selection of both relevant and non-redundant features. Additionally, we introduce sample manifolds with a rank equality constraint to ensure the selection of category-related features. To the best of our knowledge, this is the first dynamic feature graph approach in the field. Extensive experiments on 15 widely used real-world datasets demonstrate that DFG outperforms 20 state-of-the-art feature selection methods, highlighting its effectiveness in supervised and unsupervised settings.

Affiliations: Computer Network Information Center, CAS and University of Chinese Academy of Sciences, Beijing, China; Department of Computer Science, Portland State University, Portland, OR, USA; Department of Computer and Information Science, The State Key Laboratory of Internet of Things for Smart City, University of Macau, Macau, China; Arizona State University, Tempe, AZ, USA; Virginia Tech, Blacksburg, VA, USA; IBM T. J. Watson Research Center, Yorktown Heights, NY, USA; Duke University, Durham, NC, USA

Abstract:
Data augmentation is a series of techniques that generate high-quality artificial data by manipulating existing data samples. By leveraging data augmentation techniques, AI models can achieve significantly improved applicability in tasks involving scarce or imbalanced datasets, thereby substantially enhancing AI models’ generalization capabilities. Existing literature surveys only focus on a certain type of specific modality data and categorize these methods from modality-specific and operation-centric perspectives, which lacks a consistent summary of data augmentation methods across multiple modalities and limits the comprehension of how existing data samples serve the data augmentation process. To bridge this gap, this survey proposes a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities by investigating how to take advantage of the intrinsic relationship between and within instances. Additionally, it categorizes data augmentation methods across five data modalities through a unified inductive approach.

Abstract:
Graph condensation reduces the size of large graphs while preserving performance, addressing the scalability challenges of Graph Neural Networks caused by computational inefficiencies on large datasets. Existing methods often rely on bi-level optimization, requiring extensive GNN training and limiting their scalability. To tackle these issues, we propose Graph Condensation via Gaussian Process (GCGP), a novel and efficient framework that optimizes a compact, high-fidelity condensed graph, enabling effective training of various GNNs with reduced computational cost. GCGP utilizes a Gaussian Process (GP), with the condensed graph serving as observations, to estimate the posterior distribution of predictions. This approach eliminates the need for the iterative and resource-intensive training typically required by GNNs. To enhance the capability of the GCGP in capturing dependencies between function values, we derive a specialized covariance function that incorporates structural information. This covariance function broadens the receptive field of input nodes by local neighborhood aggregation, thereby facilitating the representation of intricate dependencies within the nodes. To address the challenge of optimizing binary structural information in condensed graphs, Concrete random variables are utilized to approximate the binary adjacency matrix in a continuous counterpart. This relaxation process allows the adjacency matrix to be represented in a differentiable form, enabling the application of gradient-based optimization techniques to discrete graph structures. Experimental results show that the proposed GCGP method efficiently condenses large-scale graph data while preserving predictive performance, addressing the scalability and efficiency challenges.

Abstract:
Survival analysis is extensively employed to analyze the probability of the event of interest, particularly in the medical field. Most current research treats patients as isolated entities, neglecting the complex associations among them, which leads to underutilization of valuable information. Recently, several studies address this limitation by incorporating patient graph structures. However, these approaches generally overlook two critical issues: 1) the exploration of heterogeneous inter-patient relationships, and 2) flexible and scalable inductive inference for test samples. To overcome these challenges, this study introduces a novel framework, Multiplex Graph Guided Deep Survival Analysis (MGG-Surv). Specifically, we employ multiplex patient graphs to capture comprehensive inter-patient associative information. Furthermore, we propose a teacher-student dual network architecture, where the teacher network encodes multiplex graphs, and the learned graph knowledge is transferred to the student network via a unidirectional connection termed Graph-Guided Distillation. The student network integrates this graph knowledge to predict survival outcomes without requiring the patient graphs. These innovative designs facilitate comprehensive integration of inter-patient relationships while achieving flexible and scalable graph-free inference. Experiments on four datasets, encompassing both single and competing risks, demonstrate the superior performance of our framework.

Abstract:
In this paper, we investigate a novel Influence Persistence Maximization (InfPM) problem in temporal social networks. Given a temporal graph, InfPM aims to identify a fixed seed node set SS that maximizes the total duration of persistent influence across consecutive snapshots. After proving that InfPM is NP-hard, monotonic, and non-submodular, we develop two efficient solutions: (1) \sf RevGRevG, a reverse greedy algorithm that iteratively removes low-contribution nodes, and (2) \sf LRepLRep, a replacement-based method that progressively improves the quality of seed node set. To accelerate influence computation in \sf RevGRevG and \sf LRepLRep, we propose a new influence computation method integrating snapshot compression, probability-aware sampling, and a specialized influence estimator offering unbiased estimation. Additionally, we explore a practical variant of InfPM, termed Win-InfPM, which relaxes the requirement of consecutive snapshots by introducing a flexible time window model. Extensive experiments on seven real-world networks demonstrate that (1) \sf RevGRevG and \sf LRepLRep effectively identify high-quality seed nodes, achieving up to 100% improvement in total influence persistence over the baselines; and (2) the proposed influence computation method improves the efficiency of \sf RevGRevG and \sf LRepLRep by up to 400%, while maintaining comparable influence persistence.

Abstract:
Network embedding is a fundamental technique to project a network into a lower-dimensional space while preserving similarities among nodes. Traditional network embeddings primarily capture node proximity, making them effective for community detection but insufficient for identifying roles, i.e., patterns of interaction beyond local neighborhoods. To address this limitation, we introduce a simple and efficient embedding technique based on approximate variants of equitable partitions. Our approach, called \varepsilonɛ-BE, introduces a user-tunable tolerance parameter relaxing the otherwise strict condition for exact equitable partitions that can be hardly found in real-world networks. We exploit a relationship between equitable partitions and equivalence relations for Markov chains and ordinary differential equations to develop a partition refinement algorithm for computing an approximate equitable partition in polynomial time. We extend this framework to weighted and directed networks, ensuring applicability to a more general class of graphs and filling a gap in the literature where few approaches are present. We compare our method against state-of-the-art embedding techniques on synthetic and real-world networks. We report comparable—when not superior—performance for visualization, classification, clustering, and regression tasks with smaller running times, enabling the embedding of large-scale networks that could not be efficiently handled by most of the competing techniques. These results and the capability to handle weighted and directed networks make our approach a compelling alternative for structural network embedding.

Abstract:
As an efficient model compression technique, knowledge distillation has become an important research topic in the field of deep learning. However, the requirement of pre-trained teacher networks makes the process cumbersome and inefficient, which prompted researchers to propose a more efficient mechanism. Therefore, self-knowledge distillation (SKD) is proposed, which does not require assistance from additional teacher networks. In previous surveys, self-knowledge distillation has usually been considered a special case of knowledge distillation. In recent years, significant progress has been made in self-knowledge distillation, which has evolved beyond the functions or roles of traditional knowledge distillation. However, there is no dedicated and comprehensive survey of self-knowledge distillation methods up to now. Therefore, this paper reviews and investigates existing self-knowledge distillation methods from a comprehensive perspective. Specifically, first, based on the differences in knowledge sources, this paper categorizes self-knowledge distillation methods into three types, label knowledge-based, feature knowledge-based and data knowledge-based. Then, this paper introduces the evaluation protocol and performance of SKD. In particular, the commonly used experimental datasets and evaluation networks are summarized, aiming to encourage researchers to choose common network architectures and evaluation datasets for promoting the standardization of fair comparison of self-knowledge distillation methods. Finally, this paper summarizes the applications of self-knowledge distillation in different task scenarios, enabling researchers to quickly locate the relevant task fields. Furthermore, this paper introduces the technical purposes of applying self-knowledge distillation and explores the core motivations for using SKD across different tasks, facilitating the application of self-knowledge distillation to fields in which it has not yet been widely explored.

Abstract:
Partial Multi-Label Learning (PML) is an emerging weakly supervised learning framework, where each instance contains a candidate label set with only some labels being ground-truth labels. Many existing PML methods recover the information of the ground-truth label set through k-Nearest Neighbor (kNN) disambiguation. However, this popular strategy might be suboptimal, as it makes disambiguation for a given instance based solely on its neighbors’ features and class labels, i.e., the local structural information in the feature space, thereby missing the opportunity to explicitly and sufficiently leverage the global structural information in the feature space to facilitate disambiguation. In this paper, we propose a novel algorithm called PRAG, i.e., PaRtiAl multi-label learning by exploiting Global information, which incorporates the global factor obtained from the features of all the training instances into the kNN disambiguation process. Specifically, we learn for each instance a global factor vector, which captures the global affinity between an instance and each label across the feature space. This global factor vector is continuously updated through iterative propagation, with each iteration computing the global factor vector based on the similarity between the instance’s features and a dynamically constructed label prototype for each label. The label prototype is formed by aggregating the features of all training instances weighted by their current estimated confidence for that label. Crucially, the global factor vector serves as a weighting mechanism during aggregation of the neighbor labels in the kNN disambiguation step. It effectively injects global structural information into the local disambiguation process, providing a more robust estimation of label confidence by mitigating the limitations of relying solely on potentially noisy local neighbors. Based on the estimated label confidence, PRAG then exploits label correlations to classify instances. We conducted extensive experiments on various real and synthetic datasets, and the results show the superiority of PRAG compared to the state-of-the-art methods.

Abstract:
Self-supervised learning (SSL) provides a promising paradigm for hypergraph representation learning without reliance on costly labels. However, existing hypergraph SSL methods predominantly employ contrastive learning with instance-level discrimination, encountering two significant challenges: (1) Unreliable negative sampling, where arbitrarily selected negative samples introduce bias by misclassifying similar and dissimilar pairs; and (2) High computational cost, as effective training requires a large number of negative samples. To address these limitations, we propose SE-HSSL, a hypergraph SSL framework with three sampling-efficient self-supervised objectives. Specifically, two sampling-free objectives based on canonical correlation analysis serve as node- and group-level signals, while a hierarchical membership-level contrastive objective exploits the cascading overlap structure of hypergraphs. Beyond these challenges, deep hypergraph models are prone to biased predictions against groups defined by sensitive attributes (e.g., gender and race). We theoretically show that imbalanced group contributions during hypergraph message passing amplify such biases. To address this, we propose FairHSSL, a fairnessaware extension of SE-HSSL with a two-level debiasing augmentation strategy. Specifically, we construct a fair hypergraph view via complementary feature- and structure-level adjustments. At the feature level, orthogonal projection removes sensitive information from node representations; at the structure level, rebalance-based perturbation equalizes group contributions during message passing. By aligning the fair and original views under SSL, the model mitigates bias while preserving informative signals. Extensive experiments on 10 real-world hypergraphs demonstrate the effectiveness and efficiency of SE-HSSL.

Abstract:
A directed hypergraph, which consists of nodes and hyperarcs, is a higher-order data structure that naturally models directional group interactions (e.g., chemical reactions of molecules). Although there have been extensive studies on local structures of (directed) graphs in the real world, those of directed hypergraphs remain unexplored. In this work, we focus on measurements, findings, and applications related to local structures of directed hypergraphs, and they together contribute to a systematic understanding of various real-world systems interconnected by directed group interactions. Our first contribution is to define 91 directed hypergraphlets (DHGs), which disjointly categorize directed connections and overlaps among four node sets that compose two incident hyperarcs. Our second contribution is to develop exact and approximate algorithms for counting the occurrences of each DHG. Our last contribution is to characterize 11 real-world directed hypergraphs and individual hyperarcs in them using the occurrences of DHGs, which reveals clear domain-based local structural patterns. Our experiments demonstrate that our DHG-based characterization gives up to 12% and 33% better performances on hypergraph clustering and hyperarc prediction, respectively, than baseline characterization methods. Moreover, we show that CODA-A, which is our proposed approximate algorithm, is up to 36×36× faster than its competitors with similar characterization quality.

Abstract:
Network embedding (NE) aims to learn low-dimensional node representations, wherein both neural-based (NNE) and factorization-based (FNE) methods commonly employ negative sampling (NS) as an essential component. However, the role of NS differs markedly between these two paradigms: in NNE, negative samples are randomly chosen to facilitate efficient training, while in FNE, the distribution of negative samples plays a pivotal role in deriving the factorized matrix. In this work, we propose LocAPS (Loccal cluster-based Adaptive Positive Sampling), a novel sampling strategy that adaptively determines positive samples for each node based on local clustering. Building on LocAPS, we develop an enhanced NNE method, VERSE+, which achieves both sampling and training in linear time. For FNE, we introduce an adaptive negative sampling distribution derived from LocAPS, which tailors the sampling probability for each node. This distribution informs the construction of a factorized matrix that adaptively retains information from the similarity matrix. Moreover, its node-wise nature enables the development of FREDE+, an efficient streaming-style NE method with linear time and space complexity. We conduct extensive experiments on multiple real-world datasets, evaluating our methods on node classification and link prediction tasks, demonstrating their effectiveness and superior performance.

Abstract:
Diffusion models have demonstrated promising potential in recommender systems owing to their powerful generative ability. However, due to the inherent sparse nature of real-world recommendation data and the inconsistency in the variation of reconstruction and ranking losses during training, existing works suffer from two issues: 1) Randomly sampled Gaussian noise addition tends to obscure original user preferences. 2) Training for generation and preference learning tasks interferes with each other, limiting the generative ability of the model. To address these issues, we propose SemDiff, a semantic guided diffusion-based collaborative filtering framework. For the first issue, instead of using random Gaussian noise, we leverage rich semantic information by incorporating auxiliary signals from text or image modalities to enhance the input data of denoising model. In response to the second issue, based on a comprehensive analysis of the mutual influence between generation and preference learning in diffusion recommender systems, we build a collaborative training objective strategy to transform the interference between them into mutual collaboration, which jointly enhances the effectiveness of model training. Additionally, we employ multiple GCN layers only during inference to incorporate higher-order co-occurrence information while maintaining training efficiency. Extensive experiments on four real-world datasets demonstrate that SemDiff significantly outperforms state-of-the-art methods. Our SemDiff offers an effective solution for enhancing recommendation performance and suggests a novel paradigm for applying diffusion methods in recommender systems.

Abstract:
Social media platforms have democratised information creation and dissemination by empowering users worldwide to reach vast audiences almost instantaneously. However, these platforms have also become vectors for spreading misinformation such as rumours. If left unchecked, rumours have the potential to cause great economic and political damage and even worsen public health crises. Automated rumour detection systems are imperative to deal with the volume and velocity of information being exchanged on these platforms. Graph Neural Network (GNN)-based approaches have recently emerged as state-of-the-art (SOTA) in automated rumour detection. Despite their performance, these models remain largely opaque, making explaining their predictions challenging, particularly when dealing with noisy social media data. Existing graph explainability techniques struggle to produce high-fidelity contrastive explanations when presented with such noisy data especially when dealing with a multiclass classification problem. To address the issue of noise susceptibility, we propose a novel framework to maximise both contrastivity and fidelity by reframing the explanation task as a maximum margin optimisation problem. Specifically, we impose constraints on explanation set membership and on the influence difference of the prediction explanation set to each other class explanation set. Extensive experiments on real-world datasets show that the proposed method outperforms SOTA methods.

Abstract:
Active learning (AL) is a semi-supervised learning paradigm with human-machine interaction and a limited annotation budget. However, few AL studies have explored distribution inconsistency between the data and the population. In this paper, we consider a basic form of the aforementioned issue, i.e., the training data is non-independently and identically distributed (non-IID) sampled from a class uniformly distributed population. Accordingly, we propose a naïve sample selection plugin, namely generalized active stratified sampling (GASS), to rebalance the sample size of each class during AL iterative process, resulting in a progressive approximation to the population. We generalize statistical stratified sampling to support the uncertainty strata criterion, forming the statistical foundation of GASS. This method, as a plugin, can seamlessly collaborate with popular information-based strategies. GASS shows superior rebalancing capabilities by analyzing the statistical moment and the class imbalanced index under the Probably Approximately Correct (PAC) theory. Furthermore, models derived with GASS have low Rademacher complexity (RC), indicating low generalization error bounds, and GASS also exhibits strong robustness to prediction perturbations. Experiments were conducted on 5 benchmark image datasets, and the results show that GASS significantly boosts the test accuracy by about 2.38% /3.19% (paired tt-test p=0.01p=0.01/0.04) and reduces the empirical RC by about 1.42% /1.94% (paired tt-test p=0.01p=0.01/0.05) on average in class imbalanced/balanced scenarios, respectively. This study establishes a potential benchmark for information-based AL.

Abstract:
The densest subgraphs in multilayer (ML) graphs unveil intricate relationships that are missed by simple graph representations, offering profound insights and applications across diverse domains. In this paper, we present a layer-oriented view of existing density measures for ML graphs and highlight their problems in identifying the densest subgraphs under the layer-oriented densities, including inefficiency, poor approximation ratios, and the lack of a unified algorithmic framework. In light of this, we introduce a new family of vertex-oriented density measures called generalized density. The two parameters qq and pp allow the generalized density to flexibly adjust its focus in the density evaluation. We investigate the problem of finding the ML subgraph that maximizes the generalized density and show that the problem can be solved using a unified greedy vertex peeling framework with strong approximation guarantees for half of the (q, p)(q,p) parameter space. Specifically, for four regimes of (q, p)(q,p), we design tailored vertex-peeling strategies that lead to approximation algorithms with provable approximation ratios and precise time complexity bounds. We also develop a highly efficient implementation that reduces the execution time of greedy peeling to near-linear time for two of the four explored regimes of (q, p)(q,p). Extensive experiments on ten real-world ML graphs reveal that our generalized density and greedy peeling algorithms can effectively uncover different types of dense ML subgraphs in large-scale ML graphs.

Abstract:
We study the problem of efficiently computing rankings of joinable attributes in data lakes. Traditional set-overlap measures produce numerous false positives in this scenario, while modern, more accurate Table Representation Learning (TRL) techniques incur prohibitive computational costs. In contrast to the state-of-the-art, we adopt a novel notion of join quality tailored to data lakes relying on a metric that combines multiset Jaccard and cardinality proportion. The proposed metric merges the best of both worlds by leveraging syntactic measures while achieving accuracy scores comparable to those of TRL approaches. Generating rankings of joinable pairs is highly scalable at both preparation and query time, since we train a general-purpose predictive model. Predictions are based on data profiles, succinct and efficiently computed representations of dataset characteristics. Our experiments show that our system, Freyja, matches and improves upon, the results obtained by the state-of-the-art while reducing execution costs by orders of magnitude.

Abstract:
Personalized search has been proven to be an effective method to improve ranking quality by tailoring result lists according to the user’s search history. Previous studies achieve personalization by learning a user interest profile from the search log, and decide the candidate document’s ranking score by calculating its relevance with the learned profile vector. However, existing approaches overlook fine-grained interaction signals by treating the candidate document separately from the user’s search history, relying solely on comparisons with a unified interest vector for re-ranking. Leveraging history-document interactions is not trivial due to the challenge of assessing the contributions of fine-grained matching signals within complex evolving interest patterns. In this paper, we address this challenge by helping the model understand these interactions within the evolving interest process through their integration into the interest profiling procedure. Specifically, we hierarchically incorporate these interaction signals as document-aware interests into behavior representations, employing explicit balancing and differentiation mechanisms, while jointly learning the interest pattern from both actual clues derived from original interests and potential insights provided by document-aware interests. Experimental results show that our model obtains substantial improvements over existing methods.

Abstract:
Maritime trajectory modeling is crucial for ensuring the safety and efficiency of maritime transportation. However, the unique challenges of the open ocean—such as the lack of a pre-determined road network and infrequent vessel interactions—render traditional land-based route prediction systems inadequate. To overcome these obstacles, we present ST-Shape, a spatio-temporal trajectory indexing method designed to swiftly retrieve pertinent historical maritime trajectories, thereby facilitating long-term trajectory prediction. ST-Shape approximates trajectories using two-dimensional polygons and constructs the index with shape indices, thus preserving the spatio-temporal properties of each trajectory. Concurrently, we introduce a straightforward yet robust model that harnesses this indexed data to predict vessel movements. To underpin our research, we have curated a comprehensive maritime trajectory dataset from the Atlantic and Pacific Oceans, classified according to diverse navigational scenarios. Our endeavor serves as a foundational step towards expedited spatio-temporal trajectory retrieval for maritime trajectory prediction, marking a significant stride in enhancing maritime safety and navigational efficiency.

Abstract:
To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple high-quality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.

Abstract:
Multiplex graphs represent diverse real-world interactions among entities, where multiple relationship types coexist within the same set of entities. These graphs introduce privacy risks, as data collectors can exploit cross-layer dependencies to infer hidden and sensitive connections. In this work, we propose a C2P-M framework that identifies and protects critical connections while preserving the structural information in multiplex graphs. Unlike conventional methods for single-layer graphs that perturb all edges uniformly, C2P-M selectively protects critical connections, maintaining the analytical usability of the graph. To achieve this, we introduce the multiplex pp-cohesion model, which incorporates new score functions that account for both intra-layer and inter-layer dependencies, enabling precise identification of critical connections for each vertex. For privacy protection, our method protects the identified critical connections, leveraging an adaptive Randomized Response (RR) mechanism to ensure \varepsilonɛ-Local Differential Privacy (LDP). We formally prove that C2P-M satisfies \varepsilonɛ-LDP. Extensive experiments on eight real-world multiplex graph datasets demonstrate that C2P-M significantly outperforms baseline privacy-preserving methods, achieving a better privacy-utility trade-off.

Abstract:
Graph-based clustering has been extensively explored and applied due to its exceptional performance. However, most existing methods operate directly in the original high-dimensional space, where complex nonlinear structures and redundant noisy features often obscure the intrinsic data distribution. Consequently, constructing a reliable similarity graph in such a space is inherently challenging, as uncertainty and noise can significantly degrade clustering performance. To address this issue, this paper proposes a novel graph-based clustering method, Weighted Subspace Graph Learning (WSGL). Specifically, WSGL leverages kernel principal component analysis (Kernel PCA) to construct multiple kernel-based subspaces, effectively capturing nonlinear structures while reducing redundancy and noise. This strategy enhances subspace features from different perspectives, providing a more comprehensive understanding of the data distribution. Next, WSGL learns pairwise relationships across these subspaces, fully exploiting their complementary information to mitigate the limitations of relying on a single original space for capturing the global data structure. Furthermore, to ensure that the learned similarity graph preserves the same number of connected components as the ground-truth clusters, we impose a low-rank constraint on the graph structure. Additionally, considering the varying quality of different subspaces, WSGL introduces a dynamic weighting mechanism that adaptively assigns weights to subspaces based on their contribution to clustering performance, allowing high-quality subspaces to play a more dominant role in the final clustering results. Extensive experiments on multiple high-dimensional datasets demonstrate that WSGL surpasses state-of-the-art methods, validating its effectiveness and superiority in complex clustering tasks.

Abstract:
Owing to their widespread practicability, recommendation systems play an important role in our daily life. Recently, several studies have examined recommendations based on learning from user purchase history using association rules to extract the complementary and substitution relationships between items. However, the integration of generalized rules into a customized recommendation model is a challenging task. This study proposes a novel recommendation model, A3BRec, which incorporates association rules with a transformer network to predict potentially interesting items for the next basket. We introduce a ternary-stage framework to integrate basket associations into sequential next-basket recommendations. Furthermore, extensive experiments were conducted on real-world datasets to demonstrate the performance and superiority of the proposed model over the state-of-the-art methods for various evaluation metrics. We also use a case study to show the improvement and influence of the proposed ternary integration in A3BRec on recommendation quality.

Abstract:
Clustering validation is a fundamental task in cluster analysis. While many clustering validity indices have been proposed, most existing internal validity indices are typically defined heuristically, lacking solid statistical foundation and interpretation. To address this limitation, we introduce a new internal validity index that employs hypothesis testing to determine whether two samples belong to the same cluster and defines the index as the proportion of correctly identified sample pairs based on their cluster memberships. To demonstrate the advantages of proposed validity index, we conduct experiments on various synthetic and real-world data sets. The experimental results indicate that our validation index can beat both classic and state-of-the-art internal validation indices.

Abstract:
Imbalanced class distribution disrupts the training of a classifier, resulting in biases favoring majority classes. Data oversampling is a common strategy to tackle this issue. However, traditional methods may generate incorrect and unnecessary instances when facing complex data challenges, such as class overlap, small disjuncts, and noise samples. Therefore, there is a need for an oversampling method that can accurately characterize the data distribution. This paper introduces a novel deep generative oversampling approach for balancing the imbalanced tabular data by leveraging diffusion models and Generative Adversarial Networks (GANs). The model comprises a generator constructed from diffusion models and a discriminator with a Noise-Sensitive Auxiliary Classifier (NSAC) and is trained through an adversarial process. The synergy of these two models enhances stability and sample quality compared to GANs, with faster sampling speed and better conditional generating ability than diffusion models. In experimental validation across 22 real-world datasets, our method consistently outperforms six counterparts regarding Accuracy, F1-score, and MCC for binary and multi-class scenarios. Notably, our approach enhances classifier accuracy for minority classes while maintaining a high level for the majority class, a facet often compromised by other algorithms.

Abstract:
Recovering trajectories of all moving vehicles from urban-scale cameras is an attractive but challenging topic for massive video data management. Existing solutions frame it as an iterative image clustering problem. The snapshots from the same vehicle are grouped within a cluster, which is further refined according to the spatial-temporal attributes. However, these approaches exhibit expensive iterative clustering overhead and ineffective exploitation of spatial-temporal clues. Moreover, they are designed for batch processing, facing performance degradation when handling newly collected surveillance data. In this paper, we propose a novel joint representation clustering framework, which recovers trajectories from vehicle snapshots in an efficient and accurate fashion and is inherently suited for processing video streaming data. Technically, spatial-temporal features are explicitly extracted to construct the joint representation, eliminating the need for iterative refinement, which significantly reduces computational overhead. Furthermore, we present a simple yet effective clustering scheme with one-pass scan on joint representations to generate large-scale clusters. To mitigate the dependency on external data, a joint training method based on self-supervised learning is introduced. We conduct extensive experiments in both batch and streaming modes. The results show that in the batch mode, TRACER achieves a speedup of at least 2.3×2.3× and yields recovery F_1F1-score improvements of 1.7%-19.6%1.7%-19.6%. In the streaming experimental setup, it achieves 1.1%-27.6%1.1%-27.6% improvement in F_1F1-score, and reduces the average snapshot processing time by up to 84.8%.

Abstract:
Graph data mining techniques in real-world scenarios often encounter significant computational challenges, especially when the graph contains a large number of nodes and edges. Recently, Graph Condensation (GC) has emerged to offer data-centric solutions that address the challenge of graph volume, enhancing the efficiency of graph data mining and storage. Current methods in GC rely solely on optimizing heuristic metrics of one-way maintenance of key information in the condensed graph. However, the maintenance of key information may be insufficient due to the significant condensation ratio, yet these methods lack an effective mechanism to verify and compensate for that. To this end, this paper aims to enhance the maintenance of key information through a reconstruction-based alignment mechanism. More specifically, inspired by the Kolmogorov Complexity, we revisit the theoretical foundations of GC and propose a way-back mechanism that introduces a feedback loop of learning to reconstruct the original graph from the condensed graph, with the objective of key information alignment, namely the WbGC. We modify several GC methods with our mechanism, and the experiments show that our approach provides an enhanced solution for GC.

Abstract:
Graph Neural Networks (GNNs) have become widely popular across various applications, with their vulnerability to adversarial attacks being a key concern. Among the different types of graph attacks, Restricted Black-box Attacks (RBAs) present the most strict constraints, as attackers have limited access only to node features and graph structure. Existing RBAs rely on homophily assumptions or shift-based losses as their objectives to conduct structural perturbations, but we demonstrate that all the approaches fail on heterophilic graphs. To address this challenge, we introduce node-wise distance metrics as the objective to fundamentally quantify the quality of the graph structure after perturbations. Our theoretical results show that the proposed objective allows RBAs to effectively handle graphs beyond homophily. Leveraging this objective, we propose HetAttack, a scalable method that significantly reduces the distinguishability of nodes on the victim graph. Experiments on both synthetic and real-world graphs confirm the efficacy of HetAttack across varying levels of homophily, achieving performance comparable to split-unknown white-box attacks without prior knowledge of labels or the target model.

Abstract:
In the realm of stock prediction, machine learning models encounter considerable obstacles due to the inherent low signal-to-noise ratio and the nonstationary nature of financial markets. These challenges often result in spurious correlations and unstable predictive relationships, leading to poor performance of models when applied to out-of-sample (OOS) domains. To address these issues, we investigate Domain Generalization techniques, with a particular focus on causal representation learning to improve a prediction model’s generalizability to OOS domains. By leveraging multi-factor models from econometrics, we introduce a novel error bound that explicitly incorporates causal relationships. In addition, we present the connection between the proposed error bound and market nonstationarity. We also develop a Causal Discovery technique to discover invariant feature representations, which effectively mitigates the proposed error bound, and the influence of spurious correlations on causal discovery is rigorously examined. Our theoretical findings are substantiated by numerical results, showcasing the effectiveness of our approach in enhancing the generalizability of stock prediction models.

Abstract:
Random walk centrality is a fundamental metric in graph mining for quantifying node importance and influence, defined as the weighted average of hitting times to a node from all other nodes. Despite its ability to capture rich graph structural information and its wide range of applications, computing this measure for large networks remains impractical due to the computational demands of existing methods. In this paper, we present a novel formulation of random walk centrality, underpinning two scalable algorithms: one leveraging approximate Cholesky factorization and sparse inverse estimation, while the other sampling rooted spanning trees. Both algorithms operate in near-linear time and provide strong approximation guarantees. Extensive experiments on large real-world networks, including one with over 10 million nodes, demonstrate the efficiency and approximation quality of the proposed algorithms.

Abstract:
With the successful application of granular computing in anomaly detection, a variety of tools including fuzzy information entropy can achieve superior detection results. However, fuzzy information entropy calculates fuzzy similarity through a global strategy, ignoring the local information in the data. To address this deficiency, this paper constructs a fuzzy kkNN entropy theory and applies it to identify anomalies. Firstly, fuzzy kk-similarity and fuzzy kkNN are defined, and kkNN entropy theory and the related information-theoretic metrics are proposed. Then, the relevant definitions and propositions of fuzzy kkNN entropy, fuzzy kk-joint entropy, fuzzy kk-conditional information entropy, as well as fuzzy kk-mutual information are elaborated. Based on the proposed theory, an anomaly detection model is constructed. At first, the fuzzy kk-similarity relation matrix is constructed based on the fuzzy kk-similarity in the proposed theory, and the relative fuzzy kkNN entropy is calculated. Based on the relative fuzzy kkNN entropy, the fuzzy kk-relation anomaly degree is defined to characterize the anomaly intensity of fuzzy kkNN information granules. Then, the anomaly factor based on fuzzy kkNN entropy is built to represent the anomaly degree of data objects. Finally, the corresponding Fuzzy kkNN Entropy-based Anomaly Detection algorithm (FkkEAD) is designed. Comparative experiments are conducted with 11 state-of-the-art anomaly detection methods on thirty public datasets. The results reveal that the proposed method achieves better performance.

Affiliations: School of Computer Science and Engineering, Northeastern University, Shenyang, China; School of Computer Science and Engineering, National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Key Laboratory of Data Analytics and Optimization for Smart Industry, the Key Laboratory of Intelligent Computing of Medical Images, Ministry of Education, Northeastern University, Shenyang, China; Department of Systems Engineering and Engineering Management, Chinese University of Hong Kong, Hong Kong

Abstract:
A great number of graph analysis algorithms involve iterative computations, which dominate the runtime. Accelerating iterative graph computations has become the key to improving the performance of graph algorithms. While numerous studies have focused on reducing the runtime of each iteration to improve efficiency, the optimization of the number of iterations is often overlooked. In this work, we first establish a correlation between vertex processing order and the number of iterations, providing an opportunity to reduce the number of iterations. We propose a metric function to evaluate the effectiveness of vertex processing order in accelerating iterative computations. Leveraging this metric, we propose a novel graph reordering method, GoGraph, which constructs an efficient vertex processing order. Additionally, for evolving graphs, we further propose a metric function designed to evaluate the effectiveness of vertex processing orders in response to graph changes and provide three optional methods for dynamically adjusting the vertex processing order. Our experimental results illustrate that GoGraph surpasses current state-of-the-art reordering algorithms, improving runtime by an average of 1.83× (up to 3.34×). Compared to traditional synchronous computation methods, our approach enhances the speed of iterative computations by up to 4.46×. In dynamic scenarios, incremental GoGraph can reduce end-to-end time by 43% on average (up to 48%).

Abstract:
Spatial regionalization is the process of grouping a set of spatial areas into spatially contiguous and homogeneous regions. This paper introduces an Incremental Max-P regionalization with statistical constraints (IMS) problem; a regionalization process that supports enriched user-defined constraints based on statistical aggregate functions and supports incremental updates. In addition to enabling richer constraints, it allows users to employ multiple constraints simultaneously to significantly push the expressiveness and effectiveness of the existing regionalization literature. The IMS problem is NP-hard and significantly enriches the existing regionalization problems. Such a major enrichment introduces several challenges in both feasibility and scalability. To address these challenges, we propose the FaCT algorithm, a three-phase heuristic approach that finds a feasible set of spatial regions that satisfy IMS constraints while supporting large datasets compared to the existing literature. FaCT supports local and global incremental updates when there are changes in attribute values or constraints. In addition, we incorporate the Iterated Greedy algorithm with FaCT to further improve the solution quality of the IMS problem and the classical max-p regions problem. Our extensive experimental evaluation has demonstrated the effectiveness and scalability of our techniques on several real datasets.

Abstract:
Canonical correlation analysis (CCA) is a widely used multivariate analysis technique for explaining the relation between two sets of variables. It achieves this goal by finding linear combinations of the variables with maximal correlation. Recently, under the assumption that leading canonical directions are sparse, various penalized CCA procedures have been proposed for high dimensional data applications. However, all these procedures have the inconvenience of not preserving the sparsity among the retained leading canonical directions. To address this issue, two new sparse CCA methods are proposed in this paper. The first method is obtained by diagonal thresholding of two square matrices derived from the cross-covariance matrix of the two sets of variables where each matrix characterizes one set of variables. A model selection criterion is used to select the number of variables to retain from each matrix diagonal. The second method is derived within an adaptive alternating penalized least squares framework where the \ell _2^1ℓ21-norm is used as a penalty promoting block sparsity. Compared to existing sparse CCA methods, the proposed methods have the advantage of preserving the sparsity across the retained canonical loading vectors. Their performance are illustrated in an extended experimental study which shows the superior performance of the proposed methods.

Abstract:
Transaction flow networks are crucial in detecting illicit activities such as wash trading, credit card fraud, cashback arbitrage fraud, and money laundering. Our collaborator, Grab, a leader in digital payments in Southeast Asia, faces increasingly sophisticated fraud patterns in its transaction flow networks. In industry settings such as Grab’s fraud detection pipeline, identifying fraudulent activities heavily relies on detecting dense flows within transaction networks. Motivated by this practical foundation, we propose the SS-TT densest flow (STDF) query. Given a transaction flow network GG, a source set SS, a sink set TT, and a size threshold kk, the query outputs subsets S^\prime \subseteq SS'⊆S and T^\prime \subseteq TT'⊆T such that the maximum flow from S^\prime S' to T^\prime T' is densest, with |S^\prime \cup T^\prime | \geq k|S'∪T'|≥k. Recognizing the NP-hardness of the STDF query, we develop an efficient divide-and-conquer algorithm, \mathsf ConanConan. Driven by industry needs for scalable and efficient solutions, we introduce an approximate flow-peeling algorithm to optimize the performance of \mathsf ConanConan, enhancing its efficiency in processing large transaction networks. Our approach has been integrated into Grab’s fraud detection scenario, resulting in significant improvements in identifying fraudulent activities. Experiments show that \mathsf ConanConan outperforms baseline methods by up to three orders of magnitude in runtime and more effectively identifies the densest flows. We showcase \mathsf ConanConan’s applications in fraud detection on transaction flow networks from our industry partner, Grab, and on non-fungible tokens (NFTs).

Abstract:
Deep learning-based multi-view clustering techniques have attracted considerable attention due to their ability to recover missing views in incomplete multi-view scenarios. Nevertheless, in the federated multi-view learning scenario, these techniques are often constrained by the inherently decentralized nature of the data. Consequently, most existing federated incomplete multi-view methods primarily rely on internal correlations within a single view to recover missing data, failing to leverage complex global cross-view dependencies. This inherent limitation renders them particularly vulnerable to high missing rates, leading to a sharp decline in clustering accuracy and substantially restricting their applicability in complex, real-world scenarios. To address this issue, we propose the Federated Incomplete Multi-view clustering framework with Cross-view relationship Imputation, termed FIMCI. Specifically, we employ a Transformer-based encoder at each client and server to capture cross-view relationships, thereby completing missing data recovery and extracting view-specific information. We then design a dynamic view-fusion mechanism at the server, which adaptively assigns view weights and provides feedback to the clients. Furthermore, we implement category-level contrastive learning to enhance the robustness of the consensus representation through pseudo-label generation. In this way, FIMCI explores the consistency and complementarity between views through global view weight allocation and local view encoders, enabling the completion of missing views and clustering tasks while better protecting data privacy. Experimental results on multiple multi-view datasets verify that our method outperforms existing advanced methods in terms of both performance and efficiency.

Abstract:
Out-of-distribution (OOD) detection has garnered increasing concern for identifying test samples that exhibit a distributional shift from the training dataset in practical deep learning applications. With the significant advancements in graph deep learning for graph representation, graph OOD detection has emerged as a research problem. Graph contrastive learning (GCL) is applied to graph OOD detection due to its capacity for learning discriminative representations in a self-supervised manner, thereby eliminating the need for time-consuming and labor-intensive label information. However, existing methods often neglect the explicit consideration of underlying semantics behind graph data distribution for OOD detection. We observe that naive data augmentations in GCL may inadvertently compromise the intrinsic graph structure while retaining redundant structural information, which hinders semantic discrimination between graphs. Additionally, euclidean space embedding struggles to maintain hierarchical structural consistency, making it challenging to meaningfully capture the hierarchical semantic distribution of graph data. In response to these issues, we propose a novel framework termed HGOOD-D, which aims to explore latent semantic hierarchies in hyperbolic space for graph OOD detection. Specifically, we design a bottleneck graph extractor grounded in the information bottleneck (IB) principle, which captures the minimal sufficient information to distinguish graph patterns. Based on this, we introduce hierarchical contrastive learning to capture the hierarchical semantics within graph data distribution. These methods are based on hyperbolic space embedding that can preserve complex inter-relationships in graph hierarchies, thereby mitigating data distortion. Comprehensive evaluations on ten widely used benchmark datasets show that HGOOD-D consistently surpasses current state-of-the-art approaches in graph OOD detection.

Abstract:
Multi-view clustering (MVC) with bipartite graph has been extensively studied to rapidly handle multi-source heterogeneous information via sparse anchors. However, most existing methods follow a two-stage learning paradigm that first learns continuous label matrix and then discretizes it, not only bringing extra trade-off parameters but yielding suboptimal solutions. Also, numerous methods still exhibit limited scalability for large-scale problems. Thus, this paper proposes two novel models for discrete, trade-off parameter-free and rapid MVC. First, the Bipartite Graph Discrete Reconstruction (BGDR) model uniquely leverages the discrete label matrices of both samples and anchors to dynamically reconstruct a consensus bipartite graph across views. This concise reconstruction style eliminates redundant computations, and anchor labels enable to enrich cluster partition information during reconstruction, enhancing both accuracy and efficiency. The final clustering outcomes are directly acquired via discrete sample labels. Second, to free the optimization time overheads from the limitation of sample size, we further devise the Compact Graph Discrete Reconstruction (CGDR) model, which reconstructs a smaller compact affinity graph among anchors for significant acceleration. Original sample labels are then gained by label propagation. Systematic experiments illuminate that both models reach superior outcomes in term of efficacy and efficiency.

Abstract:
Text-Attributed Heterogeneous Graphs (TAHGs) integrate topological relationships with rich textual node attributes, offering expressive representations for complex multi-faceted data. While recent methods jointly leverage textual and structural information, they still face two critical limitations: (i) existing approaches are constrained to neighborhood modeling, failing to capture semantic dependencies in higher-order topologies; (ii) current techniques exhibit inadequate unified alignment strategies, limiting dynamic interaction between cross modalities. To address these challenges, we propose SATH, a self-supervised information aggregation model for TAHGs, designed to effectively leverage textual and structural information within TAHGs. SATH aggregates higher-order neighbor textual attributes through comparative learning, and dynamically aligns these attributes to higher-order topologies through a unified strategy. This approach integrates both types of information effectively, enhancing the expressiveness and discriminative capability of the learned node representations in downstream tasks. Extensive experiments on real-world datasets demonstrate that SATH significantly outperforms baseline models while eliminating the need for manual meta-path design or text feature concatenation. It also improves efficiency and scalability on large-scale TAHGs, achieving superior representation quality in TAHG-based tasks.

Abstract:
Biclique percolation community (BPC) search is a fundamental problem in bipartite graph analysis and has many applications. In many real-world scenarios, bipartite graphs are temporal, and ignoring time information can lead to communities that lack temporal cohesiveness. Motivated by this, we propose efficient index-based approaches for both static and temporal BPC search. Based on the index, we can obtain the static BPCs in near-optimal time with well-bounded index space. For temporal bipartite graphs, we introduce the (\alpha, \beta, \thetaα,β,θ)-TemBPC model, which captures both structural and temporal cohesiveness. We extend the BPC-Index to efficiently enumerate and process temporal bicliques. Furthermore, several key optimizations are incorporated to accelerate the search processing. We conduct extensive experiments on 11 real bipartite graphs, and the experimental results demonstrate the effectiveness of the BPC models and the efficiency of our static and temporal BPC search algorithms.

Abstract:
Community extraction for multi-layer networks is a fundamental problem. Community extraction methods for multi-layer networks typically rely on a single-layer network, weights for each layer, and an additional clustering step. Current methods usually utilize average or empirical weights for each layer, which may not be reasonable. To address the challenge of assigning better weights to the layers, this paper proposes a Spectral Clustering Community Detection (SCCD) optimization model based on a unified similarity matrix. Specifically, the unified similarity matrix is defined via an optimization model that uses a weighted combination of similarity matrices from each layer, where the layer weights are learned under simplex constraints. A rank constraint is further imposed on the Laplacian matrix of the unified similarity matrix, which intrinsically leads to the division of the nodes into the desired number of clusters. Then, the SCCD optimization model is proposed, and its ability to generate community partitions is shown. An alternating minimization algorithm with simplex projections and closed-form updates is developed to solve the model, and its convergence is proven. Finally, numerical experiments are conducted on both synthetic and real-world multi-layer networks. Compared with several methods reported in the references, our method improves average NMI by 6.78% and ARI by 6.50% over the strongest baselines on the AUCs, UCI_mfeat, Wikipedia, and Primaryschool networks—see Table III.

Abstract:
The deployment of sensors enables data-driven urban management, but necessitates inductive spatio-temporal kriging to infer unmonitored areas. Existing methods impute these unknown observations by smoothing temporal features based on spatial dependencies, overlooking the decoupling of inherent properties and dynamic correlations in message passing. In particular, the inherent properties reveal non-transitive signals, and current coupled aggregation leads to inaccurate results. To this end, we propose TempoRAl deCoupled Kriging, named TRACK, to decouple two factors with the help of node-specific inherency. Specifically, we first construct a node-specific profile to represent its inherency including geographical and periodic features, which is subsequently transformed into decoupling prompts. Secondly, the coupled temporal features are separated through querying each prompt embedding, facilitating precise temporal aggregation for inherent properties and spatial aggregation for dynamic correlations. Finally, a multi-task training strategy is further adopted to mimic the inductive scenarios during testing. We evaluate TRACK on four real-world datasets spanning urban traffic and air quality prediction tasks. TRACK achieves state-of-the-art performance, with average improvements of 3.10% in MAE and 4.45% in RMSE over strong baselines. Moreover, we further demonstrated its robust generalization in a challenging cross-city inductive setting.

Abstract:
The inherent fluctuations in the stock market present significant challenges in understanding stock dynamics, especially for investment decisions based on stock ranking. Recent advancements in learning-based methods have led to promising results in exploring temporal dependencies to understand stock movements. However, they often assume stable, certain, and reliable environments, narrowing their insight into the complex and fluctuating nature of markets. This complexity is driven by two influential factors: the explicit consistency of dynamic yet stable trends across diverse temporal patterns, coupled with the implicit interplay of logic and possibility under uncertainty. Hence, we introduce a Sensitivity-aware Dependency Learning solution (SDL) for stock ranking. With bridging the ideal and reality in mind, SDL captures short-term fluctuations under the guidance of long-term dependencies, associated with the augmentation of counterfactual knowledge. Specifically, SDL devises a Short-term Co-integration Detector (SCD) that concentrates on capturing time-varying correlations and immediate market reactions, in addition to multi-period attention. Furthermore, a Long-term Co-movements Tracker (LCT) takes advantage of enduring industry relationships and incorporates counterfactual knowledge, allowing the model to generalize beyond observed patterns and identify diverse long-term trends. Comprehensive experiments on five real-world stock markets demonstrate that our proposed SDL outperforms several representative baselines.

Abstract:
The surge in online interactions and advancement in Big Data related techniques have generated vast amounts of hybrid data in the sense that the data are symbolic, numerical or missing features, and usually only a small number of data objects possess true labels due to high annotation costs. A necessary step of fully releasing the potential of these partially labeled hybrid data lies in feature selection, for which the neighborhood rough set (NRS) is an efficient mathematical method to apply. In NRS, setting proper neighborhood granules greatly influences the effectiveness and robustness of algorithms atop it. However, existing methods usually determine the optimal neighborhood radius of neighborhood granule via computationally intensive grid search, where the neighborhood radius for each object is the same, i.e., “unadaptive”. Some methods investigate adaptive granulation strategies, yet they inevitably hinge on preset parameters or a-prior knowledge. To tackle this problem, we propose an adaptive granules-enabled semi-supervised feature selection method that can adaptively generate suitable neighborhood radii for both labeled and unlabeled objects. The core idea lies in using the purity of decision labels as the threshold for granularity maximization construction. Then, by combining with neighborhood entropy and local density, a feature metric is designed to measure the feature significance. A semi-supervised feature selection algorithm is utilized to select feature subset by using the information from both labeled and unlabeled objects. Instead of hinging on expert knowledge, the proposed method only rely on the data per se. Experimental results on real-world datasets demonstrate the effectiveness of the designed method and its superiority over other state-of-the-art.

Abstract:
Large-scale approximate nearest neighbor search (ANNS) has become a fundamental operation in a wide range of modern applications, including recommendation systems and large language models. Partition-based indexes have emerged as a popular solution for billion-scale ANNS tasks, serving as the basis for many ANNS approaches. However, our analysis indicates that, to achieve optimal search performance on billion-scale datasets, an extremely large number of partitions (tens or even hundreds of millions) is often required for fine-grained partitioning of the feature space. Relying on full-precision distance calculations for constructing and querying such a large number of partitions imposes significant time costs. In this work, we propose a novel geometric distance inference mechanism that leverages geometric relationships to expedite distance computations between the vector and space partitions. By reusing intermediate or offline-computed distance information, this method substantially reduces the overhead of full-precision calculations. We also introduce a clustering paradigm for generating space partitions that incorporates this geometric distance pattern, which can be seamlessly integrated with other indexing schemes such as vector quantization and proximity graphs. Through detailed complexity analysis and extensive experiments on billion-scale datasets, we confirm the efficiency of our geometric index (GI) design. Empirical results show that GI-based solutions consistently surpass various baseline methods on search efficiency. In particular, they offer considerable acceleration (exceeding a factor of 2.0) at high recall levels (e.g., Recall10@10 = 95%) to partition-based solutions, while also demonstrating comparable or superior query throughput relative to leading graph-based indexes.

Abstract:
Concept drift—characterized by the evolving and unpredictable nature of data distributions—is a persistent challenge for long-term modeling in non-stationary environments. In multi-stream scenarios, this problem is further complicated by the dynamic and often entangled dependencies across streams, which cannot be effectively handled by traditional single-stream adaptation methods. To address these challenges, we propose a Multi-scale Adaptive Convolutional Graph framework for multi-stream concept drift, called MACG. Our framework represents the dynamic inter-dependencies between streams through a flexible and efficient graph structure, driving stable, long-term multi-step prediction tasks across multiple streams. In MACG, conventional reliance on pre-defined graphs is replaced by a multi-scale adaptive convolutional graph learned directly from historical data. By considering multiple time scales, this adaptive structure is able to capture richer and more complex spatial–temporal dependencies, thereby overcoming the limitations of static graphs and achieving high generalizability to unknown or evolving data distributions. During online testing, the model adaptively updates its graph structure and proactively predicts drift by identifying structurally analogous patterns between current drift events and historical drift patterns. This joint design effectively mitigates catastrophic forgetting by leveraging both the recurrence and learnability of drift patterns. Rather than merely repeating past responses, the adaptive graph enables the model to generalize transferable patterns and dynamically update dependencies in drift-affected regions, ensuring that adaptations are both context-aware and responsive to new changes. Extensive experiments on four large-scale real-world datasets show that MACG consistently outperforms state-of-the-art baselines in prediction accuracy.

Abstract:
Online learning for data streams has gained significant attention in recent years. Existing methods impose various limitations on the feature space, classes, and labels of data streams. In contrast, real-world data streams often exhibit dynamic feature spaces, missing or partial feature values, undefined classes, and inaccessible labels. These complexities limit the applicability of current online learning methods to real-world scenarios. To fill the gap, this study explores a new online learning problem and proposes a novel algorithm: Online Learning for Fickle Data Streams without Labels (OLFL). Specifically, OLFL has a triple main idea: 1) It tackles dimensional turbulence through a completion strategy that collaborates with weight vectors, reducing the over-reliance on surviving features; 2) It implements known class classification and new class detection via the density of core points within the neighborhood of each instance; 3) It utilizes a constraint-based adaptive coefficient vector to update model. We evaluate OLFL on extensive real-world and synthetic datasets against strong baselines, covering fickle feature spaces, emerging classes, and varying missingness rates. We further conduct ablation studies and case analyses. The results consistently show that OLFL outperforms the baselines in both accuracy and efficiency.

Abstract:
Bipartite graphs are widely used to model relationships between entities of different types, where nodes are divided into two disjoint sets. Similarity search, a fundamental operation that retrieves nodes similar to a given query node, plays a crucial role in various real-world applications, including machine learning and graph clustering. However, existing state-of-the-art methods often struggle to accurately capture the unique structural properties of bipartite graphs or fail to incorporate the informative node attributes, leading to suboptimal performance. Besides, their high computational complexity limits scalability, making them impractical for large graphs with millions of nodes and tens of thousands of attributes. To overcome these challenges, we first introduce Attribute-augmented Hidden Personalized PageRank (AHPP), a novel random walk model designed to blend seamlessly both the higher-order bipartite structure proximity and attribute similarity. We then formulate the similarity search over attributed bipartite graphs as an approximate AHPP problem and propose two efficient push-style local algorithms with provable approximation guarantees. Finally, extensive experiments on real-world and synthetic datasets validate the effectiveness of AHPP and the efficiency of our proposed algorithms when compared with fifteen competitors.

Abstract:
Few-shot node classification (FSNC) is a challenging task in graph analysis, where the goal is to classify unlabeled nodes in a graph using only a few labeled nodes as references. To tackle the label shortage problem, many meta-learning methods have been proposed to extract meta-knowledge from base classes with abundant labeled nodes and transfer the learned knowledge to classify nodes from novel classes. However, the theoretical foundation of meta-knowledge remains unexplored, and existing solutions often struggle when dealing with complex or noisy graphs. To address these issues, we propose a novel and effective meta-learning framework for FSNC based on structural information theory. First, we introduce the concept of minimal sufficient meta-knowledge, a theoretical principle inherited from information bottleneck, which optimally balances the expressiveness and robustness of the learned meta-knowledge. Guided by this principle, we develop a meta-learning model, named SE-FSNC, that extracts the minimal sufficient meta-knowledge using an encoding tree derived from the input graph with minimal structural entropy. We then propose an effective algorithm to train SE-FSNC by incorporating the encoding tree with graph contrastive learning. Extensive experiments on several datasets demonstrate the superiority of our model compared with other state-of-the-art methods.

Affiliations: School of Computer Science and Engineering, Southeast University, Nanjing, China; Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong; National Key Laboratory of Information Systems Engineering, Department of Systems Engineering, National University of Defense Technology, Changsha, China; Singapore University of Technology and Design, Singapore; School of Computer and Control Engineering, Yantai University, Yantai, China; School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China

Abstract:
Incomplete multi-view clustering (IMVC) has attracted increasing attention in recent years, owing to the prevalence of missing data in real-world multi-view scenarios. Existing imputation-based IMVC methods partially mitigate the impact of missing information but still face three key limitations: (i) overlooking latent structural relationships among samples, which leads to imputed representations deviating from the true distribution; (ii) decoupling imputation from clustering, which reduces the discriminability of the recovered representations; and (iii) exhibiting low efficiency, which makes it difficult to balance recovery quality and inference speed under complex missing scenarios. To address these issues, we propose a Structure-Aware Conditional Diffusion Generation (SACDG) framework. During training, SACDG first models local structural relationships via adaptive neighborhood graphs and injects them as conditional priors into the diffusion model, where a cross-attention mechanism integrates these priors into the noise prediction process to learn structure-aware generative capability. Meanwhile, a semantic distribution alignment module is introduced to leverage pseudo-labels for enforcing cross-view consistency, thereby enhancing semantic discriminability. During inference, SACDG integrates cross-view structural information through cross-view adjacency fusion to guide the reverse denoising trajectory, and employs deterministic DDIM sampling to efficiently and stably recover the representations of missing views. Extensive comparative experiments and ablation studies on multiple benchmark datasets demonstrate that SACDG achieves superior clustering performance and improved efficiency over state-of-the-art methods.

Abstract:
Accurate classification of healthcare time series is critical for clinical decision-making. However, existing models often struggle under real-world data shifts and lack interpretability—two key requirements for reliable medical deployment. To address these challenges, we propose SHINE, a novel end-to-end framework that learns disentangled and shift-invariant representations by modeling the generative process of multivariate healthcare signals. Specifically, SHINE first introduces a genuine data representation learning that disentangles healthcare signals into trend, seasonality, and noise components, reflecting distinct temporal dynamics of healthcare series. Then, we inject several inductive biases into each component to encourage latent representations to be invariant to data shifts and aligned with their corresponding semantic units. Extensive experiments on six healthcare benchmarks spanning ECG, EEG, and continuous glucose monitoring (CGM) domains—under a variety of simulated real-world shift scenarios—demonstrate that SHINE consistently outperforms state-of-the-art baselines, providing robust performance and clinically meaningful interpretations grounded in the estimated components.

Abstract:
As a basic machine learning task, Multi-View Classification (MVC) has garnered considerable attention and achieved great success. However, the existing MVC methods, especially late fusion style ones still suffer from some problems: 1) hidden valuable information is not well exploited; 2) a lack of interaction before decision making. To address these problems, we propose a novel framework named “TrashtoTreasure” that leverages mutual information to effectively exploit hidden valuable information. Specifically, the framework explicitly disentangles multi-view information into “useful” components and “trash” (noisy) components, and further extracts potentially valuable “treasure” information from the “trash” components of all views. Additionally, we design a tailored objective function that facilitates the effective separation of “useful” and “trash” components, as well as the synergistic extraction of “treasure” information. This function guides model optimization through triple mutual information constraints. Experimental results on synthetic data and several real-world data sets verified the effectiveness and superiority of the proposed method. The fresh perspective offered by this article may inspire more interesting exploration in this direction.

Abstract:
Semi-supervised community detection seeks to find a specified community type when only few communities are labeled. Existing “select-then-refine” pipelines often start from mis-aligned cores and rely on Reinforcement-Learning or Generative Adversarial Network, increasing computational cost and limiting scalability. We address these issues with a unified energy framework under crystallization kinetics that jointly models energy, structure, and growth. Based on this perspective, we propose CLique ANNealing (CLANN), which first employs Nucleus Proposer to select candidate clique as community core under four physics-inspired criteria. A learning-free Transitive Annealer then iteratively merges neighboring cliques and repositions the nucleus, enabling spontaneous, scalable community growth. Evaluated on diverse real-world and synthetic networks, CLANN surpasses state-of-the-art baselines by a wide margin while running faster on large graphs, demonstrating that the energy-driven crystallization kinetics framework is both principled and practical for semi-supervised community detection.

Abstract:
Graph Machine Learning (Graph ML) has witnessed substantial advancements in recent years. With their remarkable ability to process graph-structured data, Graph ML techniques have been extensively utilized across diverse applications, including critical domains like finance, healthcare, and transportation. Despite their societal benefits, recent research highlights significant safety concerns associated with the widespread use of Graph ML models. Lacking safety-focused designs, these models can produce unreliable predictions, demonstrate poor generalizability, and compromise data confidentiality. In high-stakes scenarios such as financial fraud detection, these vulnerabilities could jeopardize both individuals and society at large. Therefore, it is imperative to prioritize the development of safety-oriented Graph ML models to mitigate these risks and enhance public confidence in their applications. In this survey paper, we explore three critical aspects vital for enhancing safety in Graph ML: reliability, generalizability, and confidentiality. We categorize and analyze threats to each aspect under three headings: model threats, data threats, and attack threats. This novel taxonomy guides our review of effective strategies to protect against these threats. Our systematic review lays a groundwork for future research aimed at developing practical, safety-centered Graph ML models. Furthermore, we highlight the significance of safe Graph ML practices and suggest promising avenues for further investigation in this crucial area.

Abstract:
Distributed ledgers are enabling novel applications in traditional domains, such as finance, healthcare and supply chains, and in emerging domains such as metaverse. We observe that the current ecosystem is fragmented, with different blockchains operating in silos. Interledger applications that can access resources in different ledgers can tap into the billions of dollars worth of locked resources, but they require support for interledger communication. Existing interledger communications, however, are either insecure, inefficient, or application specific. Our goal is to design a system for interledger applications. To this end, we design and implement \sf ChainHubChainHub that achieves strong security guarantees and high performance while providing the general message passing abstraction to the applications. Our system leverages trusted hardware for performance and threshold signature schemes for strong security. It consists of multiple lightweight clients that verify transactions within their trusted execution environments before generating threshold signatures. We conduct extensive performance evaluations of \sf ChainHubChainHub and compare it against three state-of-the-art systems, namely WeCross, Cosmos IBC and LayerZero-V2. The results show that \sf ChainHubChainHub is efficient, achieving up to 42×42× higher throughput than the baselines.

Abstract:
Sequential change detection in high-dimensional dynamic networks has attracted growing attention in modern applications. A major challenge is that as network scale increases, limited sensing resources make it difficult to fully observe links at each time point, resulting in partial observability and complicating detection. To tackle this, we propose a latent bi-transit network (LBTN) that learns unobserved edge formation, latent node states, and the evolution mechanisms of dynamic networks. Based on LBTN, we design a variational Bayesian method to infer sparse node-level changes and employ a likelihood ratio test as the detection statistic. By further formulating the statistic as the reward function of a combinatorial multi-armed bandit (CMAB) problem, we develop a Thompson sampling strategy to adaptively select edges for observation, balancing exploration and exploitation. Comprehensive simulations and real-world experiments show that our method consistently outperforms existing baselines and its variants across diverse scenarios.

Abstract:
We propose the DPSM method, a density-based node clustering approach that automatically determines the number of clusters and can be applied in both data space and graph space. Unlike traditional density-based clustering methods, which necessitate calculating the distance between any two nodes, our proposed technique determines density through a propagation process, thereby making it suitable for a graph space. In DPSM, nodes are partitioned into small clusters based on propagated density. The partitioning technique has been proved to be sound and complete. We then extend the concept of spectral clustering from individual nodes to these small clusters, while introducing the CluCut measure to guide cluster merging. This measure is modified in various ways to account for cluster properties, thus provides guidance on when to terminate the merging process. Various experiments have validated the effectiveness of DPSM and the accuracy of these conclusions.

Abstract:
The real-world data is relatively complex, generally formed by the interaction of different latent factors. Disentanglement of these latent factors can effectively improve the robustness and interpretability of sample representation. However, most existing disentangled multi-view clustering methods focus on the irrelevance of disentangled representations, ignoring the semantic relevance invariance between different latent factors. To address this issue, we propose a disentangled contrastive multi-view clustering via semantic relevance invariance (DMVCS) to learn the disentangled representations and maintain their semantic relevance. Specifically, we first decompose each view into consistent and specific representations by maximizing semantic consistency and minimizing the correlation between multiple views. Meanwhile, to ensure that different disentangled representations have similar semantic relevance, a cross-component semantic relevance alignment module is proposed. Combined with the hierarchical sampling strategy, the learned semantic relevances are aligned progressively in a locally structure-aware manner. Besides, to learn a clustering-friendly unified representation, we propose a multi-hop neighbor contrastive learning to extend the range of positive samples. Comprehensive experiments on ten public multi-view datasets demonstrate that DMVCS outperforms the state-of-the-art clustering methods.

Abstract:
Rule discovery is a fundamental task in data analysis, with broad applications in data cleaning, knowledge extraction, and decision making. However, existing methods often generate a large number of functionally redundant rules, with a high time cost. To address this, a recent line of work, the first to introduce diversified top-kk rule discovery, aims to identify a set of top-ranked rules that are both relevant and diverse. Despite this advancement, it still suffers from high user interaction overhead, computational inefficiency, and the inability to handle a common scenario of selecting a diverse subset from an existing rule set. In this paper, we propose a user-friendly and efficient framework for diversified top-kk rule discovery. As a testbed, we consider Entity Enhancing Rules (REEs), which subsume common association rules and data quality rules as special cases. Our method allows users to specify lightweight preference templates, which are used to train a correlation model that captures user preferences and generates subjective embeddings for predicates and rules. Based on these embeddings, we define an objective function to jointly measure the relevance and diversity of rules in a unified vector space; moreover, we formulate and study two key problems: (i) selecting diversified top-kk rules from an existing redundant rule set, and (ii) discovering diversified top-kk rules directly from raw data. We prove that both problems are intractable and propose effective algorithms; in particular, the second problem is more challenging and thus we further optimize its solution with carefully designed pruning strategies and parallel optimization. Extensive evaluation on real-world datasets demonstrates that our algorithms consistently identify top-ranked relevant and diverse rules, achieving an average 14.4× speedup (up to 35.57×) over the state-of-the-art method.

Abstract:
Wildcard Keyword Searchable Encryption (WKSE) has grown into a ubiquitous tool. It enables clients to search desired files with wildcard expressions. Although promising, previous schemes confront three barriers: (1) An adversary can launch a correlation attack to acquire the similarity between keywords. (2) The WKSE schemes exhibit false positives which can lead to wrong search results. (3) Existing feature extraction strategies limit the flexibility of search expressions. In this paper, we propose a Multi-Character Searchable Encryption scheme (MCSE) that overcomes the aforementioned barriers. To resist correlation attacks, we design the randomize-pad model to encrypt the vector. To eradicate false positives, we apply the vector space model and complete feature extraction strategies so that a feature set uniquely identifies a keyword or expression. To enhance search flexibility, we introduce three distinct feature extraction strategies for keyword expressions, wildcard expressions, and logical expressions, enabling effective multi-character search. These strategies enable indexes to accommodate the search of diverse expressions. Finally, we prove that MCSE is indistinguishable against chosen-feature attacks and implement MCSE on two real datasets. Compared with state-of-the-art schemes, the experiment results show that MCSE achieves good performance.

Abstract:
Temporal graph representation learning seeks to capture the intrinsic evolution of nodes in temporal graphs for various applications. While existing models primarily learn node representations by aggregating temporal information from historical interactions of nodes, they often overlook the critical structural impacts arising from these interactions. To address this issue, we propose a Structure-aware model for Temporal Graph representation learning (STG), a framework that explicitly incorporates the impacts of evolving structural roles to enhance the learned node representations. Specifically, STG encodes distinct structural roles of nodes by extracting both single-unit and multi-unit interaction patterns. These roles are then transformed into the Fourier domain for a deeper analysis of the complex structural dynamics. To capture the structural impacts on future node interactions, we design a dynamic filter to process these roles. The filter is equipped with a personalized weight coefficient generator to perform the interaction-specific analysis. Finally, we employ a mixer to collaboratively aggregate the temporal and structural information to obtain structure-aware temporal node representations. Extensive experiments conducted on several real-world temporal graph datasets demonstrate the superior performance of our model in dynamic link prediction tasks under both transductive and inductive settings.

Abstract:
Federated graph learning (FGL) aims to collaboratively train graph neural networks (GNNs) among multiple clients, where each client owns a subgraph of a global model. A key challenge in FGL arises from the possible interconnections between nodes distributed across different subgraphs, leading to an incomplete capture of neighborhood knowledge within the graph. Existing FGL frameworks attempt to learn missing neighborhood knowledge by generating pseudo nodes or transmitting missing node embedding directly across clients, which is either only suitable to 1-hop neighbor nodes or comes with high communication costs when training deeper GNNs. In this paper, we propose a novel framework for FGL named \textFed^2\textGNNFed2GNN that could fully capture neighborhood knowledge while achieving low communication costs. More specifically, we propose ego-tree, a new graph structure that is easy to build and allows us to reconstruct the neighborhood faithfully. Furthermore, we design an encoder-decoder-based method to build ego-tree. The encoder enables clients to transmit encoded information essential for tree construction with minimal communication costs, while the decoder empowers clients to build the ego-tree by decoding the received information. Extensive experiments on real-world network datasets show the effectiveness of our framework for training deep GNNs and about 100× less communication compared to prior works.

Abstract:
Unsupervised graph alignment finds the node correspondence between a pair of attributed graphs by only exploiting graph structure and node features. One category of recent studies first computes the node representation and then matches nodes with the largest embedding-based similarity, while the other category reduces the problem to optimal transport (OT) via Gromov-Wasserstein learning. However, it remains largely unexplored in the model expressiveness, as well as how theoretical expressivity impacts prediction accuracy. We investigate the model expressiveness from two aspects. First, we characterize the model’s discriminative power in distinguishing matched and unmatched node pairs across two graphs. Second, we study the model’s capability of guaranteeing node matching properties such as one-to-one matching and mutual alignment. Motivated by our theoretical analysis, we put forward a hybrid approach named CombAlign with stronger expressive power. Specifically, we enable cross-dimensional feature interaction for OT-based learning and propose an embedding-based method inspired by the Weisfeiler-Lehman test. We also apply non-uniform marginals obtained from the embedding-based modules to OT as priors for more expressiveness. Based on that, we propose a traditional algorithm-based refinement, which combines our OT and embedding-based predictions using the ensemble learning strategy and reduces the problem to maximum weight matching. With carefully designed edge weights, we ensure these matching properties and further enhance prediction accuracy. By extensive experiments, we demonstrate a significant improvement of 14.5% in alignment accuracy compared to state-of-the-art approaches and confirm the soundness of our theoretical analysis.

Abstract:
The advent of Single Instruction Multiple Data (SIMD) instructions in modern processors has revolutionized data processing by enabling simultaneous computation across multiple data elements. While database systems have extensively adopted SIMD for traditional operations, its potential for complex event pattern matching remains largely unexplored. This paper presents a novel approach that bridges this gap through bit-parallel processing enhanced with AVX-512 vectorization. Our approach encodes event streams into compact bit sequences, where each bit corresponds to a time slice, and an event’s presence is marked by a 1-bit when its timestamp falls within the respective slice. This representation enables the formulation of bit-parallel operations that natively enforce complex event constraints, including temporal window requirements and event ordering relationships. We develop a family of bit-parallel algorithms that leverage this representation for continuous event matching, and further optimize their performance through SIMD vectorization (AVX-512 instructions) to exploit modern hardware parallelism. Experimental evaluations on both real-world and synthetic datasets demonstrate the superiority of our method, achieving at least 35.7x improvement in query efficiency compared to state-of-the-art alternatives.

Abstract:
Utilizing pre-trained generative models for sentiment element extraction has recently significantly enhanced aspect-based sentiment analysis benchmarks. Nonetheless, these models have two significant drawbacks: 1) high-computational cost in both the inference time and hardware requirement. 2) Lack of explicit modeling as they model the connections between sentiment elements with fragile natural or notational language target sequence. To overcome these challenges, we present a novel opinion tree parsing model designed to swiftly parse sentiment elements from an opinion tree. This approach not only accelerates the process but also explicitly unveils a more comprehensive and fully articulated aspect-level sentiment structure. Our method begins by introducing a pioneering context-free opinion grammar to standardize the opinion tree structure. Subsequently, we leverage a neural chart-based opinion tree parser to thoroughly explore the interconnections among sentiment elements and parse them into a structured opinion tree. Extensive experiments underscore the effectiveness of our proposed model and the capability of the opinion tree parser, particularly when coupled with the introduced context-free opinion grammar. Crucially, the results confirm the superior speed of our model compared to the SOTA baselines.

Abstract:
Popularity bias is a common challenge in recommender systems. It often causes unbalanced item recommendation performance and intensifies the Matthew effect. Due to limited user-item interactions, unpopular items are frequently constrained to the embedding neighborhoods of only a few users, leading to representation collapse and weakening the model’s generalization. Although existing supervised alignment and reweighting methods can help mitigate this problem, they still face two major limitations: (1) they overlook the inherent variability among different Graph Convolutional Networks (GCNs) layers, which can result in negative gains in deeper layers; (2) they rely heavily on fixed hyperparameters to balance popular and unpopular items, limiting adaptability to diverse data distributions and increasing model complexity. To address these challenges, we propose Graph-Structured Dual Adaptation Framework (GSDA), a dual adaptive framework for mitigating popularity bias in recommendation. Our theoretical analysis shows that supervised alignment in GCNs is hindered by the over-smoothing effect, where the distinction between popular and unpopular items diminishes as layers deepen, reducing the effectiveness of alignment at deeper levels. To overcome this limitation, GSDA integrates a hierarchical adaptive alignment mechanism that counteracts entropy decay across layers together with a distribution-aware contrastive weighting strategy based on the Gini coefficient, enabling the model to adapt its debiasing strength dynamically without relying on fixed hyperparameters. Extensive experiments on three benchmark datasets demonstrate that GSDA effectively alleviates popularity bias while consistently outperforming state-of-the-art methods in recommendation performance.

Abstract:
Temporal interactions form the crux of numerous real-world scenarios, thus necessitating effective modeling in temporal graph representation learning. Despite extensive research within this domain, we identify a significant oversight in current methodologies: the temporal-spatial dynamics in graphs, encompassing both structural and temporal coherence, remain largely unaddressed. In an effort to bridge this research gap, we present a novel framework termed Graph Representation learning enhanced by Periodic and Community Interactions (GRPCI). GRPCI consists of two primary mechanisms devised explicitly to tackle the aforementioned challenge. Firstly, to utilize latent temporal dynamics, we propose a novel periodicity-based neighborhood aggregation mechanism that underscores neighbors engaged in a periodic interaction pattern. This mechanism seamlessly integrates the element of periodicity into the model. Secondly, to exploit structural dynamics, we design a novel contrastive-based local community representation learning mechanism. This mechanism features a heuristic dynamic contrastive pair sampling strategy aimed at enhancing the modeling of the latent distribution of local communities within the graphs. Through the incorporation of these two mechanisms, GRPCI markedly augments the performance of graph networks. Empirical evaluations, conducted via a temporal link prediction task across five real-life datasets, attest to the superior performance of GRPCI in comparison to existing state-of-the-art methodologies. The results of this study validate the efficacy of GRPCI, thereby establishing a new benchmark for future research in the field of temporal graph representation learning. Our findings underscore the importance of considering both temporal and structural consistency in temporal graph learning, and advocate for further exploration of this paradigm.

Abstract:
Rationalization, a data-centric framework, aims to build self-explanatory models to explain the prediction outcome by generating a subset of human-intelligible pieces of the input data. It involves a cooperative game model where a generator generates the most human-intelligible parts of the input (i.e., rationales), followed by a predictor that makes predictions based on these generated rationales. Conventional rationalization methods typically impose constraints via regularization terms to calibrate or penalize undesired generation. However, these methods are suffering from a problem called mode collapse, in which the predictor produces correct predictions yet the generator consistently outputs rationales with collapsed patterns. Moreover, existing studies are typically designed separately for specific collapsed patterns, lacking a unified consideration. In this paper, we systematically revisit cooperative rationalization from a novel game-theoretic perspective and identify the fundamental cause of this problem: the generator no longer tends to explore new strategies to uncover informative rationales, ultimately leading the system to converge to a suboptimal game equilibrium (correct predictions versus collapsed rationales). To solve this problem, we then propose a novel approach, Game-theoretic Policy Optimization oriented RATionalization (PoRat), which progressively introduces policy interventions to address the game equilibrium in the cooperative game process, thereby guiding the model toward a more optimal solution state. We theoretically analyse the cause of such a suboptimal equilibrium and prove the feasibility of the proposed method. Furthermore, we validate our method on nine widely used real-world datasets and two synthetic settings, where PoRat achieves up to 8.1% performance improvements over existing state-of-the-art methods.

Affiliations: State Key Laboratory of Digital Intelligent Technology for Unmanned Coal Mining, School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, China; Zhejiang University, Hangzhou, China; Tokyo Institute of Technology, Meguro, Japan; Beijing University of Posts and Telecommunications, Beijing, China; Nanjing University of Aeronautics and Astronautics, Nanjing, China; Jinan University, Guangzhou, China; University of Southern Queensland, Toowoomba, Australia

Abstract:
Truth discovery has emerged as an effective tool to mitigate data inconsistency in crowdsensing by prioritizing data from high-quality responders. While local differential privacy (LDP) has emerged as a crucial privacy-preserving paradigm, existing studies under LDP rarely explore a worker’s participation in specific tasks for sparse scenarios, which may also reveal sensitive information such as individual preferences and behaviors. Existing LDP mechanisms, when applied to truth discovery in sparse settings, may create undesirable dense distributions, provide insufficient privacy protection, and introduce excessive noise, compromising the efficacy of subsequent non-private truth discovery. Additionally, the interplay between noise injection and truth discovery remains insufficiently explored in the current literature. To address these issues, we propose a lOcally differentially private truth diSCovery approach for spArse cRowdsensing, namely OSCAR. The main idea is to use advanced optimization techniques to reconstruct the sparse data distribution and re-formalize truth discovery by considering the statistical characteristics of injected Laplacian noise while protecting the privacy of both the tasks being completed and the corresponding sensory data. Specifically, to address the data density concerns while alleviating noise, we design a randomized response-based Bernoulli matrix factorization method BerRR. To recover the sparse structures from densified, perturbed data, we formalize a 0-1 integer programming problem and develop a sparse recovery solving method SpaIE based on implicit enumeration. We further devise a Laplacian-sensitive truth discovery method LapCRH that leverages maximum likelihood estimation to re-formalize truth discovery by measuring differences between noisy values and truths based on the statistical characteristic of Laplacian noise. Our comprehensive theoretical analysis establishes OSCAR’s privacy guarantees, utility bounds, and computational complexity. Experimental results show that OSCAR surpasses the state-of-the-arts by at least 30% in accuracy improvement.

Abstract:
Identifying locally dense communities closely connected to the user-initiated query node is crucial for a wide range of applications. Existing approaches either solely depend on rule-based constraints or exclusively utilize deep learning technologies to identify target communities. Therefore, an important question is proposed: can deep learning be integrated with rule-based constraints to elevate the quality of community search? In this paper, we affirmatively address this question by introducing a novel approach called Neural Community Search via Attribute-augmented Conductance, abbreviated as NCSAC. Specifically, NCSAC first proposes a novel concept of attribute-augmented conductance, which harmoniously blends the (internal and external) structural proximity and the attribute similarity. Then, NCSAC extracts a coarse candidate community of satisfactory quality using the proposed attribute-augmented conductance. Subsequently, NCSAC frames the community search as a graph optimization task, refining the candidate community through sophisticated reinforcement learning techniques, thereby producing high-quality results. Extensive experiments on six real-world graphs and ten competitors demonstrate the superiority of our solutions in terms of accuracy, efficiency, and scalability. Notably, the proposed solution outperforms state-of-the-art methods, achieving an impressive F1-score improvement ranging from 5.3% to 42.4%.

Abstract:
Important communities are densely connected subgraphs containing vertices with high importance values, which have received wide attention recently. However, existing methods, predominantly based on the kk-core model, suffer from limitations such as rigid degree constraints and suboptimal density, often failing to capture highly important vertices. To address these limitations, we propose a new community model based on pseudoarboricity that guarantees near-optimal density while preserving important vertices. Further, we introduce a novel problem of Psudoarboricity-based Skyline Important Community (PSIC), which uniquely treats density and importance as independent attributes. To efficiently address PSIC, we first devise a basic algorithm \mathsf ClimbStairsClimbStairs, which iteratively refines communities by peeling vertices with low importance. To boost efficiency, we develop an advanced algorithm \mathsf DivAndConDivAndCon, which employs a recursive divide-and-conquer strategy combined with weight-based and pseudoarboricity-based pruning techniques, significantly reducing the search space. For massive graphs with billions of edges, inspired by a recursive division tree, we develop several parallel algorithms utilizing thread-pool and free-synchronization mechanism. Finally, we conduct extensive experiments on 10 real-world networks, and the results demonstrate the superiority of our solutions in terms of effectiveness, efficiency, and scalability.

Abstract:
The COVID-19 pandemic not only triggered a global health crisis but also amplified public panic through the rapid spread of misinformation. Understanding public sentiment and identifying the causes of sudden sentiment spikes is therefore critical for ensuring accurate information dissemination and guiding effective policymaking. However, mining such causes from social media remains challenging. Tweets collected during sentiment spike periods are often short, noisy, and dominated by repetitive background topics, making it difficult for existing topic models to separate emerging issues from long-standing discussions. To address these challenges, we propose the Sentiment Variation-aware Emerging Topics Mining Model (SVETM), a probabilistic graphical framework that leverages user sentiment variation between adjacent time windows as a guiding signal to distinguish emerging topics from background content. We further reformulate inference as a maximum a posteriori (MAP) problem and develop an efficient variational inference algorithm for scalable learning. Extensive experiments on a large-scale COVID-19 Twitter dataset demonstrate that SVETM outperforms strong baselines in terms of topic coherence, interpretability, and its ability to uncover the underlying causes of sentiment spikes.

Affiliations: School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia; Department of Computer Science and Technology, Tongji University, Shanghai, China; School of Computing and Information Technology, University of Wollongong, Wollongong, NSW, Australia; School of Computer Science, University of Technology Sydney, Sydney, NSW, Australia; Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai, China

Abstract:
With growing demands for data privacy and model robustness, graph unlearning (GU), which erases the influence of specific data on trained GNN models, has gained significant attention. However, existing exact unlearning methods suffer from either low efficiency or poor model performance. While more utility-preserving and efficient, current approximate methods require access to the forget set during unlearning, which makes them inapplicable in immediate deletion scenarios, thereby undermining privacy. Additionally, these approximate methods, which attempt to directly perturb model parameters, still raise significant concerns regarding unlearning power in empirical studies. To fill the gap, we propose Transferable Condensation Graph Unlearning (TCGU), a data-centric solution to graph unlearning. Specifically, we first develop a two-level alignment strategy to pre-condense the original graph into a compact yet utility-preserving dataset for subsequent unlearning tasks. Upon receiving an unlearning request, we fine-tune the pre-condensed data with a low-rank plugin, to directly align its distribution with the remaining graph, thus efficiently revoking the information of deleted data without accessing them. A novel similarity distribution matching approach and a discrimination regularizer are proposed to effectively transfer condensed data and preserve its utility in GNN training, respectively. Finally, we retrain the GNN on the transferred condensed data. Extensive experiments on 7 benchmark datasets demonstrate that TCGU can achieve superior performance in terms of model utility, unlearning efficiency, and unlearning efficacy compared to existing GU methods. To the best of our knowledge, this is the first study to explore graph unlearning with immediate data removal using a data-centric approximate method.

Abstract:
Graph pattern mining (GPM) is essential for uncovering complex patterns and relationships in graph data, with applications spanning social network analysis, bioinformatics, and recommendation systems. However, existing GPM systems face significant challenges, including high computational costs, limited scalability, and inefficiencies in handling large datasets. These systems can be categorized into two paradigms: embedding-centric systems, which struggle with the exponential growth of the search space, and pattern-centric systems, which often fail to leverage the full potential of input patterns. Despite their individual strengths, a critical research gap exists in understanding the comparative limitations of these approaches and the specific bottlenecks that hinder their performance. To address these limitations, we propose the gDAG model, a novel framework that unifies the computational processes of both paradigms, enabling comprehensive performance analysis. The gDAG model serves as the foundation for our BLITZ system, which incorporates innovative optimization techniques, such as path merging and quick counting. Our experimental results demonstrate that BLITZ achieves an average speedup of 10x in mining time compared to existing methods, significantly reducing execution time. Our experimental results demonstrate that BLITZ not only improves execution time but also provides a robust framework for future research.

Abstract:
Recommender systems suffer from biases that cause the collected feedback to incompletely reveal user preference. While debiasing learning has been extensively studied, they mostly focused on the specialized (called counterfactual) test environment simulated by random exposure of items, significantly degrading accuracy in the typical (called factual) test environment based on actual user-item interactions. In fact, each test environment highlights the benefit of a different aspect: the counterfactual test emphasizes user satisfaction in the long-terms, while the factual test focuses on predicting subsequent user behaviors on platforms. Therefore, it is desirable to have a model that performs well on both tests rather than only one. In this work, we introduce a new learning framework, called Bias-adaptive Preference distillation Learning (BPL), to gradually uncover user preferences with dual distillation strategies. These distillation strategies are designed to drive high performance in both factual and counterfactual test environments. Employing a specialized form of teacher-student distillation from a biased model, BPL retains accurate preference knowledge aligned with the collected feedback, leading to high performance in the factual test. Furthermore, through self-distillation with reliability filtering, BPL iteratively refines its knowledge throughout the training process. This enables the model to produce more accurate predictions across a broader range of user-item combinations, thereby improving performance in the counterfactual test. Comprehensive experiments validate the effectiveness of BPL in both factual and counterfactual tests.

Abstract:
Hash-based collaborative filtering (Hash-CF) approaches recently employ efficient Hamming distance of learned binary representations to accelerate recommendations. Benefiting from its probabilistic nature, Variational Autoencoder (VAE) enables robust Hash-CF with stronger generalization ability. However, VAE-based Hash-CF still faces two challenging problems: 1) Traditional VAE urges the latent variables of different users (or items) to fit a unified and monotonous prior distribution, and lacks considerations for distinctive characteristics of users (or items). The obtained representations of users and items with slight individual differentiation may further weaken the performance of Hash-CF for subsequent personalized recommendations. 2) Hash-CF under the VAE framework requires discrete optimization on latent Bernoulli distributions, which are discrete and NP-hard to optimize. In this paper, we propose a Dual Discrete Collaborative Filtering (DDCF) approach, including a cluster-enhanced representation generation module and a CNF-enabled discrete optimization module. The former module mainly develops cluster-aware latent space to generate discriminative representations for users or items with significantly different characteristics. The latter module employs Continuous Normalizing Flow (CNF) to achieve discrete optimization on latent Bernoulli distributions steadily and effectively. Extensive experiments conducted on multiple real-world datasets demonstrate the superiority of our DDCF compared with the state-of-art methods in terms of effectiveness and efficiency.

Abstract:
Signed graphs, with friendly (positive) and antagonistic (negative) edges, capture important structural properties of real-world phenomena. The structural balanced clique model has recently been formulated to identify polarized structures in signed graphs, where a graph is a structural balanced clique if it is a clique and its vertices can be divided into two sets with positive intra-set and negative cross-set edges. However, this model’s rigidity restricts its applicability in practice. In this paper, we consider structural balanced near-cliques by allowing a few missing connections and deviations from structural balance theory. Specifically, we adopt the definition of kk-plex to represent a near-clique. We prove that enumerating all maximal structural balanced kk-plexes is #P-hard. To solve this problem, we first propose a backtracking algorithm \mathsf MBPE\text-\mathsfBKMBPE-BK, by drawing inspiration from the well-known Bron-Kerbosch algorithm. However, \mathsf MBPE\text-\mathsfBKMBPE-BK’s performance is unsatisfactory due to the issue of overlapping candidate sets. We then propose the algorithm \mathsf MBPEMBPE to overcome this issue by adopting a different strategy at the root level of the search tree, and prove that \mathsf MBPEMBPE achieves a better time complexity than \mathsf MBPE\text-\mathsfBKMBPE-BK (i.e., \mathcal O^(2^\delta )O(2δ) versus \mathcal O^(3^\delta D)O(3δD)). Finally, we adopt the minimum-degree branching strategy to improve the worst-case time complexity of \mathsf MBPEMBPE to \mathcal O^(\alpha _k^\delta )O(αkδ), where \alpha _k< 2αk<2 is a constant that depends only on kk. Extensive experiments on real-world and synthetic datasets demonstrate the efficiency of our algorithms and the effectiveness of our model.

Abstract:
In signed graph analysis, the balanced clique model has received increasing attention recently. A clique is balanced, if it can be divided into two disjoint subgroups, where internal connections are positive and intergroup connections are negative. However, the requirement of fully negative intergroup connection is too strict, and it may fail to retrieve some important communities, considering the unbalanced distribution of positive and negative edges in real-world signed networks. Motivated by this, we leverage the concept of kk-plex and propose a novel model, called Balanced kk-CliPlex (kk-BCP), which relaxes the negative connections between two subgroups in a balanced clique. Given a signed graph, in this paper, we aim to enumerate all the maximal kk-BCPs with a size constraint, which is proved to be NP-hard. To solve the problem, a reasonable baseline algorithm is first proposed by extending the existing approach for maximal balanced clique enumeration and equipped with two acceleration techniques. To scale for large graphs, we further introduce a partition method that can significantly reduce the search space, and propose three optimization strategies to filter unnecessary search branches during the enumeration. Comprehensive experiments are conducted over 10 real-world networks to demonstrate the efficiency and effectiveness of the proposed techniques and model.

Affiliations: School of Computer Science and Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications, Ministry of Education, Southeast University, Nanjing, China; School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China; School of Automation Science and Electrical Engineering, Beihang University, Beijing, China; School of Computing, Macquarie University, Sydney, NSW, Australia

Abstract:
In the era of Big Data and generative artificial intelligence (AI), discovering the truth about various objects from different sources has become a pressing topic. Existing studies primarily focus on dependent sources with conflicting information, where sources may copy information from each other. However, real-world scenarios are often more complex, with dynamic dependence relationships among sources over time. This complexity makes it much more difficult to discover the truth. One of the key challenges centers on measuring the dynamic dependence among sources. To address this challenge, we have developed three models: Depen\_SimpleDepen_Simple, Depen\_ComplexDepen_Complex, and Depen\_DynamicDepen_Dynamic. These models are based on the Hidden Markov Model (HMM) and are designed to handle different types of dependencies, namely simple source dependence, complex source dependence, and dynamic source dependence. Based on the constructed models, we propose a generic framework for discovering the latent truth which are evaluated by three HMM-based methods. We conduct extensive experiments on three real-world datasets to evaluate the performance of the proposed methods, and the results demonstrate that all three methods achieve high accuracy over the state-of-the-art methods.

Abstract:
Graph Knowledge Distillation (GKD) has made remarkable progress in graph representation learning in recent years. Despite its great success, GKD often obeys the label-dependence manner, which heavily relies on a large number of labels. Besides, we observe that GKD encounters the issue of embedding collapse, as merely maximizing the consistency between the teacher and student is insufficient for heterophilic graphs. To tackle these challenges, we propose a Self-Supervised Distillation framework named SSD. To realize label independence, the framework is conducted based on contrastive learning. Specifically, we design a Topology Invariance Block (TIB) and a Feature Invariance Block (FIB) to distill semantic invariance from unlabeled data. Each block includes a teacher-student architecture, which is trained by a projection-based contrastive loss. To avoid embedding collapse, the loss pays attention to two critical aspects: (1) Preserving consistency maximization between the same node representations related to teacher and student (positive pairs). (2) Ensuring consistency minimization between negative pairs, which include the final teacher and final student representation pairs and hidden teacher representation pairs. Under the guidance of self-distillation in each block, TIB captures the topology invariance while FIB learns the feature invariance. Additionally, cross-distillation is applied between two blocks, allowing each block to gain additional contrastive knowledge from each other, resulting in improved feature representations and enhanced classification performance. Comprehensive experimental results on 10 datasets demonstrate that our model achieves superior performance in the node classification task. In summary, SSD offers a novel paradigm for self-supervised knowledge distillation on graph-structured data.

Abstract:
Knowledge Graphs (KGs), with their rich semantics, friendly structure, are crucial for enhancing AI systems’ capability for understanding, reasoning, and cross-domain applications. However, KGs often face limitations in scale and quality, exhibiting incompleteness in not only missing relations (targeted by link prediction), but also missing semantic class information for numerous entities which is equally critical for schema-level semantics and downstream reasoning. There are two main issues in KG entity classification. First, common feature representation learning methods possess a ‘black box’ nature, undermining the inherent interpretability and complex semantic structure of KGs. Second, despite many studies addressing the importance of difficulty information of each instance for enhancing classifier performance, existing models often treat all entities uniformly, ignoring the impact of varying classification difficulty on the learning process. To address these issues, we propose a credible entity classification method for KG based on classification difficulty of entities, named CECKG. In this method, we first introduce an interpretable entity feature representation technique to preserve the original semantics of KGs, rather than directly mapping entity features to a low or high-dimensional vector space. Moreover, to achieve credible entity classification in KGs, we incorporate the assessment idea of degree of credibility (Cr) into the design of CECKG, creating a progressive ensemble learning model that transitions from easy to difficult. The CECKG model focuses not only on the overall classification performance but also on the credibility of each entity’s prediction. To reduce the computational cost of the classification difficulty of entities, a low-cost alternative is also proposed based on the intrinsic structural properties of KGs. A series of experimental results on five public KG datasets show that our proposed method outperforms fourteen state-of-the-art entity classification models in terms of accuracy and Cr.

Abstract:
Bipartite graphs are commonly used to model relationships between two distinct entities in real-world applications, such as user-product interactions, user-movie ratings and collaborations between authors and publications. A butterfly (a 2 × 2 bi-clique) is a critical substructure in bipartite graphs, playing a significant role in tasks like community detection, fraud detection, and link prediction. As more real-world data is presented in a streaming format, efficiently counting butterflies in streaming bipartite graphs has become increasingly important. However, most existing algorithms typically assume that duplicate edges are absent, which is hard to hold in real-world graph streams, as a result, they tend to sample edges that appear multiple times, leading to inaccurate results. The only algorithm designed to handle duplicate edges is \mathsf FABLEFABLE, but it suffers from significant limitations, including high variance, substantial time complexity, and memory inefficiency due to its reliance on a priority queue. To overcome these limitations, we introduce \mathsf DEABCDEABC (Duplicate-Edge-Aware Butterfly Counting), an innovative method that uses bucket-based priority sampling to accurately estimate the number of butterflies, accounting for duplicate edges. Compared to existing methods, \mathsf DEABCDEABC significantly reduces memory usage by storing only the essential sampled edge data while maintaining high accuracy. We provide rigorous proofs of the unbiasedness and variance bounds for \mathsf DEABCDEABC, ensuring they achieve high accuracy. We compare \mathsf DEABCDEABC with state-of-the-art algorithms on real-world streaming bipartite graphs. The results show that our \mathsf DEABCDEABC outperforms existing methods in memory efficiency and accuracy, while also achieving significantly higher throughput.

Abstract:
In recent years, the missing data problem in multi-view multi-label classification (MvMlC) has attracted extensive attention from researchers, with numerous solutions for partial multi-view incomplete multi-label classification (PMvIMlC) emerging. Nevertheless, two critical challenges persist. One is suboptimal coarse-grained multi-view fusion: traditional dynamic fusion at the view level is unable to accommodate the practical fusion demands of samples with diverse qualities. The other is neglecting latent information within missing labels: during the training phase, existing works only focus on the limited supervised information of unmissing labels while ignoring the underlying information at missing positions. To address these issues, we propose Evidential Reliable Fusion for Partial Multi-view Incomplete Multi-label Classification, termed ERF. ERF comprises two core modules: 1) Uncertainty-guided fusion module via evidence theory and 2) adaptive negative label pseudo-labeling. The former quantifies sample-level uncertainty of each view based on evidence theory, which is then used to guide multi-view fusion, enabling a fine-grained, instance-level multi-view fusion scheme. For the latter, leveraging the model’s perception ability for neighboring samples in the label space, we design a strategy to select reliable negative pseudo-labels. This module enhances supervisory information to aid model training by recovering reliable negative pseudo-labels. Extensive experiments demonstrate that our ERF delivers significantly superior classification performance over existing methods.

Abstract:
Graph anomaly detection (GAD) is critical for identifying abnormal nodes in graph-structured data from diverse domains, including cybersecurity and social networks. The existing GAD methods often focus on the learning paradigms of “one-model-for-one-dataset”, requiring dataset-specific training for each dataset to achieve optimal performance. However, this paradigm suffers from limitations, such as high computational and data costs, limited generalization and transferability to new datasets, and challenges in privacy-sensitive scenarios where access to full datasets or sufficient labels is restricted. To address these limitations, we propose a novel generalist GAD paradigm that aims to develop a unified model capable of detecting anomalies on multiple unseen datasets without retraining/fine-tuning or customization. To this end, we propose a few-shot generalist GAD method with three key designs, namely feature Alignment, a Residual encoder, and in-Context learning, abbreviated as ARC. As a generalist approach, ARC only requires a few labeled normal samples during prediction on any unseen graphs. Specifically, ARC consists of three modules: a feature Alignment module to unify and align features across datasets, a Residual graph encoder to capture dataset-agnostic anomaly representations, and a cross-attentive in-Context learning module to score anomalies using few-shot normal context. Building on ARC, we further introduce ARC_\mathrmzero zero for the zero-shot generalist GAD setting, which selects representative pseudo-normal nodes via a pseudo-context mechanism and thus enables fully label-free inference on unseen datasets. Experiments on 17 real-world datasets demonstrate that ARC and ARC_\mathrmzero zero effectively detect anomalies, exhibit strong generalization ability, and perform efficiently under few-shot and zero-shot settings.

Affiliations: National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China; School of Computer Science and Jiangsu Provincial Key Laboratory of Internet of Things Intelligent Perception and Computing, Nanjing University of Posts and Telecommunications, Nanjing, China; Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong

Abstract:
Cross-chain transaction processing is pivotal to blockchain interoperability, enabling coordinated state transitions across multiple blockchains to support increasingly complex decentralized applications (dApps). However, existing atomicity-preserving mechanisms, predominantly based on two-phase commit (2 PC) protocols, are hindered by sequential coordination, prolonged state locking, and high susceptibility to cascading aborts. These limitations severely degrade throughput and latency under contention, undermining practical deployability. This paper proposes Furion, a novel cross-chain transaction processing mechanism that achieves both atomicity and efficiency. Furion introduces the multi-future exploration, a new execution paradigm that explicitly materializes multiple possible futures of cross-chain states via multi-versioning. By speculatively executing transactions across feasible state evolutions, Furion eliminates blocking on unresolved dependencies and fundamentally avoids cascading aborts. To further unlock concurrency in the finalization phase, Furion employs preemptive voting, which allows local transactions to cast commit or abort votes early when their outcomes are invariant across all state versions. Experimental evaluations demonstrate that Furion significantly outperforms state-of-the-art systems, achieving substantially higher throughput, lower latency, and markedly reduced abort rates under skewed and highly contended workloads.

Affiliations: School of Computing Technologies, RMIT University, Melbourne, VIC, Australia; Department of Mathematics, City University of Hong Kong, Hong Kong, SAR, China; School of Information and Communication Technology, Griffith University, Southport, QLD, Australia; Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, Jinhua, China; School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China; School of Computer Science and Technology, Tianjin University, Tianjin, China; Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA

Abstract:
Recent years have witnessed fast developments of graph neural networks (GNNs) that have benefited myriad graph analytic tasks and applications. Most GNNs rely on the homophily assumption that nodes belonging to the same class are more likely to be connected. However, as a ubiquitous graph property in numerous real-world scenarios, heterophily, i.e., nodes with different labels tend to be linked, significantly limits the performance of tailor-made homophilic GNNs. Hence, GNNs for heterophilic graphs are gaining increasing research attention to enhance graph learning with heterophily. In this paper, we provide a comprehensive review of GNNs for heterophilic graphs. Specifically, we propose a systematic taxonomy that governs existing heterophilic GNN models, along with general summaries and detailed analyses. Furthermore, we discuss the relationship between heterophily and various graph research domains, aiming to facilitate the development of more effective GNNs across a spectrum of practical applications and learning tasks in the graph research community. In the end, we point out potential directions to advance and inspire future research and applications on heterophilic graph learning with GNNs.

Abstract:
Spatio-temporal time series (STTS) play a crucial role in domains such as traffic forecasting, energy scheduling, and financial analysis. However, accurate and efficient prediction remains challenging due to the complex dynamic dependencies across temporal and spatial dimensions. Existing Graph Neural Networks (GNNs) often struggle to balance effectiveness and efficiency when modeling dynamic spatio-temporal relations. Meanwhile, Large Language Models (LLMs) exhibit strong capabilities in modeling long-range dependencies and generalization under few-shot and zero-shot conditions, yet their ability to capture spatio-temporal structures remains limited. To address this, we propose LAD-SGNN, a unified framework that integrates the advantages of graph-based and language-based modeling. Specifically, LAD-SGNN employs Spectral Graph Convolution on the Stiefel manifold (SGSC) together with Linear Dynamic Graph Optimization (LDGOSM) to efficiently extract dynamic spatio-temporal features with nearly linear complexity. Structured prompts and a spatio-temporal alignment mechanism are then designed to fuse the extracted dynamic spectral information with task semantics, which are fed into a lightweight LLM to achieve unified modeling of spatio-temporal structures and semantic reasoning. In this framework, SGSC efficiently represents dynamic spatial dependencies in the spectral domain, while the prompt-based alignment strategy explicitly injects spatio-temporal information into the LLM, thereby enhancing generalization across datasets. Extensive experiments on multiple real-world spatio-temporal datasets demonstrate that LAD-SGNN significantly improves prediction accuracy in both zero-shot and few-shot scenarios.

Abstract:
Multi-label feature selection plays a critical role in data management and analysis by reducing feature dimensionality while preserving discriminative capability. However, real-world multi-label datasets commonly exhibit label coverage imbalance, causing feature evaluation to be dominated by labels with high coverage. Moreover, feature redundancy is typically estimated using averaged dependency measures, which underestimate dominant redundant relationships under heterogeneous information scales. To address these challenges, we propose a multi-label feature selection method, termed Complementary and Redundancy-Aware Feature Selection for Imbalanced Coverage (CIRFS). CIRFS introduces a coverage-aware label weighting strategy that explicitly models label coverage and normalized label frequency to dynamically mitigate well-covered label dominance. In addition, it adopts a maximum redundancy ratio criterion to characterize feature redundancy from a worst-case information perspective, enabling accurate identification of dominant redundant relationships. Furthermore, mutual information (MI) and stabilized conditional mutual information (CMI) are jointly integrated to capture complementary aspects of feature-label information that cannot be fully characterized by either measure alone. Experiments on 14 real-world multi-label datasets demonstrate that CIRFS outperforms nine representative feature selection methods across four evaluation metrics.

Abstract:
Ordinal classification, where labels follow a natural order, has gained increasing attention, particularly in the deep learning community due to its relevance in tasks such as age estimation, medical grading, and quality assessment. Despite the growing number of deep ordinal classification methods, a comprehensive experimental analysis of their core ordinal components remains lacking. This work presents a systematic evaluation of deep ordinal classifiers by analysing the impact of three key modelling choices: the loss function, output layer, and labelling strategy. To analyse their effects, we adopt a unified architecture and evaluate one nominal and 19 ordinal configurations, resulting from combination of two loss functions, two output layers, and five labelling strategies. These configurations are assessed on 12 diverse ordinal image datasets using six performance metrics, including both ordinal and nominal measures. Results show that ordinal output layers consistently outperform softmax, and that soft labelling generally improves generalisation. While categorical cross-entropy achieves better average performance, especially on nominal metrics, no configuration performs best across all datasets. Statistical analyses indicate significant interactions between losses, outputs, labelling strategies, and datasets, highlighting the need to adapt methodological choices to specific tasks. These findings provide valuable guidance for designing robust deep ordinal classification models.

Abstract:
Isolation Levels (IL) act as correct contracts between applications and database management systems (DBMSs). The complex code logic and concurrent interactions among transactions make it hard to expose violations of various ILs stated by DBMSs. With the recent proliferation of new DBMSs, especially the cloud ones, there is an urgent demand for a general way to detect bugs violating various ILs. The core challenges come from the requirements of: (a) lightweight (verifying without modifying the application logic in workloads and the source code of DBMSs), (b) generality (verifying various ILs), and (c) efficiency (performing efficient verification on a long running workload). To this end, we propose a powerful and practical bug-finding tool Leopard. For lightweight, we propose to infer transaction dependencies based on the time intervals of operations collected from the client-side, without modifying the source code of DBMSs. For generality, based on a thorough analysis of existing concurrency control protocols, we summarize and abstract four mechanisms which can implement ILs in all commercial DBMSs we have investigated. For efficiency, we design a two-level pipeline to organize and sort massive time intervals in a time and memory conservative way; we propose a mechanism-mirrored verification to simulate the concurrency control protocols implemented in DBMSs for high throughputs; From experimental results, Leopard outperforms existing methods Cobra and Elle. In practice, Leopard has a superpower to verify various ILs on any workload running on all commercial DBMSs. Moreover, it has discovered 49 bugs undetected by other existing methods.

Abstract:
Recently, the research on fairness in recommendation systems has garnered widespread attention. Moreover, numerous fair recommendation models have been developed for scenarios with limited sensitive information, thereby alleviating the issue of missing sensitive information. However, the performanceof these methods still tends to decline significantly when sensitive attributes are extremely scarce. In this paper, we propose FairCL, a novel fair recommendation framework designed to perform effectively under limited sensitive attribute information. FairCL features a contrastive learning-based sensitive attribute encoder that can be integrated with existing fair recommendation algorithms. By leveraging both collaborative information and item side information, we predict unknown sensitive attributes and apply contrastive learning for sensitive attribute modeling. Furthermore, we theoretically demonstrate how FairCL can be integrated with mutual information-based and adversarial learning-based fairness algorithms. Extensive experiments on three real-world datasets show that FairCL significantly enhances fairness, even when only a small portion of users’ sensitive attributes are known.

Abstract:
Crowdsourcing, which decomposes tasks or projects through the internet platform and outsources them to a large number of people, has gained significant academic and industry interest due to its efficiency. Although prior theoretical studies have employed evolutionary game theory to study crowdsourcing systems, existing studies have predominantly focused on the game interactions between crowdsourcing worker and requester, often neglecting the platform as a stakeholder. Here we propose an evolutionary game model for tripartite crowdsourcing, accounting for bounded rationality and potential collusion between platform and crowdsourcing worker, with incentives designed to counteract free-riding and deceptive feedback. Our theoretical results indicate that the system can achieve a desired state where workers provide high-quality work, the platform always implements diligent monitoring, and requesters adopt integrity strategies. Moreover, when the platform consistently adopts the monitoring strategy, the system exhibits oscillatory dynamics, with the probabilities of the worker providing high-quality work and the requester adopting an integrity strategy fluctuating within a bounded interval. Finally, we present numerical examples to validate our theoretical results.

Abstract:
Within the domain of data mining, one critical objective is the discovery of sequential rules with high utility. The goal is to discover sequential rules that exhibit both high utility and strong confidence, which are valuable in real-world applications. However, existing high-utility sequential rule mining algorithms suffer from redundant utility computations, as different rules may consist of the same sequence of items. When these items can form multiple distinct rules, additional utility calculations are required. To address this issue, this study proposes a sequential rule mining algorithm that utilizes segmentation guided by confidence (RSC), which employs confidence-guided segmentation to reduce redundant utility computation. It adopts a method that precomputes the confidence of segmented rules by leveraging the support of candidate subsequences in advance. Once the segmentation point is determined, all rules with different antecedents and consequents are generated simultaneously. RSC uses a utility-linked table to accelerate candidate sequence generation and introduces a stricter utility upper bound, called the reduced remaining utility of a sequence, to address sequences with duplicate items. Finally, the proposed RSC method was evaluated on multiple datasets, and the results demonstrate improvements over state-of-the-art approaches.

Abstract:
The rapid spread of fake news on social media has significantly increased the importance of computational detection methods. Graph-based approaches, particularly Graph Neural Networks (GNNs), have emerged as powerful tools for modeling news propagation patterns. Despite their potential, current GNN-based methods still face challenges in robustness and interpretability due to two key shortcomings: they inadequately filter out irrelevant user-induced noise within propagation graphs, and their shallow architectures fail to effectively capture the intricate long-range dependencies characteristic of news propagation. To overcome these limitations, we propose NEGT (Noise-filtering Enhanced Graph Transformer), a novel graph Transformer framework explicitly designed for fake news detection. NEGT introduces a noise-augmented information bottleneck strategy embedded within its self-attention mechanism, effectively identifying and removing task-irrelevant interactions. Additionally, we propose a novel relational propagation graph encoding a strategy that explicitly captures multi-scale user relationships and propagation depth, enabling NEGT to model long-sequence propagation dependencies accurately. Experiments on various benchmark datasets show that NEGT surpasses current methods in accuracy, noise robustness, and interpretability.

Abstract:
In-memory dynamic graph processing faces three critical challenges: limited DRAM capacity, inefficient concurrent update/query handling, and vulnerability to crashes. Traditional segment-level systems struggle with write amplification on emerging Storage-Class Memory (SCM), while existing persistent-memory systems suffer from coarse-grained synchronization and high recovery overhead. This study presents the Storage-Class Memory Dynamic Graph (SMDG) processing framework, an architecture-level redesign centered on the block as the atomic unit across storage, concurrency, and recovery. The system addresses these challenges through three key innovations. First, a block-granular storage design organizes adjacency data at fixed-size block granularity on heterogeneous DRAM-SCM architecture, employing buffered batched writes to significantly reduce write amplification while preserving logarithmic update complexity. Second, block-level multi-version concurrency control maintains timestamped block versions under per-vertex read-write synchronization to provide task-ordered snapshot visibility for concurrent queries without copying entire vertices or pages. Third, a block-granular crash recovery protocol with decentralized per-vertex logs enables independent parallel reconstruction, ensuring application-level semantic consistency while achieving substantially faster recovery than sequential approaches. Experimental results validate that this unified block-granular design improves update efficiency, sustains mixed update-query workloads with controlled memory overhead, and accelerates crash recovery compared with prior dynamic graph systems.

Abstract:
Multi-behavior recommender systems have demonstrated their effectiveness in mitigating issues such as data sparsity by incorporating auxiliary behaviors into the target behavior. However, existing multi-behavior recommendation approaches typically take one of two directions: (1) fusing behavior-specific preference features from various behavior interaction graphs explicitly or implicitly for recommendation; or (2) utilizing behavior-unified preference features from the unified interaction graph for recommendation or to initialize features for subsequent modeling. These methods fail to exploit the integration of behavior-unified global and behavior-specific local preference features, resulting in incomplete preference modeling. To address this issue, in this work, we propose a novel method called Unifying Global and Local Preferences (UGLP) for multi-behavior recommendation. In UGLP, we design a behavior feature fusion network that consists of global and local fusion modules for comprehensive and fine-grained user preferences. The global fusion module performs graph convolution on behavior-unified global and behavior-specific local interaction graphs to obtain behavior-unified and behavior-specific features. The behavior-unified and behavior-specific features are then fused into globally fused features via a gating network. The local fusion module then performs cross-behavior fusion on these globally fused features via another gating network. We introduce a contrastive learning module to promote preference alignment and knowledge transfer from auxiliary behaviors to the target behavior. Additionally, we incorporate a GCN refinement module to adjust the fused features to ensure that both global and local user preferences are learned. Experimental results on three real-world datasets verify that our method is able to surpass various state-of-the-art models. For instance, our method outperforms the best baseline by an average of 16.43% and 14.36% in terms of HR@10 and NDCG@10, respectively.

Abstract:
Spatial crowdsourcing platforms have become indispensable in addressing the evolving needs of modern society. These platforms facilitate essential services such as ride-sharing, on-demand food delivery, and efficient parcel distribution. However, the uneven distribution of workers and requests under a single-platform setting may lead to the loss of tasks. To address this issue, we introduce the Cross Online Matching (COM) problem, which facilitates collaboration among multiple platforms. We first propose DemCOM and RamCOM, which adopt deterministic greedy and randomized trade-off strategies, respectively. Furthermore, we develop a Utility-Distribution Aware Cooperative Online Matching (UDACOM) algorithm that leverages supply-demand relationships to optimize decision-making. Theoretical analysis confirms the competitive ratios of our algorithms. Validated on both real and synthetic datasets, our approach significantly outperforms state-of-the-art methods, achieving a 5% increase in total revenue and a 3% improvement in the successful matching rate.

Abstract:
Spectral clustering has received widespread attention for its effectiveness in handling nonconvex geometries. The classic methods of spectral clustering include Ratio Cut (Rcut), Normalized Cut (Ncut), and Min-Max Cut (MMcut). Among them, the objective function of MMcut is more reasonable. Unfortunately, existing methods cannot solve MMcut problem without relaxing the discrete or nonnegative constraints. To this end, based on coordinate descent, we propose a basic optimization algorithm to solve MMcut problem without relaxing any constraint. And then, a fast version of the solver is proposed to improve the computational efficiency. Besides, we prove the convergence of the proposed solver, evaluate its computational complexity, and discuss the connection between MMcut and NCut. Finally, extensive experiments are performed to evaluate the effectiveness of our proposed method.

Affiliations: School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Huawei Technologies Company, Ltd., Hangzhou, China; Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China; Yangtze Delta Region Institute (Quzhou), School of Computer Science and Engineering, University of Electronic Science and Technology of China, Quzhou, China; Hong Kong University of Science and Technology, Hong Kong, SAR, China

Abstract:
Velocity control in autonomous driving is an emerging technology that has achieved rapid progress over the last decade. However, existing velocity control studies ignore cascading disturbances in multi-lane scenarios and usually ignore the negative impact caused by harsh velocity decisions. To address these issues, we propose a reinforcement learning-based framework, called RISE (contRol velocIty for autonomouS vEhicle) to make velocity decisions for an autonomous vehicle in multi-lane traffic scenarios. To detect latent disturbances in the traffic flow, we propose a novel state encoder to learn the spatio-temporal correlation between different vehicles based on a well-designed impact graph. Afterward, we introduce an actor-critic paradigm to make velocity decisions with the aid of a hybrid reward function considering four optimization objectives: safety, efficiency, comfort, and impact. In particular, the impact term can penalize the harsh decisions of the autonomous vehicle, thus encouraging it to reduce the negative impacts on traffic flow. Further, we propose an improved RISE (RISE++) framework that incorporates a motion prediction model to augment state features for reasonable decisions, a modification of the reward function for energy efficiency, and a multi-worker paradigm for training efficiency. Extensive experiments offer evidence that the proposed framework can advance the state of the art in terms of effectiveness and efficiency.

Abstract:
Sequential recommendation aims to derive insights from user interaction records and make predictions based on relationships between users and items. However, most existing approaches do not effectively integrate user intents and preferences, which limits their capability to capture user behavior patterns. Additionally, these methods often struggle with poor performance in sparse data scenarios. To address these challenges, we propose EL4SR, a sequential recommendation approach by integrating intent understanding and preference learning. EL4SR simultaneously learns user intents and preferences through a dual-channel recommendation module, modeling both item and popularity sequences to enable mutual learning that captures the combined effects of intent and preference. Moreover, we enhance intent learning through contrastive learning, improving adaptability in sparse data contexts. We design several augmentation operators to improve the performance and robustness of EL4SR. Extensive experiments on MovieLens and Amazon demonstrate the performance of our proposed method across various scenarios.

Abstract:
Time series anomaly detection is crucial in fields such as industrial monitoring, financial risk management, and network security. Graph Neural Networks (GNNs) have demonstrated strong capabilities in capturing multivariate dependencies. However, existing methods often fail to adequately account for the temporal proximity between adjacent time points and are susceptible to the influence of weak or noisy connections during graph-based representation learning. To address these challenges, we propose LaGraph, a novel framework that integrates GNNs with a mask-optimized attention mechanism. Specifically, LaGraph decomposes input sequences into stable and trend components using an Expert Decomposition Block. The trend component is processed via a Multi-layer Convolution Block, while the stable component is modeled with a Proximity-enhanced Graph Convolutional Network that incorporates a Laplacian kernel to capture local temporal dependencies. Additionally, a Mask-optimized Multi-head Attention Block, based on the Straight-Through Estimator (STE), mitigates the negative effects of less informative edges, enhancing both representation quality and reconstruction performance. Extensive experiments on five real-world benchmark datasets demonstrate that LaGraph consistently outperforms state-of-the-art methods, verifying its effectiveness and superiority for time series anomaly detection.

Abstract:
Fake news detection is a hot topic in the social media mining research community. Recent studies have shown that sentiment signals could significantly benefit the detection performance. However, most existing methods treat sentiment merely as auxiliary features, while the more sophisticated social sentiment interactions were rarely explored. In this paper, we propose a novel framework named ReFEND, which leverages the sentiment resonances among the social users (i.e., social sentiment resonances) and the sentiment relationship between news content and user comments to improve the detection performance. Specifically, we first utilize sentiment scorers to assess the sentiment of comments and identify users’ emotional tendencies. Then we creatively construct a sentiment-aware multi-relational graph to capture social sentiment resonances evoked by the content and the interactions between comments and news. Next, we leverage the relational graph convolutional network (RGCN), which specializes in handling multi-relational graph data, to learn the interactions on sentiment-aware graph. To our best knowledge, this is the first effort to leverage social sentiment resonances for fake news detection. Experimental results on three datasets indicate that ReFEND significantly outperforms the state-of-the-art sentiment-based methods in terms of F1 and accuracy. Besides, ablation studies demonstrate the effectiveness of components designed in ReFEND.

Abstract:
Inductive knowledge graph completion (KGC) aims to represent unseen entities and complete triplets in emerging knowledge graphs (KGs), while the existing studies ignore that unseen elements combined with seen ones constitute a holistic new relational graph, where emerging KGs have inescapable impacts backtracking to original ones. Therefore, it is not only necessary to predict triplets in emerging KGs, but also with particular significance to further improve the completeness of original ones, considering the semantic and topological variations in the holistic new graph. To fill in this gap, we formulate a new IT (Inductive-Transductive) KGC task to transductively complete triplets inside original KGs after entities in the emerging scenario are represented and fine-tuned in an inductive manner. In order to handle this task, a novel model entitled StaR (Self-adaptive Retroaction-aware Representation) is proposed consisting of the following two modules: 1) a self-adaptive semantic encoding network is designed to adaptively adjust embeddings of seen entities to their surrounding semantic mutations; 2) a relation-aware transformer layer is developed to represent both seen and unseen entities in a unified representation space and generalize evolving reasoning paradigms to the whole graph. Our experimental results demonstrate that, compared with state-of-the-art methods, StaR is not only competitive in inductive KGC for unseen entities, but also ulteriorly improves the completeness of original parts inside the holistic new relational graph in our IT KGC task.

Abstract:
With the widespread adoption of collaborative filtering techniques for personalized recommendations, exposure bias has become a significant challenge. Exposure bias refers to the tendency of recommendation models to disproportionately favor items with high exposure over those with low exposure. In graph collaborative filtering that uses graph neural networks (GNNs) for recommendations, exposure bias can be exacerbated due to 1) the reliance on positive feedback during graph construction and 2) the effects of the neighbor aggregation step in GNNs. To tackle this challenge, we propose a novel and efficient framework called FUGCF (training-Free and Unbiased Graph Collaborative Filtering) to improve both the accuracy and bias mitigation of graph-based personalized recommendations. FUGCF employs a two-stage calculation strategy: it estimates exposure probabilities in the first stage and then leverages these exposure probabilities to help derive debiased node embeddings in the second stage. Furthermore, we design a training-free estimation method for FUGCF based on closed-form solutions to enhance its computation efficiency. The extensive experiments on a synthetic dataset and three real-world datasets demonstrate the effectiveness of FUGCF in reducing exposure bias, improving recommendation accuracy, and optimizing computation efficiency.

Abstract:
Multimodal recommender systems (RSs) represent items in the catalog through multimodal data (e.g., product images and descriptions) that, in some cases, might be noisy or (even worse) missing. In those scenarios, the common practice is to drop items with missing modalities and train the multimodal RSs on a subsample of the original dataset. To date, the problem of missing modalities in multimodal recommendation has still received limited attention in the literature, lacking a precise formalisation as done with missing information in traditional machine learning. In this work, we first provide a problem formalisation for missing modalities in multimodal recommendation. Second, by leveraging the user-item graph structure, we re-cast the problem of missing multimodal information as a problem of graph features interpolation on the item-item co-purchase graph. On this basis, we propose four training-free approaches that propagate the available multimodal features throughout the item-item graph to impute the missing features. Extensive experiments on popular multimodal recommendation datasets demonstrate that our solutions can be seamlessly plugged into any existing multimodal RS and benchmarking framework while still preserving (or even widen) the performance gap between multimodal and traditional RSs. Moreover, we show that our graph-based techniques can perform better than traditional imputations in machine learning under different missing modalities settings. Finally, we analyse (for the first time in multimodal RSs) how feature homophily calculated on the item-item graph can influence our graph-based imputations.

Abstract:
Data cleaning (DC) is a crucial yet challenging step for many data engineering tasks. Traditional pre-configuration DC methods rely heavily on predefined rules or constraints, demanding significant domain knowledge and manual effort. While configuration-free DC approaches have been explored, they still demand extensive feature engineering or labeled data for intensive model training. In this paper, we propose a zero-training and interpretable DC system, named \sf ZeroDCZeroDC, that leverages large language models (LLMs) to generate data cleaning rules and chain-of-thoughts (CoTs), without the need for model training. \sf ZeroDCZeroDC consists of two modules, iterative detection rule generation (IDG) and training-free explainable correction (TEC). To generate high-quality error detection rules with minimal human feedback, IDG first bootstraps a set of rules via contrastive rule initiation on sampled syntactic and semantic contrastive pairs. It then progressively enhances them through an iterative rule refinement workflow that selects the most informative elements for updates. TEC constructs a contextual-relevant tuple retriever using a weighted cosine similarity function to efficiently identify the most relevant tuples for each dirty value, reducing redundancy in the LLM prompts and lowering computational costs. It further prompts for generating correction CoTs for user-corrected representative values, as well as prompts for creating correction rules and explainable corrections, which automatically provide explanations for correction results, all without the need for model training. Extensive experiments conducted on various real-world datasets demonstrate that \sf ZeroDCZeroDC achieves, on average, a 5.36% increase in accuracy and an 8.16x speedup compared to state-of-the-art methods.

Abstract:
Graph-level clustering aims to partition a set of graphs into different clusters and has important applications in social networks, bioinformatics, etc. Although there have been some approaches to graph-level clustering such as various graph kernels and graph neural networks, it remains a huge challenge to select kernels and neural network architectures, since the task is unsupervised. Moreover, the clustering accuracy and model interpretability of these approaches are low and should be improved to satisfy practical needs. To address these issues, in this work, we propose a graph-level clustering method that uses Bayesian optimization to integrate various graph kernels (BOGK). BOGK aggregates the similarity matrices generated by different graph kernels and automatically learns the aggregation weights and a thresholding parameter via maximizing internal cluster validity indices. Our BOGK is free of manual hyperparameter tuning via Bayesian optimization, while it enjoys considerable interpretability, as the weight for each similarity matrix represents the importance of different structural or pattern information in graphs. Experimental results show that our BOGK outperforms the state-of-the-art on ten graph benchmark datasets.

Abstract:
In numerous artificial intelligence applications, the collaborative efforts of multiple intelligent agents are imperative for the successful attainment of target objectives. To enhance coordination among these agents, a distributed communication framework is often employed, wherein each agent must be capable of encoding information received from the environment and determining how to share it with other agents as required by the task at hand. However, indiscriminate information sharing among all agents can be resource-intensive, and the adoption of manually pre-defined communication architectures imposes constraints on inter-agent communication, thus limiting the potential for effective collaboration. Moreover, the communication framework often remains static during inference, which may result in sustained high resource consumption, as in most cases, only key decisions necessitate information sharing among agents. In this study, we propose a novel approach where the communication structure between agents is represented as a learnable graph. We frame this challenge as the task of identifying the optimal communication graph while allowing the architecture parameters to be updated through regular optimization, which requires a bi-level optimization process. By applying continuous relaxation to the graph structure and integrating attention mechanisms, our method, CommFormer, effectively optimizes the communication graph and simultaneously refines the architectural parameters via gradient descent in an end-to-end manner. Additionally, we introduce a temporal gating mechanism for each agent, enabling dynamic decisions on whether to receive shared information at a given time, based on current observations, thus improving decision-making efficiency. Comprehensive experiments conducted across a range of cooperative tasks demonstrate the robustness of our model. Our approach enables agents to develop more coordinated and sophisticated strategies, maintaining effectiveness even with varying agent counts.

Abstract:
Given a social network G=(V, E)G=(V,E), the unconstrained profit maximization problem aims to identify a subset S \subseteq VS⊆V that maximizes the net profit, defined as the expected influence spread \Gamma (S)Γ(S) of set SS minus the associated cost c(S)c(S), i.e., \Gamma (S) - c(S)Γ(S)-c(S). However, this problem presupposes an unlimited budget, which is often impractical in real scenarios. Motivated by this, we investigate the budgeted profit maximization (BPM) problem by adding a budget constraint. Unfortunately, addressing the BPM problem with a theoretical approximation guarantee remains relatively under-explored in the literature. In response, assuming \Gamma (S)Γ(S) is known for any S \subseteq VS⊆V, we propose an algorithm that guarantees returning a set S^oSo such that \Gamma (S^o) - c(S^o) \geq (1 - \frac1e) \frac\Gamma (S^)2 - \fracc(S^)2Γ(So)-c(So)≥(1-1e)Γ(S)2-c(S)2, where S^S denotes an optimal solution for BPM. Then, we develop a practical solution, which uses the reverse reachable set (RR-set) technique for influence estimation, without assuming knowledge of \Gamma (S)Γ(S), while still maintaining a strong approximation guarantee. Additionally, similar to existing RR-set-based solutions for influence cascade-related problems, our RR-set-based solution relies on generating a large number of random RR-sets to accurately estimate \Gamma (S)Γ(S). However, the existing RR-set generation method suffers from high memory stall rates due to its irregular memory access patterns, leaving room for further efficiency improvement. Therefore, we propose a new RR-set generation method that utilizes batch execution and cache prefetching. When a memory access is required, instead of stalling while waiting for data, the CPU first issues an asynchronous prefetch request to load the target data into the cache, and then switches to processing the generation of other RR-sets within the same batch, effectively hiding memory access latency. This method can be seamlessly integrated into existing RR-set-based solutions to improve their efficiency. Finally, we conduct extensive experiments on real, large-scale datasets to demonstrate the effectiveness and efficiency of our proposed solutions.

Abstract:
Time series forecasting has become a critical task in data engineering, with the volume of time series data projected to reach 180 ZB by 2025. While traditional forecasting models are typically constrained to single domains, missing opportunities for transferring temporal patterns across different domains. Through analysis, we observe that time series from different domains, despite their distinct statistical characteristics, can be fundamentally understood through temporal dependency patterns, which manifest as either long-term dependencies (like trends and cycles) or short-term dependencies (like fluctuations and abrupt changes). This observation motivates us to rethink cross-domain modeling from the dependency preferences perspective. We propose LSTPO, a novel framework that captures cross-domain commonalities through temporal dependency preferences and leverages a meta-learning-based approach to prevent cross-domain training forgetting. LSTPO dynamically models changes in preference over time and swiftly adapts to preference variations across different domains, enabling robust cross-domain forecasting. Through extensive experimental evaluations, we have shown that LSTPO substantially outperforms state-of-the-art forecasting methods while enhancing model transferability under few-shot learning conditions.

Abstract:
Nonnegative low-rank matrix approximation is an important technique in data analysis for extracting meaningful patterns from high-dimensional nonnegative data. This nonnegative low-rank approximation problem is studied and a new block-splitting method is developed in this paper. This new method enforces the low-rank constraint by utilizing QR decomposition and adopts a semismooth Newton method to address the related convex subproblems efficiently through the dual formulation of the nonnegative low-rank matrix approximation problem. Theoretical analysis confirms the convergence of the new method. Several real datasets are used to demonstrate the efficiency of the proposed method.

Abstract:
The advancement of Internet technology has spurred a rise in the dissemination of misinformation, which has had profoundly negative impacts across a wide array of fields. To address this issue, the field of Misinformation Detection (MD), which focuses on the automated identification of online misinformation, has gained significant traction among researchers. In our study, we introduce an innovative plug-and-play augmentation technique for MD, termed DEtecting Misinformation by Uncovering Commonsense Conflict (Demuc). Our approach is grounded in previous psychological research that suggests that fake content often contains commonsense. Accordingly, we develop commonsense expressions for articles to highlight potential conflicts between the inferred commonsense triplets and the established ones derived from reliable commonsense reasoning tools. According to the used tools, we induce two variants Demuc-klm using the knowledge language model COMET and Demuc-llm using the large language models. These generated expressions are then applied as augmentations to each article, enabling any MD method to be trained on these augmented datasets. Additionally, we have manually compiled a new dataset CoMis, which consists exclusively of fake articles characterized by commonsense conflicts. By integrating Demuc with various existing MD frameworks and evaluating them on four public benchmark datasets and CoMis, our empirical findings show that both Demuc-klm and Demuc-llm consistently and significantly outperform current MD baselines, while also generating precise commonsense expressions.

Abstract:
With the rapid growth of mobile devices and applications, a prodigious number of spatio-temporal data are generated constantly. To process these data for applications like traffic forecasting, existing spatio-temporal systems rely on the move-data-to-computation paradigm. However, this approach incurs significant data movement overhead between hosts and storage devices, particularly when a spatio-temporal query is executed on a non-preferred data layout or when the query has a small result size due to its inherent nature. To address this issue, this work introduces Groundhog, an efficient in-storage computing technique designed specifically for spatio-temporal queries, aimed at reducing unnecessary data movement and computations. Groundhog introduces three key designs for efficient in-storage computing: (i) a self-contained and segment-based storage model, which is lightweight for in-storage computing and enables fine-grained pruning for spatio-temporal queries; (ii) a set of fine-grained techniques to optimize query processing inside storage devices for spatio-temporal queries; and (iii) an in-storage-computing-aware query planner, which offloads spatio-temporal queries in a fine-grained manner using a cost-based approach. We implemented Groundhog on real hardware and demonstrated how to apply fine-grained techniques to accelerate various spatio-temporal queries. Extensive experiments conducted on real-world datasets demonstrate that Groundhog achieves significant performance improvements, with latency reductions of up to 81% for widely used spatio-temporal queries compared to host computing solutions.

Abstract:
Graph-based Twitter bot detectors are proven more effective than feature-based and text-based. Mainstream detectors only employ friend relationships, bringing two limitations: (i) friend relationships are sparse, ignoring implicit interactions between users, and (ii) bots would follow humans to expand their influence, challenging the homophily principle. This paper aims to learn a homophilous context graph containing implicit interactions, which faces two challenges: (i) existing homophily measures are influenced by the class distribution, which is not suitable for the class imbalance situation of bot detection, and (ii) existing graph learning paradigm would introduce noisy neighbors and consume computing resources. To this end, we first propose a class-independent homophily measure, which is proven to be robust to class distribution. Meanwhile, we propose HCGBot, which transforms graph learning into similarity metric learning. HCGBot contains a neighbor-mask GNN layer, which masks users that hardly implicitly interact and extracts topology and weight information from the context graph. Finally, we design a hybrid loss to optimize HCGBot, which maximizes the class-independent homophily measure while detecting bots. Extensive experiments prove that HCGBot achieves the best performance and learns a more homophilous context graph with high efficiency. Further analysis illustrates that HCGBot can detect social bots in more realistic situations.

Affiliations: College of Informatics, Huazhong Agricultural University, Key Laboratory of Smart Farming for Agricultural Animals, Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China; Division of Computer, Electrical and Mathematical Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia; School of Computing, Binghamton University, State University of New York, Binghamton, NY, USA; College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China

Abstract:
Link prediction for attributed graphs has garnered significant attention due to its ability to enhance predictive performance by leveraging multi-modal node attributes. However, real-world challenges such as privacy concerns, content restrictions, and attribute constraints often result in nodes facing varying degrees of missing modalities in their attributes, significantly limiting the effectiveness of existing approaches. Building on this fact, we propose a model for link PRediction in attrIbuted networkS with uncertain Modalities (PRISM), which learns the shared representations across various scenarios of missing modalities through dual-level adversarial training. PRISM comprises four modules, i.e., a GCN extractor, an adversarial extractor, an attentive fusion, and an adaptive aggregator. The GCN extractor leverages graph convolutional networks (GCN) to extract fundamental representations from the network topology. The adversarial extractor employs dual-level adversarial training to acquire the shared representations across various multi-modal scenarios at the node-level and link-level, respectively. The attentive fusion applies the multi-head attention mechanism to integrate the shared representations and the fundamental representations. The adaptive aggregator comprehensively considers both node-level and link-level representations to predict the existence of links. Experimental evaluation using real-world datasets demonstrates that PRISM significantly outperforms existing state-of-the-art link prediction methods for multi-modal attributed graphs under missing modalities by improving the Recall@50 metric (R@50) by up to 38.79%.

Abstract:
Privacy-Preserving Graph Similarity matching Query (PPGSQ) can retrieve the encrypted data graphs that approximately match with the encrypted query graph from the graph database. Existing PPGSQ schemes adopt the pivot filter and the global filter to measure the similarity of two graphs, which leads to heavy computation burden for clients and many mismatched data graphs cannot be filtered out. In addition, traversing the whole graph database to execute PPGSQ can greatly affect the query efficiency. To address these issues, we propose two privacy-preserving graph similarity matching query schemes in this paper. We first present a basic scheme with linear query complexity. We adopt the branch-based lower bound of edit distance to efficiently measure the similarity of two encrypted graphs, which can reduce the computation overhead for clients and improve the lower bound of MGED. In order to facilitate effective pruning and enhance the query efficiency, we give an improved scheme by designing a novel tree-based secure index, which can realize the sublinear query complexity. Our schemes can achieve the necessary privacy without losing the ability of querying. To further protect the number of branches/vertices, we give a succinct discussion on how to use homomorphic Paillier encryption to encrypt this number. We analyze the security of our schemes, and conduct the experiments evaluation on a real-world graph database to show the efficiency of the proposed schemes.

Abstract:
Rating prediction is a classic task in recommendation systems, aiming to accurately estimate user ratings for various items. Historical ratings typically exhibit a non-uniform distribution, leading recommendation models to favor predicting high-frequency ratings. We refer to the inconsistency between predicted ratings and users’ true preferences caused by non-uniform rating distributions as rating bias. To mitigate this bias, existing studies capture various interaction behavior patterns and employ knowledge distillation techniques to improve the network’s ability to model user preferences. However, due to the model being trained on datasets with non-uniform rating distributions, the rating bias may propagate through the knowledge distillation process across different behaviors, thereby contaminating the modeling of users’ true preferences. To this end, we propose a novel Adversarial Counterfactual Distillation (ACD) framework for the rating prediction task, aimed at eliminating rating bias. Specifically, we design a Counterfactual Distillation Module from a causal reasoning perspective to facilitate knowledge transfer across various interaction behaviors while concurrently mitigating bias contamination. Furthermore, we introduce an Adversarial Debiasing Module to dynamically adjust the debiasing strength, ensuring that the model maintains an optimal balance between effective knowledge transfer and bias mitigation. Extensive experiments demonstrate the superior performance of our proposed ACD framework.

Abstract:
Non-volatile memory (NVM), as an emerging storage technology, offers several advantageous features for OLTP engines, including byte-addressability, high capacity, low energy consumption, and data persistence across power failures. Despite these benefits, the current mainstream OLTP engines still commonly adopt a hybrid architecture that deeply couples DRAM with NVM, which results in a complex system architecture and high recovery costs. In this paper, we aim to construct a highly available, stable, and recoverable OLTP engine that guarantees ACID properties through an agile system architecture. We introduce AKV (Agile Key-Value), an NVM-only OLTP storage engine designed to provide effective space utilization, high throughput, and fast failure recovery. AKV addresses the challenges of NVM space management, write redundancy, and concurrency control with two novel techniques: dual-version concurrency control and circular dual-version storage. Experimental results demonstrate that AKV achieves higher throughput (up to 69.7%) and faster recovery (up to 54×) compared to existing storage engines in most scenarios of the TPC-C benchmarks. Additionally, the codebase of AKV (4k+ lines) is more concise than that of SOTA OLTP engines like Zen (8k+ lines) and Falcon (11k+ lines). In addition, this study innovatively proposes a read abort optimization strategy based on dynamic version changes. The experimental results show that this strategy can significantly reduce the transaction abort rate of AKV in specific workload scenarios while maintaining stable system throughput, achieving a maximum reduction of up to 73% in the abort count.

Abstract:
Urban time series, such as mobility flows, energy consumption, and pollution records, encapsulate complex urban dynamics and structures. However, data collection in each city is impeded by technical challenges such as budget limitations and sensor failures, necessitating effective data imputation techniques that can enhance data quality and reliability. Existing imputation models, categorized into learning-based and analytics-based paradigms, grapple with the trade-off between capacity and generalizability. Collaborative learning to reconstruct data across multiple cities holds the promise of breaking this trade-off. Nevertheless, urban data's inherent irregularity and heterogeneity issues exacerbate challenges of knowledge sharing and collaboration across cities. To address these limitations, we propose a novel collaborative imputation paradigm leveraging meta-learned implicit neural representations (INRs). INRs offer a continuous mapping from domain coordinates to target values, integrating the strengths of both paradigms. By imposing embedding theory, we first employ continuous parameterization to handle irregularity and reconstruct the dynamical system. We then introduce a cross-city collaborative learning scheme through model-agnostic meta learning, incorporating hierarchical modulation and normalization techniques to accommodate multiscale representations and reduce variance in response to heterogeneity. Extensive experiments on a diverse urban dataset from 20 global cities demonstrate our model's superior imputation performance and generalizability, underscoring the effectiveness of collaborative imputation in resource-constrained settings.

Abstract:
Fraud detection is a typical data mining mission in the field of finance. In recent years, due to their capability of mining hidden associations between entities, graph neural networks (GNNs) have been widely applied to detect financial fraudsters. However, GNNs are fragile in their data aggregation process and will be attacked on purpose by fraudsters, therefore, some other pioneers have explored methods to enhance the robustness of GNN-based fraud detection models. But most existing models based on ideal settings, as real-life criminals tend to attack as far as they can reach, struggle to establish a unified effective approach for attacked and unattacked data of different scales in realistic scenarios. Furthermore, mainstream robust defense models indiscriminately modifying and truncating data will lose important information of the major unattacked parts in the original graph, which lowers their overall fraud detection precision. Therefore, in this work, we propose a novel generative fraud detection framework called GSRGNN. In particular, we first design a generative structure to obtain augmented node features. Then we prioritize the nodes with high degrees to create a more stable graph structure on local distributions. Finally, we pass the enhanced features and structure with the input ones in pairs through GNN layers, and create synthetic representations with abundant information and sufficient resistance to perturbations for subsequent fraud detection. In addition, we also design a novel black-box attack algorithm to realistically imitate the perturbations conducted by fraudsters on graph features and structure. Experiments on the world's leading electronic trading platform and public anti-fraud datasets demonstrate the outstanding performance of our proposed method compared with those state-of-the-art models, showing its superiority in precision and robustness on financial fraud detection missions.

Affiliations: Department of Computing, Hong Kong Polytechnic University, Hung Hom, Hong Kong; Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China; CAIR, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Pak Shek Kok, Hong Kong; School of Data Science & Engineering, East China Normal University, Shanghai, China; College of Computer Science and Technology, Zhejiang University, Hangzhou, China; Department of Computer Science, Aalborg University, Aalborg, Denmark

Abstract:
With the proliferation of GPS-equipped edge devices, huge trajectory data are generated and accumulated in various domains, driving numerous urban applications. However, due to the limited data acquisition capabilities of edge devices, many trajectories are often recorded at low sampling rates, reducing the effectiveness of these applications. To address this issue, we aim to recover high-sample-rate trajectories from low-sample-rate ones enhancing the usability of trajectory data. Recent approaches to trajectory recovery often assume centralized data storage, which can lead to catastrophic forgetting, where previously learned knowledge is entirely forgotten when new data arrives. This not only poses privacy risks but also degrades performance in decentralized settings where data streams into the system incrementally. To enable decentralized training and streaming trajectory recovery, we propose a Lightweight incremental framework for federated Trajectory Recovery, called LightTR+, which is based on a client-server architecture. Given the limited processing capabilities of edge devices, LightTR+ includes a lightweight local trajectory embedding module that enhances computational efficiency without compromising feature extraction capabilities. To mitigate catastrophic forgetting, we propose an intra-domain knowledge distillation module. Additionally, LightTR+ features a meta-knowledge enhanced local-global training scheme, which reduces communication costs between the server and clients, further improving efficiency. Extensive experiments offer insight into the effectiveness and efficiency of LightTR+.

Abstract:
Multi-hop Knowledge Graph Reasoning (KGR) seeks to identify accurate answers within Knowledge Graphs (KGs) via multi-step reasoning, predominantly utilizing reinforcement learning (RL) to enhance the efficiency of the reasoning process. Unlike traditional Knowledge Graph Embedding (KGE) methods, RL-based approaches offer superior interpretability. However, these methods often underperform due to two critical limitations: (1) their over-reliance on Horn rules for reasoning paths, which restricts their expressive power; and (2) inadequate utilization of reasoning states during the process. To address these issues, we propose a novel RL-based framework, RAR, which shifts focus from individual paths to subgraph structures for more robust predictions. RAR frames the retrieval of reasoning subgraphs from the KG as a Markov Decision Process (MDP) and incorporates a subgraph retriever. To efficiently explore the extensive subgraph space, we integrate multi-agent RL to enhance the retriever’s capabilities. Additionally, RAR features an advanced analyst module that meticulously examines reasoning states. These modules function iteratively: the retriever expands the subgraph, followed by the analyst module’s in-depth analysis. The insights gained are then used to inform subsequent retrieval steps. Ultimately, the predicted scores from both modules are synthesized to produce more precise posterior scores. Experimental results across multiple datasets demonstrate RAR’s efficacy, showcasing a notable improvement over existing state-of-the-art RL-based KGR methods.

Abstract:
Spatio-temporal traffic data imputation is a fundamental component in intelligent transportation systems, which can significantly improve data quality and enhance the accuracy of downstream data mining tasks. Recently, low-rank tensor representation has shown great potential for spatio-temporal traffic data imputation. However, the low-rank assumption focuses on the global structure, neglecting the critical spatial topology and local temporal dependencies inherent in spatio-temporal data. To address these issues, we propose a topology-induced low-rank tensor representation (TILR), which can accurately capture the underlying low-rankness of the spatial multi-scale features induced by topology knowledge. Moreover, to exploit local temporal dependencies, we suggest a learnable convolutional regularization framework, which not only includes some classical convolution-based regularizers but also leads to the discovery of new convolutional regularizers. Equipped with the suggested TILR and convolutional regularizer, we build a unified low-rank tensor model harmonizing spatial topology and temporal dependencies for traffic data imputation, which is expected to deliver promising performance even under extreme and complex missing scenarios. To solve the proposed nonconvex model, we develop an efficient alternating direction method of multipliers (ADMM)-based algorithm and analyze its computational complexity. Extensive experiments demonstrate that the proposed model outperforms state-of-the-art baselines for various missing scenarios. These results reveal the critical synergy between topology-aware low-rank constraint and temporal dynamic modeling for spatio-temporal data imputation.

Abstract:
While stock prediction task traditionally relies on volume-price and fundamental data to predict the return ratio or price movement trend, sentiment factors derived from social media platforms such as StockTwits offer a complementary and useful source of real-time market information. However, we find that most social media posts, along with the public sentiment they reflect, provide limited value for trading predictions due to their noisy nature. To tackle this, we propose a novel dynamic expert tracing algorithm that filters out non-informative posts and identifies both true and inverse experts whose consistent predictions can serve as valuable trading signals. Our approach achieves significant improvements over existing expert identification methods in stock trend prediction. However, when using binary expert predictions to predict the return ratio, similar to all other expert identification methods, our approach faces a common challenge of signal sparsity with expert signals cover only about 4% of all stock-day combinations in our dataset. To address this challenge, we propose a dual graph attention neural network that effectively propagates expert signals across related stocks, enabling accurate prediction of return ratios and significantly increasing signal coverage. Empirical results show that our propagated expert-based signals not only exhibit strong predictive power independently but also work synergistically with traditional financial features. These combined signals significantly outperform representative baseline models in all quant-related metrics including predictive accuracy, return metrics, and correlation metrics, resulting in more robust investment strategies. We hope this work inspires further research into leveraging social media data for enhancing quantitative investment strategies.

Abstract:
Event reasoning is to reason with events and certain inter-event relations. These cutting-edge techniques possess crucial and fundamental capabilities that underlie various applications. Large language models (LLMs) have made advances in event reasoning owing to their wealth of training. However, the LLMs commonly used today still do not consistently demonstrate proficiency in managing event reasoning as humans. This discrepancy arises from not explicitly modeling events and their relations and insufficient knowledge of event relations. In addition, the different reasoning paradigms of the LLMs are trained in an imbalanced way. In this paper, we propose \textsc WizardEventWizardEvent, to synthesize data from the unlabeled corpus with the proposed hybrid event-aware instruction tuning. Specifically, we first represent the events and their relation in a novel structure and then extract the knowledge from the raw text. Second, we introduce hybrid event reasoning paradigms with four reasoning formats. Lastly, we wrap our constructed event relational knowledge with the paradigms to create the instruction tuning dataset. We fine-tune the model with this enriched dataset, significantly improving the event reasoning. The performance of \textsc WizardEventWizardEvent is rigorously evaluated through extensive experiments. The results demonstrate that \textsc WizardEventWizardEvent substantially outperforms baselines, indicating the effectiveness of our approach.

Abstract:
Multi-View Clustering (MVC) has gained increasing attention due to its ability to effectively leverage the complementary information of multi-view data. Despite the success of existing MVC methods in many real-world applications, they often overlook the discrepancy of view-specific latent distribution and struggle to ensure the completeness of the multi-view data. To address these challenges and harness the powerful feature extraction capability of deep networks, we propose a novel Contrastive and Dual Adversarial Representation Learning method for Multi-view Clustering, termed as CDARL, to solve multi-view clustering problems with both complete and incomplete multi-view data. Specifically, CDARL employs alternating adversarial and contrastive learning to align the view-specific representations, driving them into the same semantic latent space to minimize the discrepancy in view-specific distributions. In addition, a consensus latent representation is learned by an adaptive fusion block that integrates information from multiple views. The consensus representation is further refined through adversarial learning modeling the transformation of the standard Gaussian distribution to the original data distribution. Moreover, the proposed method incorporates an imputation strategy designed to handle the incomplete multi-view data clustering task. This strategy utilizes both reconstructed samples and cross-view neighbors to impute missing views from the latent space and the original space, thereby preserving clustering information, which ensures the quality and feasibility of the imputed samples. Experimental results on six widely used datasets have verified the competitiveness of the proposed CDARL method against state-of-the-art methods in MVC problems with complete and incomplete multi-view data.

Affiliations: Propulsion and Space Research Center, Technology Innovation Institute, Abu Dhabi, UAE; Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing, China; Department of Computer Science, Khalifa University, Abu Dhabi, UAE; Institute for Infocomm Research (IR), Agency for Science, Technology and Research (A*STAR), Singapore; Institute for Infocomm Research (IR), Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore; Information Systems Technology and Design, Singapore University of Technology and Design, Singapore; James Watt School of Engineering, University of Glasgow, Glasgow, U.K.

Abstract:
Source-free domain adaptation (SFDA) adapts a pre-trained model from a labeled source domain to an unlabeled target domain without source data access, preserving privacy. While SFDA is common in computer vision, it remains largely unexplored in time series analysis, where existing methods struggle to capture temporal dynamics and often produce overconfident predictions on out-of-distribution samples. We propose MAsk And imPUte (MAPU), which tackles temporal consistency through a novel imputation task, where randomly masked time series signals are recovered within the learned embedding space. During adaptation, a dedicated temporal imputer guides the target model to generate features that maintain temporal consistency with source features. However, MAPU relies on standard softmax predictions, leading to overconfident predictions on target samples that fall outside the source domain's support. To address this limitation, we introduce Evidential-MAPU (E-MAPU), which leverages evidential uncertainty estimation to identify these out-of-support samples and adapts the feature extractor to map them closer to the source domain's support, while maintaining the classifier fixed. Extensive experiments on five real-world time series datasets demonstrate significant performance improvements over existing methods. Our approaches effectively handle various time series domain adaptation challenges while maintaining computational efficiency, achieving state-of-the-art performance through its uncertainty-aware adaptation strategy.

Abstract:
Few-shot knowledge graph completion (few-shot KGC) mines unseen knowledge by leveraging meta-learning and contrastive learning to achieve accurate predictions with limited triples. Recent studies have focused on designing distance or similarity metrics to provide better knowledge representation between entities and relations. However, three issues with negative sampling remain unexplored: 1) the construction of negative queries heavily relies on manual experience in selecting candidate tail entities, 2) the constructed negative queries may mislabel potential true facts, and 3) the varying difficulties of negative queries are ignored. To solve the above issues, in this paper, we introduce curriculum learning into few-shot KGC and propose a novel few-shot KGC framework empowered by an adaptive negative sampling mechanism, which can eliminate the dependence on any additional manual experience, reduce mislabeling, and generate negative queries with appropriate difficulty. Specifically, the proposed framework includes two alternating phases. In the negative sampling phase, we first design a novel positive-unlabeled learning based scoring function with a type-related candidates encoder and then build a variable-speed sliding window based pacing function to select negative queries with appropriate learning difficulty under current training step. In the meta-training phase, we develop an adapted triple-oriented knowledge encoder to provide accurate representation for queries. Experimental results demonstrate that the proposed framework outperforms the state-of-the-art baselines and provides negative queries with appropriate difficulty in few-shot KGC.

Abstract:
The documents mentioning a target entity are essential prerequisites of various applications, such as market intelligence analysis, knowledge base enrichment, fact checking and retrieval augmented generation. A simple solution to acquire these documents is to exploit search engines via querying the name of the target entity. However, the name of the target entity appearing in a returned document does not necessarily mean it really refers to the target entity due to the name ambiguity, as it may refer to another entity sharing the same name as the target entity. Thus, in this paper, we explore a new task of targeted document detection, which aims to detect those targeted documents (i.e., documents really mentioning the target entity) from the given candidate documents each of which contains an ambiguous name of the target entity. We propose GADE+, a novel Graph-based Anchor-enhanced framework to solve the task of targeted Document dEtection by leveraging both the local relevance information (via a local relevance model) and the global cross-document interactions (via an anchor-based global interaction model) jointly. The framework GADE, as presented in our conference paper, relies on a complete graph to model global cross-document interactions, resulting in many useless interactions and limited scalability. To address the above issues, we first introduce virtual anchors representing virtual candidate documents, and construct a document interaction bipartite graph between candidate documents and virtual anchors. Then we apply a graph neural network over the graph to model the global cross-document interactions via anchor-based message passing mechanism. To further learn more discriminative representations of virtual anchors and candidate documents, an anchor-guided regularization is devised to explicitly improve the inter-class separability of virtual anchors and the intra-class compactness of candidate documents. We construct four labeled datasets for this task based on Wikipedia and Web documents respectively, and a thorough experimental study shows that our framework GADE+ significantly outperforms all the baseline methods in terms of F1-score.

Abstract:
Graph Contrastive Learning (GCL) has recently garnered significant attention for enhancing recommender systems. Most existing GCL-based methods perturb the raw data graph to generate views, performing contrastive learning across these views to learn generalizable representations. However, most of these methods rely on data- or model-based augmentation techniques that may disrupt interest consistency. In this paper, we propose a novel interest-aware augmentation approach based on diffusion models to address this issue. Specifically, we leverage a conditional diffusion model to generate interest-consistent views by conditioning on node interaction information, ensuring that the generated views align with the interests of the nodes. Based on this augmentation method, we introduce DiffCL, a graph contrastive learning framework for recommendation. Furthermore, we propose an easy-to-hard generation strategy. By progressively adjusting the starting point of the reverse denoising process, this strategy further enhances effective contrastive learning. We evaluate DiffCL on three public real-world datasets, and results indicate that our method outperforms state-of-the-art techniques, demonstrating its effectiveness.

Abstract:
A precise workload forecaster is the key to effective resource management, system scalability, and overall operational efficiency in cloud environments. However, real-world cloud systems frequently operate in dynamic and unpredictable settings, causing workloads that exhibit significant diversity and fluctuations. To address these problems, we introduce OMCR, a novel online multivariate forecaster for cloud resource management, that overcomes the limitations of existing static forecasting methods through online learning. OMCR integrates long-term memory with a rapid response mechanism to short-term changes in cloud systems, while also considering the impact of multivariate relationships on workload prediction. OMCR minimizes its reliance on historical data, thereby reducing training difficulty and maintaining lower prediction loss in the long run. OMCR also offers an adaptive approach to forecasting peak workloads in a certain time span, which helps cloud resource management. Experimental results demonstrate the superior performance of our proposed framework compared to state-of-the-art methods in MAE and MSE metrics when forecasting cloud workloads.

Abstract:
Learning-to-Rank (LTR) models built on Transformers have been widely adopted to achieve commendable performance in web search. However, these models predominantly emphasize relevance, often overlooking broader aspects of user satisfaction such as quality, authority, and recency, which collectively enhance the overall user experience. Addressing these multifaceted elements is essential for developing more effective and user-centric search engines. Nevertheless, training such comprehensive models remains challenging due to the scarcity of annotated query-webpage pairs relative to the vast number of webpages available online and the billions of daily search queries. Concurrently, industry research communities have released numerous open-source LTR datasets with well-annotated samples, though these datasets feature diverse designs of LTR features and labels across heterogeneous domains. Inspired by recent advancements in pre-training transformers for enhanced performance, this work explores the pre-training of LTR models using both labeled and unlabeled samples. Specifically, we leverage well-annotated samples from heterogeneous open-source LTR datasets to bolster the pre-training process and integrate multifaceted satisfaction features during the fine-tuning stage. In this paper, we propose S^33PRank—Satisfaction-oriented Learning to Rank with Semi-supervised Pre-training. Specifically, S^33PRank employs a three-step approach: (1) it exploits unlabeled/labeled data from the search engine to pre-train a self-attentive encoder via semi-supervised learning; (2) it incorporates multiple open-source heterogeneous LTR datasets to enhance the pre-training of the relevance tower through shared parameters in cross-domain learning; (3) it integrates a satisfaction tower with the pre-trained relevance tower to form a deep two-tower aggregation structure, and fine-tunes the combination of pre-trained self-attentive encoder and the two-tower structure using search engine data with various learning strategies. To demonstrate the effectiveness of our proposed approach, we conduct extensive offline and online evaluations using real-world web traffic from Baidu Search. The comparisons against numbers of advanced baselines confirmed the advantages of S^33PRank in producing high-performance ranking models for web-scale search.

Abstract:
Merging multi-source time series data in cloud servers significantly enhances the effectiveness of analyses. However, privacy concerns are hindering time series analytics in the cloud. Responsively, numerous secure time series analytics schemes have been designed to address privacy concerns. Unfortunately, existing schemes suffer from severe performance issues, making them impractical for real-world applications. In this work, we propose novel secure time series analytics schemes that break through the performance bottleneck by substantially improving both communication and computational efficiency without compromising security. To attain this, we open up a new technique roadmap that leverages the idea of mixed model. Specifically, we design a non-interactive secure Euclidean distance protocol by tailoring homomorphic secret sharing to suit subtractive secret sharing. Additionally, we devise a different approach to securely compute the minimum of three elements, simultaneously reducing computational and communication costs. Moreover, we delicately introduce a rotation concept, design a rotation-based hybrid comparison mode, and finally propose our fast secure top-kk protocol that can dramatically reduce comparison complexity. With the above secure protocols, we propose a practical secure time series analytics scheme with exceptional performance and a security-enhanced scheme that considers stronger adversaries. Formal security analyses demonstrate that our proposed schemes can achieve the desired security requirements, while the comprehensive experimental evaluations illustrate that our schemes outperform the state-of-the-art scheme in both computation and communication.

Abstract:
In this paper, considering the memory capability of fractional-order reservoirs and the immunity of integer-order reservoirs, a serial-parallel fractional-integer-order echo state network(SP-FIO-ESN) model, is proposed for time series prediction. First, according to the superior adaptive capability of the variational mode decomposition(VMD), the input signal is decomposed into multiple input subsequences, and thus the internal features of the signal are extracted. Second, according to the variational mode decomposition and phase space reconstruction methods, the number of serial reservoirs and the number of parallel reservoirs of SP-FIO-ESN are determined. Third, in order to ensure the stability of SP-FIO-ESN, the sufficient stability criterion of SP-FIO-ESN is given. Meanwhile, the SP-FIO-ESN reservoir parameters are optimized based on the black-winged kite algorithm (BKA). Finally, in order to verify the effectiveness of the artificial intelligence method for different learning tasks, some numerical simulation datasets and photovoltaic/wind power generation forecasting datasets are given.

Abstract:
Graph Self-Supervised Learning (GSSL) has emerged as a powerful paradigm for generating high-quality representations for graph-structured data. While multi-scale graph contrastive learning has received increasing attention, many existing methods still predominantly focus on a single graph abstraction level. To address this limitation, we propose a unified contrastive framework that can target node-level, proximity-level, cluster-level, and graph-level information and integrate them through a linear combination of similarity scores on positive pairs and dissimilarity scores (i.e., similarity scores on negative pairs). Furthermore, current approaches typically assign uniform penalty strengths to all examples, which reduces optimization flexibility and leads to ambiguous convergence status. To overcome this, we introduce a novel parameter-free fine-grained self-weighting mechanism that adaptively assigns weights to individual similarity and dissimilarity scores. The proposed mechanism emphasizes the scores that deviate significantly from their target values. Our approach not only enhances optimization flexibility but also eliminates the computational overhead of hyperparameter tuning in conventional multi-task GSSL methods. Comprehensive experiments on real-world datasets show that our methods consistently outperform state-of-the-art approaches across downstream tasks, including classification, clustering, and link prediction, in both single-level and multi-level scenarios.

Abstract:
The robustness of intelligent IoT device networking is vital for maintaining communication connectivity within intelligent manufacturing systems, impacting the reliability of the customized Industrial Internet of Things (CuIoT). Current studies enhance network connectivity and resilience against cyber attacks through combinatorial optimization theory by redeploying topologies. However, these approaches often overlook the transformative potential of network motifs in the optimization process. To address this, we introduce CuIoT-MET, an innovative approach that enhances CuIoT robustness by leveraging motif evolutionary transfer knowledge from historical evolution processes. By analyzing changes in connection relationships and emphasizing network motifs’ unique contributions, we design a novel robustness metric to optimize the evolutionary trajectory, resulting in more robust CuIoT connection patterns. Extensive experiments show that CuIoT-MET outperforms state-of-the-art methods in improving network robustness.

Abstract:
The Influence Maximization (IM) problem aims to identify a small set of influential users, as seed users, to maximize their influence spread in a social network. Recently, graph representation learning approaches have gained wide attention in the IM field for their ability to encode social influence patterns into user representations, which are then used by various strategies to identify target seed users. While effective, these graph learning-based IM methods face two main limitations. First, they fail to model the influence propagation process explicitly, limiting their ability to capture the essential underlying propagation patterns. Second, they build representations in euclidean space, which cannot reflect the latent hierarchical structure of social influence distribution. As a result, the learned representations are ineffective in supporting seed user selection. To address these limitations, we propose a novel hyperbolic learning-based IM method, HIM, which leverages hyperbolic representation learning to estimate users’ influence strength from social data, particularly historical propagation processes, for solving IM tasks. Unlike previous approaches, HIM comprises two key components. First, Hyperbolic Influence Representation encodes influence spread patterns from both the social network and influence propagation instances into hyperbolic user representations. When learning from these data sources, the geometric properties of hyperbolic space naturally place highly influential users closer to the space origin, enabling practical estimation of influence strength from the distances of learned representations. Second, Adaptive Seed Selection introduces a novel scoring mechanism grounded in estimated influence strength. It leverages the geometric advantages of hyperbolic space to incrementally refine scores using multiple types of hyperbolic distance information, enabling flexible and effective seed user selection. Extensive experiments on five network datasets demonstrate the superior effectiveness and efficiency of our method under various diffusion models with both known and unknown propagation parameters, highlighting its potential for solving IM problems in large-scale, real-world social networks.

Abstract:
Spatial co-location pattern mining aims to discover sets of spatial features that prevalently occur together. However, existing methods for local co-location pattern (LCP) mining often yield prevalent regions with limited spatial coverage, resulting in the overestimation of pattern interest measures and the omission of implicit spatial associations. Furthermore, the distinction between pattern categories is often obscured, as global co-location patterns can be regarded as special cases of LCPs. To address these limitations, we propose a novel framework that can identify prevalent regions with broader coverage and achieve clear pattern classification. We first prove that the problem of maximizing each prevalent region is NP-hard. Then we formally define the concept of maximal prevalent regions as a viable alternative and develop a heuristic dynamic spatial expansion algorithm for their efficient identification. In addition, we introduce a spatial occupancy ratio and a three-level classification scheme (rare, common, and significant) to replace the traditional global/local dichotomy. Finally, a Delaunay-triangulation-based method is employed to quantify the coverage of non-convex regions, ensuring accurate occupancy calculations. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed framework.

Abstract:
With the rapid evolution of software ecosystems, malware has become increasingly sophisticated, posing severe threats to cybersecurity. Modern malware often exhibits complex hierarchical behaviors that can be naturally modeled as a Graph of Graphs (GoG): a global Function Call Graph (FCG) whose function nodes can be annotated with fine-grained Control Flow Graphs (CFGs). This nested structure makes detection challenging for existing methods. In this work, we propose a novel end-to-end framework, namely Malware Graph of Graphs Detection Network (MGDN), specifically optimized for malware detection on hierarchical program graphs. MGDN integrates four modules, including Graph Feature Construction, FCG Malware Discrimination, CFG Risk Attribution, and Integrated Malware Prediction, to capture multi-level semantic patterns and detect malicious programs. Extensive experiments on large-scale real-world datasets demonstrate that MGDN consistently outperforms state-of-the-art generic GNNs, hierarchical graph learning methods, and specialized malware detection baselines across multiple metrics. Empirically, MGDN achieves average relative improvements of 16.22% in PR-AUC, 9.91% in Macro-F1, and 1.89% in Micro-F1 against the strongest competing baselines across all evaluated datasets, highlighting its effectiveness and robustness in diverse real-world scenarios.

Abstract:
In recent years, table reasoning has garnered substantial research interest, particularly regarding its integration with Large Language Models (LLMs), which have revolutionized natural language applications. Existing LLM-based studies typically achieve step-by-step thinking for table reasoning guided by task semantics. While these approaches emphasize autonomous exploration and enhance fine-grained table understanding, they often overlook systematic thinking in the reasoning process. This oversight can lead to omitted steps, disorganized logic and misleading results, especially in complex scenarios. In this paper, we propose PoTable, a novel stage-oriented plan-then-execute approach that incorporates systematic thinking into table reasoning. Specifically, PoTable involves several distinct analytical stages with clear objectives to provide adequate guidance. To accomplish stage-specific goals, PoTable employs a plan-then-execute mechanism: it first plans the operation chain based on the stage objective, and then executes operations sequentially through code generation, real-time running and feedback processing. Consequently, PoTable produces reliable table reasoning results with highly accurate, step-wise commented and completely executable programs. It mirrors the workflow of a professional data analyst, offering advantages in both accuracy and explainability. Finally, we conduct extensive experiments on four datasets from the WikiTQ and TabFact benchmarks, where the results demonstrate the effectiveness, efficiency and explainability of PoTable.

Abstract:
Recommendation systems alleviate the issue of information overload via modeling user preferences from interaction sequences. Although self-attention based sequential models effectively capture long-range dependencies, they are susceptible to noise amplification in sparse sequences and over-smoothing of item representations, which obscures true user intent and reduces sensitivity to fine-grained behavioral changes. To overcome these challenges, we propose SAFA, a sparse sequential recommendation framework comprising: (1) an adaptive sparse attention mechanism that suppresses noisy interactions while preserving embedding diversity; (2) a frequency-aware encoder that decomposes interaction sequences into low-frequency components for long-term preference modeling and high-frequency components for short-term intent dynamics; and (3) a simplified focal loss that removes the class-balancing term while preserving the focusing factor, emphasizing hard-to-predict samples rather than class priors. Experiments on seven benchmark datasets demonstrate that SAFA consistently achieve state-of-the-art performance with average improvements of up to 3.77%, 4.10% and 4.25% in terms of HR@5, HR@10 and HR@20, respectively, and 4.30%, 4.78% and 4.58% in terms of NDCG@5, NDCG@10 and NDCG@20, respectively, over the best competing model. Ablation studies verify the importance of each component, with notable performance degradation upon removing the sparse attention or frequency-aware encoder. Overall, SAFA enhances sequential recommendation by improving robustness and discriminative learning under noisy and sparse conditions.

Abstract:
Recommender Systems (RS) have been shown to be vulnerable to injective attacks, where attackers inject limited fake user profiles to promote the exposure of target items to real users for unethical gains (e.g., economic or political advantages). Since attackers typically lack knowledge of the victim model deployed in the target RS, existing methods resort to using a fixed surrogate model to mimic the potential victim model. Despite considerable progress, we argue that the assumption that poisoned data generated for the surrogate model can be used to attack other victim models is wishful. When there are significant structural discrepancies between the surrogate and victim models, the attack transferability inevitably suffers. Intuitively, if we can identify the worst-case victim model and iteratively optimize the poisoning effect specifically against it, then the generated poisoned data would be better transferred to other victim models. However, exactly identifying the worst-case victim model during the attack process is challenging due to the large space of victim models. To this end, in this work, we propose a novel attack method called Sharpness-Aware Poisoning (SharpAP). Specifically, it employs the sharpness-aware minimization principle to seek the approximately worst-case victim model and optimizes the poisoned data specifically for this worst-case model. The poisoning attack with SharpAP is formulated as a min-max-min tri-level optimization problem. By integrating SharpAP into the iterative process for attacks, our method can generate more robust poisoned data which is less sensitive to the shift of model structure, mitigating the overfitting to the surrogate model. Comprehensive experimental comparisons on three real-world datasets demonstrate that SharpAP can significantly enhance the attack transferability.

Abstract:
The cross-fertilization of the fast-developing AI technology and spatial indexing has given rise to spatial learned indexes. However, these indexes rely on historical data distributions to build models, which limits their ability to anticipate data that has not yet arrived. To address this, we propose a novel Spatio-Temporal Update Method (STUM) that enhances conventional spatial learned indexes by introducing a Spatial Delta Area (SDA) for updates without altering their hierarchical structure. STUM learns spatio-temporal auto-correlation from historical data and integrates predicted future distributions. We apply STUM to the Spatial Learned Block Range INdex (SLBRIN), resulting in the development of the Spatio-Temporal Updatable learned Block Range INdex (STUBRIN), which adopts Revmap to integrate spatio-temporal sequence predictions with the spatial block range. STUBRIN optimizes the retraining process by learning the temporal continuity from spatial distribution and fusing it into the error threshold control mechanism and historical delta learning mechanism. Our results show that STUBRIN achieves 1.9-2.4×1.9-2.4×, 1.8-13.3×1.8-13.3×, 3.4-6.7×3.4-6.7× better build, query and update performance compared to state-of-the-art methods. Additionally, STUBRIN offers superior query and update stability. For concurrent learned indexes, we have also designed parallel scheduling for STUBRIN, which improves build, query and update performance by 6.2-6.8×6.2-6.8×, 0.3-4.2×0.3-4.2×, 2.6-5.5×2.6-5.5×, without increasing the index size.

Affiliations: College of Computer Science, Sichuan University, Chengdu, China; Department of Computer Science, City University of Hong Kong, Hong Kong, SAR China; School of Software, Dalian University of Technology, Dalian, China; School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China; Sichuan Province Commercial Investment Group Company, Ltd., Chengdu, China; Data Science and Analytics Thrust, The Hong Kong University of Science and Technology, Guangzhou, China

Abstract:
Betweenness centrality is one of the key centrality measures in many applications including community detections in biological networks, vulnerability detections in communication networks, misinformation filtering in social networks, etc. The top-KK group betweenness centrality problem is to find a group of KK nodes from a network so that the total fraction of shortest paths that pass through the KK nodes is maximized. Existing studies proposed randomized sampling algorithms for the problem. We notice that the existing studies ensured that, the maximum deviation of the estimated centrality of every group from its expectation is no greater than a small given threshold for all potential groups with no more than KK nodes, thereby generating too many samples, as the number of such groups is prohibitively large. In contrast, in this paper we first devise a novel algorithm that enables to estimate the centrality of a tentative group adaptively, and the algorithm immediately stops once the centrality is large enough; otherwise, the algorithm uses more samples to find a better group. We then theoretically show that, even the proposed algorithm uses much less samples, it still can find a performance-guaranteed group with high probability. Experimental results with real-world networks demonstrate that the number of samples used by the proposed algorithm is up to 36 times smaller than the state-of-the-art, while the centrality of the group found by the algorithm is no more than 4.5% smaller than the latter.

Abstract:
Cognitive Diagnosis is a critical task in computer-assisted education, aimed at assessing students’ mastery of knowledge concepts and analyzing exercise indices. In fact, this direction has received a lot of research attention in the past few decades. However, the inherent heterogeneity in students’ abilities introduces significant challenges to accurate exercise indices estimation, resulting biases that lead to inaccurate diagnostics within student groups and undermining the generalizability of exercise indices across diverse groups. To address these challenges, we propose a Counterfactual Adaptive-Debiasing Framework (CADF) for Cognitive Diagnosis, which employs a causal graph to model the intricate relationships among key variables influencing student performance and knowledge mastery. Specifically, by introducing exercise adjustment factors, we capture both the intrinsic attributes of exercises and their dynamic adaptability to individual students. Then, to disentangle the direct and indirect effects of these factors, we adopt a counterfactual inference approach to answer the critical question: How would the diagnostic feedback from a cognitive diagnosis model change if it were only directly influenced by exercise adjustment factors? This allows CADF to retain the beneficial indirect effects while neutralizing the direct effects that introduce bias, thereby achieving debiased exercise indices estimation. Finally, Extensive experiments on three real-world datasets demonstrate that CADF significantly reduces bias in exercise indices estimation and enhances the accuracy of diagnostic feedback.

Abstract:
Imbalanced data classification is a hot topic in neural network learning. Current neural network methods rely on reweighting and resampling technology to solve the imbalanced data problem. Showing inconsistent rebalancing behavior and easily leading to the learning performance fluctuation in diversified imbalanced data distributions. Robust optimization is a promising direction to alleviate these problems by enhancing the robustness of neural networks. However, designing an uncertain set for robust optimization is an open problem in the imbalance learning domain. In this paper, we develop a novel data-driven robust optimization neural network method for imbalanced data classification. Specifically, we design a more aggressive local span space to generate new minority samples of the uncertain set, providing the maximum uncertain set for minority rebalancing with the robustness enhancement property. Here, we perform the robust optimization with a more stable initial state by pre-training a classification model. Then, we replace the distribution discrepancy constraint with a novel model evaluation constraint to preserve the local details of data distribution. The generated samples of the maximum uncertain set are further applied to fine-tune the delay robustness enhancement neural network, achieving the local data perturbation adaptation with empirical risk minimization. We validated the proposed method on multiple imbalanced data sets with varied imbalance configurations. The test results showed that the proposed method performed better than the state-of-the-art rebalancing methods, revealing that the robustness enhancement is an important factor in improving the stability of imbalanced learning.

Abstract:
Multi-hop Knowledge Base Question Answering (KBQA) aims to find answer entities in the knowledge base that are multiple hops away from the entities in the question. Information retrieval-based (IR-based) methods extract a pivotal subgraph from the entire KB to locate candidate answers and then evaluate their plausibility through semantic matching with the question. However, we observed that the extracted subgraphs often include nodes that are weakly related or irrelevant to the question. Without a proper node filtering mechanism, the number of irrelevant nodes grows as the number of hops increases, leading to excessive consumption of computational resources. To address these challenges, this study introduces an efficient LLM-based subgraph retrieval method for multi-hop knowledge base question answering, M-ER. The framework leverages Monte Carlo Tree Search (MCTS) to transform subgraph exploration into a tree-structured search process. During the MCTS selection phase, nodes that are highly relevant to the question are prioritized for inclusion in the subgraph, eliminating the need to traverse all nodes in the KB. The framework further incorporates a large language model (LLM) to refine the search direction, ensuring that exploration remains focused on nodes relevant to the question. In addition, selected nodes are quantitatively scored, and these scores are fed back into the node selection process to effectively filter out irrelevant candidates, thereby improving the quality of the subgraph. This mechanism not only narrows the search space but also enhances the overall efficiency of multi-hop KBQA. Experiments on the WebQSP benchmark demonstrate that M-ER achieves 78.88% on the Hits@1 metric, while also improving computational efficiency. These results not only validate the effectiveness of M-ER, but also offer a viable technical path to balance performance and computational efficiency.

Abstract:
The generation of voluminous scientific data poses significant challenges for efficient storage, transfer, and analysis. Recently, error-bounded lossy compression methods emerged due to their ability to achieve high compression ratios while controlling data distortion. However, they often overlook the inherent spatial and temporal correlations within scientific data, thus missing opportunities for higher compression. In this paper we propose GraphComp, a novel graph-based method for error-bounded lossy compression of scientific data. We perform irregular segmentation of the original grid data and generate a graph representation that preserves the spatial and temporal correlations. Inspired by Graph Neural Networks (GNNs), we then propose a temporal graph autoencoder to learn latent representations that significantly reduce the size of the graph, effectively compressing the original data. Decompression reverses the process and utilizes the learnt graph model together with the latent representation to reconstruct an approximation of the original data. The decompressed data are guaranteed to satisfy a user-defined point-wise error bound. We compare our method against the state-of-the-art error-bounded lossy methods (i.e., HPEZ, SZ3.1, SPERR, and ZFP) on large-scale real and synthetic data. GraphComp consistently achieves the highest compression ratio across most datasets, outperforming the second-best method by margins ranging from 22% to 50%.

Abstract:
Time series forecasting is integral to diverse fields, significantly influencing both industrial production and human activities. In the real world, time-series data often exists in the form of streaming data and its distribution often changes over time, which is referred to as concept drift. This results in a gradual decline in the performance of traditional deep learning models as time progresses. Current online learning methods attempt to alleviate this issue by employing online adaptation methods in temporal dimension. However, they overlook the changes and drifts in the inter-variable association within time series. To this end, we propose a novel Online Dynamic Graph Network (ODGNet). ODGNet represents variable associations as a matrix polynomial and acquires polynomial coefficients based on online gradients, which models the evolutionary trends of spatial patterns. Furthermore, we emphasize the lossless information mapping between the adjacency matrix and its corresponding polynomial coefficient. Based on this mapping, a graph memory module with low memory consumption is proposed to avoid catastrophic forgetting when graph drift occurs. And a graph drift awareness mechanism is designed to rapidly detect graph drift. Experimental results demonstrate that ODGNet achieves a significant reduction in forecasting error compared to the existing online learning methods on twelve benchmark datasets.

Abstract:
A critical limitation of conventional sequential recommendation models (SRMs) is their reliance on observed user-item interaction sequences within a closed-world setting, which hinders their ability to generalize to unseen or infrequent items. Recently, Large Language Models (LLMs) have shown remarkable promise in recommendation systems due to their vast world knowledge and advanced reasoning capabilities. Current research has predominantly explored two approaches: using LLMs to directly generate recommendations and distilling knowledge from LLMs to enhance conventional SRMs. However, these approaches face two major challenges: (1) high inference costs, as they require LLM responses during inference, either for generating predictions or as supplementary input; (2) inadequate distillation of the reasoning process, as existing methods focus mainly on improving embeddings or aligning outputs, without fully integrating LLMs’ inherent reasoning capabilities. To address these issues, we propose LCKD-SR, an LLM-driven Cascaded Knowledge Distillation framework for Sequential Recommendation. In this framework, an LLM, a Teacher SRM, and a Student SRM form a hierarchical distillation structure, enabling an LLM-free inference by using only the Student model. Beyond traditional embedding and ranking distillation, our framework abstracts the LLM’s sequential reasoning abilities by identifying key interactions that subsequently guide the Teacher’s attention using learnable markers. The Student model, which mirrors the architecture of the Teacher, achieves seamless knowledge alignment from the Teacher across all three aspects. Extensive experiments demonstrate the effectiveness and efficiency of the proposed LCKD-SR, showcasing its scalability to perform multi-level knowledge transfer while enabling LLM-independent inference, thereby overcoming the inference cost and reasoning limitations of existing methods.

Abstract:
Graph-structured data appears in domains such as molecular analysis, social networks, and program optimization, where graphs often exhibit implicit heterogeneity, as nodes may look homogeneous in type yet differ significantly in semantics or functionality. Graph Neural Networks (GNNs), while powerful on homophilic graphs, tend to degrade in such settings due to polarity confusion, over-smoothing, and inefficiency caused by dense propagation. We propose a polarity-aware framework for graph classification that addresses these challenges through adaptive directional sparse aggregation. The framework introduces a polarity-aware propagation mechanism that adaptively reinforces or inverts neighbor signals, mitigating contamination under heterophily. A polarity-guided sparse aggregation operator further alleviates over-smoothing, improves scalability by constraining redundant connections, and condenses information flow into more effective representations, while maintaining unbiased estimation with controlled variance. We provide theoretical analyses that characterize the computational complexity, stability properties, and expressive behavior of signed directional aggregation, offering theoretical insights into its computational, stability, and expressive properties. Extensive experiments on molecular and social graph benchmarks with implicit heterophily demonstrate consistent improvements in graph classification accuracy and efficiency. Our method achieves a 2.36% improvement when compared with the strongest baseline on each dataset. In addition, it improves accuracy by 4.53% on average on program optimization strategy recognition tasks, reaching 80.12% overall.

Abstract:
Hybrid Approximate Nearest Neighbor Search (Hybrid ANNS) is a foundational search technology for large-scale heterogeneous data and has gained significant attention in both academia and industry. However, current approaches overlook the heterogeneity in data distribution, thus ignoring two major challenges: the Compatibility Barrier for Similarity Magnitude Heterogeneity and the Tolerance Bottleneck to Attribute Cardinality. To overcome these issues, we propose the robuSt heTerogeneity-Aware hyBrid retrievaL framEwork, STABLE, designed for accurate, efficient, and robust hybrid ANNS under datasets with various distributions. Specifically, we introduce an enhAnced heterogeneoUs semanTic perceptiOn (AUTO) metric to achieve a joint measurement of feature similarity and attribute consistency, addressing similarity magnitude heterogeneity and improving robustness to datasets with various attribute cardinalities. Thereafter, we construct our Heterogeneous sEmantic reLation graPh (HELP) index based on AUTO to organize heterogeneous semantic relations. Finally, we employ a novel Dynamic Heterogeneity Routing method to ensure an efficient search. Extensive experiments on five feature vector benchmarks with various attribute cardinalities demonstrate the superior performance of STABLE.

Affiliations: Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Shenzhen Smart City Communication Co., Ltd., Shenzhen, China; Department of Electronic Engineering, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China; College of Civil Engineering, Southwest Forestry University, Kunming, China; College of Soil and Water Conservation, Southwest Forestry University, Kunming, China

Abstract:
Fine-grained air pollution forecasting is crucial for urban management and the development of healthy buildings. Deploying portable sensors on mobile platforms such as cars and buses offers a low-cost, easy-to-maintain, and wide-coverage data collection solution. However, due to the random and uncontrollable movement patterns of these non-dedicated mobile platforms, the resulting sensor data are often incomplete and temporally inconsistent. By exploring potential training patterns in the reverse process of diffusion models, we propose Spatio-Temporal Physics-Informed Diffusion Models (STeP-Diff). STeP-Diff leverages DeepONet to model the spatial sequence of measurements along with a PDE-informed diffusion model to forecast the spatio-temporal field from incomplete and time-varying data. Through a PDE-constrained regularization framework, the denoising process asymptotically converges to the convection-diffusion dynamics, ensuring that predictions are both grounded in real-world measurements and aligned with the fundamental physics governing pollution dispersion. To assess the performance of the system, we deployed 59 self-designed portable sensing devices in two cities, operating for 14 days to collect air pollution data. Compared to the second-best performing algorithm, our model achieved improvements of up to 89.12% in MAE, 82.30% in RMSE, and 25.00% in MAPE, with extensive evaluations demonstrating that STeP-Diff effectively captures the spatio-temporal dependencies in air pollution fields.

Abstract:
Multivariate time series classification (MTSC) plays a critical role in a wide range of real-world applications, such as healthcare, finance, and industrial monitoring. This paper proposes a triple-fusion network (TriFusNet), a novel convolutional network designed to address the challenges of MTSC. TriFusNet employs a specialized architecture that captures both variable-specific features and features shared across variables through the parallel use of standard, depth-wise, and shared-kernel convolutions. A hierarchical triple-fusion strategy is introduced to enhance representation learning across three stages: input-level fusion transforms raw variables, intermediate-level fusion integrates heterogeneous features, and output-level fusion improves decision robustness. Extensive experiments on 26 benchmark datasets show that TriFusNet outperforms 15 competitive baselines, achieving the best average rank (4.3077) with a Win/Draw/Loss of 5/4/17. The effectiveness of its architectural design and parameter settings is empirically validated, and a qualitative theoretical discussion is conducted to support the proposed fusion strategy. These results highlight TriFusNet's strong potential for real-world applications involving complex and high-dimensional time series data.

Abstract:
Dynamic graph processing is becoming increasingly critical across a wide range of domains, including social networks, financial transactions, and business intelligence. Its effectiveness relies heavily on optimizations in both storage and analytics, which are essential for improving system performance, throughput, and scalability. While dynamic graph processing has attracted significant research attention and yielded notable progress, a comprehensive analysis that integrates advancements in both dynamic graph storage and analytics remains lacking. To address this gap, this paper presents a thorough review of state-of-the-art techniques that support dynamic graph processing, with a particular focus on storage and analytical methods. Specifically, we first outline the fundamental challenges and core design principles in the field. Then, we systematically classify and summarize existing approaches, encompassing dynamic graph storage and analytics optimizations across both CPU and GPU platforms. Finally, we identify key research gaps and suggest promising directions for future work. This survey presents a comprehensive and up-to-date review of the literature on dynamic graph processing, offering valuable insights for both new and established researchers and contributing to the advancement of the field.

Abstract:
Large language models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Prompt Compression (LLM-DPC). Our method reduces the number of prompt tokens while minimizing any degradation in LLM performance. We model prompt compression as a Markov Decision Process (MDP), enabling the DPC-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DPC-Agent that balances the compression ratio, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DPC-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression ratio.

Abstract:
Abnormal behavior detection is crucial in many fields, such as social networks, financial transactions, and cybersecurity. However, it poses significant challenges due to the intricate structural evolution of heterogeneous graphs and the need for explainable models. To address these issues, we propose a novel method called Explainable anomalous behavior (edge) detection for dynamic heterogeneous Graphs (ExpGraph). ExpGraph captures relation-aware structural evolution to model temporal behavioral patterns and introduces a prototype alignment mechanism to improve both performance and interpretability. Specifically, prototype alignment enhances detection by encouraging discriminative representations of normal behaviors, which facilitates more accurate identification of anomalies. It also improves interpretability by enabling intuitive explanations through measuring how anomalous behaviors differ from learned normal prototypes. We conduct extensive experiments to evaluate ExpGraph against advanced competitors. It demonstrates that ExpGraph is 16.2% more effective than other methods on average. Moreover, it offers a deeper insight into abnormal behaviors in dynamic heterogeneous graphs.

Affiliations: College of Big Data and Intelligent Engineering, Chongqing Key Laboratory of Cloud-Edge Collaboration and Security for Intelligent Manufacturing, Chongqing College of Industrial Internet, Yangtze Normal University, Chongqing, China; Chongqing College of Artificial Intelligence, Chongqing Key Laboratory of Computational Intelligence, Key Laboratory of Big Data Intelligent Computing, Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, Chongqing University of Posts and Telecommunications, Chongqing, China; Chongqing Key Laboratory of Brain-Inspired Cognitive Computing and Educational Rehabilitation for Children with Special Needs, Chongqing Normal University, Chongqing, China; Chongqing Ant Consumer Finance Company, Ltd., Ant Group, Chongqing, China

Abstract:
Although spectral clustering is capable of identifying clusters of arbitrary shapes, its high time and space complexity poses limitations in large-scale data clustering applications. To tackle this problem, researchers have proposed using anchor points to construct the similarity matrix, thereby reducing time and space complexity. However, current methods for generating anchor points do not fit the data well and are limited in approach. To improve upon existing anchor points generation methods, we proposes a pseudo-label-based anchor points generation approach and develops a fast spectral clustering algorithm for large-scale data, named FSC-PLGB. The algorithm first randomly selects rr points as an initial granular-ball, applies K-Means on these points to obtain pseudo-labels, calculates the pseudo-purity of the granular-ball based on these pseudo-labels, and then performs granular-ball division based on these pseudo-purity to generate anchor points. A similarity matrix is constructed between all sample points and anchor points, and finally, spectral clustering is applied to obtain the clustering results. The experimental results demonstrate that our proposed algorithm exhibits exceptional efficiency and significant superiority on large-scale datasets.

Abstract:
Source identification is a foundational task in multimedia forensics, enabling the attribution and verification of digital content. While existing methods have achieved significant progress for static data, they often fail to generalize effectively on sequential data, which exhibit unique challenges such as temporal dependencies and dynamic variations caused by environmental and transmission factors. These challenges are further exacerbated in real-world scenarios, where cross-domain variations—spanning devices, software, and transmission protocols—significantly degrade the performance of traditional approaches. To address these limitations, we propose VoVAE, a probabilistic variational framework tailored for generalizable source identification in sequential data. VoVAE explicitly models temporal dependencies while disentangling dynamic variations (e.g., transmission distortions) from static source-specific features (e.g., device patterns) within a decoupled but complementary feature space. By separating these factors, VoVAE enables the extraction of robust and transferable representations, ensuring accurate source attribution across diverse and unseen conditions. We evaluate VoVAE on two challenging forensic applications: cross-domain VoIP phone call identification and cross-domain video source camera identification, using the VPCID and QUFVD datasets. Experimental results demonstrate that VoVAE outperforms state-of-the-art methods, achieving significant improvements in generalization across cross-device, cross-software, and cross-brand scenarios. Comprehensive ablation studies further highlight the importance of dynamic representation learning and feature disentanglement in capturing temporal patterns and enhancing robustness to domain shifts. These findings establish VoVAE as a scalable and robust solution for source identification in sequential data across diverse forensic scenarios.

Affiliations: School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Huawei Technologies Company, Ltd, Hangzhou, China; Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China; Yangtze Delta Region Institute (Quzhou), School of Computer Science and Engineering, University of Electronic Science and Technology of China, Quzhou, China; Hong Kong University of Science and Technology, Kowloon, Hong Kong; Yangtze Delta Region Institute (Quzhou), School of Computer Science and Engineering, University of Electronic Science and Technology of China (UESTC), Quzhou, China

Abstract:
Autonomous driving is an emerging technology that has developed rapidly over the last decade, with decision-making remaining a critical challenge, particularly due to its significant role in traffic congestion. In this paper, we propose a novel perception-and-decision framework, called HEAD, which consists of an enhanced perception module and a maneuver decision module to address this challenge. In the enhanced perception module, a graph-based state prediction model with a strategy of phantom vehicle construction is proposed to address incomplete vehicle features and predict future states in parallel. Then in the maneuver decision module, a deep reinforcement learning-based model is designed to learn a driving policy based on a parameterized action Markov decision process. A hybrid reward function takes into account aspects of safety, efficiency, comfort, and impact to guide the autonomous vehicle to make optimal maneuver decisions. To make our framework applicable to more scenarios, we further propose an improved HEAD (HEAD++) framework that makes the autonomous vehicle adapt to various road structures, such as lane merging and diverging scenarios. Besides, we develop a style tuning module in HEAD++, which supports personalized driving style tuning. To mitigate high training costs, an efficient style tuning method with approximate gradient descent is proposed to reduce the number of training iterations. Extensive experiments demonstrate the effectiveness of our framework. Compared to state-of-the-art methods, HEAD++ reduces overall traffic disturbance by 23.3%–40.9%, lowers collision risk by 4.5%–17.8%, and improves passenger comfort by 13.1%–30.5%, while maintaining high traffic efficiency.

Abstract:
Graph data has become increasingly important in the AI and Big Data era. However, graph data analysis raises privacy concerns since it often originates from individual users. As a privacy regulation, the right to be forgotten has been established to allow users to erase their data hosted by a third party. When users request to delete their information from the original graph, the deletion must be synchronized to analysis results, like graph statistics or pre-trained AI models. In existing works, much effort has been made to fulfill the right to be forgotten for complicated graph learning models. In this work, we aim at a fundamental query — graph summarization, which serves as a building block for many graph analysis tasks. Since in summarization, when data removal requests are received, re-summarizing the graph from scratch can be costly, we present a novel approach to graph summarization regarding potential deletion requests. Inspired by machine unlearning, we define this problem as graph unsummarization which has three goals: efficiency, forgetting quality, and utility. Towards these goals, we propose SUGPT, a graph summarization and unsummarization method based on matrix partition and trie. The essence of SUGPT is to identify similarities between vertices by embedding matrix partitions into a trie structure, to accelerate summary updating upon deletion requests. We prove the forgetting quality of SUGPT theoretically and our extensive experiments demonstrate that SUGPT balances well in efficiency and utility in graph analysis.

Abstract:
Tensor-based multi-view clustering is a popular approach. It can enhance representation learning by exploring higher-order correlations among views. However, two key issues remain unsolved. First, minimizing the tensor rank is a complex multi-objective optimization problem, so finding a suitable optimization strategy is an open problem. Moreover, most tensor methods require two phases to obtain the consensus matrix, which usually leads to suboptimal performance. To address these issues, we propose a Tensorial Multi-view Clustering via Alternative Rank Minimization and Inter-view Alignment (ARIA), in which multiple low-rank matrices and the consistent matrix are jointly optimized in a unified framework. Specifically, we stack the representations obtained from different views into a higher-order tensor. Then, a non-convex alternative rank-minimizing regularization is introduced to achieve a tighter approximation of the rank function. Besides, we impose intra-view alignment constraints to establish a connection between inter-view and intra-view. Unlike the previous method, it is a one-step strategy to obtain the consensus representation. Notably, our approach requires only linear complexity, and thus it can be successfully applied in large-scale clustering tasks. Extensive experiments validate the effectiveness and scalability of the proposed method.

Abstract:
Human activity is intermittent, and social interaction changes over time. It embodies the time-varying nature of social networks. However, due to the complexity and dynamics of time-varying networks, the analysis opinion dynamics with decision-making over time-varying social networks is still a challenging problem. In this paper, we study the time-varying social network opinion dynamics under one-step ahead optimal decision-making mechanism. An explicit relationship between the supremum/infimum of the ultimate opinions, the maximum desired opinion and the maximum/minimum intensity of the influence of players is given. We provide the criteria for determining that individual achieves the desired opinion. Moreover, an explicit relationship between the ultimate opinions and the influence weight of the decision-making mechanism is presented. Besides, we employ our theory framework to analyze time-varying Friedkin-Johnsen opinion dynamics under one-step ahead optimal decision-making mechanism. Based on the real networks (Dolphin network and western US power grid) and the datasets, simulation experiments applying our theory illustrate that the dolphins achieve the desired performance by the keepers and the substations achieve the desired voltage regulation rates by the technicians.

Abstract:
Accurate resource planning in large-scale systems relies on reliable predictions of future workloads, a task inherently challenged by their variability and dynamism. Previous prediction methods are either ineffective to deal with the changing dynamics of the series, or are highly black-boxed and unable to conduct effective theoretical analysis. To address these issues, we design an effective ensemble framework, Interval Prediction with Online Chasing (IPOC), tailored for multi-step interval forecasting in real-time systems. Theoretically, by formulating the task as a Dynamic Deterministic Markov Decision Process (Dd-MDP), an advanced theoretical framework is introduced to analyze problem solvability and derive conditions for the existence of feasible solutions. Incorporating the proposed Adaptive Copula Conformal Inference (ACCI) module and a well-designed Chasing Oracle, IPOC captures the changing dynamics and temporal dependencies to enable multi-step forecasting. We organically integrate advanced online learning theories with time series forecasting tasks to construct a forecasting framework that is both theoretically rigorous and practically effective. Theoretical analysis underpins IPOC’s effectiveness, demonstrating sublinear regret and adherence to confidence interval specifications. The chasing regret of the Chasing Oracle is O(L_c)O(Lc), and the overall regret of IPOC is O(\sqrtL_c T \log |\mathcal F|)O(LcTlog|F|). Empirically, IPOC is validated through extensive experiments on five real-world datasets, including public datasets and different types of workload collected from Bytedance Cloud, with comparisons to 25 baselines and 4 forecasting horizons (1/5/10/30). Specifically, IPOC achieves an average reduction of over 20% in RMSE/MAE/SMAPE/\rhoρ-risk compared to baselines across five datasets. Besides, we apply our model to a case study on predictive auto-scaling tasks in actual large-scale cloud systems to validate its utility.

Abstract:
Accurately answering range queries while protecting user privacy is critical in large-scale data collection scenarios. Current solutions for this problem based on local or shuffled differential privacy often use hierarchical trees, grids, and matrix mechanisms, which rely on domain decomposition and assume uniform distribution within sub-domains, resulting in significant accumulated noise errors and non-uniform errors when encountering non-uniform data distributions. To solve this problem, we propose a data distribution-aware structure in the shuffle model of differential privacy, called PriPL-Tree. This structure uses a piecewise linear function to fit the data, eliminate non-uniform errors, and reduce noise, thanks to its concise tree structure and privacy amplification from shuffling. To build this tree with a balance of privacy, accuracy, and efficiency, we devise a novel node frequency estimation protocol for enhanced privacy amplification, a numerically optimized tree construction method for efficiency, and a weighted tree optimization method for improved accuracy. Additionally, we combine PriPL-Tree with grids to adapt to multi-dimensional scenarios with optimally and non-uniformly allocated privacy budgets among dimensions. Through rigorous theoretical analysis and extensive experiments, we demonstrate the effectiveness and efficiency of our methods.

Abstract:
Probabilistic approaches, particularly Bayesian methods, are a cornerstone of data cleaning, yet they often depend on complex prior distributions that require costly and labor-intensive expert input. Our prior work, \sf BCleanBClean, alleviated this burden by introducing automatic Bayesian network (BN) construction and lightweight user constraints (UCs), but it still fundamentally relies on manually provided prior knowledge. In this paper, we present \sf BClean+BClean+, an enhanced Bayesian data cleaning system that extends \sf BCleanBClean with a novel framework for automated prior generation. \sf BClean+BClean+ leverages Large Language Models (LLMs) to identify attribute semantics and automatically synthesizes format patterns as UCs, while continuously maintaining a reusable template library. It also enhances BN construction through hierarchical structure discovery, improving interpretability and enabling more effective refinement for accurate inference. By integrating the automatically generated UCs into its Bayesian inference framework, \sf BClean+BClean+ achieves more robust and accurate cleaning. Moreover, the framework generalizes to the synthesis of probabilistic programming language (PPL) code for systems such as \sf PCleanPClean, thereby addressing a critical usability challenge in PPL-based data cleaning. Extensive experiments on real-world datasets demonstrate that \sf BClean+BClean+ achieves an average F1-score of 0.89 (up to 0.98), outperforming state-of-the-art methods by 0.42 on average (up to 0.57), while reducing user configuration time from hours to under five minutes, with an average of 113.28× speedup in total runtime over \sf BCleanBClean and other baselines.

Abstract:
Multi-view clustering (MVC) leverages complementary information across heterogeneous views to improve unsupervised partitioning. Nevertheless, the majority of existing MVC methods critically assume that samples are fully aligned across views, an assumption frequently violated in practice when multi-view data are collected from independent sources without any correspondence. This gives rise to Completely Unaligned multi-view Clustering (CUC), where cross-view sample correspondences are entirely unknown, fundamentally impeding effective multi-view fusion. Prior CUC-oriented methods typically infer inter-view relations from distance/similarity matrices; however, severe cross-view heterogeneity often induces over-smoothing in such matrices, leading to unreliable matching signals and degraded clustering performance. To address these issues, we propose Cross-view Graph Matching for Completely Unaligned multi-view Clustering (CGM-CUC), a unified framework that couples structure-aware representation learning with progressive cross-view alignment. Specifically, CGM-CUC introduces a bipartite graph–based sample re-encoding mechanism to enhance discriminative structural cues, and an iterative cross-view matching network that progressively refines permutation matrices to recover latent correspondences. Moreover, we develop an alignment-guided optimization strategy that mitigates the over-smoothing effect in similarity estimation, thereby stabilizing the matching process and improving downstream clustering. Extensive experiments on multiple benchmark datasets demonstrate that CGM-CUC consistently achieves superior performance over state-of-the-art baselines, with particularly notable gains under fully unaligned view settings.

Abstract:
The Cloud-Edge-Device (CED) architecture has emerged as a new framework for real-time data processing in the Internet of Things (IoT) era. However, the edge and device face significant resource constraints that prevent them from storing or processing full datasets. Effective data partition across CED architectures is therefore critical for supporting real-time decision-making. However, existing static and coarse-grained dynamic methods fail to adapt to changing workloads and to meet real-time processing demands. To address this issue, we propose \sf FineParFinePar, a fine-grained dynamic data partitioning framework based on DRL, coupled with an efficient data allocation strategy. \sf FineParFinePar combines horizontal and vertical partition to optimize data partition across CED architectures to reduce data transfer volume and shorten execution time. We use DRL to adjust data partitioning strategies in real time based on task demands and resource states. To achieve end-to-end optimization, we design an efficient data allocation strategy. We verified the effectiveness of \sf FineParFinePar through extensive experiments. Experimental results show that \sf FineParFinePar can reduce the edge side latency by 80% under resource constraints and dynamically adapts to workload changes.

Abstract:
Despite the encouraging achievements, the practical application of recommendation systems still faces two key issues. The first is how to better understand the multimodal real-time requests that are the more mainstream request behavior in industrial scenarios; the other is how to effectively capture users’ dynamic needs that change with temporal and spatial conditions. The breakthroughs in text understanding and generation capabilities of Large Language Models (LLMs) have demonstrated their tremendous potential in precise recommendation systems, particularly through the enhancement of the understanding of user intent. To address these issues, we propose a novel Spatial-Temporal Multimodal LLM for generative recommendation. Specifically, on the basis of the behavior data constructed from Alipay, spatial-temporal knowledge-guided fine-tuning module is proposed to capture specific needs in user real-time requests. Furthermore, a preference discovery module is developed to learn user preferences in visual queries from multimodal request perspective. Meanwhile, a personalized recommendation module is designed to aggregate spatial-temporal knowledge and user preferences for generative recommendation. Experimental results on a real-world deployed generative recommendation task from the ‘Explore’ scenario in Alipay have demonstrated the effectiveness of the proposed framework.

Abstract:
To manage massive trajectory data, we propose a novel cloud-based trajectory data management technique, Springbok, which leverages cloud storage to balance performance and monetary costs. Unlike existing key-value, relational, and time series databases, Springbok natively models trajectories as first-class data objects via a spatio-temporal series data model, enabling efficient insertion, query processing, and cost-aware storage management in cloud environments. It further employs an optimized indexing scheme that accounts for both the characteristics of trajectory data and the properties of cloud storage, enabling efficient query execution. In addition, Springbok adopts a tiered cloud storage architecture with carefully designed data layouts, flushing, access, and compression policies, guided by cloud storage performance characteristics and pricing models, to jointly optimize performance and cost. We implemented a fully functional prototype of Springbok supporting core trajectory queries, deletion, and crash recovery, and evaluated it using both real-world and synthetic datasets. The results show that Springbok achieves comparable or better performance than state-of-the-art systems in most cases while significantly reducing monetary costs, demonstrating its effectiveness in balancing performance and cost.

Abstract:
This paper addresses the challenge of identifying super spreaders within large, high-speed data streams. In these streams, data is segmented into flows, with each flow’s spread defined as the number of distinct items it contains. A super spreader is characterized as a flow with a notably large spread. Measuring flow spread requires counting any item at its first appearance and ignoring its subsequent duplicate appearances. Current compact solutions, known as sketches, are designed to fit within the constrained memory of online devices. However, existing sketches face accuracy challenges in spread tracking due to the substantial memory needed to remove the impact of duplicate appearances for measuring a single flow’s spread—a problem that compounds with increasing flow counts. We propose a novel sketch-based solution to address these limitations. At its core, our approach features an innovative non-duplicate sampler that eliminates duplicate appearances of any item, enabling accurate flow spread calculation using simple counters. Combined with our exponential-weakening decay mechanism that emphasizes large flows, the solution significantly improves super spreader detection accuracy. We provide rigorous theoretical analysis of our method and validate its performance through trace-driven experiments. Results demonstrate that our approach statistically outperforms existing state-of-the-art solutions in super spreader identification. Moreover, it achieves the fastest super spreader restoration time and reduces bandwidth consumption by an order of magnitude during remote offline restoration.

Affiliations: School of Software Engineering, Chengdu University of Information Technology, Chengdu, China; College of Computer Science and Technology, Zhejiang University, Hangzhou, China; OceanBase, Hangzhou, China; School of Management, Chengdu University of Information Technology, Chengdu, China; School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China; Key Laboratory of Knowledge Engineering with Big Data, Hefei University of Technology, Hefei, China

Abstract:
Database parameter automatic tuning is a significant challenge for database administrators (DBAs) in artificial intelligence (AI) enabled database (DB) systems. Optimizing key parameters is crucial for identifying critical interactions among them. Aiming to overcome the disadvantages of existing methods, we propose a collaborative multi-agents model called CMA+DB to automatically tune DB parameters in an effective and efficient fashion. CMA+DB integrates three components including SAPM (Single-Agent Pre-trained Model), MATM (Multi-Agent Joint Training Model), and PJTM (Probability-based Joint Training Model). SAPM applies the deep deterministic policy gradient to explore the impact of one single agent on DB performance, MATM uses multi-agent deep deterministic policy gradients to find agents that collaboratively work to improve DB performance, and PJTM can enhance parameter tuning by important agents based on a probabilistic selection factor. In the CMA+DB model, each agent is responsible for tuning a portion of the parameters, and multiple agents collaborate to recommend the optimal parameter configuration. This hybrid model can expand the number of tunable parameters in order to perform parameter tuning from the aspects of functions and parameter levels (i.e., global, DB, and session level). Experimental results reveal that CMA+DB obtains the fastest convergence performance (when reaching the largest throughput) of 14.83% faster than the state-of-the-art (SOTA) algorithms in the TPC-C benchmark on average. Essentially, after the phase of SAPM model training, CMA+DB outperforms the performance of the SOTA models in throughput. Furthermore, DB performance of CMA+DB can be improved by 1.758% through the phases of MATM and PJTM model training.

Abstract:
Graph condensation (GC) improves the efficiency of GNN training by condensing a large-scale graph into a compact synthetic graph. However, existing GC methods suffer from time-consuming optimization processes, and the underlying mechanisms driving their effectiveness remain unexplored. In this paper, we provide novel insights into the optimization strategies of GC, demonstrating that various methods ultimately converge to the class-level feature matching between the original and condensed graphs. Building on this understanding, we further refine the unified class-to-class matching paradigm into a fine-grained class-to-node paradigm, unveiling that the core mechanism of GC is a class-wise clustering problem in the latent space. Accordingly, we propose Deep Clustering-based Graph Condensation (DeepCGC), an efficient GC framework that integrates a clustering-based optimization objective with an invertible relay model. Extensive experiments show that DeepCGC achieves state-of-the-art efficiency and accuracy. Notably, it condenses the million-scale Ogbn-products graph in around 40 seconds—a 10^2 ×102× to 10^4 ×104× speedup over existing methods—while boosting accuracy by up to 4.6% .

Abstract:
Although the latest artificial intelligence technologies can greatly improve work efficiency by automatically generating feasible solutions in the digital world (DW), they are incapable of discovering or creating new knowledge, i.e., lack of human intelligence or creativity. To break this limitation, this article describes and elaborates the masterplan of the digital intelligent world (DIW), wherein everyone has an intelligent agent (IA) for searching, exchanging, and processing information and knowledge autonomously. First, a data-information-knowledge-intelligence (DIKI) model is proposed to illustrate the challenges of creating intelligence from raw data, and of realizing the DIW from the DW. Specifically, the DIW adopts knowledge-driven approaches and could achieve huge productivity enhancement through cross-domain innovations and deep intelligentization with broader creativity. Second, at the individual level, a knowledge processing architecture of IA is defined to support knowledge-centric operations and services. Third, at the system level, a framework of knowledge market (KM) is established for fair, effective, and autonomous collaborations among massive IAs. Inspired by basic laws in statistical thermodynamics, information sciences, and economics, three fundamental principles are developed and discussed for guaranteeing a prosperous KM and the sustainable DIW.

Abstract:
Graph database engines play a pivotal role in efficiently storing and managing graph data across various domains, including bioinformatics, knowledge graphs, and recommender systems. Graph databases must be accurate because errors lead to faulty analysis. Current bug-detection approaches are confined to specific graph query languages, limiting their applicabilities when handling graph database engines that use various graph query languages across various domains. Moreover, they require extensive prior knowledge to generate queries for detecting bugs. To address these challenges, we introduce DGDB, a novel paradigm harnessing large language models (LLM), such as ChatGPT, for comprehensive bug detection in graph database engines. DGDB leverages ChatGPT to generate high-quality queries for different graph query languages. It subsequently employs differential testing to identify bugs in graph database engines. We applied this paradigm to graph database engines based on Cypher, Gremlin, and SPARQL, and detected a total of 23 previously unknown wrong-result bugs. DGDB achieves at least 20.41% improvement in the non-empty-result query ratio and detects more than three times as many bugs as existing state-of-the-art methods on Cypher-based graph database engines, with further significant gains when employing more advanced LLM.

Abstract:
Various graph models have emerged to meet diverse application needs, each with unique characteristics and specialties. Managing and analyzing graph data inevitably requires interactions across different models to serve upstream business requirements. Therefore, an Extract-Transform-Load (ETL) tool designed to bridge different graph models is desired. In this paper, we propose \mathsf GETLGETL, a generalized graph ETL framework capable of automatically identifying graph model schemas and performing seamless data conversion among RDF, RDF-star, labeled property graph, and the relational model. This is attributed to \mathsf GETLGETL’s unified graph representation model, constructed as nested < > pairs, offering powerful capabilities in graph representation and model compatibility. Additionally, we develop a unified programming interface to support complex graph transformation tasks. It is built upon the Gremlin syntax and provides strong expressive capabilities. Finally, our evaluation demonstrates that \mathsf GETLGETL outperforms state-of-the-art solutions in terms of model conversion efficiency and data manipulation language (DML) intelligibility.

Abstract:
Collaborative allocation effectively integrates multi-party data and improves the quality of collaborative data-driven decisions. Equitable data utility valuation facilitates accurately quantified contributions and stimulates collaborative engagement, which forms the fundamental for collaborative allocation. The Shapley value is the dominant allocation scheme, making it the first choice for data utility valuation. However, calculating the exact Shapley value requires exponential utility function evaluations and factorial marginal contribution calculations, which limits scalability for large datasets. The most mainstream methods use various approximation techniques, including Monte Carlo sampling, lightweight model replacing, and stratified computation, to reduce computational costs. Nonetheless, these methods lack the guaranteed theoretical bounds for approximation errors in reducing computational costs. Finding the optimal trade-off between computational cost and approximation error is essential for practical data valuation. In this paper, we propose a stratified framework named Light Shapley for calculating Shapley values by incorporating quantization-aware training. For scenarios involving more players, we propose a cost-first method that achieves significant computational cost reductions while keeping the error within acceptable ranges. For scenarios with fewer players, we propose an error-first method that reduces the computational cost to less than half of the exact calculation while maintaining accuracy. Theoretical analysis and experimental results provide compelling evidence that Light Shapley balances computational cost and approximation error, enabling efficient and effective data utility valuation.

Abstract:
The dynamic nature of streaming data often introduces distribution shifts that challenge typical text classification models. This paper proposes an online learning framework tailored for streaming text classification under distribution shifts. First, we decompose a neural network-based text classification model into distinct modules and analyze the varying impact of updating these modules under different types of shifts. Based on this insight, we define three novel indicators to efficiently measure the extent of distribution shifts without evaluating the entire model. These indicators enable the development of predictive models that dynamically optimize module update strategies, balancing learning efficiency and accuracy in real-time. To the best of our knowledge, this is the first approach to systematically adapt model updates according to a trade-off between efficiency and accuracy in online text classification. Extensive experiments on real-world streaming datasets demonstrate the effectiveness of our method, which consistently outperforms both static update strategies and state-of-the-art online text classification models.

Abstract:
In information systems lacking decision-making information, effectively leveraging fuzzy rough sets for outlier detection in complex data is challenging, especially in capturing inherent uncertainty and multi-granularity characteristics to construct discriminative outlier scores. However, existing fuzzy rough sets-based outlier detection methods often suffer from three key limitations: (1) Local data distributions are often ignored when calculating fuzzy relation matrices, resulting in inaccurate fuzzy similarity representations; (2) Use of all objects in fuzzy upper and lower approximations can weaken noise resistance and increase computational complexity; (3) Single-granularity data processing reduces efficiency and may fail to capture the multi-granularity nature of data, thereby limiting the adaptability of these methods in complex data environments. To address these issues, we propose to fuses Natural neighbor fuzzy approximations with Granular-ball representation for Outlier Detection (NGOD), which integrates the multi-granularity granular-ball representation and fuzzy rough sets to improve the effectiveness and robustness of unsupervised outlier detection. Specifically, we first define a local distribution-aware fuzzy relation, enabling more discriminative similarity calculations between samples. To improve the effectiveness and robustness of fuzzy upper and lower approximations, we propose a multi-granularity natural neighbor fuzzy approximation model, which effectively utilizes the inherent uncertainty and local abnormal information of data in approximations. Moreover, by introducing natural neighbors, NGOD can adaptively capture local abnormal information in the data without setting neighborhoods manually. Finally, the outlier factors of each sample are calculated in NGOD to measure their outlier degrees. Extensive experiments on diverse datasets demonstrate that NGOD outperforms state-of-the-art methods, validating its superior performance and adaptability.

Abstract:
Semi-supervised learning (SSL) problems are challenging, appear in many domains, and are particularly relevant to streaming applications, where data are abundant but labels are not. The problem tackled here is classification over an evolving data stream where labels are rare and distributed randomly. We propose SLEADE (Stream LEArning by Disagreement Ensemble), a novel method that exploits disagreement-based learning and unsupervised drift detection to leverage unlabeled data during training. SLEADE uses pseudo-labeled instances to augment the training set of each member of an ensemble using a majority trains the minority scheme. The pseudo-labeled data impact is controlled by a weighting function that considers the confidence in the prediction attributed by the ensemble members. SLEADE exploits unsupervised drift detection, which allows the ensemble to respond to changes. We present several experiments using real and synthetic data to illustrate the benefits and limitations of SLEADE compared to existing algorithms.

Abstract:
Anchor graph learning has become a widely used technique for significantly reducing the computational complexity in existing multi-view clustering methods. However, most existing approaches select anchors independently for each view and then generate the consensus graph by directly fusing all anchor graphs. This process overlooks the correspondence between anchor sets across different views, i.e., the column order correspondence of the anchor graphs. To address this limitation, we propose a novel anchor-based tensor multi-rank constraint multi-view clustering method (TMC). Specifically, TMC captures the high-order structural information of the original data by constructing an anchor graph tensor and enforcing a multi-rank constraint to induce a block-diagonal structure. Additionally, to enhance anchor consistency across all view, we construct the anchor graph of each view into an anchor tensor and impose a low-rank constraint on it. In this way, the block-diagonal structure of each anchor graph maintains an approximate alignment between anchors. Furthermore, we provide theoretical proof that the generated anchor graphs inherently exhibit a block-diagonal structure. Extensive experimental results on six multi-view datasets demonstrate that TMC outperforms existing state-of-the-art methods, highlighting its effectiveness in multi-view clustering task.

Abstract:
In database systems, cost estimation for query plans has a variety of uses, including query optimization, resource management, load balancing, query scheduling, performance monitoring, and automated maintenance. Existing methods mainly targeted B-tree-based systems, where costs and cardinality are highly correlated. However, LSM-Trees, due to their unique storage structure, violate the assumptions of existing learning methods, causing cardinality to be irrelevant to cost estimation. In addition, the constantly changing data layout leads to severe data drifts when updating data, preventing current learning-based models from accurately estimating costs in an agile way. To address these challenges, we propose a dual-layer end-to-end cost estimation model for LSM-Tree-based database systems. This model treats cost estimation as a regression problem and comprises two layers: the storage layer and the query plan layer. The storage layer employs lightweight neural networks to leverage data distribution, provide information to the query plan layer, and address the challenges posed by data drift in LSM-Trees. The query plan layer uses the Transformer framework and incorporates structural information to learn the representation of plans. This dual-layer architecture allows our model to effectively embed storage information and query plan tree details. The results show that our proposed model achieves state-of-the-art cost estimation accuracy for database systems based on LSM-Trees. Additionally, our architecture significantly reduces the model’s updating costs, ensuring robust performance amid frequent data drifts.

Abstract:
Launching attacks against community detection to significantly deteriorate its performance has received increasing research attention recently due to the importance and wide applications of community detection. However, we observe that most previous attacks suffer from two major weaknesses: i) the negligence of community structures in their proxy metrics, and ii) limited attack scope. To tackle these issues, we propose a new research problem, Perturbing Community Detection with \muμ-Triad Minimization (\muμ-PerCD), based on a new metric proposed in this paper, named \muμ-triad, to attack community detection more effectively. We first present analysis results to justify the effectiveness of \muμ-triads by comparing it with many other candidate proxy metrics. Also, we illustrate the rationale behind the formulation of \muμ-PerCD problem with experiments on real datasets. Then, we analyze the NP-hardness of \muμ-PerCD and propose two \frac14(1-\frac1e)14(1-1e)-approximation algorithms, named \muμ-Triad Minimization with Edge Addition (\muμMEA) and \muμMEA+, where \muμMEA+ is an efficiency-enhanced version of \muμMEA while retaining the approximation ratio. Extensive experiments on real datasets demonstrate the effectiveness of the proposed algorithm in attacking various community detection algorithms, significantly outperforming the other state-of-the-art baselines.

Abstract:
Graph clustering is a fundamental data mining task that clusters vertices into different groups. The structural graph clustering algorithm (SCANSCAN) is a widely used graph clustering algorithm that derives not only clustering results, but also special roles of vertices like hubs and outliers. In this paper, we consider structural graph clustering on dynamic graphs under Jaccard similarity. The state-of-the-art index-based solution focuses on static graphs and incurs prohibitive update costs to maintain indices. Lately, an efficient approximate dynamic structural graph clustering algorithm DynStrClu under Jaccard similarity is proposed. However, their solution needs to fix input parameters while parameter settings of SCAN usually need to be fine-tuned to achieve good clustering results. Motivated by these limitations, we present a study on devising effective index structures for SCAN algorithm on dynamic graphs. Similar to the state-of-the-art dynamic scheme, our main idea to reduce the time complexity is still by bringing approximation to clustering results. However, our solution does not need to fix the input parameters. To achieve this, our solution includes two key components. The first is to maintain a bottom-kk sketch for each vertex so that the similarities of affected vertices can be easily updated. The second key is a bucketing strategy that allows us to update clustering results and roles of vertices efficiently. Our theoretical analysis shows that our proposed algorithm achieves O(\log \log \fracM+mp_f\cdot \log \fracM+mp_f)O(loglogM+mpf·logM+mpf) expected update cost and guarantees to return approximate clustering results with probability 1-p_f1-pf after up to MM updates. Extensive experiments show that our solution is up to two orders of magnitude faster than the state-of-the-art index-based solution while still achieving high-quality clustering results.

Abstract:
Graph pre-training has been concentrated on graph-level tasks involving small graphs (e.g., molecular graphs) or learning node representations on a fixed graph. Extending graph pre-trained models to web-scale graphs with billions of nodes in industrial scenarios, while avoiding negative transfer across graphs or tasks, remains a challenge. We aim to develop a general graph pre-trained model with inductive ability that can make predictions for unseen new nodes and even new graphs. In this work, we introduce a scalable transformer-based graph pre-training framework called PGT (Pre-trained Graph Transformer). Based on the masked autoencoder architecture, we design two pre-training tasks: one for reconstructing node features and the other for reconstructing local structures. Unlike the original autoencoder architecture where the pre-trained decoder is discarded, we propose a novel strategy that utilizes the decoder for feature augmentation. Our framework, tested on the publicly available ogbn-papers100 M dataset with 111 million nodes and 1.6 billion edges, achieves state-of-the-art performance, showcasing scalability and efficiency. We have deployed our framework on Tencent’s online game data, confirming its capability to pre-train on real-world graphs with over 540 million nodes and 12 billion edges and to generalize effectively across diverse static and dynamic downstream tasks.

Abstract:
Graph pattern queries (GPQ) over RDF graphs extend basic graph patterns to support variable-length paths (VLP), thereby enabling complex knowledge retrieval and navigation. Generally, variable-length paths describe the reachability between two vertices via a given property within a specified range. With the increasing scale of RDF graphs, it is necessary to design a partitioning method to achieve efficient distributed queries. Although many partitioning strategies have been proposed for large RDF graphs, most existing methods result in numerous inter-partition joins when processing GPQs, which impacts query performance. In this paper, we formulate a new partitioning problem, MaxLocJoin, aims to minimize inter-partition joins during distributed GPQ processing. For MaxLocJoin, we propose a partitioning framework (PIP) based on property-induced subgraphs, which consist of edges with a specific set of properties. The framework first finds a locally joinable property set using a cost-driven algorithm, LJPS, where the cost depends on the sizes of weakly connected components within its property-induced subgraphs. Subsequently, the graph is partitioned according to the weakly connected components. The framework can achieve two key objectives: first, it enables complete local processing of all variable-length path queries (eliminating inter-partition joins); second, it can minimize the number of inter-partition joins required for traditional graph pattern queries. Moreover, we identify two types of independently executable queries (IEQ): the locally joinable IEQ and the single-property IEQ. After that, a query decomposition algorithm is designed to transform all GPQ into one of them for independent execution in distributed environments. In experiments, we implement two prototype systems based on Jena and Virtuoso, and evaluate them over both real and synthetic RDF graphs. The results show that MaxLocJoin achieves performance improvements from 2.8x to 10.7x over existing methods.

Abstract:
Spatial crowdsourcing (SC) is becoming increasingly popular recently. As a critical issue in SC, task assignment currently faces challenges due to the imbalanced spatiotemporal distribution of tasks. Hence, many related studies and applications focusing on cross-platform task allocation in SC have emerged. Existing work primarily focuses on the maximization of total revenue for inner platform in cross task assignment. In this work, we formulate a SC problem called Cross Dynamic Task Assignment (CDTA) to maximize the overall utility and propose improved solutions aiming at creating a win-win situation for inner platform, task requesters, and outer workers. We first design a hybrid batch processing framework and a novel cross-platform incentive mechanism. Then, with the purpose of allocating tasks to both inner and outer workers, we present a KM-based algorithm that gets the accurate assignment result in each batch and a density-aware greedy algorithm with high efficiency. To maximize the revenue of inner platform and outer workers simultaneously, we model the competition among outer workers as a potential game that is shown to have at least one pure Nash equilibrium and develop a game-theoretic method. Additionally, a simulated annealing-based improved algorithm is proposed to avoid falling into local optima. Last but not least, since random thresholds lead to unstable results when picking tasks that are preferentially assigned to inner workers, we devise an adaptive threshold selection algorithm based on multi-armed bandit to further improve the overall utility. Extensive experiments demonstrate the effectiveness and efficiency of our proposed algorithms on both real and synthetic datasets.

Abstract:
As graph-based data storage becomes increasingly prevalent, the demand for robust privacy protection during data publication has intensified. A fundamental tension, however, exists between the level of privacy protection and the utility of anonymized graph: achieving stronger privacy often necessitates substantial alterations to the original graph (e.g., adding or removing nodes or edges), which can compromise critical graph attributes and reduce practical value. Addressing this challenge, we propose a two-phase privacy protection framework, 1HIkDA, designed to ensure both 1-hop indistinguishable and kk-degree anonymity, thus offering enhanced resilience against unknown structural attacks. To optimize the usability of anonymized graphs, our framework integrates two distinct algorithms: the Greedy-based 1-hop Indistinguishable (G1HI) algorithm and the Wasserstein and Regularization-based kk-degree Anonymity (WRkDA) algorithm. The G1HI algorithm achieves neighborhood indistinguishability with minimal modifications and low costs, while the WRkDA algorithm minimizes the Wasserstein distance between degree distributions pre- and post-anonymization to meet kk-degree anonymity requirements. Extensive experiments conducted on nine synthetic and real-world networks have demonstrated the effectiveness and efficiency of the proposed scheme.

Abstract:
Weakly-Supervised Anomaly Detection (WSAD) has garnered increasing research interest in recent years, as it enables superior detection performance while demanding only a small fraction of labeled data. However, existing WSAD methods face two major limitations. From the data aspect, they struggle to detect anomalies between normal clusters or collective anomalies due to overlooking the multi-distribution and complex manifolds of real-world data. From the label aspect, they fall short of detecting unknown anomalies because of the label-insufficiency and anomaly contamination. To address these issues, we propose MMM, a unified WSAD framework for multi-distributional data. The framework consists of three components: a Multi-distribution data modeler captures latent representations of complex data distributions, followed by a Multiform feature extractor that extracts multiple underlying features from the modeler, highlighting the characteristics of potential anomalies. Finally, a Multi-strategy anomaly score estimator converts these features into anomaly scores, with the aid of a novel training approach with three strategies that maximize the utility of both data and labels. Experimental results showed that MMM achieved superior performance and robustness compared to state-of-the-art WSAD methods, while providing interpretable results that facilitate practical anomaly analysis.

Abstract:
Multivariate time series (MTS) anomaly detection identifies abnormal patterns where each timestamp contains multiple variables. Existing MTS anomaly detection methods fall into three categories: reconstruction-based, prediction-based, and classifier-based methods. However, these methods face three key challenges: (1) Unsupervised learning methods, such as reconstruction-based and prediction-based methods, rely on error thresholds, which can lead to inaccuracies; (2) Semi-supervised methods mainly model normal dataand often underuse anomaly labels, limiting detection of subtle anomalies; (3) Supervised learning methods, such as classifier-based approaches, often fail to capture local relationships, incur high computational costs, and are constrained by the scarcity of labeled data. To address these limitations, we propose Moon, a supervised modality conversion-based multivariate time series anomaly detection framework. Moon enhances the efficiency and accuracy of anomaly detection while providing detailed anomaly analysis reports. First, Moon introduces a novel multivariate Markov Transition Field (MV-MTF) technique to convert numeric time series data into image representations, capturing relationships across variables and timestamps. Since numeric data retains unique patterns that cannot be fully captured by image conversion alone, Moon employs a Multimodal-CNN to integrate numeric and image data through a feature fusion model with parameter sharing, enhancing training efficiency. Finally, a SHAP-based anomaly explainer identifies key variables contributing to anomalies, improving interpretability. Extensive experiments on six real-world MTS datasets demonstrate that Moon outperforms six state-of-the-art methods by up to 93% in efficiency, 4% in accuracy and, 10.8% in interpretation performance.

Abstract:
Data has become a critical economic asset in recent years. To enable secure and reliable access to data assets, the combination of symmetric searchable encryption (SSE) and Hybrid-storage blockchains (HSB) offers a promising solution by storing the authenticated data structure (ADS) on-chain and encrypted data off-chain, thus enabling efficient and authenticated encrypted queries. However, existing encrypted query schemes in HSB either lack support for conjunctive queries, a commonly used and important query pattern in databases, or exhibit low query efficiency in conjunctive queries. vsChain was the first scheme to support secure and authenticated conjunctive queries in HSB but had drawbacks in terms of high query and authentication costs. To overcome these limitations, we introduce SeaCQ, a novel scheme for secure and efficient authenticated conjunctive queries. SeaCQ employs a meticulously designed two-stage authenticated query process to achieve optimal query efficiency. It also incorporates a customized double-layer authentication mechanism to ensure the correctness and completeness of query results efficiently while providing error localization. Additionally, we present an extension of SeaCQ, SeaCQ, which is a gas-efficient version that utilizes a constant-size on-chain ADS. Our security analysis and experimental results validate the security and efficiency of the proposed schemes.

Affiliations: Department of Computer Science, Shantou University, Shantou, China; College of Computer Science and Technology, Huaqiao University, Xiamen, China; Chongqing Key Laboratory of Nonlinear Circuits and Intelligent Information Processing, College of Electronic and Information Engineering, Southwest University, Chongqing, China; School of Computing and Information Science, Faculty of Science and Engineering, Anglia Ruskin University, Cambridge, U.K.; School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; Department of Computer Science, City University of Hong Kong, Hong Kong

Abstract:
Recently, neighbor-based contrastive learning has been introduced to effectively exploit neighborhood information for clustering. However, these methods rely on the homophily assumption—that connected nodes share similar class labels and should therefore be close in feature space—which fails to account for the varying homophily levels in real-world graphs. As a result, applying contrastive learning to low-homophily graphs may lead to indistinguishable node representations due to unreliable neighborhood information, making it challenging to identify trustworthy neighborhoods with varying homophily levels in graph clustering. To tackle this, we introduce a novel neighborhood Neutral Contrastive Graph Clustering method NeuCGC that extends traditional contrastive learning by incorporating neutral pairs—node pairs treated as weighted positive pairs, rather than strictly positive or negative. These neutral pairs are dynamically adjusted based on the graph’s homophily level, enabling a more flexible and robust learning process. Leveraging neutral pairs in contrastive learning, our method incorporates two key components: 1) an adaptive contrastive neighborhood distribution alignment that adjusts based on the homophily level of the given attribute graph, ensuring effective alignment of neighborhood distributions, and 2) a contrastive neighborhood node feature consistency learning mechanism that leverages reliable neighborhood information from high-confidence graphs to learn robust node representations, mitigating the adverse effects of varying homophily levels and effectively exploiting highly trustworthy neighborhood information. Experimental results demonstrate the effectiveness and robustness of our approach, outperforming other state-of-the-art graph clustering methods.

Abstract:
Computing graph-propagation based node similarities is a fundamental operator in many graph mining and graph learning tasks. The state-of-the-art approach to compute the graph-propagation based similarity is based on a push-style iterative framework. The push framework is very efficient when the resulting node similarity vector \bm \pi π has a small L1L1-norm (e.g., personalized PageRank and heat kernel PageRank). However, we find that when \bm \pi π has a large L1L1-norm (e.g., Katz scores and exponential communicability), such a framework is inefficient. To overcome this issue, we propose a novel framework, called AdaPush, which is more efficient and flexible than the state-of-the-art (SOTA) framework. Based on the AdaPushframework, we develop two new algorithms with two different carefully-designed randomized acceleration techniques, respectively. We prove that both of our new algorithms can achieve a relative-error guarantee. Additionally, a striking feature of our algorithms is that their time complexity is insensitive to \Vert \pi \Vert _1∥π∥1, thus they are efficient even when \Vert \pi \Vert _1∥π∥1 is large. Extensive experiments on 5 large real-life datasets demonstrate that our algorithms substantially outperform the SOTA algorithms for computing Katz score and exponential communicability in terms of both running time and estimation accuracy.

Abstract:
Workload prediction in multi-tenant edge cloud platforms (MT-ECP) is crucial for efficient application deployment and resource provisioning. However, the heterogeneous application patterns, variable infrastructure performance, and frequent deployments in MT-ECP pose significant challenges for accurate prediction. Existing clustering-based methods often incur excessive costs due to maintaining multiple data clusters and models, while end-to-end time-series prediction methods struggle with dynamic environments. To address these challenges, we perform a comprehensive analysis on a large-scale workload dataset in real-world MT-ECP and propose DynEformer, an end-to-end framework with global pooling and static context awareness, offering a unified workload prediction scheme for dynamic MT-ECP. Meticulously designed global pooling and information merging mechanisms can effectively identify and utilize global application patterns to drive local workload predictions. The integration of static content-aware mechanisms enhances model robustness in real-world scenarios. We also extend DynEformer’s capabilities to Long-term workload forecasting (LTLF) and Long-period service (LPS) tasks. Experiments on six real-world datasets demonstrate that DynEformer achieves state-of-the-art performance, with a 32% relative improvement on nine baselines and a 52% improvement in application switching and new entity scenarios. Additional experiments on long-term prediction and online learning further confirm its effectiveness for LTLF and LPS tasks.

Abstract:
With the popularity of federated learning, federated domain generalization (FedDG) has attracted more and more attention. Existing works of federated learning indicate that the generalization performance of the global model can be improved when the global model is obtained by aggregating local models according to suitable weights. However, existing methods to calculate weights do not fully utilize the data influences on the global model update, which gives us an opportunity to improve the generalization performance of the global model further. In this paper, we propose the method DI (data influences), which utilizes data influences on the global model update to calculate dynamical weights of local model in each round of training. Specifically, the first component data influence calculator (DIC) of DI calculates local weights of local model from the influences of data on the global model update and we introduce the influence function to complete the calculation process. The second component data influence adjuster (DIA) of DI calculates global weights (which are used in the aggregation process of the global model) from local weights. Extensive experiments indicate that our method improves the generalization performance of models significantly. In particular, our method improves model accuracy on benchmark datasets PACS, OfficeHome, and Office-31 by 1.79%, 1.61%, and 2.39% on average, respectively.

Abstract:
Large language models (LLMs) have demonstrated remarkable reasoning and generation capabilities in various natural language tasks. However, they often struggle with hallucinations or reasoning errors, particularly when handling domain-specific knowledge or complex multi-hop reasoning. The integration of knowledge graph (KG) provides LLMs with structured and reliable contextual knowledge, effectively mitigating issues of factual accuracy and incomplete reasoning chains. Nevertheless, existing KG-guided LLM reasoning methods still face challenges, including narrow answer coverage, limited accuracy in multi-hop reasoning, and inefficiency caused by frequent LLM API calls. To address these problems, we propose ELMK (Enhancing Large language models reasoning via Multi-path optimization on Knowledge graph), a novel KG-based LLM reasoning method that improves output comprehensiveness and interpretability. ELMK follows a retrieval–embedding–reasoning pipeline. First, depth-first search is used to extract relevant reasoning graphs, then a multi-path encoder is trained to semantically encode the question and candidate paths for precise path selection. Subsequently, the multi-path exploration strategy divides paths into multiple semantic clusters and selects the most similar paths from each cluster to ensure diverse coverage and complete answers. Finally, these paths are combined with prompts to guide LLMs toward reliable outputs. Extensive experiments on public benchmarks demonstrate that ELMK outperforms several state-of-the-art methods in terms of performance and generates more faithful and interpretable reasoning results.

Abstract:
Graph classification is a classic data mining task in graph-related domains. Graph Neural Networks (GNNs) have emerged as essential methods for graph classification due to their powerful ability to extract knowledge from graph-level data. Most methods employ a single GNN to learn graph representations and may overfit small datasets. Some ensemble graph neural networks combine multiple independent GNNs as an ensemble classifier, leveraging collective intelligence to provide more robust collective decisions to alleviate overfitting. Despite their effectiveness, they hardly consider and analyze the underlying implications behind the consistency and inconsistency of individual decisions. Moreover, concrete ensemble learning strategies for diverse individual decisions are currently lacking. Therefore, we propose an ensemble graph neural network with individual decision feedback that efficiently combines multiple GNNs for graph classification. In particular, we conduct a detailed analysis of the potential information fed back from individual decisions and propose an individual decision feedback weighting method to dynamically adjust model training, encouraging GNNs to focus more on the crucial graphs for training. Besides, we design a credibility-evaluation-based decision fusion method to fuse the most reliable individual decisions as the collective decision. Finally, we evaluate the performance of our method by conducting extensive experiments and demonstrating its effectiveness.

Abstract:
Temporal knowledge graph (TKG) extrapolation aims to predict future, previously unseen events based on historical facts. However, most existing temporal knowledge graph extrapolation methods either focus on global cyclic regularities or on local adjacent transitions. These methods overlook the multi-granularity nature of temporal signals and often rely on heuristic fusion schemes that are sensitive to noise. To address these limitations, we propose MTRM, a Multi-granularity Trend Retrieval and Modeling framework for TKG extrapolation. Specifically, we first apply semantic clustering to retrieve a compact set of long-term trend clusters from sequences of historical subgraphs, capturing enduring interaction patterns. Then, we introduce a trend-aware attention-enhancing evolution module with an auxiliary contrastive loss to learn fine-grained short-term dynamics by aligning each hidden state with its subsequent subgraph. To integrate information at different granularities, we design a multi-granularity attention layer that adaptively fuses the long-term clusters with the short-term trend states for each query entity. Additionally, an inter-granularity contrastive objective is employed to align these representations and enhance robustness to noisy snapshots. Experiments on four benchmark datasets demonstrate that MTRM outperforms state-of-the-art baselines by up to 5.89% in mean reciprocal rank (MRR), indicating improved robustness on large-scale noisy event streams. Moreover, MTRM provides interpretable insights into how long- and short-term temporal granularities jointly drive future-event prediction.

Abstract:
Annotating data is a time-consuming and costly task, but it is inherently required for supervised machine learning. Active Learning (AL) is an established method that minimizes human labeling effort by iteratively selecting the most informative unlabeled samples for expert annotation, thereby improving the overall classification performance. Even though AL has been known for decades (Settles, 2009), AL is still rarely used in real-world applications. As indicated in the two community web surveys among the NLP community about AL (Tomanek et l. 2009), (Romberg et al. 2025), two main reasons continue to hold practitioners back from using AL: first, the complexity of setting AL up, and second, a lack of trust in its effectiveness. We hypothesize that both reasons share the same culprit: the large hyperparameter space of AL. This mostly unexplored hyperparameter space often leads to misleading and irreproducible glsAL experiment results. In this study, we first compiled a large hyperparameter grid of over 4.6 million hyperparameter combinations, second, recorded the performance of all combinations in the so-far biggest conducted AL study, and third, analyzed the impact of each hyperparameter in the experiment results. Rather than merely reporting correlations, we explicitly focus on distilling these results into practitioner-oriented rules-of-thumb for designing AL experiments under realistic resource constraints. In the end, we give recommendations about the influence of each hyperparameter, demonstrate the surprising influence of the concrete AL strategy implementation, and outline an experimental study design for reproducible AL experiments with minimal computational effort, thus contributing to more reproducible and trustworthy AL research in the future.

Abstract:
Data streams with varying feature spaces have received extensive attention recently, while the common concept drift in them remains underexplored. Unsupervised concept drift detectors can report potential drifts without class labels, making them suitable for practical scenarios where labeling is usually costly and difficult. However, existing unsupervised detectors usually operate under fixed feature spaces. To address this limitation, a Matching Degree Histogram-based unsupervised detector for data streams with Varying Feature Spaces (MDH-VFS) is proposed. Changes in input features are refined into four scenarios, specifying the sources of concept drifts in such data streams. Based on this, MDH-VFS monitors the distribution of each feature independently using the fix-slide windows model. A matching degree-based histogram (MD-Histogram) supporting online updating is proposed to model data distribution. MD-Histogram requires no prior distributions and captures data change more sensitively than traditional histograms. The dissimilarity between two MD-Histograms is measured by the Hellinger distance, and drift is detected using an adaptive thresholding strategy. Both the drift positions and drift features can be reported. Experimental results show that MDH-VFS can not only effectively detect drifts in data streams with varying feature spaces (achieving average F1-score/MCC above 77% and outperforming nine existing detectors with improvements of at least 43%), but also improve the classification performance of downstream learning algorithms (reaching a maximum average accuracy of 88% and yielding up to 7.23% improvement).

Abstract:
Complex event recognition (CER) refers to identifying specific patterns composed of several primitive events in event stores. Since full-scanning event stores to identify primitive events that hold query constraint conditions incurs costly I/O overhead, a mainstream and practical approach is to use index techniques to obtain these events. However, prior index-based approaches suffer from significant I/O and sorting overhead when processing the query with high predicate selectivity or long query window, which leads to high query latency. To address this issue, we propose ACER, a Range Bitmap-based index, to accelerate CER. First, ACER achieves a low index space overhead by grouping the events with the same type into a cluster and compressing the cluster data, reducing I/O overhead when reading indexes. Second, ACER builds Range Bitmaps for queried attributes and ensures that the events of each cluster in the index block are chronologically ordered. Then, ACER can always obtain ordered query results for a specific event type through merge operations, avoiding sorting overhead. Most importantly, ACER avoids unnecessary disk accesses in indexes and events via window-wise filtering, thus reducing the I/O overhead further. Lastly, we propose an enhanced version of ACER (ACER-E) by optimizing the read/write operation of index blocks and variable query order. Our extensive experiments demonstrate that ACER and ACER-E reduce the query latency by up to one order of magnitude compared with SOTA techniques.

Affiliations: School of Computer Science and Engineering, Central South University, Changsha, China; Department of Computing, Hong Kong Polytechnic University, Hong Kong, China; Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China; School of Science & Engineering, East China Normal University, Shanghai, China; School of Software Technology, Zhejiang University, Zhejiang, China; Department of Computer Science, University of Illinois Chicago, Chicago, IL, USA; Department of Computer Science, Aalborg University, Aalborg, Denmark

Abstract:
Time series anomaly detection aims to identify samples that deviate from a normal distribution in a time series, which is practically important to a variety of real-world applications. Existing approaches are mostly centralized and domain-specific, and thus they are hard to generalize to time series of different domains that are decentralized due to the privacy concerns and the resulting data silos across institutions. To bridge this gap, we propose FAST-MAD, the first resource-aware framework for efficient federated time series anomaly detection. Operating under a client-server architecture, different clients in FAST-MAD can handle time series from distinct domains. In particular, FAST-MAD first employs a multi-resolution transformation module to capture hierarchical local semantics, frequency-oriented patching as well as inter-time-series interaction. An LLM serves as the main body of the local model for each client, owing to its strong knowledge transfer capabilities. Further, an adaptive modularized separation mechanism is integrated with sharded federated training to reduce computational costs, which innovatively splits the LLM into a U-shaped architecture. To address data heterogeneity across different clients, we propose a decomposed client-server alignment mechanism, featuring a tailored low-rank parameter decomposition that extracts domain-common knowledge. Extensive experiments on multiple cross-domain time series datasets offer insight into the effectiveness and efficiency of FAST-MAD, which outperforms SOTA baselines by up to 10.25% in terms of F1-score and reduces the training time by 40.93%.

Abstract:
Real-world dynamic systems often exhibit time-varying behavior. While state space models (SSMs) have shown great potential for sequence modeling, especially in capturing long-range dependencies, most previous studies have been limited to time-invariant dynamics. To overcome this limitation, we propose a neural network architecture based on time-varying SSMs with dynamics that evolve over time, called dynamic SSMs. To enhance scalability and efficiency, several techniques are introduced, including sparsification strategy via diagonalization and fast tensor convolution with quasi-linear complexity in sequence length. Extensive experiments on both synthetic and real-world datasets show that the proposed model consistently outperforms existing state-of-the-art methods. Moreover, the model achieves significantly lower time and space complexity compared to architectures such as Transformer and LSTM. This work advances the theoretical foundation of SSMs-based neural networks in deep learning and promotes their further development.

Abstract:
Graph Neural Networks (GNNs) have achieved remarkable success in processing graph data. However, when graph contains node label noise (label error) and graph structural noise (edge error), the performance of GNNs will severely decrease. In the learning with noisy label, the methods based on sample selection and label generation reveal promising outlook. Unfortunately, most of these methods only focus on single node label nosie or graph edge noise, which cannot handle both two types of noise at the same time. Furthermore, these methods also lack measures for error accumulation during training. In this work, we propose a simple yet efficient method, named GNN Defender (GDF), to address challenges of both types of noises through cross-correction and two regularization technologies. Specifically, GDF consists of two networks (Nnet and Enet), which Nnet is responsible to predict node category and Enet is used to predict the connected probability of two nodes. For sample selection, the mean distribution distance between node category prediction and node original label is regard as the selecting indicator. Then, we consider node labels with distances less than the mean as clean labels. Similar to node, replacing indicator with the mean cosine similarity of connected to Adaptively complete clean edge selection. For cross correction, we use the clean probabilities of the edges between connected nodes as weights to generate pseudo labels for target nodes on the subgraph. The pseudo labels generated by edge information carry more credible supervision information, thereby completing the correction of node noise labels. Next, we compare the label consistency between connected node to achieve the noisy edge correction. It is worth noting that the clean probabilities of node labels are used to smooth graph structure, improving the edge correction quality. Finally, we propose category consistency regularization and subgraph clustering regularization to weaken the impact of error accumulation. Extensive experiments on single noise and mixed noise datasets show that our proposed framework far outperforms current baseline methods, with improvements ranging from 1.43% to 14.72%.

Abstract:
Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketches provide only coarse estimates under strict memory constraints. Although some learning-augmented methods have emerged recently, they typically rely on offline training with real frequencies or/and labels, which are often unavailable. Moreover, these methods suffer from slow update speeds, limiting their suitability for real-time processing despite offering only marginal accuracy improvements. To overcome these challenges, we propose UCL-sketch, a practical learning-based paradigm for per-key frequency estimation. Our design introduces two key innovations: (i) an online training mechanism based on equivalent learning that requires no ground truth (GT), and (ii) a highly scalable architecture leveraging logically structured estimation buckets to scale to real-world data stream. The UCL-sketch, which utilizes compressive sensing (CS), converges to an estimator that provably yields an error bound far lower than that of prior works, without sacrificing the speed of processing. Extensive experiments on both real-world and synthetic datasets demonstrate that our approach outperforms previously proposed approaches regarding per-key accuracy and distribution. Notably, under extremely tight memory budgets, its quality almost matches that of an (infeasible) omniscient oracle. Moreover, compared to the existing equation-based sketch, UCL-sketch achieves an average decoding speedup of nearly 500 times.

Abstract:
Predicting future connections among Uncrewed Aerial Vehicles (UAVs) is one of the fundamental tasks in UAV Ad Hoc Networks (UANETs). In adversarial environments where UAV operational information is unavailable, future link prediction must rely solely on the observed historical topological data. However, the highly dynamic and sparse nature of UANET topologies poses substantial challenges in capturing structural and temporal link formation patterns. Most existing link prediction methods focus only on single-scale structural features while neglecting the effects of network sparsity, thus limiting their performance when applied to UANETs. In this paper, we propose MUST, a Multi-scale Structural-Temporal link prediction model for UANETs. In our model, multi-scale structural representations are learned using a weighted graph attention network combined with multi-scale pooling, capturing features at the levels of individual UAVs, UAV communities, and the entire network, which are then fused via concatenation. Then, a stacked long short-term memory network is employed to learn the temporal dynamics of these multi-scale structural features. To address the impact of network sparsity, we develop a tailored loss function that emphasizes the contribution of existing links during training. We validate the performance of MUST using several UANET datasets generated through simulations. Extensive experimental results demonstrate that MUST achieves state-of-the-art link prediction performance in highly dynamic and sparse UANETs.

Abstract:
Deep cross-modal hashing is widely studied for its low storage cost and high retrieval efficiency. Despite recent progress, existing deep cross-modal hashing methods still face critical challenges. Most existing methods use Euclidean space for embedding to measure semantic similarity, but its volume grows polynomially with dimension, worsening the curse of dimensionality. In contrast, methods based on spherical space usually use cosine similarity as the metric, effectively mitigating the aforementioned problem by normalizing the embedding vectors. Nevertheless,such methods only considers the direction to determine the category, ignoring the uncertainty measure in the embedding space, thus having a limited ability to preserve inherent multimodal semantics. In this paper, with a novel extension of the maximum entropy distribution on the surface of a hypersphere von Mises-Fisher (vMF) distribution, a novel deep cross-modal hashing method, named Deep Stochastic Spherical Hashing (DSSH), is designed to utilize uncertain information to guide the hashing process and produce discriminative modality-invariant hash codes. Specifically, to learn explicit uncertainty in learned embedding space, the Spherical von Mises-Fisher distribution is applied for the first time in deep cross-modal hashing, where the direction of the sample embedding controls its position on the hypersphere, thereby preventing its semantic content, and its norm parameterizes the determinism of the distribution. In addition, stochastic spherical von Mises–Fisher loss is proposed to preserve the mode-specific semantic information of the sample, achieving the alignment of different modalities and semantic embeddings. Extensive experiments on four benchmark datasets show that our DSSH framework outperforms existing state-of-the-art methods.

Abstract:
The challenges associated with large-scale user-item interaction graphs have attracted increasing attention in graph-based recommendation systems, primarily due to computational inefficiencies and inadequate information propagation. Existing methods provide partial solutions but suffer from notable limitations: model-centric approaches, such as sampling and aggregation, often struggle with generalization, while data-centric techniques, including graph sparsification and coarsening, lead to information loss and ineffective handling of bipartite graph structures. Recent advances in graph condensation offer a promising direction by reducing graph size while preserving essential information, presenting a novel approach to mitigating these challenges. Inspired by the principles of democracy, we propose DemoRec, a framework that leverages graph condensation to generate user and item representatives for recommendation tasks. By constructing a compact interaction graph and clustering nodes with shared characteristics from the original graph, DemoRec significantly reduces graph size and computational complexity. Furthermore, it mitigates the over-reliance on high-order information, a critical challenge in large-scale bipartite graphs. Extensive experiments conducted on four public datasets demonstrate the effectiveness of DemoRec, showcasing substantial improvements in recommendation performance, computational efficiency, and robustness compared to SOTA methods.

Abstract:
The automated grading of subjective answers is crucial for reducing manual workload and enhancing feedback efficiency in online education, particularly for short answer scoring (SAS). However, in scenarios with limited labeled data, existing methods face challenges in sample diversity and scoring consistency due to limited training data and misaligned label distributions. While data augmentation and transfer learning have been employed to address these issues, rule-based approaches often lack textual variability, and general-domain embeddings struggle to align with domain-specific scoring criteria. To overcome these limitations, we propose the Scoring with Contextual Alignment and Language Enhancement (SCALE) framework, a novel LLM-driven training paradigm that synthesizes diverse responses while preserving scoring consistency. SCALE leverages a knowledge graph-based generation strategy to enhance sample diversity by substituting key phrases with contextually aligned alternatives and employs a style rewrite prompt to introduce linguistic variations. To mitigate label inconsistency, we introduce a polish align prompt that refines synthetic and real samples into a shared semantic subspace, training an annotator model for aligned scoring. Additionally, an entity-aware enhancement mechanism improves comprehension of formulas and quantitative content. Extensive experiments on multilingual and multi-domain datasets demonstrate that SCALE achieves state-of-the-art performance, improving Pearson scores by 4.9%, 2.27%, and 1.34% over BERT, RoBERTa, and ERNIE 3.0, respectively.

Abstract:
The widespread dissemination and misleading impact of fake news on the web have become a significant concern for the public and the government. Discovering fake news is crucial for ensuring that users receive authentic information and maintaining social harmony. However, most existing entity-based fake news detection methods have two issues: i) methods for acquiring additional information through entities lack flexibility and real-time capabilities. ii) approaches using entities to capture news semantics have not adequately revealed the interactions between words in the text. To address these issues, we propose a Multi-graph Semantic-aware Adaptive Graph Convolutional Network (MgSAN), which comprehensively captures the semantic information of news texts by constructing multiple semantic graphs and learns the features from these graph structures using an adaptive graph convolutional network (SwiGCN). Specifically, we design a global semantic interaction graph to capture the complex interactions between words, generating a comprehensive textual semantic representation. We also employ an entity-noun relationship graph to mine deep semantic associations, enhancing the model’s understanding of fine-grained textual deep meanings. Additionally, we develop an adaptive graph convolutional network to effectively extract and aggregate feature information from different graph structures. Finally, we introduce a fusion module to integrate both global and local fine-grained semantic information, forming a rich composite semantic representation, thereby improving the effectiveness of fake news detection. Extensive experimental results on three public benchmark datasets verify the effectiveness and superior performance of MgSAN, outperforming state-of-the-art detection models.

Abstract:
Previous research on event log data analysis has primarily focused on identifying critical and frequent events, as well as qualitatively assessing correlations between event occurrences. However, the probabilistic behavior of frequently occurring events over time remains poorly understood. Through an in-depth exploratory analysis, we reveal that the (log) inter-arrival times of events follow a bimodal mixture distribution, suggesting the presence of transitions between latent states. To better understand the data-generating mechanism underlying these frequent events, we employ Markov Modulated Renewal Processes (MMRPs), a type of hidden Markov model, to capture the patterns exhibited in the inter-arrival times between successive events. Due to limitations in record precision, some inter-arrival times are recorded as zero. To address this issue, we propose a simple data imputation algorithm to generate non-zero inter-arrival times, facilitating inference on the inter-arrival time distributions and the underlying MMRPs. The effectiveness of the algorithm is validated using synthetic data. Finally, we evaluate the proposed model on real manufacturing system data, uncovering key insights into system states.

Abstract:
Streaming graphs widely exist in various application domains due to their excellent capability to capture temporal relationships between different entities. In recent years, outsourcing streaming graphs to the cloud for storage and analytics has become increasingly popular. Among others, pattern detection on streaming graphs, which aims to continuously detect subgraphs matching a given query pattern, benefits practical applications like credit card fraud detection and cyber-attack detection. However, conducting such streaming graph analytics in the cloud also raises critical privacy concerns. This paper introduces GraphGuard, the first system aimed at privacy-preserving pattern detection on outsourced streaming graphs. GraphGuard is designed through a tailored synergy of insights from graph modeling, lightweight secret sharing, edge differential privacy, and data encoding/padding. It conceals edge and vertex labels, as well as the relationship between vertices, for both the outsourced streaming graph and query pattern. We implement GraphGuard and perform comprehensive performance evaluations. The results show that GraphGuard is able to securely perform one detection on a streaming graph’s snapshot (with a sliding time window of size 50,000) in just a few seconds. In comparison to a baseline utilizing general secure multiparty computation techniques, GraphGuard is up to 60× faster in query latency and achieves up to 98% savings in communication.

Abstract:
Many real-world datasets including stock prices or disease records are represented as regular or irregular tensors across multiple domains. How can we accurately capture patterns from both irregular and regular tensors in a newly emerging domain by leveraging existing ones from multiple domains? This problem is crucial for applications such as identifying patterns of new diseases using data from existing ones. A main challenge is that the new target tensors contain limited information due to their recent emergence. Previously, PARAFAC2- and PARAFAC-based methods have been widely used to find patterns in irregular and regular tensors, respectively, through decomposing them into latent factors. However, they cannot effectively transfer knowledge from previously known tensors to the new one. In this work, we propose a fast and accurate domain adaptation method for tensor decomposition. We propose Meta-P2 for irregular tensors and Meta-P for regular tensors. Both Meta-P2 and Meta-P learn general and easily-adaptable information– - referred to as the meta factor—from multiple source domains. Using this meta factor, they efficiently identify patterns in a new target tensor. Extensive experiments on real-world datasets show that Meta-P2 and Meta-P achieve the state-of-the-art performance across various downstream tasks, including missing value prediction and anomaly detection.

Affiliations: State Key Laboratory for Multimedia Information Processing, School of Computer Science, Beijing Key Laboratory of Software and Hardware Cooperative Artificial Intelligence Systems, Peking University, Beijing, China; School of Mathematical Sciences, Peking University, Beijing, China; Department of Statistics, University of Wisconsin–Madison, Madison, WI, USA; Department of Computer Science & Engineering, University of Washington, Seattle, WA, USA; Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong; Terminus Group, Beijing, China

Abstract:
Graph domain adaptation has emerged as a critical challenge in real-world applications, where labeled graph data is often scarce and expensive to obtain. While existing methods have shown promise, they typically require access to source domain data, which may be restricted due to privacy concerns or data regulations. To address these limitations, we investigate the challenging yet practical problem of source-free graph domain adaptation. We propose a new approach named Robust Cross Supervision with Target Mining (ROSE) for this problem. ROSE achieves robustness by considering the complementary topology of graphs. The model consists of a message-passing branch for local semantic learning and a graph-kernel branch for global structural capture. Both branches are incorporated into a unified cross-supervision framework. To improve the robustness of the optimization process, we explore the context of the target domain, and divide the target data into discriminant set and anchor set. Then we incorporate the two tasks into a meta-learning optimization framework. Extensive experiments on benchmark datasets have demonstrated that our ROSE, compared with a wide range of baselines, always yields superior performance.

Abstract:
Account risk detection, which aims to identify accounts at forced liquidation risk within financial account-asset bipartite graphs, is crucial for ensuring financial market stability and economic resilience. Although traditional node-classification-based anomaly detection techniques can be applied to this task, these approaches often exhibit two key limitations: 1) insufficient consideration of asset fluctuations, resulting in unsatisfactory accuracy; and 2) scalability challenges, making them unsuitable for large-scale financial graphs. To address these issues, we propose RiskGuard, a novel framework for account risk detection that integrates auxiliary asset prediction and gradient-based sampling. First, we introduce an auxiliary asset prediction paradigm to capture the critical influence of asset fluctuations on account risk. Rather than solely predicting account risk, our unified model employs a Temporal-Attention Net (TANet) to jointly predict asset fluctuations and account risk. This auxiliary task enables the model to learn fluctuation-aware asset representations, significantly enhancing prediction accuracy. To overcome scalability challenges, we design GLUE, an online graph sampler leveraging gradient entropy. GLUE dynamically adjusts sampling weights based on model gradients and graph structure, prioritizing high-entropy nodes in the neighborhood for improved efficiency. Extensive experiments on five financial datasets demonstrate that RiskGuard outperforms existing techniques in accuracy while achieving high efficiency in processing large financial graphs.

Abstract:
Industrial time series analysis is the core foundation of equipment status monitoring and industrial intelligence. However, the long-tailed distribution characteristics of time series caused by low-frequency and low-probability events seriously restrict the performance of analysis models. In key scenarios such as industrial process prediction and anomaly detection, data analysis models face obvious performance bottlenecks due to insufficient representation of tail events. Existing data augmentation methods have dual limitations in capturing tail patterns and modeling long-distance time series dependencies. To address this challenge, this paper proposes an industrial long-tailed time series generator (DLTTS) model based on a diffusion model. Firstly, a hybrid architecture generation model is constructed to deeply integrate the encoder-decoder informer structure with the traditional diffusion process to maintain long-distance time patterns and achieve long-distance time series output; Secondly, we propose Fourier-based batch-Monte Carlo (FBMC) loss to enhance the model’s ability to capture low-frequency events, thereby improve the quality of tail time series generation. Experiments show that DLTTS maintains the authenticity and diversity of tail time series generation in industrial-grade long-tail time series generation tasks. It also exhibits robust performance in imputation and forecasting tasks, verifying the multiple performance advantages of this method for cross-task application.

Abstract:
Local Differential Privacy (LDP) enables massive data collection and analysis while protecting end users’ privacy against untrusted aggregators. It has been applied to various data types (e.g., categorical, numerical, and graph data) and application settings (e.g., static and streaming). Recent findings indicate that LDP protocols can be easily disrupted by poisoning or manipulation attacks, where an attacker can leverage injected/corrupted fake users to send crafted data to the aggregator in order to manipulate the final estimate of the aggregator. However, current attacks primarily target static protocols, neglecting the security of LDP protocols in the streaming settings. Our research fills the gap by developing novel fine-grained manipulation attacks to LDP protocols for data streams. By reviewing the attack surfaces in existing algorithms, we introduce a unified attack framework with composable modules, which can manipulate the LDP estimated stream toward a target stream. Our attack framework can adapt to state-of-the-art streaming LDP algorithms with different analytic tasks (e.g., frequency and mean) and LDP models (event-level, user-level, ww-event level). We verify our attacks theoretically and validate them through extensive experiments on real-world datasets. Finally, we explore a possible defense mechanism for mitigating our attacks.

Abstract:
Individual fairness (IF) in graph neural networks (GNNs), which emphasizes the need for similar individuals should receive similar outcomes from GNNs, has been a critical issue. Despite its importance, research in this area has been largely unexplored in terms of (1) a clear understanding of what induces individual unfairness in GNNs and (2) a comprehensive consideration of identifying similar individuals. To bridge these gaps, we conduct a preliminary analysis to explore the underlying reason for individual unfairness and observe correlations between IF and similarity consistency, a concept introduced to evaluate the discrepancy in identifying similar individuals based on graph structure versus node features. Inspired by our observations, we introduce two metrics to assess individual similarity from two distinct perspectives: topology fusion and feature fusion. Building upon these metrics, we propose Similarity-aware GNNs for Individual Fairness, named SaGIF. The key insight behind SaGIF is the integration of individual similarities by independently learning similarity representations, leading to an improvement of IF in GNNs. Our experiments on several real-world datasets validate the effectiveness of our proposed metrics and SaGIF. Specifically, SaGIF consistently outperforms state-of-the-art IF methods while maintaining utility performance.

Abstract:
Trajectory similarity in road networks is pivotal for numerous applications in transportation, urban planning, and ridesharing. However, due to the varying lengths of trajectories, employing similarity metrics directly on raw trajectory data (e.g., DTW (Yi et al., 1998)) becomes impractical at scale. Therefore, current research primarily revolves around applying deep learning to embed trajectories into vector representations, i.e., embeddings, enabling the application of simpler (and indexable) similarity metrics such as Euclidean distance. Existing research either involves embedding trajectories independent of the downstream tasks, or tailors the embedding specifically for a designated similarity metric. While the former offers versatility and allows for easy fine-tuning to accommodate various metrics, the latter typically yields more effective results but necessitates reconfiguration for different, yet similar metrics. Moreover, both approaches neglect the intrinsic spatiotemporal continuity in trajectory data, resulting in suboptimal trajectory modeling. Our objective is to address the limitations in modeling and have the best of the two worlds. Initially, we generate an embedding through pre-training, decoupled from any particular similarity metric. Subsequently, through a meticulous yet less complex fine-tuning process, we enhance the embedding to encapsulate the nuances of a designated similarity metric. Moreover, a significant aspect of our approach lies in our trajectory modeling that captures spatiotemporal continuity, which mainly consists of a trajectory-oriented road segment embedding and a Transformer encoder enhanced by spatiotemporal semantics inherent in road network-constrained trajectories. Our experimental results demonstrate the superiority of our approach in approximating multiple trajectory similarity metrics over existing state-of-the-art models from both categories of approaches.

Abstract:
Spatio-temporal data proliferates in numerous real-world domains, such as transportation, weather, and energy. Spatio-temporal deep learning models aims to utilize useful patterns in such data to support tasks like prediction, imputation, and anomaly detection. However, previous one-to-one deep learning models designed for specific tasks typically require separate training for each use case, leading to increased computational and storage costs. To address this issue, one-to-many spatio-temporal foundation models have emerged, offering a unified framework capable of solving multiple spatio-temporal tasks. These foundation models achieve remarkable success by learning general knowledge with spatio-temporal data or transferring the general capabilities of pre-trained language models. While previous surveys have explored spatio-temporal data and methodologies separately, they have ignored a comprehensive examination of how foundation models are designed, selected, pre-trained, and adapted. As a result, the overall pipeline for spatio-temporal foundation models remains unclear. To bridge this gap, we innovatively provide an up-to-date review of previous spatio-temporal foundation models from the pipeline perspective. The pipeline begins with an introduction to different types of spatio-temporal data, followed by details of data preprocessing and embedding techniques. The pipeline then presents a novel data property taxonomy to divide existing methods according to data sources and dependencies, providing efficient and effective model design and selection for researchers. On this basis, we further illustrate the training objectives of primitive models, as well as the adaptation techniques of transferred models. Overall, our survey provides a clear and structured pipeline to understand the connection between core elements of spatio-temporal foundation models while guiding researchers to get started quickly. Additionally, we introduce emerging opportunities such as multi-objective training in the field of spatio-temporal foundation models, providing valuable insights for researchers and practitioners.

Abstract:
As geospatial data from web platforms becomes increasingly accessible and regularly updated, urban representation learning has emerged as a critical research area for advancing urban planning. Recent studies have developed foundation model-based algorithms to leverage this data for various urban-related downstream tasks. However, current research has inadequately explored deep integration strategies for multiscale, multimodal urban data in the context of urban foundation models. This gap arises primarily because the relationships between micro-scale (e.g., individual points of interest and street view imagery) and macro-scale (e.g., region-wide satellite imagery) urban features are inherently implicit and highly complex, making traditional interaction modeling insufficient. This paper introduces a novel research problem – how to learn multiscale urban representations by integrating diverse geographic data modalities and modeling complex multimodal relationships across different spatial scales. To address this significant challenge, we propose UrbanMFM, a spatial graph-based multiscale foundation model framework explicitly designed to capture and leverage these intricate relationships. UrbanMFM utilizes a self-supervised learning paradigm that integrates diverse geographic data modalities, including POI data and urban imagery, through novel contrastive learning objectives and advanced sampling techniques. By explicitly modeling spatial graphs to represent complex multiscale urban relationships, UrbanMFM effectively facilitates deep interactions between multimodal data sources. Extensive experiments on datasets from Singapore, New York, and Beijing demonstrate that UrbanMFM outperforms the strongest baselines significantly in four representative downstream tasks. By effectively modeling spatial hierarchies with diverse data, UrbanMFM provides a more comprehensive and adaptable representation of urban environments.

Abstract:
Local differential privacy (LDP) provides strict privacy guarantee in a distributed environment. Recent studies demonstrated that LDP protocols are vulnerable to data poisoning attacks where an attacker can manipulate the perturbed result on the local side and send bogus data to skew the final estimate on the server. Unfortunately, existing attack detections do not create an effective attack indicator and rely on particular characteristics of LDP protocols. As a result, they typically exhibit limited detection performance. In this paper, we use log-likelihood as the attack indicator and propose a chain-style detection to enhance the detection effectiveness, in which the attack impact could propagate along the chain and exhibit clear anomaly signal even under stealthy attack scenarios. The experimental results show that our detection consistently outperforms the existing methods. Using four datasets containing categorical and numerical data separately, our detection achieves an F1 score exceeding 96% in most cases. It even remains above 0.9 under stealthy attack settings, outperforming the state-of-the-art detection by up to 0.25.

Abstract:
Recommender systems have become increasingly influential in shaping user behavior and decision-making, highlighting their growing impact in various domains. Meanwhile, the widespread adoption of machine learning models in recommender systems has raised significant concerns regarding user privacy and security. As compliance with privacy regulations becomes more critical, there is a pressing need to address the issue of recommendation unlearning, i.e., eliminating the memory of specific training data from the learned recommendation models. Despite its importance, traditional machine unlearning methods are ill-suited for recommendation unlearning due to the unique challenges posed by collaborative interactions and model parameters. This survey offers a comprehensive review of the latest advancements in recommendation unlearning, exploring the design principles, challenges, and methodologies associated with this emerging field. We provide a unified taxonomy that categorizes different recommendation unlearning approaches, followed by a summary of widely used benchmarks and metrics for evaluation. By reviewing the current state of research, this survey aims to guide the development of more efficient, scalable, and robust recommendation unlearning techniques. Furthermore, we identify open research questions in this field, which could pave the way for future innovations not only in recommendation unlearning but also in a broader range of unlearning tasks across different machine learning applications.

Abstract:
Active learning can effectively reduce the cost of labeling while enhancing model classification performance. However, prior studies have indicated that imbalanced class distributions adversely impact active learning, leading to diminished model effectiveness. Existing approaches to unbalanced active learning often neglect the multi-class imbalance problem and suffer from low performance and high time consumption. To address these issues, this paper introduces a hybrid active learning with online weighted broad learning system (HAL-OWBLS). Its main advantages include: (1) We optimize the initial labeled instance selection through an approximate query strategy to avoid the cold-start problem and introduce a sample selection strategy based on double uncertainty to enhance the rationality of active learning iterations. (2) A weighted broad learning system (WBLS) is chosen as the classifier, and an improved weighting strategy is adopted for multi-class imbalanced data. (3) We theoretically derive an efficient online updating model for WBLS, which reduces the time cost of active learning iterations by using only newly labeled samples for fast updating. The proposed HAL-OWBLS algorithm has better performance and robustness compared with existing related algorithms on various multi-class imbalanced data sets.

Affiliations: College of Computer Science and Electronics Engineering, Hunan University, Changsha, China; School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan, China; Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China; Department of Network Technology Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; Computer Network Information Center, Chinese Academy of Sciences, Beijing, China

Abstract:
As deep learning (DL) continues to advance, effective feature extraction from large-scale data remains crucial for enhancing model performance. To leverage the advantages of the frequency domain, such as concentrated signal energy, prominent data features, and rich detailed characteristics, this paper proposes a novel frequency-domain feature extraction method. However, existing frequency component selection algorithms often struggle to adapt to diverse tasks, tend to yield only locally optimal solutions, and require prolonged processing times. To overcome these limitations, we introduce the Adaptive Fast Frequency Selection (AFFS) algorithm, which seamlessly integrates a frequency component selection factor layer into DL models to identify globally optimal frequency combinations suited to various downstream tasks. We further analyze the relationship between selected frequency components and model performance, providing theoretical guarantees regarding optimality, robustness, and generalization error bounds. Moreover, a fast selection procedure is developed to exploit the empirically observed rapid convergence of the selection-factor ranking, significantly accelerating the selection process. Extensive experiments on five datasets, ten DL models, and two subsequent tasks demonstrate that AFFS achieves superior performance: even when the input data size is reduced to only 10% of the original frequency features, model classification accuracy improves by approximately 1%, while the early stopping mechanism shortens the selection process by about 80%.

Abstract:
Time series forecasting faces significant challenges due to non-stationary components that obscure underlying patterns. While Transformer-based models are effective at capturing stationary components, they struggle with non-stationary dynamics and multivariate dependencies. In this paper, we propose FreqEvo, a lightweight Frequency Domain Feature Enhancement module for time series forecasting. FreqEvo progressively filters frequency components from high to low amplitude, ensuring the preservation of informative features while reducing noise. By integrating recursive Fourier-based residual modeling and cross-domain attention, FreqEvo effectively refines low-amplitude frequency features and stabilizes the embeddings, outperforming traditional low-pass filtering and random frequency selection methods in capturing both short-term and long-term dependencies. Experimental results on benchmark datasets demonstrate that FreqEvo outperforms state-of-the-art (SOTA) models and serves as a plug-and-play module to enhance existing Long-Term Sequence Forecasting (LSTF) models.

Abstract:
Large-scale social networks can be modeled as decentralized graphs, where each node holds a part of the overall network. Local differential privacy (LDP) has been widely adopted in decentralized graph analysis to ensure privacy for individual nodes. However, existing LDP-based methods often fail to accommodate personalized privacy requirements due to their uniform encoding and equal perturbation mechanisms. To address this issue, we propose PEGS, a novel privacy-preserving decentralized graph synthesis approach that significantly improves utility while respecting user-specific privacy preferences. Specifically, we introduce interactive local differential privacy (iLDP), a new edge-level definition of LDP that relaxes the constraints of node-independent perturbation, thereby enabling the fulfillment of individual privacy needs. Furthermore, we develop a decentralized graph perturbation framework offering three levels of privacy settings. To optimize the balance between information preservation and privacy, we design encoding and perturbation mechanisms leveraging information entropy tailored to different privacy levels. Extensive experimental evaluations and rigorous theoretical analysis demonstrate that our method produces high-quality synthetic graphs while adhering to iLDP guarantees.

Abstract:
A recent trend in urban computing involves utilizing multi-modal data for urban region embedding, which can be further expanded in a variety of downstream urban sensing tasks. Many previous studies rely on multi-graph embedding techniques and follow a two-stage paradigm: first building a kk-nearest neighbor graph based on fixed region correlations for each view, and then blending multi-view information in a posterior stage to learn region representations. However, multi-graph construction and multi-graph representation learning are not associated in most existing two-stage studies, and the relationship between them is not leveraged, which can provide complementary information to each other. In this paper, we unify these two stages into one by constructing learnable weighted complete graphs of regions and propose a new one-stage Region Embedding method with Adaptive region correlation Discovery (READ). Specifically, READ comprises three modules, including a disentangled region feature learning module utilizing a city-context Transformer to encode regions’ semantic and mobility features, and an adaptive weighted multi-graph construction module that builds multiple complete graphs with learnable weights based on disentangled features of regions. In addition, we propose a multi-graph representation learning module to yield effective region representations that integrate information from multiple graphs. We conduct thorough experiments on three downstream tasks to assess READ. Experimental results demonstrate that READ considerably outperforms state-of-the-art baseline methods in urban region embedding.

Abstract:
Graph retrieval (GR), a ranking procedure that aims to sort the graphs in a database by their relevance to a query graph in decreasing order, has wide applications across diverse domains, such as visual object detection and drug discovery. Existing Graph Retrieval (GR) approaches usually compare graph pairs at a detailed level and generate quadratic similarity scores. In realistic scenarios, conducting quadratic fine-grained comparisons is costly. However, coarse-grained comparisons would result in performance loss. Moreover, label scarcity in real-world data brings extra challenges. To tackle these issues, we investigate a more realistic GR problem, namely, efficient graph retrieval (EGR). Our key intuition is that, since there are numerous underutilized unlabeled pairs in realistic scenarios, by leveraging the additional information they provide, we can achieve speed-up while simplifying the model without sacrificing performance. Following our intuition, we propose an efficient model called Dual-Tower Model with Dividing, Contrasting and Alignment (TowerDNA). TowerDNA utilizes a GNN-based dual-tower model as a backbone to quickly compare graph pairs in a coarse-grained manner. In addition, to effectively utilize unlabeled pairs, TowerDNA first identifies confident pairs from unlabeled pairs to expand labeled datasets. It then learns from remaining unconfident pairs via graph contrastive learning with geometric correspondence. To integrate all semantics with reduced biases, TowerDNA generates prototypes using labeled pairs, which are aligned within both confident and unconfident pairs. Extensive experiments on diverse realistic datasets demonstrate that TowerDNA achieves comparable performance to fine-grained methods while providing a 10× speed-up.

Abstract:
Recent research and practical scenarios demonstrate that integrating transaction-based graph computing into blockchain has become a critical focus in consortium networks. The on-chain Transactional Graph Processing Applications (TGPAs) have become popular in the blockchain. TGPA leverages blockchain consensus mechanisms to prevent malicious peer attacks while utilizing graph computing to enable powerful analytical capabilities. However, a fundamental challenge exists: blockchain operates on a computation-before-consensus principle, while graph computing requires intermediate result-sharing based on mutual trust. This isolation between the two mechanisms fails to ensure both computational trustworthiness and consensus efficiency. Thus, TGPA necessitates an integrated solution combining consensus and graph computing. Besides, the trusted high-communication environment of graph computing conflicts with the untrusted high-communication environment of blockchain. Communication efficiency and data trustworthiness are existing challenges of the solution. This paper presents a Graph partitioning-based Byzantine Fault Tolerance (\mathsf GBFTGBFT) mechanism for the computing-consensus integration. \mathsf GBFTGBFT integrates graph computing’s shuffling and merging phases with the consensus phase to achieve parallel computation and synchronized consensus. Additionally, \mathsf GBFTGBFT incorporates a grouping-partitioning strategy and granular communication methods to enhance both trustworthiness and efficiency. Theoretical analysis proves that \mathsf GBFTGBFT reduces communication and latency complexity to O(x)O(x) (xx represents the number of peers). Experimental evaluations demonstrate that \mathsf GBFTGBFT achieves superior consensus capacity, graph computing capacity, and communication scalability performance. It also supports various graph algorithms across different data scales.

Abstract:
Currently, cross-lingual named entity recognition tasks primarily rely on knowledge distillation as a key technique. The design of diverse teacher models aims to guide the training of the target student model, effectively addressing the scarcity of target language data. However, existing methods often overemphasize the design of teacher models, overlooking the crucial significance of high-quality hard-label data. To address this issue, this paper proposes a multi-channel, multi-teacher cross-lingual named entity recognition approach (TSH-MC), comprising a translation teacher, a similarity teacher, and a hard-label screening module, with the objective of enhancing the training effectiveness of the target model. To fully exploit intermediate layer information, we introduce multi-channel knowledge distillation techniques, facilitating information exchange between the intermediate layers of teacher and student models, resulting in performance improvement. The proposed TSH-MC method has been experimentally validated on six languages from the CoNLL2002, CoNLL2003, and WikiAnn datasets, successfully demonstrating its effectiveness.

Abstract:
In this work, we focus on the task of learning the promising graph for clustering and present a novel Tensorized Graph Learning (TGL) framework, which synergizes the neighbor and self-expressiveness information. The main proposition is that the graph with neighbor information and the graph with self-expressiveness information describe the underlying clustering structure from two different perspectives and can be regarded as two views of data. To this end, our TGL converts a single-view graph learning task into a multi-view graph learning task. Specifically, it jointly learns these two graphs with a low-rank tensor constraint, which pursues the consistency in the high-order tensor space. Both the neighbor information and self-expressiveness information of data can be excavated during the graph learning process. Extensive experimental results show the promising performance of the proposed TGL in comparison to several state-of-the-art clustering algorithms.

Abstract:
Graph-level anomaly detection (GLAD) aims to distinguish anomalous graphs that exhibit significant deviations from others. The graph-graph relationship, revealing the deviation and similarity between graphs, offers global insights into the entire graph level for highlighting the anomalies’ divergence from normal graph patterns. Thus, understanding graph-graph relationships is critical to boosting models on GLAD tasks. However, existing deep GLAD algorithms heavily rely on Graph Neural Networks that primarily focus on analyzing individual graphs. These methods overlook the significance of graph-graph relationships in telling anomalies from normal graphs. In this paper, we propose a novel model for Graph-level Anomaly Detection using the Transformer technique, namely GADTrans. Specifically, GADTrans builds the transformer upon crucial subgraphs mined by a parametrized extractor, for modeling precise graph-graph relationships. The learned graph-graph relationships put effort into distinguishing normal and anomalous graphs. In addition, a specific loss is introduced to guide GADTrans in highlighting the deviation between anomalous and normal graphs while underlining the similarities among normal graphs. GADTrans achieves model interpretability by delivering human-interpretable results, which are learned graph-graph relationships and crucial subgraphs. Extensive experiments on six real-world datasets verify the effectiveness and superiority of GADTrans for GLAD tasks.

Affiliations: College of Computer and Data Science, Fuzhou University, Fuzhou, China; College of Computer and Data Science, Engineering Research Center of Big Data Intelligence, Ministry of Education, Fujian Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou, China; School of Computer and Software Engineering, Xihua University, Chengdu, China; School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China

Abstract:
Multi-party urban flow analysis is a crucial task in smart cities. However, existing analysis methods has difficulty in trade-off between data privacy security and spatio-temporal feature capture. The solution to the problem of how to capture the complete spatio-temporal features of multi-party urban flow data while protecting data privacy is of great importance in multi-party urban flow analysis. Therefore, to address data privacy and spatio-temporal feature capture in multi-party urban flow analysis, this paper proposes a spatio-temporal federated analysis model, for multi-party urban flow mining, which is able to effectively protect data privacy and capture spatio-temporal features completely at the same time. First, a multi-party urban flow mining framework based on federated learning is proposed to realize complete capture of spatio-temporal feature information of multi-party urban flow data and mining urban flow pattern knowledge under the premise of protecting data privacy. Second, to address the communication cost of the multi-party urban flow analysis, we propose a lazy aggregation method based on similarity clustering, which improves the communication efficiency between clients and the server. Further, we propose a similarity evaluation criteria for urban flow data based on step function, which can effectively calculate the similarity between urban flow data. Finally, we compare the proposed model with some benchmark methods on Chengdu Didi order data and point of interest data to prove the effectiveness of the proposed model and visualize and analyze the spatio-temporal features.

Abstract:
Facing climate change, the transformation to renewable energy poses stability challenges for power grids due to their reduced inertia and increased decentralization. Traditional dynamic stability assessments, crucial for safe grid operation with higher renewable shares, are computationally expensive and unsuitable for large-scale grids in the real world. Although multiple proofs in the network science have shown that network measures, which quantify the structural characteristics of networked dynamical systems, have the potential to facilitate basin stability prediction, no studies to date have demonstrated their ability to efficiently generalize to real-world grids. With recent breakthroughs in Graph Neural Networks (GNNs), we are surprised to find that there is still a lack of a common foundation about: Whether network measures can enhance GNNs’ capability to predict dynamic stability and how they might help GNNs generalize to realistic grid topologies. In this paper, we conduct, for the first time, a comprehensive analysis of 48 network measures in GNN-based stability assessments, introducing two strategies for their integration into the GNN framework. We uncover that prioritizing measures with consistent distributions across different grids as the input or regarding measures as auxiliary supervised information improves the model’s generalization ability to realistic grid topologies, even when models trained on only 20-node synthetic datasets are used. Our empirical results demonstrate a significant enhancement in model generalizability, increasing the R^2R2 performance from 66%% to 83%%. When evaluating the probabilistic stability indices on the realistic Texan grid model, GNNs reduce the time needed from 28,950 hours (Monte Carlo sampling) to just 0.06 seconds. This study could provide fundamental insights into basin stability assessments using GNNs, setting a new benchmark for future research.

Abstract:
With the rapid development of digital technologies, a large range of real-world systems, spanning from cloud servers, IoT devices, to industrial control systems, continuously generate vast amounts of time series data. Time series anomaly detection (AD) plays a crucial role in maintaining system stability by identifying unusual patterns from normal distributions, with the primary challenge lies in learning effective anomaly-discriminative representations. Recently, diffusion models have been applied to time series AD due to their strong representational capabilities. However, existing diffusion-based methods typically rely on reconstruction errors, which not only fail to fully exploit the representational potential of diffusion models but also be computationally intensive. To address these limitations, through experimental observation and theoretical analysis, we show that specific regions of the diffusion noises exhibit stronger representation capabilities for normal patterns, which can be leveraged to enhance AD performance and reduce computational costs. Building on these insights, we propose NoiseAD, a diffusion noise-guided anomaly detection method incorporating an optimal noise steps selection approach to identify diffusion steps with higher resolution. Extensive experiments on diverse benchmarks demonstrate the superiority of NoiseAD over state-of-the-art methods, further substantiated by insightful visualizations.

Abstract:
Event forecasting is an important task in temporal knowledge graphs (TKGs), aiming to leverage historical memories like humans to make informed decisions about future unknown events. However, according to the multiple intelligence and encoding-retrieval specificity theories, the previous approaches conflict with two fundamental human decision-making paradigms, resulting in incomplete representations of historical events over TKGs and thereby hindering later future forecasting performance. On the one hand, humans recall memories from multiple perspectives, while previous event forecasting works have unilaterally viewed historical sequences as a single-perspective paradigm, resulting in incomplete learning of structural features. On the other hand, the human memory recollection (recall) is a bidirectional process. Existing approaches merely focus on capturing unidirectional evolutionary patterns in chronological order, failing to properly model the time-varying retracing–retrieval processes of memory recollection over time. In this paper, we propose a novel event forecasting method in TKGs, namely Cog-RMH, mimicking the human Cognitive paradigm and Recalling Multiview History to support future decision making. To address the former challenge, we extract inherent structural features of historical concurrent events derived from associative thinking (amygdala), spatial context (hippocampus), and logical reasoning (prefrontal cortex) intelligences, and synthesize their effects on future events. To tackle the latter issue, we propose an encoder-decoder architecture with stacked gated recurrent units that simulate bidirectional memory retracing and retrieval containing cognitive dependencies. We further introduce a retracing–retrieval attention mechanism to model the time-varying human emphasis on different timestamp events during recall. Extensive experiments show that Cog-RMH achieves significantly improved event forecasting performance on four public TKG benchmarks in comparison with the existing state-of-the-art baselines.

Abstract:
Knowledge Graph Completion (KGC) is an essential task aimed at mitigating the issue of incompleteness in knowledge graphs, thereby enhancing their utility for various downstream applications. Existing KGC models predominantly fall into two categories: structure-based and semantic-based approaches. Structure-based methods often encounter challenges with long-tail entities due to the scarcity of structural information and imbalanced entity distributions. Conversely, semantic-based methods, while addressing those limitations, necessitate extensive training of language models and specific finetuning for each knowledge graph, thus constraining their practical efficiency. To alleviate those limitations in both approaches, in this paper, we propose KICGPTv2, an innovative framework that synergizes a large language model (LLM) with traditional KGC methods. This integration effectively mitigates the long-tail entity problem without incurring significant additional training overhead. Central to the KICGPTv2 model is a novel in-context learning strategy, termed Knowledge Prompt, which encodes structural knowledge into demonstrations to effectively guide the LLM. Comprehensive evaluations on various KGC tasks, including link prediction, relation prediction, and triple classification, underscore the efficacy of the KICGPTv2 model, highlighting its ability to achieve competitive performance with reduced training demands and without the need for finetuning.

Abstract:
This paper presents a systematic literature review (SLR) focused on the implementation of chatbots using Large Language Models (LLMs), aimed at providing insights into the architectures, frameworks, best practices, and evaluation metrics that are shaping the field. By analyzing 39 primary studies, the review addresses six key research questions, exploring common architectures such as client-server and Retrieval-Augmented Generation (RAG), and identifying frequently utilized models, including the GPT family, BERT, and open-source models like LLaMA. The paper evaluates the performance of these models across various domains, emphasizing the impact of fine-tuning, prompt engineering, and embedding techniques on accuracy and domain-specific relevance. Additionally, it highlights the critical evaluation metrics used in LLM-based chatbot systems, including accuracy, user satisfaction, content quality, safety, and efficiency. Ethical considerations, including data governance, bias mitigation, and fairness audits, are also discussed to ensure responsible deployment of LLM chatbots. The review concludes with an exploration of the trade-offs between performance, cost-efficiency, and scalability, providing a comprehensive framework for future research and development of LLM-based chatbot applications.

Abstract:
We introduce the triangle-densest-kk-subgraph problem (TDkkS) for undirected graphs: given a size parameter kk, compute a subset of kk vertices that maximizes the number of induced triangles. The problem corresponds to the simplest generalization of the edge-based densest-kk-subgraph problem (DkkS) to the case of higher-order network motifs. We prove that TDkkS is NP-hard and is not amenable to efficient approximation, in the worst-case. By judiciously exploiting the structure of the problem, we propose a relaxation algorithm for the purpose of obtaining high-quality, sub-optimal solutions. Our approach utilizes the fact that the cost function of TDkkS is submodular to construct a convex relaxation for the problem based on the Lovász extension for submodular functions. We demonstrate that our approaches attain state-of-the-art performance on real-world graphs and can offer substantially improved exploration of the optimal density-size curve compared to sophisticated approximation baselines for DkkS. We use document summarization to showcase why TDkkS is a useful generalization of DkkS.

Abstract:
Multimodal sarcasm detection receives increasing attentions due to people’s growing interest in posting multimodal information. The key factor of multimodal sarcasm detection is to leverage incongruity information across different modalities. Existing works are mainly based on the late fusion strategy by simply concatenating the intra- and inter-modal incongruity features, which are prone to learning surface patterns. In contrast, this work mainly focuses on modeling the deep fusion of intra- and inter-modal incongruity information. To this end, this work first discusses the incompatibility between the two kinds of incongruity features within existing multimodal frameworks. Under this motivation, we further propose an end-to-end cooperative framework dubbed Cooperative Multimodal Incongruity Learning (CoMIL). Specifically, our approach incorporates a primary module to model the deep fusion of intra- and inter-modal incongruity information. To prevent the integrated inter-modal visual information from disturbing the modeling of intra-text incongruity, CoMIL introduces a cooperative mechanism incorporating a reference module which focuses on token-level correlations as a structural guidance to the primary module. Based on the proposed cooperative mechanism, the intra- and inter-modal incongruity information can be compactly and compatibly integrated into deep features of neural models. Extensive experiments are conducted to validate the effectiveness of our proposed CoMIL approach.

Abstract:
Graph clustering, which aims to divide nodes in the graph into several distinct clusters, is a fundamental yet challenging task. Benefiting from the powerful representation capability of deep learning, deep graph clustering (DGC) methods have achieved great success in recent years. However, the corresponding survey paper is relatively scarce, and it is imminent to make a summary of this field. From this motivation, we conduct a comprehensive survey of DGC. Firstly, we introduce formulaic definition, evaluation, and development in this field. Secondly, the taxonomy of DGC methods is presented based on four different criteria, including graph type, network architecture, learning paradigm, and clustering method. Thirdly, we carefully analyze the existing methods via extensive experiments and summarize the challenges and opportunities from five perspectives, including graph data quality, stability, scalability, discriminative capability, and unknown cluster number. Besides, the applications of DGC methods in six domains, including computer vision, natural language processing, recommendation systems, social network analysis, bioinformatics, and medical science, are presented. Last but not least, this paper provides open resource supports, including 1) a collection of state-of-the-art DGC methods (papers, codes, and datasets) and 2) a flexible and extensible Python library for DGC. We hope this work can serve as a quick guide and help researchers overcome challenges in this vibrant field.

Abstract:
Efficient storage and query processing over tabular data, while balancing storage cost, query latency, and memory footprint, remains a fundamental challenge in the database community. In this work, we propose DeepMapping++, a neural-based data representation that leverages the memorization capability of deep neural networks to support efficient query processing in resource-constrained environments. DeepMapping++ has two flavors: DeepMapping-L for lossless look-up queries on categorical data and DeepMapping-R for approximate range aggregation queries on numerical data. To efficiently handle data modifications, DeepMapping-L integrates a lightweight auxiliary structure to correct prediction errors and support data modification operations, including insertions, deletions, and updates. DeepMapping-R further incorporates a buffer structure for caching partially aggregated values to reduce the need for model retraining. Experiments on real-world, synthetic, and benchmark datasets, demonstrated the effectiveness of DeepMapping-L and DeepMapping-R.

Abstract:
Cloud-assisted data sharing offers substantial convenience to resource-constrained users, enabling them to outsource data to public clouds and permit authorized users to perform retrievals. However, ensuring data privacy remains a significant challenge. Attribute-Based Keyword Search (ABKS) facilitates authorized searches over encrypted data, yet most existing ABKS schemes are primarily tailored for exact searches. They lack effective support for fuzzy search, which severely limits the practicality and efficiency. In this paper, we propose ESA, a novel privacy-preserving framework supporting multi-keyword fuzzy search with arbitrary Boolean semantics and fine-grained access control in data sharing. ESA introduces the Prime Filter, a novel data structure, along with a divide-and-conquer access control mechanism to achieve secure, efficient, and accurate fuzzy search. To enable authorized multi-keyword fuzzy search, the Prime Filter is designed as a unified primitive for both indexing and querying. It seamlessly integrates fuzzy search and access-policy enforcement by leveraging the indivisibility of prime numbers. To ensure reliable access control during both the search and decryption phases, ESA proposes a divide-and-conquer mechanism that leverages blockchain to coordinate the off-chain and on-chain indexes. Finally, we extend ESA from the perspective of non-repudiation. Formal security analysis demonstrates that ESA is secure under the known background model. Extensive evaluations confirm ESA’s advantages, demonstrating search speeds up to 2–86×86× faster than state-of-the-art methods, with precision exceeding 95%.

Abstract:
Real-world analytics hinges on tabular data, yet prevailing learners face a triple bind: tree ensembles excel on fixed schemas but cannot generalize to new columns, neural nets learn rich features yet overfit small tables, and recent transfer approaches falter when schemas diverge. We tackle these limitations with GTab, a Gradient-Boosting Bipartite Graph Neural Network that marries decision-tree residual refinement with self-supervised graph representation learning. GTab maps each table to an instance-feature bipartite graph, where a GNN, optimised jointly with contrastive, clustering, and reconstruction objectives, captures feature-feature, instance-instance, and cross-type relations. Boosted trees ingest the GNN's gradients, correcting residual errors and injecting the strong inductive bias of split-based models. Building on this backbone, we introduce three variants: E-GTab ensembles multiple overlapping feature sub-graphs for robust classic prediction; I-GTab inductively attaches unseen feature nodes, enabling feature-incremental inference without retraining; and T-GTab pre-trains on a source schema and lightly fine-tunes on a target schema to achieve zero-shot and transfer learning across heterogeneous tables. Across 20 public benchmarks and five clinical trials, GTab consistently ranks first: it outperforms tree, neural, and graph baselines on static tasks, surpasses prior art (and an oracle) when half the test-time columns are unseen, and delivers higher AUC than the leading transformer baseline in both cross-dataset and zero-shot transfer – all with a unified architecture. GTab thus offers a principled, scalable, and adaptable solution to holistic tabular data prediction, bridging the gap between classic ensembles and modern self-supervised representation learning.

Affiliations: School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China; College of Computer Science and Technology, Zhejiang University, Hangzhou, China; Department of Computer Science, Aalborg University, Copenhagen, Denmark; College of Information Science and Technology, Jinan University, Guangzhou, China; School of Computer Science and Engineering, Central South University, Changsha, China; Department of Computer Science, Aalborg University, Aalborg, Denmark

Abstract:
Next Point-of-interest (POI) recommendation has been widely used in real scenarios to predict the next possible location based on user behavior patterns. However, existing methods predominantly rely on spatio-temporal associations and check-in sequence relationships between users and POIs, which fall short for users with limited interactions with POIs. Moreover, user preferences are inherently multi-dimensional, rendering user selections often influenced by multiple factors such as location categories and multi-modal information. To mitigate these issues, we introduce a Multi-Modal Knowledge Graph Modeling of Multi-Dimensional User Preferences for Next-POI Recommendation (M4Rec for short). First, we define a multi-modal knowledge graph to organize the relationships among users, locations, categories, and multi-modal information. Subsequently, we use the multi-modal knowledge graph-based relation-aware network to derive comprehensive entity representations from the constructed knowledge graph. Next, employing the temporal knowledge prediction method, we predict the user’s next-POI category and next-POI. Finally, the final recommendation results are obtained by enhancing the corresponding location prediction scores through category semantics. Extensive experimentation conducted on real-world datasets validates the superiority of our proposed method over state-of-the-art competitors.

Abstract:
Spatio-temporal data are continuously generated and grow exponentially with the rapid proliferation of mobile devices. Efficient storage and indexing of spatio-temporal Big Data are crucial for supporting fast queries. Due to their high write throughput and scalability, distributed NoSQL data stores such as HBase are widely adopted as storage engines in spatio-temporal Big Data systems. These systems require a key-value-based storage model to convert one or a group of spatio-temporal data points into a single key-value pair. However, existing techniques cannot govern an inherent dilemma caused by adding a time dimension within spatial information, that is, we can either optimize for space-preferred or time-preferred queries but not both. In this paper, we propose Cymo, a flexible storage model with query-aware indexing that dynamically adapts to varying query patterns across different spatial regions and evolving query behaviors over time. Our key idea is to partition the spatio-temporal space into multiple subspaces, allowing each subspace to optimize its storage model based on its specific query characteristics. Cymo features a learning model that predicts query patterns for each subspace using historical workloads. Additionally, it introduces a virtual layer that abstracts the heterogeneity of different storage models, providing a unified query interface. This design ensures that storage model variations remain transparent to applications while enabling effective query optimization based on diverse query patterns. We have implemented Cymo on HBase and integrated it into GeoMesa, a representative spatio-temporal Big Data system. Experimental results on real taxi datasets demonstrate that Cymo significantly improves query performance, achieving speedups of 1.53× to 10× compared to GeoMesa. The open-source implementation of Cymo has been released for public access.

Abstract:
The proliferation of fake reviews, often produced by organized groups, undermines consumer trust and fair competition on online platforms. These groups employ sophisticated strategies that evade traditional detection methods, particularly in cold-start scenarios involving newly launched products with sparse data. To address this, we propose the Diversity- and Similarity-aware Dynamic Graph Attention-enhanced Graph Convolutional Network (DS-DGA-GCN), a new graph learning model for detecting fake reviewer groups. DS-DGA-GCN achieves robust detection since it focuses on the joint relationships among products, reviews, and reviewers by modeling product-review-reviewer networks. DS-DGA-GCN also achieves adaptive detection by integrating a Network Feature Scoring (NFS) system and a new dynamic graph attention mechanism. The NFS system quantifies network attributes, including neighbor diversity, network self-similarity, as a unified feature score. The dynamic graph attention mechanism improves the adaptability and computational efficiency by captures features related to temporal information, node importance, and global network structure. Extensive experiments conducted on two real-world datasets derived from Amazon and Xiaohongshu demonstrate that DS-DGA-GCN significantly outperforms state-of-the-art baselines, achieving accuracies of up to 89.8% and 88.3%, respectively.

Abstract:
Science Question Answering (SQA) is an important task for evaluating models’ capability to reason with scientific knowledge. However, the extensive availability of scientific information (e.g., basic concepts in biology, physics, and chemistry) in pre-trained corpora may cause large language models (LLMs) to rely more on memorized information rather than actual reasoning when answering questions. This reliance persists even with techniques like Chain-of-Thought prompting, resulting in shallow understanding and limited reasoning of scientific knowledge. Therefore, to enhance LLMs’ capacity to comprehend and apply scientific knowledge, we propose a framework called Multi-Agent Cooperation-based Knowledge Exemplification (MCKE). Specifically, MCKE leverages knowledge alongside questions to create exemplified knowledge, promoting deeper understanding through novel knowledge representation. To better evaluate the model’s ability to reason and apply knowledge, we introduce NovSciQA, a multiple-choice question answering dataset based on newly created scientific knowledge. This dataset covers multi-subject scientific knowledge and questions that do not exist in reality, making it impossible for the model to rely on memorized answer-related information to answer questions. Experimental results show that the MCKE framework outperforms baselines, and the NovSciQA dataset effectively assesses models’ knowledge understanding and application.

Abstract:
Weakly supervised graph anomaly detection aims to unveil unusual graph instances, e.g., nodes, whose behaviors are significantly different from the normal ones, under the condition that only a limited number of annotated anomalies but abundant unlabeled samples are available. A major challenge for this task is to learn a meaningful latent feature representation that reduces intra-class variance among normal data while remaining highly sensitive to anomalies. Although recent works have applied self-supervised feature learning methods for graph anomaly detection, their strategies are not specifically tailored to the unique requirements of graph anomaly detection, which motivates our exploration of a more domain-specific feature learning approach. In this paper, we introduce a weakly supervised graph anomaly detection method that leverages a feature learning strategy specifically tailored for graph anomalies. Our approach is built upon a multi-task learning scheme designed to extract robust feature representations, through synthesized anomalies. We generate these synthetic anomalies by perturbing the normal graph in various ways and assign a dedicated detection head to each anomaly type. This design ensures that the learned features are sensitive to potential deviations from normal patterns. Although synthetic anomalies may not perfectly replicate real-world patterns, they provide valuable auxiliary data for effective feature learning—much like the way features learned from classifying ImageNet images are used in various downstream computer vision tasks. Additionally, we adopt a two-phase learning strategy to balance the influence of synthetic anomalies and real data. The process begins with an initial warm-up phase using only synthetic samples, followed by a full-training phase that integrates both tasks. Numerous experiments on public datasets demonstrate the superior performance of our proposed strategy, in comparison with those of its competitors.

Affiliations: Key Laboratory of Advanced Microprocessor Chips and Systems, College of Computer Science and Technology, National University of Defense Technology, Changsha, China; National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China; Laboratory of Advanced Biotechnology, Beijing Institute of Biotechnology, Beijing, China; College of Meteorology and Oceanology, National University of Defense Technology, Changsha, China

Abstract:
Rumor detection is essential for building a responsible web and internet ecosystem, which has attracted significant attention from the research community. However, emerging topic rumor detection, i.e., identifying rumors at the early stages of a topic’s emergence where only limited discussions can be observed, still remains a challenge. Technically, this scenario is accompanied by the issues of data scarcity on emerging topics and the data distribution discrepancy between old topics and emerging new topic. In this paper, we propose a new framework termed LLM-driven ADversarial Example Synthesis (LADES) for emerging topic rumor detection. LADES utilizes Large Language Models (LLMs) for generating readable and contextually coherent adversarial examples. The generated adversarial examples not only expand the training set to tackle the data scarcity issue, but also act as a bridge to connect the data distribution of old and new topics. To overcome training instability in adversarial example generation, LADES introduces a gradient-free Markov Chain Monte Carlo (MCMC) sampling method. This method ensures adversarial examples are readable and contextually coherent by harnessing LLMs, while promoting effective attacks through entropy-based sampling that targets model uncertainty. To mitigate the impact of potential mislabeling in synthetic data, LADES implements a meta-mixed-learning mechanism. This mechanism dynamically adjusts the weights of synthetic adversarial examples, guided by limited labeled data from emerging topics, thereby alleviating the data noise. Extensive experiments conducted on three real-world benchmarks demonstrate that our proposed method outperforms state-of-the-art (SOTA) baselines in terms of emerging topic rumor detection.

Abstract:
Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.

Abstract:
Data quality is essential for the performance of deep neural networks in various fields. However, label noise and class imbalance are common data issues, which cause deep learning models to overfit in real-world scenarios. Recent research solves the learning with noisy labels (LNL) problem by employing label correction or loss adjustment methods, which often rely on uncertainty estimation. Unfortunately, these methods usually do not work well with imbalanced datasets. To address both noisy and imbalanced data biases, we analyze the limitations of current discriminative models in uncertainty estimation, and propose Class-aware Multi-granularity Co-Diffusion models (CaMCoD), which leverage a generative uncertainty and inconsistency loss adjustment method to generate labels more robustly. Specifically, we reframe the LNL problem as a robust diffusion-generative process, i.e., labels are generated by gradually refining an initial random guess. First, we use coarse-grained uncertainty from the diffusion model to achieve more accurate confidence estimates. This will guide the model to generate correct labels on a broader level. Then, we leverage the fine-grained inconsistency of co-diffusion models during reverse denoising to determine the learnable weight for each sample, which can mitigate the risk of the model overfitting to noisy samples. Finally, we apply class-aware loss adjustments to reduce data bias caused by class imbalance. Experiments on both synthetic and real-world datasets demonstrate that our method perform well in imbalanced and noisy scenarios.

Abstract:
Research on learned cardinality estimation has made significant progress in recent years. However, existing methods still face distinct challenges that hinder their practical deployment in production environments. We define these challenges as the “Trilemma of Cardinality Estimation”, where learned cardinality estimation methods struggle to balance generality, accuracy, and updatability. To address these challenges, we introduce DistJoin, a join cardinality estimator based on efficient distribution prediction using multi-autoregressive models. Our contributions are threefold: (1) We propose a method to estimate join cardinality by leveraging the probability distributions of individual tables in a decoupled manner. (2) To meet the requirements of efficiency for DistJoin, we develop Adaptive Neural Predicate Modulation (ANPM), a high-throughput distribution estimation model. (3) We demonstrate that an existing similar approach suffers from variance accumulation issues by formal variance analysis. To mitigate this problem, DistJoin employs a selectivity-based approach to infer join cardinality, effectively reducing variance. In summary, DistJoin not only represents the first data-driven method to support both equi and non-equi joins simultaneously but also demonstrates superior accuracy while enabling fast and flexible updates. The experimental results demonstrate that DistJoin achieves the highest accuracy, robustness to data updates, generality, and comparable update and inference speed relative to existing methods.

Abstract:
Retrieval-augmented generation (RAG) provides an efficient solution for expanding the knowledge boundaries of large language models (LLMs), where the indexing serves as a compass to guide LLMs in locating query-relevant external knowledge. Nevertheless, current indexing methods commonly encounter a critical challenge: native indexing is convenient to construct, but it usually disrupts contextual associations and constrains the expressive capacity of rich knowledge. Conversely, knowledge indexing can structure contextual knowledge, but it is often based on preset schemas that limit its generalizability. To address it, we propose a universal and flexible knowledge indexing called pseudo-graph (PG) indexing. During the indexing construction phase, we use the advanced LLMs to transform the knowledge of each raw text into a concise and structured mind map, organizing intra-document knowledge. Subsequently, independent mind maps are linked by associating highly relevant topics or consistent facts across documents, thereby establishing inter-document knowledge connections. Eventually, using the resulting knowledge network PG as the knowledge indexing can circumvent the challenges associated with schema design reliant on preset knowledge and relationship types. During the knowledge retrieval phase, we develop a PG knowledge retriever to mimic human note-reviewing, adaptively navigating and recalling query-relevant knowledge from PG. Experimental results demonstrate that retrieving relevant pseudo-subgraphs from the PG via PG indexing and retriever significantly improves performance in fact-based Q&A, hallucination correction, and two multi-document Q&A tasks, achieving F1_QEF1QE improvements of 15.85%, 8.12%, 3.34%, and 5.73%, respectively, and outperforming the state-of-the-art baseline KGP-LLaMA. Our code is available at: https://github.com/IAAR-Shanghai/PGRAG.

Abstract:
Knowledge base question answering (KBQA) refers to the task of answering natural language questions using factual information from large-scale knowledge bases (KBs). To obtain accurate answers, recent research optimizes semantic parsing methods, a major KBQA approach, with large language models (LLMs), where concise logical forms (LFs) are generated by LLMs and executed in KBs. Although these methods demonstrate superior performance, they still encounter the problem that some generated LFs fail to yield answers when executed, significantly limiting their effectiveness. To mitigate this issue, we propose KARV, a Knowledge-Assisted reasoning path Reconstruction and hierarchical Voting approach for non-executable LFs. This method extracts semantic knowledge from KBs as guidance to correct and reconstruct reasoning paths, deriving answers through a voting-based strategy. The insight is that non-executable LFs generated by LLMs still contain rich semantic information, and the knowledge retrieved from KBs can effectively correct them. Specifically, we fine-tune LLMs to generate high-quality LFs, and the nonexecutable LFs are decomposed into multiple path branches based on mentioned entities. Semantic knowledge from KBs is then leveraged to correct the entities and relations within these branches, effectively reconstructing the reasoning paths. To obtain precise final answers, we apply a hierarchical voting strategy both within and across the non-executable LFs. Our proposed method achieves state-of-the-art performance on benchmarks including WebQuestionSP (WebQSP), ComplexWebQuestions (CWQ), and FreebaseQA.

Abstract:
With growing client diversity, model-heterogeneous personalized federated learning (MHPFL) supports collaboration over structure-heterogeneous client models. However, existing MHPFL methods only achieve client-level personalization but ignore inherent discrepancies within each client’s different data samples, leading to limited model performance. To this end, we propose a novel model-heterogeneous personalized Federated learning with Mixture of Experts (pFedMoE) to achieve a fine-grained data-level personalization. As the first work that incorporates MoE in MHPFL, it introduces three innovations: (1) Different clients hold heterogeneous local models, we add a small proxy global homogeneous feature extractor shared by clients for knowledge exchange. (2) To achieve a fine-grained data-level personalization, we construct a personalized local MoE for each client: a local expert (local heterogeneous client model’s feature extractor), a global expert (global proxy homogeneous feature extractor), and a local personalized gating network, which dynamically balances the generalization and personalization of the local model at the data sample level. (3) We customize a lightweight linear gating network to capture the generalized and personalized data characteristics of each local data sample. We theoretically prove its \mathcal O(1/T)O(1/T) convergence rate. Experiments on 3 benchmark image datasets, 1 real-world image dataset and 1 real-world text dataset against 9 baselines demonstrate its state-of-the-art model accuracy with up to 2.79% accuracy improvement while saving up to 43.12% computational overheads and keeping satisfactory communication costs.

Abstract:
Traffic Signal Control plays a vital role in modern traffic management. However, most existing methods focus exclusively on vehicle flow, neglecting the critical role of pedestrians, leading to suboptimal performance in intersections with mixed vehicle-pedestrian traffic. Pedestrian behavior presents unique challenges due to its irregularity and flexibility, such as non-lane-based movements and uncertain crossing directions, which cannot be modeled by existing methods. To address this limitation, we propose VPLight, a comprehensive framework designed to manage both Vehicle and Pedestrian dynamics in traffic signal control. Specifically, we first design the Pedestrian Feature Extractor to capture the spatiotemporal dynamics of pedestrian movement, offering a robust representation of their irregular patterns. Subsequently, to coordinate traffic signal control at multiple intersections, we develop a novel communication approach called V-Comm to enable effective integration among intersections. Extensive experiments show that VPLight outperforms state-of-the-art baselines with significant margins (up to +44.04%). Our results demonstrate that VPLight can remarkably address the challenges of mixed vehicle-pedestrian traffic control and enhance the overall traffic flow efficiency across the road network.

Abstract:
The evolutionary dynamics of complex systems encode critical information about their functional organization. In particular, the generation times of edges reveal key aspects of historical development in networked systems such as protein–protein interaction networks, ecosystems, and social networks. Accurately recovering these temporal processes is of significant scientific value—for example, in elucidating the mechanisms underlying protein interaction evolution. However, existing methods typically assume access to partially time-stamped networks and often struggle to generalize across domains. They perform poorly in recovering edge generation times in static networks without temporal annotations. To address this challenge, we propose a comparative paradigm that enables cross-network learning by jointly training on multiple temporal networks. This framework captures structural–temporal correlations that generalize across networks and improves accuracy by 16.98% on average compared to separate training strategies. Furthermore, to mitigate the scarcity of real temporal data, we introduce a novel diffusion-based generative model for synthesizing large quantities of pseudo-temporal networks. By integrating both real and generated samples during training, our joint strategy yields an additional 5.46% improvement in predictive accuracy, demonstrating the effectiveness of data augmentation in enhancing generalization.

Abstract:
In modern quantitative finance and portfolio-based investment, modeling latent interactions among stocks is paramount for prediction and decision-making on profit, risk management, hedging, etc. While previous studies have constructed complex stock graphs for applying sophisticated variants of graph neural networks (GNNs), existing graph modeling approaches still face two limitations: 1) Correlation-based statistical relationships fail to unveil nuanced stock interactions effectively and determine directional influence. 2) Rigid and static graphs overlook the evolving graph structure of stocks in volatile financial systems. In this paper, we propose a dynamic-causal graph neural network (DC-GNN) to discover causal interactions from the non-stationary price time series and dynamically model graph structures for stock movement prediction. More specifically, we identify the pattern prototypes of all directed stock pairs from long-term price movement knowledge to quantify their causal interactions. These prototypes capture the pattern-to-pattern correspondence across time series based on symbolic dynamics. By inferring real-time stock networks from the prototypes, we encapsulate neighbor-induced causal impacts within heterogeneous edges to model bullish, bearish, and neutral effects among stocks. Extensive experiments conducted on real-world trading data demonstrate the superiority of the proposed framework over various state-of-the-art baselines and its effectiveness, robustness, and interpretability in Fintech.

Abstract:
Knowledge graphs (KGs) can provide structured knowledge to assist large language models (LLMs) in interpretable reasoning. Knowledge graph question answering (KGQA) is a typical benchmark to evaluate KG-enhanced LLM methods. Previous methods of KG-enhanced LLMs for KGQA mainly include: 1) origin question-oriented methods, which perform KG retrieval based solely on the original question without explicitly analyzing multi-step reasoning logic; and 2) stepwise reasoning-oriented methods, which alternate between LLM generating the next reasoning step and targeted KG retrieval but lack systematic planning, leading to poor controllability. To tackle these limitations, we propose KELGoP, a framework of KG-enhanced LLM based on global planning. We propose fine-grained question categorization based on reasoning patterns and corresponding category-driven question decomposition for complex questions, enabling more controllable reasoning and atomic KG retrieval targeted to sub-questions. Furthermore, we propose an adaptive strategy that allows adjusting the reasoning pattern based on the performance of question answering, making the reasoning more flexible and robust. Finally, we introduce several efficient atomic KG retrieval strategies that operate on KG subgraphs to assist the LLM in answering atomic-level questions. A series of experiments on KGQA datasets demonstrate that our proposed framework achieves superior performance compared to existing baselines.

Abstract:
Identifying harmful memes is challenging due to their implicit meanings, which are not always evident from texts and images alone. Existing solutions often lack clear explanations to justify their decisions. To address this gap, we propose an explainable approach, ExplainHM++, which detects harmful memes by reasoning over competing rationales from both harmful and harmless perspectives. First, inspired by the capabilities of Large Multimodal Models (LMMs) in text generation and multimodal reasoning, we develop ExplainHM, a one-stage multimodal debate in which LMMs generate explanations through contradictory arguments. Second, we fine-tune a small language model to serve as a judge in the debate, improving the integration of harmfulness rationales with the multimodal content of memes. However, we observe that a naive multimodal debate remains vulnerable, as it heavily depends on the inherent reasoning ability of LMMs to understand the memes. Given the evolving and noisy nature of memes, we further introduce a meme sample retrieval mechanism and a retrieval-augmented debate paradigm to strengthen and refine LMM-generated explanations. Extensive experiments on three public meme datasets demonstrate that ExplainHM++ not only outperforms state-of-the-art methods but also provides superior, interpretable explanations for harmful meme detection.

Abstract:
Recent years have witnessed increasing attention on the semantic knowledge integration between curated knowledge bases (CKBs) and open knowledge bases (OKBs), which is non-trivial due to the intrinsically heterogeneous features involved in CKBs and OKBs. OKB canonicalization and OKB linking are regarded as two vital tasks to achieve the knowledge integration. Although these two tasks are inherently complementary with each other, previous studies just solve them separately or via superficial interaction. To address this issue, we propose CLUE+, a novel framework that jointly encodes the OKB and CKB into a unified embedding space, to tackle OKB canonicalization and OKB linking simultaneously and make them benefit each other reciprocally. We design an expectation-maximization (EM) based approach to iteratively refine the unified embedding space via performing seed generation and embedding refinement alternately, by leveraging the deep interaction between OKB canonicalization and OKB linking. Curriculum learning is employed to yield high-quality canonicalization seeds and linking seeds adaptively, according to two elaborately designed metrics (i.e., a margin-based linking metric and an entropy-based cluster metric). Additionally, active learning is incorporated to further complement the seed generation process by selectively annotating the most informative noun phrases within low-quality clusters, driven by an innovative acquisition function comprising three key criteria (i.e., uncertainty, diversity and specificity). A thorough experimental study over two public benchmark data sets demonstrates that our proposed CLUE+ consistently outperforms state-of-the-art baselines for the task of OKB canonicalization (resp. OKB linking) in terms of average F1 (resp. accuracy).

Affiliations: College of Computing and Data Science, Nanyang Technological University, Singapore; College of Mathematics and Informatics, South China Agricultural University, Guangzhou, China; School of Information Engineering, Guangdong University of Technology, Guangzhou, China; Peng Cheng Laboratory, Shenzhen, China; School of Computer Science and Engineering, Sun Yatsen University, Guangzhou, China; School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China

Abstract:
Exposure bias is a notorious issue in Recommender Systems (RSs), often resulting in the overexposure of a few items while a majority of items are underexposed. Most existing methods mitigate exposure bias in offline recommendations, but few research address such bias in Interactive Recommender Systems (IRSs). However, the exposure bias will be increasingly amplified over time as users interact with the system. Worse still, bias amplification will further lead to a series of problems such as echo chambers and filter bubbles. To address these issues, we propose a novel paradigm, Causal Intervention-Inspired Policy Learning to Mitigate Exposure Bias for Interactive RECommendation (CIREC), which consists of Intervention-based Causal Discovery (ICD) and Causality-Inspired Interactive Policy Learning (CIIPL). Firstly, ICD aims to meticulously examine the causal relationships inherent in the interactive recommendation model, and formulates a novel Dynamic Causal Graph (DCG). The DCG is employed to generate the temporal conditional probability. Subsequently, the causal intervention is performed on the DCG, which serves to induce the temporal interventional probability. Afterwards, CIIPL utilizes the temporal conditional probability to estimate the cause-effect relationships between exposure bias and prediction scores during the phase of causality-inspired interactive training. It then employs the temporal interventional probability to alleviate the negative impact of exposure bias in the phase of causal intervention-based inference. Extensive experiments conducted on a series of public benchmarks substantiate the efficacy and efficiency of our proposed CIREC in mitigating exposure bias within the IRS.

Abstract:
As neural network models grow larger and more complex, federated learning (FL) faces challenges in terms of communication and computation efficiency. To address these issues, layer-wise learning has been proposed. Existing approaches did not leverage useful properties of layer-wise learning including update locking and variations in convergence rates, thereby resulting in sub-par model performance. To bridge this gap, we propose the Federated Mean Block Difference-based global model aggregation approach with Patience-based local training (FedMBDP). We automatically partition the neural network model into uncoupled blocks and progressively train them. Determining which blocks to train and aggregate becomes a critical task. To improve computation efficiency, we propose a patience-based local training algorithm to adaptively select training blocks, reducing computation latency. To improve communication efficiency, we introduce a mean block difference-based global model aggregation algorithm to dynamically select blocks for aggregation to minimize communication latency. We provide the convergence analysis of FedMBDP. Extensive experiments on three widely adopted benchmark datasets show that FedMBDP achieves superior performance compared to six state-of-the-art approaches. It reduces FL training latency by 26.37% compared to the best baseline, while achieving similar test accuracy.

Abstract:
Nowadays, social media platforms have become primary channels for dissemination of fake news. On these platforms, user comments provide direct reactions and insights into the content being shared, offering valuable clues for effective fake news detection. However, existing approaches predominantly analyze comments from an isolated, single-comment perspective, overlooking the broader insights from the entire comment section. To address this limitation, this paper comprehensively considers three key factors within the comment section: emotional evolution, semantic evolution, and diversity of user attention, based on which a novel fake news detection model MESE is proposed by mining the emotional and semantic evolution from user comments. Specifically, to capture the diversity of user attention toward different news segments, we first propose a news-conditioned comment attention mechanism to obtain news-enhanced comment representations. Next, a gating mechanism is introduced to deeply integrate emotional and semantic features. Additionally, we develop a comment emotional and semantic evolution module to capture shifts in public reactions over time. Finally, these diverse representations are fused to generate prediction results. Extensive experiments on two public datasets demonstrate the superior performance of MESE. Further case studies and ablation experiments validate the rationality of our design and the effectiveness of the model components.

Abstract:
Distributed machine learning provides an efficient solution for large-scale data processing through parallel computing. However, current distributed learning relies on global or local paradigms and cannot adaptively adjust decision boundaries in complex data environments. To address this problem, we propose a Distributed Hybrid Learning algorithm based on Fisher Linear Discriminant (DHL-FLD). Specifically, DHL-FLD consists of a global pre-learning phase and a subspace local learning phase. On the one hand, the global pre-learning phase is designed to divide the data space, which can obtain the data structure information. On the other hand, the local learning phase dynamically adjusts and optimizes the decision boundaries, guided by the structural information and distributional properties of the data. Theoretically, we establish the generalization bound of DHL-FLD using the integral operator technique and verify the scalability and robustness of DHL-FLD. The effectiveness of DHL-FLD is demonstrated through extensive experiments on real datasets.

Abstract:
Event Causality Extraction (ECE) aims to extract causal event pairs from text. Existing methods overlook the interplay between causal event pairs and their corresponding textual evidence (e.g., causal event mention pairs), and fail to effectively leverage global causal dependency information. To address these issues, we propose a Mention-level Causal Evidence and Global Causal Graph Reasoning (MLCE-GCGR) framework to enhance ECE. First, we introduce an auxiliary Event Mention Causality Extraction (EMCE) task, which extracts causal event mention pairs, to provide evidence for the main ECE task, and design a Dual-level Interaction Enhancement (DLIE) strategy to enhance the bidirectional interplay between event-level and mention-level causality. Second, we develop a Global Causal Graph Reasoning (GCGR) module that simulates human-like multi-turn reasoning, aiming to progressively refine the causal graph by capturing global dependencies among event mentions, types, and arguments. Experiments on four benchmark datasets show that our method outperforms state-of-the-art approaches. Moreover, by extracting causal event mention pairs as supporting evidence, our approach improves the interpretability of structured causality extraction.

Abstract:
Spatial stratified heterogeneity refers to the pattern variation of the target phenomenon across different regions. Current measures mainly quantify spatially stratified heterogeneity in terms of the consistency between strata and the target variable and neglect the complexity of stratification, whereas complex stratification may lead to overfitting and an overestimated degree of heterogeneity. To address this issue, this paper enhances the relative-entropy-based spatial stratified heterogeneity measure to unit explanatory power using the entropy of the stratification. The proposed method first quantifies the stratification complexity using the minimum theoretical number of bits for encoding it, then characterizes the degree of heterogeneity per bit, i.e., unit explanatory power, by the quotient of the relative-entropy-based measure and the stratification complexity. Additionally, this paper develops two visualization tools for interpreting and comparing unit explanatory power and reveals the relation among spatial stratified heterogeneity, stratification complexity, and the log-likelihood function. Finally, we conduct experiments on both illustrative and real-life data sets to show the advantages of the unit explanatory power over traditional spatial stratified heterogeneity measures.

Abstract:
Existing temporal knowledge graph completion research often extracts features from isolated aspects of temporal facts, which hinders modeling of multi-faceted semantic structures under complex temporal dynamics. This work begins with dynamic characteristics and behavior patterns of entities under independent perturbations of relations or time, and reveals following scenarios: 1) The sensitivity of relations to temporal changes: Different relations between entities show distinct temporal sensitivities as time evolves. 2) Latent perturbations between relations: Different relations mutually perturb and interact, revealing latent internal associations. 3) Snail shell-like time growth effect: Timestamps can be modeled as outward growth of a spiral shell, where each spiral marks a trace of progress, reflecting cumulative and progressive nature of time. To address this, we design a novel model, FNES, inspired by natural structure of snail shells, which maps facts onto corresponding equiangular spirals. Specifically, FNES treats each entity as initial radial distance, maps each relation to a unique rotation angle, and uses each timestamp as a constant controlling rate of radial expansion. As relations gradually extend and timestamps expand outward, spiral structure progressively takes shape. To prevent intersections between spiral trajectories constructed by different entities, FNES learns modulus distribution differences between head and tail entities along zz-axis. In this way, we provide a compact and coherent encoding structure for facts while simulating real-world rules to the greatest extent possible. Empirically, FNES significantly outperforms SOTA models on six public benchmarks as well as industrial mining datasets.

Abstract:
Locality-sensitive hashing (LSH) is a well-known solution for approximate nearest neighbor (ANN) search with theoretical guarantees. Traditional LSH-based methods mainly focus on improving the efficiency and accuracy of query phase by designing different query strategies, but pay little attention to improving the efficiency of the indexing phase. They typically fine-tune existing data-oriented partitioning trees to index data points and support their query strategies. However, their strategy to directly partition the multidimensional space is time-consuming, and performance degrades as the space dimensionality increases. In this paper, we design an encoding-based tree called Dynamic Encoding Tree (DE-Tree) to improve the indexing efficiency and support efficient range queries. Based on DE-Tree, we propose a novel LSH scheme called DET-LSH. DET-LSH adopts a novel query strategy, which performs range queries in multiple independent index DE-Trees to reduce the probability of missing exact NN points. Extensive experiments demonstrate that while achieving best query accuracy, DET-LSH achieves up to 6x speedup in indexing time and 2x speedup in query time over the state-of-the-art LSH-based methods. In addition, to further improve the performance of DET-LSH, we propose PDET-LSH, an in-memory method adopting the parallelization opportunities provided by multicore CPUs. PDET-LSH exhibits considerable advantages in indexing and query efficiency, especially on large-scale datasets. Extensive experiments show that, while achieving the same query accuracy as DET-LSH, PDET-LSH offers up to 40x speedup in indexing time and 62x speedup in query answering time over the state-of-the-art LSH-based methods. Our theoretical analysis demonstrates that DET-LSH and PDET-LSH offer probabilistic guarantees on query answering accuracy.

Abstract:
With the rapid advancement and widespread application of the graph neural network (GNN), the collaborative graph learning, in which multiple parties collaboratively construct a GNN model using their respective graph data, has attracted increasing attention. However, this paradigm also raises significant privacy concerns, as both nodes and edges may contain sensitive personal information, while existing privacy-preserving schemes often come at the cost of degraded model performance or substantial system overhead. Therefore, this paper proposes an efficient and privacy-preserving collaborative learning framework on vertically partitioned graph data, dubbed Plog. Specifically, we first design a decomposition algorithm to split the sparse adjacency matrix into the summation of multiple independent permutations, which are lightweight, parallelizable, and well-suited for secure multi-party computation. Building on this, a weighted oblivious batch permutation protocol is carefully customized based on correlated randomness to securely and efficiently compute adjacency matrix multiplications, addressing the core efficiency bottleneck in GNN inference and training. The selective security of Plog is formally verified under the ideal-real paradigm. Extensive experimental results on three real-world datasets demonstrate that compared to the state-of-the-art scheme, Plog can reduce online communication rounds by \bm 46%46% and achieve a \bm 1.73 × 1.73× speedup in the overall inference and training time.

Abstract:
Fully mining the differential features of different class samples in overlapping areas is the key and difficult point to improving imbalanced classification performance under complex distribution patterns. Although existing data-level and algorithm-level methods have achieved good results in dealing with overlapping problems, sample generation and classifier training heavily rely on distribution information, and the ability to mine the different information is limited. This paper proposes a dual imbalanced classification framework with feature transfer guided by memory compensation strategy, which enhances the model's ability to mine differential features by constructing a feature space with better inter-class separability. In the traditional classification branch, a feature extraction network maps original samples to feature space and a traditional classifier is used to classify the features. In the compensation classification branch, a feature memory module based on iterative clustering strategy is designed, separately obtaining and saving the correctly classified feature centers of different classes. Moreover, a feature transfer module based on vector combination theory is proposed, combining “push” and “pull” vectors to transfer the misclassified features to the non-overlapping areas corresponding to the same class feature memory module, thereby constructing a feature space with better inter-class separability. Finally, a classification compensation strategy based on feature similarity is designed, integrating the prediction results of the traditional classifier and feature memory module as the final classification results. Experimental results on 50 imbalanced datasets show the proposed method outperforms 28 typical imbalanced classification methods in F1-score and G-mean. Especially on 20 severely overlapping datasets, the performance improvement is more significant.

Abstract:
Spatial Transcriptomics offers unprecedented opportunities to explore tissue architecture by capturing gene expression with spatial context. However, effectively learning discriminative and spatially smooth representations for accurate spatial domain identification remains a significant challenge. To address this, we propose CSMVL, a multi-view representation learning framework to learn high-quality spot representations by synergistically enhancing both discriminability and spatial continuity. CSMVL introduces a cluster structure learning strategy that guides cell representations within the same domain toward their cluster center while simultaneously separating distinct cluster centers, thereby improving intra-domain compactness and inter-domain separability. Furthermore, graph smoothness regularization is introduced to ensure that representations of spatially adjacent cells within the same domain transition smoothly, reflecting the inherent spatial continuity of biological tissues. Extensive experiments on public ST datasets demonstrate CSMVL’s superiority, achieving an average ARI of 71.64% and NMI of 73.43%, outperforming existing state-of-the-art methods.

Abstract:
High-utility itemset mining (HUIM) is an advanced problem of frequent itemset mining, considering the frequency of occurrence and quantitative criteria such as unit profit. Because HUIM can be applied to a broad spectrum of knowledge discovery work, various algorithmic improvements have been studied over the past two decades. On the other hand, limited efforts have been made to take advantage of hardware performance despite significant changes in hardware trends. This paper presents a novel parallelization method called DPHIM (Dynamic Parallelization for High-utility Itemset Mining). DPHIM dynamically decomposes a high-utility itemset mining task into subtasks to utilize logical parallelism and carefully assigns the subtasks and their related data to physical resources such as processing cores and nearby memory in a NUMA-aware manner. Through rigorous and diverse experiments, we found that DPHIM achieved speeds up to 72.7 times faster than the fully tuned serial execution, up to 23.5 times faster than static partitioning, and up to 2.5 times faster than the best case of alternative dynamic parallel executions for a variety of datasets and configurations on DRAM. We also demonstrated that DPHIM effectively worked on persistent memory; it offered similar thread scalability trends and was 1.1 to 2.4 times slower on persistent memory.

Abstract:
With the widespread adoption of cloud storage, time-series databases have become indispensable for managing and analyzing sequential data generated on the user side over time (i.e., time-series data), thereby alleviating the computational and storage burden on resource-constrained users. However, critical security and privacy challenges—such as query privacy leakage, data exposure, and threats to storage integrity—remain inadequately addressed by existing solutions. To this end, we propose VMPQ, an efficient protocol for privacy-preserving and verifiable multi-predicate queries over time-series databases. Specifically, we introduce a new cryptographic primitive, verifiable offline/online private information retrieval (V-OO-PIR), which supports sublinear retrieval complexity while simultaneously ensuring both query privacy and result verifiability against untrusted servers. Building on V-OO-PIR, we design a dual-layer security framework that integrates replicated secret sharing (RSS) and secure multiparty computation (MPC): 1) RSS splits time-series data into two shares stored across two non-colluding servers, ensuring data confidentiality and mitigating exposure risks, and 2) MPC performs secure multiplication directly on these shares, enabling efficient evaluation of multi-predicate queries without reconstructing the original data. As a result, VMPQ ensures query privacy by preventing servers from inferring user interests across multiple predicates, while simultaneously guaranteeing data confidentiality and the verifiability of query results. Theoretical analysis confirms the security of VMPQ against malicious adversaries. Experimental results demonstrate that VMPQ reduces query latency by up to 5× compared to the state-of-the-art solution Waldo, while also enhancing throughput and preserving high storage efficiency through optimized database encoding.

Abstract:
In recent years, the development of smart edge computing systems to process information locally is on the rise. Many near-sensor machine learning (ML) approaches have been implemented to introduce accurate and energy efficient template matching operations in resource-constrained edge sensing systems, such as wearables. To introduce novel solutions that can be viable for extreme edge cases, hybrid solutions combining conventional and emerging technologies have started to be proposed. Deep Neural Networks (DNN) optimised for edge application alongside new approaches of computing (both device and architecture -wise) could be a strong candidate in implementing edge ML solutions that aim at competitive accuracy classification while using a fraction of the power of conventional ML solutions. In this work, we are proposing a hybrid software-hardware edge classifier aimed at the extreme edge near-sensor systems. The classifier consists of two parts: (i) an optimised digital tinyML network, working as a front-end feature extractor, and (ii) a back-end RRAM-CMOS analogue content addressable memory (ACAM), working as a final stage template matching system. The combined hybrid system exhibits a competitive trade-off in accuracy versus energy metric with E_front-endEfront-end = 96.23\; \rm nJ96.23 nJ and E_back-endEback-end = 1.45\; \rm nJ1.45 nJ for each classification operation compared with 78.06 \muμJ for the original teacher model, representing a 792-fold reduction, making it a viable solution for extreme edge applications.

Abstract:
Time series classification (TSC) is a critical area with broad applications. In the field of evidence theory, quantum evidence theory (QET) offers a promising framework for one-dimensional TSC tasks, leveraging the capabilities of quantum basic probability amplitude (QBPA) to capture two-dimensional uncertainty. However, as the first step for the application of QET to TSC, how to construct QBPA still remains an open issue. In this paper, a novel approach to generate QBPA is devised. Specifically, we first apply the discrete Fourier transform (DFT) to the original data, extracting two-dimensional features embedded in the magnitude and phase from the frequency domain based on the front-few multi-frequency components, achieved by setting a threshold frequency index (TFI) to limit the frequencies considered. Next, we introduce the complex dual gaussian fuzzy number (CDGFN) as a carrier for QBPA, effectively representing two-dimensional uncertainty in the data. A CDGFN-based multisource information fusion (CDGFN-MSIF) algorithm for decision-making is proposed to combine information from different frequency components. Finally, the decision-making algorithm is validated on multiple time series datasets. Experimental results highlight the superior performance of the proposed approach over other state-of-the-art models, demonstrating its effectiveness and enhanced classification accuracy.

Abstract:
Unsupervised anomaly detection in multivariate time series can prevent large-scale system failures and is crucial for various applications. Most existing methods only consider a single temporal pattern and insufficiently model the normal pattern, causing the model to learn incorrect temporal patterns from anomalous or noisy data. This poses a significant challenge for accurate anomaly detection. To overcome these challenges, we introduce a new dynamic decomposition and reconstruction anomaly detection algorithm, DMRAD. DMRAD captures various regular patterns of multivariate time series by designing a dynamic decomposition module that learns trend and seasonal features. By integrating improved channel and temporal attention mechanisms, DMRAD effectively learns the correlations within the sequence and dependencies across different sequences, thereby enhancing the model’s capacity to distinguish between features and extract relevant information. DMRAD incorporates a latent anomaly-noise detection algorithm to identify and suppress the influence of noise and latent anomalies, elevating the overall accuracy of anomaly detection. Extensive experimental comparisons demonstrate that DMRAD achieves state-of-the-art performance on a variety of datasets for real-world application scenarios.

Abstract:
Data valuation provides a principled framework for quantifying the contribution of data to model training. It plays a crucial role in trustworthy machine learning (ML) by supporting data curation, enhancing interpretability, and enabling fair incentive mechanisms in data markets. Shapley value is a popular method for data valuation, but accurate estimation remains computationally expensive, particularly at the dataset level. In this paper, we introduce Ensemble Shapley, an efficient framework tailored for dataset-level valuation on the sharded structure. To reduce the computational costs, we propose a two-phase estimation method that apportions the intensive contribution computation costs across disjoint data shards and strategically reuses the computation results, achieving efficient contribution evaluation through the ensemble of shard models. However, weak shard models trained on noisy data may degrade ensemble models’ performance. To solve this, we introduce a behavior-driven guided sampling method that pairs noisy datasets with benign ones, ensuring reliable contribution estimates despite the noise. We also derive an advantageous lower bound for the number of evaluation iterations that balances efficiency and accuracy by the number of shards. Experimental results show Ensemble Shapley has superior efficiency over existing methods while maintaining comparable accuracy across various ML tasks, and demonstrates strong scalability and integration potential.

Affiliations: State Key Laboratory of Industrial Control Technology, College of Control Sceince and Engineering, Zhejiang University, Hangzhou, China; College of Electrical Engineering and Automation, Fuzhou University, Fuzhou, China; Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA; College of Computer Science and Technology, Zhejiang University, Hangzhou, China; School of Mathematics, Hangzhou Normal University, Hangzhou, China

Abstract:
Integrating graph neural networks (GNNs) with variational inference (VI) provides a promising direction for blending structured prior knowledge with observational empirical data for data-driven industrial process modeling. However, this task requires inference of the normalized adjacency matrix (NAM), where each row is normalized to be non-negative and to sum to one, matching the support of Dirichlet distribution. This requirement presents two main technical challenges: 1) intractable Kullback-Leibler (KL) divergence optimization between Dirichlet distributions, and 2) constrained optimization for standard-gradient-descent-based neural network parameter optimization. To handle issue 1), we first formulate the inference of the NAM as a differential equation simulation problem and derive an easy-to-implement expression to iteratively improve the KL divergence without explicitly computing it. Based on this, to alleviate issue 2), we involve Riemannian optimization to precondition this simulation procedure, which ensures that the inferred NAM conforms to the row-normalization constraint. After that, we collectively designate these approaches for NAM inference as Preconditioned-Simulation-Induced Variational Inference (\psiψ-VI), and provide theoretical guarantees of convergence. On this foundation, we propose a new graph neural network architecture, the Preconditioned-Simulation-Induced-based Variational Graph Neural Network (\psiψ-VGNN) for industrial process modeling. Finally, we validate the efficacy of \psiψ-VGNN through comprehensive experiments on industrial modeling tasks.

Abstract:
Natural Language Processing (NLP) aims to analyze text or speech via techniques in the computer science field. It serves applications in the domains of healthcare, commerce, education, and so on. Particularly, NLP has been widely applied to the education domain and its applications have enormous potential to help teaching and learning. In this survey, we review recent advances in NLP with a focus on solving problems relevant to the education domain. In detail, we begin with introducing the related background and the real-world scenarios in education to which NLP techniques could contribute. Then, we present a taxonomy of NLP in the education domain and highlight typical NLP applications including question answering, question construction, automated assessment, and error correction. Next, we illustrate the task definition, challenges, and corresponding cutting-edge techniques based on the above taxonomy. In particular, LLM-involved methods are included for discussion due to the wide usage of LLMs in diverse NLP applications. After that, we showcase some off-the-shelf demonstrations in this domain, which are designed for educators or researchers. At last, we conclude with five promising directions for future research, including generalization over subjects and languages, deployed LLM-based systems for education, adaptive learning for teaching and learning, interpretability for education, and ethical consideration of NLP techniques.

Abstract:
Clustering is an unsupervised learning task that groups data points by their inherent similarities. Non-automatic clustering algorithms face significant challenges when the true number of clusters is unknown or changes dynamically, as they require this number to be predefined. This paper provides a comprehensive review of automatic clustering algorithms specifically designed to handle such uncertainty. In this paper, these algorithms are systematically classified based on three key perspectives: clustering framework (classical vs. deep), clustering strategy (e.g., density-based, model based, graph-theoretic, subspace methods), and the use of labeled data (unsupervised vs. semi-supervised). We analyze each algorithm based on its core principles, key contributions, strengths, and limitations. Furthermore, we address the current challenges in this area and propose future research directions to enhance the scalability, robustness, and effectiveness of automatic clustering algorithms.

Abstract:
Bipartite graphs are a fundamental resource for inferring user preferences and understanding consumer decision-making behavior, driving the rapid development of graph-based recommendation systems. The challenge of feedback scarcity (i.e., sparse interactions) in graph-based recommendation systems has prompted extensive research on exploration-exploitation strategies to trade off exploration-exploitation in interactions. However, the tight coupling between exploration-exploitation behavior and bipartite graph learning poses a major obstacle to achieving the balance needed for long-term recommendation gains (i.e., enhancing diversity while maintaining accuracy). To better understand this issue, we conduct preliminary theoretical analysis and find that uneven user-item interactions heighten the risk of exploration-exploitation imbalance. Even popular graph Transformer architectures, though effective in exploration-exploitation, may still exhibit uncertain and unintended behavior, which can result in the imbalance issues. To address this problem, we propose a pane-aware graph Transformer architecture for personalized recommendation from a dynamic resource allocation perspective. Our approach builds upon graph Transformers and introduces two key modules: 1) pane partitioning and 2) a two-stage learning strategy, to explicitly intervene in the exploration-exploitation process of graph Transformers, rather than relying on the models to perform potentially uncertain and unintended balancing behavior. Specifically, the first module uses clustering and large language model reasoning to accurately constrain exploration-exploitation regions, guiding the balancing behavior of downstream graph Transformers. The second module separates training focuses to improve information allocation and ensure marginal gains. Extensive experiments on real-world datasets demonstrate that the proposed method effectively balances exploration-exploitation in user-item interactions and achieves long-term recommendation gains.

Abstract:
Accurately mapping unstructured cyber threat intelligence (CTI) text to standardized attack technique remains a fundamental yet challenging task, especially under zero-shot conditions where no labeled instances from the target CTI corpus are available. Existing methods largely rely on heuristic rules or supervised classifiers,and therefore often generalize poorly to unseen, long-tail, or evolving techniques. To overcome these limitations, we propose RA-CTI—a retrieval-augmented Large Language Model (LLM) framework for contextual technique identification, which instantiates a general Retrieve–Expand–Infer paradigm for aligning free-form text with ontology-driven label inventories. RA-CTI reformulates MITRE ATT&CK Techniques Identification as semantic evidence discovery and reference-guided reasoning through three coordinated stages: 1) dense semantic retrieval for efficient candidate acquisition, 2) query expansion and evidence aggregation to improve recall and coverage, and 3) reference-aware inference that conditions on explicit MITRE ATT&CK technique definitions for fine-grained semantic alignment. An optional weakly supervised retriever adaptation module further improves domain robustness without relying on manual annotations. Extensive experiments on multiple public CTI datasets show that RA-CTI consistently outperforms state-of-the-art baselines in zero-shot multi-label identification, achieving better efficiency–accuracy trade-offs and strong generalization across LLM backbones.

Abstract:
In federated recommendation systems, model poisoning attacks aim to manipulate the gradient information of multiple target items sent back from local clients to the central server, with the goal of abnormally increasing their exposure across the system. Existing multi-target attack approaches directly manipulate multiple target items and apply a uniform attack strategy to all target items, which may lead to suboptimal promotion effectiveness. To address this issue, we introduce ProitMTA, a novel multi-target model poisoning attack framework that introduces proxy items and provides tailored attack strategies for target items. ProitMTA employs a three-stage process that balances the promotion of multiple target items while preserving recommendation quality. First, proxy item generation uses a Gaussian Mixture Model to create proxy items that represent diverse attack strategies. Second, proxy attack construction designs customized gradient manipulation strategies for each proxy item. Finally, proxy-based target item attack transfers these strategies to actual target items, enhancing their promotion while minimizing the negative impact on system performance. Through comprehensive experiments on multiple base federated recommendation frameworks and diverse real-world datasets, we demonstrate that ProitMTA outperforms existing attack methods, achieving higher success rates in target item promotion with minimal system-wide performance degradation. Our research highlights the vulnerability of federated recommendation systems when facing multi-target poisoning attacks and underscores the importance of researching effective defense mechanisms.

Abstract:
Differential privacy (DP) is the leading standard for privacy protection, providing rigorous privacy guarantees for various data. However, its conventional approach of treating all records uniformly regarding privacy risk and using a non-adaptive privacy budget (\epsilonε) often compromises data utility in subsequent analyses. This uniform treatment and fixed \epsilonε can introduce significant perturbations, making the secondary use of shared data challenging. To overcome these limitations, we introduce a novel record-sensitivity-aware and Particle Swarm Optimization (PSO)-driven customised \epsilonε-DP method for data perturbation. Our approach significantly enhances data utility without compromising privacy in data sharing by introducing three key optimisations to the traditional DP framework: First, we partition records into three sensitivity classes (high, medium, and low) based on the privacy risk. Second, we adopt a PSO mechanism to determine the optimal \epsilonε for each partition, perturbing data with a variable \epsilonε that considers sensitivity, rather than using a single, fixed \epsilonε for the entire dataset. Finally, noise is injected by grouping attributes horizontally, rather than adding noise to each attribute independently, to prevent the generation of inconsistent values in the perturbed data. Detailed experiments on real benchmark and synthetic datasets demonstrate the superiority of our method in terms of utility and privacy across seven evaluation metrics, compared to the latest state-of-the-art \epsilonε-DP methods.

Abstract:
As a critical branch of time series analysis, time series forecasting (TSF) focuses on predicting future trends based on historical data, and it plays a pivotal role in a wide range of applications, including meteorology, finance, and healthcare. Recently, deep learning has demonstrated significant potential in TSF. While several existing surveys have systematically summarized deep learning-based methods, we complement these foundational works by investigating emerging models, such as large language models (LLMs), and providing an in-depth comparative analysis of distinct models alongside the challenges currently facing the field. Specifically, we propose a hierarchical taxonomy based on model structural dependency, categorizing existing studies into model-specific and model-agnostic frameworks. The model-specific framework is further divided into discriminative and generative paradigms, accompanied by a detailed comparison of these distinct model types. Moreover, we systematically review prevalent time-series datasets across diverse domains, analyze their key statistics, and summarize evaluation metrics. Finally, we analyze the key challenges currently faced by TSF and explore potential future research directions. Through this systematic review and forward-looking analysis, we aim to provide novel perspectives and establish a clear classification framework of TSF methods.

Abstract:
The data collected from a mobile user in social networking services or similar platforms can contain diverse attributes with interdependencies, which together define the user’s mobility behavior. However, the current deep learning-based approaches for generating synthetic mobility data mostly use data representations that cannot fully capture the diversity and the dependencies in mobility data. In this paper, we introduce the concept of mobility graph as a graph that represents an individual’s mobility data, and we subsequently define a scalable approach for generating synthetic mobility graphs based on real mobility graphs using a recurrent neural network (RNN). Our motivation lies in the fact that contrary to other representations, graphs are flexible and can capture various attributes and dependencies. By performing an experimental evaluation, we show that the mobility graphs generated by our approach can retain the useful statistical features of the real graphs (i.e., their utility), while varying sufficiently from them (to preserve privacy). For our evaluation, we implement a data pipeline that transforms real-world mobility data into mobility graphs. We also introduce various utility and privacy metrics. Considering the rise of deep learning on graphs, the present work can be used as the basis for developing and testing more advanced approaches for mobility data generation or other mobility-related tasks.

Affiliations: School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan, China; Department of Network Technology Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; Department of Electrical and Computer Engineering, State University of New York at Stony Brook, Stony Brook, NY, USA; School of Software Engineering, Sun Yat-Sen University, Zhuhai, China; Computer Network Information Center, Chinese Academy of Sciences, Beijing, China

Abstract:
Deep learning is applied to various tasks, such as image recognition and self-driving. Training acceleration is crucial for the further development of deep learning, as efficient training algorithms can help greatly reduce the time consumption and hardware usage while making real-time updates of large-scale deep learning models possible. The mainstream methods realize it through distributed training or network pruning. The former relies on abundant hardware resources and the latter may suffer from a non-negligible performance drop. In this paper, we propose CORESTR, a data-efficient training framework that asynchronously utilizes heterogeneous hardware resources. The framework consists of two major procedures. We first characterize the training status of each instance and propose a representative instance selection algorithm for reducing the total number of instances participating in each epoch of training. In the second procedure, we design a lightweight sample weighting mechanism based on meta-learning to closely approximate the convergence quality using a representative instance set selected from the full training dataset. We present the theoretical rationale for our approach and evaluate its training performance with several classical models and datasets. Experiment results demonstrate that our training method can achieve an average speedup of 4.8×4.8× and reach a higher final accuracy compared with state-of-the-art methods by only relying on a small part of the training data.

Abstract:
Short-video platforms have gained immense popularity, captivating the interest of millions, if not billions, of users globally. Recently, researchers have highlighted the significance of analyzing the propagation of short-videos, which typically involves discovering commercial values, public opinions, user behaviors, etc. This paper proposes a new Short-video Propagation Influence Rating (SPIR) task and aims to promote SPIR from both the dataset and method perspectives. First, we propose a new Cross-platform Short-Video (XS-Video) dataset, which aims to provide a large-scale and real-world short-video propagation network across various platforms to facilitate research on short-video propagation. Our XS-Video dataset includes 117,720 videos, 381,926 samples, and 535 topics across 5 biggest Chinese platforms, annotated with the propagation influence from level 0 to 9. To the best of our knowledge, this is the first large-scale short-video dataset that contains cross-platform data or provides all of the views, likes, shares, collects, fans, comments, and comment content. Second, we propose a Large Graph Model (LGM) named NetGPT, based on a novel three-stage training mechanism, to bridge heterogeneous graph-structured data with the powerful reasoning ability and knowledge of Large Language Models (LLMs). Our NetGPT can comprehend and analyze the short-video propagation graph, enabling it to predict the long-term propagation influence of short-videos. Comprehensive experimental results evaluated by both classification and regression metrics on our XS-Video dataset indicate the superiority of our method for SPIR. Our dataset and code will be open upon acceptance.

Abstract:
Graph Neural Networks (GNNs) have demonstrated remarkable success in various scenarios. However, their impressive performance is under the assumption of class balance (i.e., equal training sample distribution across various categories). Once trapped in the class-imbalanced issue, the GNN-based models typically under-represent the minority ones, resulting in decreased performance compared to balanced graphs. A promising solution is to balance the graph in a generative manner. However, the existing studies overlook the consistency between the synthesized sample and its corresponding class. Furthermore, the homophily assumption (i.e., like attracts like) undermines the topological diversity of graphs, thereby complicating the capability of models to capture the true distribution and boundaries of the categories. To this end, we propose a Consistency-Aware and Loose Homophily guided generative method for class-imbalanced graphs, namely GraphCALH. Specifically, we design a consistency-aware feature synthesis method to balance the node-wise characteristics and the class-wise commonality for the synthesized samples. Moreover, we devise a loose homophily guided topology modeling method to enrich the topological diversity and simplify category boundaries. The experimental results on eleven class-imbalanced datasets demonstrate that the proposed GraphCALH outperforms ten state-of-the-art methods.

Abstract:
Distance-based unsupervised text classification is a method within text classification that leverages the semantic similarity between a label and a text to determine label relevance. This method provides numerous benefits, including fast inference and adaptability to expanding label sets, as opposed to zero-shot, fewshot, and fine-tuned neural networks that require re-training in such cases. In multilabel distance-based classification and information retrieval algorithms, thresholds are required to determine whether a text instance is “similar” to a label or query. Similarity between a text and label is determined in a dense embedding space, usually generated by state-of-the-art sentence encoders. Multi-label classification complicates matters, as a text instance can have multiple true labels, unlike in multi-class or binary classification, where each instance is assigned only one label. We expand upon previous literature on this underexplored topic by thoroughly examining and evaluating the ability of sentence encoders to perform distance-based classification. First, we perform an exploratory study to verify whether the semantic relationships between texts and labels vary across models, datasets, and label sets by conducting experiments on a diverse collection of realistic multi-label text classification (MLTC) datasets. We find that similarity distributions show statistically significant differences across models, datasets and even label sets. We propose a novel method for optimizing label-specific thresholds using a validation set. Our label-specific thresholding method achieves an average improvement of 46% over normalized 0.5 thresholding and outperforms uniform thresholding approaches from previous work by an average of 14%. Additionally, the method demonstrates strong performance even with limited labeled examples.

Abstract:
In the existing traffic prediction scenarios, the lack of accompanying event data, noise interference and insufficient supervised signals seriously restrict the effect of actual traffic prediction. Meanwhile, currently prevalent graph neural networks often struggle to capture effective semantic structures when dealing with learning tasks involving diverse specific events, consequently exhibiting limited generalization and transfer capabilities. This study focuses on crowd gathering events in transportation scenarios and conducts quantitative analysis of their potential risks to traffic network. Relying on the massive online crowd query data produced in Location Based Services (LBS), we propose a generative strategy for node and edge augmentation based on event-traffic interactions, which seeks to generate richer supervised signals. Furthermore, in response to the generative graph structure derived from event chains that fail to match the contextual semantic information, we utilize comparative learning for self-supervised training as the auxiliary proxy task of time series prediction. Experiments on the benchmark datasets of real road networks show that the proposed method is effective in identifying the traffic risk of road segments, especially when the breakdown probability is greater than 50%.

Abstract:
Recommender systems (RSs), as crucial components of online services, can help users efficiently obtain information they may like. In reality, RSs face long-term threats. Attackers manipulate recommendation results by injecting malicious data in order to obtain benefits. At present, research on the security of RSs lacks a comprehensive understanding of attack capabilities. Moreover, existing defense strategies have not yet been systematically associated with attack characteristics. More importantly, existing defense methods rarely focus on real unlabeled data in practical application scenarios for anomaly detection and forensics. Therefore, this survey systematically analyzes the security of RSs and provides new insights. Specifically, we first categorize attack models from an attack perspective into: attack strategies based on targets, attack strategies against security and privacy, attack strategies based on prior knowledge, and attack strategies against other RSs. From a perspective of defense, existing detection models, second, can be divided into: behavioral representation based on statistics, detection based on hidden features, detection against privacy attacks, anomaly discovery based on association mining, and abnormality forensics for real-world data. Finally, we propose several potential research directions aimed at providing guidance for the security research of RSs.

Abstract:
As a pivotal variant of multi-label classification, hierarchical text classification (HTC) faces unique challenges due to its intricate taxonomic hierarchy. Recent state-of-the-art approaches improve performance by considering both global hierarchy covering all labels and local hierarchy indicating substructure of sample-specific ground-truth labels. However, they often over-condense hierarchical information into one or several tokens, which may cause the loss of useful knowledge. Accordingly, we propose a dual classifier model with global and local hierarchies (DCGL). It adopts prompt tuning-based BERT as the backbone, where global hierarchy is integrated into the soft prompt template. And this resulting classifier branch is termed global pipeline. To mitigate information loss caused by hierarchy condensation, we introduce a parallel local hierarchy-aware classifier pipeline. This local pipeline acquires label-level classification features through text propagation on the label hierarchy and aligns these features with oracle label representations of local hierarchy via graph contrastive learning, which serve as a novel strategy for local hierarchy incorporation. Thereby, DCGL obtains more granular and targeted features and captures local hierarchy information such as label co-occurrence and local structure. Moreover, since global and local pipelines capture distinct yet complementary information, we further apply mutual knowledge distillation to bridge the gap between their output logits and facilitate mutual learning. And to better control the distillation degree, we design a dynamic temperature negatively correlated with label confidence. Comprehensive experiments demonstrate that our DCGL outperforms several representative HTC methods.

Abstract:
Traffic prediction is essential for modern transportation systems, enhancing traffic management and urban planning. Accurate predictions of traffic flow and speed are crucial for understanding road usage, mitigating congestion, and providing real-time traffic monitoring and dynamic route guidance, thus improving road safety and infrastructure efficiency. Traditional research has often focused on predicting traffic flow or speed independently, leading to higher resource consumption due to the need for separate models. Few studies have explored the simultaneous prediction of both metrics, with recent attempts failing to account for spatial correlations, resulting in suboptimal performance. To address these challenges, we propose MTNet, a multi-task learning framework for joint traffic flow and speed prediction. MTNet employs a Transformer-like Encoder-Decoder architecture to process and enhance feature representations, capturing complex spatio-temporal correlations. Specifically, MTNet extracts intra-task dependencies using a cross-task interaction module and models task-specific spatiotemporal dependencies using spatial and temporal-aware modules with cascaded residual structures. Additionally, spatio-temporal positional encoding is integrated to increase awareness of long-term and long-distance dependencies. Extensive experiments on three diverse traffic datasets—Manchester, PeMSD4, and PeMSD8—demonstrate that MTNet significantly outperforms state-of-the-art methods in both traffic flow and speed prediction. MTNet achieves substantial improvements in prediction accuracy and efficiency, striking an optimal balance between performance and computational resource usage.

Abstract:
Dynamic Graph Neural Networks (GNNs) combine temporal information with GNNs to capture structural, temporal, and contextual relationships in dynamic graphs simultaneously, leading to enhanced performance in various applications. As the demand for dynamic GNNs continues to grow, numerous models and frameworks have emerged to cater to different application needs. There is a pressing need for a comprehensive survey that evaluates the performance, strengths, and limitations of various approaches in this domain. This paper aims to fill this gap by offering a thorough comparative analysis and experimental evaluation of dynamic GNNs. It covers 91 dynamic GNN models with a novel taxonomy, 17 dynamic GNN training frameworks, and commonly used benchmarks. We also evaluate the experimental results of ten representative dynamic GNN models and five frameworks on six datasets. Evaluation metrics focus on convergence accuracy, training efficiency, and GPU memory usage, enabling a thorough performance comparison across various models and frameworks. From the analysis and evaluation results, we identify key challenges and offer principles for future research to enhance the design of models and frameworks in the dynamic GNNs field.

Abstract:
Implicit sentiment analysis (ISA) presents significant challenges due to the absence of salient cue words. Previous methods have struggled with insufficient data and limited reasoning capabilities to infer underlying opinions. Integrating multi-task learning (MTL) with large language models (LLMs) offers the potential to enable models of varying sizes to reliably perceive and recognize genuine opinions in ISA. However, existing MTL approaches are constrained by two sources of uncertainty: data-level uncertainty, arising from hallucination problems in LLM-generated contextual information, and task-level uncertainty, stemming from the varying capacities of models to process contextual information. To handle these uncertainties, we propose MT-ISA, a novel MTL framework that enhances ISA by leveraging the generation and reasoning capabilities of LLMs through automatic weight learning (AWL). Specifically, MT-ISA constructs auxiliary tasks using generative LLMs to supplement sentiment elements and incorporates automatic MTL to fully exploit auxiliary data. We introduce data-level and task-level AWL, which dynamically identify relationships and prioritize more reliable data and critical tasks, enabling models of varying sizes to adaptively learn fine-grained weights based on their reasoning capabilities. Three strategies are investigated for data-level AWL, which are integrated with homoscedastic uncertainty for task-level AWL. Extensive experiments reveal that models of varying sizes achieve an optimal balance between primary prediction and auxiliary tasks in MT-ISA. This underscores the effectiveness and adaptability of our approach.

Abstract:
Identifying influential nodes in complex networks poses a significant challenge, requiring a delicate balance among efficiency, accuracy, and redundancy elimination. To address this, we propose TIG-IM (A Two-stage Inductive GNN for Influence Maximization), a novel framework based on GNN. In Stage 1, we construct a multi-task learning-based global scorer. This scorer utilizes a Graph Attention Network (GAT) to simultaneously regress the SIR propagation score and our newly proposed 3D-Bridging Centrality. This multi-task design enables efficient identification of cross-community bridge nodes to generate a high-quality candidate pool. In Stage 2, we design a local selector that introduces a composite loss function with sparsity and diversity regularization. This selector performs a secondary re-ranking on the candidate subgraph to produce a final seed set that is both structurally dispersed and possesses complementary propagation potential. Theoretical analysis proves that our scorer satisfies monotonicity and that the local loss can be interpreted as a MAP estimation. Extensive experiments on twelve real-world networks from diverse domains, including social, academic, and biological, demonstrate that TIG-IM significantly outperforms various state-of-the-art baselines in both spreading effectiveness and computational efficiency. Ablation studies further validate the synergistic value of our three core components: 3D-Bridging Centrality, sparsity penalty, and diversity-based re-ranking.

Abstract:
AI has increasingly influenced modern society, recently in particular through significant advancements in Large Language Models (LLMs). However, high computational and storage demands of LLMs still limit their deployment in resource-constrained environments. Knowledge distillation addresses this challenge by training a small student model from a larger teacher model. Previous research has introduced several distillation methods for both generating training data and training the student model. Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated and compared. In this work, we enlarge the set of available methods by applying critique-revision prompting to distillation for data generation and by synthesizing existing training methods. We systematically compare the distillation methods on the widely used Commonsense Question-Answering (CQA), Extended Stanford Natural Language Inference (ESNLI), and StrategyQA datasets. While we measure performance via student model accuracy, we employ a human-grounded study to evaluate explainability. We contribute new distillation methods and their comparison in terms of both performance and explainability. This should further advance the distillation of small language models and, thus, contribute to broader applicability and faster diffusion of language models.

Abstract:
Temporal knowledge graphs (TKGs) effectively capture the dynamic evolution of events over time, emerging as a critical driving force in the advancement of artificial intelligence. In recent years, temporal knowledge graph reasoning (TKGR) has garnered significant attention for its ability to address the intrinsic incompleteness of TKGs. Among various TKGR methods, reinforcement learning (RL)-based multi-hop reasoning stands out due to the decision-making capabilities and interpretability. However, existing multi-hop reasoning methods are predominantly designed for the transductive setting where test entities are observed during training, and they exhibit limited performance in the fully-inductive setting where training and test entities are entirely disjoint. Moreover, the sparse links of newly emerged unseen entities in TKGs hinder multi-hop reasoning methods from utilizing sufficient actions to construct multi-hop relational paths, ultimately impairing reasoning accuracy. To address these challenges, we propose ARLIE (Adaptive Reinforcement Learning with Inductive Embeddings), a novel method capable of conducting multi-hop reasoning in both fully-inductive and transductive settings over TKGs. Specifically, ARLIE consists of the following two key components. (1) A context-based inductive representation method generates fine-grained embeddings for unseen entities by exploiting query-related contextual information. (2) After obtaining temporal evolution and semantic dependencies of unseen entities, an action-augmented adaptive RL framework leverages diverse actions to infer missing elements step-by-step over TKGs. Finally, experimental results show that ARLIE surpasses state-of-the-art TKGR methods across both fully-inductive and transductive settings.

Abstract:
The processing of continuous data streams in non-stationary environments has gained increasing attention. However, supervised online learning is often limited by label availability. Furthermore, it is crucial to develop a stable and high-performance online method in non-stationary environments. To tackle these issues, we propose a dynamic chunk-based active learning framework (DCAL). This framework includes a dynamic dual-stage query strategy and an enhanced active learning model. Specifically, the proposed query strategy, referred to as DyDQS, evaluates sample value comprehensively by considering local density, uncertainty, and dynamic imbalance ratio. This approach selects samples that are both representative and uncertain, while also enhancing the likelihood of selecting minority class samples. Additionally, we introduce an enhanced active learning model, named eBLS-W, which is based on the broad learning system (BLS). We redesign the update rule of BLS and equip it with a kernel mapping to improve its robustness and performance, enabling it to better handle non-stationary environments. The effectiveness of the DyDQS, eBLS-W, and DCAL was validated through experiments on synthetic datasets with drift and real-world datasets. The results demonstrate that our approach outperforms other advanced methods in terms of robustness and accuracy.

Abstract:
The latent factor analysis (LFA) model has been widely used to uncover latent relationships from high-dimensional sparse (HiDS) matrices. However, the performance of LFA depends largely on the hyper-parameter value used in the model training. Traditional hyper-parameter tuning methods such as grid search suffer from inefficiency and inaccuracy. In recent years, the particle swarm optimization (PSO) algorithm offers an intelligent approach to adaptively adjust the hyper-parameter of LFA. However, the global optimal solution of the hyper-parameter tuning problem is not fixed due to its dynamic decision space. Therefore, it is difficult for PSO to determine the best hyper-parameter for each training iteration. To address this problem, this paper proposes a novel hyper-parameter adaptive adjustment algorithm called dynamic stochastic reorientation PSO (DSR-PSO) that adapts to constantly changing decision spaces. By randomly adjusting the search directions of particles and perturbing the elite particles, the dynamic property of the DSR-PSO can be enhanced, so that the hyper-parameter can be adjusted in real time throughout the model training process. Furthermore, this paper proves the convergence of the DSR-PSO and gives its convergence condition by discussing the distribution of the characteristic roots. Finally, this paper proposes the DSR-PSO-based LFA (DPL) model by incorporating the DSR-PSO-based hyper-parameter adjustment into the LFA to promote its model training, and analyzes its complexity. Experimental results on benchmark datasets show that the proposed DPL surpasses state-of-the-art LFA models in terms of accuracy and efficiency.

Abstract:
Trustworthy explanations are essential for supporting decision-making in high-stakes domains such as healthcare, finance, and environmental risk management. Counterfactual explanations provide actionable insights by revealing how small changes to input features can alter a model’s prediction. However, for such explanations to be reliable and interpretable, they must be not only valid but also plausible, reflecting realistic alternatives within the data-generating process. Existing aproaches often approximate plausibility using statistical similarity or domain constraints, neglecting the underlying causal dependencies among variables. In this work, we introduce a novel framework for generating causally plausible counterfactuals by optimizing their class-conditional likelihood under a neural structural causal model (SCM). This model provides a modular, heteroscedastic representation of structured causal knowledge, enabling probabilistic evaluation of candidate explanations, conditioned on causal relationships. Through empirical validation across synthetic, semi-synthetic, and real-world datasets, we demonstrate that our approach improves both statistical and causal plausibility without sacrificing validity or proximity. The proposed method bridges structural modeling with interpretable machine learning, offering a principled and scalable approach to trustworthy explanation in data-driven systems.

Abstract:
Sea surface temperature (SST) prediction is a critical task in ocean science, supporting various applications, such as weather forecasting, fisheries management, and storm tracking. While existing data-driven methods have demonstrated significant success, they often neglect to leverage the rich domain knowledge accumulated over the past decades, limiting further advancements in prediction accuracy. The recent emergence of large language models (LLMs) has highlighted the potential of integrating domain knowledge for downstream tasks. However, the application of LLMs to SST prediction remains underexplored, primarily due to the challenge of integrating ocean domain knowledge and numerical data. To address this issue, we propose Ocean Knowledge Graph-enhanced LLM (OKG-LLM), a novel framework for global SST prediction. To the best of our knowledge, this work presents the first systematic effort to construct an Ocean Knowledge Graph (OKG) specifically designed to represent diverse ocean knowledge for SST prediction. We then develop a graph embedding network to learn the comprehensive semantic and structural knowledge within the OKG, capturing both the unique characteristics of individual sea regions and the complex correlations between them. Finally, we align and fuse the learned knowledge with fine-grained numerical SST data and leverage a pre-trained LLM to model SST patterns for accurate prediction. Extensive experiments on the real-world dataset demonstrate that OKG-LLM consistently outperforms state-of-the-art methods, showcasing its effectiveness, robustness, and potential to advance SST prediction.

Abstract:
With the rapid advancement of model architectures, the accuracy of industrial predictive modeling now largely hinges on data quality. However, real-world industrial datasets frequently contain low-quality samples that compromise model performance. While existing data preprocessing methods can effectively remove salient outliers, they persistently struggle to detect latent anomalies. To address this challenge, this paper proposes a fast data attribution-based dataset selection method for regression models, termed \mathrmF\scriptscriptstyle ASTDARF AST DAR, which enables the model to identify training samples that are detrimental to its performance and subsequently perform dataset selection. \mathrmF\scriptscriptstyle ASTDARF AST DAR integrates deep network data attribution into the Leave-One-Out (LOO) influence calculation paradigm of linear regression models through model linearization and parameter dimensionality reduction. Considering the synergy among samples, the truncated Monte Carlo method is adopted to estimate marginal influences of each sample, and sample utility is defined for dataset selection. Validation on real-world industrial datasets demonstrates the effectiveness and practicality of our method. Experimental results show that models trained on \mathrmF\scriptscriptstyle ASTDARF AST DAR-selected data achieve significant performance improvements on both validation and test sets, outperforming multiple baseline methods.

Abstract:
Community Detection (CD) in weighted social networks is a highly active research field, celebrated for its profound practical implications across a multitude of disciplines. Genetic algorithms (GAs) are frequently explored to tackle CD problems, leveraging their capability to navigate the extensive discrete search space effectively. Throughout the evolutionary process, genetic operators such as crossover and mutation assume pivotal roles in effectively exploring the vast solution space. Nonetheless, prevailing GA-based approaches often ignore crucial topology information, particularly information regarding edge weights, resulting in compromised algorithm performance. In light of this, this paper introduces Edge Information-based GA (EIGA) to effectively solve CD problems in weighted networks. This is achieved specifically through the innovative designs of edge-weight-aware crossover and mutation operators. These novel edge-weight-aware operators improve the extraction of meaningful community structures, advancing knowledge discovery from social networks. Empirical findings demonstrate the superior performance of EIGA over numerous state-of-the-art algorithms across various real-world and synthetic benchmark networks.

Abstract:
In data science, predictive tasks such as classification, regression, and missing value imputation are fundamental challenges in tabular data analysis. This research investigates the application of Large Language Models (LLMs) to these tasks. While LLMs excel in natural language understanding, their effectiveness on structured tabular data remains limited due to minimal exposure during pretraining. To address this gap, we construct a large-scale corpus of annotated tables and introduce a tailored pretraining framework. Our trained model achieves significant improvements over baselines, with an average gain of 8.9% in classification and 10.7% in regression tasks. We further evaluate its performance in zero-shot and few-shot prediction, as well as in-context learning scenarios. Extensive experiments demonstrate substantial gains over existing benchmarks, highlighting the potential of LLMs for tabular data processing. Additionally, we apply our approach across multiple open-source LLMs and demonstrate its generalizability. This work establishes a new benchmark for enhancing tabular intelligence through LLM-based pretraining.1

Abstract:
Graph contrastive learning (GCL) is a powerful self-supervised learning approach. However, existing GCL methods are designed for homophilic graphs, using low-pass filters that struggle to capture high-frequency components in heterophilic graphs. We propose Graph Contrastive Learning with Regularization and stabilization techniques enhanced high-pass Filter (GCLRF). REgularization and Stabilization techniques enhanced High-pass filter (RESH) can serve as a mutually promoting plug-in, significantly improving the performance of various homophilic GCL training strategies on heterophilic graphs. We also investigate four component orderings in RESH and identify the optimal fusion mechanism, demonstrating its critical impact on performance. Experiments show GCLRF achieves state-of-the-art (SOTA) performance across six benchmark datasets in node classification and clustering. Notably, on the Cornell dataset, GCLRF outperformers classification accuracy by 6.76% and achieves a 23.64% relative improvement in clustering normalized mutual information (NMI).