TKDE2025

Abstract:
The incorporation of target distribution significantly enhances the success of deep clustering. However, most of the related deep clustering methods suffer from two drawbacks: (1) manually-designed target distribution functions with uncertain performance and (2) cluster misassignment accumulation. To address these issues, a Self-Correcting Clustering (Self-CC) framework is proposed. In Self-CC, a robust target distribution solver (RTDS) is designed to automatically predict the target distribution and alleviate the adverse influence of misassignments. Specifically, RTDS divides the high confidence samples selected according to the cluster assignments predicted by a clustering module into labeled samples with correct pseudo labels and unlabeled samples of possible misassignments by modeling its training loss distribution. With the divided data, RTDS can be trained in a semi-supervised way. The critical hyperparameter which controls the semi-supervised training process can be set adaptively by estimating the distribution property of misassignments in the pseudo-label space with the support of a theoretical analysis. The target distribution can be predicted by the well-trained RTDS automatically, optimizing the clustering module and correcting misassignments in the cluster assignments. The clustering module and RTDS mutually promote each other forming a positive feedback loop. Extensive experiments on four benchmark datasets demonstrate the effectiveness of the proposed Self-CC.

Abstract:
Graph representation provides a more effective method for describing the underlying data relationships. Nonetheless, the vast majority of data consists solely of feature information without a corresponding graph structure, rendering graph representation techniques ineffective. Much of the existing research on graph data has concentrated on how to effectively characterize graph nodes, with little focus on how to adaptively construct internal structures and potential connections between the sample pairs. On the other hand, the existing graph construction techniques generate linear inter-instance affinity distributions based on a probabilistic perspective, which might not give a true picture of the relationships. To overcome the above problems, motivated by the fact that sample and inter-sample affinities can be viewed as the source and strength of the magnetic field, respectively, a novel tangent-based affinity measurement algorithm that utilizes a parameter to dynamically adjust the sparsity of the magnetic field is derived. In addition, Adaptive Magnetic-Graph Clustering (AMGC) is designed for graph representation and clustering. AMGC ensures instance-level and cluster-level consistency using a novel dual decoder, where the reconstructed graph retains local affinity and global topology, and contrastive learning defines new sample pairs based on positive-incentive noise, making the learned embedding more discriminative. Eventually, we perform empirical experiments to demonstrate the superiority of the model.

Abstract:
Label enhancement (LE) is still a challenging task to mitigate the dilemma of the lack of label distribution. Existing LE work typically focuses on primarily formulating a projection between feature space and label distribution space from discriminative model perspective, which preserves the relevance consistency that the sign of recovered label distribution should be consistent with the logical label. Different from previous algorithms, we formulate this problem from a causal perspective and present a novel LE method via the structured causal model (LESCM). Specifically, the proposed LESCM deliberates establishing the causal graph with assuming that label distribution is a cause of feature and logical label, which naturally satisfies the definition of label distribution learning (LDL). With capturing the underlying causal relationships, we can significantly boost the interpretability and identifiability of label enhancement. Meanwhile, except for the relevance consistency, LESCM are encouraged to sustain the order consistency that assigns higher description degree of the recovered label distribution to the positive labels, as compared with the negative labels. Empirically, sufficient experiments on several label distribution learning data sets validate the effectiveness of LESCM.

Abstract:
Provenance is a standardised record that describes how entities, activities, and agents have influenced a piece of data; it is commonly represented as graphs with relevant labels on both their nodes and edges. With the growing adoption of provenance in a wide range of application domains, users are increasingly confronted with an abundance of graph data, which may prove challenging to process. Graph kernels, on the other hand, have been successfully used to efficiently analyse graphs. In this paper, we introduce a novel graph kernel called provenance kernel, which is inspired by and tailored for provenance data. We employ provenance kernels to classify provenance graphs from three application domains. Our evaluation shows that they perform well in terms of classification accuracy and yield competitive results when compared against existing graph kernel methods and the provenance network analytics method while more efficient in computing time. Moreover, the provenance types used by provenance kernels are a symbolic representation of a tree pattern which can, in turn, be described using the domain-agnostic vocabulary of provenance. Therefore, provenance types thus allow for the creation of explanations of predictive models built on them.

Abstract:
Graph neural networks (GNNs) are recognized as a significant methodology for handling graph-structure data. However, with the increasing prevalence of learning scenarios involving multiple graphs, traditional GNNs mostly overlook the relationships between nodes across different graphs, mainly due to their limitation of traditional message passing within each graph. In this paper, we propose a novel GNN architecture called cross-graph interaction networks (GInterNet) to enable inter-graph message passing. Specifically, we develop a cross-graph topology construction module to uncover and learn the potential topologies between nodes across different graphs. Furthermore, we establish inter-graph message passing based on the learned cross-graph topologies, achieving cross-graph interaction by aggregating information from different graphs. Finally, we employ cross-graph construction functions involving the relationships between contextual information and cross-graph topology structure to iteratively update the cross-graph topologies. Different to existing related approaches, GInterNet is designed as a cross-graph interaction paradigm for inter-graph message passing. It enables multi-graph interaction during the message passing process. Additionally, it is a plug-and-play framework that can be easily embedded into other models. We evaluate its performance in semi-supervised and unsupervised learning scenarios involving multiple graphs. A detailed theoretical analysis and extensive experiment results have shown that GInterNet improves the performance and robustness of the base models.

Abstract:
The knob tuning aims to optimize database performance by searching for the most effective knob configuration under a certain workload. Existing works suffer from two significant problems. First, there exist multiple useless evaluations of knob tuning even with diverse searching methods because of the different sensitivities of knobs on a certain workload. Second, the single evaluation of knob configurations may bring overestimation or underestimation because of query performance uncertainty. To solve the above problems, we propose a query uncertainty-aware knob classifier, called \sf KnobCFKnobCF, to enhance knob tuning. Our method has three contributions: (1) We propose uncertainty-aware configuration estimation to improve the tuning process. (2) We design a few-shot uncertainty estimator that requires no extra data collection, ensuring high efficiency in practical tasks. (3) We provide a flexible framework that can be integrated into existing knob tuners and DBMSs without modification. Our experiments on four open-source benchmarks demonstrate that our method effectively reduces useless evaluations and improves the tuning results. Especially in TPCC, our method achieves competitive tuning results with only 60% to 70% time consumption compared to the full workload evaluations.

Affiliations: College of Electronic and Information Engineering, Shanghai Institute of Intelligent Science and Technology, Shanghai Research Institute for Intelligent Autonomous Systems, State Key Laboratory of Intelligent Autonomous Systems, Frontier Science Center for Intelligent Autonomous Systems, Tongji University, Shanghai, China; College of Electronic and Information Engineering, Tongji University, Shanghai, China; Faculty of Data Science, City University of Macau, Taipa, China; School of Computer Science and Technology, Tongji University, Shanghai, China

Abstract:
Traditional shortest-path graph kernels generate for each graph a histogram-like feature map, whose elements represent the number of occurrences of non-isomorphic shortest paths in this graph. The histogram-like feature map does not contain the distributions of the shortest paths within and across graphs, causing inaccurate graph similarities. To this end, we propose a novel graph kernel called the Distributional Shortest-Path (DSP) graph kernel to embrace both types of distribution information. Since the distribution of substructures (e.g., the shortest paths) follows a power law like that of words in natural language, we utilize neural language models to learn each node’s distributional shortest-path feature map, encompassing the distributions and dependencies of the shortest paths in each graph. Moreover, we design the Partition Kernel (PK) to capture the dataset-wide distribution information of the shortest paths. PK projects similar (i.e., belonging to the same partition) distributional shortest-path node feature maps to the same point in the Reproducing Kernel Hilbert Space. Finally, Kernel Mean Embedding (KME) is applied to compute graph feature maps and efficiently construct the DSP graph kernel. Empirical experiments demonstrate that DSP outperforms state-of-the-art graph kernels on most benchmark datasets.

Abstract:
Low-rank latent factorization of tensors is a powerful method for analyzing high-dimensional and incomplete (HDI) data derived from cyber-physical systems, particularly when computational resources are limited. However, traditional tensor factorization models are inherently linear and struggle to capture the complex nonlinear spatiotemporal dependencies embedded in the data. This paper introduces a novel latent factorization model, namely Auto-encoding Neural Tucker Factorization (ANTucF) for accurate spatiotemporal representation learning on the HDI tensor. It constructs a low-rank Tucker factorization-based neural network to capture a potential latent manifold in space and time, built upon three core ideas: a) applying density-oriented modeling principles with neural networks to facilitate latent feature learning via positional and temporal encoding of mode indices; b) constructing a Tucker interaction tensor to represent all possible spatiotemporal interactions among distinct spatial and temporal modes; and c) enhancing the uniqueness of the core tensor in Tucker factorization by incorporating nonlinear spatiotemporal representation learning via auto-encoding latent interaction learning. The ANTucF model outperforms several state-of-the-art LFT models in estimating missing observations on real-world datasets. Additionally, visualizations demonstrate its ability to capture finer spatiotemporal dynamics by nonlinearly exploiting an optimal Tucker core tensor using a data-driven approach.

Abstract:
Mining dense subgraphs on multilayer graphs offers the opportunity for more in-depth discoveries than classical dense subgraph mining on single-layer graphs. However, the existing approaches fail to ensure the denseness of a discovered subgraph on layers of users’ interest and simultaneously gain partial supports on the denseness from other layers. In this paper, we introduce a novel dense subgraph model called FocusCore (FoCore for short) for multilayer graphs, which can pay more attention to the layers focused by users. The FoCore decomposition problem, that is, identifying all nonempty FoCores in a multilayer graph, can be addressed by executing the peeling process with respect to all possible configurations of focus and background layers. Using the nice properties of FoCores, we devise an interleaved peeling algorithm and a vertex-centric algorithm toward efficient FoCore decomposition. We further design a novel cache to minimize the average retrieval time for an arbitrary FoCore without the need for full FoCore decomposition, which significantly improves efficiency in large-scale graph mining tasks. As an application, we propose a FoCore-decomposition-based algorithm to approximate the densest subgraph in a multilayer graph with a provable approximation guarantee. The extensive experiments on real-world datasets verify the effectiveness of the FoCore model and the efficiency of the proposed algorithms.

Abstract:
Patent data generally includes information from different perspectives or different types, and its heterogeneous attributes can be greatly beneficial to data clustering analysis. However, the existing patent analysis method always focus on the patent text cues, and such a strategy merely depends on the feature information to capture the data characteristics, failing to multi-type informative patent representation. Therefore, in this paper, to model the underlying structure/relationships of patent data, we employ the knowledge graph to depict the heterogeneous attributes of patent, and propose a novel Knowledge Graph-based Patent Clustering (KGPC) method, where the relationship reconstruction in knowledge graph as well as clustering-oriented representation refinement for patent clustering are jointly considered. With this model, there are three components, i.e., entity representation refinement, relationship reconstruction and self-supervised entity clustering. Given a patent knowledge graph as input, the entity representation refinement can be mutually boosted by the relationship reconstruction and self-supervised clustering objective, thereby leading to a balanced clustering-oriented output. Extensive experiments on several real-world patent knowledge graph datasets validate the effectiveness of KGPC while compared with the state-of-the-art.

Abstract:
Most previous learning-based graph matching algorithms solve the quadratic assignment problem (QAP) by dropping one or more of the matching constraints and adopting a relaxed assignment solver to obtain sub-optimal correspondences. Such relaxation may actually weaken the original graph matching problem, and in turn hurt the matching performance. In this paper, we propose a deep learning-based graph matching framework that works for the original QAP without compromising on the matching constraints. In particular, we design an affinity-assignment prediction network to jointly learn the pairwise affinity and estimate the node assignments, and we then develop a differentiable solver inspired by the probabilistic perspective of the pairwise affinities. Aiming to obtain better matching results, the probabilistic solver refines the estimated assignments in an iterative manner to impose both discrete and one-to-one matching constraints. The proposed method is trained in a supervised manner, evaluated on several benchmarks related to semantic keypoint corresponding, matching of social networks and pure QAP instances. In all experiment, it exhibits state-of-the-art matching performance on all benchmarks.

Abstract:
Causal learning is a recent and widely adopted paradigm to handle algorithmic discrimination. Contemporary causality-based studies on fairness only capture the unfair causal effect of a single-dimensional sensitive attribute (i.e., individual-dimension, like gender) on the decision. They neglect the socially constructed nature of individual attributes, such as macro-dimensional factors. However, social science research shows that discrimination against an individual may be related to disadvantaged treatments, which operate at the macro-dimension (e.g., neighborhood economic level). This multi-dimensional conceptualization is pertinent to matters of fairness, and it is crucial to be fair for individuals across multiple dimensions. The hidden confounder is another bottleneck for addressing fairness concerns based on causal techniques. To tackle these issues, we present an approach, called MultiCFL, which accounts for multi-dimensional sources of discrimination and unifies them via causal tools. To handle hidden confounders, MultiCFL first trains a causal effect variational autoencoder as the causal estimator to learn the causal mechanisms behind observational data. Subsequently, it makes selective use of estimated causal relationships to construct a predictive model with multi-dimensional fairness. Experimental results confirm the effectiveness of MultiCFL, and prove the necessity of considering multiple dimensional properties to mitigate unfairness.

Abstract:
A transparent decision-making process is essential for developing reliable and trustworthy recommender systems. For sequential recommendation, it means that the model can identify key items that account for its recommendation results. However, achieving both interpretability and recommendation performance simultaneously is challenging, especially for models that take the entire sequence of items as input without screening. In this paper, we propose an interpretable framework (named PTSR) that enables a pattern-wise transparent decision-making process without extra features. It breaks the sequence of items into multi-level patterns that serve as atomic units throughout the recommendation process. The contribution of each pattern to the outcome is quantified in the probability space. With a carefully designed score correction mechanism, the pattern contribution can be implicitly learned in the absence of ground-truth key patterns. The final recommended items are those that most key patterns strongly endorse. Extensive experiments on five public datasets demonstrate remarkable recommendation performance, while statistical analysis and case studies validate the model interpretability.

Abstract:
Ensemble clustering can utilize the complementary information among multiple base clusterings, and obtain a clustering model with better performance and more robustness. Despite its great success, there are still two problems in the current ensemble clustering methods. First, most ensemble clustering methods often treat all base clusterings equally. Second, the final ensemble clustering result often relies on kk-means or other discretization procedures to uncover the clustering indicators, thus obtaining unsatisfactory results. To address these issues, we proposed a novel ensemble clustering method based on structured graph learning, which can directly extract clustering indicators from the obtained similarity matrix. Moreover, our methods take sufficient consideration of correlation among the base clusterings and can effectively reduce the redundancy among them. Extensive experiments on artificial and real-world datasets demonstrate the efficiency and effectiveness of our methods.

Abstract:
In many real-world applications, sequential rule mining (SRM) can offer prediction and recommendation functions for a variety of services. It is an important technique of pattern mining to discover all valuable rules that can reveal the temporal relationship between objects. Although several algorithms of SRM are proposed to solve various practical problems, there are no studies on the problem of targeted mining. Targeted sequential rule mining aims to obtain those interesting sequential rules that users focus on, thus avoiding the generation of other invalid and unnecessary rules. It can further improve the efficiency of users in analyzing rules and reduce the consumption of computing resources. In this paper, we first present the relevant definitions of target sequential rules and formulate the problem of targeted sequential rule mining. Then, we propose an efficient algorithm called TaSRM. Several pruning strategies and an optimization are introduced to improve the efficiency of TaSRM. Finally, a large number of experiments are conducted on different benchmarks, and we analyze the results in terms of running time, memory consumption, and scalability, as well as query cases with different query rules. It is shown that the novel algorithm TaSRM and its variants can achieve better experimental performance compared to the baseline algorithm.

Abstract:
Shapelets are interclass discriminative subsequences that can be used to characterize target classes. Learning shapelets by continuous optimization has recently been studied to improve classification accuracy. However, there are two issues in previous studies. First, since the locations where shapelets appear in the time series are determined by only their shapes, shapelets may appear at incorrect and non-discriminative locations in the time series, degrading the accuracy and interpretability. Second, the theoretical interpretation of learned shapelets has been limited to binary classification. To tackle the first issue, we propose a continuous optimization that learns not only shapelets but also their probable locations in a time series, and we show theoretically that this enhances feature discriminability. To tackle the second issue, we provide a theoretical interpretation of shapelet closeness to the time series for target / off-target classes when learning with softmax loss, which allows for multi-class classification. We demonstrate the effectiveness of the proposed method in terms of accuracy, runtime, and interpretability on the UCR archive.

Abstract:
The rapid growth of graph data poses significant challenges in storage, transmission, and particularly the training of graph neural networks (GNNs). To address these challenges, graph condensation (GC) has emerged as an innovative solution. GC focuses on synthesizing a compact yet highly representative graph, enabling GNNs trained on it to achieve performance comparable to those trained on the original large graph. The notable efficacy of GC and its broad prospects have garnered significant attention and spurred extensive research. This survey paper provides an up-to-date and systematic overview of GC, organizing existing research into five categories aligned with critical GC evaluation criteria: effectiveness, generalization, efficiency, fairness, and robustness. To facilitate an in-depth and comprehensive understanding of GC, this paper examines various methods under each category and thoroughly discusses two essential components within GC: optimization strategies and condensed graph generation. We also empirically compare and analyze representative GC methods with diverse optimization strategies based on the five proposed GC evaluation criteria. Finally, we explore the applications of GC in various fields, outline the related open-source libraries, and highlight the present challenges and novel insights, with the aim of promoting advancements in future research.

Abstract:
Causal feature selection has recently received increasing attention in machine learning and data mining, especially in the era of Big Data. Existing causal feature selection algorithms select unique causal features of the single class label as the optimal feature subset. However, a single class label usually has multiple classes, and it is unreasonable to select the same causal features for different classes of a single class label. To address this problem, we employ the class-specific mutual information to evaluate the causal information carried by each class of the single class label, and theoretically analyze the unique relationship between each class and the causal features. Based on this, a Label-aware Causal Feature Selection algorithm (LaCFS) is proposed to identifies the causal features for each class of the class label. Specifically, LaCFS uses the pairwise comparisons of class-specific mutual information and the size of class-specific mutual information values from the perspective of each class, and follows a divide-and-conquer framework to find causal features. The correctness and application condition of LaCFS are theoretically proved, and extensive experiments are conducted to demonstrate the efficiency and superiority of LaCFS compared to the state-of-the-art approaches.

Abstract:
Graph kernels have been regarded as a successful tool for handling a variety of graph applications since they were proposed. However, most of the proposed graph kernels are based on the R-convolution framework, which decomposes graphs into a set of substructures at the same abstraction level and compares all substructure pairs equally; these methods inherently overlook the utility of the hierarchical structural information embedded in graphs. In this paper, we propose Hierarchical Abstracting Graph Kernels (HAGK), a novel set of graph kernels that compare graphs’ hierarchical substructures to capture and utilize the latent hierarchical structural information fully. Instead of generating non-structural substructures, we reveal each graph’s hierarchical substructures by constructing its hierarchical abstracting, specifically, the hierarchically organized nested node sets adhering to the principle of structural entropy minimization. To compare a pair of hierarchical abstractings, we propose two novel substructure matching approaches, Local Optimal Matching (LOM) and Priority Ordering Matching (POM), to find appropriate matching between the substructures by different strategies recursively. Extensive experiments demonstrate that the proposed kernels are highly competitive with the existing state-of-the-art graph kernels, and verify that the hierarchical abstracting plays a significant role in the improvement of the kernel performance.

Abstract:
Topic modeling is a commonly used text analysis tool for discovering latent topics in a text corpus. However, while topics in a text corpus often exhibit a hierarchical structure (e.g., cellphone is a sub-topic of electronics), most topic modeling methods assume a flat topic structure that ignores the hierarchical dependency among topics, or utilize a predefined topic hierarchy. In this work, we present a novel Hierarchical Deep Document Model (HDDM) to learn topic hierarchies using a variational autoencoder framework. We propose a novel objective function, sum of log likelihood, instead of the widely used evidence lower bound, to facilitate the learning of hierarchical latent topic structure. The proposed objective function can directly model and optimize the hierarchical topic-word distributions at all topic levels. We conduct experiments on four real-world text datasets to evaluate the topic modeling capability of the proposed HDDM method compared to state-of-the-art hierarchical topic modeling benchmarks. Experimental results show that HDDM achieves considerable improvement over benchmarks and is capable of learning meaningful topics and topic hierarchies. To further demonstrate the practical utility of HDDM, we apply it to a real-world medical notes dataset for clinical prediction. Experimental results show that HDDM can better summarize topics in medical notes, resulting in more accurate clinical predictions.

Abstract:
With the rise of Large Language Models (LLMs), tourists increasingly use it for route planning by entering keywords for attractions, instead of relying on traditional manual map services. LLMs provide generally reasonable suggestions, but often fail to generate optimal plans that account for detailed user requirements, given the vast number of potential POIs and possible routes based on POI combinations within a real-world road network. In this case, a route-planning API could serve as an external tool, accepting a sequence of keywords and returning the top-kk best routes tailored to user requests. To address this need, this paper introduces the Keyword-Aware Top-kk Routes (KATR) query that provides a more flexible and comprehensive semantic to route planning that caters to various user’s preferences including flexible POI visiting order, flexible travel distance budget, and personalized POI ratings. Subsequently, we propose an explore-and-bound paradigm to efficiently process KATR queries by eliminating redundant candidates based on estimated score bounds from global to local levels. Extensive experiments demonstrate our approach’s superior performance over existing methods across different scenarios.

Abstract:
In the field of Machine Learning (ML) and data-driven applications, one of the significant challenge is the change in data distribution between the training and deployment stages, commonly known as distribution shift. This paper outlines different mechanisms for handling two main types of distribution shifts: (i) Covariate shift: where the value of features or covariates change between train and test data, and (ii) Concept/Semantic-shift: where model experiences shift in the concept learned during training due to emergence of novel classes in the test phase. We sum up our contributions in three folds. First, we formalize distribution shifts, recite on how the conventional method fails to handle them adequately and urge for a model that can simultaneously perform better in all types of distribution shifts. Second, we discuss why handling distribution shifts is important and provide an extensive review of the methods and techniques that have been developed to detect, measure, and mitigate the effects of these shifts. Third, we discuss the current state of distribution shift handling mechanisms and propose future research directions in this area. Overall, we provide a retrospective synopsis of the literature in the distribution shift, focusing on OOD data that had been overlooked in the existing surveys.

Abstract:
A wide range of applications manage interval data with selections and overlap joins being the most fundamental querying operations. Selection queries are typically evaluated using interval indexing. However, the statethe-of-art HINT index and its competitors, are only designed for single query requests while modern systems receive a large number of queries at the same time. In view of this challenge, we study the batch processing of selection queries on HINT. We propose two novel strategies termed level-based and partition-based, which operate in a per-level fashion, i.e., they collect the results for all queries at an index level before moving to the next. The new strategies reduce the cache misses when climbing the index hierarchy, and in particular, partition-based can prevent scanning every index partition more than once. Our experiments on real-world intervals showed that our batch strategies always outperform a baseline which executes queries in a serial fashion, and that partition-based is overall the most efficient one. Motivated by our shared computation techniques for query batches, we also study overlap joins anew across the entire spectrum of different setups, based on the (pre)-existence of interval indexing. For unindexed inputs, we enhance the state-of-the-art optFS join algorithm with effective partitioning proposed for HINT and for indexed inputs, we propose a novel algorithm HINT-join which concurrently scans the input indices, joining partition pairs with optFS. Our tests showed the advantage of HINT-join over indexed nestedloops solutions that employ either B+-trees or probing a single HINT even powered by our partition-based batch processing.

Abstract:
Efficient incomplete multi-view clustering has received increasing attention due to its ability to handle large-scale and missing data. Although existing methods have promising performance, 1) they typically generate anchors directly from incomplete and noisy raw data, resulting in uncomprehensive anchor coverage and unreliable results; 2) they typically use only sparse regularization to remove noise and overlook outliers; 3) they ignore the inherent consistency of features in a view. To address these issues, we propose a smoothness-induced efficient incomplete multi-view clustering (SEIC) method. SEIC regards available data as natural anchors selected from complete data, and performs matrix decomposition only on them to obtain reliable small-size representation matrices. View-specific representation matrices are constructed as a tensor to capture consensus and guide matrix decomposition. More significantly, we enforce both smoothness and low-rank coupling on the tensor. Smoothness induces continuous variation of the tensor to further eliminate noise and enhance the relation among features. Benefiting from the noise robustness of SEIC, we design an adaptive noise balance parameter that renders SEIC parameter-free. Furthermore, by constructing a sparse anchor graph on the learned tensor, we propose the spectral clustering version SEIC-SC. Experiments on multiple datasets demonstrate the superior performance and efficiency of SEIC and SEIC-SC.

Abstract:
Travel Time Estimation (TTE) stands as a cornerstone of efficient transportation systems. However, the critical imperative of privacy preservation within the TTE context remains notably underexplored. This gap underscores the pressing necessity for innovative solutions that prioritize the safeguarding of users’ geo-privacy, particularly in light of the expanding prevalence of data-driven TTE algorithms. In this paper, a novel privacy-preserving TTE framework, CRATE, is proposed to ensure comprehensive privacy preservation for TTE without compromising service quality. CRATE achieves this objective by identifying random routes within a transportation network that yield identical travel times to the actual, privacy-rich route. This is accomplished through exploiting the embedding representations for road segments and routes, followed by the development of a highly efficient heuristic for random route generation. Furthermore, a travel time aggregation and calibration model is devised to enhance estimation accuracy while upholding user privacy. Case studies conducted on three real-world vehicular trajectory datasets demonstrate that CRATE attains comparable estimation accuracy to state-of-the-art non-privacy-preserving TTE algorithms while maintaining strict privacy protection. Additionally, CRATE’s efficiency is showcased through deployment on both high- and low-end mobile handsets spanning the past decade.

Abstract:
Most clustering algorithms require setting one or more parameters, which rely on prior knowledge or are constantly adjusted based on external indicators. To address the issues of requiring external index guidance, blindness, and time-consuming parameter setting for clustering algorithms on complex data, we propose a novel Parameter-Adaptive Border Peeling clustering algorithm (PABP). The PABP algorithm initially employs the maximum number of neighbors identified through natural neighbor search to automatically ascertain the number of local neighborhoods. At the same time, the Gaussian kernel bandwidth can be adaptively obtained in density measurement, which can highlight high-density areas. Secondly, the number of peels is adaptively determined by the coefficient of variation of density during the iterative border peeling process. Lastly, labels are assigned to core points based on graph connections, while the clustering of border points is accomplished via label propagation. PABP does not require users to adjust parameters based on prior knowledge or external indicators throughout the entire process. In the experiment, PABP was compared with seven other advanced clustering algorithms on 13 synthetic datasets, 10 UCI datasets, and Olivetti Face and MNIST datasets. The results indicate that the clustering performance of PABP is superior to the compared algorithms.

Abstract:
Multi-layer graphs have emerged as a new representation of multi-faceted relationships between entities in the real world. Community detection on multi-layer graphs has been investigated to gain deeper insights into the modular structures of real-world graphs. As an effective and efficient approach to community detection, structural clustering has been investigated on single-layer graphs. However, it has been overlooked in the study of community detection on multi-layer graphs. In this paper, we give a formulation of structural clustering on multi-layer graphs for the first time. Two polynomial-time algorithms are proposed to solve the problem. Furthermore, two indexes, namely the core index and the interval index, with respective preferences to time efficiency and space efficiency, are designed to improve the efficiency of the algorithms. The experiments demonstrate the effectiveness of structural clustering in improving the quality of community detection results on multi-layer graphs. The experiments also verify the improvement in running time due to the use of the proposed indexes.

Abstract:
Cohesive subgraph computation on bipartite graphs has drawn significant research interest recently. As a popular cohesive subgraph model, kk-bitruss is defined as the maximal subgraph where each edge is contained in at least kk butterflies (i.e., a (2, 2)-biclique). The bitruss decomposition problem is widely studied, which aims to compute all kk-bitrusses for k \geq 0k≥0. The state-of-the-art CPU-based solutions require extensive costs to construct an index structure for grouping butterflies, leading to scalability challenges on large bipartite graphs. In this paper, we explore bitruss decomposition with GPU by leveraging the parallel computing capabilities of GPU architectures. As the index-based approach requires extensive space and the memory resources of GPUs are limited, we propose GBiD, which is a peeling-based algorithm on GPUs that utilizes a block-centric computation scheme to enable space-efficient bitruss decomposition without any indexing structure. In addition, cost-aware common neighbor exploration and neighbor list accessing optimizations are proposed to enhance GBiD by reducing the cost of enumerating butterflies and accessing the graph structure during the peeling process. Extensive experiments conducted on 10 real-world datasets demonstrate that our proposed techniques significantly surpass existing CPU-based solutions in terms of both space and time efficiency.

Abstract:
The increasing prevalence of large-scale graphs presents a significant challenge for Graph neural networks (GNNs) training due to their computational demands, limiting the applicability of GNNs in various scenarios. In response to this challenge, graph condensation (GC) is proposed as a promising acceleration solution, focusing on generating an informative compact graph that enables efficient training of GNNs while retaining performance. Despite the potential to accelerate GNN training, existing GC methods overlook the quality of large training graphs during both the training and inference stages. They indiscriminately emulate the training graph distributions, making the condensed graphs susceptible to noises within the training graph and significantly impeding the application of GC in intricate real-world scenarios. To address this issue, we propose robust graph condensation (RobGC), a plug-and-play approach for GC to extend the robustness and applicability of condensed graphs in noisy graph structure environments. Specifically, RobGC leverages the condensed graph as a feedback signal to guide the denoising process on the original training graph. A label propagation-based alternating optimization strategy is in place for the condensation and denoising processes, contributing to the mutual purification of the condensed graph and training graph. Additionally, as a GC method designed for inductive graph inference, RobGC facilitates test-time graph denoising by leveraging the noise-free condensed graph to calibrate the structure of the test graph. Extensive experiments show that RobGC is compatible with various GC methods, significantly boosting their robustness.

Abstract:
Traditional statistical time series forecasting models rely on model identification methods to identify the worthiest model variants to investigate; therefore, the model parameters change with the statistical features of rolling windows to reach optimality. Currently, although deep-learning-based methods achieve promising multivariate forecasting performance, their representations of variable correlations are consistent regardless of the observed local time series properties and dynamic cross-variable relations, rendering them prone to overfitting. To bridge this gap, we propose FPPformer-MD, a novel inconsistent time series forecasting transformer. FPPformer-MD leverages multiresolution analysis to transform each univariate series into multiple frequency scales and evaluate the local variable correlations via their variances. Thus, FPPformer-MD receives richer input features, and its inner inconsistent cross-variable attention mechanism enables the adaptive extraction of cross-variable features. To further alleviate the overfitting problem, we apply dynamic mode decomposition to perform cross-variable data augmentation, which reconstructs the sequence outliers with other correlated sequences during the model training process. Extensive experiments conducted on thirteen real-world benchmarks demonstrate the state-of-the-art performance of FPPformer-MD.

Abstract:
The kk-vertex connected (kk-VC) subgraph, which remains connected with fewer than kk vertices being removed, is an essential structure in graph mining. It has found many applications, such as survivable network design and web search optimization. However, existing studies focus on mining maximal kk-VCs, which are excessively large yet less cohesive in real applications. In this paper, we study the minimum kk-VC search (MinVC) problem, seeking to find a kk-VC with the minimum number of vertices. We formally prove that this problem is NP-hard and then propose two algorithms to obtain the exact solution. The basic method, called Enum, follows a branch-and-bound framework with some pruning rules, which directly enumerates all possible vertex sets. Nonetheless, it suffers from the efficiency issues due to the non-hereditary property of the kk-VC model. To address this challenge, we propose an advanced method, called VCtoB, which divides the MinVC problem into several new sub-problems, called the fixed-size kk-VC problems. Each of them can be solved efficiently by exploiting the hereditary property of the ss-bundle model. Finally, our empirical experiments on 139 real-world networks demonstrate that VCtoB achieves performance improvement of up to six orders of magnitude over the baseline.

Abstract:
Network motifs provide a deep insight into the network functional abilities, and have proven useful in various practical applications. Existing studies reveal that different definitions of motifs may be needed for different temporal networks. In this study, we focus on a class of temporal networks such that the nodes and edges keep fixed, but the edge labels vary regularly with timestamps. First, we propose a proper definition of temporal motifs, which appear continuously within sufficiently large time intervals, to properly reinterpret the recurrent and statistically significant nature of motifs in temporal networks. Second, we develop a low polynomial time solution to find temporal motifs for all possible time intervals with the top to bottom and right to left scheme, based on the analyses of the properties for temporal motifs. Third, we develop a theoretically faster incremental solution to efficiently find temporal motifs to support continuously updates of temporal networks, by identifying unaffected time intervals and unnecessary edges. Finally, we have conducted extensive experiments to verify the efficiency and usefulness of our static and incremental solutions.

Abstract:
Non-negative Matrix Factorization (NMF) is an intensively used technique for obtaining parts-based, lower dimensional and non-negative representation. Researchers in biology, medicine, pharmacy and other fields often prefer NMF over other dimensionality reduction approaches (such as PCA) because the non-negativity of the approach naturally fits the characteristics of the domain problem and its results are easier to analyze and understand. Despite these advantages, obtaining exact characterization and interpretation of the NMF’s latent factors can still be difficult due to their numerical nature. Rule-based approaches, such as rule mining, conceptual clustering, subgroup discovery and redescription mining, are often considered more interpretable but lack lower-dimensional representation of the data. We present a version of the NMF approach that merges rule-based descriptions with advantages of part-based representation offered by the NMF. Given the numerical input data with non-negative entries and a set of rules with high entity coverage, the approach creates the lower-dimensional non-negative representation of the input data in such a way that its factors are described by the appropriate subset of the input rules. In addition to revealing important attributes for latent factors, their interaction and value ranges, this approach allows performing focused embedding potentially using multiple overlapping target labels.

Abstract:
In this paper, we study two different problems that investigate relations between given vertices sss and ttt. The first problem is to generate the ksk-hop-constrained sts-tkt path graph, i.e., the subgraph consisting of all paths from sss to ttt, where each path is not longer than k++k s.t. sks and tst appear only once. To solve the first problem, we propose the A-BiBFS^++t++ method enhanced with the reduced neighbor index and an approximate vertex grouping strategy. The second problem is to generate the kkk-hop-constrained sss-ttt simple path graph, i.e., the subgraph consisting of all k++k-hop-constrained simple paths from s++s to tst, which is proved to be NP-hard on directed graphs. Based on A-BiBFS^++t++, we propose the EVE method to tackle the second problem, which exploits the paradigm of edge-wise examination rather than exhaustively enumerating all simple paths. Extensive experiments show that both A-BiBFS^++s++ and EVE significantly outperform all baselines. Moreover, by taking EVE as a built-in block, state-of-the-art for hop-constrained simple path enumeration can be accelerated by up to an order of magnitude.

Abstract:
Disentanglement techniques used in collaborative filtering uncover interaction intents between nodes, improving the interpretability of node representations and enhancing recommendation performance. However, existing disentanglement methods still face the following two problems. 1) They focus on local structural features derived from direct node interactions, overlooking the comprehensive graph structure, which limits disentanglement accuracy. 2) The disentanglement process depends on backpropagation signals derived from recommendation tasks, lacking direct supervision, which may lead to biases and overfitting. To address the issues, we propose the Intent Propagation Contrastive Collaborative Filtering (IPCCF) algorithm. Specifically, we design a double helix message propagation framework to more effectively extract the deep semantic information of nodes, thereby improving the model's understanding of interactions between nodes. An intent message propagation method is also developed that incorporates graph structure information into the disentanglement process, thereby expanding the consideration scope of disentanglement. In addition, contrastive learning techniques are employed to align node representations derived from the structure and intents, providing direct supervision for the disentanglement process, mitigating biases, and enhancing the model's robustness to overfitting. The experiments on three real data graphs illustrate the superiority of the proposed approach.

Abstract:
Multi-view spectral clustering has attracted considerable attention since it can explore common geometric structures from diverse views. Nevertheless, existing min-min framework-based models adopt internal minimization to find the view combination with the minimized within-cluster variance, which will lead to effectiveness loss since the real clusters often exhibit high within-cluster variance. To address this issue, we provide a novel scalable min-max multi-view spectral clustering (SMMSC) model to improve clustering performance. Besides, anchor graphs, rather than full sample graphs, are utilized to reduce the computational complexity of graph construction and singular value decomposition, thereby enhancing the applicability of SMMSC to large-scale applications. Then, we rewrite the min-max model as a minimized optimal value function, demonstrate its differentiability, and develop an efficient gradient descent-based algorithm to optimize it with linear computational complexity. Moreover, we demonstrate that the resultant solution of the proposed algorithm is the global optimum. Numerous experiments on different real-world datasets, including some large-scale datasets, demonstrate that SMMSC outperforms existing state-of-the-art multi-view clustering methods regarding clustering performance.

Abstract:
Multi-view Clustering (MVC) has achieved significant progress, with many efforts dedicated to learn knowledge from multiple views. However, most existing methods are either not applicable or require additional steps for incomplete MVC. Such a limitation results in poor-quality clustering performance and poor missing view adaptation. Besides, noise or outliers might significantly degrade the overall clustering performance, which are not handled well by most existing methods. In this paper, we propose a novel unified framework for incomplete and complete MVC named self-learning symmetric multi-view probabilistic clustering (SLS-MPC). SLS-MPC proposes a novel symmetric multi-view probability estimation and equivalently transforms multi-view pairwise posterior matching probability into composition of each view's individual distribution, which tolerates data missing and might extend to any number of views. Then, SLS-MPC proposes a novel self-learning probability function without any prior knowledge and hyper-parameters to learn each view's individual distribution. Next, graph-context-aware refinement with path propagation and co-neighbor propagation is used to refine pairwise probability, which alleviates the impact of noise and outliers. Finally, SLS-MPC proposes a probabilistic clustering algorithm to adjust clustering assignments by maximizing the joint probability iteratively without category information. Extensive experiments on multiple benchmarks show that SLS-MPC outperforms previous state-of-the-art methods.

Abstract:
Unsupervised domain adaptation aims to classify unlabeled data points in the target domain using labeled data points from the source domain, while the distributions of data points in two domains are different. To address this issue, we propose a novel method called the anchor guided unsupervised domain adaptation method (AGDA). We minimize distribution divergence in a latent feature subspace using the Maximum Mean Discrepancy (MMD) criterion. Unlike existing unsupervised domain adaptation methods, we introduce anchor points in the original space and impose domains data to the same anchor points rather than center points to further reduce the domain difference. We optimize the anchor-based graph in the subspace to obtain discriminative transformation matrices. This enables our model to perform better on non-Gaussian distribution than methods focusing on global structure. Furthermore, the sparse anchor-based graph reduces time complexity compared to the fully connected graph, enabling exploration of local structure. Experimental results demonstrate that our algorithm outperforms several state-of-the-art methods on various benchmark datasets.

Abstract:
Collaborative filtering (CF) models have demonstrated remarkable performance in recommender systems, which represent users and items as embedding vectors. Recently, due to the powerful modeling capability of graph neural networks for user-item interaction graphs, graph-based CF models have gained increasing attention. They encode each user/item and its subgraph into a single super vector by combining graph embeddings after each graph convolution. However, each hop of the neighbor in the user-item subgraphs carries a specific semantic meaning. Encoding all subgraph information into single vectors and inferring user-item relations with dot products can weaken the semantic information between user and item subgraphs, thus leaving untapped potential. Exploiting this untapped potential provides insight into improving performance for existing recommendation models. To this end, we propose the Graph Cross-correlated Network for Recommendation (GCR), which serves as a general recommendation paradigm that explicitly considers correlations between user/item subgraphs. GCR first introduces the Plain Graph Representation (PGR) to extract information directly from each hop of neighbors into corresponding PGR vectors. Then, GCR develops Cross-Correlated Aggregation (CCA) to construct possible cross-correlated terms between PGR vectors of user/item subgraphs. Finally, GCR comprehensively incorporates the cross-correlated terms for recommendations. Experimental results show that GCR outperforms state-of-the-art models on both interaction prediction and click-through rate prediction tasks.

Abstract:
Meta learning has been recognized as an effective remedy for solving the cold-start problem in the recommendation domain. Existing models aim to learn how to generalize from the user behaviors in the training set to testing set. However, in the cold start settings, with only a small number of training samples, the testing distribution may easily deviate from the training one, which may invalidate the learned generalization patterns, and lower the recommendation performance. For alleviating this problem, in this paper, we propose a robust meta recommender framework to address the distribution shift problem. In specific, we argue that the distribution shift may exist on both the user- and interaction-levels, and in order to mitigate them simultaneously, we design a novel distributionally robust model by hierarchically reweighing the training samples. Different sample weights correspond to different training distributions, and we minimize the largest loss induced by the sample weights in a simplex, which essentially optimizes the upper bound of the testing loss. In addition, we analyze our framework on the convergence rates and generalization error bound to provide more theoretical insights. Empirically, we conduct extensive experiments based on different meta recommender models and real-world datasets to verify the generality and effectiveness of our framework.

Abstract:
Multi-modal contents have proven to be the powerful knowledge for recommendation tasks. Most state-of-the-art multi-modal recommendation methods mainly focus on aligning the semantic spaces of different modalities to enhance the item representations and do not pay much attention on the relevant knowledge in the multi-modalities for recommendation, resulting in that the positive effects of the relevant knowledge is reduced and the improvement of recommendation performance is limited. In this paper, we propose a multi-modal correction network termed MMCN to enhance the item representation with the important semantic knowledge in each modality by a residual structure with attention mechanisms and a hierarchical contrastive learning framework. The residual information is obtained through self-attention and cross-attention, which can learn the relevant knowledge across different modalities effectively. While hierarchical contrastive learning further captures the relevant knowledge not only at the feature level but also at the element-wise level with a matrix. Extensive experiments on three large-scale real-world datasets show the superiority of MMCN over state-of-the-art multi-modal recommendation methods.

Abstract:
Outlier Detection (OD) has attracted extensive research due to its application in many fields. The idea of neighborhood computing is one of the widely used methods in outlier analysis. Nevertheless, these methods mainly use certainty strategies to model outlier detection, so they cannot effectively handle the fuzzy information in the dataset. Moreover, they mainly focus on dealing with outlier detection in numerical data and cannot effectively find outliers in mixed-attribute data. Fuzzy information granulation theory is an effective granular computing model that allows objects to belong to a set to a certain extent (i.e., membership degree), which makes it possible to better handle uncertainty problems such as fuzziness. In this work, we propose an outlier detection model based on fuzzy neighborhoods. First, a hybrid fuzzy similarity is constructed to granulate the set of objects to form fuzzy information granules. Second, the fuzzy kk-nearest neighbor is defined to describe the fuzzy local information. Then, the fuzzy neighborhood density is defined to indicate the degree of aggregation of each object. The smaller the fuzzy neighborhood density of an object, the more likely it is to be an outlier. Based on this idea, the fuzzy neighborhood deviation degree is defined to quantify the degree of outliers of objects. Finally, the fuzzy deviation degree on the set of conditional attributes is constructed to indicate the outlier scores of objects. Experimental comparisons with state-of-the-art methods show that the proposed method has a significant improvement on the AUC index and applies to three types of data.

Abstract:
Graph Neural Networks (GNNs) have gained attention for their ability in capturing node interactions to generate node representations. However, their performances are frequently restricted in real-world directed networks with natural hierarchical structures. Most current GNNs incorporate information from immediate neighbors or within predefined receptive fields, potentially overlooking long-range dependencies inherent in hierarchical structures. They also tend to neglect node adaptability, which varies based on their positions. To address these limitations, we propose a new model called Hierarchy-Aware Adaptive Graph Neural Network (HAGNN) to adaptively capture hierarchical long-range dependencies. Technically, HAGNN creates a hierarchical structure based on directional pair-wise node interactions, revealing underlying hierarchical relationships among nodes. The inferred hierarchy helps to identify certain key nodes, named Source Hubs in our research, which serve as hierarchical contexts for individual nodes. Shortcuts adaptively connect these Source Hubs with distant nodes, enabling efficient message passing for informative long-range interactions. Through comprehensive experiments across multiple datasets, our proposed model outperforms several baseline methods, thus establishing a new state-of-the-art in performance. Further analysis demonstrates the effectiveness of our approach in capturing relevant adaptive hierarchical contexts, leading to improved and explainable node representation.

Abstract:
Optimizer-based meta-learning, specifically model-agnostic meta-learning (MAML), has emerged as a powerful tool for tackling the cold-start recommendation problem. In these meta-learning-based methods, recommendations for individual users are typically treated as separate tasks and learned independently. However, this task-by-task learning paradigm presents several observable limitations. First, learning one task at a time ignores inter-task correlations, i.e., collaborative signals, which limits the meta-model’s receptive field and prevents it from leveraging valuable shared information, ultimately leading to subpar performance. Second, the meta-model is susceptible to the task distribution, i.e., the varied preference distributions among different users, which in turn introduces biases and inconsistencies, resulting in a less robust model that may perform well on certain user groups while underperforming on others. In this paper, we explore the correlations among different tasks in cold-start recommendations and develop a novel strategy termed cross-task collaborative meta-learning (CCML). More specifically, we propose a collaborative task sampling module designed to mitigate the adverse impact of irrelevant tasks during meta-model learning. This module adaptively identifies tasks that are both similar and beneficial to the primary task, ensuring that the meta-model learns from relevant and supportive information. Additionally, to harness collaborative information across relevant tasks, we introduce a bi-level cross-task meta-training strategy. This strategy leverages multi-task learning to capture collaborative knowledge simultaneously and enhance user profiling with pertinent information. Extensive experiments on four public benchmark datasets demonstrate the advantages of CCML over many state-of-the-art cold-start recommendation methods. Our results show significant improvements in recommendation accuracy and robustness, highlighting the potential of cross-task collaboration in enhancing meta-learning-based recommender systems.

Abstract:
A kk-plex is a subgraph in which each vertex can miss edges to at most kk vertices, including itself. kk-plex can find many real-world applications such as social network analysis and product recommendation. Previous studies about kk-plex mainly focus on static graphs. However, in reality, relationships between two entities often occur at some specific timestamps, which can be modeled as temporal graphs. Directly extending the kk-plex model may fail to find some critical groups in temporal graphs, which exhibit certain frequent occurring patterns. To fill the gap, in this paper, we develop a novel model, named (k,l)(k,l)-plex, which is a vertex set that exists in no less than ll timestamps, at each of which the subgraph induced is a kk-plex. To identify practical results, we propose and investigate two important problems, i.e., large maximal (k,l)(k,l)-plex (MalKLP) enumeration and maximum (k,l)(k,l)-plex (MaxKLP) identification. For the MalKLP enumeration problem, a reasonable baseline method is first proposed by extending the Bron-Kerbosch (BK) framework. To overcome the limitations in baseline and scale for large graphs, optimized strategies are developed, including novel graph reduction approach and search branch pruning techniques. For the MaxKLP identification task, we first design a baseline method by extending the proposed enumeration framework. Additionally, to accelerate the search, a new search framework with efficient branch pruning rules and refined graph reduction method is developed. Finally, comprehensive experiments are conducted on 14 real-world datasets to validate the efficiency and effectiveness of the proposed techniques.

Abstract:
Community search on multilayer graphs has significant applications in fields such as bioinformatics, social network analysis, and financial fraud detection, offering deeper insights compared to traditional community search on single-layer graphs. However, existing approaches often suffer from several key limitations, including inefficiency and a lack of flexibility in accommodating query requirements. To address these challenges, we investigate the problem of community search over large multilayer graphs. Specifically, we introduce a novel multilayer community model called PivotTruss Community (PiTC) with provably nice structural guarantees. We formalize the PiTC search (PiTCS) problem, which aims to efficiently identify personalized PiTCs for a given query vertex. To solve the PiTCS problem, we propose an efficient algorithm and design an elegant index to accelerate the search process. In addition, we propose a parameter recommendation method to improve the usability of PiTCS. To further optimize performance, we introduce a method to compact the index by making a trade-off between search time and index size. Extensive experiments on real-world datasets demonstrate the effectiveness and efficiency of our proposed algorithms.

Abstract:
Cardinality estimation is a fundamental task in database management systems, aiming to predict query results accurately without executing the queries. However, existing techniques either achieve low estimation accuracy or take high inference latency. Simultaneously achieving high speed and accuracy becomes critical for the cardinality estimation problem. In this paper, we propose a novel data-driven approach called CoDe (Covering with Decompositions) to address this problem. CoDe employs the concept of covering design, which divides the table into multiple smaller, overlapping segments. For each segment, CoDe utilizes tensor decomposition to accurately model its data distribution. Moreover, CoDe introduces innovative algorithms to select the best-fitting distributions for each query, combining them to estimate the final result. By employing multiple models to approximate distributions, CoDe excels in effectively modeling discrete distributions and ensuring computational efficiency. Notably, experimental results show that our method represents a significant advancement in cardinality estimation, achieving state-of-the-art levels of both estimation accuracy and inference efficiency. Across various datasets, CoDe achieves absolute accuracy in estimating more than half of the queries.

Abstract:
Neural Collapse (NC) presents an elegant geometric structure that enables individual activations (features), class means and classifier (weights) vectors to reach optimal inter-class separability during the terminal phase of training on a balanced dataset. Once shifted to imbalanced classification, such an optimal structure of NC can be readily destroyed by the notorious minority collapse, where the classifier vectors corresponding to the minority classes are squeezed. In response, existing works mainly optimize classifiers in an effort to recover NC. However, we discover that this squeezing phenomenon is not only confined to classifier vectors but also occurs with class means. Consequently, reconstructing NC solely at the classifier aspect may be futile, as the class means remain compressed, leading to the violation of inherent self-duality in NC (i.e., class means and classifier vectors converge mutually) and incidentally, an unsatisfactory collapse of individual activations towards the corresponding class means. To shake off these dilemmas, we present a unified All-around Neural Collapse framework (AllNC), aiming to comprehensively restore NC across multiple aspects including individual activations, class means and classifier vectors. We thoroughly analyze its effectiveness and verify its performance on multiple benchmark datasets as state-of-the-art in both balanced and imbalanced settings.

Abstract:
Graph neural networks (GNNs) provide powerful insights into brain neuroimaging technology from the view of graphical networks. However, most existing GNN-based models treat the brain connectome, derived from neuroimaging, as a homogeneous graph characterized by uniform node and edge types. In fact, emerging studies have reported and emphasized the significance of heterogeneity among human brain activities, especially between the two cerebral hemispheres. Thus, homogeneous-structured brain network-based graph methods are insufficient for modeling complicated cerebral activity states. To overcome this problem, we introduce a novel heterogeneous graph neural network (HeBrainGNN) for multimodal brain neuroimaging fusion learning. HeBrainGNN first conceptualizes the brain network as a heterogeneous graph with multiple types of nodes (representing the left and right hemispheres) and edges (categorizing intra- and interhemispheric interactions). We further develop a self-supervised pretraining strategy for this heterogeneous network to address the potential overfitting problem caused by the conflict between a large parameter size and a small medical data sample size. Empirical results show the superiority of the proposed model over other existing methods in brain-related disease prediction tasks. Ablation experiments show that our heterogeneous graph-based model attaches more importance to hemispheric connections that may be neglected due to their low strength by previous homogeneous graph models. Additional experiments reveal that our pretraining strategy not only addresses the challenge of limited labeled data but also significantly enhances accuracy, affirming the potential of our approach in advancing neuroimaging analysis.

Affiliations: Department of Strategic and Advanced Interdisciplinary Research, Pengcheng Laboratory, Shenzhen, China; State Key Laboratory of Multimedia Information Processing, School of Computer Science, School of Electronics Engineering and Computer Science, Peking University, Beijing, China; Yuanpei College, Peking University, Beijing, China; Department of Computer Science, University of Maryland, College Park, MD, USA; Key Laboratory of High Confidence Software Technologies (MOE) & School of Computer Science, Peking University, Beijing, China

Abstract:
This paper introduces a novel framework, M4, designed to estimate per-flow quantiles in data streams accurately. M4 is a versatile framework that can be integrated with a wide array of single-flow quantile estimation algorithms, thereby enabling them to perform per-flow estimation. The framework employs a sketch-based approach to provide a space-efficient method for recording and extracting distribution information. M4 incorporates two techniques: MINIMUM and SUM. The MINIMUM technique minimizes the noise on a flow from other flows caused by hash collisions, while the SUM technique efficiently categorizes flows based on their sizes and customizes treatment strategies accordingly. We demonstrate the application of M4 on three single-flow quantile estimation algorithms (DDSketch, tt-digest, and ReqSketch), detailing the specific implementation of the MINIMUM and SUM techniques. We provide theoretical proof that M4 delivers high accuracy while utilizing limited memory. Additionally, we conduct extensive experiments to evaluate the performance of M4 regarding accuracy and speed. The experimental results indicate that across all three example algorithms, M4 significantly outperforms two comparison frameworks in terms of accuracy for per-flow quantile estimation while maintaining comparable speed.

Abstract:
Spatio-temporal time series forecasting has attracted great attentions in various fields, including climate, power, and traffic forecasting. Recently, Spatio-temporal Graph Neural Networks (STGNNs) have shown promising performances in modeling spatial dependencies based on graph neural networks (GNNs) and temporal dependencies based on temporal learning modules. However, most STGNNs do not effectively integrate explicit and implicit relationships between nodes, nor do they adequately capture long and short-term time dependencies. To address these challenges, this paper presents a Quaternion Spatio-temporal Graph Neural Network (QSTGNN). Specifically, the quaternion spatio-temporal graph is constructed firstly, such that the information of both short and long-term time steps are preserved in quaternion feature tensor, and information of multiple explicit graphs and implicit graph are integrated in quaternion graph adjacency matrix. Then, two modules are designed: a 1D quaternion convolution module and a quaternion graph convolution module. In the 1D quaternion convolution module, complex temporal correlations among short and long-term time steps can be well exploited by 1D quaternion convolution operator based on the quaternion Hamilton product. In the quaternion graph convolution module, quaternion graph convolution is designed to characterize nonlinear dependencies among multiple spatial graphs, including explicit and implicit graphs. Extensive experiments are conducted on six datasets, and the results show that QSTGNN achieves state-of-the-art performances over the existing ten methods. Explainable analysis presents that multiple spatial correlations can accurately illustrate the traffic flow and road functional information in real traffic roads.

Abstract:
Although robust tensor completion has been extensively studied, the effect of incorporating side information has not been explored. In this article, we fill this gap by developing a novel high-order robust tensor completion model that incorporates both latent and explicit side information. We base our model on the transformed t-product because the corresponding tensor tubal rank can characterize the inherent low-rank structure of a tensor. We study the effect of side information on sample complexity and prove that our model needs fewer observations than other tensor recovery methods when side information is perfect. This theoretically shows that informative side information is beneficial for learning. Extensive experimental results on synthetic and real data further demonstrate the superiority of the proposed method over several popular alternatives. In particular, we evaluate the performance of our solution based on two important applications, namely, link prediction in signed networks and rating prediction in recommender systems. We show that the proposed model, which manages to exploit side information in learning, outperforms other methods in the learning of such low-rank tensor data. Furthermore, when dealing with varying dimensions, we also design an online robust tensor completion with side information algorithm and validate its effectiveness using a real-world traffic dataset in the supplementary material.

Abstract:
Graph Neural Networks have demonstrated remarkable effectiveness in various graph-based tasks, but their inefficiency in training and inference poses significant challenges for scaling to real-world, large-scale applications. To address these challenges, a plethora of algorithms have been developed to accelerate GNN training and inference, garnering substantial interest from the research community. This paper presents a systematic review of these acceleration algorithms, categorizing them into three main topics: training acceleration, inference acceleration, and execution acceleration. For training acceleration, we discuss techniques like graph sampling and GNN simplification. In inference acceleration, we focus on knowledge distillation, GNN quantization, and GNN pruning. For execution acceleration, we explore GNN binarization and graph condensation. Additionally, we review several libraries related to GNN acceleration, including our Scalable Graph Learning library, and propose future research directions.

Abstract:
Data valuation is a core function in data markets and cooperative data sharing. Shapley value is a widely used approach to fairly measure the contribution of data points towards a collective utility (e.g., a machine learning model trained from the data). However, computing Shapley values is known to be in general #P-hard due to the exponential utility evaluation. Furthermore, the presence of dynamic data poses additional challenges due to the prohibitively expensive cost of recomputing from scratch. In this paper, we study the problem of Dynamic Shapley Value Computation, which focuses on updating Shapley values when dynamically adding or deleting data points. For adding, to prune redundant computation of overlapping model utilities, we propose the pivot-based algorithm that can reduce half the computation time in expectation. We also propose delta-based algorithms to capture Shapley value changes, which require only a smaller sample size to converge. For deleting, we present the YN-NN algorithm that derives the new Shapley values from precomputed utilities efficiently. Based on Shapley value changes, we give another version of the delta-based algorithm for deleting data points. Besides, we propose heuristic algorithms that draw on experimental observations for addition, deletion, and hybrid scenarios. Extensive experimental results demonstrate the efficiency and effectiveness of our proposed algorithms.

Abstract:
The problem of structural diversity search has been widely studied recently, which aims to find out the users with the highest structural diversity in social networks. The structural diversity of a user is depicted by the number of social contexts inside his/her contact neighborhood. Three structural diversity models based on cohesive subgraph models (e.g., k-sized component, k-core, and k-truss), have been proposed. Previous solutions only focus on CPU-based sequential solutions, suffering from several key steps of that cannot be highly parallelized. GPUs enjoy high-efficiency performance in parallel computing for solving many complex graph problems such as triangle counting, subgraph pattern matching, and graph decomposition. In this paper, we provide a unified framework to utilize multiple GPUs to accelerate the computation of structural diversity search under the mentioned three structural diversity models. We first propose a GPU-based lock-free method to efficiently extract ego-networks in CSR format in parallel. Second, we design detailed GPU-based solutions for computing k-sized component-based, k-core-based, and also k-truss-based structural diversity scores by dynamically grouping GPU resources. To effectively optimize the workload balance among multiple GPUs, we propose a greedy work-packing scheme and a dynamic work-stealing strategy to fulfill usage. Extensive experiments on real-world datasets validate the superiority of our GPU-based structural diversity search solutions in terms of efficiency and effectiveness.

Abstract:
Today, mainstream recommendation systems have achieved remarkable success in recommending items that align with user interests. However, limited attention has been paid to the perspective of item providers. Content providers often desire that all their offerings, including unpopular or cold items, are displayed and appreciated by users. To tackle the challenges of unfair exhibition and limited item acceptance coverage, we introduce a novel recommendation perspective that enables items to “select” their most relevant users. We further introduce ItemRec, a straightforward plug-and-play approach that leverages mutual scores calculated by any model. The goal is to maximize the recommendation and acceptance of items by users. Through extensive experiments on three real-world datasets, we demonstrate that ItemRec can enhance valid coverage by up to 38.5% while maintaining comparable or superior recommendation quality. This improvement comes with only a minor increase in model inference time, ranging from 1.5% to 5%. Furthermore, when compared to thirteen state-of-the-art recommendation methods across accuracy, fairness, and diversity, ItemRec exhibits significant advantages as well. Specifically, ItemRec achieves an optimal balance between precision and valid coverage, showcasing an efficiency gain ranging from 1.8 to 45 times compared to other fairness-oriented methodologies.

Abstract:
Graph clustering has become a crucial technique for uncovering community structures in complex network data. However, existing approaches often introduce cumbersome regularization or constraints (hyperparameter tuning burden) to obtain balanced clustering results, thereby increasing hyperparameter tuning requirements and intermediate variables. These limitations can lead to suboptimal performance, particularly in scenarios involving imbalanced clusters or large-scale datasets. Besides, most graph cut clustering methods solve two separate discrete problems, resulting in information loss and relying on time-consuming eigen-decomposition. To address these challenges, this paper propose an effective graph cut framework, termed Harmonic MaxMin Cut (HMMC), inspired by worst-case objective optimization and the harmonic mean. Unlike traditional spectral clustering, HMMC produces all cluster assignments in a single step, eliminating the need for additional discretization and notably enhancing robustness to “worst-case cluster” boundaries. this paper further devise a fast coordinate descent (CD) solver that scales linearly complexity with the graph size, offering a computationally efficient alternative to eigen decomposition. Extensive experiments on real-world datasets demonstrate that HMMC is comparable to, or even surpasses, state-of-the-art methods, while also finding more favorable local solutions than non-negative matrix factorization techniques.

Abstract:
Currently, attention mechanism has become a standard fixture in most state-of-the-art natural language processing (NLP) models, not only due to the outstanding performance it could gain but also due to plausible innate explanations for the behaviors of neural architectures it provides, which is notoriously difficult to analyze. However, recent studies show that attention is unstable against randomness and perturbations during training or testing, such as random seeds and slight perturbation of embedding vectors, which impedes it from becoming a faithful explanation tool. Thus, a natural question is whether we can find some substitute for the current attention that is more stable and could keep the most important characteristics of explanation and prediction of attention. In this paper, to resolve the problem, we provide a rigorous definition of such alternate namely SEAT (Stable and Explainable Attention). Specifically, a SEAT should have the following three properties: (1) Its prediction distribution is enforced to be close to the distribution based on the vanilla attention; (2) Its top-kk indices have large overlaps with those of the vanilla attention; (3) It is robust w.r.t perturbations, i.e., any slight perturbation on SEAT will not change the prediction distribution too much, which implicitly indicates that it is stable to randomness and perturbations. To further improve the interpretability stability against perturbations, based on SEAT we provide another definition called SEAT++. Then we propose a method to get a SEAT++, which could be considered an ad hoc modification for canonical attention. Finally, through intensive experiments on various datasets, we compare our SEAT and SEAT++ with other baseline methods using RNN, BiLSTM, and BERT architectures via six different evaluation metrics for model interpretation, stability, and accuracy. Results show that SEAT and SEAT++ are more stable against different perturbations and randomness while also keeping the explainability of attention, which indicates they provide more faithful explanations. Moreover, compared with vanilla attention, there is almost no utility (accuracy) degradation for SEAT and SEAT++.

Abstract:
Traditional domain-specific causal discovery relies on expert knowledge to guide the data-based structure learning process, thereby improving the reliability of recovered causality. Recent studies have shown promise in using the Large Language Model (LLM) as causal experts to construct autonomous expert-guided causal discovery systems through causal reasoning between pairwise variables. However, their performance is hampered by inaccuracies in aligning LLM-derived causal knowledge with the actual causal structure. To address this issue, this paper proposes a novel LLM-driven causal discovery framework that limits LLM’s prior within a reliable range. Instead of pairwise causal reasoning that requires both precise and comprehensive output results, the LLM is directed to focus on each single aspect separately. By combining these distinct causal insights, a unified set of structural constraints is created, termed a harmonized prior, which draws on their respective strengths to ensure prior accuracy. On this basis, we introduce plug-and-play integrations of the harmonized prior into mainstream categories of structure learning methods, thereby enhancing their applicability in practical scenarios. Evaluations on real-world data demonstrate the effectiveness of our approach.

Abstract:
Feature-based knowledge distillation has been applied to compress modern recommendation models, usually with projectors that align student (small) recommendation models’ dimensions with teacher dimensions. However, existing studies have only focused on making the projected features (i.e., student features after projectors) similar to teacher features, overlooking investigating whether the user preference can be transferred to student features (i.e., student features before projectors) in this manner. In this paper, we find that due to the lack of restrictions on projectors, the process of transferring user preferences will likely be interfered with. We refer to this phenomenon as preference inconsistency. It greatly wastes the power of feature-based knowledge distillation. To mitigate preference inconsistency, we propose PCKD, which consists of two regularization terms for projectors. We also propose a hybrid method that combines the two regularization terms. We focus on items with high preference scores and significantly mitigate preference inconsistency, improving the performance of feature-based knowledge distillation. Extensive experiments on three public datasets and three backbones demonstrate the effectiveness of PCKD.

Abstract:
Geo-Indistinguishability (GI) is a powerful privacy model that can effectively protect location information by limiting the ability of an attacker to infer a user's true location. In real life, locations usually have different sensitive levels in terms of privacy; for example, shopping malls might be low-sensitive while home addresses might be high-sensitive for users. But the GI model does not consider the various sensitive levels of locations, and implements the same perturbation on all locations to meet the highest privacy requirement. This would cause overprotection of low-sensitive locations and reduce data utility. To strike a good balance between privacy and utility, in this paper, we propose a novel privacy notion, termed Location-Discriminative Geo-Indistinguishability (LDGI), which takes into account different sensitive levels of location privacy. With LDGI model, we then develop a perturbation scheme called EM-LDGI based on the exponential mechanism, and an advance scheme MinQL to further enhance data utility. To improve the efficiency of the proposed schemes, we design a scheme MinQL-S with the assistance of the spanner graph, at the cost of a slight utility degradation. We theoretically analyze that the proposed schemes satisfy LDGI and evaluate their performance by extensive experiments on both synthetic and real datasets. The comparison with GI mechanisms demonstrates the advantages of the LDGI model.

Abstract:
Maximizing Influence (Max-Inf) query is a fundamental operation in spatial data management. This query returns an optimal site from a candidate set to maximize its influence. Existing work commonly focuses on outdoor spaces. In practice, however, people spend up to 87% of their daily life inside indoor spaces. The outdoor techniques fall short in indoor spaces due to the complicated topology of indoor spaces. In this paper, we formulate two indoor Max-Inf queries: Top-kk Probabilistic Influence Query (TkkPI) and Collective-kk Probabilistic Influence Query (CkkPI) taking probability and mobility factors into consideration. We propose a novel spatial index, IT-tree, which utilizes the properties of indoor venues to facilitate the indoor distance computation, and then applies a trie to further organize the trajectories with similar check-in partitions together, based on their sketch information. This structure is simple but highly effective in pruning the trajectory search space. To process TkkPI efficiently, we devise subtree pruning and progressive pruning techniques to delicately filter out unnecessary trajectories based on probability bounds and the monotonicity of influence probability. For CkkPI queries, which is a submodular NP-hard problem, three approximation algorithms are provided with different strategies of computing marginal influence value during the search. Through extensive experiments on several real indoor venues, we demonstrate the efficiency and effectiveness of our proposed algorithms.

Abstract:
Social recommendations leverage social networks to augment the performance of recommender systems. However, the critical task of denoising social information has not been thoroughly investigated in prior research. In this study, we introduce a hierarchical denoising robust social recommendation model to tackle noise at two levels: 1) intra-domain noise, resulting from user multi-faceted social trust relationships, and 2) inter-domain noise, stemming from the entanglement of the latent factors over heterogeneous relations (e.g., user-item interactions, user-user trust relationships). Specifically, our model advances a preference and social psychology-aware methodology for the fine-grained and multi-perspective estimation of tie strength within social networks. This serves as a precursor to an edge weight-guided edge pruning strategy that refines the model's diversity and robustness by dynamically filtering social ties. Additionally, we propose a user interest-aware cross-domain denoising gate, which not only filters noise during the knowledge transfer process but also captures the high-dimensional, nonlinear information prevalent in social domains. We conduct extensive experiments on three real-world datasets to validate the effectiveness of our proposed model against state-of-the-art baselines. We perform empirical studies on synthetic datasets to validate the strong robustness of our proposed model.

Abstract:
Summarization quality evaluation is a non-trivial task in text summarization. Contemporary methods can be mainly categorized into two scenarios: (1) reference-based: evaluating with human-labeled reference summary; (2) reference-free: evaluating the summary consistency of the document. Recent studies mainly focus on one of these scenarios and explore training neural models to align with human criteria and finally give a numeric score. However, the models from different scenarios are optimized individually, which may result in sub-optimal performance since they neglect the shared knowledge across different scenarios. Besides, designing individual models for each scenario caused inconvenience to the user. Moreover, only providing the numeric quality evaluation score for users cannot help users to improve the summarization model, since they do not know why the score is low. Inspired by this, we propose Unified Multi-scenario Summarization Evaluator (UMSE) and Multi-Agent Summarization Evaluation Explainer (MASEE). More specifically, we propose a perturbed prefix tuning method to share cross-scenario knowledge between scenarios and use a self-supervised training paradigm to optimize the model without extra human labeling. Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios. We propose a multi-agent summary evaluation explanation method MASEE, which employs several LLM-based agents to generate detailed natural language explanations in four different aspects. Experimental results across three typical scenarios on the benchmark dataset SummEval indicate that our UMSE can achieve comparable performance with several existing strong methods that are specifically designed for each scenario. And intensive quantitative and qualitative experiments also demonstrate the effectiveness of our proposed explanation method, which can generate consistent and accurate explanations.

Abstract:
Detecting out-of-distribution (OOD) samples poses a significant safety challenge when deploying models in open-world scenarios. Advanced works assume that OOD and in-distributional (ID) samples exhibit a distribution discrepancy, showing an encouraging direction in estimating the uncertainty with embedding features or predicting outputs. Besides incorporating auxiliary outlier as decision boundary, quantifying a “meaningful distance” in embedding space as uncertainty measurement is a promising strategy. However, these distances-based approaches overlook the data structure and heavily rely on the high-dimension features learned by deep neural networks, causing unreliable distances due to the “curse of dimensionality”. In this work, we propose a data structure-aware approach to mitigate the sensitivity of distances to the “curse of dimensionality”, where high-dimensional features are mapped to the manifold of ID samples, leveraging the well-known manifold assumption. Specifically, we present a novel distance termed as tangent distance, which tackles the issue of generalizing the meaningfulness of distances on testing samples to detect OOD inputs. Inspired by manifold learning for adversarial examples, where adversarial region probability density is close to the orthogonal direction of the manifold, and both OOD and adversarial samples have common characteristic -- imperceptible perturbations with shift distribution, we propose that OOD samples are relatively far away from the ID manifold, where tangent distance directly computes the Euclidean distance between samples and the nearest submanifold space -- instantiated as the linear approximation of local region on the manifold. We provide empirical and theoretical insights to demonstrate the effectiveness of OOD uncertainty measurements on the low-dimensional subspace. Extensive experiments show that the tangent distance performs competitively with other post hoc OOD detection baselines on common and large-scale benchmarks, and the theoretical analysis supports our claim that ID samples are likely to reside in high-density regions, explaining the effectiveness of internal connections among ID data.

Abstract:
This paper explores explaining session-based recommendation (SR) by path reasoning. Current SR models emphasize accuracy but lack explainability, while traditional path reasoning prioritizes knowledge graph exploration, ignoring sequential patterns present in the session history. Therefore, we propose a generalized hierarchical reinforcement learning framework for SR, which improves the explainability of existing SR models via Path Reasoning, namely PR4SR. Considering the different importance of items to the session, we design the session-level agent to select the items in the session as the starting nodes for path reasoning and the path-level agent to perform path reasoning. In particular, we design a multi-target reward mechanism to adapt to the skip behaviors of sequential patterns in SR and introduce path midpoint reward to enhance the exploration efficiency and accuracy in knowledge graphs. To improve the knowledge graph’s completeness and diversify the paths of explanation, we incorporate extracted feature information from images into the knowledge graph. We instantiate PR4SR in five state-of-the-art SR models (i.e., GRU4REC, NARM, GCSAN, SR-GNN, SASRec) and compare it with other explainable SR frameworks to demonstrate the effectiveness of PR4SR for recommendation and explanation tasks through extensive experiments with these approaches on four datasets.

Abstract:
As a compromise between supervised and unsupervised learning, semi-supervised learning (SSL) harnesses both labeled and unlabeled data to enhance learning performance. Graph-based semi-supervised learning (GSSL) has emerged as a prominent approach owing to its versatility in representing sample interdependencies via graph structures. However, traditional GSSL methods face high time cost when computing matrix inverses, making them inefficient for large datasets. To address this, some researchers have introduced anchors as a bridge to accelerate the process. Nevertheless, most anchor-based models suffer from one or more of the following issues: (1) The anchor graph-based construction of the adjacency matrix has limitations; (2) The objective functions are typically non-convex, leading to local optima and requiring multiple runs to achieve good performance. To tackle these challenges, we develop a probability-driven approach to build the adjacency matrix, defining sample similarity as the probability of sharing the same anchor. Based on this strategy, we design a model (CFSL) with a strictly convex objective function, guaranteeing a globally optimal solution without iterative optimization. Experiments on multiple datasets indicate that our algorithm yields strong performance.

Abstract:
Ordinal regression (OR, also called ordinal classification) is classification of ordinal data, in which the underlying target variable is categorical and considered to have a natural ordinal relation for the underlying explanatory variable. A key to successful OR models is to find a data structure ‘natural ordinal relation’ common to many ordinal data and reflect that structure into the design of those models. A recent OR study found that many real-world ordinal data show a tendency that the conditional probability distribution (CPD) of the target variable given a value of the explanatory variable will often be unimodal. Several previous studies thus developed unimodal likelihood models, in which a predicted CPD is guaranteed to become unimodal. However, it was also observed experimentally that many real-world ordinal data partly have values of the explanatory variable where the underlying CPD will be non-unimodal, and hence unimodal likelihood models may suffer from a bias for such a CPD. Therefore, motivated to mitigate such a bias, we propose approximately unimodal likelihood models, which can represent up to a unimodal CPD and a CPD that is close to be unimodal. We also verify experimentally that a proposed model can be effective for statistical modeling of ordinal data and OR tasks.

Affiliations: School of Computer and Information Technology, Shanxi University, Taiyuan, Shanxi, China; Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, School of Information Science and Technology, Beijing University of Technology, Beijing, China; University of Manchester, Manchester, U.K.; Institute for Infocomm Research, Singapore; ILCC, School of Informatics, University of Edinburgh, Edinburgh, U.K.

Abstract:
Attributed Question Answering (AQA) aims to provide both a trustworthy answer and a reliable attribution report for a given question. Retrieval is a widely adopted approach, including two general paradigms: Retrieval-Then-Read (RTR) and post-hoc retrieval. Recently, Large Language Models (LLMs) have shown remarkable proficiency, prompting growing interest in AQA among researchers. However, RTR-based AQA often suffers from irrelevant knowledge and rapidly changing information, even when LLMs are adopted, while post-hoc retrieval-based AQA struggles with comprehending long-form answers with complex logic, and precisely identifying the content needing revision and preserving the original intent. To tackle these problems, this paper proposes an Atomic fact decomposition-based Retrieval and Editing (ARE) framework, which decomposes the generated long-form answers into molecular clauses and atomic facts by the instruction-tuned LLMs. Notably, the instruction-tuned LLMs are fine-tuned using a well-constructed dataset, generated from large scale Knowledge Graphs (KGs). This process involves extracting one-hop neighbors from a given set of entities and transforming the result into coherent long-form text. Subsequently, ARE leverages a search engine to retrieve evidences related to atomic facts, inputting these evidences into an LLM-based verifier to determine whether the facts require expansion for re-retrieval or editing. Furthermore, the edited facts are backtracked into the original answer, with evidence aggregated based on the relationship between molecular clauses and atomic facts. Extensive evaluations demonstrate the superior performance of our proposed method over the state-of-the-arts on several datasets, with an additionally proposed new metric Attr_pAttrp for evaluating the precision of evidence attribution.

Abstract:
Outdoor billboard advertising has proven effective for commercial promotions, attracting potential customers, and boosting product sales. Auction serves as a popular method for leasing billboard usage rights, enabling a seller to rent billboards to winning users for predefined periods according to their bids. An effective auction algorithm is of great significance to maximize the efficiency of the billboard ecosystem. In contrast to a rich literature on Internet advertising auctions, well-crafted algorithms tailored for outdoor billboard auctions remain rare. In this work, we investigate the problem of outdoor billboard auctions, in the practical setting where bids are received and processed on the fly. Our goal is to maximize social welfare, namely the total benefits of auction participants, including the billboard service provider and the bidding users. To this end, we first formulate the billboard social welfare maximization problem into an Integer Linear Problem (ILP), and then reformulate the ILP into a compact form with a reduced size of constraints (at the cost of involving exponentially many primal variables), based on which we derive the dual problem. Furthermore, we design a dual oracle to handle the exponentially many dual constraints, avoiding exhaustive enumeration. We present a primal-dual online algorithm with an incentive-compatible pricing mechanism. Theoretical analysis proves the individual rationality, incentive compatibility, and computational efficiency of our online algorithm. Extensive experimental results show that the online algorithm is both effective and efficient, and achieves a good competitive ratio.

Abstract:
With the prevalence of social networks on online platforms, social recommendation has become a vital technique for enhancing personalized recommendations. The effectiveness of social recommendations largely relies on the social homophily assumption, which presumes that individuals with social connections often share similar preferences. However, this foundational premise has been recently challenged due to the inherent complexity and noise present in real-world social networks. In this paper, we tackle the low social homophily challenge from an innovative generative perspective, directly generating optimal user social representations that maximize consistency with collaborative signals. Specifically, we propose the Score-based Generative Model for Social Recommendation (SGSR), which effectively adapts the Stochastic Differential Equation (SDE)-based diffusion models for social recommendations. To better fit the recommendation context, SGSR employs a joint curriculum training strategy to mitigate challenges related to missing supervision signals and leverages self-supervised learning techniques to align knowledge across social and collaborative domains. Extensive experiments on real-world datasets demonstrate the effectiveness of our approach in filtering redundant social information and improving recommendation performance.

Affiliations: Department of Data Science, School of Computing, Data Sciences & Physics, College of William and Mary, Williamsburg, VA, USA; College of Computer and Information Science, Southwest University, Chongqing, China; School of Computing, Southern Illinois University, Carbondale, IL, USA; Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA; Key Laboratory of Knowledge Engineering with Big Data, Hefei University of Technology, Ministry of Education, Hefei, China

Abstract:
Outlier detection is essential for data compliance, fraud prevention, and strategic decision-making. Finding outliers relies on study of feature space to find anomalous instances. As the feature dimension increases, it will inevitably complicate the process and hinder the models from finding genuine outliers. In this paper, we investigate an ever-more challenging task, online outlier detection (OOD) problem, where data points to be examined for outlier detection are characterized by two dynamic changes: (1) increasing volume instead of a static set; and (2) evolving feature space instead of a known set. Such instance and feature space dynamics impedes traditional OD techniques reliant on geometric data structure for distinguishing outliers. To aid, we propose a new approach coined Online Outlier Detection in Open Feature Spaces, which circumvents this limitation by learning a latent hypersphere representation, respectively positioning regular and anomalous data points inside and outside its boundary. The crux of our approach tailors a reconstruction loss, allowing each data point to be represented as an addition of its pertinent feature embeddings. Each of these embeddings is updated non-intrusively, championing both efficient and incremental learning of the latent hypersphere. Extensive experiments on twelve benchmark datasets underscore the robustness and superior performance of our method against seven leading counterparts.

Abstract:
Burgeoning graph contrastive learning (GCL) stands out in the graph domain with low annotated costs and high model performance improvements, which is typically composed of three standard configurations: 1) graph data augmentation (GraphDA), 2) multi-branch graph neural network (GNN) encoders and projection heads, 3) and contrastive loss. Unfortunately, the diverse GraphDA may corrupt graph semantics to different extents and meanwhile greatly burdens the time complexity on hyperparameter search. Besides, the multi-branch contrastive framework also demands considerable training consumption on encoding and projecting. In this paper, we propose one simplified GCL model to simultaneously address these problems via the minimal components of a general graph contrastive framework, i.e., a GNN encoder and a projection head. The proposed model treats the node representations generated by the GNN encoder and the projection head as positive pairs while considering all other representations as negatives, which not only liberates the model from the dependency on GraphDA but also streamlines the traditional multi-branch contrastive learning framework into a more efficient single-streamlined one. Through the in-depth theoretical analysis on the objective function, the mystery of why the proposed model works is illustrated. Empirical experiments on multiple public datasets demonstrate that the proposed model still ensures performance to be comparative with current advanced self-supervised GNNs.

Abstract:
Cross-domain Recommendation (CR) has been extensively studied in recent years to alleviate the data sparsity issue in recommender systems by utilizing different domain information. In this work, we focus on the more general Non-overlapping Cross-domain Sequential Recommendation (NCSR) scenario. Non-overlapping Cross-domain Sequential Recommendation (NCSR) is challenging because there are no overlapped entities (e.g., users and items) between domains, and there is only users’ implicit feedback and no content information. Previous Cross-domain Recommendation (CR) methods cannot solve NCSR well, since (1) they either need extra content to align domains or need explicit domain alignment constraints to reduce the domain discrepancy from domain-invariant features, (2) they pay more attention to users’ explicit feedback (i.e., users’ rating data) and cannot well capture their sequential interaction patterns, (3) they usually do a single-target cross-domain recommendation task and seldom investigate the dual-target ones. Considering the above challenges, we propose Prompt Learning-based Cross-domain Recommender (PLCR), an automated prompting-based recommendation framework for the NCSR task. Specifically, to address the challenge (1), Prompt Learning-based Cross-domain Recommender (PLCR) resorts to learning domain-invariant and domain-specific representations via its prompt learning component, where the domain alignment constraint is discarded. For challenges (2) and (3), PLCR introduces a pre-trained sequence encoder to learn users’ sequential interaction patterns, and conducts a dual-learning target with a separation constraint to enhance recommendations in both domains. Our empirical study on two sub-collections of Amazon demonstrates the advance of PLCR compared with some related SOTA methods.

Abstract:
Group recommendation aims to recommend desired items for a group of users. Existing methods mainly adopt deterministic networks to represent groups as fixed-point vectors, assuming their preferences be highly close to these vectors in interest space. However, each group tends to have various interests, which cannot be fully captured by fixed-point vectors and thus calls for probabilistic modeling of interests as density instead. Although this can be supported by Variational AutoEncoder (VAE), interaction data in group recommendation are highly sparse and insufficient for VAE model training, resulting in high risks of posterior collapse and deficiency in personalization. To this end, this paper proposes a contrastive variational learning model boosted by variational model augmentation and an easy-to-hard paradigm. Specifically, VAE with tailored attention is first employed to represent group preferences as variational vectors for probabilistic preference modeling. Additionally, we conduct data-agnostic augmentation via learnable variational dropout, which removes redundant or irrelevant neurons in VAE to generate meaningful augmented views adequately for contrastive learning in spite of data sparsity. Difficulty-aware negative sampling is further applied to generate high-quality negative samples adapting to varying requirements of task difficulty according to the training process. Finally, we utilize density-based variational alignment to guide the optimization process of contrastive learning. Experiments on four real-world datasets are conducted to demonstrate the significant performance improvements of our model compared with SOTA methods for group recommendation.

Abstract:
In many real-world applications, data are continuously accumulated in open environments, and new classes may emerge over time. For instance, in disease diagnosis, the prevalence of a certain disease may vary seasonally, and new diseases can also emerge. This paper investigates the problem of learning from unlabeled data stream where the label distribution evolves over time, and meanwhile, previously unseen new classes may appear. To handle the emerging new classes in online label shift, we first design a novel risk estimator by unbiased risk rewriting and mixture proportion estimation, which enables the identification of new class data. Subsequently, we employ the online ensemble paradigm for model updating to handle unknown distribution shifts. Moreover, we introduce the sketching and ensemble pruning mechanisms to improve the efficiency of the algorithm, making it more lightweight and practical. The proposed approach enjoys a theoretical guarantee of dynamic regret, ensuring its effectiveness in adapting to the unknown distribution shifts and the emergence of new classes in streaming data. Experiments on diverse benchmark datasets and two real-world applications demonstrate the effectiveness of the algorithm.

Abstract:
To address the data sparsity issue, conventional graph-based models leverage structural signals from the interaction graph to embed users’ interests. However, these models learn a uniform representation for interest modeling, which blends users’ diverse intents and inevitably biases interest learning, hindering recommendations. Although the fine-grained paradigm can learn the intents of interactions separately to alleviate learning bias, the relationships among intents and the disentangled manner require elaborate design. Existing fine-grained models emphasize intent diversity and employ additional data splitting for disentanglement, which ignores the hierarchical relationship, exacerbates data sparsity, and increases the computational burden. To address these issues, we explore hierarchical intents and adaptive intent learning, proposing a hierarchical intent-based interest disentanglement (HIID) model for personalized recommendation. HIID introduces learnable intent queries to guide interest disentanglement from global interactions in a split-free manner. It raises a hierarchical intent hypothesis to involve hierarchical CF signals for interest modeling, where intents within the same level appear relatively diverse, and the in-depth intents are abstracted from the superficial ones. Both adaptive intent learning and hierarchical hypothesis help extract significant CF signals to promote personalized recommendation. Extensive experiments on public datasets show that the proposed HIID outperforms the state-of-the-art CF models for recommendation. Furthermore, HIID implements adaptive interest disentanglement in a split-free manner, improving the training efficiency of the recommender model compared to the existing fine-grained interest models.

Abstract:
Knowledge-aware recommendations improve performance by using knowledge graphs as auxiliary information. Recently, researchers have introduced the contrastive learning paradigm in knowledge-aware recommendations to enhance representation learning. However, most contrastive learning methods rely on manually or randomly generated knowledge views, making it challenging to generalize to different data distributions and alleviate knowledge noise effects. To solve these issues, we propose a mask diffusion-based contrastive learning method for knowledge-aware recommendation. Specifically, we apply local masked input to the diffusion model, using a mask prediction paradigm to adaptively generate views from both global and local perspectives, thereby enhancing the model’s generalization capability across different data distributions. Additionally, we propose a conditional inference process, leveraging user intentions to provide reasonable denoising guidance. At the same time, we design a collaborative knowledge diffusion loss aimed at improving the consistency between generated data and user behavior patterns. In this way, we combine the diffusion model with contrastive learning for the knowledge-aware recommendation, which can improve the generalization ability of the model. Our experimental results on four datasets show the effectiveness of our model.

Abstract:
Semi-supervised multi-label learning (SSMLL) involves learning a multi-label classifier from a small set of labeled data and a large set of unlabeled data. Label enhancement (LE), accounting for the relative importance of labels, has been effective in improving the performance of supervised multi-label learning models. Nevertheless, generating a robust SSMLL model with LE based on incomplete label information remains challenging. In this paper, we pioneer the idea of applying LE to SSMLL. First, we design a kNN aggregation-based method, aiming to assign pseudo-labels to unlabeled data and perform the LE process by aggregating label information from neighboring instances. Leveraging the topological structure of the feature space is an effective LE approach for training. However, LE, decoupled from the training process, lacks the dynamic feedback of the training model. To improve this, we incorporate a label propagation mechanism that iteratively optimizes the LE process with the guidance of the available label information. Moreover, we consider local label correlations according to local linear embedding to further enhance the generalization ability of the learning model. Extensive experiments demonstrate that the proposed approach can effectively recover latent label information, resulting in significant performance improvement in SSMLL.

Abstract:
Multi-label classification represented by hierarchical classification (HC) plays an important role in current large-scale problems, which can acquire a more accurate expression of data that conforms to the human multi-granularity cognitive process. To compress the original dataset and simultaneously enhance the expressive force of models, selecting an appropriate granularity for approximately describing the classification is the main task in the rough set theory. Nevertheless, the current rough set theory merely concerns flat classification and encounters new problems when approximately describing HC. 1) There lacks a measure to correctly reflect misclassification in accordance with the hierarchical accuracy of HC on the training set. 2) There lacks a measure relying on the distribution of the dataset to reflect the difference between two distinct feature sets describing HC in generalization ability. To address the mentioned issues, this paper utilizes the knowledge distance to characterize HC and proposes a cost-sensitive granularity selection for HC. First, HC and features are respectively granulated according to hierarchical quotient space and neighborhood granular structures. Then, knowledge distance and its extended form are employed to formulate misclassification and test costs. On this basis, a cost-sensitive neighborhood granularity selection is presented for HC. Finally, we experimentally demonstrate the excellent performance of the proposed method in terms of efficiency and HC accuracy both in synthetic and real datasets.

Abstract:
Although significant progress has been made in multi-view learning over the past few decades, it remains challenging, especially in the context of incomplete multi-view clustering, where modeling complex correlations among different views and handling missing data are key difficulties. In this paper, we propose a novel incomplete multi-view clustering network to address the aforementioned issue, named Incomplete Multi-view Clustering via Multi-level Contrastive Learning (IMC-MCL). Specifically, the proposed model aims to minimize the conditional entropy between views to recover missing data by dual prediction strategy. Moreover, the approach learns multi-level features, including latent, high-level and semantic features, with the goal of satisfying both reconstruction and consistency objectives in distinct feature spaces. Specifically, latent features are utilized to accomplish the reconstruction objective, while high-level features and semantic labels are employed to achieve the two consistency goals through contrastive learning. This framework enables the exploration of shared semantics within high-level features and achieves clustering assignment using semantic features. Extensive experiments have shown that the proposed approach outperforms other state-of-the-art incomplete multi-view clustering methods on seven challenging datasets.

Abstract:
Data summarization aims at utilizing a small-scale summary to represent massive datasets as a whole, which is useful for visualization and information sipped generation. However, most existing studies of hierarchical summarization only work on one single tree by selecting kk representative nodes, which neglects an important problem of comparative summarization on two trees. In this paper, given two trees with the same topology structure and different node weights, we aim at finding kk representative nodes, where k_1k1 nodes summarize the common relationship between them and k_2k2 nodes highlight significantly different subtrees meanwhile satisfying k_1+k_2=kk1+k2=k. To optimize summarization results, we introduce a scaling coefficient for balancing the summary view between two subtrees in terms of similarity and difference. Additionally, we propose a novel definition based on the Hellinger distance to quantify the node distribution difference between two subtrees. We present a greedy algorithm SVDT to find high-quality results with approximation guaranteed in an efficient way. Furthermore, we explore an extension of our comparative summarization to handle two trees with different structures. Extensive experiments demonstrate the effectiveness and efficiency of our SVDT algorithm against existing summarization competitors.

Abstract:
Domain adaptation has found widespread applications in real-life scenarios, especially when the target domain has limited labeled samples. However, most of the domain adaptation models only utilize one type of knowledge from the source domain, which is usually achieved by strong mode of convergence. To fully incorporate multiple knowledge from the source domain, for binary classification, this paper studies a novel learning paradigm for Domain Adaptation via Learning Using Statistical Invariant by simultaneously combining the strong and weak modes of convergence in a Hilbert space. The strong mode of convergence undertakes the mission of learning a least squares probability output binary classification task in a general hypothesis space, while the weak mode of convergence integrates diverse knowledge by constructing meaningful statistical invariants that embody the concept of intelligence. The utilization of weak convergence shrinks the admissible set of approximation functions, and subsequently accelerates the learning process. In this paper, several statistical invariants that represent sample, feature and parameter information from the source domain are constructed. By taking an appropriate statistical invariant, DLUSI realizes some existing methods. Experimental results on synthetic data as well as the widely used Amazon Reviews and 20 News data demonstrate the superiority of the proposed method.

Abstract:
Eliminating bias from data representations is crucial to ensure fairness in recommendation. Existing studies primarily focus on weakening the correlation between data representations and sensitive attributes, yet may inadvertently steer the user representations toward another potential bias direction of the target attribute. Furthermore, they often overlook the impact of user preferences on capturing sensitive information, incurring inadequate bias elimination. In this paper, we propose a Fair Counterfactual Representations (FairCoRe) learning framework, which aims to ensure the neutrality of representations among all bias directions. First, we intervene on sensitive attributes to construct a counterfactual scenario. Then, two opposing attribute prediction tasks are respectively performed in ground-truth and counterfactual scenarios to encode sensitive information along different bias directions. Second, we design a bias-aware enhancement learning method that quantifies the respective correlation of user preferences and sensitive attributes to enhance sensitive information encoding. Finally, we introduce two mutual information optimization methods that optimize the representations to capture users’ interests and disentangle sensitive factors. Moreover, we propose an attribute neutralization strategy that refines the learned representations, ensuring sensitive attribute neutrality. Extensive experiments demonstrate that our method achieves the optimal fairness and competitive accuracy compared to state-of-the-art methods.

Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve similar informative context examples. Additionally, Post-retrieval Re-ranking is composed of Sequence Reversal and Random Walk selection to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context learning and reasoning capability of LLMs with the retrieved examples and adapts the parameters of LLMs for better alignment between molecules and texts. Experimental results demonstrate that ICMA can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.

Abstract:
Graph-based clustering technique has garnered significant attention due to precise information characterization by pairwise graph similarity. Nevertheless, the post-processing step in traditional methods often limits clustering effects because of crucial information loss. Therefore, the Constrained Laplacian Rank (CLR) theory emerges to directly obtain discrete labels from optimally structural graph, achieving desirable outcomes. However, CLR suffers from substantial time overhead, making it infeasible for large-scale data analysis. To overcome this issue, we propose Anchor-based CLR (ACLR), a simple yet effective method for efficient large-scale clustering. The ACLR method comprises four stages: (1) anchors that roughly cover original data are opted to prepare bipartite graph construction; (2) a novel two-step probability transition (TSPT) strategy initializes a small-scale graph with random walk probability among anchors; (3) the main ACLR model alternately optimizes the graph connected structure and directly produces discrete anchor labels, achieving a time complexity independent of the number of samples due to dramatically reduced graph scale; and (4) labels are propagated from anchors to samples using KK-NN algorithm. Extensive experiments demonstrate that ACLR yields superior accuracy and efficiency, particularly when applied to large-scale data.

Abstract:
Transactional stream processing engines (TSPEs) are central to modern stream applications handling shared mutable states. However, their full potential, particularly in adaptive scheduling, remains largely unexplored. We present MorphStream, a TSPE designed to optimize parallelism and performance for transactional stream processing on multicores. Through a unique three-stage execution paradigm (i.e., planning, scheduling, and execution), MorphStream enables adaptive scheduling under varying workload characteristics. Building on this foundation, MorphStream is further enhanced with support for non-deterministic state access, employing a stateful task precedence graph to handle undefined read/write sets at runtime while guaranteeing transaction semantics. Additionally, MorphStream incorporates a generalized framework for managing window-based operations, enabling efficient tracking and maintenance of overlapping windows using multi-versioned state management. These extensions enhance the system’s ability to process dynamic and irregular workloads. Experimental results demonstrate up to 3.4 times higher throughput and 69.1% lower latency compared to state-of-the-art TSPEs, validating its scalability and adaptability in real-world streaming scenarios.

Abstract:
This paper proposes a deep pseudo-label method for unsupervised feature selection, which learns non-linear representations to generate pseudo-labels and trains a Neural Network (NN) to select informative features via self-Knowledge Distillation (KD). Specifically, the proposed method divides a standard NN into two sub-components: an encoder and a predictor, and introduces a dependency subnet. It works by self-supervised pre-training the encoder to produce informative representations and then alternating between two steps: (1) learning pseudo-labels by combining the clustering results of the encoder's outputs with the NN's prediction outputs, and (2) updating the NN's parameters by globally selecting a subset of features to predict the pseudo-labels while updating the subnet's parameters through self-KD. Self-KD is achieved by encouraging the subnet to locally capture a subset of the NN features to produce class probabilities that match those produced by the NN. This allows the model to self-absorb the learned inter-class knowledge and evaluate feature diversity, removing redundant features without sacrificing performance. Meanwhile, the potential discriminative capability of a NN can also be self-excavated without the assistance of other NNs. The two alternate steps reinforce each other: in step (2), by predicting the learned pseudo-labels and conducting self-KD, the discrimination of the outputs of both the NN and the encoder is gradually enhanced, while the self-labeling method in step (1) leverages these two improvements to further refine the pseudo-labels for step (2), resulting in the superior performance. Extensive experiments show the proposed method significantly outperforms state-of-the-art methods across various datasets.

Affiliations: WeBank, Shenzhen, China; Southwestern University of Finance and Economic, Chengdu, China; Universiti Malaya, Kuala Lumpur, Malaysia; Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; Duke Kunshan University, Kunshan, China; Institute of Innovation, E Fund Management Company Ltd., Guangzhou, China; KTH Royal Institute of Technology, Stockholm, Sweden; Nanyang Technological University, Singapore; Beihang University, Beijing, China; Shanghai Jiao Tong University, Shanghai, China; Xi’an Jiaotong University, Xi’an, China; Huazhong University of Science and Technology, Wuhan, China; Academy for Artificial Intelligence, Hong Kong Polytechnic University, Kowloon, Hong Kong

Abstract:
Federated Foundation Models (FedFMs) represent a distributed learning paradigm that fuses general competences of foundation models as well as privacy-preserving capabilities of federated learning. This combination allows the large foundation models and the small local domain models at the remote clients to learn from each other in a teacher-student learning setting. This paper provides a comprehensive summary of the ten challenging problems inherent in FedFMs, encompassing foundational theory, utilization of private data, continual learning, unlearning, Non-IID and graph data, bidirectional knowledge transfer, incentive mechanism design, game mechanism design, model watermarking, and efficiency. The ten challenging problems manifest in five pivotal aspects: “Foundational Theory,” which aims to establish a coherent and unifying theoretical framework for FedFMs. “Data,” addressing the difficulties in leveraging domain-specific knowledge from private data while maintaining privacy; “Heterogeneity,” examining variations in data, model, and computational resources across clients; “Security and Privacy,” focusing on defenses against malicious attacks and model theft; and “Efficiency,” highlighting the need for improvements in training, communication, and parameter efficiency. For each problem, we offer a clear mathematical definition on the objective function, analyze existing methods, and discuss the key challenges and potential solutions. This in-depth exploration aims to advance the theoretical foundations of FedFMs, guide practical implementations, and inspire future research to overcome these obstacles, thereby enabling the robust, efficient, and privacy-preserving FedFMs in various real-world applications.

Affiliations: College of Information Engineering, Northwest A&F University, Yangling, China; School of Software Engineering, South China University of Technology, Guangzhou, China; School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China; School of Information Science and Technology, Yunnan Normal University, Kunming, China; School of Computer Science, School of Artificial Intelligence, OPtics and ElectroNics, Northwestern Polytechnical University, Xi’an, China

Abstract:
Most existing multi-view graph clustering models focus on integrating the topological structure of different views directly, which cannot efficiently stimulate the collaboration between multiple views. To alleviate this problem, this paper proposes a Triangle Topology Enhancement (T^22E) module, which expands two topological structures based on the raw topology of each view, including the self-triangle enhanced topology that highlights the local view information and the cross-view triangle enhanced topology containing the global-local view information. Afterward, this paper designs a novel multi-view graph clustering model, named MGC-T^22E, to integrate both the raw and derived topological structures and directly induce consistent clustering indicators based on a self-supervised clustering module. In the simulation, the experimental results demonstrate that MGC-T^22E achieves state-of-the-art performances compared with a mass of current competitors.

Abstract:
In the scenario of class-incremental learning (CIL), deep neural networks have to adapt their model parameters to non-stationary data distributions, e.g., the emergence of new classes over time. To mitigate the catastrophic forgetting phenomenon, typical CIL methods either cumulatively store exemplars of old classes for retraining model parameters from scratch or progressively expand model size as new classes arrive, which, however, compromises their practical value due to little attention paid to parameter efficiency. In this paper, we contribute a novel solution, effective control of the parameters of a well-trained model, by the synergy between two complementary learning subnetworks. Specifically, we integrate one plastic feature extractor and one analytical feed-forward classifier into a unified framework amenable to streaming data. In each CIL session, it achieves non-overwritten parameter updates in a cost-effective manner, neither revisiting old task data nor extending previously learned networks; Instead, it accommodates new tasks by attaching a tiny set of declarative parameters to its backbone, in which only one matrix per task or one vector per class is kept for knowledge retention. Experimental results on a variety of task sequences demonstrate that our method achieves competitive results against state-of-the-art CIL approaches, especially in accuracy gain, knowledge transfer, training efficiency, and task-order robustness. Furthermore, a graceful forgetting implementation on previously learned trivial tasks is empirically investigated to make its non-growing backbone (i.e., a model with limited network capacity) suffice to train on more incoming tasks.

Abstract:
This study addresses the problem of convolutional kernel learning in univariate, multivariate, and multidimensional time series data, which is crucial for interpreting temporal patterns in time series and supporting downstream machine learning tasks. First, we propose formulating convolutional kernel learning for univariate time series as a sparse regression problem with a non-negative constraint, leveraging the properties of circular convolution and circulant matrices. Second, to generalize this approach to multivariate and multidimensional time series data, we use tensor computations, reformulating the convolutional kernel learning problem in the form of tensors. This is further converted into a standard sparse regression problem through vectorization and tensor unfolding operations. In the proposed methodology, the optimization problem is addressed using the existing non-negative subspace pursuit method, enabling the convolutional kernel to capture temporal correlations and patterns. To evaluate the proposed model, we apply it to several real-world time series datasets. On the multidimensional ridesharing and taxi trip data from New York City and Chicago, the convolutional kernels reveal interpretable local correlations and cyclical patterns, such as weekly seasonality. For the monthly temperature time series data in North America, the proposed model can quantify the yearly seasonality and make it comparable across different decades. In the context of multidimensional fluid flow data, both local and nonlocal correlations captured by the convolutional kernels can reinforce tensor factorization, leading to performance improvements in fluid flow reconstruction tasks. Thus, this study lays an insightful foundation for automatically learning convolutional kernels from time series data, with an emphasis on interpretability through sparsity and non-negativity constraints.

Affiliations: College of Computer Science and Technology, National University of Defense Technology, Changsha, China; School of Business and Management, The Hong Kong University of Science and Technology, Hong Kong; Department of Computing (COMP) and Department of Management and Markering (MM), The Hong Kong Polytechnic University, Hong Kong; Department of Data Science, City University of Hong Kong, Hong Kong; Anhui Province Key Laboratory of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui, China

Abstract:
Deep learning has been widely applied in recommender systems, which has recently achieved revolutionary progress. However, most existing learning-based methods assume that the user and item distributions remain unchanged between the training phase and the test phase. However, the distribution of user and item features can naturally shift in real-world scenarios, potentially resulting in a substantial decrease in recommendation performance. This phenomenon can be formulated as an Out-Of-Distribution (OOD) recommendation problem. To address this challenge, we propose a novel Dual Test-Time-Training framework for OOD Recommendation, termed DT3OR. In DT3OR, we incorporate a model adaptation mechanism during the test-time phase to carefully update the recommendation model, allowing the model to adapt specially to the shifting user and item features. To be specific, we propose a self-distillation task and a contrastive task to assist the model learning both the user’s invariant interest preferences and the variant user/item characteristics during the test-time phase, thus facilitating a smooth adaptation to the shifting features. Furthermore, we provide theoretical analysis to support the rationale behind our dual test-time training framework. To the best of our knowledge, this paper is the first work to address OOD recommendation via a test-time-training strategy. We conduct experiments on five datasets with various backbones. Comprehensive experimental results have demonstrated the effectiveness of DT3OR compared to other state-of-the-art baselines.

Abstract:
Effective recommender systems play a crucial role in accurately capturing user and item attributes that mirror individual preferences. Some existing recommendation techniques have started to shift their focus towards modeling various types of interactive relations between users and items in real-world recommendation scenarios, such as clicks, marking favorites, and purchases on online shopping platforms. Nevertheless, these approaches still grapple with two significant challenges: (1) Insufficient modeling and exploitation of the impact of various behavior patterns formed by multiplex relations between users and items on representation learning, and (2) ignoring the effect of different relations within behavior patterns on the target relation in recommender system scenarios. In this work, we introduce a novel recommendation framework, Dual-Channel Multiplex Graph Neural Network (DCMGNN), which addresses the aforementioned challenges. It incorporates an explicit behavior pattern representation learner to capture the behavior patterns composed of multiplex user-item interactive relations, and includes a relation chain representation learner and a relation chain-aware encoder to discover the impact of various auxiliary relations on the target relation, the dependencies between different relations, and mine the appropriate order of relations in a behavior pattern. Extensive experiments on three real-world datasets demonstrate that our DCMGNN surpasses various state-of-the-art recommendation methods. It outperforms the best baselines by 10.06% and 12.15% on average across all datasets in terms of Recall@10 and NDCG@10 respectively. The source code of our paper is available at https://github.com/lx970414/TKDE-DCMGNN.

Abstract:
The information available in multi-attributed road-social networks includes network structure, location information, and numerical attributes. Most studies mainly focus on mining communities by combining structure with attributes or structure with location, which do not consider structure, attributes, and location simultaneously. Therefore, we propose a parameter-free algorithm, called LCDMRS, to mine local communities in multi-attributed road-social networks. LCDMRS extracts a sub-network surrounding the given node and embeds it to generate the vector representations of nodes, which incorporates both structural and attributed information. Based on the vector representations of nodes, the average cosine similarity between nodes is designed to ensure both the structural and attributed cohesiveness of the community, while the community node density is designed to ensure the spatial cohesiveness of the community. Targeting the community node density and cosine similarity of nodes, LCDMRS takes the given node as the starting node and employs the community dominance relation to expand the community outward. Experimental results on multiple real-world datasets demonstrate LCDMRS outperforms comparison algorithms.

Abstract:
Subgraph learning has dominated most practices of improving the expressive power of Message Passing Neural Networks (MPNNs). Existing subgraph discovery policies can be classified into node-based and partition-based, which both achieve impressive performance in most scenarios. However, both mainstream solutions still face a subgraph degradation trap. Subgraph degradation is reflected in the phenomenon that the subgraph-level methods fail to offer any benefits over node-level MPNNs. In this work, we empirically investigate the existence of the subgraph degradation issue and introduce a unified perspective, perfect reconstruction, to provide insights for improving two lines of methods. We further propose a subgraph learning strategy guided by the principle of perfect reconstruction. To achieve this, two major issues should be well-addressed, i.e., (i) how to ensure the subgraphs to possess with ‘perfect’ information? (ii) how to guarantee the ‘reconstruction’ power of obtained subgraphs? First, we propose a subgraph partition strategy Rayleigh-resistance to extract non-overlap subgraphs by leveraging the graph spectral theory. Second, we put forward a Query mechanism to achieve subgraph-level equivariant learning, which guarantees subgraph reconstruction ability. These two parts, perfect subgraph partition and equivariant subgraph learning are seamlessly unified as a novel Rayleigh-resistance Equivariant Subgraph learning architecture (RayE-Sub). Comprehensive experiments on both synthetic and real datasets demonstrate that our approach can consistently outperform previous subgraph learning architectures.

Abstract:
Large-scale, high-quality data are considered an essential factor for the successful application of many deep learning techniques. Meanwhile, numerous real-world deep learning tasks still have to contend with the lack of sufficient amounts of high-quality data. Additionally, issues such as model robustness, fairness, and trustworthiness are also closely related to training data. Consequently, a huge number of studies in the existing literature have focused on the data aspect in deep learning tasks. Some typical data optimization techniques include data augmentation, logit perturbation, sample weighting, and data condensation. These techniques usually come from different deep learning divisions and their theoretical inspirations or heuristic motivations may seem unrelated to each other. This study aims to organize a wide range of existing data optimization methodologies for deep learning from the previous literature, and makes the effort to construct a comprehensive taxonomy for them. The constructed taxonomy considers the diversity of split dimensions, and deep sub-taxonomies are constructed for each dimension. On the basis of the taxonomy, connections among the extensive data optimization methods for deep learning are built in terms of five aspects. We probe into rendering several promising and interesting future directions. The constructed taxonomy and the revealed connections will enlighten the better understanding of existing methods and the design of novel data optimization techniques. Furthermore, our aspiration for this survey is to promote data optimization as an independent subdivision of deep learning.

Affiliations: Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institute of Physical Science and Information Technology, Anhui University, Hefei, China; Zhejiang Dahua Technology Company, Ltd., Hangzhou, China; Department of Electronic Engineering and Information Science, School of information Science and Technology, University of Science and Technology of China, Hefei, China; Key Laboratory of intelligent computing and Signal Processing, Ministry of Education, and the School of Computer Science and Technology, Anhui University, Hefei, China; School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China

Abstract:
Knowledge tracing has been widely used in online learning systems to guide the students’ future learning. However, most existing KT models primarily focus on extracting abundant information from the question sets and explore the relationships between them, but ignore the personalized student behavioral information in the learning process. This will limit the model’s ability to accurately capture the personalized knowledge states of students and reasonably predict their performances. To alleviate this limitation, we explicitly models the personalized learning process by incorporating the emotions, a representative personalized behavior in the learning process, into KT framework. Specifically, we present a novel Dual-State Personalized Knowledge Tracing with Emotional Incorporation model to achieve this goal: First, we incorporate emotional information into the modeling process of knowledge state, resulting in the Knowledge State Boosting Module. Second, we design an Emotional State Tracing Module to monitor students’ personalized emotional states, and propose an emotion prediction method based on personalized emotional states. Finally, we apply the predicted emotions to enhance students’ response prediction. Furthermore, to extend the generalization capability of our model across different datasets, we design a transferred version of DEKT, named Transfer Learning-based Self-loop model (T-DEKT). Extensive experiments show our method achieves the state-of-the-art performance.

Abstract:
Dynamic ensemble has significantly greater potential space to improve the classification of imbalanced data compared to static ensemble. However, dynamic ensemble schemes are far less successful than static ensemble methods in the imbalanced learning field. Through an in-depth analysis on the behavior characteristics of dynamic ensemble, we find that there are some important problems that need to be addressed to release the full potential of dynamic ensemble, including but not limited to, correcting the component classifiers’ bias towards the majority classes, increasing the proportions of the positive classifiers (i.e., the component classifiers making correct prediction) for difficult samples, and providing the accurate competence estimations on the hard-to-classify samples w.r.t the classifier pool. Inspired by these, we propose a Dynamic Ensemble Framework for imbalanced data classification (imDEF). imDEF first uses the data generation method OREM\mathrm_GG to generate multiple artificial synthetic datasets, which have diverse class distributions by rebalancing the original imbalanced data. Based on each of such synthetic datasets, imDEF then utilizes a Classification Error-aware Self-Paced Sampling Ensemble (SPSE\mathrm_CE CE ) method to gradually focus more on difficult samples, to create a low-biased classifier pool and increase the proportions of the positive classifiers for the difficult samples. Finally, imDEF constructs a referee system to achieve the competence estimations by leveraging an Ensemble Margin-aware Self-Paced Sampling Ensemble (SPSE\mathrm_EM EM ) method. SPSE\mathrm_EM EM incrementally strengthens the learning of the hard-to-classify samples, so that the competent levels of component classifiers could be estimated accurately. Extensive experiments demonstrate the effectiveness of imDEF. The source codes have been made publicly available on GitHub.

Abstract:
A multi-label classifier estimates the binary label state (relevant/irrelevant) for each of a set of concept labels, for a given instance. Probabilistic multi-label classifiers provide a distribution over all possible labelset combinations of such label states (the powerset of labels), from which we can provide the best estimate by selecting the labelset corresponding to the largest expected accuracy. Providing confidence for predictions is important for real-world application of multi-label models, which provides the practitioner with a sense of the correctness of the prediction. It has been thought that the probability of the chosen labelset is a good measure of the confidence of the prediction, but multi-label accuracy can be measured in many ways and so confidence should align with the expected accuracy of the evaluation method. In this article, we investigate the effectiveness of seven candidate functions for estimating multi-label expected accuracy conditioned on the labelset distribution and the evaluation method. We found most correlate to expected accuracy and have varying levels of robustness. Further, we found that the candidate functions provide high expected accuracy estimates for Hamming similarity, but a combination of the candidates provided an accurate estimate of expected accuracy for Jaccard index and Exact match.

Affiliations: Research Center of Language Technology, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, China; Research Center of Artificial Intelligence of Things, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, China; Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Research Center of Artificial Intelligence of Things, Faculty of Computing, Harbin Institute of Technology, Shenzhen, Guangdong, China

Abstract:
It is a long-standing question to discover causal relations from observed variables in many empirical sciences. However, current causal discovery methods are inefficient when dealing with large-scale observed variables due to challenges in conditional independence (CI) tests or complex computations of acyclicity, and may even fail altogether. To address the efficiency issue in causal discovery from large-scale observed variables, we propose a Hierarchical Causal Discovery (HCD) framework with a bilevel policy that handles this issue by boosting existing models. Specifically, the high-level policy first finds a causal cut set to partition observed variables into several causal clusters and releases the clusters to the low-level policy. The low-level policy applies any causal discovery method to process these causal clusters in parallel and obtain intra-cluster structures for subsequently inter-cluster structure merging in the high-level policy. To avoid missing inter-cluster edges, we theoretically demonstrate the feasibility of causal cluster cut and inter-cluster structure merging. We also prove the completeness and correctness of HCD for causal discovery. Experiments on both synthetic and real-world datasets demonstrate that HCD consistently and significantly enhances the efficiency and effectiveness of existing advanced methods.

Abstract:
Real-world data is typically a noisy manifestation of a core pattern (schema), and the purpose of data mining algorithms is to uncover that pattern, thereby splitting (i.e. decomposing) the data into schema and noise. We introduce SCHENO, a principled evaluation metric for the goodness of a schema-noise decomposition of a graph. SCHENO captures how schematic the schema is, how noisy the noise is, and how well the combination of the two represent the original graph data. We visually demonstrate what this metric prioritizes in small graphs, then show that if SCHENO is used as the fitness function for a simple optimization strategy, we can uncover a wide variety of patterns. Finally, we evaluate several well-known graph mining algorithms with this metric; we find that although they produce patterns, those patterns are not always the best representation of the input data.

Abstract:
Traffic prediction is an essential task in intelligent transportation systems dealing with complex and dynamic spatio-temporal correlations. To date, most work is focused on point estimation models, which only output a single value w.r.t an attribute of traffic data at a time, falling short of depicting diverse situations and uncertainty in future. Besides, most methods are not flexible enough to handle real complex traffic scenarios, involving missing values and non-uniformly sampled data. The interactions among different attributes of traffic data are also rarely explored explicitly. In this paper, we focus on probabilistic estimation in traffic prediction tasks, proposing a spatio-temporal multivariate probabilistic predictive model to estimate the distributions of traffic data. Specifically, we devise a multivariate spatio-temporal fusion graph block to extract spatio-temporal correlations of multiple traffic attributes at different locations. A multi-graph fusion module is designed to capture time-varying spatial relationships. We estimate the joint distributions of missing traffic data using copulas. The proposed model can simultaneously perform traffic forecasting and interpolation tasks with non-uniformly sampled data. Our experiments on two real-world traffic datasets demonstrate the advantages of our model over the state-of-the-art1.

Abstract:
Existing work on knowledge graph (KG) link prediction has primarily focused on a single KG. However, a single KG is often limited by its incompleteness, encompassing missing facts, entities, and relations. This limitation subsequently restricts the practicality, as it cannot handle the queries that involve missing entities or relations within the single KG. In this article, we explore an extended link prediction task, cross-KG link prediction, which answers queries using entities or relations integrated from other KGs. The crux of this problem is transferring knowledge across KGs and fusing their embedding spaces, which possess varying schemata. We develop a relation prototype graph to model the interactions among relations from different KGs. Based on this graph, we first propose a dual-view embedding learning module to fuse embedding spaces by training with instance facts and relation prototype edges. We then introduce an attention mechanism to highlight pivotal information for specific queries, recognizing that different KGs often emphasize various domains. Moreover, we devise an augmentation strategy to generate pseudo-cross-KG facts, facilitating knowledge transfer across KGs. Using four widely-used KGs, we construct two cross-KG link prediction datasets. Extensive experimental results demonstrate the superiority of our model and the unique contributions of each module.

Abstract:
Link prediction, which aims to predict the existence of a link between two nodes in a network, has various applications ranging from friend recommendation to protein interaction prediction. Recently, Graph Neural Network (GNN)-based link prediction has demonstrated its advantages and achieved the state-of-the-art performance. Typically, GNN-based link prediction can be formulated as a binary classification problem. However, in link prediction, we only have positive data (observed links) and unlabeled data (unobserved links), but no negative data. Therefore, Positive Unlabeled (PU) learning naturally fits the link prediction scenario. Unfortunately, the unknown class prior and data imbalance of networks impede the use of PU learning in link prediction. To deal with these issues, this paper proposes a novel model-agnostic PU learning algorithm for GNN-based link prediction by means of Positive-Unlabeled Area Under the Receiver Operating Characteristic Curve (PU-AUC) optimization. The proposed method is free of class prior estimation and able to handle the data imbalance. Moreover, we propose an accelerated method to reduce the operational complexity of PU-AUC optimization from quadratic to approximately linear. Extensive experiments back up our theoretical analysis and validate that the proposed method is capable of boosting the performance of the state-of-the-art GNN-based link prediction models.

Abstract:
Hypergraphs can naturally model group-wise relations (e.g., a group of users who co-purchase an item) as hyperedges. Hyperedge prediction is to predict future or unobserved hyperedges, which is a fundamental task in many real-world applications (e.g., group recommendation). Despite the recent breakthrough of hyperedge prediction methods, the following challenges have been rarely studied: (C1) How to aggregate the nodes in each hyperedge candidate for accurate hyperedge prediction? and (C2) How to mitigate the inherent data sparsity problem in hyperedge prediction? To tackle both challenges together, in this paper, we propose a novel hyperedge prediction framework (\mathsfCASHCASH) that employs (1) context-aware node aggregation to precisely capture complex relations among nodes in each hyperedge for (C1) and (2) self-supervised contrastive learning in the context of hyperedge prediction to enhance hypergraph representations for (C2). Furthermore, as for (C2), we propose a hyperedge-aware augmentation method to fully exploit the latent semantics behind the original hypergraph and consider both node-level and group-level contrasts (i.e., dual contrasts) for better node and hyperedge representations. Extensive experiments on six real-world hypergraphs reveal that \mathsfCASHCASH consistently outperforms all competing methods in terms of the accuracy in hyperedge prediction and each of the proposed strategies is effective in improving the model accuracy of \mathsfCASHCASH.

Abstract:
Contrastive Learning (CL)-based recommender systems have gained prominence in the context of Heterogeneous Graph (HG) due to their capacity to enhance the consistency of representations across different views. However, existing frameworks often neglect the fact that user-item interactions within HG are governed by diverse latent intents (e.g., brand preferences or demographic characteristics of item audiences), which are pivotal in capturing fine-grained relations. The exploration of these underlying intents, particularly through the lens of meta-paths in HGs, presents us with two principal challenges: i) How to integrate CL with intents; ii) How to mitigate noise from meta-path-driven intents. To address these challenges, we propose an innovative framework termed Intent-guided Heterogeneous Graph Contrastive Learning (IHGCL), which designed to enhance CL-based recommendation by capturing the intents contained within meta-paths. Specifically, the IHGCL framework includes: i) a meta-path-based Dual Contrastive Learning (DCL) approach to effectively integrate intents into the recommendation, constructing intent-intent contrast and intent-interaction contrast; ii) a Bottlenecked AutoEncoder (BAE) that combines mask propagation with the information bottleneck principle to significantly reduce noise perturbations introduced by meta-paths. Empirical evaluations conducted across six distinct datasets demonstrate the superior performance of our IHGCL framework relative to conventional baseline methods.

Abstract:
Machine unlearning involves retracting data records and reducing their influence on trained models, aiding user privacy protection, at a significant computational cost potentially. Weight perturbation-based unlearning is common but typically modifies parameters globally. We propose fine-grained Top-K and Random-k parameters perturbed inexact machine unlearning that address the privacy needs while keeping the computational costs tractable. However, commonly used training data are independent and identically distributed, for inexact machine unlearning, current metrics are inadequate in quantifying unlearning degree that occurs after unlearning. To address this quantification issue, we introduce SPD-GAN, which subtly perturbs data distribution targeted for unlearning. Then, we evaluate unlearning degree by measuring the performance difference of the models on the perturbed unlearning data before and after unlearning. Furthermore, to demonstrate efficacy, we tackle the challenge of evaluating machine unlearning by assessing model generalization across unlearning and remaining data. To better assess the unlearning effect and model generalization, we propose novel metrics, namely, the forgetting rate and memory retention rate. By implementing these innovative techniques and metrics, we achieve computationally efficacious privacy protection in machine learning applications without significant sacrifice of model performance. A by-product of our work is a novel method for evaluating and quantifying unlearning degree.

Abstract:
Class imbalance learning is a challenging task in machine learning applications. To balance training data, traditional class imbalance learning approaches, such as class resampling or reweighting, are commonly applied in the literature. However, these methods can have significant limitations, particularly in the presence of noisy data, missing values, or when applied to advanced learning paradigms like semi-supervised or federated learning. To address these limitations, this paper proposes a novel and theoretically-ensured latent Feature Rectification method for clAss iMbalance lEarning (FRAME). The proposed FRAME can automatically learn multiple centroids for each class in the latent space and then perform class balancing. Unlike data-level methods, FRAME balances feature in the latent space rather than the original space. Compared to algorithm-level methods, FRAME can distinguish different classes based on distance without the need to adjust the learning algorithms. Through latent feature rectification, FRAME can effectively mitigate contaminated noises/missing values without worrying about structural variations in the data. In order to accommodate a wider range of applications, this paper extends FRAME to the following three main learning paradigms: fully-supervised learning, semi-supervised learning, and federated learning. Extensive experiments on 10 binary-class datasets demonstrate that our FRAME can achieve competitive performance than the state-of-the-art methods and its robustness to noises/missing values.

Abstract:
Real-time bidding (RTB) plays a pivotal role in online advertising ecosystems. Advertisers employ strategic bidding to optimize their advertising impact while adhering to various financial constraints, such as the return-on-investment (ROI) and cost-per-click (CPC). Primarily focusing on bidding with fixed budget constraints, traditional approaches cannot effectively manage the dynamic budget allocation problem where the goal is to achieve global optimization of bidding performance across multiple channels with a shared budget. In this paper, we propose a hierarchical multi-agent reinforcement learning framework for multi-channel bidding optimization. In this framework, the top-level strategy applies a CPC constrained diffusion model to dynamically allocate budgets among the channels according to their distinct features and complex interdependencies, while the bottom-level strategy adopts a state-action decoupled actor-critic method to address the problem of extrapolation errors in offline learning caused by out-of-distribution actions and a context-based meta-channel knowledge learning method to improve the state representation capability of the policy based on the shared knowledge among different channels. Comprehensive experiments conducted on a large scale real-world industrial dataset from the Meituan ad bidding platform demonstrate that our method achieves a state-of-the-art performance.

Abstract:
Hyperbolic space based collaborative filtering has emerged as a popular topic in recommender systems. Compared to the euclidean space, hyperbolic space is more suitable to the tree-like structures in the user-item interactions and can achieve better recommender performance. Although some works have been devoted to this popular topic and made some progresses, they use tangent space as an approximation of hyperbolic space to implement model. Despite the effectiveness, such methods fail to fully exploit the advantages of hyperbolic space and still suffer from the data sparsity issue, which severely limits the recommender performance. To tackle these problems, we refer to the self-supervised learning technique and novelly propose a Hyperbolic Graph Contrastive Learning (HyperCL) framework. Specifically, our framework encodes the augmentation views from both the tangent space and the hyperbolic space, and construct the contrast pairs based on their corresponding learned node representations. Our model not only leverages the geometric advantages of both sides but also achieves seamless information transmission between the two spaces. Extensive experimental results on public benchmark datasets demonstrate that our model is highly competitive and outperforms leading baselines by considerable margins. Further experiments validate the robustness and the superiority of contrastive learning paradigm.

Abstract:
Most current session-based recommendations model session sequences solely based on the user's target behavior, ignoring the user's hidden preferences in auxiliary behaviors. Additionally, they use ordinary graphs to model one-to-one item correlations in the current session and fail to leverage other sessions to learn richer higher-order item correlations. To address these issues, a multi-behavior hypergraph contrastive learning model for session-based recommendations is proposed. This model represents all the sessions as global hypergraphs according to two types of behavior sequences. It employs contrastive learning to obtain global item embeddings, which are further aggregated to generate a global session representation that captures higher-order correlations of items from all session perspectives. A novel local heterogeneous hypergraph is designed for the current session to capture higher-order correlations between items with different behaviors in the current session, thus enhancing the local session representation. Additionally, a novel self-supervised signal is created by constructing a multi-behavior line graph, enhancing the global session representation. Finally, the local session representation, global session representation, and global item embedding are used to learn the predicted interaction probability of each item. Extensive experiments are conducted on three real datasets, and the results demonstrate that the proposed model significantly improves recommendation accuracy.

Abstract:
Many real-world networks are signed networks with positive and negative edge weights, such as social networks with positive (friend) or negative (foe) relationships between users, and gene interaction networks with positive (stimulatory) or negative (inhibitory) interactions between genes. A well-known data mining task in signed networks is to find groups of antagonistic communities, where the vertices in the same community have a strong positive relationship and the vertices in different communities have a strong negative relationship. Most existing methods find antagonistic communities by modelling a signed network as a static graph with constant positive and negative edge weights. However, since the relationship between vertices is often uncertain in many real-world networks, it is more practical and accurate to capture the uncertainty of the relationship in the network by a signed uncertain graph (SUG), where each edge is independently associated with a discrete probability distribution of signed edge weights. How to find groups of antagonistic communities in a SUG is a challenging data mining task that has not been systematically tackled before. In this paper, we propose a novel method to tackle this task. We first model a group of antagonistic communities by a set of subgraphs, where the vertices in the same subgraph have a large expectation of positive edge weights and the vertices in different subgraphs have a large expectation of negative edge weights. Then, we propose a method to efficiently find significant groups of antagonistic communities by restricting all the computations on small local subgraphs of the SUG. Extensive experiments on seven real-world datasets and a synthetic dataset demonstrate the outstanding effectiveness and efficiency of the proposed method.

Abstract:
Attributed network embedding seeks to depict each network node via a compact, low-dimensional vector while effectively preserving the similarity between node pairs, which lays a strong foundation for a great many high-level network mining tasks. With the advent of the era of Big Data, the number of nodes and edges has reached billions in many real-world networks, which poses great computational and storage challenges to the existing methods. Although some algorithms have been developed to handle billion-scale networks, they often undergo accuracy degradation or tempo-spatial inefficiency owing to attribute information loss or substantial parameter learning. To this end, we propose a simple, time- and space-efficient billion-scale attributed network embedding algorithm called SketchBANE in this paper, which strikes an excellent balance between accuracy and efficiency by adopting sparse random projection with 1-bit quantization to sketch the iterative closed neighborhood and maintain the similarity among high-order nodes in a non-learning manner. The extensive experimental results indicate that our proposed SketchBANE algorithm competes favorably with the state-of-the-art approaches, while remarkably reducing runtime and space consumption. Also, the proposed SketchBANE algorithm exhibits good scalability and parallelization.

Abstract:
Fuzzy clustering ensemble techniques have been proven to yield more accurate and robust clustering results, with the mainstream methods relying on the fuzzy co-association (FCA) matrix. However, the inherent issues of low-value density and uniform dispersion in the FCA matrix significantly affect the performance of fuzzy clustering ensembles, an aspect that has been overlooked. To address this issue, we propose a novel framework for fuzzy clustering ensemble based on fuzzy matrix self-enhancement (FMSE). Specifically, we initially employ singular value decomposition to extract the principal components of the FCA matrix, thereby alleviating its low-value density. Second, on the basis of the criterion of fuzzy entropy, we measure the fuzziness of samples, design a metric for the fuzzy representativeness of samples, and incorporate it into a fusion-weighted structure for the reconstruction of the FCA matrix, mitigating uniform dispersion. Subsequently, on the basis of the self-enhanced fuzzy matrix model, we utilize a prototype diffusion approach to identify core samples and gradually allocate remaining samples to obtain a consensus clustering solution. Extensive comparative experiments on benchmark datasets against state-of-the-art clustering ensemble methods demonstrate the effectiveness and superiority of the proposed approach.

Abstract:
Graph clustering technique is highly effective in detecting complex-shaped clusters, in which graph building is a crucial step. Nevertheless, building a reasonable graph that can exhibit high connectivity within clusters and low connectivity across clusters is challenging. Herein, we design a max-ascent-angle graph called the “Y-graph”, a high-sparse graph that automatically allocates dense edges within clusters and sparse edges across clusters, regardless of their shapes or dimensionality. In the graph, every point xx is allowed to connect its nearest higher-density neighbor \deltaδ, and another higher-density neighbor \gammaγ, satisfying that the angle \angle \delta x\gamma∠δxγ is the largest, called “max-ascent-angle”. By seeking the max-ascent-angle, points are automatically connected as the Y-graph, which is a reasonable graph that can effectively balance inter-cluster connectivity and intra-cluster non-connectivity. Besides, an edge weight function is designed to capture the similarity of the neighbor probability distribution, which effectively represents the density connectivity between points. By employing the Normalized-Cut (Ncut) technique, a Ncut-Y algorithm is proposed. Benefiting from the excellent performance of Y-graph, Ncut-Y can fast seek and cut the edges located in the low-density boundaries between clusters, thereby, capturing clusters effectively. Experimental results on both synthetic and real datasets demonstrate the effectiveness of Y-graph and Ncut-Y.

Abstract:
Graph Neural Networks (GNNs) hold promise in various application domains, but their limited explainability hinders widespread adoption, impacting customer satisfaction and loyalty. This issue intensifies when addressing diverse explanation needs of different user groups. Current GNN explanation models focus on a single objective, neglecting varied and potential conflicting user requirements, resulting in suboptimal outcomes. Moreover, existing models prioritize explanation objectives during multi-objective explanations, disrupting the intrinsic hierarchical structures and distant relationships within the graphs, further diminishing their effectiveness. To tackle these challenges, this paper introduces a novel multi-objective explanatory framework with hierarchical structure attribution for GNNs, termed HM-Explainer. This framework constructs a multi-objective explanation generation module based on Pareto theory to balance different and potentially conflicting explanatory objectives. Additionally, to embed hierarchical information into explanations, HM-Explainer designs node-level and cluster-level attribution modules to analyze the impact of input data on GNN decisions hierarchically. Furthermore, a self-attention mechanism is integrated into the node-level attribution module to account for the influence of distant neighbors. Ultimately, the efficacy of HM-Explainer is validated across multiple datasets for different GNN models through experimentation.

Affiliations: Faculty of Data Science, City University of Macau, Macao, China; College of Computer Science and Technology, Zhejiang University, Hangzhou, China; State Key Laboratory of Digital Intelligent Modeling and Simulation and the College of Systems Engineering, National University of Defense Technology, Changsha, China; College of Cyber Security, Jinan University, Guangzhou, China; Centre for Learning, Teaching, and Technology (LTTC), The Education University of Hong Kong, Hong Kong; Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE

Abstract:
Encoding multi-behavior information into a single graph collaborative filtering vector is an emerging challenge, as different behaviors generate distinct graphs, each with its own embedding vector. To address this problem, recent approaches typically designate the embedding of some behaviors as primary embeddings and use the embeddings of other behaviors to enhance the primary behavior recommendation. However, these models may excel in recommending primary behaviors at the expense of degrading the performance of auxiliary behaviors. As a result, modern recommender systems often need to maintain multiple sets of collaborative filtering embeddings to achieve satisfactory recommendation performance across all behaviors. To alleviate this issue, we introduce the Behavior Merging Graphs. Instead of modeling each behavior separately, BMG uses a joint graph to capture potential behavior merging sets between nodes and applies the partial order theory to model the intricate structures and relational order among behavior merging sets. Based on BMG, we introduce the Behavior Merging Graphs Convolutional Networks (BMGCN), which aggregates neighbor information by integrating convolutional weights that account for the rank transformation of Behavior Merging Order across various behavior merging sets. Furthermore, BMGCN employs behavior merging-based sampling to guide the traditional BPR sampling process, enhancing embedding training. Experiments on three widely used datasets demonstrate that BMGCN achieves superior multi-behavior recommendation performance compared to state-of-the-art baselines.

Abstract:
Large language models have revolutionized text generation, offering significant benefits while also posing threats to society, such as copyright infringement and misinformation. To prevent harmful use, the task of detecting machine-generated content has become an important research topic, though it remains particularly challenging across diverse content domains. This paper presents DGRM, an innovative add-on module designed to improve the domain generalization capability of existing machine-generated text detectors. Our model consists of two training components. (1) Feature disentanglement separates a text’s embedding into target-specific and common attributes, thereby enhancing semantic domain generalization across different content domains. (2) Feature regularization applies constraints to these attributes to extract additional target-relevant information and ensure detection consistency under syntactic perturbations—thus achieving syntactic domain generalization. Evaluation over multiple datasets demonstrates that incorporating our module substantially improves the detection of machine-generated text across semantically and syntactically diverse domains. We hope our work contributes to mitigating the harmful use of language models.

Abstract:
Generating accurate SQL from users’ natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restrict the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summary and discuss the remaining challenges in this field and suggest expectations for future research directions.

Abstract:
Input-discriminative local differential privacy (ID-LDP) protects user data with a different range of values, which improves the utility of the estimated data compared to traditional LDP. However, the existing ID-LDP methods are used for categorical data and cannot be directly applied to numerical data. In this paper, we propose a numerical data collection (NDC) framework with ID-LDP to provide discriminative protection for the data with different inputs. This framework uses a piecewise mechanism to divide the numerical data into several segments and designs two perturbation methods to minimize the mean value of numerical data based on values submitted by users. We first create an NDC-UE method that encodes the raw data into a binary vector. This method sets the uploaded data bit as 1 and the rest as zero and perturbs each bit with a given probability. We further propose an NDC-GRR algorithm to perturb the numerical data with an optimal privacy budget. To reduce the complexity of NDC-GRR, we apply a greedy algorithm-based spanner to shorten the computation time and improve the accuracy. Theoretical analysis proves that our schemes satisfy the definition of ID-LDP. Experimental results based on two real-world datasets and a synthetic dataset show that the proposed schemes have less mean square error compared with the benchmarks.

Abstract:
Addressing the persistent challenge of learning from imbalanced datasets is crucial in advancing machine learning applications. Standard machine learning algorithms typically assume that the input data is balanced, and they often struggle to effectively learn the distribution of minority class data when dealing with imbalanced data. To address this, our study designed an improved Generative Adversarial Networks (GANs) model, named MDGAN, for tabular sample synthesis to augment samples and balance the data distribution. MDGAN employs a multi-generator and multi-discriminator structure to capture non-connected subspace manifolds, thereby better fitting the complete data distribution. To enhance the diversity among the multiple generators, an exclusive loss among generators was designed, ensuring that each generator produces data of different modalities. Additionally, a contrastive loss was introduced to ensure that the generated samples better fit the minority class distribution and are separated from the majority class distribution, preventing blurred classification boundaries. Qualitative and quantitative tests were conducted on 25 real datasets, and the experimental results indicate that MDGAN outperforms traditional classical models and current advanced oversampling models.

Abstract:
Federated recommender systems (FedRecSys) have emerged as a pivotal solution for privacy-aware recommendations, balancing growing demands for data security and personalized experiences. Current research efforts predominantly concentrate on adapting traditional recommendation architectures to federated environments, optimizing communication efficiency, and mitigating security vulnerabilities. However, user personalization modeling, which is essential for capturing heterogeneous preferences in this decentralized and non-IID data setting, remains underexplored. This survey addresses this gap by systematically exploring personalization in FedRecSys, charting its evolution from centralized paradigms to federated-specific innovations. We establish a foundational definition of personalization in a federated setting, emphasizing personalized models as a critical solution for capturing fine-grained user preferences. The work critically examines the technical hurdles of building personalized FedRecSys and synthesizes promising methodologies to meet these challenges. As the first consolidated study in this domain, this survey serves as both a technical reference and a catalyst for advancing personalized FedRecSys research.

Abstract:
Index tuning is crucial for optimizing database performance by selecting optimal indexes based on workload. The key to this process lies in an accurate and efficient benefit estimator. Traditional methods relying on what-if tools often suffer from inefficiency and inaccuracy. In contrast, learning-based models provide a promising alternative but face challenges such as instability, lack of interpretability, and complex management. To overcome these limitations, we adopt a novel approach: quantifying the uncertainty in learning-based models’ results, thereby combining the strengths of both traditional and learning-based methods for reliable index tuning. We propose Beauty, the first uncertainty-aware framework that enhances learning-based models with uncertainty quantification and uses what-if tools as a complementary mechanism to improve reliability and reduce management complexity. Specifically, we introduce a novel method that combines AutoEncoder and Monte Carlo Dropout to jointly quantify uncertainty, tailored to the characteristics of benefit estimation tasks. In experiments involving sixteen models, our approach outperformed existing uncertainty quantification methods in the majority of cases. We also conducted index tuning tests on six datasets. By applying the Beauty framework, we eliminated worst-case scenarios and more than tripled the occurrence of best-case scenarios.

Abstract:
In the digital era, effective Transaction Fraud Detection (TFD) is essential to ensuring financial security. The considerable class imbalance, with legitimate transactions vastly outnumbering fraudulent ones, presents a significant challenge for TFD models to accurately identify fraudulent patterns. While existing sample-balancing strategies address class imbalance effectively in many contexts, they often fall short in TFD due to fraudsters’ sophisticated concealment tactics, which lead to pronounced behavioral overlap between fraudulent and legitimate transactions. In this paper, we introduce a novel Generative Adversarial Network-based Hybrid Sampling method (GANHS) to effectively address the class imbalance issue. GANHS employs a dual-discriminator generative adversarial network to generate synthetic samples that accurately reflect the characteristics of fraudulent activity, while an adaptive neighborhood-based undersampling technique refines these samples to minimize overlap with legitimate ones. This hybrid approach not only enhances the model’s ability to learn fraud patterns by generating high-quality samples but also improves its resilience against highly concealed fraudulent activities. Experiments on real-world and public datasets demonstrate that GANHS outperforms its competitive peers, with gains of 0.5%–8.7% in average F_1F1-Score and 1.0%–7.0% in G-mean, highlighting its strong potential for improving the reliability and effectiveness of TFD systems in complex, high-risk financial scenarios.

Abstract:
Time series forecasting (TSF) has gained significant attention as a widely explored research area in diverse applications. Existing methods, which focus on improvements in the most common scenarios, focus little on performance in rare cases. Despite their scarce occurrences in the data, these rare samples are more challenging and easily overlooked by models, significantly contributing to the total loss. In this paper, we propose a novel approach (dubbed iBACon) that overcomes this limitation by employing imbalance-aware contrastive learning and trend-seasonal decomposition architecture, specifically designed to solve TSF. To this end, we first introduce the Input-Output Difference (IOD) metric as a pseudo-label and reveal the data imbalance phenomenon in TSF. This label continuity inherently provides a meaningful distance between targets, implying a similarity between nearby targets in both label and feature spaces. Based on this similarity, the proposed imbalance-aware contrastive loss aims to reshape feature embeddings to facilitate knowledge dissemination among challenging samples and learn specific predictive features. Finally, when combined with our trend-seasonal decomposition network, iBACon significantly improves TSF accuracy. Experiments show that iBACon enhances overall average accuracy and substantially improves the 1-3% most challenging samples.

Abstract:
Accuracy has been the primary benchmark for assessing recommenders learned from sequential interactions. To improve user experience by diverse and novel recommendation, our paper focuses on Multi-objective Sequential Recommendation (MOSR) to balance these conflicting objectives. Although a few studies leveraged reinforcement learning (RL) to solve MOSR, these methods can lead to sub-optimal results. First, traditional offline RL approach typically optimizes various objectives independently via multiple RL heads, accumulating prediction errors and leading to unstable performance. Furthermore, the offline policy cannot dynamically adjust objective weights during the inference stage, limiting adaptability to varying contexts. To this end, we introduce Multi-objective Decision Transformer for Reward-driven Recommendation (MODT4R), a novel framework that addresses MOSR as sequence modeling problem. First, we propose a user trajectory to capture user state transitions along with their multi-objective interests, represented by sequential expected cumulative rewards (returns). Moreover, the supervised learning paradigm makes the training process more stable while naturally integrating multi-objective optimization into sequence modeling by using multiple returns as conditional inputs. During inference, a score function is used to adjust the weights of diversity and novelty. Experimental evaluations on real-world datasets demonstrate that MODT4R significantly enhances diversity and novelty while maintaining accuracy compared to existing state-of-the-art methods.

Abstract:
Anomaly detection (AD) suffers from severe performance decrease when dealing with corrupted datasets. By querying limited annotations from an oracle, active learning is prevalent in mitigating this problem. However, previous work ignores the particularity of the AD task on the one-class setting, where wrong pseudo-annotations of anomaly noise will mislead the active inference results. To address this challenge, we propose D^22AE, a novel active AD framework through Decoupling Data pools between training and inference process for Active Experts. Specifically, we design a data-splitting module named as DSS to obtain diverse subsets and weaken the mutual interference of similar anomalies. To decouple the data, we propose an Independent Active Experts (IAE) module formed by multiple expert replications, on which each data subset is trained by one separate expert (squad) and inferred by the other non-training ones. To further improve the efficiency of data utilization, we propose Active Expert Squad (AES) beyond IAE by introducing Mixture-of-Experts. The commonality and specificity between expert squads promote model training and active query, respectively. We conduct extensive experiments on various image, tabular, and NLP datasets. Experimental results show the superiority of our solution compared with existing methods.

Abstract:
Graph matching is a critical task with diverse real-world applications. Current cutting-edge methodologies incorporate GNN (Graph Neural Network) combined with incremental anchor refinement, calculating the matching similarity directly via node embeddings. However, the direct similarity computation based on aggregated embeddings from GNN may obscure the distinctiveness of nodes within a localized region. In addition, the possible wrongly added anchor pairs in the iterations and the lack of capturing the relationships to anchors may further affect the performance. In order to tackle these challenges, this paper proposes a method named DeepNM, which attempts to find node matching based on their neighbors’ similarities. Specifically, DeepNM introduces a Sinkhorn-based similarity on a node’s neighborhood’s embeddings, which serves as both a training loss and a matching metric tailored to the graph matching problem. Additionally, we demonstrate that the Sinkhorn-based similarity, which relies on common neighbor statistics, is highly resilient to inaccurately identified anchor pairs within the context of incremental graph matching. Our comprehensive experiments on synthetic and real-world datasets demonstrate that DeepNM, compatible with the incremental graph matching paradigm, excels particularly well at matching graphs where common neighbors provide good matches. Applying the DeepNM pipeline to real social networks results in a 6% improvement, and applying the Sinkhorn similarity on knowledge graphs results in an average improvement of 1.7% over the best baseline.

Abstract:
As trustworthy AI continues to advance, the fairness issue in recommendations has received increasing attention. A recommender system is considered unfair when it produces unequal outcomes for different user groups based on user-sensitive attributes (e.g., age, gender). Some researchers have proposed data augmentation-based methods aiming at alleviating user-level unfairness by altering the skewed distribution of training data among various user groups. Despite yielding promising results, they often rely on fairness-related assumptions that may not align with reality, potentially reducing the data quality and negatively affecting model effectiveness. To tackle this issue, in this paper, we study how to implement high-quality data augmentation to improve recommendation fairness. Specifically, we propose FairDgcl, a dynamic graph adversarial contrastive learning framework aiming at improving fairness in recommender system. First, FairDgcl develops an adversarial contrastive network with a view generator and a view discriminator to learn generating fair augmentation strategies in an adversarial style. Then, we propose two dynamic, learnable models to generate contrastive views within contrastive learning framework, which automatically fine-tune the augmentation strategies. Meanwhile, we theoretically show that FairDgcl can simultaneously generate enhanced representations that possess both fairness and accuracy. Lastly, comprehensive experiments conducted on four datasets demonstrate the effectiveness of the proposed FairDgcl.

Abstract:
Subgraph matching is challenging as it necessitates time-consuming combinatorial searches. Recent Graph Neural Network (GNN)-based approaches address this issue by employing GNN encoders to extract graph information and hinge distance measures to ensure containment constraints in the embedding space. These methods significantly shorten the response time, making them promising solutions for subgraph retrieval. However, they suffer from scale differences between graph pairs during encoding, as they focus on feature counts but overlook the relative positions of features within node-rooted subtrees, leading to disturbed containment constraints and false predictions. Additionally, their hinge distance measures lack discriminative power for matched graph pairs, hindering ranking applications. We propose NC-Iso, a novel GNN architecture for neural subgraph matching. NC-Iso preserves the relative positions of features by building the hierarchical dependencies between adjacent echelons within node-rooted subtrees, ensuring matched graph pairs maintain consistent hierarchies while complying with containment constraints in feature counts. To enhance the ranking ability for matched pairs, we introduce a novel similarity dominance ratio-enhanced measure, which quantifies the dominance of similarity over dissimilarity between graph pairs. Empirical results on nine datasets validate the effectiveness, generalization ability, scalability, and transferability of NC-Iso while maintaining time efficiency, offering a more discriminative neural subgraph matching solution for subgraph retrieval.

Abstract:
Metro Origin-Destination (OD) prediction is a crucial yet challenging spatial-temporal prediction task in urban computing, which aims to accurately forecast cross-station ridership for optimizing metro scheduling and enhancing overall transport efficiency. Analyzing fine-grained and comprehensive relations among stations effectively is imperative for metro OD prediction. However, existing metro OD models either mix information from multiple OD pairs from the station’s perspective or exclusively focus on a subset of OD pairs. These approaches may overlook fine-grained relations among OD pairs, leading to difficulties in predicting potential anomalous conditions. To address these challenges, we learn traffic evolution from the perspective of all OD pairs and propose a fine-grained spatial-temporal MLP architecture for metro OD prediction, namely ODMixer. Specifically, our ODMixer has double-branch structure and involves the Channel Mixer, the Multi-view Mixer, and the Bidirectional Trend Learner. The Channel Mixer aims to capture short-term temporal relations among OD pairs, the Multi-view Mixer concentrates on capturing spatial relations from both origin and destination perspectives. To model long-term temporal relations, we introduce the Bidirectional Trend Learner. Extensive experiments on two large-scale metro OD prediction datasets HZMOD and SHMO demonstrate the advantages of our ODMixer.

Abstract:
Pairwise constrained clustering, which employs the pairwise constraints to boost clustering performance, has been widely used in many applications such as face clustering and image retrieval. Due to the prevalence of multi-view data, pairwise constrained multi-view clustering has attracted increasing attention. Nevertheless, existing methods suffer from at least one of the three issues, i.e., expensive time consumption, two-stage clustering and inadequate use of pairwise constraints. To address the above issues, this paper proposes a Pairwise Constrained Bipartite Graph (PCBG) learning method for efficient one-step pairwise constrained multi-view clustering. Concretely, to encode must-link constraints, a novel comprehensive bipartite graph is elegantly designed. Meanwhile, a cannot-link regularization is derived and imposed on the comprehensive bipartite graph, which enforces cannot-link constraints to be realized with theoretically provable guarantees. Moreover, the comprehensive bipartite graph is constrained to exhibit explicit clustering partition by its connected components. Then, an efficient and convergent algorithm with theoretically proved accelerating techniques is derived for optimization, which has linear time complexity to the sample size. Extensive experimental results demonstrate the advantages of PCBG in both clustering performance and time complexity compared with state-of-the-art baselines.

Abstract:
Knowledge Graph Embedding (KGE) aims to learn dense embeddings as the representations for entities and relations in KGs. Indeed, the entities in existing KGs suffer from the data imbalance issue, i.e., there exists a substantial disparity in the occurrence frequencies among various entities. Existing KGE models pre-define a unified and fixed dimension size for all entity embeddings. However, embedding sizes of entities are highly desired for their frequencies, while a uniform embedding size may result in inadequate expression of entities, i.e., leading to overfitting for low-frequency entities and underfitting for high-frequency ones. A straight-forward idea is to set the embedding sizes for each entity before KGE training. However, manually selecting different embedding sizes is labor-intensive and time-consuming, posing challenges in real-world applications. To tackle this problem, we propose AdaE, which adaptively learns KG embeddings with different embedding sizes during training. In particular, AdaE is capable of selecting appropriate dimension sizes for each entity from a continuous integer space. To this end, we specially tailor bilevel optimization for the KGE task, which alternately learns representations and embedding sizes of entities. Our framework is general and flexible, fitting various existing KGE models. Extensive experiments demonstrate the effectiveness and compatibility of AdaE.

Abstract:
The long-tailed data distribution frequently occurs in the real-world scenarios, whereas deep learning is not effective enough for such distribution. In order to improve the effectiveness for the long-tailed data, data augmentation is widely used to balance the distribution of classes by generating new samples. However, most existing studies are designed from the perspective of the class-independence assumption by default, ignoring the effect of interrelation among classes for data augmentation, which causes that some generated samples may be unrepresentative and useless for balancing the class-distribution. Inspired by this, we propose a new data augmentation method based the sparse class-correlation exploitation in this paper, which can generate more representative samples by utilizing the class-correlation, to effectively balance the class-distribution for the long-tailed data. In the proposed method, a sparse class-correlation exploration module is first proposed to explore the potential correlations among multiple classes for boosting the classification performance. Based on the class-correlations, the pivotal seed-samples are generated by maximizing the sparse representation of challenging samples. Meanwhile, an ambiguity-filtered translation module is designed to generate more representative new samples for the target classes based the obtained seed-samples by enhancing the class-consistency and suppressing the deviation from the target classes. In addition, we introduce the self-supervised feature and fuse it with the discriminative feature to explore more accurate class-correlations. Experimental results illustrate that the proposed method obtains better performance only with a small number of generated samples than the state-of-the-art methods.

Abstract:
Multi-view clustering aims at partitioning data into their underlying categories by mining shared and complementary information conveyed by different views. Although the integration of deep learning and disentanglement learning has markedly improved clustering performance, our analysis reveals two fundamental limitations in existing approaches: inadequate separation between view-shared and view-exclusive features; and the negative effects of clustering-irrelevant information on feature decoupling. To tackle these issues, we present a novel Disentangled Feature Learning Network (DFL-Net), which utilizes a progressive learning framework to systematically disentangle features. DFL-Net initially establishes view-shared representations through semantic disparity minimization, followed by the construction of orthogonal feature subspaces using cross-view and intra-view independence constraints to isolate view-specific features. Subsequently, DFL-Net enforces clustering consistency across views to adaptively eliminate irrelevant information, thus enhancing the overall effectiveness of disentanglement learning. The framework introduces two significant innovations: a comprehensive feature independence criterion that concurrently reduces intra-view and cross-view feature dependencies, and an irrelevance filtering mechanism that ensures cross-view clustering consistency. Extensive experiments on benchmark datasets demonstrate the superior performance of DFL-Net compared to state-of-the-art methods.

Abstract:
Anchor-based clustering methods have attracted increasing attention due to their ability to provide efficient and scalable solutions in clustering tasks, such as subspace, multi-view and ensemble clustering. Nevertheless, the majority of anchor-based methods view anchors merely as tools, concentrating on diminishing computational complexity within original data space. However, in fact, clustering can be directly performed on anchors and then the anchor clustering results could be propagated to original data. Due to the much smaller volume of anchors, this could significantly reduce the computational complexity of clustering algorithms. Building upon this idea, in this paper, we propose a fast anchor graph clustering method (FAGC) via maximizing within-cluster similarity. Inspired by the relaxation and discretization model in spectral clustering, we also propose two corresponding models, namely FAGC-R and FAGC-D. FAGC-R first obtains spectral embedding of anchors and then discretizes the embedding to obtain anchor indicator matrix. While FAGC-D directly solves the discrete anchor membership matrix. Once anchor clustering results are obtained, original data labels can be obtained through anchor label transmission. Extensive experiments conducted on synthetic and real datasets illustrate the effectiveness and efficiency of the proposed methods.

Abstract:
Structured proximity matrix learning, one of the mainstream directions in clustering research, refers to learning a proximity matrix with an explicit clustering structure from the original first-order proximity matrix. Due to the complexity of the data structure, the original first-order proximity matrix always lacks some must-links compared to the groundtruth proximity matrix. It is worth noting that high-order proximity matrices can provide missed must-link information. However, the computation of high-order proximity matrices and clustering based on them are expensive. To solve the above problem, inspired by the anchor bipartite graph, we present a novel high-order bipartite graph proximity matrix and a fast method to compute it. This proposed high-order bipartite graph proximity matrix contains high-order proximity information and can significantly reduce the computational complexity of the whole clustering process. Furthermore, we introduce an efficient and simple high-order bipartite graph fusion framework that can adaptively assign weights to each order of the high-order bipartite graph matrices. Finally, under the Laplace rank constraint, a consensus structured bipartite graph proximity matrix is obtained. At the same time, an efficient solution algorithm is proposed for this model. The model's efficacy is underscored through rigorous experiments, highlighting its superior clustering performance and time efficiency.

Abstract:
Mining learner preferences and needs from individual learning behavior data is a critical task in course recommendation systems. While graph-based models have shown efficacy in capturing pairwise relationships between learners and courses, they often overlook the complex higher-order interactions involving learners, courses and teachers that are essential for accurate recommendations. To address this limitation, we propose a novel Hypergraph Convolutional Network for Course Recommendation (HCNCR) framework, designed to model these higher-order interactions effectively. Our approach constructs course and learner hypergraphs based on course attributes and learner similarity relations, respectively. By employing hypergraph convolution, we capture the intrinsic higher-order relationships within these hypergraphs. Additionally, we utilize graph convolutional layers on the learner-course bipartite graph to integrate embeddings derived from hypergraphs, achieving comprehensive representations of both learners and courses. Extensive experiments conducted on real-world datasets demonstrate that HCNCR significantly outperforms existing state-of-the-art methods in course recommendation tasks.

Abstract:
Generative recommendation systems have recently seen a surge in interest, largely due to the promising advancements in generative AI. As a competitive solution for multi-behavior sequence recommendations, much of the recent research has concentrated on predicting the next item a user will likely interact with using a generative approach. However, these methods often 1). assign multiple residual quantization layers to obtain item codes, which leads to extra storage costs of more codebooks. And 2). explicitly utilize behavior sequences leading to longer sequences, potentially increasing the training time as well as inference time compared with original sequences. In response to these challenges, we introduce the Implicit Multi-Behavior Generative recommendation with a mixture of quantization (IMBGen) approach in this paper. Specifically, we have devised a Mixture of Quantization (MoQ) that combines the merits of both residual and parallel quantization for a more effective tokenization process. Additionally, we propose an Implicit Behavior Modeling (IBM) framework, allowing for more efficient integration of users’ behaviors into the interacted items. Finally, we conducted extensive experiments on two widely used benchmark datasets and further confirmed our findings with an online A/B test. The results consistently demonstrate the advantages of our approach over other baseline methods.

Abstract:
Finding cohesive subgraphs from a directed graph is a fundamental approach to analyze directed graph data. We consider a new model called directed (k,\ell )(k,ℓ)-plex for a cohesive directed subgraph, which is generalized from the concept of kk-plex that is only applicable to undirected graphs. Directed (k,\ell )(k,ℓ)-plex (or DPlex) has the connection requirements on both inbound and outbound directions of each vertex inside, i.e., each vertex disconnects at most kk vertices and is meanwhile not pointed to by at most \ellℓ vertices. In this paper, we study the maximum DPlex search problem which finds a DPlex with the most vertices. We formally prove the NP-hardness of the problem. We then design a heuristic algorithm called DPHeuris, which finds a DPlex with the size close to the maximum one and runs practically fast in polynomial time. Furthermore, we propose a branch-and-bound algorithm called DPBB to find the exact maximum DPlex and develop effective graph reduction strategies for boosting the empirical performance. We also consider the problem of querying personalized maximum DPlex, and design a new method called DPBBQ for the problem. Finally, we conduct extensive experiments on real directed graphs. The experimental results show that (1) our heuristic method can quickly find a near-optimal solution and (2) our branch-and-bound method runs up to six orders of magnitude faster than other baselines.

Abstract:
In practical applications, multi-view subspace clustering is hindered by data noise that disrupts the ideal block-diagonal structure of self-representation matrices, thereby degrading performance. Moreover, many existing methods rely solely on sample features, overlooking the valuable structural information in affinity matrices (e.g., pairwise relationships). While conventional contrastive learning strategies often introduce false negative pairs due to noise and unreliable sample selection. To address these challenges, we propose a pseudo-label guided bidirectional discriminative deep multi-view subspace clustering method (PBDMSC). Our approach first employs pseudo-label guided contrastive learning, using previous cluster assignments to select reliable positive and negative samples, which mitigates incorrect pairings and enhances low-dimensional representations. Then, a discriminative self-representation learning method is introduced that leverages pseudo-labels to enforce homogeneous expression constraints and incorporates a bidirectional attention mechanism to preserve the structured information from affinity matrices, thereby enhancing robustness. Experimental results on six real-world datasets demonstrate that our proposed method achieves state-of-the-art clustering performance.

Abstract:
Automatic evaluation of hashtag recommendation models is a fundamental task in Twitter. In the traditional evaluation methods, the recommended hashtags from an algorithm are first compared with the ground truth hashtags for exact correspondences. The number of exact matches is then used to calculate the hit rate, hit ratio, precision, recall, or F1-score. This way of evaluating hashtag similarities is inadequate as it ignores the semantic correlation between the recommended and ground truth hashtags. To tackle this problem, we propose a novel semantic evaluation framework for hashtag recommendation, called #REval. This framework includes an internal module referred to as BERTag, which automatically learns the hashtag embeddings. We investigate on how the #REval framework performs under different word embedding methods and different numbers of synonyms and hashtags in the recommendation using our proposed #REval-hit-ratio measure. Our experiments of the proposed framework on three large datasets show that #REval gave more meaningful hashtag synonyms for hashtag recommendation evaluation. Our analysis also highlights the sensitivity of the framework to the word embedding technique, with #REval based on BERTag more superior over #REval based on Word2Vec, FastText, and GloVe.

Abstract:
Graph Neural Networks (GNNs) have been gaining more attention due to their excellent performance in modeling various graph-structured data. However, most of the current GNNs only consider fixed-neighbor discrete message-passing, disregarding the importance of the local structure of different nodes and the implicit information between nodes for smoothing features. Previous approaches either focus on adaptive selection for aggregation structures or treat discrete graph convolution as a continuous diffusion process, but none of them comprehensively considered the above issues, significantly limiting the model's performance. To this end, we present a novel approach called Flexible Diffusion Convolution (Flexi-DC), which exploits the neighborhood information of nodes to set a particular continuous diffusion for each node to smooth features. Specifically, Flexi-DC first extracts the local structure knowledge based on the degrees of nodes in the graph data and then injects it into the diffusion convolution module to smooth features. Additionally, we utilize the extracted knowledge to smooth labels. Flexi-DC is an efficient framework that can significantly improve the performance of most GNN architectures. Experimental results demonstrate that Flexi-DC outperforms their vanilla implementations by an average accuracy of 13.24% (GCN), 16.37% (JKNet), and 11.98% (ARMA) on nine graph datasets with different homophily ratios.

Affiliations: National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Laboratory, and the Cluster and Grid Computing Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China; School of Computer Science, Hong Kong Baptist University, Hong Kong; Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China

Abstract:
Blockchain-based query with its traceability and data provenance has become increasingly popular and widely adopted in numerous applications. Yet existing index-based query approaches are only efficient under static blockchain query workloads where the query attribute or type must be fixed. It turns out to be particularly challenging to construct an efficient index for dynamic workloads due to prohibitively long construction time and excessive storage consumption. In this paper, we present FlexIM, the first efficient and verifiable index management system for blockchain dynamic queries. The key innovation in FlexIM is to uncover the inherent characteristics of blockchain, i.e., data distribution and block access frequency, and then to optimally choose the index by utilizing reinforcement learning technique under varying workloads. In addition, we enhance and facilitate verifiability with low storage overhead by leveraging Root Merkle Tree (RMT) and Bloom Filter Merkle Tree (BMT). Our comprehensive evaluations demonstrate that FlexIM outperforms the state-of-the-art blockchain query mechanism, vChain+, by achieving a 26.5% speedup while consuming 94.2% less storage, on average, over real-world Bitcoin datasets.

Abstract:
Time series classification (TSC) is crucial in many applications, yet accurately modeling complex time series patterns remains challenging. Model-based TSC strives to aptly model time series by capturing their intrinsic temporal dynamics, deriving effective dynamic representations for classification. Despite significant progress in this domain, existing works are still constrained by a singular and overly simplistic modeling paradigm, which proves inadequate to handle the multiscale hierarchies inherent in time series. Additionally, the prevailing reliance on manual model configuration fails to address the diverse dynamic characteristics across varying data scenarios. In this paper, we amalgamate multiple recurrent reservoirs to devise a model-based Multiscale Temporal Dynamic Learning (MsDL) approach. These reservoirs are endowed with varied recurrent connection skips, ensuring a comprehensive capture of temporal dynamics across different timescales. We also present a multi-objective optimization algorithm, which adaptively configures the memory length of each reservoir, allowing for more accurate time series modeling. This optimization further encourages time series from the same class to look closer, while separating those from different classes, thereby enhancing the category-discriminability. Extensive experiments on public datasets demonstrate that MsDL outperforms the state-of-the-art methods. Additionally, ablation studies confirm that our multiscale design and optimization algorithm effectively enhance classification accuracy.

Abstract:
Post-click conversion rate (CVR) is a reliable indicator of online customers’ preferences, making it crucial for developing recommender systems. A major challenge in predicting CVR is severe selection bias, arising from users’ inherent self-selection behavior and the system’s item selection process. To mitigate this issue, the inverse propensity score (IPS) is employed to weight the prediction error of each observed instance. However, current propensity score estimations are unreliable due to the lack of a quality measure. To address this, we evaluate the quality of propensity scores from the perspective of uncertainty calibration, proposing the use of Expected Calibration Error (ECE) as a measure of propensity-score quality, which quantifies the extent to which predicted probabilities are overconfident by assessing the difference between predicted probabilities and actual observed frequencies. Miscalibrated propensity scores can lead to distorted IPS weights, thereby compromising the debiasing process in CVR prediction. In this paper, we introduce a model-agnostic calibration framework for propensity-based debiasing of CVR predictions. Theoretical analysis on bias and generalization bounds demonstrates the superiority of calibrated propensity estimates over uncalibrated ones. Experiments conducted on the Coat, Yahoo and KuaiRand datasets show improved uncertainty calibration, as evidenced by lower ECE values, leading to enhanced CVR prediction outcomes.

Abstract:
Bandit online multiclass prediction plays an important role in many real-world applications. In this paper, we propose a unified Bandit Online Multiclass Prediction (BOMP) framework. This framework is based on our proposed margin-based gradient descent approach. Its update step provides an unbiased estimate of the surrogate loss gradient and has a lower variance than existing methods. It also enables our algorithms to update even for incorrect predictions by penalizing the wrong classes. The link function of the framework can evolve over time, gradually incorporating online data information including second-order information into the potential functions. Based on the proposed framework, we investigate first-order and second-order bandit online multiclass prediction algorithms. Theoretical analysis demonstrates the superiority of our proposed update rule and bandit online multiclass prediction framework. Finally, we compare our proposed first-order and second-order bandit online multiclass prediction algorithms with several state-of-the-art methods on two synthetic and four real-world datasets. The encouraging results show that our proposed algorithms significantly outperform state-of-the-art techniques.

Abstract:
Graph Convolutional Networks (GCNs) have demonstrated remarkable success in various graph-related tasks. However, recent studies show that GCNs are vulnerable to adversarial attacks on graph structures. Therefore, how to defend against such attacks has become a popular research topic. The current common defense methods face two main limitations: (1) From the data perspective, it may lead to suboptimal results since the structural information is ignored when distinguishing the perturbed edges. (2) From the model perspective, the defenders rely on the low-pass filter of the GCN, which is vulnerable during message passing. To overcome these limitations, this paper analyzes the characteristics of perturbed edges, and based on this we propose a robust defense framework, REDE, to generate the adaptive Reliable Defense graph for multi-channel robust GCN. REDE first uses feature similarity and structure difference to discriminate perturbed edges and generates the defense graph by pruning them. Then REDE designs a multi-channel GCN, which can separately capture the information of different edges and high-order neighbors utilizing different frequency components. Leveraging this capability, the defense graph is adaptively updated at each layer, enhancing its reliability and improving prediction accuracy. Extensive experiments on four benchmark datasets demonstrate the enhanced performance and robustness of our proposed REDE over the state-of-the-art defense methods.

Abstract:
Diversifying recommendations to broaden user horizons and explore potential interests has become a prominent research area in recommender systems. Although numerous efforts have been made to enhance diverse recommendations, the trade-off between diversity and accuracy remains a significant challenge. The primary causes lie in the following two aspects: (i) the inherent goals of diversity-promoting recommendation, which are to simultaneously deliver accurate recommendations and cater to a broader spectrum of users’ interests, have not been adequately explored; and (ii) considering diversity only in the model training procedure cannot guarantee the provision of diversification services in recommender systems. In this work, we directly formulate the inherent goals of diversity-promoting recommendation as a dual-objective optimization problem by simultaneously minimizing the recommendation error and maximizing diversity. These proposed objectives are integrated into Generative Adversarial Nets (GANs) to guide the training process toward the orientation of boosting both diversification and accuracy. Additionally, we propose considering diversity in both training and serving phases. Experimental results demonstrate that our model outperforms others in both diversity and relevance. We extend DDPR to state-of-the-art CTR and re-ranking models, which also result in improved performance on these tasks, further demonstrating the applicability of our model in real-world scenarios.

Abstract:
Credit card fraud is a severe issue that causes significant losses for both cardholders and issuing banks. Existing methods utilize machine learning-based classifiers to identify fraudulent transactions from labeled transaction records. However, labeled data are often scarce compared to the billions of real transactions due to the high cost of annotation, which means that previous methods do not fully utilize the rich features of unlabeled data. What’s more, contemporary methods succumb to a fallacy of unawareness of the local risk structure and the inability to capture certain risk patterns. Therefore, we propose the Risk-aware Gated Temporal Attention Network (RGTAN) for fraud detection in this work. Specifically, we first build a temporal transaction graph based on the transaction records, which consists of temporal transactions (nodes) and their interactions (edges). Then we leverage a Gated Temporal Graph Attention (GTGA) Mechanism to propagate messages among the nodes and learn adaptive representations of transactions. We also model the fraud patterns through risk propagation, taking advantage of the relations among transactions. More importantly, we devise a neighbor risk-aware representation learning layer to enhance our method’s perception of multi-hop risk structures. We conduct extensive experiments on a real-world credit card transaction dataset and two public fraud detection datasets. The results show that our proposed method, RGTAN, outperforms other state-of-the-art methods on three fraud detection datasets. The risk-aware semi-supervised experiments also demonstrate the excellent performance of our model with only a small fraction of manually labeled data. Moreover, RGTAN has been deployed in a world-leading credit card issuer for credit card fraud detection, and the case study results show the effectiveness of our method in uncovering real-world fraud patterns.

Abstract:
Causal discovery faces significant challenges as the number of hypotheses grows exponentially with the number of variables. This complexity becomes particularly daunting when dealing with large sets of variables. We introduce a novel divide-and-conquer method that uniquely handles this challenge. The existing division strategies often rely on conditional independency (CI) tests or data-driven clustering to split variables, which can suffer from the typical data scarcity in large-scale settings, thus leading to inaccurate division results. The proposed method overcomes this by implementing a data-independent division strategy, which constructs a prior structure, informed by potential causal relationships identified using a Large Language Model (LLM), to guide recursively dividing variables into sub-sets. This approach avoids the impact of data insufficiency and is robust against potential incompleteness in the prior structure. In the merging phase, we adopt a score-based refinement strategy to address fake causal links caused by hidden variables in sub-sets, which eliminates edges in the intersected parts of sub-sets to optimize the score of local structures. While maintaining both correctness and completeness under the faithfulness assumption, this novel merging approach demonstrates enhanced performance than the conventional CI-test based merging strategy in practical scenarios. Empirical evaluations on various large-scale datasets demonstrate the proposed approach's superior accuracy and efficiency compared to existing causal discovery methods.

Abstract:
Fairness in recommendation has drawn much attention since it significantly affects how users access information and how information is exposed to users. However, most fairness-aware methods are designed offline with the entire stationary interaction data to handle the global unfairness issue and evaluate their performance in a one-time paradigm. In real-world scenarios, users tend to interact with items continuously over time, leading to a dynamic recommendation environment where unfairness is evolving online. Moreover, previous methods that focus on mitigating the unfairness can hardly bring significant improvements to the recommendation task. Hence, in this paper, we propose a Model-agnostic Dual-side Online Fairness Learning method (MDOFair) for the dynamic recommendation. First, we carefully design dynamic dual-side fairness learning to trace the rapid evolution of unfairness from both the user and item sides. Second, we leverage the fairness and recommendation tasks in one utilized framework to pursue the double-win success. Last, we present an efficient model-agnostic post-ranking method for the dynamic recommendation scenario to mitigate the dynamic unfairness while improving the recommendation performance significantly. Extensive experiments demonstrate the superiority and effectiveness of our proposed MDOFair by incorporating it into existing dynamic models as a post-ranking stage.

Abstract:
Shortest path computation is ubiquitous in various applications in road networks and the index-based algorithms, especially hub labeling, can boost the query performance dramatically. However, traffic conditions keep changing in real life, making the precomputed index unable to answer the query correctly. In this work, we adopt the state-of-the-art tree decomposition-based hub labeling (TDHL) as the underlying index and design efficient algorithms to incrementally maintain the index. Specifically, we first analyze the structural stability of the index in dynamic road networks which enables us to concentrate on label value maintenance. We then introduce the minimum weight property and minimum distance property to guarantee index correctness without graph traversal. Moreover, we propose the star-centric paradigm for tracing index change and design various pruning techniques to further accelerate index maintenance. We also extend our algorithms to batch mode for shared computation, to structural maintenance for full types of updates, and generalize to all kinds of TDHL. Finally, we further improve the index maintenance efficiency and scalability of our algorithms by leveraging graph partition. Our experimental results validate the superiority of our proposals over existing solutions on both index maintenance and query processing.

Abstract:
Secure join schemes, an important class of queries over encrypted databases, have attracted increasing attention. While efficient querying is paramount, data owners also emphasize the significance of privacy preservation. The state-of-the-art JXT (Jutla and Patranabis ASIACRYPT 2022) enables efficient join queries over encrypted tables with a symmetric-key solution. However, we observe that JXT inadvertently leaks undesirable query results as the number of queries increases. In this paper, we propose a novel equi-join scheme, One-Time Join Cross-Tags (OTJXT), which can avoid additional result leakage in multiple queries and extend to equi-join as opposed to natural join in JXT. Specifically, we design a new data encoding method using nonlinear transformations that reveals only the union of results for each query without extra leakage observed in JXT. Moreover, OTJXT addresses the linear search complexity issue (Shafieinejad et al. ICDE 2022) while preventing multiple query leakage. Finally, we implement OTJXT and compare its performance with JXT and Shafieinejad et al.'s scheme on the TPC-H dataset. The results show that OTJXT outperforms in search and storage efficiency, achieving a \mathbf 98.5× 98.5×98.5× (resp., \mathbf 10^6× 106×106×) speedup in search latency and reducing storage cost by 62.5% (resp., 78.5%), compared to JXT (resp., Shafieinejad et al.'s scheme). Using OTJXT, a TPC-H query on a 40 MB database only takes 21 ms.

Abstract:
Finding top-KK frequent items has been a hot topic in data stream processing with wide-ranging applications. However, most existing sketch algorithms focus on finding local top-KK in a single data stream. In this paper, we tackle finding global top-KK across multiple data streams. We find that using prior sketch algorithms directly is often unfair in global scenarios, degrading global top-KK accuracy. We define top-KK-fairness and show its importance for finding global top-KK. To achieve this, we propose the Double-Anonymous (DA) sketch, where double-anonymity ensures fairness. We also propose two techniques, hot-filtering and early-freezing, to improve accuracy further. We theoretically prove that the DA sketch achieves top-KK-fairness while maintaining high accuracy. Extensive experiments verify top-KK-fairness in disjoint data streams, showing that the DA sketch's error is up to 129 times (60 times on average) smaller than the state-of-the-art. To enhance the applicability and technical depth, we also investigate how to extend the DA sketch to general distributed data stream scenarios and how to provide a fairer and more accurate global ranking for top-KK items. The experimental results show that the extended version of the DA sketch can indeed compute better rankings and still has significant advantages in general data streams.

Abstract:
In multisource information fusion (MSIF), Dempster–Shafer evidence (DSE) theory offers a useful framework for reasoning under uncertainty. However, measuring the divergence between belief functions within this theory remains an unresolved challenge, particularly in managing conflicts in MSIF, which is crucial for enhancing decision-making level. In this paper, several divergence and distance functions are proposed to quantitatively measure discrimination between belief functions in DSE theory, including the reverse evidential KullbackLeibler (REKL) divergence, evidential Jeffrey’s (EJ) divergence, evidential JensenShannon (EJS) divergence, evidential \chi ^2χ2 (E\chi ^2χ2) divergence, evidential symmetric \chi ^2χ2 (ES\chi ^2χ2) divergence, evidential triangular (ET) discrimination, evidential Hellinger (EH) distance, and evidential total variation (ETV) distance. On this basis, a generalized ff-divergence, also called the evidential ff-divergence (Ef divergence), is proposed. Depending on different kernel functions, the Ef divergence degrades into several specific classes: EKL, REKL, EJ, EJS, E\chi ^2χ2 and ES\chi ^2χ2 divergences, ET discrimination, and EH and ETV distances. Notably, when basic belief assignments (BBAs) are transformed into probability distributions, these classes of Ef divergence revert to their classical counterparts in statistics and information theory. In addition, several Ef-MSIF algorithms are proposed for pattern classification based on the classes of Ef divergence. These Ef-MSIF algorithms are evaluated on real-world datasets to demonstrate their practical effectiveness in solving classification problems. In summary, this work represents the first attempt to extend classical ff-divergence within the DSE framework, capitalizing on the distinct properties of BBA functions. Experimental results show that the proposed Ef-MSIF algorithms improve classification accuracy, with the best-performing Ef-MSIF algorithm achieving an overall performance difference approximately 1.22 times smaller than the suboptimal method and 14.12 times smaller than the worst-performing method.

Abstract:
The dictionary-based approach is one of the most representative types of time series classification (TSC) algorithm due to its high accuracy, efficiency, and good interpretability. However, existing studies focus on the centralized scenario where data from multiple sources are gathered. Considering that in many practical applications, data owners are reluctant to share their data due to privacy concerns, we study an unexplored problem involving collaboratively building the dictionary-based model over the data owners without disclosing their private data (i.e., in the federated scenario). We propose FedDict, a novel dictionary-based TSC approach customized for the federated setting to benefit from the advantages of the centralized algorithms. To further improve the performance and practicality, we propose a novel federated optimization algorithm for training logistic regression classifiers using dictionary features. The algorithm does not rely on any secure broker and is more accurate and efficient than existing solutions without hyper-parameter tuning. We also propose two contract algorithms for federated dictionary building, such that the user can flexibly balance the running time and the TSC performance through a pre-defined time limit. Extensive experiments on a total of 117 highly heterogeneous datasets validate the effectiveness of our methods and the superiority over existing solutions.

Abstract:
Knowledge hypergraph embedding models are usually computationally expensive due to the inherent complex semantic information. However, existing works mainly focus on improving the effectiveness of knowledge hypergraph embedding, making the model architecture more complex and redundant. It is desirable and challenging for knowledge hypergraph embedding to reach a trade-off between model effectiveness and efficiency. In this paper, we propose an end-to-end efficient knowledge hypergraph embedding model, HyCubE, which designs a novel 3D circular convolutional neural network and the alternate mask stack strategy to enhance the interaction and extraction of feature information comprehensively. Furthermore, our proposed model achieves a better trade-off between effectiveness and efficiency by adaptively adjusting the 3D circular convolutional layer structure to handle nn-ary knowledge tuples of different arities with fewer parameters. In addition, we use a knowledge hypergraph 1-N multilinear scoring way to accelerate the model training efficiency further. Finally, extensive experimental results on all datasets demonstrate that our proposed model consistently outperforms state-of-the-art baselines, with an average improvement of 8.22% and a maximum improvement of 33.82% across all metrics. Meanwhile, HyCubE is 6.12x faster, GPU memory usage is 52.67% lower, and the number of parameters is reduced by 85.21% compared with the average metric of the latest state-of-the-art baselines.

Abstract:
Spatial databases play a vital role in a number of applications ranging from geographic information systems to location-based services. Application tasks typically access underlying spatial data to answer queries. However, non-experts lack the expertise necessary for formulating spatial queries. To fill in this gap, we propose an effective framework that translates natural language queries over spatial data into executable database queries, called NALSpatial. The framework consists of two core phases: (i) natural language understanding and (ii) natural language translation. Phase (i) extracts key entity information, comprehends the query intent and determines the query type by employing natural language processing techniques and deep learning algorithms. The key entities and query type are passed to phase (ii), which makes use of entity mapping rules and structured language models to construct executable database queries. NALSpatial supports dealing with five types of queries including (i) basic queries (e.g. distance and area), (ii) range queries, (iii) nearest neighbor queries, (iv) spatial join queries and (v) aggregation queries. We develop NALSpatial in an open-source extensible database system SECONDO. Extensive experiments show that NALSpatial on average achieves response time of about 2.5 seconds, translatability of 95% and translation precision of 92%, outperforming three state-of-the-art methods.

Abstract:
Efforts to predict stock market outcomes have yielded limited success due to the inherently stochastic nature of the market, influenced by numerous unpredictable factors. Many existing prediction approaches focus on single-point predictions, lacking the depth needed for effective decision-making and often overlooking market risk. To bridge this gap, we propose RAGIC, a novel risk-aware framework for stock interval prediction to quantify uncertainty. Our approach leverages a Generative Adversarial Network (GAN) to produce future price sequences infused with randomness inherent in financial markets. RAGIC’s generator detects the risk perception of informed investors and captures historical price trends globally and locally. Then the risk-sensitive intervals is built upon the simulated future prices from sequence generation through statistical inference, incorporating horizon-wise insights. The interval’s width is adaptively adjusted to reflect market volatility. Importantly, our approach relies solely on publicly available data and incurs only low computational overhead. RAGIC’s evaluation across globally recognized broad-based indices demonstrates its balanced performance, offering both accuracy and informativeness. Achieving a consistent 95% coverage, RAGIC maintains a narrow interval width. This promising outcome suggests that our approach effectively addresses the challenges of stock market prediction while incorporating vital risk considerations.

Abstract:
Multi-Modal Knowledge Graphs (MMKGs), comprising relational triples and related multi-modal data (e.g., text and images), usually suffer from the problems of low coverage and incompleteness. To mitigate this, existing studies introduce a fundamental MMKG fusion task, i.e., Multi-Modal Entity Alignment (MMEA) that identifies equivalent entities across multiple MMKGs. Despite MMEA’s significant advancements, effectively integrating MMKGs remains challenging, mainly stemming from two core limitations: 1) entity ambiguity, where real-world entities across different MMKGs may possess multiple corresponding counterparts or alternative identities; and 2) severe noise within multi-modal data. To tackle these limitations, a new task MMER (Multi-Modal Entity Resolution), which expands the scope of MMEA to encompass entity ambiguity, is introduced. To tackle this task effectively, we develop a novel model ADMH-ER (Adaptive Denoising Multi-modal Hybrid for Entity Resolution) that incorporates several crucial modules: 1) multi-modal knowledge encoders, which are crafted to obtain entity representations based on multi-modal data sources; 2) an adaptive denoising multi-modal hybrid module that is designed to tackle challenges including noise interference, multi-modal heterogeneity, and semantic irrelevance across modalities; and 3) a hierarchical multi-objective learning strategy, which is proposed to ensure diverse convergence capabilities among different learning objectives. Experimental results demonstrate that ADMH-ER outperforms state-of-the-art methods.

Abstract:
A colorful star motif is a star-shaped graph where any two nodes have different colors. Counting the colorful star motif can help to analyze the structural properties of real-life colorful graphs, model higher-order clustering, and accelerate the mining of the densest subgraph exhibiting hh-clique characteristics in graphs. In this manuscript, we introduce the concept of colorful hh-star in a colored graph and proposes two higher-order cohesive subgraph models, namely colorful hh-star core and colorful hh-star truss. We show that the colorful hh-stars can be counted and updated very efficiently using a novel dynamic programming (DP) algorithm. Based on the proposed DP algorithm, we develop a colorful hh-star core decomposition algorithm which takes O(h m)O(hm) time, O(h n+m)O(hn+m) space; and a colorful hh-star truss decomposition algorithm which takes O(h m^1.5)O(hm1.5) time, O(hm)O(hm) space, where mm and nn denote the number of edges and nodes of the graph respectively. Moreover, we also propose a graph reduction technique based on our colorful hh-star core model to accelerate the computation of the approximation algorithm for hh-clique densest subgraph mining. The results of comprehensive experiments on 11 large real-world datasets demonstrate the efficiency, scalability and effectiveness of the proposed algorithms.

Abstract:
Entity linking (EL) is a challenging task as it typically requires matching an ambiguous entity mention with its corresponding entity in a knowledge base (KB). The mainstream studies focus on learning and evaluating linking models on the same corpus and obtained significant performance achievement, however, they often overlook the generalization ability to out-of-domain corpus, which is more realistic yet much more challenging. To address this issue, we introduce a novel neural-symbolic model for entity linking, which is inspired by the symbol-manipulation mechanism in human brains. Specifically, we abstract diverse features into unified variables, then combine them using neural operators to capture diverse relevance requirements, and finally aggregate relevance scores through voting. We conduct experiments on eleven benchmark datasets with different types of text, and the results show that our method outperforms nearly all baselines. Notably, the best performance of our method on seven out-of-domain datasets highlights its generalization ability.

Abstract:
Network Representation Learning (NRL) has achieved remarkable success in learning low-dimensional representations for network nodes. However, most NRL methods, including Graph Neural Networks (GNNs) and their variants, face critical challenges. First, labeled network data, which are required for training most GNNs, are expensive to obtain. Second, existing methods are sub-optimal in preserving comprehensive topological information, including structural and positional information. Finally, most GNN approaches ignore the rich node content information. To address these challenges, we propose a self-supervised Network-to-Network framework (Net2Net) to learn semantically meaningful node representations. Our framework employs a pretext task of node position prediction (PosPredict) to effectively fuse the topological and content knowledge into low-dimensional embeddings for every node in a semi-supervised manner. Specifically, we regard a network as node content and position networks, where Net2Net aims to learn the mapping between them. We utilize a multi-layer recursively composable encoder to integrate the content and topological knowledge into the egocentric network node embeddings. Furthermore, we design a cross-modal decoder to map the egocentric node embeddings into their node position identities (PosIDs) in the node position network. Extensive experiments on eight diverse networks demonstrate the superiority of Net2Net over comparable methods.

Abstract:
Next point-of-interest (POI) recommendation predicts user’s next movement and facilitates location-based applications such as destination suggestion and travel planning. State-of-the-art (SOTA) methods learn an adaptive graph from user trajectories and compute POI representations using graph neural networks (GNNs). However, a single graph cannot capture the diverse dependencies among the POIs (e.g., geographical proximity and transition frequency). To tackle this limitation, we propose the Adaptive Graph Contrastive Learning (AGCL) framework. AGCL constructs multiple adaptive graphs, each modeling a kind of POI dependency and producing one POI representation; and the POI representations from different graphs are merged into a multi-facet representation that encodes comprehensive information. To train the POI representations, we tailor a graph-based contrastive learning, which encourages the representations of similar POIs to align and dissimilar POIs to differentiate. Moreover, to learn the sequential regularities of user trajectories, we design an attention mechanism to integrate spatial-temporal information into the POI representations. An explicit spatial-temporal bias is also employed to adjust the predictions for enhanced accuracy. We compare AGCL with 10 state-of-the-art baselines on 3 datasets. The results show that AGCL outperforms all baselines and achieves an improvement of 10.14% over the best performing baseline in average accuracy.

Abstract:
In recent years, sequence prediction, particularly in natural language processing tasks, has made significant progress due to advanced neural network architectures like Transformer and enhanced computing power. However, challenges persist in modeling and analyzing certain types of sequence data, such as human daily activities and competitive ball games. These segmented sequence data are characterized by short length, varying local dependencies, and coarse-grained unit states. These characteristics limit the effectiveness of conventional probabilistic graphical models and attention-based or recurrent neural networks in modeling and analyzing segmented sequence data. To address this gap, we introduce a novel generative model for segmented sequences, employing an ensemble of multiple variable-order Markov models (VOMMs) to flexibly represent state transition dependencies. Our approach integrates probabilistic graphical models with neural networks, surpassing the representation capabilities of single high-order or variable-order Markov models. Compared to end-to-end deep learning models, our method offers improved interpretability and reduces overfitting in short segments. We demonstrate the efficacy of our proposed method in two tasks: predicting tennis shot types and forecasting daily action sequences. These applications highlight the broad applicability of our segmented sequence modeling approach across diverse domains.

Abstract:
The booming of computer graphics technology facilitates the growing use of terrain data. Notably, shortest path querying on a terrain surface is central in a range of applications and has received substantial attention from the database community. Despite this, computing the shortest paths on-the-fly on a terrain surface remains very expensive, and all existing oracle-based algorithms are only efficient when the terrain surface is fixed. They rely on large data structures that must be re-constructed from scratch when updates to the terrain surface occur, which is very time-consuming. To advance the state-of-the-art, we propose an efficiently updatable (1+\epsilon )(1+ε)-approximate shortest path oracle for a set of Points-Of-Interests (POIs) on an updated terrain surface, and it can be easily adapted to the case if POIs are not given as input. Our experiments show that when POIs are given (resp. not given), our oracle is up to 88 times, 12 times, and 3 times (resp. 15 times, 50 times, and 100 times) better than the best-known oracle on terrain surfaces in terms of the oracle update time, output size, and shortest path query.

Abstract:
Hyperedges, as extensions of pairwise edges, can characterize higher-order relations among multiple individuals. Due to the necessity of hypergraph detection in practical systems, hyperedge prediction has become a frontier problem in complex networks. However, previous hyperedge prediction models encounter three challenges: (i) failing to predict dynamic and arbitrary-order hyperedges simultaneously, (ii) confusing higher-order and lower-order features together to propagate neighborhood information, and (iii) lacking the capability to learn physical evolution laws, which lead to poor performance of the models. To tackle these challenges, we propose D^33HP, a Dual-view Desynchronization hypergraph learning for arbitrary-order Dynamic Hyperedge Prediction. Specifically, D^33HP extracts the dynamic higher-order and lower-order features of hyperedges separately through an elastic hypergraph neural network (EHGNN) and an alternate desynchronization graph convolutional network (ADGCN) at each time snapshot. EHGNN is designed to incrementally mine the implicit higher-order relations and propagate neighborhood information. Moreover, ADGCN aims to combine GCN with desynchronization learining to learn the physical evolution of lower-order relations and alleviate the over-smoothing problem. Further, we improve the prediction performance of the model by rationally fusing the features learned from the dual views. Extensive experiments on 8 dynamic higher-order networks demonstrate that D^33HP outperforms 14 state-of-the-art baselines.

Abstract:
Monotonic classification is a special ordinal classification task that involves monotonicity constraints between features and the decision. Monotonic feature selection can reduce dimensionality while preserving the monotonicity constraints, ultimately improving the efficiency and performance of monotonic classifiers. However, existing feature selection algorithms cannot handle large-scale monotonic data sets due to their lack of consideration for monotonic constraints or their high computational complexities. To address these issues, building on our team's previous research, we define the monotonic related family method with lower time complexity to select informative features and obtain multi-reducts carrying complementary information from multi-view for raw feature space. Using bi-directional rank mutual information, we build two trees for each feature subset and fuse all trees using the corresponding decision support level (BFMDT). Compared with six representative algorithms for monotonic feature selection, BFMDT's average classification accuracy increased by 4.06% (FFREMT), 6.77% (FCMT), 5.61% (FPRS_up), 6.05% (FPRS_down), 5.86%(FPRS_global), 4.41% (Bagging), 7.65% (REMT) and 21.89% (FMKNN), the average execution time compared to tree-based algorithms decreased by 83.41% (FFREMT), 96.96% (FCMT), 75.64% (FPRS_up), 59.43% (FPRS_down), 84.65%(FPRS_global), 81.50% (Bagging) and 63.41% (REMT), while most of comparing algorithms were unable to complete computation on six high-dimensional datasets.

Abstract:
Collective Location Selection (CLS) has received significant research attention in the spatial database community due to its wide range of applications. The CLS problem selects a group of k preferred locations among candidate sites to establish facilities, aimed at collectively attracting the maximum number of users. Existing studies commonly assume every user is located in a fixed position, without considering the competition between peer facilities. Unfortunately, in real markets, users are mobile and choose to patronize from a host of competitors, making traditional techniques unavailable. To this end, this paper presents the first effort on a CLS problem in competition scenarios, called mc^22ls, taking into account the mobility factor. Solving mc^22ls is a non-trivial task due to its NP-hardness. To overcome the challenge of pruning multi-point users with highly overlapped minimum boundary rectangles (MBRs), we exploit a position count threshold and design two square-based pruning rules. We introduce IQuad-tree, a user-MBR-free index, to benefit the hierarchical and batch-wise properties of the pruning rules. We propose an (1-\frac1e)(1-1e)-approximate greedy solution to mc^22ls and incorporate a candidate-pruning strategy to further accelerate the computation for handling skewed datasets. Extensive experiments are conducted on real datasets, demonstrating the superiority of our proposed pruning rules and solution compared to the state-of-the-art techniques.

Abstract:
In this paper, a multi-object recognition scenario is considered to extend the random finite set into random permutation set. Probabilistic information on random permutation set can be viewed as an distribution determined by three random variables. We use another emerging uncertainty representation, order-2 information granule, to realize the probabilistic information fusion on random permutation sets. First, the probabilistic information on random permutation sets is viewed as an order-2 probability distribution. Second, corresponding information fusion approach is proposed. Finally, the proposed approach is applied to random permutation sets, resolving the decision-making issue under the multi-object recognition scenario. This paper pioneers the connection of order-2 information processing logic to a multi-object recognition task and develops order-2 probability distribution and its combination rules. Compared to the traditional probabilistic information fusion approaches, the proposed approach takes into account not only the propositions’ beliefs provided by the sources, but the structural dependency among propositions as well.

Abstract:
Optimizing stock selection through stock ranking is one of the critical but intricate tasks in quantitative trading areas because of the non-stationary dynamics and complicated interdependencies behind stock markets. Recent studies have made efforts to model historical market movements to enhance stock selection. However, they primarily borrowed the spirit of time series modeling and sought to build a deterministic paradigm without considering the uncertain fluctuations. In addition, some of these studies tailor to explore stock correlations from a predefined (e.g., binary) graph structure and use explicitly simple relations (such as first-order relations) to guide evolving interactions. Nevertheless, aggregating predefined but shallow relationships to collaborate with stock movements may affect selection generalizability and increase the risk of portfolio failure. This study introduces a novel Relational stock selection framework via probabilistic State Space Learning (or RSSL) for stock selection. Specifically, RSSL first attempts to build a tree-based structure to explicitly expose higher-order relations in the stock market, primarily by discovering a hierarchical delineation of ties between stocks. Whereafter, it couples with time-varying movements via an attention mechanism to smoothly explore the interactive correlations among different stocks. Inspired by recent state space models (SSM) in probabilistic Bayesian learning, we devise a Probabilistic Kalman Network (PKNet) with uncertainty estimates to recursively simulate ever-changing stock volatility, enabling more promising return-risk trade-offs. The experimental results on several real-world stock market datasets demonstrate that RSSL outperforms several representative baseline methods by a significant margin.

Abstract:
Session-based Recommender Systems (SBRSs) aim at timely predicting the next likely item by capturing users’ current preferences in sessions. Existing SBRSs research only focuses on maximizing session utilities, and little has been done on the fairness issue in SBRSs, which is vital but different from the same issue in traditional Recommender Systems (RSs). To fill in this gap, we define a novel concept of session-oriented fairness to enforce individual items to have the same exposure accumulated within each single session, which is flexible enough to provide opportunities to achieve different fairness goals. Then, we devise a Session-Oriented Fairness-Aware algorithm (SOFA) with a dual Temporal Convolutional Networks (TCN) architecture: one is SOUP (Session-Oriented Utility Promoter) and the other is SODA (Session-Oriented Disparity Alleviator). Benefit from the collaborative learning of SOUP and SODA for the evolution of accumulated exposure in sessions, SOFA is effective to maximize session-oriented fairness while maintaining high session utilities. To the best of our knowledge, this research is the first to solve fairness issues in SBRSs. Extensive experiments on real-world datasets demonstrate that SOFA outperforms the state-of-the-art approaches in terms of both utility and fairness.

Abstract:
As the representative density-based clustering algorithm, density peaks clustering (DPC) has wide recognition, and many improved algorithms and applications have been extended from it. However, the DPC involving privacy protection has not been deeply studied. In addition, there is still room for improvement in the selection of centers and allocation methods of DPC. To address these issues, vertical federated density peaks clustering under nonlinear mapping (VFDPC) is proposed to address privacy protection issues in vertically partitioned data. Firstly, a hybrid encryption privacy protection mechanism is proposed to protect the merging process of distance matrices generated by client data. Secondly, according to the merged distance matrix, a more effective cluster merging under nonlinear mapping is proposed to ameliorate the process of DPC. Results on man-made, real, and multi-view data fully prove the improvement of VFDPC on clustering accuracy.

Abstract:
Joint multimodal entity-relation extraction (JMERE) is a challenging task that involves two joint subtasks, i.e., named entity recognition and relation extraction, from multimodal data such as text sentences with associated images. Previous JMERE methods have primarily employed 1) pipeline models, which apply pre-trained unimodal models separately and ignore the interaction between tasks, or 2) word-pair relation tagging methods, which neglect neighboring word pairs. To address these limitations, we propose a fine-grained network for JMERE. Specifically, we introduce a fine-grained alignment module that utilizes a phrase-patch to establish connections between text phrases and visual objects. This module can learn consistent multimodal representations from multimodal data. Furthermore, we address the task-irrelevant image information issue by proposing a gate fusion module, which mitigates the impact of image noise and ensures a balanced representation between image objects and text representations. Furthermore, we design a multi-word decoder that enables ensemble prediction of tags for each word pair. This approach leverages the predicted results of neighboring word pairs, improving the ability to extract multi-word entities. Evaluation results from a series of experiments demonstrate the superiority of our proposed model over state-of-the-art models in JMERE.

Abstract:
In search sessions, a series of interactions in the context has been proven to be advantageous in capturing users’ search intents. Existing studies show that designing pre-training tasks and data augmentation strategies for session search improves the robustness and generalizability of the model. However, such data augmentation strategies only focus on changing the original session structure to learn a better representation. Ignoring information from outside the session, users’ diverse and complex intents cannot be learned well by simply reordering and deleting historical behaviors, proving that such strategies are limited and inadequate. In order to solve the problem of insufficient modeling under complex user intents, we propose exploiting information outside the original session. More specifically, in this paper, we sample queries and documents from the global click-on and follow-up session graph, alter an original session with these samples, and construct a new session that shares a similar user intent with the original one. Specifically, we design four data augmentation strategies based on session graphs in view of both one-hop and multi-hop structures to sample intent-associated query/document nodes. Experiments conducted on three large-scale public datasets demonstrate that our model outperforms the existing ad-hoc and context-aware document ranking models.

Abstract:
Temporal causal discovery is a crucial task aimed at uncovering the causal relations within time series data. The latest temporal causal discovery methods usually train deep learning models on prediction tasks to uncover the causality between time series. They capture causal relations by analyzing the parameters of some components of the trained models, e.g., attention weights and convolution weights. However, this is an incomplete mapping process from the model parameters to the causality and fails to investigate the other components, e.g., fully connected layers and activation functions, that are also significant for causal discovery. To facilitate the utilization of the whole deep learning models in temporal causal discovery, we proposed an interpretable transformer-based causal discovery model termed CausalFormer, which consists of the causality-aware transformer and the decomposition-based causality detector. The causality-aware transformer learns the causal representation of time series data using a prediction task with the designed multi-kernel causal convolution which aggregates each input time series along the temporal dimension under the temporal priority constraint. Then, the decomposition-based causality detector interprets the global structure of the trained causality-aware transformer with the proposed regression relevance propagation to identify potential causal relations and finally construct the causal graph. Experiments on synthetic, simulated, and real datasets demonstrate the state-of-the-art performance of CausalFormer on discovering temporal causality.

Affiliations: School of Computer Science and Technology, Zhejiang Normal University, Jinhua, China; School of Computer Science and Technology, Harbin University of Science and Technology, Harbin, China; School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China; Qingdao Institute of Software, China University of Petroleum (East China), Qingdao, China; Guangzhou Institute of Technology, Xidian University, Guangzhou, China; School of Computer Science, University of Technology Sydney, Sydney, NSW, Australia; College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China

Abstract:
In unsupervised graph anomaly detection, existing methods usually focus on detecting outliers by learning local context information of nodes, while often ignoring the importance of global context. However, global context information can provide more comprehensive relationship information between nodes in the network. By considering the structure of the entire network, detection methods are able to identify potential dependencies and interaction patterns between nodes, which is crucial for anomaly detection. Therefore, we propose an innovative graph anomaly detection framework, termed CoCo (Context Correlation Discrepancy Analysis), which detects anomalies by meticulously evaluating variances in correlations. Specifically, CoCo leverages the strengths of Transformers in sequence processing to effectively capture both global and local contextual features of nodes by aggregating neighbor features at various hops. Subsequently, a correlation analysis module is employed to maximize the correlation between local and global contexts of each normal node. Unseen anomalies are ultimately detected by measuring the discrepancy in the correlation of nodes’ contextual features. Extensive experiments conducted on six datasets with synthetic outliers and five datasets with organic outliers have demonstrated the significant effectiveness of CoCo compared to existing methods.

Abstract:
Domain generalization (DG) tasks aim to learn cross-domain models from source domains and apply them to unknown target domains. Recent research has demonstrated that diverse and rich source domain samples can enhance domain generalization capability. This work argues that the impact of each sample on the model's generalization ability varies. Even a small-scale but high-quality dataset can achieve a notable level of generalization. Motivated by this, we propose a domain-adversarial active learning (DAAL) algorithm for classification tasks in DG. First, we analyze that the objective of DG tasks is to maximize the inter-class distance within the same domain and minimize the intra-class distance across different domains. We design a domain adversarial selection method that prioritizes challenging samples in an active learning (AL) framework. Second, we hypothesize that even in a converged model, some feature subsets lack discriminatory power within each domain. We develop a method to identify and optimize these feature subsets, thereby maximizing inter-class distance of features. Lastly, We experimentally compare our DAAL algorithm with various DG and AL algorithms across four datasets. The results demonstrate that the DAAL algorithm can achieve strong generalization ability with fewer data resources, thereby significantly reducing data annotation costs in DG tasks.

Abstract:
Context-aware data selectivity in Edge Computing (EC) requires nodes to efficiently manage the data collected from Internet of Things (IoT) devices, e.g., sensors, for supporting real-time and data-driven pervasive analytics. Data selectivity at the network edge copes with the challenge of deciding which data should be kept at the edge for future analytics tasks under limited computational and storage resources. Our challenge is to efficiently learn the access patterns of data-driven tasks (analytics) and predict which data are relevant, thus, being stored in nodes’ local datasets. Task patterns directly indicate which data need to be accessed and processed to support end-users’ applications. We introduce a task workload-aware mechanism which adopts one-class classification to learn and predict the relevant data requested by past tasks. The inherent uncertainty in learning task patterns, identifying inliers and eliminating outliers is handled by introducing a lightweight fuzzy inference estimator that dynamically adapts nodes’ local data filters ensuring accurate data relevance prediction. We analytically describe our mechanism and comprehensively evaluate and compare against baselines and approaches found in the literature showcasing its applicability in pervasive EC.

Abstract:
This paper presents an accelerated spherical K-means clustering algorithm for large-scale and high-dimensional sparse document data sets. We design an algorithm working in an architecture-friendly manner (AFM), which is a way of suppressing performance-degradation factors such as the numbers of instructions, branch mispredictions, and cache misses in CPUs of a computer system. For the AFM operation, we leverage universal characteristics (UCs) of the data, which are skewed distributions on data relationships. The UCs indicate that the most part of multiplications for similarity calculations is executed on high-document-frequency terms and the most part of a similarity is obtained by the multiplications regarding a few high mean-feature values. To extract the foregoing specific region on terms and mean-feature values, we construct a mean-inverted index partitioned into three regions by two structural parameters. Our algorithm optimizes the parameters by minimizing the approximate number of the multiplications corresponding to the instructions based on our efficient pruning method, reduces conditional branches by sharing the index structure with all the objects, and keeps in the caches the frequently used data in the foregoing specific region. We experimentally demonstrate that our algorithm efficiently achieves superior speed performance in large-scale documents compared with algorithms using the state-of-the-art techniques.

Abstract:
Clustering as a fundamental technique in data mining and machine learning, aims to partition data into meaningful groups based on the inherent relationships among data. However, traditional clustering algorithms typically assume convex hyperspherical geometry of data, where the clusters have clearly defined boundaries and do not overlap. In contrast, real-world data often exhibits complex and non-convex geometries, which makes these assumptions ineffective and lead to inaccurate clustering results that fail to capture the intrinsic structure. To address this challenge, the paper proposes a novel granular clustering based on an enhanced granularity representation, which further refines the principle of justifiable granularity. By introducing a more precise and flexible hyper-box granulation mechanism, the method dynamically adapts to the topology of data, thereby improving clustering accuracy. By defining the degree of aggregation and discreteness between data points, the importance of attributes in the feature space is quantified, leading to the design of a novel hyper-box feature selection (HBFS) algorithm. This algorithm integrates the granular clustering principle to optimize the feature selection process, reducing the impact of redundant features and noise, thus improving clustering efficiency and interpretability. To validate the superiority and effectiveness of the proposed method, extensive experiments were conducted on fifteen publicly available datasets, comparing the performance of HBFS algorithm with classical and state-of-art feature selection methods. The results and the statistical significance tests show that HBFS significantly outperforms existing feature selection methods across various evaluation metrics.

Abstract:
Graph anomaly detection (GAD) on attributed networks aims to capture abnormal nodes whose attributes or structures differ significantly from most nodes. The existing GAD models amplify the representation differences between normal and abnormal nodes to identify anomalies via carefully designed feature extraction modules. However, these models ignore the bottlenecks encountered by abnormal nodes in message passing. In particular, when the anomalies occurs at critical crossroads, the information of multiple nodes is compressed into a fixed-length representation, and the resulting over-squashing weakens the abnormal information. To address this, we propose an unsupervised STructural optimization model guided by sIMilarity reconstruction (STIM). Specifically, we define redundant edges that cause over-squashing, design the Neighbor-Structure Optimization module to filter redundant edges through the edge-dropping strategy based on critical crossroads, and optimize the graph structure to alleviate over-squashing. In addition, to alleviate the over-smoothing caused by the high inter-class node similarity of the data itself and the edge-dropping strategy, we design the Neighbor-Similarity Reconstruction module based on similarity calculation, which guides the model to expand inter-class variation. Extensive experiments on benchmark datasets show that STIM can effectively optimize message passing and improve anomaly detection performance.

Abstract:
The quality of features plays an important role in the performance of recommender systems. Recognizing this, feature selection has emerged as a crucial technique in refining recommender systems. Recent advancements leveraging Automated Machine Learning (AutoML) has drawn significant attention, particularly in two main categories: early feature selection and late feature selection, differentiated by whether the selection occurs before or after the embedding layer. The early feature selection selects a fixed subset of features and retrains the model, while the late feature selection, known as adaptive feature selection, dynamically adjusts feature choices for each data instance, recognizing the variability in feature significance. Although adaptive feature selection has shown remarkable improvements in performance, its main drawback lies in its post-embedding layer feature selection. This process often becomes cumbersome and inefficient in large-scale recommender systems with billions of ID-type features, leading to a highly sparse and parameter-heavy embedding layer. To overcome this, we introduce Adaptive Early Feature Selection(AEFS), a very simple method that not only adaptively selects informative features for each instance, but also significantly reduces the activated parameters of the embedding layer. AEFSemploys a dual-model architecture, encompassing an auxiliary modeldedicated to feature selection and a main modelresponsible for prediction. To ensure effective alignment between these two models, we incorporate two collaborative training loss constraints. Our extensive experiments on three benchmark datasets validate the efficiency and effectiveness of our approach. Notably, AEFSmatches the performance of current state-of-the-art Adaptive Late Feature Selection methods while achieving a significant reduction of 37. 5% in the activated parameters of the embedding layer. We believe that this work opens up new possibilities for feature selection.

Abstract:
Approximate queries offer an efficient means of analyzing massive data streams under acceptable errors. Among these, subset queries over multiple attributes are common in many real-world applications. While sketches offer promising approximate solutions for massive data streams, efficiently supporting subset queries over multiple statistical attributes remains a significant challenge. To address this, we propose Hyper-USS, a novel sketching solution that accurately and efficiently supports subset queries over data streams involving multiple statistical attributes. With Joint Variance Optimization, Hyper-USS provides unbiased estimation and optimizes estimation variance jointly, addressing the challenge of accurately estimating multiple statistical attributes in the sketch design. The algorithm records the information of keys and all attributes in one sketch, ensuring high insertion efficiency. Furthermore, its three speed-optimized versions are introduced to handle the growing number of statistical attributes in data streams. Experimental results show that Hyper-USS and its three speed-optimized versions consistently surpass state-of-the-art methods that support subset queries in both estimation accuracy and insertion throughput. Specifically, Hyper-USS improves accuracy by at least 38%, while the algorithm and its three speed-optimized versions achieve throughput improvements of up to 31.90×31.90×, 45.31×45.31×, 49.21×49.21×, and 58.03×58.03×, respectively.

Abstract:
Despite advancements using graph neural networks (GNNs) to capture complex user-item interactions, challenges persist due to data sparsity and noise. To address these, self-supervised learning (SSL) methods, particularly recent generative approaches, have gained attention due to their ability to augment graph data without requiring complex view constructions and unstable negative sampling. However, existing generative SSL solutions often focus on structural rather than semantic (refer to collaborative signals in recommendation scenarios) reconstruction, limiting their potential as comprehensive recommender. This paper explores the untapped potential of generative SSL for graph-based recommender systems. We highlight two critical challenges: firstly, designing effective diffusion mechanisms to enhance semantic information and collaborative signals while avoiding optimization biases; and secondly, developing adaptive structural masking mechanisms within graph diffusion to improve overall model performance. Motivated by these challenges, we propose a novel approach: the Guided Diffusion enhanced Mask graph AutoEncoder (GDiffMAE). GDiffMAE integrates an adaptive mask encoder for structural reconstruction and a guided diffusion model for semantic reconstruction, addressing the limitations of current methods. Experimental results on diverse datasets demonstrate that GDiffMAE consistently outperforms powerful baseline models, particularly in handling noisy data scenarios. By enhancing both structural and semantic dimensions through guided diffusion, our model advances the state-of-the-art in graph-based recommender systems.

Abstract:
Existing cold-start recommendation methods typically use item-level alignment strategies to align the content feature and collaborative feature of warm items during model training. However, these methods are less effective for cold items with low semantic similarity to the warm items when they first appear in the test stage, as they have no historical interactions to obtain the collaborative feature. In this paper, we propose a preference aware recommendation (PARec) model with hierarchical item alignment to solve the item cold-start issue. Our approach exploits user preference from historical records to achieve group-level alignment with item content feature, enhancing recommendation performance. Specifically, our hierarchical item alignment strategy improves recommendations for both high and low similarity cold items by using item-level alignment for high similarity cold items and introducing group-level alignment for low similarity cold items. Low similarity cold items can be successfully recommended through relationships among items, captured by our group-level alignment, based on their co-occurrence possibilities and semantic similarities. For model training, a hierarchical contrastive objective function is presented to balance the performance of warm and cold items, achieving better overall performance. Extensive experiments demonstrate the effectiveness of our method, with results showing its superiority compared to state-of-the-art approaches.

Abstract:
Entity Alignment (EA) is to link potential equivalent entities across different knowledge graphs (KGs). Most existing EA methods are supervised as they require the supervision of seed alignments, i.e., manually specified aligned entity pairs. Very recently, several EA studies have made some attempts to get rid of seed alignments. Despite achieving preliminary progress, they still suffer two limitations: (1) The entity embeddings produced by their GNN-like encoders lack personalization since some of the aggregation subpaths are shared between different entities. (2) They cannot fully alleviate the distribution distortion issue between candidate KGs due to the absence of supervised signals. In this work, we propose a novel unsupervised entity alignment approach called UNEA to address the above two issues. First, we parametrically sample a tree neighborhood rooted at each entity, and accordingly develop a tree attention aggregation mechanism to extract a personalized embedding for each entity. Second, we introduce an auxiliary task of maximizing the mutual information between the input and the output of the KG encoder, which serves as a regularization to prevent the distribution distortion. Extensive experiments show that our UNEA achieves a new state-of-the-art for the unsupervised EA task, and can even outperform many existing supervised EA baselines.

Abstract:
Graphs serve as an essential data structure to model complex relationships in a variety of applications, such as social networks, web graphs, and chemical informatics. Due to the high cost of maintaining large-scale graph data and executing graph queries, data owners often outsource their graph data to a third-party service provider for graph processing. In this scenario, it is crucial to ensure the integrity of query results, as the provider may have the incentive to return only partial or tampered results to save computing resources or serve their own interests. Blockchain, as a promising solution for secure data storage and retrieval, opens up new opportunities for data management in such scenarios. To scale the blockchain, existing studies have concentrated on using off-chain storage while ensuring the integrity of query results for key-value data in hybrid-storage blockchain architectures. To the best of our knowledge, there is no work to enable the blockchain to support subgraph matching queries. In this paper, we first study the problem of authenticated subgraph matching queries. Traditional subgraph matching algorithms follow the filtering-searching paradigm. The main challenge is to design an Authenticated Data Structure (ADS) and aggregation algorithm that efficiently aggregates non-results for verification during the filtering-searching process. We first propose a vertex-based scheme - the novel ADS MELTree can generate candidate vertices and aggregate non-resulting vertices in the filtering phase, while the aggregation algorithm AMatching can aggregate invalid partial results in the search phase. Furthermore, we propose the bidirectional search aggregation algorithm AMatching and ADS MVPTree to reduce the computational cost in the search phase and to reduce the on-chain storage cost. In addition, we propose a novel path-based scheme to enhance the aggregation of non-results and accelerate the processing. We design the path-based ADS MPETree for generating candidate paths and aggregating non-resulting paths, and the aggregation algorithm PMatching for efficiently aggregating invalid partial results one path at a time. The results of extensive experiments on five real-world graphs demonstrate the efficiency of our proposed ADSs and aggregation algorithms.

Abstract:
With the rapid advancements in positioning technologies, the volume of spatio-temporal data has grown significantly. Analyzing the spatial and temporal characteristics of these data is imperative for uncovering underlying associations and deriving insights into natural and societal mechanisms. Clustering is a widely utilized technique for data analysis, which groups data with similar characteristics for further investigation. However, current clustering methodologies usually inadequately address temporal properties that are vital in numerous scenarios. Additionally, traditional spatio-temporal clustering approaches are constrained to standalone environments, which struggle to handle large-scale spatio-temporal datasets. To this end, we introduce DiST, the first distributed spatio-temporal clustering method, which simultaneously considers both temporal and spatial proximity. DiST comprises data partition, local clustering, and global merging stages, along with an auto-tuning framework for parameter optimization. DiST addresses key challenges, including the integration of temporal and spatial attributes, managing data duplication across distributed nodes, and selecting appropriate parameters for diverse data characteristics. Comparative experiments on two real-world datasets validate the performance and scalability of DiST, demonstrating its effectiveness in spatio-temporal data analysis.

Abstract:
Document-level relation extraction (RE) aims to determine the relations between entities scattered across different sentences through reading and reasoning. Existing methods use semantic segmentation to obtain global information among triples by analyzing entity-level matrices. However, complete document input may introduce certain interference, making it challenging to express the underlying relationships. To address this, we propose a novel approach introducing a low-entity redundancy feature map, achieved by removing certain entities. The proposed optimal path filtering (OPF) selects entity-related sentences using heuristic rules and formulates sentence selection as a set cover problem, solved via backtracking pruning. U-Net is then applied to obtain global features. Our experiment achieves state-of-the-art results on two common document-level RE datasets, Re-DocRED and CDR, outperforming previous methods.

Abstract:
Financial time series prediction is an important and challenging data mining task for quantitative investment. The inherent non-linearity, high noise, and susceptibility to various factors, such as macroeconomic conditions and market sentiment in the stock market, increase the difficulty of prediction. Existing financial industries mainly employ time series models or fundamental analysis methods for prediction. However, these methods fail to effectively capture the complex interrelationships between equity. In recent years, graph neural networks (GNNs), due to their powerful relational modeling capabilities, have been applied to stock prediction. However, with the advances of recent digital power, such as widely-used high-frequency trading techniques, existing graph-based methods still have shortcomings in effectively learning multi-granularity temporal relations as they cannot effectively learn the patterns in different frequencies, e.g., minute-level, daily, weekly, etc. Therefore, in this paper, we propose a multi-granularity graph augmented learning framework for interrelated financial time series forecasting. We first construct a temporal return relationship graph with multi-granularity financial time series, including weekly, daily, and minute-level, to comprehensively capture the dynamic relations of equities, including both medium-term trends and short-term fluctuations. Then, to further augment the node relations, we devise an attentional graph augment module to improve the graph learning with fundamental data, which are jointly optimized in the prediction layer. We conduct extensive empirical studies on multiple datasets from both the Chinese and U.S. stock markets. The results demonstrate that our proposed model consistently outperforms existing baseline methods across four key financial metrics, including ARR, ASR, CR, and IR, thereby validating its effectiveness and superiority. The model has been applied and empirically tested in commercial-grade trading platforms, further demonstrating its efficiency and robustness in real-world trading environments.

Abstract:
Review summarization aims to provide a summary that covers the main aspect of the product review and reflects personal preference. Existing methods employ the historical reviews of customer and product to provide useful clues for the target summary generation. However, most of the existing methods indiscriminately model the historical reviews of customer and product. Since the historical customer reviews provide the personal information while the historical product reviews provide the commonly focused aspect of the product, these two types of heterogeneous information should be separately modeled. Moreover, the review rating of the historical reviews can be seen as a high-level abstraction of the customer preference and product which have been ignored by most of the existing methods. In this paper, we propose the Heterogeneous Historical Review aware Review Summarization (HHRRS) which separately models the two types of historical reviews with the rating information by a graph reasoning module with a contrastive loss. We employ a multi-task paradigm that conducts the review sentiment classification and summarization jointly. And we also propose a novel Graph Retrieval Augmemted Review Summarization (GRARS) to model the two types of heterogeneous information in a fine-grained manner. We conduct extensive experiments on four benchmark datasets, and demonstrate the superiority of HHRRS on both tasks.

Affiliations: State Key Laboratory of Information Security (SKLOIS), Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China; Security Department, Alibaba Group, Hangzhou, China; School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University, Shenzhen, China; School of Computer Science and Technology, Key Laboratory of Big Data Mining and Knowledge Management (BDKM), University of Chinese Academy of Sciences, Beijing, China

Abstract:
Performing complex First-Order Logic (FOL) queries on knowledge graphs is crucial for advancing knowledge reasoning. Knowledge graphs encapsulate rich semantic interactions among entities, encompassing both explicit structural knowledge represented by triples (e_1, r, e_2)(e1,r,e2) and implicit relational knowledge through multi-hop paths (e_1 \stackrelr_1\rightarrow \cdots e_3 \cdots \stackrelr_2\rightarrow e_2)(e1→r1⋯e3⋯→r2e2). Traditional models often focus solely on either triple-level or path-level knowledge, overlooking the benefits of integrating both to enhance logic query answering. This oversight leads to suboptimal representation learning and inefficient query reasoning. To overcome these challenges, we introduce a new Semantic-Aware representation learning model for Query-answering Embeddings (SAQE). Specifically, SAQE employs a joint learning approach that integrates triple-level and path-level knowledge semantics and captures both explicit and implicit contextual nuances within the knowledge graph, yielding more accurate and contextually relevant representations. To efficiently handle the large combinatorial search spaces in FOL reasoning, we propose a novel hierarchical reasoning optimization strategy by a multi-hop tree thus optimizing subqueries rooted at variable nodes in a divide-and-conquer manner. Theoretical analysis confirms that SAQE effectively supports various types of FOL reasoning and enhances generalizations for query answering. Extensive experiments demonstrate that our model achieves state-of-the-art performance across several established datasets.

Abstract:
Recently, federated learning (FL) has become a prevalent algorithm to harvest data while preserving privacy. However, private information can still be compromised by local parameters during transmissions between local parties and the central server. To address this problem, local differential privacy (LDP) has been adopted. Known as federated LDP-SGD, each local device only sends perturbed parameters to the central server. However, due to the low model efficiency caused by overwhelming LDP noise, only a relaxed LDP privacy scheme, namely Gaussian mechanism, is explored in the federated LDP-SGD literature. The objective of this paper is to enable other LDP mechanisms (e.g., Laplace, Piecewise, Square Wave and Gaussian) in federated learning by enhancing their model efficiency. We first propose an analytical framework that generalizes federated LDP-SGD and derives its model efficiency. Serving as a benchmark, this framework can compare performances of different LDP mechanisms in federated learning. Based on this framework, we identify a new perspective to generally optimize federated LDP-SGD, namely, the vectorized perturbation strategy LDPVec. By only perturbing the direction of a gradient, LDPVec better preserves the descending direction of the gradient, which consequently leads to comprehensive efficiency improvements in terms of various LDP mechanisms.

Abstract:
Recent years have witnessed the trend of enhancing recommender systems with large language models (LLMs), namely, LLMRec. A common way is to fine-tune the LLMs with the instruction data transformed from user behaviors, stimulating the recommendation ability of LLMs. Similar to traditional recommender systems, integrating user data into LLMs raises privacy concerns. Users desire a tool to erase the impacts of their sensitive data from the trained models. To meet this user demand, LLMRec unlearning becomes pivotal to enable the removal of unusable data (e.g., historical behaviors) from established LLMRec models. However, existing methods mostly focus on partition strategies and approximate unlearning. These methods are not well-suited for the unique characteristics of LLMRec due to computational costs or incomplete removal. In this study, we propose the Adapter Partition and Aggregation (APA) framework for exact and efficient LLMRec unlearning while maintaining recommendation performance. APA achieves this by retraining PEFT adapters using data partitioning, constructing adapters for partitioned training data shards, and retraining only the affected adapters. To preserve recommendation performance and avoid significant inference costs, APA incorporates balanced and heterogeneous data partitioning, and parameter-level adapter aggregation with sample-adaptive adapter attention for each testing sample. Extensive experiments demonstrate the effectiveness and efficiency of our method.

Abstract:
Traditional recommendation methods, which typically focus on modeling a single user behavior (e.g., purchase), often face severe data sparsity issues. Multi-behavior recommendation methods offer a promising solution by leveraging user data from diverse behaviors. However, most existing approaches entangle multiple behavioral factors, learning holistic but imprecise representations that fail to capture specific user intents. To address this issue, we propose a multi-behavior method by modeling latent factors with an expert network (MBLFE). In our approach, we design a gating expert network, where the expert network models all latent factors within the entire recommendation scenario, with each expert specializing in a specific latent factor. The gating network dynamically selects the optimal combination of experts for each user, enabling a more accurate representation of user preferences. To ensure independence among experts and factor consistency of a particular expert, we incorporate self-supervised learning during the training process. Furthermore, we enrich embeddings with multi-behavior data to provide the expert network with more comprehensive collaborative information for factor extraction. Extensive experiments on three real-world datasets demonstrate that our method significantly outperforms state-of-the-art baselines, validating its effectiveness.

Abstract:
Multimodal news event detection aims to identify and categorize significant events across media platforms using multimodal data. Previous work was limited to a single platform and assumed complete multimodal data. In this paper, we explore a novel task of cross-platform multimodal news event detection to enhance model generalization for cross-platform scenarios. We propose a Self-Supervised Modality Complementation (SSMC) method to tackle the challenges of incomplete modalities and platform heterogeneity presented in this task. Specifically, a Missing Data Complementation (MDC) module is designed to overcome the limitations caused by incomplete modalities. It employs a separation mechanism that distinguishes between modality-specific and modality-shared features across all modalities, allowing for the augmentation of missing modalities with information extracted from common features. Meanwhile, a Multimodal Self-Learning (MSL) module addresses platform heterogeneity by extracting pseudo labels from the target platform’s multimodal views and incorporating a self-penalization mechanism to reduce reliance on low-confidence labels. Additionally, we collect a comprehensive cross-platform news event detection (CNED) dataset encompassing 37,711 multimodal samples from Twitter, Flickr, and online news media, covering 40 public news events verified by Wikipedia. Extensive experiments on the CNED dataset demonstrate the superior performance of our proposed method.

Abstract:
There is a growing interest in utilizing large language models (LLMs) to advance next-generation Recommender Systems (RecSys), driven by their outstanding language understanding and reasoning capabilities. In this scenario, tokenizing users and items becomes essential for ensuring seamless alignment of LLMs with recommendations. While studies have made progress in representing users and items using textual contents or latent representations, challenges remain in capturing high-order collaborative knowledge into discrete tokens compatible with LLMs and generalizing to unseen users/items. To address these challenges, we propose a novel framework called TokenRec, which introduces an effective ID tokenization strategy and an efficient retrieval paradigm for LLM-based recommendations. Our tokenization strategy involves quantizing the masked user/item representations learned from collaborative filtering into discrete tokens, thus achieving smooth incorporation of high-order collaborative knowledge and generalizable tokenization of users and items for LLM-based RecSys. Meanwhile, our generative retrieval paradigm is designed to efficiently recommend top-K items for users, eliminating the need for the time-consuming auto-regressive decoding and beam search processes used by LLMs, thus significantly reducing inference time. Comprehensive experiments validate the effectiveness of the proposed methods, demonstrating that TokenRec outperforms competitive benchmarks, including both traditional recommender systems and emerging LLM-based recommender systems.

Abstract:
Although Graph Neural Networks (GNNs) have exhibited the powerful ability to gather graph-structured information from neighborhood nodes via various message-passing mechanisms, the performance of GNNs is limited by poor generalization and fragile robustness caused by noisy and redundant graph data. As a prominent solution, Graph Augmentation Learning (GAL) has recently received increasing attention in the literature. Among the existing GAL approaches, edge-dropping methods that randomly remove edges from a graph during training are effective techniques to improve the robustness of GNNs. However, randomly dropping edges often results in bypassing critical edges. Consequently, the effectiveness of message passing is weakened. In this paper, we propose a novel adversarial edge-dropping method (ADEdgeDrop) that leverages an adversarial edge predictor guiding the removal of edges, which can be flexibly incorporated into diverse GNN backbones. Employing an adversarial training framework, the edge predictor utilizes the line graph transformed from the original graph to estimate the edges to be dropped, which improves the interpretability of the edge-dropping method. The proposed ADEdgeDrop is optimized alternately by stochastic gradient descent and projected gradient descent. Comprehensive experiments on eight graph benchmark datasets demonstrate that the proposed ADEdgeDrop outperforms state-of-the-art baselines across various GNN backbones, demonstrating improved generalization and robustness.

Abstract:
Multi-modal emotion recognition (MER) integrates multi-modal signals to help computers comprehensively understand human emotions, which is a crucial technology in human-computer interactions. However, the amount of labeled multi-modal emotion data is small and limits MER performance due to its expensive manual annotations. Meanwhile, semi-supervised learning (SSL) methods improving MER models with enormous unlabeled data suffer from confirmation bias, resulting in biased data distribution. To tackle these challenges, this paper proposes a cyclic data distillation semi-supervised learning (CDD-SSL) for MER tasks. CDD-SSL leverages multiple pre-trained unimodal teacher models and confidence-boosting pseudo-labelling (CBPL) to boost the confidence of multi-modal ensemble outputs and distill reliable and class-representative data from numerous unlabeled data. It then utilizes reliable and less-biased data to train a multi-modal student model and provides feedback to update all unimodal teacher models. CDD-SSL is a cyclic teacher-student framework with a feedback mechanism that gradually mitigates confirmation bias and obtains an effective MER model. Experimental results on four benchmark datasets demonstrate that CDD-SSL achieves superior performance over both the semi-supervised methods and the state-of-the-art fully-supervised models in MER tasks.

Abstract:
Graph anomaly detection (GAD), which aims to identify unusual graph instances (e.g., nodes, edges, subgraphs, or graphs), has attracted increasing attention in recent years due to its significance in a wide range of applications. Deep learning approaches, graph neural networks (GNNs) in particular, have been emerging as a promising paradigm for GAD, owing to its strong capability in capturing complex structure and/or node attributes in graph data. Considering the large number of methods proposed for GNN-based GAD, it is of paramount importance to summarize the methodologies and findings in the existing GAD studies, so that we can pinpoint effective model designs for tackling open GAD problems. To this end, in this work we aim to present a comprehensive review of deep learning approaches for GAD. Existing GAD surveys are focused on task-specific discussions, making it difficult to understand the technical insights of existing methods and their limitations in addressing some unique challenges in GAD. To fill this gap, we first discuss the problem complexities and their resulting challenges in GAD, and then provide a systematic review of current deep GAD methods from three novel perspectives of methodology, including GNN backbone design, proxy task design for GAD, and graph anomaly measures. To deepen the discussions, we further propose a taxonomy of 13 fine-grained method categories under these three perspectives to provide more in-depth insights into the model designs and their capabilities. To facilitate the experiments and validation of the GAD methods, we also summarize a collection of widely-used datasets for GAD and empirical performance comparison on these datasets. We further discuss multiple important open research problems in GAD to inspire more future high-quality research in this area.

Abstract:
Information diffusion prediction is a crucial task for comprehending the dissemination process of information. Although this problem has received significant attention recently, most of the state-of-the-arts primarily focus on the modelling of information cascades, while neglecting the implicit social relations between users in the social network and failing to adequately model the interrelations between the user social network and information cascades. To tackle the aforementioned issues, in this work, we propose a Dual-State Hypergraph Contrastive Learning model (\sf DSHCLDSHCL). Specifically, we first propose to construct a social hypergraph based on the social network to capture the implicit social relations. Then, for capturing the cascade level correlations among users, we generate the dual-state (i.e., static and dynamic) user representations from the user social hypergraph and information cascades. Finally, we exploit contrastive learning to model the interplay between the social network and information cascades by discriminating the dual-state representations generated from them. We conduct an empirical assessment of DSHCL across four publicly available datasets, and the findings underscore the DSHCL’s superiority and the efficacy of its components.

Abstract:
Federated Recommender Systems (FedRecs) have evolved as a privacy-preserving paradigm that facilitates distributed training of personalized recommenders without sharing user data. However, FedRecs are known to be susceptible to poisoning attacks by malicious users, who aim at promoting or demoting the exposure of target items through sending malicious updates to the central server. Meanwhile, the distribution of recommendation performance among users, called as performance fairness, could be exacerbated, which is one of the major concerns of trustworthy FedRecs. This paper proposes a novel attack method, Generative Adversarial Network (GAN)-Based Collusive Poisoning Attack (GCPA). To implement GCPA, we create a GAN-based fake user synthesis strategy that mimics behaviors and preferences of real users to generate fake users. Furthermore, we design a collusion-based fairness attack strategy that changes the exposure of items to undermine fairness. To maximize the impact on the distribution of recommendation performance, we develop an adaptive clustering algorithm to identify a subset of items that significantly contribute to the uneven distribution of recommendation performance through collusion. Extensive experiments on two datasets show that GCPA effectively increase the exposure of target items while undermining the performance fairness of FedRecs. In addition, GCPA also has strong resistance to four defense methods. Meanwhile, we provide a heuristic defense method based on gradient direction and similarity against collusive poisoning attack on FedRecs.

Abstract:
Knowledge graph reasoning (KGR) seeks to infer new factual triples from existing knowledge graphs (KGs). Recent methods have unified transductive and inductive reasoning by learning entity-independent representations through local neighboring structures. Nevertheless, these methods often encounter inefficiencies and rely on elaborate local structures without directly modeling the correlations between queries and various structures within KGs. In this paper, we propose a novel framework MulGA, which is designed to learn multi-granularity and adaptive embeddings for KGR. MulGA first employs connectivity subgraphs to uniformly and hierarchically represent query-related structures within KGs, such as triples, relation paths, and subgraphs, establishing the hierarchical relationship between structures at different granularities. Subsequently, we design a graph neural network-based multi-granularity embedding propagation module that unifies the message-passing process with the connectivity subgraph construction. This module obtains the query-related structural representations by all entities at multiple granularities, eliminating the need to explicitly extract any graph elements, thus addressing inefficiency issues. Moreover, we develop a structure-aware adaptive merging mechanism that assigns weights to different granularities and integrates them into cohesive subgraph-granularity representations for reasoning. The systematic experiments have been conducted on 15 benchmarks and MulGA achieves a significant improvement in MRR by an average of 0.5% -1.1% on transductive tasks and 0.2% -7.3% on inductive tasks than existing state-of-the-art methods. Moreover, MulGA exhibits faster convergence speed, smaller number of parameters, competitive inference time, and alleviates the over-smoothing prevalent in graph neural networks.

Abstract:
With the widespread adoption of mobile internet and GPS-enabled smartphones, spatial crowdsourcing has emerged as a prevalent computing paradigm. In this paradigm, the human-machine collaborative task assignment mode, which empowers workers to select tasks based on their preferences, has become a preferred approach for various applications such as ridesharing and takeaways. Generally, the platform continuously presents a set of top-kk tasks to individual workers by taking into account factors like travel distance, and allows workers to select tasks from this set. This decision approach is beneficial to both platform and workers. However, it still faces significant challenges in large-scale dynamic results maintenance, which incurs considerable computational costs. In this paper, we propose a novel solution framework with an adaptive two-layer cache structure to efficiently address the problem of updating dynamic top-kk results. Additionally, we propose two effective learning-based methods which greatly improve the efficiency of result maintenance. Furthermore, we present a novel approach to identify and process caches that trigger intensive updates within a tight time limit, greatly reducing the peak demand for updating caches. Finally, extensive experimental results on real datasets demonstrate that our proposed algorithms exhibit strong performance across various parameter configurations.

Abstract:
Higher-order graph clustering partitions graphs use frequently occurring subgraphs instead of edges, proving effective in community detection and knowledge discovery. Motif conductance, known for its strong interpretability, is a leading model. However, existing motif conductance algorithms are hindered by a two-stage reweighting framework that requires enumerating motif instances to generate an edge-weighted graph for partitioning. This framework has two major drawbacks: (1) It provides only a quadratic bound for three-vertex motifs, with no provable approximation guarantees for other motifs. (2) Enumerating motif instances is computationally prohibitive for large motifs or dense graphs due to combinatorial explosions. Besides, costly spectral clustering or local graph diffusion on the edge-weighted graph limits their scalability. In this paper, we propose a novel peeling-based clustering framework, PSMC, offering a motif-independent approximation ratio for any motif. Specifically, PSMC first defines a new locally computable vertex metric Motif Resident based on the given motif. Then, it iteratively deletes vertices with the smallest motif resident using efficient dynamic update techniques, outputting a locally optimal result with approximation guarantees. Besides, we introduce several powerful optimization techniques to further reduce computational costs. Empirical results on real-world and synthetic datasets showcase our proposed solutions’ superiority over ten competitors.

Abstract:
In a world where Machine Learning (ML) is increasingly used to make predictions about critical events, such as health outcomes, it is crucial to ensure decision-makers have access to explainable, consistent, and relevant predictive features. ML prediction relies on perfect data with similar distributions for testing and validation. These results are compared with those of humans, who use more noisy and limited data. Human predictions overcome those limitations by learning from abstractions. This paper addresses these issues by conducting experiments comparing traditional machine learning methods and a previously proposed method that uses data abstractions to learn predictive feature significance. The results indicate that the previously proposed descriptive ML approach maintains higher classification accuracy and ensures the stability of feature selection as data incompleteness increases, becoming valuable under limited data scenarios. It demonstrates the possibility of developing ML capable of automatic decision-making.

Abstract:
Time series forecasting plays a crucial role across numerous domains, driving rapid development in the field. With the advent of large models, time series foundation models (TSFMs) have exhibited great generalization capabilities, such as zero-shot learning, through large-scale pre-training. Meanwhile, Retrieval-Augmented Generation (RAG) methods are widely employed to enhance the performance of foundation models on unseen data across various domains, including Large Language Models (LLMs). To explore the integration of TSFMs with retrieval-augmented methods, we introduce TimeRAF, a Retrieval-Augmented Foundation model for zero shot time series Forecasting. A learnable retriever is employed and trained in an end-to-end fashion to extract useful information from a curated time series knowledge base. Additionally, we propose an approach called Channel Prompting for knowledge integration. Augmented by the retrieved knowledge, our TimeRAF demonstrates significant improvement across various domain and datasets. Furthermore, TimeRAF can leverage specialized knowledge bases to meet diverse application requirements. Extensive ablation studies and visualizations are provided to validate the effectiveness of our approach.

Abstract:
The field of graph foundation models (GFMs) has seen a dramatic rise in interest in recent years. Their powerful generalization ability is believed to be endowed by self-supervised pre-training and downstream tuning techniques. There is a wide variety of knowledge patterns embedded in the graph data, such as node properties and clusters, which are crucial for learning generalized representations for GFMs. We present a comprehensive survey of self-supervised GFMs from a novel knowledge-based perspective. Our main contribution is a knowledge-based taxonomy that categorizes self-supervised graph models by the specific graph knowledge utilized: microscopic (nodes, links, etc.), mesoscopic (context, clusters, etc.), and macroscopic (global structure, manifolds, etc.). It covers a total of 9 knowledge categories and 300 references for self-supervised pre-training as well as various downstream tuning strategies. Such a knowledge-based taxonomy allows us to more clearly re-examine potential GFM architectures, including large language models (LLMs), as well as provide deeper insights for constructing future GFMs.

Abstract:
Information diffusion prediction is a vital component for a wide range of social applications, including viral marketing identification and precise recommendation. Prior methods focus on modeling contextual information from a single cascade, ignoring rich collaborative information behind historical interactions across various cascades and future data within the cascade. Leveraging such interactions can substantially enhance diffusion prediction performance but presents two major challenges: (1) user intents are usually entangled behind historical interactions; and (2) utilizing future data may introduce severe training-inference discrepancies. We present MIM, a novel information diffusion model merging multi-scale interactions for improving user intent learning and behavior retrieval. Specifically, we convert cascades and social relations into multi-channel hypergraphs, where each channel depicts a common fine-grained user intent behind historical interactions across cascades. By aggregating embeddings learned through multiple channels, we obtain comprehensive intent representations. Second, we decouple past- and future-level temporal influences within a cascade via a dual temporal network. Then we implement past-future knowledge transferring to enhance the knowledge learned from the dual network via hierarchical knowledge distillation. Extensive experiments conducted on four datasets demonstrate that MIM significantly outperforms various benchmarks.

Abstract:
Community structure refers to the “small groups” in the network. Detecting community structure in networks has significant application value. With the continuous expansion and complexity of the network, the global information of the network is often difficult to obtain. On the other hand, in some cases, we pay more attention to the local community where the given node is located. Local community detection methods detect local community structure by using local information from a given node. However, many local community detection methods encounter the problem of precision limitation. Therefore, in order to alleviate such problems, we propose the FG-based method in this paper. Based on the characteristics of complex networks, a folded subgraph method is designed to consider some similar nodes as single nodes, reducing the impact of noise in the network. Furthermore, based on the folded subgraph, FG-based method designs a three-stage local expansion strategy, in which nodes with different characteristics are added to the local community in each stage. We conduct experiments on datasets and find that the FG-based method can improve the recall and precision of local community structures.

Abstract:
The rapid growth of e-commerce has intensified the demand for efficient urban logistics. Electric Vehicles (EVs), with their eco-friendly and high-efficiency features, have emerged as a promising solution for improving urban logistics efficiency. However, due to their limited battery capacity, EVs often require recharging during operations, and improper charging decisions may lead to delivery delays, resulting in a loss of platform revenue. In this paper, we explore a novel EV Charging-Aware Task Assignment (ECTA) problem in urban logistics scenarios, where the objective is to maximize platform revenue by ensuring timely task completion while meeting the charging needs of EVs. To address this challenge, we present e-Charge, an efficient two-stage framework that enables real-time optimization of two continuous processes: task assignment and charging decision. For task assignment, which focuses on matching tasks to suitable EVs, we construct a hybrid weight model that incorporates charging penalties to calculate matching weights for EVs in both active and charging states, thus improving task assignment quality. Additionally, we implement an effective vehicle selection strategy to expedite the matching process, ensuring the efficiency of task assignment. For charging decision, which focuses on determining when and where EVs should be charged, we propose a multi-agent reinforcement learning (MARL) approach to dynamically select the charging timing for EVs. To further enhance decision-making quality, we devise a hierarchical communication graph that enables better collaboration between EVs and facilitates adaptive charging decisions. Finally, extensive experiments demonstrate that e-Charge significantly outperforms compared methods, achieving higher revenue and task completion ratio across a wide range of parameter settings.

Abstract:
Synthetic data is being widely used as a replacement or enhancement for real data in fields as diverse as healthcare, telecommunications, and finance. Unlike real data, which represents actual people and objects, synthetic data is generated from an estimated distribution that retains key statistical properties of the real data. This makes synthetic data attractive for sharing while addressing privacy, confidentiality, and autonomy concerns. Real data often contains missing values that hold important information about individual, system, or organizational behavior. Standard synthetic data generation methods eliminate missing values as part of their pre-processing steps and thus completely ignore this valuable source of information. Instead, we propose methods to generate synthetic data that preserve both the observable and missing data distributions; consequently, retaining the valuable information encoded in the missing patterns of the real data. Our approach handles various missing data scenarios and can easily integrate with existing data generation methods. Extensive empirical evaluations on diverse datasets demonstrate the effectiveness of our approach as well as the value of preserving missing data distribution in synthetic data.

Abstract:
Constrained decoding approaches aim to control the meaning or style of text generated by a Pre-trained Language Model (PLM) for various task-specific objectives at inference time. However, these methods often guide plausible continuations by greedily and explicitly selecting targets, which, while fulfilling the task requirements, may overlook the natural patterns of human language generation. In this work, we propose a novel decoding framework, Decider, which enables us to program high-level rules on how we might effectively complete tasks to control a PLM. Differing from previous works, our framework transforms the encouragement of concrete target words into the encouragement of all words that satisfy the high-level rules. Specifically, Decider is a dual system in which a PLM is equipped and controlled by a First-Order Logic (FOL) reasoner to express and evaluate the rules, along with a decision function that merges the outputs from both systems to guide the generation. Experiments on CommonGen and PersonaChat demonstrate that Decider can effectively follow given rules to guide a PLM in achieving generation tasks in a more human-like manner.

Abstract:
Community detection is a fundamental problem and has been extensively studied. With the abundance of information in real-world networks, the discovery of communities in attribute graphs is increasingly valuable. However, numerous previous models in attribute graphs neglect the fairness concept, which plays an important role in ensuring that graph analysis is not biased toward specific groups. In this paper, we propose a novel model, named proportional fair clique (PFC). Specifically, given an attribute graph G=(V,E,A)G=(V,E,A), an integer kk and a threshold \lambda \in [0,1/|A|]λ∈[0,1/|A|], a subgraph SS of GG is a PFC if (i)(i) SS is a clique with size at least kk and (ii)(ii) |S_a_i|/|S| \geq \lambda|Sai|/|S|≥λ for each attribute a_iai in GG, where S_a_iSai is the node set in SS associated with attribute a_iai. We show that the problem of enumerating all the maximal proportional fair cliques (MPFC) is NP-hard. A reasonable baseline algorithm is first presented by extending the Bron-Kerbosch framework. To scale for large networks, we propose several optimization strategies to accelerate the computation. Finally, comprehensive experiments are conducted over 6 graphs to demonstrate the efficiency and effectiveness of the proposed techniques and model.

Affiliations: School of Economics and Management, Beijing University of Posts and Telecommunications, Beijing, China; School of Computer Science (National Pilot School of Software Engineering), Beijing University of Posts and Telecommunications, Beijing, China; School of Management and Economics, University of Electronic Science and Technology of China, Chengdu, China; School of Economics and Management, University of Science and Technology Beijing, Beijing, China; School of Electrical Engineering and Computer Science, University of Queensland, Brisbane, QLD, Australia

Abstract:
Environmental, social, and governance (ESG) serves as a crucial indicator for evaluating firms in terms of sustainable development. However, the existing ESG evaluation systems suffer from limitations, such as narrow coverage, subjective bias, and lack of timeliness. Therefore, there is a pressing need to leverage machine learning methods to predict the ESG performance of firms using their publicly available data. Traditional machine learning models encounter the feature imbalance problem due to the heterogeneity in ESG-related features. Common approaches typically involve unfolding all features, thereby granting high-dimensional folding features greater exposure and accessibility to downstream models, which results in the neglect of low-dimensional features. To fill the research gap regarding fully using the heterogeneous features of enterprises to enhance AI-based ESG prediction performance, we propose the Feature Balancing Transformer (FeBT), a model based on autoencoders and Transformer blocks. FeBT incorporates a novel feature balancing technique that compresses and enhances high-dimensional features from imbalanced data into low-dimensional representations, thereby ensuring a more balanced impact of high-dimensional and low-dimensional features on the model’s performance in the downstream ESG forecasting module. Extensive experiments verified the superior performance of FeBT compared with state-of-the-art methods in real-world ESG-related datasets and evidenced that our feature balancing module provides significant insights from high-dimensional folding features.

Abstract:
Recently, the multi-behavior information on a specific domain has been successfully exploited by aggregating diverse user behaviors to solve the problems of cold start and data sparsity in recommendations. However, the user behavior information captured from multiple behaviors in a single domain is insufficient. Our study seeks to enhance user behavior prediction by leveraging both multi-behavior information and cross-domain information in a more effective manner. In order to explore the correlations and differences between different behaviors and different domains, we propose a novel competition framework consists of intra-domain competition and inter-domain competition for knowledge learning. Specifically, for intra-domain, a behavior competition mechanism is designed to enable the model to mine users’ interests and behavior patterns effectively. For inter-domain, a domain competition mechanism is designed to perform knowledge transfer and knowledge fusion for overlapping users in different domains. Through the competition mechanisms, our proposed Graph Competitive Transfer Network (GCTN) achieves knowledge transfer between different domains and captures users’ behavior patterns in different contexts. The effectiveness of the GCTN and its competition mechanisms has been validated through sufficient experimental trials on Douban and Amazon datasets. Compared to baseline methods, GCTN has demonstrated a marked improvement in both AUCAUC and F1F1 scores.

Abstract:
This study aims to address the challenges of financial price prediction in high-frequency trading (HFT) by introducing a novel continual learning framework based on factor predictors via graph neural networks. The model integrates multi-factor pricing theory with real-time market dynamics, effectively bypassing the limitations of conventional time series forecasting methods, which often lack financial theory guidance and ignore market correlations. We propose three heterogeneous tasks, including price gap regression, changepoint detection, and price moving average regression to trace the short-, intermediate-, and long-term trend factors present in the data. We also account for the cross-sectional correlations inherent in the financial market, where prices of different assets show strong dynamic correlations. To accurately capture these dynamic relationships, we resort to spatio-temporal graph neural network (STGNN) to enhance the predictive power of the model. Our model allows a continual learning strategy to simultaneously consider these tasks (factors). To tackle the catastrophic forgetting in continual learning while considering the heterogeneity of tasks, we propose to calculate parameter importance with mutual information between original observations and the extracted features. Empirical studies on the Chinese futures data and U.S. equity data demonstrate the superior performance of the proposed model compared to other state-of-the-art approaches.

Abstract:
Ontologies are widely used for representing domain knowledge and meta data, playing an increasingly important role in Information Systems, the Semantic Web, Bioinformatics and many other domains. However, logical reasoning that ontologies can directly support are quite limited in learning, approximation and prediction. One straightforward solution is to integrate statistical analysis and machine learning. To this end, automatically learning vector representation for knowledge of an ontology i.e., ontology embedding has been widely investigated. Numerous papers have been published on ontology embedding, but a lack of systematic reviews hinders researchers from gaining a comprehensive understanding of this field. To bridge this gap, we write this survey paper, which first introduces different kinds of semantics of ontologies and formally defines ontology embedding as well as its property of faithfulness. Based on this, it systematically categorizes and analyses a relatively complete set of over 80 papers, according to the ontologies they aim at and their technical solutions including geometric modeling, sequence modeling and graph propagation. This survey also introduces the applications of ontology embedding in ontology engineering, machine learning augmentation and life sciences, presents a new library mOWL and discusses the challenges and future directions.

Abstract:
While full-graph training is effective for graph learning, it typically demands substantial memory resources. Existing multi-GPU training frameworks struggle with scalability because they require retaining data for each layer within GPU memory. In this work, we present \mathsf HongTu HongTu, a memory-efficient system that supports out-of-memory full-graph GNN training on GPUs. \mathsf HongTu HongTu offloads vertex data to CPU memory and employs partition parallelism training that splits and assigns large graphs to multiple GPUs. To reduce runtime memory consumption with optimal performance, \mathsf HongTu HongTu utilizes a hybrid solution combining recomputation, caching, and computation-reordering, enabling efficient layer-wise intermediate data management. To address the increased communication caused by duplicated neighbor access among partitions, \mathsf HongTu HongTu employs a deduplicated communication framework that converts host-GPU transfers into more efficient inter/intra-GPU data access. Additionally, \mathsf HongTu HongTu tackles the load-imbalance issues in out-of-memory full-graph training, featuring a multi-objective graph partition algorithm that balances memory consumption and data transfer and maximizes the effectiveness of communication deduplication. Experiments on a 4× A100 GPU server show that \mathsf HongTu HongTu can effectively train graphs with billion edges while reducing host-GPU data communication by 25% to 71% . Compared to the full-graph GNN system running on 16 CPU nodes, \mathsf HongTu HongTu achieves speedups ranging from 11.4× to 21.3×.

Abstract:
The widespread adoption of smartphones and Location-Based Social Networks has led to a massive influx of spatio-temporal data, creating unparalleled opportunities for enhancing Point-of-Interest (POI) recommendation systems. These advanced POI systems are crucial for enriching user experiences, enabling personalized interactions, and optimizing decision-making processes in the digital landscape. However, existing surveys tend to focus on traditional approaches and few of them delve into cutting-edge developments, emerging architectures, as well as security considerations in POI recommendations. To address this gap, our survey stands out by offering a comprehensive, up-to-date review of POI recommendation systems, covering advancements in models, architectures, and security aspects. We systematically examine the transition from traditional models to advanced techniques such as large language models. Additionally, we explore the architectural evolution from centralized to decentralized and federated learning systems, highlighting the improvements in scalability and privacy. Furthermore, we address the increasing importance of security, examining potential vulnerabilities and privacy-preserving approaches. Our taxonomy provides a structured overview of the current state of POI recommendation, while we also identify promising directions for future research in this rapidly advancing field.

Abstract:
A modern service model known as the “hub-oriented” model has emerged with the development of mobility services. This model allows users to request vehicles from multiple companies (agents) simultaneously through a unified entry (a ‘hub’). In contrast to conventional services, the “hub-oriented” model emphasizes pricing competition. To address this scenario, an agent should consider its competitors when developing its pricing strategy. In this paper, we introduce DRLPG, a mixed opponent-aware pricing method, which consists of two main components: the two-stage guarantor and the end-to-end deep reinforcement learning (DRL) module, as well as interaction mechanisms. In the guarantor, we design a prediction-decision framework. Specifically, we propose a new objective function for the spatiotemporal neural network in the prediction stage and utilize a traditional reinforcement learning method in the decision stage, respectively. In the end-to-end DRL framework, we explore the adoption of conventional DRL in the “hub-oriented” scenario. Finally, a meta-decider and an experience-sharing mechanism are proposed to combine both methods and leverage their advantages. We conduct extensive experiments on real data, and DRLPG achieves an average improvement of 99.9% and 61.1% in the peak and low peak periods, respectively. Our results demonstrate the effectiveness of our approach compared to the baseline.

Abstract:
Sarcasm thrives on popular social media platforms such as Twitter and Reddit, where users frequently employ it to convey emotions in an ironic or satirical manner. The ability to detect sarcasm plays a pivotal role in comprehending individuals’ true sentiments. To achieve a comprehensive grasp of sentence semantics, it is crucial to integrate external knowledge that can aid in deciphering entities and their intricate relationships within a sentence. Although some efforts have been made in this regard, their use of external knowledge is still relatively superficial. Specifically, Knowledge-enhanced entity and relationship understanding still face significant challenges. In this paper, we propose the Knowledge Enhanced Sentiment Dependency Graph Convolutional Network (KSDGCN) framework, which constructs a commonsense-augmented sentiment graph and a commonsense-replaced dependency graph for each text to explicitly capture the role of external knowledge for sarcasm detection. Furthermore, we validate the irrational relationships between co-occurring entity pairs within sentences and background knowledge by a signed attention mechanism. We conduct experiments on four benchmark datasets, and the results show that KSDGCN outperforms existing state-of-the-art methods and is highly interpretable.

Abstract:
Bipartite graphs are a powerful tool for modeling the interactions between two distinct groups. These bipartite relationships often feature small, recurring structural patterns called motifs which are building blocks for community structure. One promising structure is the induced 6-cycle which consists of three nodes on each node set forming a cycle where each node has exactly two edges. In this paper, we study the problem of counting and utilizing induced 6-cycles in large bipartite networks. We first consider two adaptations inspired by previous works for cycle counting in bipartite networks. Then, we introduce a new approach for node triplets which offer a systematic way to count the induced 6-cycles, used in BatchTripletJoin. Our experimental evaluation shows that BatchTripletJoin is significantly faster than the other algorithms while being scalable to large graph sizes and number of cores. On a network with 112M112M edges, BatchTripletJoin is able to finish the computation in 78 mins by using 52 threads. In addition, we provide a new way to identify anomalous node triplets by comparing and contrasting the butterfly and induced 6-cycle counts of the nodes. We showcase several case studies on real-world networks from Amazon Kindle ratings, Steam game reviews, and Yelp ratings.

Abstract:
Traditional machine-learning approaches face limitations when confronted with insufficient data. Transfer learning addresses this by leveraging knowledge from closely related domains. The key in transfer learning is to find a transferable feature representation to enhance cross-domain classification models. However, in some scenarios, some features correlated with samples in the source domain may not be relevant to those in the target. Causal inference enables us to uncover the underlying patterns and mechanisms within the data, mitigating the impact of confounding factors. Nevertheless, most existing causal inference algorithms have limitations when applied to high-dimensional datasets with nonlinear causal relationships. In this work, a new causal representation method based on a Graph autoencoder embedded AutoEncoder, named GeAE, is introduced to learn invariant representations across domains. The proposed approach employs a causal structure learning module, similar to a graph autoencoder, to account for nonlinear causal relationships present in the data. Moreover, the cross-entropy loss as well as the causal structure learning loss and the reconstruction loss are incorporated in the objective function designed in a united autoencoder. This method allows for the handling of high-dimensional data and can provide effective representations for cross-domain classification tasks. Experimental results on generated and real-world datasets demonstrate the effectiveness of GeAE compared with the state-of-the-art methods.

Abstract:
With the rapid development of Internet and the burgeoning scale of social media, Social Event Classification (SEC) has garnered increasing attention. The existing study of SEC focuses on recognizing a fixed set of social events. However, in real-world scenarios, new social events continually emerge on social media, which suggests the necessity for a practical SEC model that can swiftly adapt to the evolving environment with incremental social events. Therefore, in this paper, we study a new yet crucial problem defined as Continual Social Event Classification (C-SEC), where new events continually emerge in the sequentially collected social data. Accordingly, we propose a novel Temporal Event Knowledge Network (TEKNet) to continually learn temporal event knowledge for C-SEC with temporally incremental events. First, we conduct present event knowledge learning to learn the classification of newly emerging events in the presently incoming data. Second, we design past event knowledge replay with self-knowledge distillation to consolidate the learned knowledge of past events and prevent catastrophic forgetting. Finally, we propose future event knowledge pretraining with a modality mixture mechanism to pretrain the classifiers for events that occur in the future. Comprehensive experiments on real-world social event datasets demonstrate the superiority of our proposed TEKNet for C-SEC.

Abstract:
Prerequisite-link Prediction (PLP) aims to discover the condition relations of a specific event or a concerned variable, which is a fundamental problem in a large number of fields, such as educational data mining. Current studies on PLP usually developed graph neural networks (GNNs) to learn the representations of pairs of nodes. However, these models fail to distinguish non-isomorphic graphs and integrate multiscale structures, leading to the insufficient expressive capability of GNNs. To this end, we in this paper proposed k-dimensional Weisferiler-Leman directed GNNs, dubbed k-WediGNNs, to recognize non-isomorphic graphs via the Weisferiler-Leman algorithm. Furthermore, we integrated the multiscale structures of a directed graph into k-WediGNNs, dubbed multiscale k-WediGNNs, from the bidirected views of in-degree and out-degree. With the Siamese network, the proposed models are extended to address the problem of PLP. Besides, the expressive power is then interpreted via theoretical proofs. The experiments were conducted on four publicly available datasets for concept prerequisite relation prediction (CPRP). The results show that the proposed models achieve better performance than the state-of-the-art approaches, where our multiscale k-WediGNN achieves a new benchmark in the task of CPRP.

Abstract:
How to effectively diagnose and mitigate database performance anomalies remains a significant concern for modern database systems. Manually identifying the root causes of the anomalies is a labor-intensive process and significantly relies on professional experience. Meanwhile, existing work on automatic database diagnosis mainly focuses on detecting anomalous performance metrics or system log. These solutions lack the power to pinpoint detailed issues such as bad queries or problematic operators, which are indispensable for most database troubleshooting processes. In this paper, we propose OpDiag, a diagnosis framework that attributes database performance anomalies to query operators. In this framework, we first construct models offline to represent the relationship between query operators, performance metrics, and anomalies. These models can capture query plan features and support ad-hoc queries and schemas. Then, through feature attribution on these models during online diagnosis, OpDiag can effectively identify critical anomalous metrics and further trace back to suspicious queries and operators. This can provide concrete guidance for subsequent steps in anomaly mitigation. We applied OpDiag to both synthetic benchmark and real industry cases from ZTE Corporation. Empirical studies prove that OpDiag can accurately localize anomalous queries and operators, thus reducing human efforts in diagnosing and mitigating database performance anomalies.

Abstract:
In dense retrieval, embedding long texts into dense vectors can result in information loss, leading to inaccurate query-text matching. Additionally, low-quality texts with excessive noise or sparse key information are unlikely to align well with relevant queries. Recent studies mainly focus on improving the sentence embedding model or retrieval process. In this work, we introduce a novel text augmentation framework for dense retrieval. This framework transforms raw documents into information-dense text formats, which supplement the original texts to effectively address the aforementioned issues without modifying embedding or retrieval methodologies. Two text representations are generated via large language models (LLMs) zero-shot prompting: question-answer pairs and element-driven events. We term this approach QAEA-DR: unifying question-answer generation and event extraction in a text augmentation framework for dense retrieval. To further enhance the quality of generated texts, a scoring-based evaluation and regeneration mechanism is introduced in LLM prompting. Our QAEA-DR model has a positive impact on dense retrieval, supported by both theoretical analysis and empirical experiments.

Abstract:
As a promising strategy to achieve generalizable graph learning tasks, graph invariant learning emphasizes identifying invariant subgraphs for stable predictions on biased unknown distribution by selecting the important edges/nodes based on their contributions to the predictive tasks (i.e., subgraph predictivity). However, the existing approaches solely relying on subgraph predictivity face a challenge: the learned invariant subgraph often contains numerous spurious nodes and shows poor connectivity, undermining the generalization power of Graph Neural Networks (GNNs). To tackle this issue, we propose a summary graph-induced Invariant Learning (SIL) model that innovatively adopts a summary graph to leverage both the subgraph connectivity and predictivity for learning strong connected and accurate invariant subgraphs. Specifically, SIL first learns a summary graph containing multiple strongly connected supernodes while maintaining structure consistency with the original graph. Second, the learned summary graph is disentangled into an invariant supernode and spurious counterparts to eliminate the interference of highly predictive edges and nodes. Finally, SIL identifies a potential invariant subgraph from the invariant supernode to accomplish generalization tasks. Additionally, we provide a theoretical analysis of the summary graph learning mechanism, guaranteeing that the learned summary graph is consistent with the original graph. Experimental results validate the effectiveness of the SIL model.

Abstract:
In-context learning (ICL) empowers large pre-trained language models (PLMs) to predict outcomes for unseen inputs without parameter updates. However, the efficacy of ICL heavily relies on the choice of demonstration examples. Randomly selecting from the training set frequently leads to inconsistent performance. Addressing this challenge, this study takes a novel approach by focusing on training data valuation through causal inference. Specifically, we introduce the concept of average marginal effect (AME) to quantify the contribution of individual training samples to ICL performance, encompassing both its generalization and robustness. Drawing inspiration from multiple treatment effects and randomized experiments, we initially sample diverse training subsets to construct prompts and evaluate the ICL performance based on these prompts. Subsequently, we employ Elastic Net regression to collectively estimate the AME values for all training data, considering subset compositions and inference performance. Ultimately, we prioritize samples with the highest values to prompt the inference of the test data. Across various tasks and with seven PLMs ranging in size from 0.8B to 33B, our approach consistently achieves state-of-the-art performance. Particularly, it outperforms Vanilla ICL and the best-performing baseline by an average of 14.1% and 5.2%, respectively. Moreover, prioritizing the most valuable samples for prompting leads to a significant enhancement in performance stability and robustness across various learning scenarios. Impressively, the valuable samples exhibit transferability across diverse PLMs and generalize well to out-of-distribution tasks.

Abstract:
Federated learning (FL) is an emerging paradigm that enables multiple clients to collaboratively train a machine learning (ML) model without the need to exchange their raw data. However, it relies on a centralized authority to coordinate participants’ activities. This not only interrupts the entire training task in case of a single point of failure, but also lacks an effective regulatory mechanism to prevent malicious behavior. Although blockchain, with its decentralized architecture and data immutability, has significantly advanced the development of FL, it still struggles to withstand poisoning attacks and faces limitations in computational scalability. We propose Zkfhed, a verifiable and scalable FL system that overcomes the limitations of blockchain-based FL in poison attacks and computational scalability. First, we propose a two-stage audit scheme based on zero-knowledge proofs (ZKPs), which verifies that the training data are extracted from trusted organizations and that computations on the data exactly follow the specified training protocols. Second, we propose a homomorphic encryption delegation learning (HEDL), based on fully homomorphic encryption (FHE). It is capable of outsourcing complex computing to external computing resources without sacrificing the client's data privacy. Final, extensive experiments on real-world datasets demonstrate that Zkfhed can effectively identify malicious clients and is highly efficient and scalable in terms of online time and communication efficiency.

Abstract:
Conversational recommender systems (CRSs) provide personalised recommendations by strategically querying attributes matching users’ preferences. However, this process suffers from confounding effects of time and user attributes, as users’ preferences naturally evolve over time and differ among similar users due to their unique attributes. These confounding effects distort user behaviors’ causal drivers, challenging CRSs in learning users’ true preferences and generalizable patterns. Recently, causal inference provides principled tools to clarify cause-effect relations in data, offering a promising way to address such confounding effects. In this context, we introduce Causal Conversational Recommender (CCR), which applies causal inference to model the causality between user behaviors and time/user attribute, enabling deeper understanding of user behaviors’ causal drivers. First, CCR employs stratification and matching to ensure attribute asked per round is independent from time and user attributes, mitigating their confounding effects. Following that, we apply the Average Treatment Effect (ATE) to quantify the unbiased causal impact of each unasked attribute on user preferences, identifying the attribute with the highest ATE per round as the causal-based attribute, i.e., causal driver of user behaviour. Finally, CCR iteratively refines user preferences through feedback on causal-based attributes. Extensive experiments verified CCR's robustness and personalization.

Abstract:
Cross-modal retrieval is a promising technique nowadays to find semantically similar instances in other modalities while a query instance is given from one modality. However, there still exists many challenges for reducing heterogeneous modality gap by embedding label information to discrete hash codes effectively, solving the binary optimization when generating unified hash codes and reducing the discrepancy of data distribution efficiently during common space learning. In order to overcome the above-mentioned challenges, we propose a Collaboratively Semantic alignment and Metric learning for cross-modal Hashing (CSMH) in this paper. Specifically, by a kernelization operation, CSMH first extracts the non-linear data features for each modality, which are projected into a latent subspace to align both marginal and conditional distributions simultaneously. Then, a maximum mean discrepancy-based metric strategy is customized to mitigate the distribution discrepancies among features from different modalities. Finally, semantic information obtained from the label similarity matrix, is further incorporated to embed the latent semantic structure into the discriminant subspace. Experimental results of CSMH and baseline methods on four widely-used datasets show that CSMH outperforms some state-of-the-art hashing baseline methods for cross-modal retrieval on efficiency and precision.

Abstract:
The Next POI recommendation, which has attracted many attentions recently, is a complex study due to the sparsity of check-in data and numerous sequential patterns. The recent methods based on sequential modeling have shown promising applicability for this task. However, most of existing next POI recommendation researches only model users’ preferences based on their own sequences and ignore the influence of partners who visit POI with the target user and may change with time. Inspired by dynamic Graph neural networks, we propose a Group-aware Dynamic Graph Representation Learning (GDGRL) method for next POI recommendation. GDGRL connects different user sequences and specific partners via dynamic graph structure, which contains interactions between users and POIs as well as influence of partners. The users’ dynamic preferences are learned from group-aware dynamic graph and context-aware dynamic graph through dynamic graph neural networks. Finally, the next POI recommendation task is transformed into a link prediction between user node and POI node in the dynamic graph. Extensive experiments on two real-world datasets show that GDGRL outperforms several state-of-the-art approaches.

Abstract:
Time series clustering poses a significant challenge with diverse applications across domains. A prominent drawback of existing solutions lies in their limited interpretability, often confined to presenting users with centroids. In addressing this gap, our work presents kk-Graph, an unsupervised method explicitly crafted to augment interpretability in time series clustering. Leveraging a graph representation of time series subsequences, kk-Graph constructs multiple graph representations based on different subsequence lengths. This feature accommodates variable-length time series without requiring users to predetermine subsequence lengths. Our experimental results reveal that kk-Graph outperforms current state-of-the-art time series clustering algorithms in accuracy, while providing users with meaningful explanations and interpretations of the clustering outcomes.

Abstract:
As the application of knowledge graphs becomes increasingly widespread, the issue of knowledge graph incompleteness has garnered significant attention. As a classical type of non-euclidean spatial data, knowledge graphs possess various complex structural types. However, most current knowledge graph completion models are developed within a single space, which makes it challenging to capture the inherent knowledge information embedded in the entire knowledge graph. This limitation hinders the representation learning capability of the models. To address this issue, this paper focuses on how to better extend the representation learning from a single space to Riemannian manifolds, which are capable of representing more complex structures. We propose a new knowledge graph completion model called MRME-KGC, based on multi-view Riemannian Manifolds fusion to achieve this. Specifically, MRME-KGC simultaneously considers the fusion of four views: two hyperbolic Riemannian spaces with negative curvature, a Euclidean Riemannian space with zero curvature, and a spherical Riemannian space with positive curvature to enhance knowledge graph modeling. Additionally, this paper proposes a contrastive learning method for Riemannian spaces to mitigate the noise and representation issues arising from Multi-view Riemannian Manifolds Fusion. This paper presents extensive experiments on MRME-KGC across multiple datasets. The results consistently demonstrate that MRME-KGC significantly outperforms current state-of-the-art models, achieving highly competitive performance even with low-dimensional embeddings.

Abstract:
Incomplete multiview clustering (IMVC) optimally integrates complementary information within incomplete multiview data to improve clustering performance. Several one-step graph-based methods show great potential for IMVC. However, the low-rank structures of similarity graphs are neglected at the initialization stage of similarity graph construction. Moreover, further investigation into complementary information integration across incomplete multiple views is needed, particularly when considering the low-rank structures implied in high-dimensional multiview data. In this paper, we present one-step adaptive graph learning (OAGL) that adaptively performs spectral embedding fusion to achieve clustering assignments at the clustering indicator level. We first initiate affinity matrices corresponding to incomplete multiple views using spare representation under two constraints, i.e., the sparsity constraint on each affinity matrix corresponding to an incomplete view and the degree matrix of the affinity matrix approximating an identity matrix. This approach promotes exploring complementary information across incomplete multiple views. Subsequently, we perform an alignment of the spectral block-diagonal matrices among incomplete multiple views using low-rank tensor learning theory. This facilitates consistency information exploration across incomplete multiple views. Furthermore, we present an effective alternating iterative algorithm to solve the resulting optimization problem. Extensive experiments on benchmark datasets demonstrate that the proposed OAGL method outperforms several state-of-the-art approaches.

Abstract:
Approximate membership query data structures (i.e., filters) have ubiquitous applications in database and data mining. Cuckoo filters are emerging as the alternative to Bloom filters because they support deletions and usually have higher operation throughput and space efficiency. However, their designs are confined to a single-threaded execution paradigm and consequently cannot fully exploit the parallel processing capabilities of modern hardware. This paper presents PipeFilter, a faster and more space-efficient filter that harnesses pipeline parallelism for superior performance. PipeFilter re-architects the Cuckoo filter by partitioning its data structure into several sub-filters, each providing a candidate position for every item. This allows the filter operations, including insertion, lookup, and deletion, to be naturally distributed across several pipeline stages, each overseeing one of the sub-filters, which can further be implemented through multi-threaded execution or pipeline stages of programmable hardware to achieve significantly higher throughput. Meanwhile, PipeFilter excels for single-threaded execution thanks to a combination of unique design features, including block design, path prophet, round robin, and SIMD optimization, such that it achieves superior performance than the SOTAs even when running with a single core. PipeFilter also has a competitive advantage in space utilization because it permits each item to explore more candidate positions. We implement and optimize PipeFilter on four platforms (single-core CPU, multi-core CPU, FPGA, and P4 ASIC). Experimental results demonstrate that PipeFilter surpasses all baseline methods on four platforms. When running with a single core, it showcases a notable 15%～∼57% improvement in operation throughput and a high load factor exceeding 99%. When parallel processing on other platforms, PipeFilter achieves 7× ～ 800××∼800× higher throughput than single-threaded execution.

Abstract:
Asynchronous pipeline model parallelism with a “1F1B” (one forward, one backward) schedule generates little bubble overhead and always provides quite a high throughput. However, the “1F1B” schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches across GPUs. To simultaneously address these two problems, in this paper, we propose an optimizer-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training. The key insight of our proposal is that we employ a weight prediction strategy in the forward pass to approximately ensure that each mini-batch uses consistent and staleness-free weights to compute the forward pass of the “1F1B” schedule. To be concrete, we first construct the weight prediction scheme based on the update rule of the used optimizer when training the deep neural network models. Then throughout the “1F1B” pipeline training, each mini-batch is mandated to execute weight prediction, subsequently employing the predicted weights to perform the forward pass. As a result, PipeOptim 1) inherits the advantage of the “1F1B” schedule and generates high throughput, and 2) can ensure effective parameter learning regardless of the type of the used optimizer. We conducted extensive experimental evaluations using nine different deep-learning models to verify the effectiveness of our proposal. The experiment results demonstrate that PipeOptim outperforms the other five popular pipeline approaches including GPipe, PipeDream, PipeDream-2BW, SpecTrain, and XPipe.

Abstract:
Probabilistic forecasting of multivariate time series is essential for various downstream tasks. Most existing approaches rely on the sequences being uniformly spaced and aligned across all variables. However, real-world multivariate time series often suffer from temporal irregularities, including nonuniform intervals and misaligned variables, which pose significant challenges for accurate forecasting. To address these challenges, we propose an end-to-end framework that models temporal irregularities while capturing the joint distribution of variables at arbitrary continuous-time points. Specifically, we introduce a dynamic conditional continuous normalizing flow to model data distributions in a non-parametric manner, accommodating the complex, non-Gaussian characteristics commonly found in real-world datasets. Then, by leveraging a carefully factorized log-likelihood objective, our approach captures both temporal and cross-sectional dependencies efficiently. Extensive experiments on a range of real-world datasets demonstrate the superiority and adaptability of our method compared to existing approaches.

Abstract:
Multi-view graph clustering (MVGC) explores pairwise correlations of entire instances and comprehensively aggregates diverse source information with optimal graph structure. One major issue of practical MVGC is the high time and space complexities prohibiting being applied on large-scale applications. As a promising solution of addressing large-scale problems, anchor-based strategy identifies small portion and key landmarks to serve as replacements for the entire dataset. Despite of its efficiency, anchors chosen across views may be semantically unaligned contrasting to naturally-aligned full sample setting, which may lead to the latter inappropriate graph fusion. Limited attention has been focused on the mentioned Multi-View Anchor-Unaligned Problem (MV-AUP) in the existing literature. In this paper, we first revisit existing multi-view anchor graph clustering frameworks and present the MV-AUP phenomenon. Then, we propose a novel Multi-view Corresponding Anchor Graph Alignment Fusion framework (MV-CAGAF), which elegantly solves MV-AUP with structural representation matching in multi-dimensional spaces. Further, we theoretically prove our proposed structural matching approach can be regarded as minimizing the EMD distance of the two relative anchor distributions. Based on this, we design the innovative multi-view anchor graph fusion paradigm with correspondence alignment, which inherits the linear sample complexity for scalable cross-view clustering. Our proposed MV-CAGAF achieves significant improvements with the help of the novel fusion framework on comprehensive benchmark datasets. Most importantly, the experimental results on both of the simulated and real-world datasets significantly prove the importance of cross-view alignment for large-scale multi-view clustering.

Abstract:
Although a variety of models have been proposed for urban spatio-temporal forecasting, most existing forecasting models are developed manually for specific tasks. By investigating the correlation between multi-order derivative and spatio-temporal data, we propose a generic yet simple plug-in structure, named TaylorS, to improve the performance and generalization of existing forecasting models. The TaylorS converts the non-linear regression problem into a multi-order non-linear approximation problem by plugging a Taylor expansion into the forecasting task. To achieve this, we design a two-step training framework, including a training step and an adjusting step. During training, we train a given forecasting model as a base model to be equipped with prior knowledge. During adjusting, we fine-tune the base model while plugging an adjustment model into the base model. The adjustment model, as a multi-order expansion, takes the multi-order derivative of data to evaluate data uncertainty for further forecasting approximation and adjustment. Extensive experimental results demonstrate that the proposed TaylorS framework can consistently improve the performance of existing state-of-the-art methods and generalize these methods to different forecasting tasks.

Abstract:
Debt collection is utilized for risk control after credit card delinquency. The existing rule-based method tends to be myopic and non-adaptive due to the delayed feedback. Reinforcement learning (RL) has an inherent advantage in dealing with such task and can learn policies end-to-end. However, employing RL here remains difficult because of different interaction processes from standard RL and the notorious problem of optimistic estimations in the offline setting. To tackle these challenges, we first propose an Alternating Q-Learning (AQL) framework to adapt debt collection processes to comparable procedures in RL. Based on AQL, we further develop an Adversarial Conservative Alternating Q-Learning (ACAQL) to address the issue of overoptimistic estimations. Specifically, adversarial conservative value regularization is proposed to balance optimism and conservatism on Q-values of out-of-distribution actions. Furthermore, ACAQL utilizes the counterfactual action stitching to mitigate the overestimation by enhancing behavior data. Finally, we evaluate ACAQL on a real-world dataset created from Bank of Shanghai. Offline experimental results show that our approach outperforms state-of-the-art methods and effectively alleviates the optimistic estimation issue. Moreover, we conduct online A/B tests on the bank, and ACAQL achieves at least a 6% improvement of the debt recovery rate, which yields tangible economic benefits.

Abstract:
In the evolving field of urban development, precise traffic prediction is essential for optimizing traffic and mitigating congestion. While traditional graph learning-based models effectively exploit complex spatial-temporal correlations, their reliance on trivially generated graph structures or deeply intertwined adjacency learning without supervised loss significantly impedes their efficiency. This paper presents Contrastive Learning of spatial-tEmporal trAffic data Representations (CLEAR) framework, a comprehensive approach to spatial-temporal traffic data representation learning aimed at enhancing the accuracy of traffic predictions. Employing self-supervised contrastive learning, CLEAR strategically extracts discriminative embeddings from both traffic time-series and graph-structured data. The framework applies weak and strong data augmentations to facilitate subsequent exploitations of intrinsic spatial-temporal correlations that are critical for accurate prediction. Additionally, CLEAR incorporates advanced representation learning models that transmute these dynamics into compact, semantic-rich embeddings, thereby elevating downstream models’ prediction accuracy. By integrating with existing traffic predictors, CLEAR boosts predicting performance and accelerates the training process by effectively decoupling adjacency learning from correlation learning. Comprehensive experiments validate that CLEAR can robustly enhance the capabilities of existing graph learning-based traffic predictors and provide superior traffic predictions with a straightforward representation decoder. This investigation highlights the potential of contrastive representation learning in developing robust traffic data representations for traffic prediction.

Abstract:
Knowledge Tracing (KT) predicts future performance by modeling students’ historical interactions, and understanding students’ affective states can enhance the effectiveness of KT, thereby improving the quality of education. Although traditional KT values students’ cognition and learning behaviors, efficient evaluation of students’ affective states and their application in KT still require further exploration due to the non-affect-oriented nature of the data and budget constraints. To address this issue, we propose a computation-driven approach, Dynamic Affect Simulation Knowledge Tracing (DASKT), to explore the impact of various student affective states (such as frustration, concentration, boredom, and confusion) on their knowledge states. In this model, we first extract affective factors from students’ non-affect-oriented behavioral data, then use clustering and spatiotemporal sequence modeling to accurately simulate students’ dynamic affect changes when dealing with different problems. Subsequently, we incorporate affect with time-series analysis to improve the model's ability to infer knowledge states over time and space. Extensive experimental results on two public real-world educational datasets show that DASKT can achieve more reasonable knowledge states under the effect of students’ affective states. Moreover, DASKT outperforms the most advanced KT methods in predicting student performance. Our research highlights a promising avenue for future KT studies, focusing on achieving high interpretability and accuracy.

Abstract:
Proliferation of fake news has become a critical issue in today's information-driven society. Our study includes external knowledge from Wikidata which allows the model to cross-reference factual claims with established knowledge. This approach deviates from the reliance on social information to detect fake news that many state-of-the-art (SOTA) fact-checking models adopt. This paper introduces EA^22N, an Evidence-based AMR (abstract meaning representation) Attention Network for Fake News Detection. EA^22N utilizes the proposed Evidence based Abstract Meaning Representation (WikiAMR) which incorporates knowledge using a proposed evidence-linking algorithm, pushing the boundaries of fake news detection. The proposed framework encompasses a combination of a novel language encoder and a graph encoder to detect fake news. While the language encoder effectively combines transformer-encoded textual features with affective lexical features, the graph encoder encodes semantic relations with evidence through external knowledge, referred to as WikiAMR graph. A path-aware graph learning module is designed to capture crucial semantic relationships among entities over evidence. Extensive experiments support our model's superior performance, surpassing SOTA methodologies with a difference of 2-3% in F1-score and accuracy for Politifact and Gossipcop datasets. The improvement due to the introduction of WikiAMR is found to be statistically significant with t-value less than 0.01.

Abstract:
Model-based reinforcement learning (RL) aims to learn the underlying dynamics of a given environment. The success of most existing works is built on the critical assumption that the dynamic is fixed, which is unrealistic in many open-world scenarios, such as drone delivery and online chatting, where agents may need to deal with environments with unpredictable changing dynamics (hereafter, real non-stationary environment). Therefore, learning changing dynamics in a real non-stationary environment offers both significant benefits and challenges. This paper proposes a new model-based reinforcement learning algorithm that proactively and dynamically detects possible changes and Learns these Latent and Changing Dynamics (LLCD) in a latent Markovian space for real non-stationary environments. To ensure the Markovian property of the RL model and improve computational efficiency, we employ a latent space model to learn the environment’s transition dynamics. Furthermore, we perform online change detection in the latent space to promptly identify change points in non-stationary environments. Then, we utilize the detected information to help the agent adapt to new conditions. Experiments indicate that the rewards of the proposed algorithm accumulate for the most rapid adaptions to environmental change, among other benefits. This work has a strong potential to enhance environmentally suitable model-based reinforcement learning capabilities.

Abstract:
Analyzing and comprehending check-in sequences is crucial for various applications in smart cities. However, publicly available check-in datasets are often limited in scale due to privacy concerns. This poses a significant obstacle to academic research and downstream applications. Thus, it is urgent to generate realistic check-in datasets. The denoising diffusion probabilistic model (DDPM) as one of the most capable generation methods is a good choice to achieve this goal. However, generating check-in sequences using DDPM is not an easy feat. The difficulties lie in handling check-in sequences of variable lengths and capturing the correlation from check-in sequences’ distinct characteristics. This paper addresses the challenges by proposing a Spatio-Temporal Contrastive Diffusion Model (STCDM). This model introduces a novel spatio-temporal lossless encoding method that effectively encodes check-in sequences into a suitable format with equal length. Furthermore, we capture the spatio-temporal correlations with two disentangled diffusion modules to reduce the impact of the difference between spatial and temporal characteristics. Finally, we incorporate contrastive learning to enhance the relationship between diffusion modules. We generate four realistic datasets in different scenarios using STCDM and design four metrics for comparison. Experiments demonstrate that our generated datasets are more realistic and free of privacy leakage.

Abstract:
Change point detection is crucial for identifying state transitions and anomalies in dynamic systems, with applications in network security, health care, and social network analysis. Dynamic systems are represented by dynamic graphs with spatial and temporal dimensions. As objects and their relations in a dynamic graph change over time, detecting these changes is essential. Numerous methods for change point detection in dynamic graphs have been developed, but no systematic review exists. This paper addresses this gap by introducing change point detection tasks in dynamic graphs, discussing two tasks based on input data types: detection in graph snapshot series (focusing on graph topology changes) and time series on graphs (focusing on changes in graph entities with temporal dynamics). We then present related challenges and applications, provide a comprehensive taxonomy of surveyed methods, including datasets and evaluation metrics, and discuss promising research directions.

Affiliations: School of Artificial Intelligence, Beijing Normal University, Beijing, China; School of Information, Central University of Finance and Economics, Beijing, China; Zhejiang Institute of Optoelectronics, Jinhua, China; College of Oceanography and Space Informatics, China University of Petroleum (East China), Shandong, China; School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China; Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA; Department of Computer Science, University of York, York, U.K.

Abstract:
In this work, we develop a family of Aligned Entropic Graph Kernels (AEGK) for graph classification. We commence by performing the Continuous-time Quantum Walk (CTQW) on each graph structure, and compute the Averaged Mixing Matrix (AMM) to describe how the CTQW visits all vertices from a starting vertex. More specifically, we show how this AMM matrix allows us to compute a quantum Shannon entropy of each vertex for either un-attributed or attributed graphs. For pairwise graphs, the proposed AEGK kernels are defined by computing the kernel-based similarity between the quantum Shannon entropies of their pairwise aligned vertices. The analysis of theoretical properties reveals that the proposed AEGK kernels cannot only address the shortcoming of neglecting the structural correspondence information between graphs arising in most existing R-convolution graph kernels, but also overcome the problems of neglecting the structural differences and vertex-attributed information arising in existing vertex-based matching kernels. Moreover, unlike most existing classical graph kernels that only focus on the global or local structural information of graphs, the proposed AEGK kernels can simultaneously capture both global and local structural characteristics through the quantum Shannon entropies, reflecting more precise kernel-based similarity measures between pairwise graphs. The above theoretical properties explain the effectiveness of the proposed AEGK kernels. Experimental evaluations demonstrate that the proposed kernels can outperform state-of-the-art graph kernels and deep learning models for graph classification.

Abstract:
Pre-training GNNs to extract transferable knowledge and apply it to downstream tasks has become the de facto standard of graph representation learning. Recent works focused on designing self-supervised pre-training tasks to extract useful and universal transferable knowledge from large-scale unlabeled data. However, they have to face an inevitable question: traditional pre-training strategies that aim at extracting useful information about pre-training tasks, may not extract all useful information about the downstream task. In this paper, we reexamine the pre-training process within traditional pre-training and fine-tuning frameworks from the perspective of Information Bottleneck (IB) and confirm that the forgetting phenomenon in pre-training phase may cause detrimental effects on downstream tasks. Therefore, we propose a novel Delayed Bottlenecking Pre-training (DBP) framework which maintains as much as possible mutual information between latent representations and training data during pre-training phase by suppressing the compression operation and delays the compression operation to fine-tuning phase to make sure the compression can be guided with labeled fine-tuning data and downstream tasks. To achieve this, we design two information control objectives that can be directly optimized and further integrate them into the actual model design. Extensive experiments on both chemistry and biology domains demonstrate the effectiveness of DBP.

Abstract:
Outlier detection refers to the identification of anomalous samples that deviate significantly from the distribution of normal data and has been extensively studied and used in a variety of practical tasks. However, most unsupervised outlier detection methods are carefully designed to detect specified outliers, while real-world data may be entangled with different types of outliers. In this study, we propose a fuzzy rough sets-based multi-scale outlier detection method to identify various types of outliers. Specifically, a novel fuzzy rough sets-based method that integrates relative fuzzy granule density is first introduced to improve the capability of detecting local outliers. Then, a multi-scale view generation method based on granular-ball computing is proposed to collaboratively identify group outliers at different levels of granularity. Moreover, reliable outliers and inliers determined by the three-way decision are used to train a weighted support vector machine to further improve the performance of outlier detection. The proposed method innovatively transforms unsupervised outlier detection into a semi-supervised classification problem and for the first time explores the fuzzy rough sets-based outlier detection from the perspective of multi-scale granular balls, allowing for high adaptability to different types of outliers. Extensive experiments carried out on both artificial and UCI datasets demonstrate that the proposed outlier detection method significantly outperforms the state-of-the-art methods, improving the results by at least 8.48% in terms of the Area Under the ROC Curve (AUROC) index.

Abstract:
We study Graph Neural Networks (GNNs)-based embedding techniques for knowledge graph (KG) reasoning. For the first time, we link the path redundancy issue in the state-of-the-art path encoding-based models to the transformation error in model training, which brings us new theoretical insights into KG reasoning, as well as high efficacy in practice. On the theoretical side, we analyze the entropy of transformation error in KG paths and point out query-specific redundant paths causing entropy increases. These findings guide us to maintain the shortest paths and remove redundant paths for minimized-entropy message passing. To achieve this goal, on the practical side, we propose an efficient Graph Percolation process motivated by the percolation phenomenon in Fluid Mechanics, and design a lightweight GNN-based KG reasoning framework called Graph Percolation Embeddings (GraPE)1. GraPE outperforms state-of-the-art methods in both transductive and inductive reasoning tasks, while requiring fewer training parameters and less inference time.

Abstract:
Modern machine learning (ML) models have grown to a scale where training them on a single machine becomes impractical. As a result, there is a growing trend to leverage federated learning (FL) techniques to train large ML models in a distributed and collaborative manner. These models, however, when deployed on new devices, might struggle to generalize well due to domain shifts. In this context, federated domain adaptation (FDA) emerges as a powerful approach to address this challenge. Most existing FDA approaches typically focus on aligning the distributions between source and target domains by minimizing their (e.g., MMD) distance. Such strategies, however, inevitably introduce high communication overheads and can be highly sensitive to network reliability. In this paper, we introduce RF-TCA, an enhancement to the standard Transfer Component Analysis approach that significantly accelerates computation without compromising theoretical and empirical performance. Leveraging the computational advantage of RF-TCA, we further extend it to FDA setting with FedRF-TCA. The proposed FedRF-TCA protocol boasts communication complexity that is independent of the sample size, while maintaining performance that is either comparable to or even surpasses state-of-the-art FDA methods. We present extensive experiments to showcase the superior performance and robustness (to network condition) of FedRF-TCA.

Abstract:
Graph neural networks (GNNs) are effective machine learning models for many graph-related applications. Despite their empirical success, many research efforts focus on the theoretical limitations of GNNs, i.e., the GNNs expressive power. Early works in this domain mainly focus on studying the graph isomorphism recognition ability of GNNs, and recent works try to leverage the properties such as subgraph counting and connectivity learning to characterize the expressive power of GNNs, which are more practical and closer to real-world. However, no survey papers and open-source repositories comprehensively summarize and discuss models in this important direction. To fill the gap, we conduct a first survey for models for enhancing expressive power under different forms of definition. Concretely, the models are reviewed based on three categories, i.e., Graph feature enhancement, Graph topology enhancement, and GNNs architecture enhancement.

Abstract:
Motif detection is a graph algorithm that detects certain local structures in a graph. Although network motif has been studied in graph analytics, e.g., social network and biological network, it is yet unclear whether network motif is useful for analyzing online transaction network that is generated in applications such as instant messaging and e-commerce. In an online transaction network, each vertex represents a user’s account and each edge represents a money transaction between two users. In this work, we try to analyze online transaction networks with network motifs. We design motif-based vertex embedding that integrates motif counts and centrality measurements. Furthermore, we design a distributed framework to detect motifs in large-scale online transaction networks. Our framework obtains the edge directions using a bi-directional tagging method and avoids redundant detection with a reduced view of neighboring vertices. We implement the proposed framework under the parameter server architecture. In the evaluation, we analyze different kinds of online transaction networks w.r.t the distribution of motifs and evaluate the effectiveness of motif-based embedding in downstream graph analytical tasks. The experimental results also show that our proposed motif detection framework can efficiently handle large-scale graphs.

Abstract:
Along with the rapid technological and commercial innovation on e-commerce platforms, an increasing number of frauds cause great harm to these platforms. Many frauds are conducted by organized groups of fraudsters for higher efficiency and lower costs, also known as group-based frauds. Despite the high concealment and strong destructiveness of group-based fraud, no existing research can thoroughly exploit the information within the transaction networks of e-commerce platforms for group-based fraud detection. In this work, we analyze and summarize the characteristics of group-based frauds. Based on this, we propose a novel end-to-end semi-supervised Group-based Fraud Detection Network (GFDN) to support such fraud detection in real-world applications. In addition, we introduce a module named Temporal Group Dynamics Analyzer (TGDA) that strengthens the ability to analyze temporal information on group fraudulent activity. Based on this, we built an enhanced model named TGFDN. Experimental results on large-scale e-commerce datasets from Taobao and Bitcoin trading datasets show our proposed model's superior effectiveness and efficiency for group-based fraud detection on bipartite graphs.

Abstract:
The constrained shortest path problem is a fundamental and challenging task in applications built on graphs. In this paper, we formalize and study the MinMin-MaxMax resource-constrained shortest path (MinMin-MaxMax RCSP) problem, which generalizes the well-studied MaxMax RCSP problem. The objective is to find a simple path of minimum cost between two query nodes, subject to resource constraints between minimum and maximum limits. This problem has wide applications in fields such as delay networks and transportation. However, we theoretically prove that computing the optimal solution is NP-hard. We propose a two-stage approach that involves resource-based graph reduction followed by cost-guided path generation. To reduce the cost of expensive acyclicity checking, we introduce the technique of ancestor checking based on the shortest path tree. Furthermore, we present an even faster incremental search approach that considers both the path cost and resource constraints while avoiding acyclicity checking. Extensive experiments on twenty real graphs consistently demonstrate the superiority of our proposed methods, achieving up to two orders of magnitude improvement in time efficiency over the baseline algorithms while producing high-quality solutions.

Abstract:
Training recommendation models on large datasets requires significant time and resources. It is desired to construct concise yet informative datasets for efficient training. Recent advances in dataset condensation show promise in addressing this problem by synthesizing small datasets. However, applying existing methods of dataset condensation to recommendation has limitations: (1) they fail to generate discrete user-item interactions, and (2) they could not preserve users’ potential preferences. To address the limitations, we propose a lightweight condensation framework tailored for recommendation (DConRec), focusing on condensing user-item historical interaction sets. Specifically, we model the discrete user-item interactions via a probabilistic approach and design a pre-augmentation module to incorporate the potential preferences of users into the condensed datasets. While the substantial size of datasets leads to costly optimization, we propose a lightweight policy gradient estimation to accelerate the data synthesis. Experimental results on multiple real-world datasets have demonstrated the effectiveness and efficiency of our framework. Besides, we provide a theoretical analysis of the provable convergence of DConRec.

Abstract:
Similarity search finds objects that are similar to a given query object based on a similarity metric. As the amount and variety of data continue to grow, similarity search in metric spaces has gained significant attention. Metric spaces can accommodate any type of data and support flexible distance metrics, making similarity search in metric spaces beneficial for many real-world applications, such as multimedia retrieval, personalized recommendation, trajectory analytics, data mining, decision planning, and distributed servers. However, existing studies mostly focus on indexing metric spaces on a single machine, which faces efficiency and scalability limitations with increasing data volume and query amount. Recent advancements in similarity search turn towards distributed methods, while they face challenges including inefficient local data management, unbalanced workload, and low concurrent search efficiency. To this end, we propose DIMS, an efficient Distributed Index for similarity search in Metric Spaces. First, we design a novel three-stage heterogeneous partition to achieve workload balance. Then, we present an effective three-stage indexing structure to efficiently manage objects. We also develop concurrent search methods with filtering and validation techniques that support efficient distributed similarity search. Additionally, we devise a cost-based optimization model to balance communication and computation cost. Extensive experiments demonstrate that DIMS significantly outperforms existing distributed similarity search approaches.

Abstract:
Map matching aims to align GPS trajectories to their actual travel routes on a road network, which is an essential pre-processing task for most of trajectory-based applications. Many map matching approaches utilize Hidden Markov Model (HMM) as their backbones. Typically, HMM treats GPS samples of a trajectory as observations and nearby road segments as hidden states. During map matching, HMM determines candidate states for each observation with a fixed searching range, and computes the most likely travel route using the Viterbi algorithm. Although HMM-based approaches can derive high matching accuracy, they still suffer from high computation overheads. By inspecting the HMM process, we find that the computation bottleneck mainly comes from improper candidate sets, which contain many irrelevant candidates and incur unnecessary computations. In this paper, we present \mathtt LiMMLiMM – a learned road network index structure for efficient map matching. \mathtt LiMMLiMM improves existing HMM-based approaches from two aspects. First, we propose a novel learned index for road networks, which considers the characteristics of road data. Second, we devise an adaptive searching range mechanism to dynamically adjust the searching range for GPS samples based on their locations. As a result, \mathtt LiMMLiMM can provide refined candidate sets for GPS samples and thus accelerate the map matching process. Extensive experiments are conducted with three large real-world GPS trajectory datasets. The results demonstrate that \mathtt LiMMLiMM significantly reduces computation overheads by achieving an average speedup of 11.7×11.7× than baseline methods, merely with a subtle accuracy loss of 1.8%.

Abstract:
Semi-supervised clustering leverages prior information in the form of constraints to achieve higher-quality clustering outcomes. However, most existing methods struggle with large-scale datasets owing to their high time and space complexity. Moreover, they encounter the challenge of seamlessly integrating various constraints, thereby limiting their applicability. In this paper, we present Scalable Semi-supervised clustering via Structural Entropy (SSSE), a novel method that tackles scalable datasets with different types of constraints from diverse sources to perform both semi-supervised partitioning and hierarchical clustering, which is fully explainable compared to deep learning-based methods. Specifically, we design objectives based on structural entropy, integrating constraints for semi-supervised partitioning and hierarchical clustering. To achieve scalability on data size, we develop efficient algorithms based on graph sampling to reduce the time and space complexity. To achieve generalization on constraint types, we formulate a uniform view for widely used pairwise and label constraints. Extensive experiments on real-world clustering datasets at different scales demonstrate the superiority of SSSE in clustering accuracy and scalability with different constraints. Additionally, Cell clustering experiments on single-cell RNA-seq datasets demonstrate the functionality of SSSE for biological data analysis.

Abstract:
Effective management of business processes is crucial for organizational success. However, despite meticulous design and implementation, anomalies are inevitable and can result in inefficiencies, delays, or even significant financial losses. Numerous methods for detecting anomalies in business processes have been proposed recently. However, there is no comprehensive benchmark to evaluate these methods. Consequently, the relative merits of each method remain unclear due to differences in their experimental setup, choice of datasets and evaluation measures. In this paper, we present a systematic literature review and taxonomy of business process anomaly detection methods. Additionally, we select at least one method from each category, resulting in 16 methods that are cross-benchmarked against 32 synthetic logs and 19 real-life logs from different industry domains. Our analysis provides insights into the strengths and weaknesses of different anomaly detection methods. Ultimately, our findings can help researchers and practitioners in the field of process mining make informed decisions when selecting and applying anomaly detection methods to real-life business scenarios. Finally, some future directions are discussed in order to promote the evolution of business process anomaly detection.

Abstract:
Knowledge graph embedding is efficient method for reasoning over known facts and inferring missing links. Existing methods are mainly triplet-based or graph-based. Triplet-based approaches learn the embedding of missing entities by a single triple only. They ignore the fact that the knowledge graph is essentially a graph structure. Graph-based methods consider graph structure information but ignore the contextual information of nodes in the knowledge graph, making them unable to discern valuable entity (relation) information. In response to the above limitations, we propose a general graph transformer framework for knowledge graph embedding (TGformer). It is the first to use a graph transformer to build knowledge embeddings with triplet-level and graph-level structural features in the static and temporal knowledge graph. Specifically, a context-level subgraph is constructed for each predicted triplet, which models the relation between triplets with the same entity. Afterward, we design a knowledge graph transformer network (KGTN) to fully explore multi-structural features in knowledge graphs, including triplet-level and graph-level, boosting the model to understand entities (relations) in different contexts. Finally, semantic matching is adopted to select the entity with the highest score. Experimental results on several public knowledge graph datasets show that our method can achieve state-of-the-art performance in link prediction.

Abstract:
The popularity of location-aware devices has boosted urban systems with massive volumes of anonymous trajectory data, presenting both challenges and opportunities for enhancing smart city initiatives through Trajectory-User Linking (TUL). Typically, TUL aims to match anonymous trajectories with specific users by exploring spatiotemporal patterns and insightful mobility behaviors. However, current TUL models face significant limitations due to their reliance on singular data sources and insufficient consideration of real-world scenarios. Furthermore, these models often lack evaluation in fair and comprehensive environments, hindering accurate assessment of their performance and applicability. This paper systematically investigates prevalent challenges encountered by existing TUL models, conducts a comprehensive review of state-of-the-art models, and proposes a structured framework that encompasses three core components: point-level representation learning, trajectory-level representation learning, and user linking. Through meticulously designed experiments, we examine the effectiveness and efficiency of leading TUL models in handling the complexities of real-world data, such as data imbalance, sparsity, new users, and scalability. This in-depth analysis uncovers limitations in existing methodologies and offers guidance for future advancements, contributing to the development of robust TUL solutions for urban mobility analysis and smart city technologies.

Abstract:
Multimodal Summarization aims to use multimodal data to generate accurate and concise summaries for long sentences. While previous work has achieved promising success, they have overlooked the mismatching among multimodal semantics and lacked subject information guidance for adaptive referential images. Motivated by this observation, we propose ASSM, an Adaptive Subject-focused modeling for multimodal summarization via Semantic Matching. The novelty of ASSM lies in two aspects. First, we propose a multimodal semantic matching module that projects multimodal inputs into a shared joint embedding semantic space to determine whether the semantics between multimodalities are mismatching. Second, we propose an adaptive subject-focused guide module, which adaptively references images to learn subject tokens based on the multimodal semantic matching results. With these subject tokens, we are able to focus on the subject information, providing precise guidance for summary generation. We conduct extensive experiments on two standard benchmarks and compare ASSM with 17 existing models. The experimental results regarding ROUGE, BERTScore, and MoverScore show that the proposed ASSM model outperforms all competitors, achieving state-of-the-art performance and suggesting the effectiveness of our proposal. In addition, we provide a case study to further demonstrate the usability of ASSM.

Abstract:
Time series are widely used in many classification and regression tasks. However, numerous time series contain unavoidable missing data, making it challenging to model the temporal dynamics of sequential data. Various data imputation methods have been proposed to infer missing values in time series. Although sequences recorded at fixed time intervals are presented in discrete form, they possess an inherent temporal continuity, which is ignored in most existing approaches. In this paper, we propose an end-to-end Attentive Continuous-Time Generative Adversarial Network (ACGANet) to estimate unobserved values in irregular sequences. ACGANet captures the temporal dynamics by transforming the discrete sequence into the continuous-time flow, thereby modeling the underlying distribution of the real data. Furthermore, ACGANet employs an adversarial learning strategy to alleviate the error introduced by imputed values, with the discriminator distinguishing between real and generated samples. Additionally, ACGANet introduces the log-density of hidden temporal states as an auxiliary loss to further optimize the generator. This allows the model to simultaneously focus on the overall temporal dynamics of the time series and the underlying distribution of the missing data. Extensive experiments on three publicly available real-world datasets demonstrate that ACGANet achieves state-of-the-art performance in imputing incomplete time series. Moreover, both qualitative and quantitative analyses validate the effectiveness of the proposed model.

Abstract:
With the proliferation and extensive use of the Internet of Things (IoT), it is vital to ensure the secure operation of IoT devices. However, due to limited computing power and lightweight system design, IoT devices often remain unprotected and are vulnerable to a myriad of attacks. Directly outsourcing the computationally intensive anomaly detection work to some middleboxes or cloud servers seems to solve this problem, but it can raise severe privacy concerns. To tackle these problems, we propose CryptIF, a scalable, privacy-preserving approach to detecting IoT anomalies from the cloud. By leveraging the Extended isolation forest (EIForest) and ciphertext comparison algorithms, CryptIF inspects features encrypted by fully homomorphic encryption (FHE) to detect various IoT anomalies. Furthermore, CryptIF parallelizes computing tasks by taking advantage of the single instruction, multiple data (SIMD) property of Cheon, Kim, Kim and Song (CKKS) homomorphic encryption to accelerate the detection process, thereby significantly increasing its scalability and operating efficiency. The evaluations demonstrate that CryptIF outperforms the state-of-the-art ciphertext-based anomaly detection approach in both detection accuracy and time efficiency. Additionally, CryptIF achieves comparable detection performance to plaintext-based IForest algorithms.

Abstract:
Recently, many machine learning-based approaches that effectively solve graph optimization problems have been proposed. The graph optimization problem is the problem that aims to optimize (maximize or minimize) a quantity that is associated with a graph, such as the Minimum Vertex Cover (MVC) and Maximum Independent Set (MIS) problems. These approaches are usually trained on graphs randomly generated with graph generators or sampled from existing datasets. However, we observe that such training graphs lead to poor testing performance if the testing graphs are not generated analogously, i.e., the generalizability of the models trained on those randomly generated training graphs is very limited. To address this critical issue, in this paper, we propose a new framework, named Learning with Iterative Graph Diversification (LIGD), and formulate a new research problem, named Diverse Graph Modification Problem (DGMP), that iteratively generate diversified training graphs and train the models that solve graph optimization problems to improve their performance significantly. We propose three approaches to solve DGMP by considering both the performance of the machine learning approaches and the structural properties of the training graphs. In addition, we study a practical case of DGMP, named Diverse Graph Modification Problem with XOR Diversity (DGMP-XDiv), which considers an XOR-based diversity function. We propose a polynomial-time algorithm named Structure Diversifying Modification on Edge Score (DMES) to obtain the optimal solution. We also propose DMES with Efficiency-Boosting Strategies (DMES-EB) to enhance the efficiency of DMES significantly. Experimental results on well-known problems show that our proposed approaches significantly boost the performance of both supervised and reinforcement learning approaches. They produce near-optimal results and significantly outperform the baseline approaches, such as graph augmentation and diffusion-based approaches.

Abstract:
Artificial intelligence (AI) advancements have significantly broadened its application across various sectors, simultaneously elevating concerns regarding the transparency and understandability of AI-driven decisions. Addressing these concerns, this paper embarks on an exploratory journey into Case-Based Reasoning (CBR) and Explainable Artificial Intelligence (XAI), critically examining their convergence and the potential this synergy holds for demystifying the decision-making processes of AI systems. We employ the concept of Explainable CBR (XCBR) system that leverages CBR to acquire case-based explanations or generate explanations using CBR methodologies to enhance AI decision explainability. Though the literature has few surveys on XCBR, recognizing its potential necessitates a detailed exploration of the principles for developing effective XCBR systems. We present a cycle-aligned perspective that examines how explainability functions can be embedded throughout the classical CBR phases: Retrieve, Reuse, Revise, and Retain. Drawing from a comprehensive literature review, we propose a set of six functional goals that reflect key explainability needs. These goals are mapped to six thematic categories, forming the basis of a structured XCBR taxonomy. The discussion extends to the broader challenges and prospects facing the CBR-XAI arena, setting the stage for future research directions. This paper offers design guidance and conceptual grounding for future XCBR research and system development.

Abstract:
With the development of AI, Big Data, and mobile communication, intelligent transportation has become popular in recent years. Path planning is a typical topic of intelligent transportation, attracting significant attention from researchers. However, existing studies only focus on the path planning of a single platform, which may lead to unexpected traffic congestion. This is because multiple platforms can provide route planning services, the optimal planning calculated by one single platform may be not good in practice, since multiple platforms may lead the users to the same roads, which causes unexpected traffic congestion. Although in the view of each platform, the planning is optimal. Fortunately, with the rise of data sharing and cross-platform cooperation, the data silos between different platforms are gradually being broken. Based on this, we propose Cooperative Global Path Planning(CGPP) framework to overcome the above shortcoming. CGPP allows the path planning request target platform to send some queries to cooperative platforms to optimize its path planning results. Such queries should be “easy” enough to answer, and the query frequency should be small. Based on the above principle, we design a query decision model based on multi-agent reinforcement learning in CGPP framework to decide the query range and query frequency. We design action and reward specifically for the CGPP problem. Furthermore, we propose mechanisms to enhance query precision and reduce query overhead. Specifically, the Self-adjusting Query Area(SQA) concept allows refining query parameters, while the Query Reuse Optimization(QRO) algorithm aims to minimize the number of queries. To solve potential overestimation problems in queries, we propose a Distance-based Outer Query (DB-oq) and Distance-Based Vehicle Count Estimation (DB-VCE) Model. To address the issue that the time interval computed by the QRO algorithm might not fully adapt to dynamic traffic environments, we propose the Temporal Sequence Historical Integration for Time Interval Prediction(TSHI-TIP) algorithm. Extensive experiments on real and synthetic datasets confirm the effectiveness and efficiency of our algorithms.

Affiliations: Cyberspace Institute of Advanced Technology, Guangzhou University, and Guangdong Key Laboratory of Industrial Control System Security, and, Huangpu Research School of Guangzhou University, Guangzhou, China; Faculty of Information Science and Engineering, Ocean University of China, Qingdao, China; School of Software Engineering, Sun Yat-sen University, Zhuhai, China; School of Computer Science and Engineering, Northeastern University, Shenyang, China; China United Network Communications Group Corporation Limited, Beijing, China

Abstract:
Data privacy protection legislation around the world has increasingly enforced the “right to be forgotten” regulation, generating a surge in research interest in machine unlearning (MU), which aims to remove the impact of training data from machine learning models upon receiving revocation requests from data owners. There exist two major challenges for the performance of MU: the execution efficiency and the inference interference. The former requires minimizing the computational overhead for each execution of the MU mechanism, while the latter calls for reducing the execution frequency to minimize interference with normal inference services. Nowadays most MU studies focus on the sample-level unlearning setting, leaving the other paramount feature-level setting under-explored. Adapting these existing techniques to the latter turns out to be non-trivial. The only known feature-level work achieves an approximate unlearning guarantee, but suffers from degraded model accuracy and still leaves the inference interference challenge unsolved. We are therefore motivated to propose FELEMN, the first FEature-Level Exact Machine uNlearning method that overcomes both of the above-mentioned hurdles. For the MU execution efficiency challenge, we explore the impact of different feature partitioning strategies on the preservation of semantic relationships for maintaining model accuracy and MU efficiency. For the inference interference challenge, we propose two batching mechanisms to combine as many individual unlearning requests to be processed together as possible, while avoiding potential privacy issues coming with falsely postponing unlearning requests, which is grounded on theoretical analysis. Experiments on five real datasets show that our FELEMN outperforms up-to-date competitors with up to 3×3× speedup for each MU execution, and 50% runtime reduction by mitigating inference interference.

Abstract:
Graph similarity is critical in graph-related tasks such as graph retrieval, where metrics like maximum common subgraph (MCS) and graph edit distance (GED) are commonly used. However, exact computations of these metrics are known to be NP-Hard. Recent neural network-based approaches approximate the similarity score in embedding spaces to alleviate the computational burden, but they either involve expensive pairwise node comparisons or fail to effectively utilize structural and scale information of graphs. To tackle these issues, we propose a novel geometric-based graph embedding method called Graph2Region (G2R). G2R represents nodes as closed regions and recovers their adjacency patterns within graphs in the embedding space. By incorporating the node features and adjacency patterns of graphs, G2R summarizes graph regions, i.e., graph embeddings, where the shape captures the underlying graph structures and the volume reflects the graph size. Consequently, the overlap between graph regions can serve as an approximation of MCS, signifying similar node regions and adjacency patterns. We further analyze the relationship between MCS and GED and propose using disjoint parts as a proxy for GED similarity. This analysis enables concurrent computation of MCS and GED, incorporating local and global structural information. Experimental evaluation highlights G2R’s competitive performance in graph similarity computation. It achieves up to a 60.0% relative accuracy improvement over state-of-the-art methods in MCS similarity learning, while maintaining efficiency in both training and inference. Moreover, G2R showcases remarkable capability in predicting both MCS and GED similarities simultaneously, providing a holistic assessment of graph similarity.

Abstract:
In the domain of Multi-view Subspace Clustering (MSC) in Latent Embedding Space (LES), existing methods aim to capture and leverage critical multi-view information by mapping it into a low-dimensional LES. However, several aspects can be further improved: (i) Fusion Strategy: Existing methods adopt either early fusion or late fusion to integrate multi-view information, limiting the effectiveness of the fusion. (ii) Diversity: Current methods often overlook the inherent diversity in the multi-view data by focusing on a single LES. (iii) Efficiency: LES-based methods exhibit high computational complexity, with cubic time and quadratic space requirements based on the number of samples. To address these issues, we propose a novel framework called MSC-DOLES (Multi-view Subspace Clustering in Diverse Orthogonal Latent Embedding Spaces), a novel framework designed to tackle these challenges. MSC-DOLES incorporates a two-stage fusion approach that generates and learns from multiple LES to maximize cross-view diversity. Orthogonality constraints on individual LES ensure view-internal diversity, resulting in a set of Diverse Orthogonal Latent Embedding Spaces (DOLES). The DOLES are then fused into a consensus anchor graph using learnable anchors. The final clustering is induced by partitioning the obtained graph without pre-processing. We develop an eight-step optimization algorithm for MSC-DOLES, which exhibits nearly linear time and space complexities relative to the number of samples. Extensive experiments demonstrate the superiority of MSC-DOLES over state-of-the-art methods.

Abstract:
Learning graphical causal models from observational data can effectively elucidate the underlying causal mechanism behind the variables. In the context of limited datasets, modelers often incorporate prior knowledge, which is assumed to be correct, as a penalty in single-objective optimization. However, this approach struggles to adapt complex and uncertain priors effectively. This paper introduces UpCM, which tackles the issue from a multi-objective optimization perspective. Instead of focusing exclusively on the DAG as the optimization goal, UpCM methodically evaluate the effect of uncertain priors on specific structures, merging data-driven and knowledge-driven objectives. Utilizing the MOEA/D framework, it achieve a balanced trade-off between these objectives. Furthermore, since uncertain priors may introduce erroneous constraints, resulting in PDAGs lacking consistent extensions, the minimal non-consistent extension is explored. This extension, which separately incorporates positive and negative constraints, aims to approximate the true causality of the PDAGs. Experimental results demonstrate that UpCM achieves significant structural accuracy improvements compared to baseline methods. It reduces the SHD by 7.94%, 13.23%, and 12.8% relative to PC_stable, GES, and MAHC, respectively, when incorporating uncertain priors. In downstream inference tasks, UpCM outperforms domain-expert knowledge graphs, owing to its ability to learn explainable causal relationships that balance data-driven evidence with prior knowledge.

Abstract:
An increasing number of technologies depend on the large-scale collection of individual-level data, whether for gathering statistical insights from billions of users or for training AI models. However, reliance on personal data raises privacy concerns that, in turn, limit the collection and analysis essential to these technologies. Differential Privacy (DP) has gained traction in both academia and industry, ensuring privacy by adding carefully crafted noise to data or its outputs based on a pre-defined privacy loss budget \varepsilonɛ. As real-world implementations emerge, we can examine how DP is practically used beyond academic settings, supporting industry adoption and expanding knowledge on DP applications. Using a systematic process, we comprehensively surveyed the deployed parameters of DP configurations in both commercial and governmental implementations (n=140n=140) and compared them to those employed in academic research. We also propose a high-level taxonomy for DP configuration that captures practical implementations of differentially-private Machine Learning (ML) and Federated Learning (FL) applications, and highlights key factors such as the privacy unit and the privacy loss budget \varepsilonɛ. Our results show that, on average, \varepsilonɛ values utilized in industry span a wider range than those used in academic research, with distinct configuration policies for governmental and commercial organizations. Moreover, we identified contrasting reasoning behind \varepsilonɛ selection across deployment environments, as well as insufficient transparency in how commercial organizations report implemented DP parameters and limited support for user-oriented configuration. Finally, we discuss how the collected knowledge can be used to create methodological guidelines for the configuration of DP in real-world environments, supporting the vision of an Epsilon Registry.

Abstract:
Influence maximization (IM) aims to identify kk vertices that maximize influence spread across a network. While well-studied in regular graphs, IM in hypergraphs presents unique challenges: conventional graph-based IM methods fail to capture hypergraph-specific structural properties, and existing hypergraph IM algorithms lack theoretical guarantees for time complexity and approximation quality. We address these gaps with HyperIM, a novel algorithm leveraging stratified sampling to generate random reversible reachable sets for efficient seed selection. Our key innovation lies in dual-perspective stratified sampling: assigning sampling probabilities based on vertex structural properties while applying size-adaptive sampling strategies. This approach optimizes seed selection, reduces computational costs, and provides rigorous theoretical guarantees. We further propose HyperIM_BRR, which optimizes the required number of reversible reachable sets, achieving substantial cost reduction without sacrificing accuracy. Extensive experiments on real-world hypergraphs demonstrate that our algorithms significantly outperform state-of-the-art methods, delivering faster execution times and superior influence spread.

Abstract:
Class overlap is a major factor of data complexity that hampers classifier performance, particularly in imbalanced learning scenarios. Most existing oversampling methods rely on conservative seed sample selection and decoupled synthesis strategies, which limit sample diversity and fail to effectively control overlap risk. This paper proposes a novel oversampling framework called TMACO (Class Alliance-Constrained Oversampling), which integrates data complexity considerations into both seed selection and sample generation. First, TMACO selects seed sample units using a class alliance constraint that jointly considers spatial geometry and class distribution to enhance diversity and representativeness. Second, it generates synthetic samples based on three-point units to ensure regional stability. Third, a region-level filtering mechanism is applied to prevent synthetic samples from intruding into majority class areas. Extensive experiments on benchmark and real-world datasets demonstrate that TMACO consistently improves minority class performance and overall classification accuracy compared to state-of-the-art oversampling techniques. The proposed method also offers interpretable parameter control and adapts well to varying task objectives.

Abstract:
Ensemble clustering aggregates multiple weak clusterings to achieve a more accurate and robust consensus result. The Co-Association matrix (CA matrix) based method is the mainstream ensemble clustering approach that constructs the similarity relationships between sample pairs according the weak clustering partitions to generate the final clustering result. However, the existing methods neglect that the quality of cluster is related to its size, i.e., a cluster with smaller size tends to higher accuracy. Moreover, they also do not consider the valuable dissimilarity information in the base clusterings which can reflect the varying importance of sample pairs that are completely disconnected. To this end, we propose the Similarity and Dissimilarity Guided Co-association matrix (SDGCA) to achieve ensemble clustering. First, we introduce normalized ensemble entropy to estimate the quality of each cluster, and construct a similarity matrix based on this estimation. Then, we employ the random walk to explore high-order proximity of base clusterings to construct a dissimilarity matrix. Finally, the adversarial relationship between the similarity matrix and the dissimilarity matrix is utilized to construct a promoted CA matrix for ensemble clustering. We compared our method with 13 state-of-the-art methods across 12 datasets, and the results demonstrated the superior clustering ability and robustness of the proposed approach.

Abstract:
Tabular data is prevalent in many fields. In practice, tabular data classification may encounter severe challenges due to class imbalance, i.e., some majority classes overwhelm minority ones. Such imbalance could lead to biased prediction tendency of trained classifiers towards majority classes. Oversampling minority classes is an essential solution due to its generality and independence of downstream tasks. Recent years have witnessed the advantages of generative adversarial networks (GANs) in synthetic data generation, favored for their ability to generate quasi-realistic samples. However, challenges arise when the size of minority classes is too small to provide sufficient information for learning real data distributions. Furthermore, the generated minority-class samples could exacerbate the class overlap problem, i.e., some generated samples unexpectedly overlap with partial majority-class samples. To address these challenges, this paper presents B2BGAN, a novel GAN-based approach for oversampling imbalanced tabular data. To capture the real data distribution in a fine-grained manner, we propose a novel backbone-to-branches neural network for the generator to fit the majority and minority classes simultaneously. The backbone network fits the whole distribution of the entire data, while each branch network grasps the distinctive characteristics of individual classes. To alleviate the class overlap problem of generated samples, we develop a prototype-guided loss function to ensure that generated samples are closer to the corresponding class prototypes. We evaluate the effectiveness of B2BGAN on six real-world datasets using six metrics. Experimental results demonstrate that our method outperforms state-of-the-art models by 5.38% in AUC and 10.19% in AP.

Abstract:
Predicting information popularity in social networks has become a central focus of network analysis. While recent advancements have been made, most existing approaches rely solely on the final cascade size as the primary supervision signal for model optimization. This narrow focus limits the model generalization ability, particularly when faced with highly heterogeneous cascades. Additionally, in real-world scenarios, obtaining detailed social relationships is challenging, complicating effective structural feature learning. To address these issues, this paper proposes a semi-supervised model called Dual Variational Cascade AutoEncoders (DVCAE), which leverages parallel structural and temporal variational autoencoders for enhanced feature learning and popularity prediction. The model first aggregates multiple cascades into a global interaction graph, enabling structural information sharing across cascades. Then, it applies sparse matrix factorization-based graph embedding and graph filtering techniques on global and local cascade graphs respectively, generating initial node embeddings that are insensitive to topological perturbations. After that, two parallel variational autoencoders are designed to generate hidden representations for structural and temporal features respectively, with two self-supervised reconstruction losses integrated into the prediction loss to enrich supervision signals. Extensive experiments conducted on three real-world datasets demonstrate that DVCAE outperforms state-of-the-art models in terms of prediction accuracy.

Abstract:
Graph neural networks (GNNs) have garnered significant attention for their competitive performance on graph-structured data. However, many existing methods are commonly constrained by the homophily assumption, making them overly reliant on the uniform neighbor propagation, which limits their ability to generalize to heterophilous graphs. Although some approaches extend aggregation to multi-hop neighbors, adapting neighborhood sizes on a per-node basis remains a significant challenge. In view of this, we propose an Evolutionary Graph Neural Network (EGNN) with adaptive structure-level aggregation and label smoothing, offering a novel solution to the aforementioned drawback. The core innovation of EGNN lies in assigning each node a personalized neighborhood structure utilizing behavior-level crossover and mutation. Specifically, we first adaptively search for the optimal structure-level neighborhoods for nodes within the solution space, leveraging the exploratory capabilities of evolutionary computation. This approach enhances the exchange of information between the target node and surrounding nodes, achieving a smooth vector representation. Subsequently, we adopt the optimal structure obtained through evolutionary search to perform label smoothing, further boosting the robustness of the framework. We conduct experiments on nine real-world networks with different homophily ratios, where outstanding performance demonstrates that the ability of EGNN can match or surpass SOTA baselines.

Abstract:
Density peaks clustering (DPC) is one of the density-based clustering algorithms and has been widely studied and applied in recent years because of its unique parameter, non-iteration and good robustness. However, it cannot effectively identify the cluster centers, and time and space complexities are too high. To this end, this paper proposes a fast density peaks clustering algorithm based on approximate k-nearest neighbors (FDPAN). Firstly, it uses Balanced K-means based Hierarchical K-means (BKHK) method to partition the data and quickly find the approximate k-nearest neighbors (AKNN), improving the algorithm’s efficiency on large-scale high-dimensional data. Meanwhile, three-way clustering is used to improve the neighbor search of the boundary points of the partition. Then, the local density and relative distance of DPC are recalculated by AKNN. Finally, according to the similar density chain, the connected high-density points are labeled while searching for the cluster center, and the remaining points are assigned to the clusters where their nearest higher-density points are located. Theoretical analysis and experiments on synthetic and real datasets show that FDPAN can obtain higher clustering results and shorten the operation time on large-scale high-dimensional data compared with DPC and its variants.

Abstract:
The spatial density distribution collected and aggregated from users’ trajectory data is vital for location-based services like regional popularity analysis and congestion measurement. However, spatial density aggregation poses privacy concerns since trajectory data usually originate from users. Local differential privacy (LDP) addresses these concerns by allowing users to perturb their data before reporting it. Yet, LDP is vulnerable to poisoning attacks where attackers manipulate data from malicious users. Recent studies attempt to defend against such attacks in LDP-enabled frequency estimation but suffer from inaccurate data recovery due to empirical presets of malicious user proportions and inaccurate malicious data estimation. These issues worsen in spatial density aggregation, as high-dimensional trajectory data help conceal malicious information. In this work, we propose GeoRecover, a method to defend against poisoning attacks in LDP-enabled spatial density aggregation by addressing previous limitations. GeoRecover designs an adaptive model to unify these attacks. Under this model, GeoRecover estimates the proportion of malicious users using statistical differences between genuine and malicious data and learns malicious data statistics through LDP properties. This allows GeoRecover to recover accurate spatial density distribution by subtracting malicious users’ contributions. Evaluations on two real-world datasets show GeoRecover outperforms state-of-the-art methods in recovery accuracy, defense capability, and practical performance.

Abstract:
When distribution shifts occur between testing and training graph data, out-of-distribution (OOD) samples undermine the performance of graph neural networks (GNNs). To improve adaptive OOD generalization of GNNs, this paper introduces a novel generative invariant graph learning framework, named GI-Graph. It consists of four modules: subgraph extractor, generative environment subgraph augmentation, generative invariant subgraph learning, and query feedback module. The subgraph extractor decomposes a graph sample into an environment subgraph and an invariant subgraph and improves extraction accuracy through query feedback. GI-Graph uses a diffusion model to generate diverse environment subgraphs, augmenting the OOD data. By combining diffusion models, contrastive learning, and attribute prediction networks, GI-Graph also generates augmented invariant subgraphs with significant identically distributed features and consistency of labels. Experimental results demonstrate that the controllable environment subgraph and invariant subgraph augmentation effectively improve the OOD generalization capability of GI-Graph, especially in capturing invariant features and maintaining category consistency across environments. Additionally, the contrastive learning-based fine-tuning method enables GI-Graph to quickly adapt to evolving environments. This paper verifies the effectiveness of the generative invariant graph learning scheme in graph OOD generalization.

Abstract:
Graph representation learning is a fundamental research theme and can be generalized to benefit multiple downstream tasks from the node and link levels to the higher graph level. In practice, it is desirable to develop task-agnostic graph representation learning methods that are typically trained in an unsupervised manner. However, existing unsupervised graph models, represented by the variational graph auto-encoders (VGAEs), can only address node- and link-level tasks while manifesting poor generalizability on the more difficult graph-level tasks because they can only keep low-order isomorphic consistency within the subgraphs of one-hop neighborhoods. To overcome the limitations of existing methods, in this paper, we propose the Isomorphic-Consistent VGAE (IsoC-VGAE) for multi-level task-agnostic graph representation learning. We first devise an unsupervised decoding scheme to provide a theoretical guarantee of keeping the high-order isomorphic consistency within the VGAE framework. We then propose the Inverse Graph Neural Network (Inv-GNN) decoder as its intuitive realization, which trains the model via reconstructing the node embeddings and neighborhood distributions learned by the GNN encoder. Extensive experiments on multi-level graph learning tasks verify that our model achieves superior or comparable performance compared to both the state-of-the-art unsupervised methods and representative supervised methods with distinct advantages on the graph-level tasks.

Abstract:
The Conditional Expectation Function (CEF) is an optimal estimator in real space. Artificial Neural Networks (ANN), as the current state-of-the-art method, lack interpretability. Estimating CEF offers a path to achieve both accuracy and interpretability. Previous attempts to estimate CEF rely on limiting assumptions such as independence and distributional form or perform the expensive nearest neighbor search. We propose Dynamically Ordered Precise Bayes Regression (DO-PBR), a novel method to estimate CEF in discrete space. We prove DO-PBR approaches optimality with increasing number of samples. DO-PBR dynamically learns importance rankings for the predictors, which are region-specific, allowing the importance of a predictor vary across the space. DO-PBR is fully interpretable and makes no assumptions on independence or the distributional form, while requiring minimal parameter setting. In addition, DO-PBR avoids the costly nearest-neighbor search, by using a hierarchy of binary trees. Our experiments confirm our theoretical claims on approaching optimality and show that DO-PBR achieves substantially higher accuracy compared to ANN, when given the same amount of time. Our experiments show that on average, ANN takes 32 times longer to achieve the same level of accuracy as DO-PBR.

Abstract:
Concept drift is the phenomenon in which the underlying data distributions and statistical properties of a target domain change over time, leading to a degradation in model performance. Consequently, production models require continuous drift detection monitoring. Most drift detection methods to date are supervised, relying on ground-truth labels. However, they are inapplicable in many real-world scenarios, as true labels are often unavailable. Although recent efforts have proposed unsupervised drift detectors, many lack the accuracy required for reliable detection or are too computationally intensive for real-time use in high-dimensional, large-scale production environments. Moreover, they often fail to characterize or explain drift effectively. To address these limitations, we propose DriftLens, an unsupervised framework for real-time concept drift detection and characterization. Designed for deep learning classifiers handling unstructured data, DriftLens leverages distribution distances in deep learning representations to enable efficient and accurate detection. Additionally, it characterizes drift by analyzing and explaining its impact on each label. Our evaluation across classifiers and data-types demonstrates that DriftLens (i) outperforms previous methods in detecting drift in 15/17 use cases; (ii) runs at least 5 times faster; (iii) produces drift curves that align closely with actual drift (correlation \geq \!0.85≥0.85); (iv) effectively identifies representative drift samples as explanations.

Abstract:
Knowledge graphs often suffer from incompleteness issues, which can be alleviated through information completion. However, current state-of-the-art deep knowledge convolutional embedding models rely on external convolution kernels and conventional convolution processes, which limits the feature interaction capability of the model. This paper introduces a novel dynamic convolutional embedding model, named ConvD, which directly reshapes relation embeddings into multiple internal convolution kernels. This approach effectively enhances the feature interactions between relation embeddings and entity embeddings. Simultaneously, we incorporate a priori knowledge-optimized attention mechanism that assigns distinct contribution weights to multiple relational convolution kernels during dynamic convolution, further boosting the expressive power of the model. Extensive experiments on various datasets show that our proposed model consistently outperforms the state-of-the-art baseline methods, with average improvements ranging from 3.28% to 14.69% across all the evaluation metrics, while the number of parameters is reduced by 50.66% to 85.40% compared to other state-of-the-art models.

Abstract:
Encrypted Databases (EDBs) are essential for protecting sensitive data outsourced to public clouds, enabling diverse index-based queries over encrypted data. However, existing EDB indexes often incur high storage overhead and performance degradation, primarily due to the poor compressibility of pseudorandom encrypted values, which leads to frequent accesses to slower persistent storage as indexes outgrow main memory. We introduce ECStore, the first EDB that supports compressible and efficient indexing. Observing that EDB indexes are used solely for lookups and never decrypted, we design ECTree, a cryptographic hash-based index structure in which each node is a compressible bit-string identifier that conceals plaintext keys. ECTree enables logarithmic-time encrypted search via a novel membership testing mechanism. To address false positives arising in dynamic workloads, we introduce Directed View Check (DVC), which detects inaccuracies and avoids redundant traversals. Additionally, ECTree’s Merkle-tree-like structure supports encrypted query authentication, resisting server compromise. Extensive evaluations show that ECStore can achieve up to 94.7% lower latency and 10.5x higher throughput on popular benchmarks compared to notable EDBs.

Abstract:
Large Language Models (LLMs) are increasingly prominent in the recommendation systems domain. Existing studies usually utilize in-context learning or supervised fine-tuning on task-specific data to align LLMs into recommendations. However, the substantial bias in semantic spaces between language processing tasks and recommendation tasks poses a nonnegligible challenge. Specifically, without the adequate capturing ability of collaborative information, existing modeling paradigms struggle to capture behavior patterns within community groups, leading to LLMs’ ineffectiveness in discerning implicit interaction semantic in recommendation scenarios. To address this, we consider enhancing the learning capability of language model-driven recommendation models for structured data, specifically by utilizing interaction graphs rich in collaborative semantics. We propose a Graph-Aware Learning for Language Model-Driven Recommendations (GAL-Rec). GAL-Rec enhances the understanding of user-item collaborative semantics by imitating the intent of Graph Neural Networks (GNNs) to aggregate multi-hop information, thereby fully exploiting the substantial learning capacity of LLMs to independently address the complex graphs in the recommendation system. Sufficient experimental results on three real-world datasets demonstrate that GAL-Rec significantly enhances the comprehension of collaborative semantics, and improves recommendation performance.

Affiliations: School of Cyber Science and Technology, Beihang University, Beijing, China; Department of Mathematics, College of Science, Shantou University, Shantou, China; School of Computer Science and Engineering, Beihang University, Beijing, China; Faculty of Information Engineering and Automation, Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming, China; Department of Computer Science, University of Illinois Chicago, Chicago, IL, USA

Abstract:
With long-tailed data and complex label hierarchy, hierarchical text classification (HTC) is a challenging multi-label text classification task. Applying prompts to pre-trained language models (PLMs) has recently become a mainstream approach in HTC. However, existing prompt-based models experience a significant drop in classification performance on tail labels. Due to the imbalanced data, HTC models still face two challenges. First, text embeddings, learned for classification, often lack distinctiveness for tail categories. Second, label embeddings suffer from significant degeneration, especially for tail labels. To address these issues, in this paper, we propose a novel Hierarchical Text Classification Optimization method via Structural Entropy and SIngular Spectrum Smoothing, namely SIHTC. SIHTC contains two parts: text embedding optimization and label embedding optimization. First, based on the structural information theory, we design a tree aggregation network and construct encoding trees to minimize the structural entropy of texts under the hierarchical labels. In this manner, SIHTC injects label structural information into text embeddings, hierarchically optimizing the embedding space by enclosing the text embeddings within related ground truth labels while separating them from unrelated ground truth labels. Second, we propose a global and local singular spectrum smoothing regularization method to maximize the area under the singular value curve. In this way, SIHTC decreases representation degeneration and learns label embeddings with improved label generalization capability. Extensive experiments are conducted on three popular HTC datasets. The results show that SIHTC outperforms all baseline methods, especially with an advantage in handling tail labels, indicating the effectiveness of the above two optimizations.

Affiliations: College of Computer Science and Technology, National University of Defense Technology, Changsha, China; Center for Frontier AI Research, Agency for Science, Technology and Research (A*STAR), Singapore; Science and Technology and Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, China; College of Computer Science and Electronic Engineering, Hunan University, Changsha, China; Department of Computer Science, State University of New York, New Paltz, NY, USA

Abstract:
Bipartite graph clustering (BGC) has emerged as a fast-growing research in the clustering community. Despite BGC has achieved promising scalability, most variants still suffer from the following concerns: a) Susceptibility to noisy features. They construct bipartite graphs in the raw feature space, inducing poor robustness to noisy features. b) Inflexible anchor selection strategies. They usually select anchors through heuristic sampling or constrained learning methods, degrading flexibility. c) Partial structure mining. Existing methods are mainly built upon Linear Reconstruction Paradigm (LRP) from subspace clustering or Locally Linear Paradigm (LLP) from manifold learning, which partially exploit linear or locally linear structures, lacking a unified perspective to integrate global complementary structures. To this end, we propose a novel model, termed J oint Robust Emb e dding and Struc t ural Fusion B ipartite G raph C lustering (JetBGC), which focuses on three aspects, namely robustness, flexibility, and complementarity. Concretely, we first introduce a robust embedding learning module to extract latent representation that can reduce the impact of noisy features. Then, we optimize anchors via a constraint-free strategy that can flexibly capture data distribution. Furthermore, we revisit the consistency and specificity of LRP and LLP, and design a new unified structural fusion strategy to integrate both linear and locally linear structures from a global perspective. Therefore, JetBGC unifies robust representation learning, flexible anchor optimization, and structural bipartite graph fusion in a framework. Extensive experiments on synthetic and real-world datasets validate our effectiveness against existing baselines.

Abstract:
This study addresses the Multiple Flying Sidekicks Traveling Salesman Problem (mFSTSP), where parallel Unmanned Aerial Vehicles (UAVs), or Drones, work alongside truck to enhance delivery efficiency. Existing scheduling approaches face challenges in high computational costs and the risk of converging to local optima due to excessive exploration in unknown environments, especially in large-scale mFSTSP. This study proposed a Large Language Model Enhanced Q-Learning Approach (LLM-QL) to solve mFSTSP, which combines the local exploration advantages of Q-Learning with the global understanding of unknown environments provided by LLMs, thus improving the efficiency of path planning. A novel prompt strategy is also provided, transforming the problem modeling into a format easily understood by LLMs, guiding the algorithm’s exploration and significantly improving convergence. We also provide a proof of the convergence of LLM-QL. Experimental results demonstrate that LLM-QL achieves up to a 1.35 x improvement in key performance metrics such as total completion time, algorithm runtime, and UAV utilization, compared to existing state-of-the-art methods.

Abstract:
Interval prediction is crucial in decision-making processes across many domains. Although significant progress has been made in existing interval prediction methods, they still face several challenges, such as assumptions about data distribution, fixed interval widths, limitations of gradient-based optimization algorithm, crossing of upper and lower bounds, and insufficient consideration of multi-scale spatial-temporal patterns. To address these issues, we propose a Multi-Scale Deep Interval Prediction Network (MSDIPN). Specifically, a Multi-Scale Spatio-Temporal Self-Attention Mechanism is introduced to capture spatio-temporal dependencies across different spatial scales. Additionally, a Temporal Self-Attention Mechanism module is constructed to extract temporal dependencies of historical variables across varying lag phases. Then a Global Self-Attention Mechanism module is designed to address representation degradation using residual connections and self-attention mechanisms. To overcome limitations related to distributional assumptions, fixed interval widths, and crossing problems, an Improved LUBE module is developed as the output module for generating prediction intervals (PIs) of time series data. Furthermore, a gradient-based PIs loss function is designed to address the optimization issue of MSDIPN by integrating a smooth approximation function with a pinball loss function. We validate the effectiveness of the proposed algorithm using five real-world datasets, demonstrating its superiority over traditional models.

Abstract:
The rise in instant delivery services necessitates efficient route planning in last-mile delivery scenarios, where new orders arrive dynamically and need to be integrated into existing routes. In such contexts, complete re-optimization of routes are not permitted, and node insertion to existing route sequences is the only viable option. However, many existing heuristics for node insertion, such as the Cheapest Insertion (CI) method, are myopic and often result in suboptimal solutions retrospectively. This paper presents Neuro-Ins, an initial yet novel attempt at harnessing a learning-based framework to handle the insertion of new orders for the Pickup and Delivery Problem (PDP). In contrast to CI, which considers only one node at a time for insertion, Neuro-Ins leverages an Attention-Mechanism (AM) based encoder-decoder structure to collectively consider all nodes to be inserted, thereby enhancing the quality of the eventual solution. To further improve the model’s representation of the current route, we introduce a position embedding to enrich the node feature embedding with positional information of the route. Experiments on synthetic and real-world datasets demonstrate that Neuro-Ins, trained by PPO, consistently outperforms CI without compromising computational speed, and it also surpasses the performance of state-of-the-art solution methods implemented in the industry. Our findings emphasize the importance of explicitly considering all nodes to be inserted along with the en-route nodes and their positions in the route, showcasing the efficacy of the proposed AM-based framework in optimizing the instant delivery routes.

Abstract:
E-commerce content generation necessitates creating engaging and customer-centric material to endorse products and enhance user satisfaction. Existing methods depend on task-specific feature design, which requires a fine-tailored model for each task with complex data collection and pre-processing, and their generation capabilities are limited. Meanwhile, large language models have demonstrated their capabilities in diverse natural language processing tasks, solving multiple tasks in a unified process. To address the concerns in e-commerce content generation, we leverage the impressive generation performance of large language models and propose a framework to educate them as proficient promoters in various e-commerce-related tasks. Our framework involves two modules: self-educating proliferates task instructions and data by instructing the unaligned model, and multi-aspect instruction alignment educates the language model by embedding all e-commerce tasks in a unified framework. The proposed model, Promoter, can perform a batch of prediction and generation tasks, working as a smart and creative promoter that only requires a quick view of the customer profile. Extensive experiments from automatic and human perspectives indicate that Promoter achieves state-of-the-art performances in various generation tasks, bringing the productivity of large language models to e-commerce in an integrated pipeline.

Abstract:
Accurate time series forecasting is crucial for optimizing resource allocation, industrial production, and urban management, particularly with the growth of cyber-physical and IoT systems. However, limited training sample availability in fields like physics and biology poses significant challenges. Existing models struggle to capture long-term dependencies and to model diverse meta-knowledge explicitly in few-shot scenarios. To address these issues, we propose MetaGP, a meta-learning-based Gaussian process latent variable model that uses a Gaussian process kernel function to capture long-term dependencies and to maintain strong correlations in time series. We also introduce Kernel Association Search (KAS) as a novel meta-learning component to explicitly model meta-knowledge, thereby enhancing both interpretability and prediction accuracy. We study MetaGP on simulated and real-world few-shot datasets, showing that it is capable of state-of-the-art prediction accuracy. We also find that MetaGP can capture long-term dependencies and can model meta-knowledge, thereby providing valuable insights into complex time series patterns.

Abstract:
The problem of infection source detection deals with localizing the infection source in a given network. While the problem has been extensively studied in the past, researchers have mainly focused on simulated infection networks which may not be the correct reflection of the dynamics of real-world infections. More significantly, the existing methods assume that a rumor source lies at the center of an infection network (source-centrality), which is not always true in sparse real-world rumor networks. Due to the randomness of infection flow in such networks, the source may lie away from the center (source-skewness). There is also a lack of real-world infection network datasets to provide a true real-world perspective. Therefore, we revisit the source detection problem and contemplate a shift from mainstream simulations to a real-world paradigm. To this end, we generate two novel rumor network datasets, \mathsf Cov19-RNCov19-RN and \mathsf Use20-RNUse20-RN, based on COVID-19 and US Elections 2020 misinformation trends on Twitter (currently \mathbb XX). Besides, inspired by the technicalities inherent to real-world rumor networks, we propose a real-world oriented algorithm called Generalized Exoneration and Prominence based Age, \mathsf GEPAGEPA, for rumor source detection. \mathsf GEPAGEPA addresses the problem of source-skewness to detect rumor sources using the concept of generalized local prominence, which we introduce in this study. Our experiments show that \mathsf GEPAGEPA significantly outperforms the state-of-the-art methods, producing detection rates of 73.6% against 61.5% of the closest competing method on \mathsf Cov19-RNCov19-RN, and 61.5% against 52.6% of the closest competing method on \mathsf Use20-RNUse20-RN. To the best of our knowledge, this study is the first such work to deal with source detection in real-world rumor networks and address the problem of source-skewness.

Abstract:
Knowledge Tracing (KT) refers to inferring the students’ knowledge mastery and predicting their future performance. KT serves as the foundation for personalized learning and enhances the effectiveness of educational interventions, becoming a crucial technology in intelligent tutoring systems. Recent approaches have demonstrated notable success by harnessing the potent representational capacities of deep learning. However, complex neural networks lead to entangled knowledge state embeddings, where the embedding dimensions are coupled, limiting their expressiveness and interpretability. In addition, the limitations of existing methods in euclidean space result in distortions when capturing complex relationships among knowledge states. This distortion is reflected in the alteration of distances and geometric structures among knowledge states during the embedding process. To address the challenges, in this paper, we propose a hyperbolic hypergraph transformer with knowledge state Disentanglement for Knowledge Tracing, named DisenKT. We construct the students’ response sequences into the hypergraph, projected into the hyperbolic space to alleviate the representation distortion problem of questions and knowledge states. The embeddings of hierarchical knowledge states are refined through message passing between questions and students based on the proposed hyperbolic hypergraph transformer. Moreover, we are the first to disentangle knowledge states via a contrastive clustering auxiliary task, which enhances the expressiveness and interpretability of knowledge state embeddings. Extensive experimental results on three public datasets demonstrate that DisenKT outperforms state-of-the-art methods on student performance prediction and interpretability.

Abstract:
Malicious social bots achieve their malicious purposes by spreading misinformation and inciting social public opinion, seriously endangering social security, making their detection a critical concern. Recently, graph-based bot detection methods have achieved state-of-the-art (SOTA) performance. However, our research finds many isolated and poorly linked nodes in social networks, which graph-based methods cannot effectively detect. To address this problem, our research focuses on effectively utilizing node semantics and network structure to jointly detect sparsely linked nodes. Given the excellent performance of language models (LMs) in natural language understanding (NLU), we propose a novel social bot detection framework LGB, which consists of two main components: language model (LM) and graph neural network (GNN). Specifically, the social account information is first extracted into unified user textual sequences, which is then used to perform supervised fine-tuning (SFT) of the language model to improve its ability to understand social account semantics. Next, the semantically enriched node representation is fed into the pre-trained GNN to further enhance the node representation by aggregating information from neighbors. Finally, LGB fuses the information from both modalities to improve the detection performance of sparsely linked nodes. Extensive experiments on two real-world datasets demonstrate that LGB consistently outperforms state-of-the-art baseline models by up to 10.95%. LGB is already online: https://botdetection.aminer.cn/robotmain.

Abstract:
Traffic prediction is a crucial component of data management systems, leveraging historical data to learn spatio-temporal dynamics for forecasting future traffic and enabling efficient decision-making and resource allocation. Despite efforts to develop increasingly complex architectures, existing traffic prediction models often struggle to generalize across diverse datasets and contexts, limiting their adaptability in real-world applications. In contrast to existing traffic prediction models, large language models (LLMs) progress mainly through parameter expansion and extensive pre-training while maintaining their fundamental structures. In this paper, we propose ST-LLM+, the graph enhanced spatio-temporal large language models for traffic prediction. Through incorporating a proximity-based adjacency matrix derived from the traffic network into the calibrated LLMs, ST-LLM+ captures complex spatio-temporal dependencies within the traffic network. The Partially Frozen Graph Attention (PFGA) module is designed to retain global dependencies learned during LLMs pre-training while modeling localized dependencies specific to the traffic domain. To reduce computational overhead, ST-LLM+ adopts the LoRA-augmented training strategy, allowing attention layers to be fine-tuned with fewer learnable parameters. Comprehensive experiments on real-world traffic datasets demonstrate that ST-LLM+ outperforms state-of-the-art models. In particular, ST-LLM+ also exhibits robust performance in both few-shot and zero-shot prediction scenarios. Additionally, our case study demonstrates that ST-LLM+ captures global and localized dependencies between stations, verifying its effectiveness for traffic prediction tasks.

Abstract:
Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods.

Abstract:
Vehicle movement is frequently captured in the form of GPS trajectories, i.e., sequences of timestamped GPS locations. Such data is widely used for various tasks such as travel-time estimation, trajectory recovery, and trajectory prediction. A universal vehicle trajectory model could be applied to different tasks, removing the need to maintain multiple specialized models, thereby reducing computational and storage costs. However, creating such a model is challenging when the integrity of trajectory features is compromised, i.e., in scenarios where only partial features are available or the trajectories are sparse. To address these challenges, we propose the Universal Vehicle Trajectory Model (UVTM), which can effectively adapt to different tasks without excessive retraining. UVTM incorporates two specialized designs. First, it divides trajectory features into three distinct domains. Each domain can be masked and generated independently to accommodate tasks with only partially available features. Second, UVTM is pre-trained by reconstructing dense, feature-complete trajectories from sparse, feature-incomplete counterparts, enabling strong performance even when the integrity of trajectory features is compromised. Experiments involving four representative trajectory-related tasks on three real-world vehicle trajectory datasets provide insight into the performance of UVTM and offer evidence that it is capable of meeting its objectives.

Abstract:
This paper studies the fair influence maximization problem with efficient algorithms. In particular, given a graph GG, a community structure \mathcal CC consisting of disjoint communities, and a budget kk, the problem asks to select a seed set SS (|S|=k|S|=k) that maximizes the influence spread while narrowing the influence gap between different communities. This problem derives from some significant social scenarios, such as health interventions (e.g. suicide/HIV prevention) where individuals from underrepresented groups or LGBTQ communities may be disproportionately excluded from the benefits of the intervention. To depict the concept of fairness in the context of influence maximization, researchers have proposed various notions of fairness, where the welfare fairness notion that better balances fairness level and influence spread has shown promising effectiveness. However, the lack of efficient algorithms for optimizing the objective function under welfare fairness restricts its application to networks of only a few hundred nodes. In this paper, we modify the objective function of welfare fairness to maximize the exponentially weighted sum and the logarithmically weighted sum over all communities’ influenced fractions (utility). To achieve efficient algorithms with theoretical guarantees, we first introduce two unbiased estimators: one for the fractional power of the arithmetic mean and the other for the logarithm of the arithmetic mean. Then, by adapting the Reverse Influence Sampling (RIS) approach, we convert the optimization problem to a weighted maximum coverage problem. We also analyze the number of reverse reachable sets needed to approximate the fair influence at a high probability. Finally, we present an efficient algorithm that guarantees 1-1/e - \varepsilon1-1/e-ɛ (positive objective function) or 1+1/e + \varepsilon1+1/e+ɛ (negative objective function) approximation for any small \varepsilon > 0ɛ>0. Experiments demonstrate that our proposed algorithm could efficiently handle large-scale networks with good performance.

Abstract:
Large language models (LLMs) have garnered unprecedented advancements across diverse fields, ranging from natural language processing to computer vision and beyond. The prowess of LLMs is underpinned by their substantial model size, extensive and diverse datasets, and the vast computational power harnessed during training, all of which contribute to the emergent abilities of LLMs (e.g., in-context learning) that are not present in small models. Within this context, the mixture of experts (MoE) has emerged as an effective method for substantially scaling up model capacity with minimal computation overhead, gaining significant attention from academia and industry. Despite its growing prevalence, there lacks a systematic and comprehensive review of the literature on MoE. This survey seeks to bridge that gap, serving as an essential resource for researchers delving into the intricacies of MoE. We first briefly introduce the structure of the MoE layer, followed by proposing a new taxonomy of MoE. Next, we overview the core designs for various MoE models including both algorithmic and systemic aspects, alongside collections of available open-source implementations, hyperparameter configurations and empirical evaluations. Furthermore, we delineate the multifaceted applications of MoE in practice, and outline some potential directions for future research.

Abstract:
Traditional spectral clustering methods struggle with scalability and robustness in large datasets due to their reliance on similarity matrices and EigenValue Decomposition. We introduce two innovative models: Rcut-based Coordinate Descent Clustering (R-CDC) and Ncut-based Doubly Stochastic Clustering (N-DSC). These models integrate graph construction and segmentation into a unified process optimized through the coordinate descent method, significantly enhancing clustering efficacy. A novel graph structure enhances robustness against noise and outliers, simplifying the clustering process and improving outcomes across diverse datasets. Our extensive experiments show that these models surpass existing spectral clustering techniques in managing large-scale data and complex structures.

Abstract:
Learning to Rank (LTR) aims to develop a ranking model from supervised data to rank a set of items using machine learning techniques. However, since the losses and ranking metrics involved in LTR are both based on ranking, they are neither continuous nor differentiable, making it challenging to optimize them using gradient descent algorithms. Various surrogate losses have been proposed to address this issue, yet their connection with ranking metrics is often loose, leading to inconsistencies between optimization objectives and evaluation metrics. In this study, we introduce NeuralLoss, a learnable and pretrained surrogate loss. By undergoing training on data structured around ranking metrics, NeuralLoss approximates these ranking metrics, aligning its optimization objectives with evaluation metrics. We employ Transformer to construct the surrogate model and ensure permutation invariance. The pretrained surrogate loss facilitates end-to-end training of ranking models using gradient descent algorithms and can approximate various ranking metrics by adjusting the training data. In this paper, we employ NeuralLoss to approximate NDCG and Recall, demonstrating its performance in both document retrieval and cross-modal retrieval tasks. Experimental results indicate that our approach achieves excellent performance and exhibits strong competitiveness across these tasks.

Abstract:
Traffic accidents pose a significant risk to human health and property safety. To address this issue, predicting their risks has garnered growing interest. We argue that a desired prediction solution should demonstrate resilience to the complexity of traffic accidents. In particular, it should adequately consider the streaming nature of data and key related aspects, such as regional background, accurately capture both proximity and similarity while bridging the disparities, and effectively address the sparsity. However, these factors are often overlooked or difficult to incorporate. In this paper, we propose a novel streaming multi-granularity hierarchical spatio-temporal network. Initially, we innovate by incorporating remote sensing data, facilitating the creation of hierarchical multi-granularity structure and the comprehension of regional background. We construct multiple high-level risk prediction tasks to enhance model’s ability to cope with sparsity. Subsequently, to capture and bridge spatial proximity and semantic similarity, region features and multi-view graph undergo encoding processes to distill effective representations, followed by a graph-enhanced representation alignment module that reconciles their disparities. At last, an alternating experience replay with a dual-memory buffer is employed to accommodate streaming data scenarios. Extensive experiments on two real datasets verify the superiority of our model against the state-of-the-art methods.

Abstract:
Many question-answering problems can be approached as textual entailment tasks, where the hypotheses are formed by the question and candidate answers, and the premises are derived from an external knowledge base. However, current neural methods often lack transparency in their decision-making processes. Moreover, first-order logic methods, while systematic, struggle to integrate unstructured external knowledge. To address these limitations, we propose a neuro-symbolic reasoning framework called Final, which combines FIrst-order logic with NAtural Logic for question answering. Our framework utilizes first-order logic to systematically decompose hypotheses and natural logic to construct reasoning paths from premises to hypotheses, employing bidirectional reasoning to establish links along the reasoning path. This approach not only enhances interpretability but also effectively integrates unstructured knowledge. Our experiments on three benchmark datasets, namely QASC, WorldTree, and WikiHop, demonstrate that Final outperforms existing methods in commonsense reasoning and reading comprehension tasks, achieving state-of-the-art results. Additionally, our framework also provides transparent reasoning paths that elucidate the rationale behind the correct decisions.

Abstract:
The Kirchhoff index, which is the sum of the resistance distance between every pair of nodes in a network, is a key metric for gauging network performance, where lower values signify enhanced performance. In this paper, we study the problem of minimizing the Kirchhoff index by adding edges. We first provide a greedy algorithm for solving this problem and give an analysis of its quality based on the bounds of the submodularity ratio and the curvature. Then, we introduce a gradient-based greedy algorithm as a new paradigm to solve this problem. To accelerate the computation cost, we leverage geometric properties, convex hull approximation, and approximation of the projected coordinate of each point. To further improve this algorithm, we use pre-pruning and fast update techniques, making it particularly suitable for large networks. Our proposed algorithms have nearly-linear time complexity. We provide extensive experiments on ten real networks to evaluate the quality of our algorithms. The results demonstrate that our proposed algorithms outperform the state-of-the-art methods in terms of efficiency and effectiveness. Moreover, our algorithms are scalable to large graphs with over 5 million nodes and 12 million edges.

Abstract:
Learned indexes, emerging as a promising alternative to traditional indexes like B+Tree, utilize machine learning models to enhance query performance and reduce memory usage. However, the widespread adoption of learned indexes is limited by their expensive training cost and the need for high accuracy of internal models. Although some studies attempt to optimize the building process of these learned indexes, existing methods are restrictive in scope and applicability. They are usually tailored to specific index types and heavily rely on pre-trained model knowledge, making deployment a challenging task. In this work, we introduce the Learned Index Optimization Framework (LIOF), a general and easily integrated solution aimed at expediting the training process and improving the accuracy of index model for one-dimensional and multi-dimensional learned indexes. The optimization of LIOF for the learned indexes is intuitive, directly providing optimized parameters for index models based on the distribution of node data. By leveraging the correlation between key distribution and node model parameters, LIOF significantly reduces the training epochs required for each node model. Initially, we introduce an optimization strategy inspired by optimization-based meta-learning to train the LIOF to generate optimized initial parameters for index node models. Subsequently, we present a data-driven encoder and a parameter-centric decoder network, which adaptively translate key distribution into a latent variable representation and decode it into optimized node model initialization. Additionally, to further utilize characteristics of key distribution, we propose a monotonic regularizer and focal loss, guiding LIOF training towards efficiency and precision. Through extensive experimentation on real-world and synthetic datasets, we demonstrate that LIOF provides substantial enhancements in both training efficiency and the predictive accuracy for learned indexes.

Abstract:
The Single-Source Personalized PageRank (SSPPR) problem is widely used in information retrieval and recommendation systems. Traditional algorithms assume full knowledge of the network, making them inapplicable to online social networks (OSNs), where the topology is unknown, and users can only explore the network step by step via APIs. The only feasible approach for SSPPR in OSNs is Monte Carlo (MC) simulation, but traditional MC methods rely on static sampling, which lacks flexibility, delays feedback, and overestimates the number of required random walks. To address these limitations, we propose PANDA (Single-Source Personalized PageRank on OSNs with Rademacher Average), a progressive sampling algorithm. PANDA iteratively samples random walks in batches, estimating accuracy dynamically using Rademacher Average from statistical learning theory. This data-dependent approach allows for early termination once the desired accuracy is met. Additionally, PANDA features a dynamic sampling schedule to optimize efficiency. Empirical studies show that PANDA significantly outperforms existing methods, achieving the same accuracy with far greater efficiency.

Abstract:
The sparse portfolio optimization (SPO) problem is increasingly crucial in portfolio management, focusing on selecting a few stocks with the potential for strong market performance. However, sparse portfolio strategies often face significant short-term drawdowns during periods of market volatility. To this end, a news-driven portfolio strategy offers valuable insights to capture sudden market changes. Nevertheless, it encounters two main challenges: how to reasonably map the relationships between news and stocks and how to effectively utilize the irregular timing of news releases. To tackle the SPO problem in fluctuating markets while addressing these challenges, we propose a novel news-driven sparse portfolio strategy, named SPIN. Specifically, SPIN not only leverages industry-specific group structures existing among stocks for a more reasonable news-stock mapping and models news sequential patterns based on our devised novel news-driven forecaster to handle the irregularity of news releases. We rigorously prove that SPIN achieves a sub-linear regret. Extensive experiments on three real-world datasets demonstrate SPIN's superiority over state-of-the-art portfolio strategies in terms of cumulative wealth and short-term drawdowns.

Abstract:
Federated learning (FL), a decentralized machine learning approach, offers great performance while alleviating autonomy and confidentiality concerns. Despite FL’s popularity, how to deal with missing values in a federated manner is not well understood. In this work, we initiate a study of federated imputation of missing values, particularly in complex scenarios, where missing data heterogeneity exists and the state-of-the-art (SOTA) approaches for federated imputation suffer from significant loss in imputation quality. We propose Cafe, a personalized FL approach for missing data imputation. Cafe is inspired from the observation that heterogeneity can induce differences in observable and missing data distribution across clients, and that these differences can be leveraged to improve the imputation quality. Cafe computes personalized weights that are automatically calibrated for the level of heterogeneity, which can remain unknown, to develop personalized imputation models for each client. An extensive empirical evaluation over a variety of settings demonstrates that Cafe matches the performance of SOTA baselines in homogeneous settings while significantly outperforming the baselines in heterogeneous settings.

Abstract:
Federated Class Incremental Learning (FCIL) has emerged as a new paradigm due to its applicability in real-world scenarios. In FCIL, clients continuously generate new data with unseen class labels and do not share local data due to privacy restrictions, and each client’s class distribution evolves dynamically and independently. However, existing work still faces two significant challenges. Firstly, current methods lack a better balance between maintaining sound anti-forgetting effects over old data (stability) and ensuring good adaptability for new tasks (plasticity). Secondly, some FCIL methods overlook that the incremental data will also have a non-identical label distribution, leading to poor performance. This paper proposes CGoFed, which includes relax-constrained gradient update and cross-task gradient regularization modules. The relax-constrained gradient update prevents forgetting the knowledge about old data while quickly adapting to the new data by constraining the gradient update direction to a gradient space that minimizes interference with historical tasks. The cross-task gradient regularization also finds applicable historical models from other clients and trains a personalized global model to address the non-identical label distribution problem. The results demonstrate that the CGoFed performs well in alleviating catastrophic forgetting and improves model performance by 8% -23% compared with the SOTA comparison method.

Abstract:
Leveraging Large Language Models as recommenders, referred to as LLMRec, is gaining traction and brings novel dynamics for modeling user preferences, particularly for cold-start users. However, existing LLMRec approaches primarily focus on text semantics and overlook the crucial aspect of incorporating collaborative information from user-item interactions, leading to potentially sub-optimal performance in warm-start scenarios. To ensure superior recommendations across both warm and cold scenarios, we introduce CoLLM, an innovative LLMRec approach that explicitly integrates collaborative information for recommendations. CoLLM treats collaborative information as a distinct modality, directly encoding it from well-established traditional collaborative models, and then tunes a mapping module to align this collaborative information with the LLM's input text token space for recommendations. By externally integrating traditional models, CoLLM ensures effective collaborative information modeling without modifying the LLM itself, providing the flexibility to adopt diverse collaborative information modeling mechanisms. Extensive experimentation validates that CoLLM adeptly integrates collaborative information into LLMs, resulting in enhanced recommendation performance.

Abstract:
Sequential recommendation systems are integral to discerning temporal user preferences. Yet, the task of learning from abbreviated user interaction sequences poses a notable challenge. Data augmentation has been identified as a potent strategy to enhance the informational richness of these sequences. Traditional augmentation techniques, such as item randomization, may disrupt the inherent temporal dynamics. Although recent advancements in reverse chronological pseudo-item generation have shown promise, they can introduce temporal discrepancies when assessed in a natural chronological context. In response, we introduce a sophisticated approach, Bidirectional temporal data Augmentation with pre-training (BARec). Our approach leverages bidirectional temporal augmentation and knowledge-enhanced fine-tuning to synthesize authentic pseudo-prior items that retain user preferences and capture deeper item semantic correlations, thus boosting the model’s expressive power. Our comprehensive experimental analysis on five benchmark datasets confirms the superiority of BARec across both short and elongated sequence contexts. Moreover, theoretical examination and case study offer further insight into the model’s logical processes and interpretability.

Abstract:
Multi-tenant DBMSs are used by cloud providers for their Database-as-a-Service products. They could be single-node DBMSs installed in virtual machines, SQL-on-Hadoop systems or classic parallel relational DBMSs running on top of a shared-nothing or shared-disk architecture. For a cloud provider, it is interesting to measure these systems’ capability of dealing with multi-tenant workloads, i.e., taking advantage of the statistical multiplexing to obtain economic gain while being attractive by providing a good quality of service and a low bill to the tenants. In this paper, we present MTD-DS benchmark (with MTD for Multi-Tenant parallel DBMSs and DS for Decision Support). MTD-DS extends TPC-DS by adding a multi-tenant query workload generator, a performance Service Level Objectives generator, configurable Database-as-a-Service pricing models, and new metrics to measure the potential capability of a multi-tenant parallel DBMS in obtaining the best trade-off between the provider's benefit and the tenants’ satisfaction. Example experimental results have been produced to show the relevance and the feasibility of the MTD-DS benchmark.

Abstract:
Methods based on variational bayes theorytare widely used to detect community structures in networks. In recent years, many related methods have emerged that provide valuable insights into variational bayes theory. Remarkably, a fundamental assumption remains incomprehensible. Variational bayes-based methods typically employ a posterior distribution that follows a gaussian distribution to approximate the unknown prior distribution. However, the complexity and irregularity of node distributions in real-world networks prompt us to consider what characteristics of network information are suitable for the posterior distribution. Mathematically, inappropriate low- and high-frequency signals in expectation inference and variance inference can intensify the adverse effects of community distortion and ambiguity. To analysis these two phenomena and propose reasonable countermeasures, we conduct an empirical study. It is found that appropriately compressing low-frequency signals during expectation inference and amplifying high-frequency signals during variance inference are effective strategies. Based on these two strategies, this paper proposes a novel variational bayes plug-in, namely VBPG, to boost the performance of existing variational bayes-based community detection methods. Specifically, we modulate the frequency signals during expectation and variance inference to generate a new gaussian distribution. This strategy improves the fitting accuracy between the posterior distribution and the unknown true distribution without altering the modules of existing methods. The comprehensive experimental results validate that methods using VBPG achieve competitive performance improvements in most cases.

Abstract:
Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework, \sf SnoopySnoopySnoopy, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection. To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that \sf SnoopySnoopySnoopy outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency—being at least 5 orders of magnitude faster than cell-level solutions, and 3.5× faster than existing column-level methods.

Abstract:
Learning from data streams collected sequentially over time are widely spread in real-world applications. Previous methods typically assume that the data stream has a feature space with a fixed or clearly defined evolution pattern, as well as a balanced class distribution. However, in many practical scenarios, such as environmental monitoring systems, the frequency of anomalous events is significantly imbalanced compared to normal ones and the feature space dynamically changes due to ecological evolution and sensor lifespan. To alleviate this important but rarely studied problem, we propose the Adaptive Learning in Imbalace data streams with Unpredictable feature evolution (ALIU) algorithm. As data streams with imbalanced class distribution arrive, ALIU first mitigates the model's bias for the majority class by reweighting the adaptive gradient descent magnitudes between different classes. Then, a new loss function is proposed that simultaneously focuses on misclassifications and maintains model robustness. Further, when imbalanced data streams arrive with feature evolutions, we reuse the previously learned model and update the incomplete and augmented features by adopting the adaptive gradient strategy and ensemble method, respectively. Finally, we utilize the projected technique to build a sparse yet efficient model. Based on a few common and mild assumptions, we theoretically analyze that the ALIU satisfies a sub-linear regret bound under both convex and strong convex loss functions and the performance of model can be improved with the assistance of old features. Besides, extensive experimental results further demonstrate the effectiveness of our proposed algorithm.

Abstract:
Multi-scenario and multi-task recommendation systems efficiently facilitate knowledge transfer across different scenarios and tasks. However, many existing approaches inadequately incorporate personalized information across users and scenarios. Moreover, the conversion rate (CVR) task in multi-task learning often encounters challenges like sample selection bias, resulting from systematic differences between the training and inference sample spaces, and data sparsity due to infrequent clicks. To address these issues, we propose Adaptive Entire-space Multi-scenario Multi-task Transfer Learning model (AEM^22TL) with four key modules: 1) Scenario-CGC (Scenario-Customized Gate Control), 2) Task-CGC (Task-Customized Gate Control), 3) Personalized Gating Network, and 4) Entire-space Supervised Multi-Task Module. AEM^22TL employs a multi-gate mechanism to effectively integrate shared and specific information across scenarios and tasks, enhancing prediction adaptability. To further improve task-specific personalization, it incorporates personalized prior features and applies a gating mechanism that dynamically scales the top-layer neural units. A novel post-impression behavior decomposition technique is designed to leverage all impression samples across the entire space, mitigating sample selection bias and data sparsity. Furthermore, an adaptive weighting mechanism dynamically allocates attention to tasks based on their relative importance, ensuring optimal task prioritization. Extensive experiments on one industrial and two real-world public datasets indicate the superiority of AEM^22TL over state-of-the-art methods.

Abstract:
Deep learning (DL) is increasingly viewed as a foundational methodology for advancing Artificial Intelligence (AI). However, its interpretability remains limited, and it often underperforms in certain fields due to its lack of human-like characteristics. Consequently, leveraging insights from Brain and Cognitive Science (BCS) to understand and advance DL has become a focal point for researchers in the DL community. However, BCS is a diverse discipline where existing studies often concentrate on cognitive theories within their respective domains. These theories are typically grounded in certain assumptions, complicating comparisons between different approaches. Therefore, this review is intended to provide a comprehensive landscape of more than 300 papers on the intersection of DL and BCS grounded in DL community. Unlike previous reviews that based on sub-disciplines of Cognitive Science, this article aims to establish a unified framework encompassing all aspects of DL inspired by BCS, offering insights into the symbiotic relationship between DL and BCS. Additionally, we present a forward-looking perspective on future research directions, with the intention of inspiring further advancements in AI research.

Abstract:
Graph kernels are a significant class of tools for measuring the similarity of graph data, which is the basis of a wide range of graph learning methods. However, graph kernels often suffer from high computing overhead. With the shining of cloud computing, it is desirable to transfer the computing burden to the server with abundant computing resources to reduce the cost of local machines. Nonetheless, under the honest-but-curious cloud assumption, the server may peek at the data, raising privacy concerns. To eliminate the risk of data privacy leakage, we propose CloudRGK to securely perform Random walk Graph Kernel(RGK), one of the most well-known graph kernels, on the cloud. We first prove that the edge- and vertex-labeled graphs could be transformed into an equivalent matrix representation. Afterward, we prove that the cloud could perform the core operations in RGK on the encrypted graphs without feature information loss. Evaluations of the real-world graph data demonstrate that our strategy significantly reduces the overhead of the local party to perform RGK without performance degradation. Meanwhile, it introduces only a small amount of extra computation cost. To the best of our knowledge, it is the first work towards private graph kernel computation on the cloud.

Abstract:
Unlike the traditional recommender systems that rely on historical data such as clicks or purchases, a conversational recommender system (CRS) aims to provide a personalized recommendation through a natural conversation. The conversational interaction facilitates capturing not only explicit preference from mentioned items but also implicit states, such as a user’s current situation and emotional states from a dialogue context. Nevertheless, existing CRSs fall short of fully exploiting a dialogue context since they primarily derive explicit user preferences from the items and item-attributes mentioned in a conversation. To address this limitation and attain a comprehensive understanding of a dialogue context, we propose CoreSense, a conversational recommender system enhanced with social commonsense knowledge. In other words, CoreSense exploits the social commonsense knowledge graph ATOMIC to capture the user’s implicit states, such as a user’s current situation and emotional states, from a dialogue context. Thus, the social commonsense knowledge-augmented CRS can provide a more appropriate recommendation from a given dialogue context. Furthermore, we enhance the collaborative filtering effect by utilizing the user’s states inferred from commonsense knowledge as an improved criterion for retrieving other dialogues of similar interests. Extensive experiments on CRS benchmark datasets show that CoreSense provides human-like recommendations and responses based on inferred user states, achieving significant performance improvements.

Abstract:
Recommendation systems typically rely on users’ historical behavior to infer their preferences. However, when new entries emerge, the system cannot make accurate prediction due to the lack of historical data. This is known as the “cold-start” problem, which not only limits the exposure of new items but also impacts the first experience of new users severely. Meta-learning has emerged as a promising approach to address this issue, but existing methods have limitations in dealing with the differences in user preferences and sparse monitoring data. To overcome these limitations, Dual enhanced Meta-learning with Adaptive Task Sampling is proposed. First, we propose an embedding enhancement strategy for cold nodes. Specifically, we map the cold-start embeddings into the warm space based on the common features shared across all nodes, and then add uniform noise to create the contrastive views. This strategy injects warm co-occurrence signals into the content of cold nodes, effectively enriching the feature space of cold nodes. Second, we introduce an adaptive task scheduler to measure the effectiveness of different meta-tasks and filter out the noise from invalid tasks. We assign different sampling probabilities to the tasks based on the learning process (gradient similarity) and the learning result (loss) of the meta-tasks. Finally, we consider the above two modules as auxiliary tasks for the main meta-model. Then, joint optimization is carried out through a multi-task learning framework. Experiments in three cold-start scenarios show that our approach outperforms the most advanced baselines, including traditional methods, HIN-based methods, and meta-learning-based methods.

Abstract:
To address the issue of label sparsity in heterogeneous graphs (HGs), heterogeneous graph few-shot learning (HGFL) has recently emerged. HGFL aims to extract meta-knowledge from source HGs with rich-labeled data and transfers it to a target HG, facilitating learning new classes with few-labeled training data and improving predictions on unlabeled testing data. Existing methods typically assume the same distribution across the source HG, training data, and testing data. However, in practice, distribution shifts in HGFL are inevitable due to (1) the scarcity of source HGs that match the target HG's distribution, and (2) the unpredictable data generation mechanism of the target HG. Such distribution shifts can degrade the performance of existing methods, leading to a novel problem of out-of-distribution (OOD) generalization in HGFL. To address this challenging problem, we propose COHF, a Causal OOD Heterogeneous graph Few-shot learning model. In COHF, we first adopt a bottom-up data generative perspective to identify the invariance principle for OOD generalization. Then, based on this principle, we design a novel variational autoencoder-based heterogeneous graph neural network (VAE-HGNN) to mitigate the impact of distribution shifts. Finally, we propose a novel meta-learning framework that incorporates VAE-HGNN to effectively transfer meta-knowledge in OOD environments. Extensive experiments on seven real-world datasets have demonstrated the superior performance of COHF over the state-of-the-art methods.

Abstract:
High-dimensional and incomplete (HDI) data are frequently encountered in diverse real-world applications involving complex interactions among numerous nodes. Approaches based on latent feature analysis (LFA) have proven effective in performing representation learning in HDI data. Nevertheless, they cannot handle the high-order connectivity among nodes in HDI data well, resulting in severe accuracy loss. To address the previously mentioned issue, we present a novel model in this paper, namely Graph Linear Convolution Pooling Network (GLCPN). The proposed GLCPN adopts the three-fold ideas. First, it leverages simplified graph convolutions to efficiently capture high-order connectivity among nodes for learning representations of matrix factorization. Second, a simple yet effective priori convolution operator is adopted by each graph neural layer to capture node-node collaboration for aggregation. Third, a locality-enhanced pooling scheme is designed to holistically utilize multi-layer representations of the neighborhood. Therefore, GLCPN can effectively acquire the hidden information in HDI data with high efficiency. In addition, we have conducted a theoretical analysis demonstrating that the proposed GLCPN is more expressive compared with existing graph neural networks for HDI data. Extensive experiments have been further conducted on ten well-established HDI datasets from various applications. The experimental results demonstrate that the proposed GLCPN significantly outperforms state-of-the-art models for learning representations in HDI data evaluated by accuracy and efficiency metrics.

Abstract:
Few-shot classification is increasingly relevant in emerging applications, such as university course classification in intelligent education systems. University course classification helps students acquire specific skills, comprehend course purposes, and assists departments in defining training goals. However, classifying frontier courses presents challenges due to the absence of labels and descriptions. Few-shot learning addresses this by acquiring meta-knowledge. Heterogeneous graphs (HGs), rich in semantic information, introduce complexities that make few-shot particularly challenging. Addressing this problem, we propose a subgraph-aware convolutional few-shot classification method on HGs (HG-SCC). We first formalize the subgraph sampling strategy for HGs and different views under meta-paths. Then, the layer number adaptive spectral-based graph convolution is designed for personalized node embedding. Furthermore, a high-order convolution operation with classes as nodes is designed to increase the class representation coverage. Modeling subgraph centrality, combined with node features, captures structural information, improving awareness of each sampled subgraph, thus alleviating sparsity in new class labels and enhancing classification accuracy. Euclidean distance-based and task-affected cosine similarity-based classifiers under different meta-paths are proposed, with stacking introduced to blend multiple classifiers based on subgraph features. Experimental results show that our method has high performance in course classification and also outperforms state-of-the-art methods on benchmark datasets.

Abstract:
Cold-start recommendation is a long-standing challenge when presenting potential preferred items to new users. Most empirical studies leverage side information to promote cold-start recommendation. In this work, we focus on cross-domain cold-start recommendation, which aims to provide suggestions to those non-overlapping users who have only interacted in the source domain and are viewed as new users in the target domain. Pre-training and then mapping is the common solution for the cross-domain cold-start recommendation. The former learns domain-specific user preference, and the latter transfers preference knowledge from the source to the target domain. Despite the effectiveness, we argue that current mapping-based methods still have the following limitations. First, current mapping functions fail to fully consider the similarity of user behavioral patterns, either common transfer or personalized transfer mappings. Second, sparse supervision signals from the limited overlapping users, lead to insufficient mapping function learning for recommendation. To tackle the above limitations, we propose a novel MACDR model for cross-domain cold-start recommendation. Specifically, MACDR consists of two elaborate modules: a Prototype enhanced Mixture-Of-Experts (PMOE) based mapping function and a Preference Distribution Alignment (PDA) enhanced optimization. PMOE is designed to balance the transfer patterns of common and personalized preferences, following the basis that similar users share similar preference transfer. Furthermore, to alleviate the sparse supervision issue, PDA is designed to explore the utilization of non-overlapping users in an unsupervised manner based on the prototype distribution alignment technique. Extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed method.

Abstract:
The rapid growth of the internet has led to an alarming increase in the dissemination of fake news, which has had many negative effects on society. Various methods have been proposed for detecting fake news. However, these approaches suffer from several limitations. First, most existing works only consider news as separate entities and do not consider the correlations between fake news and real news. Moreover, these works are usually conducted in the Euclidean space, which is unable to capture complex relationships between news, in particular the hierarchical relationships. To tackle these issues, we introduce a novel Multi-modal Hyperbolic Representation framework (MHR) for fake news detection. Specifically, we capture the correlations between news for graph construction to arrange and analyze different news. To fully utilize the multi-modal characteristics, we first extract the textual and visual information, and then design a Lorentzian multi-modal fusion module to fuse them as the node information in the graph. By utilizing the fully hyperbolic graph neural networks, we learn the graph’s representation in hyperbolic space, followed by a detector for detecting fake news. The experimental results on three real-world datasets demonstrate that our proposed MHR model achieves state-of-the-art performance, indicating the benefits of hyperbolic representation.

Abstract:
Truth discovery endeavors to extract valuable information from multi-source data through weighted aggregation. Some studies have integrated differential privacy techniques into traditional truth discovery algorithms to protect data privacy. However, due to the neglect of outliers and limitations in budget allocation, these schemes still need improvement in the accuracy of discovery results. To solve these challenges, we propose a privacy-preserving scheme called PriPTD to achieve secure and accurate truth discovery services over crowdsourced data streams. Instead of assuming that worker weights are always stable between two neighboring timestamps, we delve deeper to consider outliers where worker weights change rapidly. Accordingly, we develop an outlier-aware weight estimation method with a time series model to capture and handle these outliers. Furthermore, to ensure data utility under a limited budget, we devise a weight-aware budget allocation algorithm. Its core idea is that timestamps with higher importance consume a larger proportion of the remaining budget. Additionally, we design a noise-aware error adjustment approach to mitigate the adverse effects of introduced noise on accuracy. Theoretical analysis and extensive experiments validate our scheme. Final comparative experiments against existing works confirm that our scheme achieves more accurate truth discovery while preserving privacy.

Abstract:
Web attack is a major threat to cyberspace security, so web attack detection models have become a critical task. Traditional supervised learning methods learn features of web attacks with large amounts of high-confidence labeled data, which are extremely expensive in the real world. Pre-trained models offer a novel solution with their ability to learn generic features on large unlabeled datasets. However, designing and deploying a pre-trained model for real-world web attack detection remains challenges. In this paper, we present a pre-trained model for web attack detection, including a pre-processing module, a pre-training module, and a deployment scheme. Our model significantly improves classification performance on several web attack detection datasets. Moreover, we deploy the model in real-world systems and show its potential for industrial applications.

Abstract:
Inferring contextual information such as demographics from historical transactions is valuable to public agencies and businesses. Existing methods are data-hungry and do not work well when the available records of transactions are sparse. We consider here specifically inference of demographic information using limited historical grocery transactions from a few random trips that a typical business or public service organization may see. We propose a novel method called DemoMotif to build a network model from heterogeneous data and identify subgraph patterns (i.e., motifs) that enable us to infer demographic attributes. We then design a novel motif context selection algorithm to find specific node combinations significant to certain demographic groups. Finally, we learn representations of households using these selected motif instances as context, and employ a standard classifier (e.g., SVM) for inference. For evaluation purposes, we use three real-world consumer datasets, spanning different regions and time periods in the U.S. We evaluate the framework for predicting three attributes: ethnicity, seniority of household heads, and presence of children. Extensive experiments and case studies demonstrate that DemoMotif is capable of inferring household demographics using only a small number (e.g., fewer than 10) of random grocery trips, significantly outperforming the state-of-the-art.

Abstract:
This paper addresses the pressing need for effective k-tips decomposition in dynamic bipartite graphs, a crucial aspect of real-time applications that analyze and mine binary relationship patterns. Recognizing the dynamic nature of these graphs, our study is the first to provide a solution for k-tips decomposition in such evolving environments. We introduce a pioneering projection-based algorithm, coupled with advanced incremental maintenance strategies for edge modifications, tailored specifically for dynamic graphs. This novel approach not only fills a significant gap in the analysis of dynamic bipartite graphs but also substantially enhances the accuracy and timeliness of data-driven decisions in critical areas like public health. Our contributions set a new benchmark in the field, paving the way for more nuanced and responsive analyses in various domains reliant on dynamic data interpretation.

Abstract:
In recommender systems, it is frequently presumed that missing ratings adhere to a missing at random (MAR) mechanism, implying the absence of ratings is independent of their potential values. However, this assumption fails to hold in real-world scenarios, where users are inclined to rate items they either strongly favor or disfavor, introducing a missing not at random (MNAR) scenario. To tackle this issue, prior researchers have utilized explicit MAR feedbacks to infer the propensities of unobserved, implicit MNAR feedbacks. Nonetheless, acquiring explicit MAR feedbacks is resource-intensive and time-consuming and may not reflect users’ true preferences. Furthermore, most methods have only been tested on synthetic or small-scale datasets, thus their applicability and effectiveness in real-world settings without MAR feedbacks remain unclear. Along these lines, we aim to predict MNAR ratings without MAR prior propensities by exploring the consistency between MAR and MNAR feedbacks and narrowing the gap between them. From the empirical study and preliminary experiment, we hypothesize that user preferences can be treated as the common prior propensity for both MAR and MNAR generative processes. In this way, we extend this hypothesis to a more general MNAR scenario: user preferences learned from MNAR can partially substitute for the prior propensities derived from MAR feedbacks for MNAR recommendation tasks. To validate our hypothesis and approach, we develop a lightweight iterative probabilistic matrix factorization framework (lightIPMF) as a practical method of our methodology, utilizing user preferences extracted from MNAR, not MAR, to estimate MNAR feedbacks. Finally, the experimental results show that modeling user preferences can effectively improve MNAR feedback estimation without MAR feedback, and our proposed lightIPMF outperforms the state-of-the-art MNAR methods in predicting MNAR feedbacks.

Abstract:
In large venues like shopping malls and airports, knowledge on the indoor populations fuels applications such as business analytics, venue management, and safety control. In this work, we provide means of modeling populations in partitions of indoor space offline and of monitoring indoor populations continuously, by using indoor positioning data. However, the low-sampling rates of indoor positioning render the data temporally and spatially sparse, which in turn renders the offline capture of indoor populations challenging. It is even more challenging to continuously monitor indoor populations, as positioning data may be missing or not ready yet at the current moment. To address these challenges, we first enable probabilistic modeling of populations in indoor space partitions as Normal distributions. Based on that, we propose two learning-based estimators for on-the-fly prediction of population distributions. Leveraging the prediction-based schemes, we provide a unified continuous query processing framework for a type of query that enables continuous monitoring of populated partitions. The framework encompasses caching and result validity mechanisms to reduce cost and maintain monitoring effectiveness. Extensive experiments on two real data sets show that the proposed estimators are able to outperform the state-of-the-art alternatives and that the query processing framework is effective and efficient.

Abstract:
Treatment effect estimation from observational data is a fundamental problem in causal inference, and its critical challenge is to address the confounding bias arising from the confounders. The effectiveness of the conventional methods proposed to solve this problem depends on the unconfoundedness assumption. In practice, however, the unconfoundedness assumption is frequently violated since we cannot guarantee that all the confounders are measured. To this end, recent studies suggest using auxiliary network architectures to mine information about unmeasured confounders in the data to relax this assumption. However, these methods cannot address the confounding bias from unmeasured confounders unrelated to the network information. Inspired by the insight that some neighboring features that influence one's treatment choice (e.g., which movie to watch) but do not affect the outcome (e.g., assessment of the movie) can be treated as instrumental variables (IVs), we propose a novel Network Instrumental Variable Regression (NetIV) framework exploits IV information from neighborhoods to perform a two-stage regression for treatment effect estimation. Extensive experiments demonstrate that our NetIV method outperforms the state-of-the-art methods for treatment effect estimation in the presence of unmeasured confounders.

Affiliations: School of Artificial Intelligence, Optics and Electronics (iOPEN), School of Computer Science, Key Laboratory of Intelligent Interaction and Applications (Ministry of Industry and Information Technology), Northwestern Polytechnical University, Xi’an, Shaanxi, China; School of Artificial Intelligence, Optics and Electronics (iOPEN), Key Laboratory of Intelligent Interaction and Applications (Ministry of Industry and Information Technology), Northwestern Polytechnical University, Xi’an, Shaanxi, China

Abstract:
Fuzzy c-means based on entropy regularization (FCER) is a commonly used machine learning algorithm that uses maximum entropy as the regularization term to realize fuzzy clustering. However, this model has many constraints and is challenging to optimize directly. During the solution process, the membership matrix and cluster centers are alternately optimized, easily converging to poor local solutions, limiting the clustering performance. In this paper, we start with the optimization model and propose an unconstrained fuzzy clustering model (UFCER) equivalent to FCER, which reduces the size of optimization variables from (n+d)× c(n+d)×c to d× cd×c. More importantly, there is no need to calculate the membership matrix during the optimization process iteratively. The time complexity is only linear, and the convergence speed is fast. We conduct extensive experiments on real datasets. The comparison of objective function value and clustering performance fully demonstrates that under the same initialization, our proposed algorithm can converge to smaller local minimums and get better clustering performance.

Abstract:
Reverse skyline query (RSQ) has been widely used in practice since it can pick out the data of interest to the query vector. To save storage resources and facilitate service provision, data owners usually outsource data to the cloud for RSQ services, which poses huge challenges to data security and privacy protection. Existing privacy-preserving RSQ schemes are either based on a two-cloud model or cannot fully protect privacy. To this end, we propose an efficient privacy-preserving reverse skyline query scheme over a single cloud (ePRSQ). Specifically, we first design a privacy-preserving inner product's sign determination scheme (PIPSD), which can determine whether the inner product of two vectors satisfies a specific relation with 0 without leaking the vectors’ information. Next, we propose a privacy-preserving reverse dominance checking scheme (PRDC) based on symmetric homomorphic encryption. Finally, we achieve ePRSQ based on PIPSD and PRDC. Security analysis shows that PIPSD and PRDC are both secure in the real/ideal world model, and ePRSQ can protect the security of the dataset, the privacy of query requests and query results. Extensive experiments show that ePRSQ is efficient. Specifically, for a 3-dimensional dataset of size 1000, the computational and communication overheads of ePRSQ for a query are 79.47 s and 0.0021 MB, respectively. The efficiency is improved by 3.78×3.78× (300.58 s) and 928.57×928.57× (1.95 MB) respectively compared with PPARS, and by 61.31×61.31× (4872.55 s) and 407309×407309× (855.35 MB) respectively compared with OPPRS.

Abstract:
Effectively and efficiently mining valuable clustering patterns is a challenging problem when handling large-scale data from diverse sources. Existing approaches adopt anchor graph learning or binary representation embedding to reduce computational complexity. Normally, anchor graph learning can not directly obtain the clustering assignment except adopt the post-processing stage, such as graph cut or k-means clustering. The binary representation embedding neglects the structure information in Hamming space. In order to overcome these limitations, this paper proposes a novel, effective, and efficient angular reconstructive discrete embedding method with fusion similarity for a multi-view clustering (AFMC) that can jointly learn the global and local structure preserving binary representation and clustering assignment. Specifically, we propose to use angular reconstructive error minimization to maintain the global similarity correlation of binary representations of heterogeneous features in a common Hamming space. Moreover, we design a multi-view discrete ridge regression with fusion similarity term to handle the out-of-sample problem and preserve the local manifold structure. In addition, we propose an efficient optimization algorithm with linear computational complexity to solve the non-convex and non-smooth objective function. The experimental results demonstrate that AFMC outperforms several state-of-the-art large-scale multi-view clustering methods.

Abstract:
Graph neural networks (GNNs) for link prediction can loosely be divided into two broad categories. First, node-wise architectures pre-compute individual embeddings for each node that are later combined by a simple decoder to make predictions. While extremely efficient at inference time, model expressiveness is limited such that isomorphic nodes contributing to candidate edges may not be distinguishable, compromising accuracy. In contrast, edge-wise methods rely on the formation of edge-specific subgraph embeddings to enrich the representation of pair-wise relationships, disambiguating isomorphic nodes to improve accuracy, but with increased model complexity. To better navigate this trade-off, we propose a novel GNN architecture whereby the forward pass explicitly depends on both positive (as is typical) and negative (unique to our approach) edges to inform more flexible, yet still cheap node-wise embeddings. This is achieved by recasting the embeddings themselves as minimizers of a forward-pass-specific energy function that favors separation of positive and negative samples. Notably, this energy is distinct from the actual training loss shared by most existing link prediction models, where contrastive pairs only influence the backward pass. As demonstrated by extensive empirical evaluations, the resulting architecture retains the inference speed of node-wise models, while producing competitive accuracy with edge-wise alternatives.

Abstract:
Modern recommender systems derive predictions from an interaction graph that links users and items. To this end, many of today's state-of-the-art systems use graph neural networks (GNNs) to learn effective representations of these graphs under the assumption of homophily, i.e., the idea that similar users will sit close to each other in the graph. However, recent studies have revealed that real-world recommendation graphs are often heterophilous, i.e., dissimilar users will also often sit close to each other. One of the reasons for this heterophilia is shilling attacks that obscure the inherent characteristics of the graph and make the derived recommendations less accurate as a consequence. Hence, to cope with low homophily in recommender systems, we propose a recommendation model called PGT4Rec that is based on a Partitioned Graph Transformer. The model integrates label information into the learning process, which allows discriminative neighbourhoods of users to be generated. As such, the framework can both detect shilling attacks and predict user ratings for items. Extensive experiments on real and synthetic datasets show PGT4Rec as not only providing superior performance in these two tasks but also significant robustness to a range of adversarial conditions.

Abstract:
Estimating per-flow cardinality from high-speed data streams has many applications such as anomaly detection and resource allocation. Yet despite tracking single flow cardinality with approximation algorithms offered, there remain algorithmical challenges for monitoring multi-flows especially under unbalanced cardinality distribution: existing methods adopt a uniform sketch layout and incur a large memory footprint to achieve high accuracy. Furthermore, they are hard to implement in the compact hardware used for line-rate processing. In this paper, we propose Couper, a memory-efficient measurement framework that can estimate cardinality for multi-flows under unbalanced cardinality distribution. We propose a two-layer structure based on a classic coupon collector's principle, where numerous mice flows are confined to the first layer and only the potential elephant flows are allowed to enter the second layer. Our two-layer structure can better fit the unbalanced cardinality distribution in practice and achieve much higher memory efficiency. We implement Couper in both software and hardware. Extensive evaluation under real-world and synthetic data traces show more than 20× improvements in terms of memory-efficiency compared to state-of-the-art.

Abstract:
Graph embedding maps graph nodes to low-dimensional vectors and is widely used in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based, and pipeline-optimized graph embedding framework, \sfDistGER-PipeDistGER−Pipe, which scales to embed billion-edge graphs. \sfDistGER-PipeDistGER−Pipe incrementally computes information-centric random walks to reduce redundant computations for more effective and efficient graph embedding. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. \sfDistGER-PipeDistGER−Pipe also improves the distributed \sfSkip-GramSkip−Gram learning model to generate node embeddings by optimizing access locality, CPU throughput, and synchronization efficiency. Finally, \sfDistGER-PipeDistGER−Pipe designs pipelined execution that decouples the operators in sampling and training procedures with an inter-round serial and intra-round parallel processing, attaining optimal utilization of computing resources. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including \sfKnightKingKnightKing, \sfDistDGLDistDGL, \sfPytorch-BigGraphPytorch−BigGraph, and \sfDistGERDistGER, \sfDistGER-PipeDistGER−Pipe exhibits 3.15×–1053× acceleration, 45% reduction in cross-machines communication, >10% effectiveness improvement in downstream tasks, and 38% enhancement in CPU utilization.

Abstract:
Short-term time series forecasting is pivotal in various scientific and industrial fields. Recent advancements in deep learning-based technologies have significantly improved the efficiency and accuracy of short-term time series modeling. Despite advancements, current time short-term series forecasting methods typically emphasize modeling dependencies across time stamps but frequently overlook inter-variable dependencies, which is crucial for multivariate forecasting. We propose a multi patterns memory model discovering various dependency patterns for short-term multivariate time series forecasting to fill the gap. The proposed model is structured around two key components: the short-term memory block and the long-term memory block. These networks are distinctively characterized by their use of asymmetric convolution, each tailored to process the various spatial-temporal dependencies among data. Experimental results show that the proposed model demonstrates competitive performance over the other time series forecasting methods across five benchmark datasets, likely thanks to the asymmetric structure, which can effectively extract the underlying various spatial-temporal dependencies among data.

Abstract:
Aspect-Based Multimodal Sentiment Analysis (ABMSA) aims to infer the users’ sentiment polarities over individual aspects using visual, textual, and acoustic signals. Although psychological studies have shown that personality has a direct impact on people's sentiment orientations, most existing methods disregard the potential personality character while executing ABMSA tasks. To tackle this issue, a novel psychological perspective, the people's personalities are introduced. To the best of our knowledge, this paper is the very first study in this field. Different from current pipelined multi-task sentiment analysis methods, an end-to-end ABMSA method called Personality-coupled mUlti-task leaRning framEwork (PURE) is proposed, which strongly couples personality mining and ABMSA tasks in a unified architecture to avoid error propagation and enhance the overall system robustness. Specifically, an adaptive personality feature extraction method is designed to accurately model the first impression of different people's personalities. Then, a multi-task ABMSA framework is designed to strongly couple the multimodal features of aspects extracted by the interactive attention fusion network with people's personalities. Subsequently, the proposed framework optimizes them parallel via extended Bayesian meta-learning. Finally, compared to the current optimal model, the classification accuracy and macro F1 score of the proposed model have both shown significant improvements on public datasets. In addition, PURE is transferable and can effectively couple personality modeling tasks with any other sentiment analysis methods.

Abstract:
Models based on Transformer variants have consistently demonstrated leading performance in long sequence time series forecasting. However, in some complex application scenarios, Transformers tend to capture low-frequency information in the data while overlooking high-frequency information, which often contains rich non-stationary features. This unbalanced feature extraction approach limits the model’s ability to effectively handle real-world time series data. To address this issue, we explicitly represent both low-frequency and high-frequency information and propose a model called STCNet, a data-driven scale-adaptive convolutional network that aims to extract diverse features and patterns from the data by learning features across different frequency bands in a balanced manner. Specifically, we propose an entropy-based adaptive wavelet basis selection algorithm, which can adaptively select appropriate wavelet bases based on the data distribution to achieve effective multi-frequency decomposition of complex time series. In addition, we designed a hierarchical scale-adaptive factor that allows for dynamic adjustment of feature weights according to different time scales through refined layered weight adjustment, significantly enhancing the model’s capability in handling non-stationary time series features. To further optimize the output features of the model, we introduce a test-time training mechanism, combined with a fast weight update strategy and a weight-sharing strategy to reduce the number of model parameters, effectively mitigating the risk of overfitting. Experimental results on nine datasets demonstrate that STCNet outperforms the current state-of-the-art models in both effectiveness and efficiency.

Abstract:
Concept drift arises from unpredictable data distribution shifts, degrading model performance. In evolving multiple data streams, these drifts pose greater challenges due to dynamic changes and uncertain inter-stream correlations, demanding robust accuracy and generalization. To address this issue, in this article, we propose a novel multiple data stream learning method, called the adaptive information fusion-based concept drift learning (AIF-CD) method, to adaptively handle multiple data streams with heterogeneous feature spaces and complex drift situations. First, a real-time learning method with a cooperation scheme is proposed to handle multiple data streams. Second, an information fusion-based augmentation process is designed to help enhance the learning efficiency of each stream. Next, a drift severity identification-based adaptation strategy and a process to selectively use the previous timestamps’ data are introduced to enhance learning robustness in both synchronous and asynchronous scenarios. Moreover, a detailed runtime complexity and theoretical analysis further explains the learning efficiency of our method. Our key innovation combines real-time adaptation with theoretical guarantees for complex, evolving multi-stream learning. The experiment results in various scenarios under synchronous and asynchronous settings show that the proposed method is more efficient than other benchmark methods.

Abstract:
Predicting crime hotspots in a city is a complex and critical task with significant societal implications. Numerous spatiotemporal correlations and irregularities pose substantial challenges to this endeavor. Existing methods commonly employ fixed-time granularities and sequence prediction models. However, determining appropriate time granularities is difficult, leading to inaccurate predictions for specific time windows. For example, users might ask: What are the crime hotspots during [12:00-20:00]? To address this issue, we introduce \sf FlexiCrimeFlexiCrime, a novel event-centric framework for predicting crime hotspots with flexible time intervals. \sf FlexiCrimeFlexiCrime incorporates a continuous-time attention network to capture correlations between crime events, which learns crime context features, representing general crime patterns across time points and locations. Furthermore, we introduce a type-aware spatiotemporal point process that learns crime-evolving features, measuring the risk of specific crime types at a given time and location by considering the frequency of past crime events. Together, the crime context and evolving features allow us to predict whether an urban area is a crime hotspot given a future time interval. To evaluate \sf FlexiCrimeFlexiCrime’s effectiveness, we conducted experiments using real-world datasets from two cities, covering twelve crime types. The results show that our model outperforms baseline techniques in predicting crime hotspots over flexible time intervals.

Abstract:
Multimodal foundation models (MFMs) have revolutionized sequential recommender systems through advanced representation learning. While Parameter-efficient Fine-tuning (PEFT) is commonly used to adapt these models, studies often prioritize parameter efficiency, neglecting GPU memory and training speed. To address this, we introduced the IISAN framework, significantly enhancing efficiency. However, IISAN was limited to symmetrical MFMs and identical text and image encoders, preventing the use of state-of-the-art Large Language Models. To overcome this, we developed IISAN-Versa, a versatile plug-and-play architecture compatible with both symmetrical and asymmetrical MFMs. IISAN-Versa employs a Decoupled PEFT structure and utilizes both intra- and inter-modal adaptation. It effectively handles asymmetry through a simple yet effective combination of group layer-dropping and dimension transformation alignment. Our research demonstrates that IISAN-Versa effectively adapts large text encoders, and we further identify a scaling effect where larger text encoders generally perform better. IISAN-Versa also demonstrates strong versatility in our defined multimodal scenarios, which include raw titles and captions generated from images and videos. Additionally, IISAN-Versa achieved state-of-the-art performance on the MicroLens public benchmark.

Abstract:
Multi-view clustering (MVC) has demonstrated impressive performance due to its ability to capture both consistency and diversity information among views. However, most existing techniques assume that all views are available in advance, making them inadequate for stream-view data, such as intelligent transportation systems and medical imaging analysis, where memory constraints or privacy concerns prevent storing all previous views. Although some methods attempt to address this issue by capturing consistency information, they often fail to effectively extract both diversity information and cross-view relationships. We argue that these limitations are inherent to incremental multi-view clustering (IMVC), as the inability to retain all previous views inevitably leads to insufficient information utilization, thereby compromising performance. To address these challenges, we propose a novel algorithm, termed Incremental Multi-View Clustering with Cross-View Correlation and Diversity (CDIMVC). Unlike existing methods that only retain consistency information, CDIMVC also preserves diversity information and utilizes similarity matrices to capture cross-view relationships. To implement this method, we develop three key modules: the dynamic view correlation analysis module (DVCAM), the knowledge extraction module (KEM), and the knowledge transfer module (KTM). When a new view arrives, DVCAM first assesses its importance and correlations to historical views. Subsequently, KEM computes its consistency and diversity information by comparing it to that in the knowledge base. Finally, KTM facilitates the effective transmission of past knowledge, preventing the loss of historical information. By integrating these modules, CDIMVC can effectively capture cross-view relationships and diversity information, facilitating efficient knowledge updating and maintenance. An alternating procedure is also designed to optimize the resulting optimization problem. Experimental results show that CDIMVC exceeds state-of-the-art methods, demonstrating its effectiveness in handling stream-view data.

Abstract:
Visual Question Answering (VQA), aimed at improving AI-driven interactions and solving complex visual-linguistic tasks, has increasingly garnered attention as a pivotal research domain in both academic and industrial spheres. Despite progress in VQA, current studies still suffer from the challenge of language bias posed by spurious semantic correlations and minority class collapse, leading to semantic ambiguities and distribution shifts that hinder robust performance across challenging scenarios. To address these challenges, we propose a robust multi-space collaborative debiasing paradigm, termed “LBF-VQA”, which systematically leverages multi-space collaborative debiasing strategies to achieve language bias-free VQA, encompassing both Euclidean space debiasing (ESD) and Spherical space debiasing (SSD). By strategically introducing bias-examples and their corresponding counter-examples, the ESD strategy focuses on uncovering hidden prior correlations and the complex interactions between modality and semantics within the Euclidean space. Benefiting from the infinite contrastive and distribution debiasing learning mechanisms, the SSD strategy is devoted to effectively preventing the collapse of minority classes while enhancing the manifold representations of instance de-bias and distribution de-dependence in the Spherical space. Furthermore, we meticulously constructed a specialized medical dataset intentionally embedded with deliberate language bias to comprehensively examine the negative effects of language bias on medical VQA systems. Extensive experiments on multiple general and medical VQA benchmarks consistently verify the effectiveness and generalizability of our LBF-VQA in handling various complex VQA scenarios than state-of-the-art baselines.

Abstract:
A nonstandard tensor is frequently adopted to model a large-sale complex dynamic network. A Tensor Representation Learning (TRL) model enables extracting valuable knowledge form a dynamic network via learning low-dimensional representation of a target nonstandard tensor. Nevertheless, the representation learning ability of existing TRL models are limited for a nonstandard tensor due to its inability to accurately represent the specific nature of the nonstandard tensor, i.e., mode imbalance, high-dimension, and incompleteness. To address this issue, this study innovatively proposes a Mode-Aware Tucker Network-based Tensor Representation Learning (MTN-TRL) model with three-fold ideas: a) designing a mode-aware Tucker network to accurately represent the imbalanced mode of a nonstandard tensor, b) building an MTN-based high-efficient TRL model that fuses both data density-oriented modeling principle and adaptive parameters learning scheme, and c) theoretically proving the MTN-TRL model’s convergence. Extensive experiments on eight nonstandard tensors generating from real-world dynamic networks demonstrate that MTN-TRL significantly outperforms state-of-the-art models in terms of representation accuracy.

Abstract:
The demand for more precise and timely urban resource allocation and management has driven the extension of urban flow prediction from short-term to long-term horizons. As the time scale expands, the issue of urban flow distribution shift becomes increasingly prominent due to various impact factors, such as weather, events, city changes, etc. Traditionally, comprehensively analyzing and addressing the causal relationships underlying the distribution shift caused by these factors has been challenging. In this paper, we propose that these impact factors can be partitioned in two major types, i.e., context factors and structural factors. We then present a decomposition-based model for long-term urban flow prediction from a causal perspective, named DeCau, which can discriminate between the two types of factors for effectively solving the problem of urban flow distribution shift. First, we employ a decomposition module to decompose urban flow into seasonal part and trend part. The seasonal part contains high frequency irregular variations caused by context factors. We advise a shared distribution estimator to approximate the unavailable prior distributions of context factors, and then apply causal intervention to mitigate the confounding impact of context factors. The distribution shift in the trend part is induced by structural factors. We design a dual causal dependency extractor to model the causality between POIs distribution and urban flow, and then eliminate spurious correlations through causal adjustment. Finally, we design an end-to-end framework for long-term urban flow prediction by combining the embeddings from two parts, enabling the model to generalize to unseen distribution. Extensive experimental results demonstrate DeCau outperforms state-of-the-art baselines.

Affiliations: School of Computing and Artificial Intelligence, Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, National Engineering Laboratory of Integrated Transportation Big Data Application Technology, Manufacturing Industry Chains Collaboration and Information Support Technology Key Laboratory of Sichuan Province, Southwest Jiaotong University, Chengdu, China; School of Computing, University of Leeds, Leeds, U.K.

Abstract:
Long-term time series forecasting (LTSF) is a critical task across diverse domains. Despite significant advancements in LTSF research, we identify a performance bottleneck in existing LTSF methods caused by the inadequate modeling of Temporal Dependencies within the Target (TDT). To address this issue, we propose a novel and generic temporal modeling framework, Temporal Dependency Alignment (TDAlign), that equips existing LTSF methods with TDT learning capabilities. TDAlign introduces two key innovations: 1) a loss function that aligns the change values between adjacent time steps in the predictions with those in the target, ensuring consistency with variation patterns, and 2) an adaptive loss balancing strategy that seamlessly integrates the new loss function with existing LTSF methods without introducing additional learnable parameters. As a plug-and-play framework, TDAlign enhances existing methods with minimal computational overhead, featuring only linear time complexity and constant space complexity relative to the prediction length. Extensive experiments on six strong LTSF baselines across seven real-world datasets demonstrate the effectiveness and flexibility of TDAlign. On average, TDAlign reduces baseline prediction errors by 1.47% to 9.19% and change value errors by 4.57% to 15.78%, highlighting its substantial performance improvements.

Abstract:
Temporal data analysis plays a pivotal role in applications such as weather forecasting, traffic flow management, energy consumption monitoring, and other areas of urban computing. In recent years, temporal data modeling has transitioned from traditional deep learning methods to pre-trained models. However, existing approaches often exhibit significant task-specific limitations, requiring bespoke model designs and extensive domain data for training. To address these challenges, this study introduces KPT, a novel foundation model for temporal data analysis in urban computing. By leveraging temporal competitive attention and feature interaction attention mechanisms, KPT can effectively capture global context, integrate cross-variable features precisely, and achieve universal feature learning across diverse time series tasks. Additionally, the knowledge prompt network facilitates the deep fusion of cross-layer features via an intricate interaction mechanism, enabling the model to identify and align shared temporal patterns across different time series data. These patterns then transformed into knowledge prompts, thereby enhancing the universal feature learning capabilities of the pre-trained model. Experimental results demonstrate that KPT excels in four core temporal analysis tasks within urban computing, outperforming task-specific models. This highlights KPT’s ability to generalize across tasks and underscores its potential as a foundation model for multi-task scenarios in urban computing.

Abstract:
Graph-based multi-view clustering methods have demonstrated satisfying performance by effectively capturing relationships among data samples. However, most existing methods primarily emphasize direct pairwise relationships, neglecting the exploration of high-order correlations present within each view. To this end, a novel approach, called multiview clustering via high-order bipartite graph learning and tensor low-rank representation (HBGTLRR), is proposed. Specifically, we first construct high-order bipartite graphs to capture latent relationships and concatenate them into a tensor. By applying tensor nuclear norm (TNN) minimization, we obtain a low-rank representation that reduces noise and preserves high-order consistency. Subsequently, a consensus graph is constructed by adaptively fusing the high-order bipartite graphs with corresponding weights, and then a Laplacian low-rank constraint is imposed on it to effectively capture the intrinsic data structure. Finally, extensive experimental results show that HBGTLRR significantly outperforms existing methods, thereby validating the effectiveness of our proposed method.

Abstract:
Phishing attacks continue to pose a significant cybersecurity threat, especially as social engineering (SE) tactics become more contextually embedded and difficult to detect. To address the limitations of traditional rule-based framework and AI-driven classifiers, we propose PRIME, a phishing email evaluation framework that leverages large language models (LLMs) to assess manipulative intent across six interpretable criteria. Three risk scoring strategies, namely equal weighting, semantically weighted scoring, and fuzzy logic-based classification, are applied to aggregate the criterion scores into multi-level risk assessments. Qualitative comparisons with established frameworks demonstrate PRIME’s broad coverage and conceptual soundness. Quantitative experiments validate its effectiveness, with the fuzzy-based method achieving perfect recall on a phishing-only dataset and consistent performance across multiple years. An ablation study, where each criterion is removed in turn, highlights the critical role of the Context and Content dimension in detecting both explicit and subtle SE cues. By separating LLMs interpretation from final decision-making, PRIME enhances transparency, robustness, and adaptability in phishing detection systems.

Abstract:
Real-world applications have produced massive short text streams. Contrary to the traditional normal texts, they present the characteristics such as short length, only having few labeled data, high-velocity, high-volume and dynamic data distributions, which deteriorate the issues of data sparseness, label missing and concept drift. Obviously, it is a huge challenge for existing short text (stream) classification algorithms due to the poor effectiveness, where they always assume all short texts are completely labeled and little attention is paid on the concept drift issue hidden in short text streams. Therefore, we propose a novel semi-supervised short text steam classification method based on the drift-aware incremental deep learning ensemble model. Specifically, with the sliding window mechanism, we first fuse three types of statistical, semantic and structure information to solve the data sparseness issue. Second, a semi-supervised incremental deep learning ensemble model based on GCN and the refined LSTM is developed to adapt to the high-volume, high-velocity and label missing short text streams. Third, a label-probability distribution based concept drift detector is introduced to distinguish concept drifts. Finally, as compared with eleven well-known classification methods, extensive experiments demonstrate the effectiveness of the proposed method in the handling of short text streams with limited labeled data.

Abstract:
For Named Entity Recognition (NER), sequence labeling-based and span-based paradigms are quite different. Previous studies have demonstrated the clear complementary advantages of the two paradigms, but few models have tried to incorporate them into a single NER model as far as we know. In our previous work, we proposed a paradigm called Bundling Learning (BL) to explore the above issue, which bundles the two NER paradigms, enabling NER models to jointly tune their parameters by weighted summing each paradigm's training loss. However, three critical issues remain unresolved: When does BL work? Why does BL work? Can BL enhance existing state-of-the-art NER models? To address the first two issues, we design three NER models: a sequence labeling-based model – SeqNER, a span-based NER model – SpanNER, and BL-NER which bundles SeqNER and SpanNER. We draw two conclusions regarding the two issues based on the experimental results on eleven NER datasets. To investigate the third issue, we apply BL to five existing state-of-the-art NER models, including three sequence labeling-based and two span-based models. Experimental results indicate consistent NER performance gains, suggesting a feasible way to construct new state-of-the-art NER systems by applying BL to the current state-of-the-art systems. Moreover, investigation results show that BL reduces both entity boundary and type prediction errors. In addition, we compare two commonly used label tagging methods and three types of span semantic representations.

Abstract:
In real-world scenarios, most complex systems can be generally modelled as homogenous or heterogenous networks. Therefore, downstream tasks (e.g., node/graph classification, node clustering) based on these two types of graphs become ubiquitous and have drawn considerable attention in recent years. Existing literatures on node classification mainly focus on either homogeneous or heterogeneous graphs, while research on effectively carrying out node classification tasks on both types of graphs simultaneously is still under-exploited. To fill this gap, we propose a universal Graph Neural Network architecture based on Subgraph and Subhypergraph (SS-GNN) with feature-enhanced strategy for node embedding on both homogeneous and heterogeneous graphs. Through construction of subgraph and subhypergraph with same-class nodes, our model can simultaneously deal with homogeneous and heterogeneous graphs. Graph attention modules are especially designed to embed subgraphs of same-class nodes to learn the internal topological structure and local community structure within the original graph. Additionally, to capture high-order features of graph and enhance the embedding representations of nodes, we also utilize hypergraph attention modules to embed subhypergraphs of same-class nodes. Unlike other approaches that rely on pre-defined meta-paths, our model can be readily applied to most real-world applications without requiring any domain knowledge. Finally, we conduct extensive experiments on three homogeneous and three heterogeneous real-world graphs to demonstrate the effectiveness of SS-GNN. The experimental results for node classification and clustering tasks not only show the superior performance of our proposed model compared to state-of-the-art baselines, but also demonstrate its potentially good interpretability for graph analysis. This work may provide some enlightening insights to the study on universality of graph foundation model.

Abstract:
Complex Query Answering (CQA) on knowledge graphs is a fundamental yet challenging task, which can be formalized as answering a subset of first-order logic queries containing logical conjunction, disjunction, negation, and existential quantifiers. Recent research reveals that Link Predictors (LPs) trained on 1-hop queries can generalize to various types of complex queries. However, existing methods neglect crucial characteristics of LPs’ outputs, including the effects of highly relevant entities and uncertainty. What’s worse, as they model logical operations by fuzzy set operations, these methods suffer from problems like inflexibility, sensitivity to noise, and inconsistency with priority in human cognition, which limits their performance, especially on queries with negation. To address these challenges, we propose an efficient fuzzy system for CQA that requires no extra training overhead and is plug-and-play with existing LP-based methods. First, we expand the output of LPs by two complementary membership functions of weak and strong relevance, which help to distinguish the target entities from highly relevant and irrelevant entities. Subsequently, we model logical operations through fuzzy rule bases and infer the final predictions via defuzzification, providing a flexible and tractable scheme for modeling logical operations. Finally, the effectiveness of the proposed fuzzy system is validated by its outstanding performance on benchmark datasets when compared to state-of-the-art methods. The source code of our proposed method is available at https://github.com/luyy9apples/FuzzSys-CQA.

Affiliations: School of Artificial Intelligence and Information Engineering, Zhejiang University of Science and Technology, Hangzhou, China; School of Science, Zhejiang University of Science and Technology, Hangzhou, China; School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China; Zhejiang Key Laboratory of Multidimensional Perception Technology, Application and Cybersecurity, Hangzhou, China; School of Computer Science and Technology, Zhejiang University, Hangzhou, China; Ministry of Education Key Lab for Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an, China; School of Computer Science and Mathematics, Victoria University, Melbourne, VIC, Australia

Abstract:
Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by well-designed perturbations. This could lead to disastrous results on critical applications such as self-driving cars, surveillance security, and medical diagnosis. At present, adversarial training is one of the most effective defenses against adversarial examples. However, in traditional adversarial training, it is still difficult to achieve a good trade-off between clean accuracy and robustness since DNNs still learn spurious features. The intrinsic reason is that traditional adversarial training makes it difficult to fully learn core features from adversarial examples when noise and examples cannot be disentangled. In this paper, we disentangle the adversarial examples into natural and perturbed patterns by bit-plane slicing. We assume the higher bit-planes represent natural patterns and the lower bit-planes represent perturbed patterns, respectively. We propose Feature-Focusing Adversarial Training (F^22AT), which differs from previous work in that it enforces the model to focus on the core features from natural patterns and reduce the impact of spurious features from perturbed patterns. The experimental results demonstrated that the clean accuracy and adversarial robustness with our F^22AT can be significantly improved.

Abstract:
Cross-domain sequential recommendation (CDSR) tackles data sparsity and cold-start issues by leveraging information from the source domain to enhance prediction accuracy in the target domain. However, the recommendation fairness issue may further deteriorate dramatically with the biased knowledge transfer of overlapped users. This paper is the first study to address and improve fairness measurement between different demographic groups in CDSR. The proposed FairCDSR employs sequence augmentation techniques to enrich the interaction histories of disadvantaged user groups, which typically have less training data. These augmented sequences are further represented by a contrastive learning method with hard negative sampling to mitigate the unfairness in recommendations. Then, to more precisely capture cross-domain preferences, a multi-interest learning approach is applied to each group across the domains. Finally, an interest-level knowledge transfer algorithm with fixed bandwidth limitations for each group is developed to extract fair and semantic cross-domain information. Extensive experiments conducted on real-world datasets demonstrate the effectiveness of FairCDSR. Compared to existing cross-domain or fair recommendation systems, FairCDSR significantly reduces recommendation disparity between advantaged and disadvantaged groups. Benefiting from a significant improvement in the recommendation accuracy of the disadvantaged group, the overall system performance can also be effectively enhanced by 5-10% .

Abstract:
The widespread use of locatable devices leads to a sharp increase in the storage of trajectory data, and redundant storage of similar trajectories wastes a large amount of storage resources. The state-of-the-art multiple trajectory compression algorithms are developed to strip the partial information of trajectory; however, these algorithms have low compression efficiency because they do not eliminate the redundancy within a single trajectory as much as possible, as well as high time overhead due to matching of reference sub-trajectories. In this study, we propose a new spatio-temporal trajectory compression technique, consisting of intra-trajectory error balancing and inter-trajectory feature point clustering. Intra-trajectory error balancing is achieved through retaining high score (an aggregated metric) trajectory points (i.e., feature points). Furthermore, inter-trajectory feature point clustering realizes the fusion of similar trajectories and extracts the commonality between trajectories. Experiments are performed on five real trajectory datasets, including two road datasets, one airline dataset, and one walking dataset. Compared with the state-of-the-art methods, our compression technique improves the compression ratio by an average of 24.9% under the same error, and reduces the time overhead by at least an order of magnitude.

Affiliations: Faculty of Computer Science, University of Information Technology, Ho Chi Minh City, Vietnam; Faculty of Information Technology, HUTECH University, Ho Chi Minh City, Vietnam; School of Computer Science and Engineering, International University, Ho Chi Minh City, Vietnam; Faculty of Computer Engineering, University of Information Technology, Ho Chi Minh City, Vietnam; College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China

Abstract:
High utility itemset mining (HUIM) is one of the most compelling problems in data mining, extending frequent itemset mining (FIM) and serving as a crucial method for analyzing customer behavior. Many HUIM algorithms have been proposed to improve execution time and memory consumption. However, most assume that the profit is fixed for each item in a database, which is unrealistic. Some algorithms address products with unstable transaction profits but still need to run faster due to ineffective pruning strategies. Additionally, generalizing items into categories is often neglected. To address these issues, this paper considers a more practical database type that integrates unstable profits with a taxonomy of items. The proposed algorithm, CLHUN (Cross-level High Utility Itemset Mining in a Database with Unstable and Negative Profits), combines efficient techniques such as item sorting and tighter upper bounds to prune the search space. Furthermore, it introduces strategies to eliminate unpromising items during mining and reduce the number of transaction scans. Several experiments were conducted to evaluate the algorithm’s performance. Results demonstrate that CLHUN is efficient with these techniques and strategies.

Abstract:
Recent advances in Multimodal Entity Linking leverage multimodal information to link target mentions to corresponding entities. However, existing methods uniformly adopt a “one-size-fits-all” approach, which overlooks the unique requirements of individual samples and fails to adequately balance modality-assisted disambiguation and modality-induced noise. Also, the commonly used separate large-scale visual and text pre-trained models for feature extraction do not address inter-modal heterogeneity and the high computational cost of fine-tuning. To resolve these two issues, we introduce a novel approach named Multimodal Entity Linking with Dynamic Modality Selection and Interactive Prompt Learning (DSMIP). First, we design three expert networks that utilize different subsets of modalities tailored to the task and train them individually. Specifically, for the multimodal expert network, we enhance entity and mention feature extraction by updating multimodal prompts and setting up a coupling function to realize the interaction of prompts between modalities. Subsequently, to select the best-suited expert network for each specific sample, we devise a Modality Selection Gating Network to gain the optimal one-hot selection vector by applying a specialized reparameterization technique and a two-stage training process. Experimental results on three public benchmark datasets demonstrate that the proposed DSMIP outperforms all state-of-the-art baselines.

Affiliations: School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China; School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China; School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China; China CSCEC Real Estate Company, Ltd., Shanghai, China; Network department, China Mobile Communications Corporation, Beijing, China; School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China

Abstract:
With the prevalence of online news services, personalized news recommendation (PNR) has played an indispensable role in meeting users’ needs and mitigating information overload, with the aim of providing news articles that cater to user preferences. Despite significant progress made in the field of PNR over the past few decades, their performances are still hindered by some limitations, such as insufficient news modeling, difficulties in effectively modeling diverse user interests, and ignorance of fine-grained matching signals. It is fortunate that the emergence of large language models (LLMs) provides a promising insight into empowering the capabilities of news recommendation. Known for their impressive capabilities of natural language understanding and generation, LLMs have achieved disruptive achievements in various natural language processing (NLP) tasks, which motivates us to integrate LLMs into news recommendation and benefits from them to make up existing deficiencies. In this paper, we conduct a comprehensive review of current efforts made towards utilizing LLMs for PNR, with a focus on three core modules involved in the news recommendation process, i.e., news modeling, user modeling, and accurate matching. We systematically discuss and analyze relevant works under each focus. In addition, we point out several potential research directions to provide more inspiration for future investigation in this thriving field.

Abstract:
Recently, recommender system has achieved significant success. However, due to the openness of recommender systems, they remain vulnerable to malicious attacks. Additionally, natural noise in training data and issues such as data sparsity can also degrade the performance of recommender systems. Therefore, enhancing the robustness of recommender systems has become an increasingly important research topic. In this survey, we provide a comprehensive overview of the robustness of recommender systems. Based on our investigation, we categorize the robustness of recommender systems into adversarial robustness and non-adversarial robustness. In the adversarial robustness, we introduce the fundamental principles and classical methods of recommender system adversarial attacks and defenses. In the non-adversarial robustness, we analyze non-adversarial robustness from the perspectives of data sparsity, natural noise, and data imbalance. Additionally, we summarize commonly used datasets and evaluation metrics for evaluating the robustness of recommender systems. Finally, we also discuss the current challenges in the field of recommender system robustness and potential future research directions. Additionally, to facilitate fair and efficient evaluation of attack and defense methods in adversarial robustness, we propose an adversarial robustness evaluation library–ShillingREC, and we conduct evaluations of basic attack models and recommendation models.

Abstract:
Session-based recommendation is gaining increasing attention due to its practical value in predicting the intents of anonymous users based on limited behaviors. Emerging efforts incorporate various side information to alleviate inherent data scarcity issues in this task, leading to impressive performance improvements. The core of side information-driven session-based recommendation is the discovery and utilization of diverse data. In this survey, we provide a comprehensive review of this task from a data-centric perspective. Specifically, this survey commences with a clear formulation of the task. This is followed by a detailed exploration of various benchmarks rich in side information that are pivotal for advancing research in this field. Afterwards, we delve into how different types of side information enhance the task, underscoring data characteristics and utility. Moreover, we discuss the usage of various side information, including data encoding, data injection, and involved techniques. A systematic review of research progress is then presented, with the taxonomy by the types of side information. Finally, we summarize the current limitations and present the future prospects of this vibrant topic.

Abstract:
Traffic forecasting plays a crucial role in establishing an Intelligent Transportation System (ITS) by providing essential insights. Existing traffic forecasting relies on the assumption that there is a hidden invariant spatial-temporal pattern in the large-scale dataset. However, the traffic patterns are easily influenced by many unpredictable external factors, such as policy interventions and climate changes. Due to the dynamic nature of these exogenous factors, the traffic network’s spatial-temporal patterns are also changed, thus impacting the performance of traffic forecasting models. Thus, there is an urgent need to rethink the traffic forecasting model in a fast-adaptive manner. To solve this challenge, this paper proposes an Adaptive Spatio-Temporal Context Learning framework named ASTCL, which achieves desired forecasting accuracy using daily basis traffic data collected from dozens of sensors. ASTCL constructs adaptive spatio-temporal contexts for target locations in the traffic network and generates dynamic sequence graphs based on semantic similarities. The adaptive contexts aggregate valuable information from available data, while the graphs reveal dynamic trends in traffic properties. Further, ASTCL introduces a joint convolution and attention mechanism to model intricate spatio-temporal relationships from multiple perspectives. Extensive experiments conducted on four real-world datasets demonstrate that ASTCL achieves remarkable fast adaptability and outperforms other state-of-the-art methods by a significant margin.

Abstract:
Fine-grained urban flow inference (FUFI) is crucial for traffic management, as it infers high-resolution urban flow maps from coarse-grained observations. Existing FUFI methods typically focus on a single city and rely on comprehensive training with large-scale datasets to achieve precise inferences. However, data availability in developing cities may be limited, posing challenges to the development of well-performing models. To address this issue, we propose cross-city fine-grained urban flow inference, which aims to transfer spatio-temporal knowledge from data-rich cities to data-scarce areas using meta-transfer learning. This paper devises a Spatio-Temporal Deviation Alignment (STDA) framework to mitigate spatio-temporal distribution deviations and urban structural deviations between multiple source cities and the target city. Furthermore, STDA presents a cross-city normalization method that adaptively combines batch and instance normalization to maintain consistency between city-variant and city-invariant features. Besides, we design an urban structure alignment module to align spatial topological differences across cities. STDA effectively reduces distribution and structural deviations among different datasets while avoiding negative transfer. Extensive experiments conducted on three real-world datasets demonstrate that STDA consistently outperforms state-of-the-art baselines.

Abstract:
Sepsis is one of the main causes of death in ICU patients, and accurate and stable early prediction is essential for clinical intervention. Existing methods mostly rely on traditional time series models (e.g., LSTM, Transformer) or clinical scoring criteria (e.g., SOFA, qSOFA), but face two major challenges: 1) spurious correlations in the data affect the robustness of the model; 2) Lack of modeling the underlying causal relationships in the data space. We propose a Serialized Causal Disentanglement Model (SCDM) that decouples latent variables into sepsis-related factors (uu), other disease-related factors (vv), and irrelevant confounders (ss ). Based on the MIMIC-IV v2.2 dataset (3,511 positive samples and 17,538 negative samples), SCDM took patient clinical indicators, personal information, and clinical notes as input, and achieved an AUC of 0.765-0.928in the prediction task 48 to 0 hours before the onset of sepsis. The performance is significantly better than the baseline models (e.g., Transformer's 0.662-0.910, MGP-AttTCN's 0.692-0.913). Experiments show that optimizing the time window (5 hours of continuous observation) and variable selection (45 key indicators) can improve the performance of the model. The effectiveness of causal unwinding is verified by the visualization of Grad CAM and t-SNE, key clinical indicators such as platelet count, lactic acid, and respiratory rate are further identified to provide interpretable decision support for doctors. Our study provides a high-precision and interpretable causal disentanglement framework for early prediction of sepsis, which is expected to promote the development of intelligent diagnosis and treatment in the ICU.

Affiliations: College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China; School of Science and Engineering, Chinese University of Hong Kong (Shenzhen), Shenzhen, China; Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China; School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China; School of Computer Science and Engineering, Nanyang Technological University, Singapore; School of Engineering, Westlake University, Hangzhou, China

Abstract:
Cross-domain recommendation (CDR) aims to alleviate the data sparsity problem by leveraging the benefits of modeling two domains. However, existing research often focuses on the recommendation performance while ignores the privacy leakage issue. We find that an attacker can infer user attribute information from the knowledge (e.g., user preferences) transferred between the source and target domains. For example, in our experiments, the average inference accuracies of attack models on gender and age attributes are 0.8323 and 0.3897. The best-performing attack model achieves accuracies of 0.8847 and 0.4634, exceeding a random inference by 25.10% and 64.04%. We can see that the leakage of user attribute information may significantly exceed what would be expected from random inference. In this paper, we propose a novel recommendation framework named CVGAE (short for camouflaged variational graph autoencoder), which effectively models user behaviors and mitigates the risk of user attribute information leakage at the same time. Specifically, our CVGAE combines the strengths of VAEs in capturing latent features and variability with the ability of GCNs in exploiting high-order relational information. Moreover, to ensure against attribute inference attacks without sacrificing the recommendation performance, we design a user attribute protection module that fuses user attribute-camouflaged information with knowledge transfer during cross-domain processes. We then conduct extensive experiments on three real-world datasets, and find our CVGAE is able to achieve strong privacy protection while making little sacrifices in recommendation accuracy.

Abstract:
Missing values pose a formidable obstacle in multivariate time series analysis. Existing imputation methods rely on entangled representations that struggle to simultaneously capture multiple orthogonal time-series patterns, leading to suboptimal performance and limited interpretability. Meanwhile, requiring the entire data span as input renders these models impractical for long time series. To address these issues, we propose \mathsf TIDERTIDER and its enhanced version, \mathsf AdaTIDERAdaTIDER. \mathsf TIDERTIDER employs low-rank matrix factorization and disentangled temporal representations to model intricate dynamics like trend, seasonality, and local bias. However, \mathsf TIDERTIDER is limited to single-period modeling and does not explicitly capture dependencies between channels. To overcome these limitations, \mathsf AdaTIDERAdaTIDER incorporates adaptive cross-channel dependency modeling and multi-period seasonality representations. These advancements enable it to dynamically capture variable relationships and complex multi-period patterns, significantly enhancing imputation accuracy and interpretability, while maintaining \mathsf TIDERTIDER’s scalability. Extensive experiments on real-world datasets validate the superiority of our models in imputation accuracy, scalability, interpretability, and robustness.

Affiliations: Key Laboratory of Smart Farming for Agricultural Animals, Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, College of Informatics, Huazhong Agricultural University, Wuhan, China; College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China; College of Computer and Information, Hohai University, Nanjing, China; Artificial Intelligence Research Institute, Shenzhen MSU-BIT University, Shenzhen, China; Division of Computer, Electrical and Mathematical Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

Abstract:
With the rapid expansion of interactions across various domains such as social networks, transaction networks, and IP-IP networks, anomaly detection in dynamic graphs has become increasingly critical for mitigating potential risks. However, existing anomaly detection methods often assume noise-free dynamic graphs, overlooking the prevalence of noisy dynamic graphs in real-world applications. Specifically, noisy dynamic graphs affected by structural noises—such as spurious and missing nodes and edges—struggle to consistently provide reliable structural evidence for anomaly detection. To tackle this challenge, we propose an Evolutionary Perception Method (EPM) for identifying anomalous nodes in noisy dynamic graphs by resisting the interference of structural noises. EPM primarily consists of two components: a dynamic fitter and a filtering reviser. The dynamic fitter characterizes the interaction dynamics of nodes that removes and generates links at each period as a multiple superposition state, utilizing various link prediction algorithms to fit evolutionary mechanisms. Additionally, the filtering reviser designs evolutional entropies to quantify the evolutional uncertainty in multiple superposition states, further reconstructing the Kalman filter to optimize these entropies. Extensive experiments have proved that our proposed EPM outperforms state-of-the-art methods in discovering anomalous nodes in noisy dynamic graphs.

Affiliations: Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hong Kong; Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, USA; School of Computer Science and Technology, Shandong University, Jinan, China; School of Cyber Science and Engineering, Southeast University, Nanjing, China; Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Shaanxi, China

Abstract:
Blockchain has become a popular paradigm for secure and immutable data storage. Despite its numerous applications across various fields, concerns regarding the user privacy and result integrity during data queries persist. Additionally, the need for rich query functionalities to harness the full potential of blockchain data remains an area ripe for exploration. In order to address these challenges, our paper first utilizes a framework based on the Trusted Execution Environment (TEE) and oblivious RAM technique to achieve both privacy and data integrity. To enhance the query efficiency over the entire blockchain, we then devise a two-level learned indexing methodology named TELEX within the TEE for both integer and string keys. We also propose different query processing algorithms for versatile query types, including exact queries, aggregate queries, Boolean queries, and range queries. By implementing the prototype and conducting extensive evaluation, we demonstrate the feasibility and remarkable improvement in efficiency compared to existing solutions.

Abstract:
Online platforms aggregate extensive user feedback across diverse behaviors, providing a rich source for enhancing user engagement. Traditional recommender systems, however, typically optimize for a single target behavior and represent user preferences with a single vector, limiting their ability to handle multiple important behaviors or optimization objectives. This conventional approach also struggles to capture the full spectrum of user interests, resulting in a narrow item pool during candidate generation. To address these limitations, we present Tricolore, a versatile multi-vector learning framework that uncovers connections between different behavior types for more robust candidate generation. Tricolore's adaptive multi-task structure is also customizable to specific platform needs. To manage the variability in sparsity across behavior types, we incorporate a behavior-wise multi-view fusion module that dynamically enhances learning. Moreover, a popularity-balanced strategy ensures the recommendation list balances accuracy with item popularity, fostering diversity and improving overall performance. Extensive experiments on public datasets demonstrate Tricolore's effectiveness across various recommendation scenarios, from short video platforms to e-commerce. By leveraging a shared base embedding strategy, Tricolore also significantly improves the performance for cold-start users.

Abstract:
Active learning (AL) reduces human annotation costs for machine learning systems by strategically selecting the most informative unlabeled data for annotation, but performing it individually may still be insufficient due to restricted data diversity and annotation budget. Federated Active Learning (FAL) addresses this by facilitating collaborative data selection and model training, while preserving the confidentiality of raw data samples. Yet, existing FAL methods fail to account for the heterogeneity of data distribution across clients and the associated fluctuations in global and local model parameters, adversely affecting model accuracy. To overcome these challenges, we propose \sf CHASeCHASe (Client Heterogeneity-Aware Data Selection), specifically designed for FAL. \sf CHASeCHASe focuses on identifying those unlabeled samples with high epistemic variations (EVs), which notably oscillate around the decision boundaries during training. To achieve both effectiveness and efficiency, \sf CHASeCHASe encompasses techniques for 1) tracking EVs by analyzing inference inconsistencies across training epochs, 2) calibrating decision boundaries of inaccurate models with a new alignment loss, and 3) enhancing data selection efficiency via a data freeze and awaken mechanism with subset sampling. Experiments show that \sf CHASeCHASe surpasses various established baselines in terms of effectiveness and efficiency, validated across diverse datasets, model complexities, and heterogeneous federation settings.

Abstract:
Multivariate time series prediction has aroused widely research interests during decades. However, the spatial heterogeneity and temporal evolution characteristics bring much challenges for high-dimensional time series prediction. In this paper, a novel adaptive graph convolution module is introduced to automatically learn the spatial correlation of multivariate time series and a Koopman-based neural differential equation is proposed to simulate the nonlinear system state evolution. In detail, the correlation between multivariate time series is revealed by the consine similarity of node embedding to infer the potential relationship between nodes and the spatio-temporal feature fusion module is utilized. The LSTM-based network is adopted as Koopman operator to reveal the latent states of spatio-temporal time series and the reversible assumption is imposed on the Koopman operator. Furthermore, the Euler-trapezoidal integration are utilized to simulate the temporal dynamics and multiple-step prediction is carried out in the latent space from the perspective of dynamical differential equation. The proposed model could explicitly discover the spatial correlation by adaptive graph convolution and reveal the temporal dynamics by neural differential equation, which make the modeling more interpretable. Simulation results show the effectiveness on spatio-temporal dynamic discovery and prediction performance.

Abstract:
Math word problem (MWP) serves as a critical milestone for assessing the text mining ability and knowledge mastery level of models. Recent advancements have witnessed large language models (LLMs) showcasing remarkable performance on MWP. However, current LLMs still frequently exhibit logical errors, which highlights their inability to fully grasp the knowledge required for genuine step-by-step mathematical reasoning. To this end, in this paper, we propose a novel Knowledge-guided Solver (KNOS) framework that empowers LLMs to simulate human mathematical reasoning, whose core idea is to Invoke-Verify-Inject necessary knowledge to solve MWP. We draw inspiration from the dual-process theory to construct two cooperative systems: a Knowledge System and an Inference System. Specifically, the Knowledge System employs LLMs as the knowledge base and develops a novel knowledge invoker that can elicit their relevant knowledge to support the strict step-level mathematical reasoning. In the Inference System, we propose a knowledge verifier and a knowledge injector to evaluate the knowledge rationality and further guide the step-wise symbolic deduction in an interpretable manner based on human cognitive mechanism, respectively. Moreover, to tackle the potential scarcity issue of mathematics-specific knowledge in LLMs, we consider an open-book exam scenario and propose an improved version of KNOS called EKNOS. In EKNOS, we meticulously design knowledge selectors to extract the most relevant commonsense and math formulas from external knowledge sources for each reasoning step. This knowledge is utilized to assist the knowledge invoker in better stimulating LLMs’ reasoning abilities. Both KNOS and EKNOS are flexible to empower different LLMs. Our experiments with GPT3, ChatGPT, and GPT4 not only demonstrate their reasoning accuracy improvement but also show how they bring the strict step-wise interpretability of mathematical thinking.

Abstract:
The query efficiency of Spark SQL is significantly impacted by its configurations. Therefore, configuration tuning has drawn great attention, and various automatic configuration tuning methods have been proposed. However, existing methods suffer from two issues: (1) high tuning overhead: they need to repeatedly execute the workloads several times to obtain the training samples, which is time-consuming; and (2) low throughput: they need to occupy resources like CPU cores and memory for a long time, causing other Spark SQL workloads to wait, thereby reducing the overall system throughput. These issues impede the use of automatic configuration tuning methods in practical systems which have limited tuning budget and many concurrent workloads. To address these issues, this paper proposes a Low-Overhead and Flexible approach for Spark SQL configuration Tuning, dubbed LOFTune. LOFTune reduces the tuning overhead via a sample-efficient optimization framework, which is proposed based on multi-task SQL representation learning and multi-armed bandit. Furthermore, LOFTune solves the low throughput issue with a recommendation-sampling-decoupled tuning framework. Extensive experiments validate the effectiveness of LOFTune. In the sampling-allowed case, LOFTune can save up to 90% of the workload runs comparing with the state-of-the-art methods. Besides, in the zero-sampling case, LOFTune can reduce up to 41.26% of latency.

Abstract:
Next POI (Point-of-Interest) recommendation aims to forecast users’ future movements based on their historical check-in trajectories, holding significant value in location-based services. Existing methods address trajectory data sparsity by integrating rich auxiliary information or using spatial-temporal knowledge graphs (STKGs), showing promising results. Yet, they face two main challenges: i) Due to the difficulty of transforming structured trajectory data into trajectory text describing users’ spatial-temporal mobility, the powerful reasoning ability of pre-trained language models is rarely explored to enhance recommendation performance. ii) Methods based on STKG can introduce external knowledge inconsistent with user preferences, leading to the knowledge noise generated hampering the accuracy of recommendations. To this end, we propose a novel approach called STKG-PLM that integrates STKG contrastive learning and prompt pre-trained language model (PLM) to enhance the next POI recommendation. Specifically, we design a spatial-temporal trajectory prompt template that transforms structured trajectories into text corpus based on STKG, serving as the input of PLM to understand the movement pattern of users from coarse-grained and fine-grained perspectives. Additionally, we propose an STKG contrastive learning framework to mitigate the introduced knowledge noise. Extensive experiments on three real-world datasets demonstrate that STKG-PLM exhibits notable performance improvements over the state-of-the-art baseline methods.

Abstract:
The class imbalance problem can cause classifiers to be biased toward the majority class and inclined to generate incorrect predictions. While existing studies have proposed numerous oversampling methods to alleviate class imbalance by generating extra minority class samples, these methods still have some inherent weaknesses and make the generated samples less informative. This study proposes a novel over-sampling method named the Expandable Borderline Smote (EB-Smote), which can address the weaknesses of existing over-sampling methods and generate more informative synthetic samples. In EB-Smote, not only minority class but also majority class is oversampled, and the synthetic samples are generated in the area between the selected minority and majority samples, which are close to the borderlines of their respective classes. EB-Smote can generate more informative samples by expanding the borderlines of minority and majority classes toward the actual decision boundary. Based on 27 imbalanced datasets and commonly used machine learning models, the experimental results demonstrate that EB-Smote significantly outperforms the other 8 existing oversampling methods. This study can provide theoretical guidance and practical recommendations to solve the crucial class imbalance problem in classification tasks.

Abstract:
Urban spatio-temporal prediction is crucial for informed decision-making, such as traffic management, resource optimization, and emergency response. Despite remarkable breakthroughs in pretrained natural language models that enable one model to handle diverse tasks, a universal solution for spatio-temporal prediction remains challenging. Existing prediction approaches are typically tailored for specific spatio-temporal scenarios, requiring task-specific model designs and extensive domain-specific training data. In this study, we introduce UniST, a universal model designed for general urban spatio-temporal prediction across a wide range of scenarios. Inspired by large language models, UniST achieves success through: (i) utilizing diverse spatio-temporal data from different scenarios, (ii) effective pre-training to capture complex spatio-temporal dynamics, (iii) knowledge-guided prompts to enhance generalization capabilities. These designs together unlock the potential of building a universal model for various scenarios. Extensive experiments on more than 20 spatio-temporal scenarios, including grid-based data and graph-based data, demonstrate UniST’s efficacy in advancing state-of-the-art performance, especially in few-shot and zero-shot prediction.

Abstract:
3D spatial data management is increasingly vital across various application scenarios, such as GIS, digital twins, human atlases, and tissue imaging. However, the inherent complexity of 3D spatial data, primarily represented by 3D geometries in real-world applications, hinders the efficient evaluation of spatial relationships through resource-intensive geometric computations. Geometric simplification algorithms have been developed to reduce the complexity of 3D representations, albeit at the cost of querying accuracy. Previous work has aimed to address precision loss by leveraging the spatial relationship between the simplified and original 3D object representations. However, this approach relied on specialized geometric simplification algorithms tailored to regions with specific criteria. In this paper, we introduce a novel approach to achieve highly efficient and accurate 3D spatial queries, incorporating geometric computation and simplification. We present a generalized progressive refinement methodology applicable to general geometric simplification algorithms, involving accurate querying of 3D geometry data using low-resolution representations and simplification extents quantified using Hausdorff distances at the facet level. Additionally, we propose techniques for calculating and storing Hausdorff distances efficiently. Extensive experimental evaluations validate the effectiveness of the proposed method which outperforms state-of-the-art systems by a factor of 4 while minimizing computational and storage overhead.

Abstract:
Few-shot knowledge graph completion (FKGC) addresses the long-tail problem of relations by leveraging a few observed support entity pairs to infer unknown facts for tail-located relations. Learning the relation representation of entity pairs and evaluating the match of query and support entity pairs are the two key steps of FKGC. Existing methods learn the representation of entity pairs by either aggregating neighbors of entities or integrating relation representations in the connected paths from head to tail. However, in few-shot scenarios, the limited number of support entity pairs and insufficient structural information with a single neighborhood topology will lead to matching failure. To this end, we consider the star and ring topological information for a given entity pair: (1) Entity neighborhood, which captures multi-hop neighbors of entities; (2) Relational path, which characterizes compound relation forms. Furthermore, to effectively fuse the two kinds of heterogeneous topological information, we design the multi-aggregator and the fine-grained path correlation matching algorithm to obtain more delicate and balanced matching. Based on the proposed relational path correlation matching module, we propose the relation adaptive network to solve the few-shot temporal knowledge graph completion problem. The experimental results show that our method continuously outperforms the state-of-the-art methods.

Abstract:
Self-supervised tasks show significant advantages for node representation learning in recommender systems. This core idea of self-supervised task-based recommender systems depends on data augmentation to generate multi-view representations. However, there are two key challenges that are not well explored in existing self-supervised tasks: i) Restricted by the structure of the graph-based CF paradigm itself, the classical graph comparison learning architecture ignores the global structural information on the user-item interaction graph. ii) In a key part of existing contrast learning-random graph data enhancement schemes can significantly deteriorate model performance. To address these challenges, we propose a new hypergraph collaborative filtering with adaptive augmentation framework(HCFAA). It captures both local and global collaborative relationships on the user-item graph through a hypergraph-enhanced joint learning architecture. In particular, the designed adaptive structure-guided model ignores the noise introduced on unimportant edges, and thus learns the critical node information on the user-item graph. Comprehensive experimental studies on the Amazon dataset show that the method is effective, which provides an optimization scheme with a new perspective for the problems of key node loss in graph data enhancement and loss of higher-order structural information in GNN. The source code of our model can be available on https://github.com/RSnewbie/RS/tree/master/HCFAA.

Abstract:
Neural architecture search (NAS) is widely used to automate the design of high-accuracy deep architectures, which are often vulnerable to adversarial attacks in practice due to the lack of adversarial robustness. Existing methods focus on the direct utilization of regularized optimization process to address this critical issue, which causes the lack of interpretability for the end users to learn how the robust architecture is constructed. In this paper, we introduce a robust enhanced plugin (REP) method for differentiable NAS to search for robust neural architectures. Different from existing peer methods, REP focuses on the robust search primitives in the search space of NAS methods, and naturally has the merit of contributing to understanding how the robust architectures are progressively constructed. Specifically, we first propose an effective sampling strategy to sample robust search primitives in the search space. In addition, we also propose a probabilistic enhancement method to guarantee natural accuracy and adversarial robustness simultaneously during the search process. We conduct experiments on both convolutional neural networks and graph neural networks with widely used benchmarks against state of the arts. The results reveal that REP can achieve superiority in terms of both the adversarial robustness to popular adversarial attacks and the natural accuracy of original data. REP is flexible and can be easily used by any existing differentiable NAS methods to enhance their robustness without much additional effort.

Abstract:
Graph Neural Networks (GNNs) show great power in Knowledge Graph Completion (KGC) as they can handle non-Euclidean graph structures and do not depend on the specific shape or topology of the graph. However, many current GNN-based KGC models have difficulty in effectively capturing and utilizing the substantial structure and global semantic information in Knowledge Graphs (KGs). For more effective use of GNN for KGC, we innovatively propose the Semantic Similarity-based Interaction Graph Attention Network (SemSI-GAT) for the KGC task. In SemSI-GAT, we utilize BERT, a pre-trained language model, to learn the global semantic information and obtain semantic similarity between entities and their neighbors. Furthermore, we creatively design a novel encoder network called the interaction graph attention network and introduce a semantic similarity sampling mechanism to optimize the aggregation of interaction information between neighbors. By aggregating local features with interaction features, this network can generate more expressive structural embeddings. This network generates more expressive embeddings by fusing global semantic information, local structure features, and interaction features. The experimental evaluations demonstrate that the proposed SemSI-GAT outperforms existing state-of-the-art KGC methods on four benchmark datasets.

Abstract:
Sequential recommendation systems aim to predict the future behaviors of users based on their historical interactions. Despite the success of neural architectures like Transformer and Graph Neural Networks, these models often struggle with the inherent challenge of sparse data in accurately predicting future user behaviors. To alleviate the data sparsity problem, some methods leverage the contrastive learning to generate contrastive views, assuming the items appear discretely at the same time intervals and focusing on the sequence order. However, these approaches neglect the crucial temporal-aware collaborative patterns hidden within the user-item interactions, leading to a limited variety of contrastive pairs and less informative embeddings. The proposed framework, Temporal-aware graph contrastive learning with theoretical guarantees for sequential Recommendation (TagRec), integrates temporal-aware collaborative patterns with adaptive data augmentation to generate more informative user and item representations. TagRec employs a temporal-aware graph neural network to embed the original graph, then generates augmented graphs through the addition of interactions via latent user interest mining, the dropping of redundant interaction edges, and the perturbation of temporal information. Theoretical guarantees are provided that these augmentations enhance the graph’s utility. Extensive experiments on real-world datasets demonstrate the superiority of the proposed approach over the state-of-the-art recommendation methods.

Abstract:
Temporal point process as the stochastic process on a continuous domain of time is commonly used to model the asynchronous event sequence featuring occurrence timestamps. Thanks to the strong expressivity of deep neural networks, they are emerging as a promising choice for capturing the patterns in asynchronous sequences, in the context of temporal point process. In this paper, we first review recent research emphasis and difficulties in modeling asynchronous event sequences with deep temporal point process, which can be concluded into four fields: encoding of history sequence, formulation of conditional intensity function, relational discovery of events, and learning approaches for optimization. We introduce most of the recently proposed models by dismantling them into four parts and conduct experiments by re-modularizing the first three parts with the same learning strategy for a fair empirical evaluation. Besides, we extend the history encoders and conditional intensity function family and propose a Granger causality discovery framework for exploiting the relations among multi-types of events. Because the Granger causality can be represented by the Granger causality graph, discrete graph structure learning in the framework of Variational Inference is employed to reveal latent structures of the graph. Further experiments show that the proposed framework with latent graph discovery can both capture the relations and achieve an improved fitting and predicting performance.

Abstract:
Big time series are increasingly available from an ever wider range of IoT-enabled sensors deployed in various environments. Significant insights can be gained by mining temporal patterns from these time series. Temporal pattern mining (TPM) extends traditional pattern mining by adding event time intervals into extracted patterns, making them more expressive at the expense of increased time and space complexities. Besides frequent temporal patterns (FTPs), which occur frequently in the entire dataset, another useful type of temporal patterns are so-called rare temporal patterns (RTPs), which appear rarely but with high confidence. Mining rare temporal patterns yields additional challenges. For FTP mining, the temporal information and complex relations between events already create an exponential search space. For RTP mining, the support measure is set very low, leading to a further combinatorial explosion and potentially producing too many uninteresting patterns. Thus, there is a need for a better approach to mine frequent and rare temporal patterns. This paper presents our Generalized Temporal Pattern Mining from Time Series (GTPMfTS) approach that can mine both types of patterns, with the following specific contributions: (1) The end-to-end GTPMfTS process taking time series as input and producing frequent/rare temporal patterns as output. (2) The efficient Generalized Temporal Pattern Mining (GTPM) algorithm mines frequent and rare temporal patterns using efficient data structures for fast retrieval of events and patterns during the mining process, and employs effective pruning techniques for significantly faster mining. (3) An approximate version of GTPM that uses mutual information, a measure of data correlation, to prune unpromising time series from the search space. (4) An extensive experimental evaluation of GTPM for rare temporal pattern mining (RTPM) and frequent temporal pattern mining (FTPM), showing that RTPM and FTPM significantly outperform the baselines on runtime and memory consumption, and can scale to big datasets. The approximate RTPM is up to one order of magnitude, and the approximate FTPM is up to two orders of magnitude, faster than the baselines, while retaining high accuracy.

Abstract:
A warehouse-distribution integration (WDI) e-commerce platform is an approach that combines warehousing and distribution processes, which is increasingly adopted in industry to enhance business efficiency. In the WDI e-commerce, one of the most important problems is to estimate the full-link delivery time for decision-making. Traditional methods designed for separate warehouse-distribution models struggle to address challenges in integrated systems. The difficulties stem from two main factors: (i) the contextual influence exerted by neighboring units within heterogeneous delivery networks, and (ii) the uncertainty in delivery times caused by dynamic and periodic temporal factors such as fluctuations in online sales volumes and the varying characteristics of different delivery units (e.g., warehouses and sorting centers). To address these challenges, we propose a novel full-link delivery time estimation framework called Heterogeneous Periodic Spatial-Temporal Graph Transformer (HPST-GT). First, we develop heterogeneous graph transformers to capture the hierarchical and diverse information of the warehouse-distribution network. Next, we design spatial-temporal transformers based on heterogeneous features to analyze the correlation between spatial and temporal information. Finally, we create a heterogeneous spatial-temporal graph prediction module to estimate full-link delivery time. Our method, evaluated on a one-month dataset from a leading e-commerce platform, surpasses current benchmarks across multiple performance metrics.

Abstract:
Privacy-preserving collaborative data analysis is a popular research direction in recent years. Among all such analysis tasks, privacy-preserving SQL queries on multi-party databases are of particular industrial interest. Although the privacy concern can be addressed by many cryptographic tools, such as secure multi-party computation (MPC), the efficiency of executing such SQL queries is far from satisfactory, especially for high-volume databases. In particular, existing MPC-based solutions treat each SQL query as an isolated task and launch it from scratch, in spite of the nature that many SQL queries are done regularly and somewhat overlap in their functionalities. In this work, we are motivated to exploit this nature to improve the efficiency of MPC-based, privacy-preserving SQL queries. We introduce a cache-like optimization mechanism. To ensure a higher cache hit rate and reduce redundant MPC operators, we present a cache structure different from that of plain databases and design a set of cache strategies. Our optimization mechanism, SMPCache, can be built upon secret-sharing-based MPC frameworks, which attract much attention from the industry. To demonstrate the utility of SMPCache, we implement it on Rosetta, an open-source MPC library, and use real-world datasets to launch extensive experiments on some basic SQL operators (e.g., Filter, Order-by, Aggregation, and Inner-Join) and some representative composite SQL queries. To give a data point, we note that SMPCache can achieve most up to 3536× efficiency improvement on the TPC-DS dataset and 562× on the TPC-H dataset at a moderate storage cost. We also apply SMPCache to the basic SQL operators (Filter, Order-by, Group-by, Aggregation, and Inner-join) of the Secrecy framework, achieving up to 127.3× efficiency improvement.

Abstract:
The widespread deployment of wireless and mobile devices results in a proliferation of decentralized spatio-temporal data. Many recent proposals that target deep learning for spatio-temporal prediction assume that all data is available at a central location and suffers from so-called catastrophic forgetting, where previously learned knowledge is entirely forgotten when new data arrives. Such proposals may face data privacy concerns and may experience deteriorating prediction performance when applied in decentralized settings where data streams into the system. To bridge the gap between decentralized training and spatio-temporal prediction on streaming data, we propose a unified federated continuous learning framework, which uses a horizontal federated learning mechanism for protecting data privacy and includes a global replay buffer with synthetic spatio-temporal data generated by the previously learned global model. For each client, we fuse the current training data with synthetic spatio-temporal data using a spatio-temporal mixup mechanism to preserve historical knowledge effectively, thus avoiding catastrophic forgetting. To enable holistic representation preservation, the local models at clients each integrates a general spatio-temporal autoencoder with a spatio-temporal simple siamese network that aims to ensure prediction accuracy and avoid holistic feature loss. Extensive experiments on real data offer insight into the effectiveness of the proposed framework.

Abstract:
Conversational recommendation is one system that can extract the user's preferences and recommend suitable items in a similar way to human-like responses. Existing methods often use the feature extraction combined with the Transformer model to extract user preferences and make recommendations. However, these methods have two limitations. First, they do not consider the order in which entities appear, thus affecting the extraction of user preferences. Second, the generated responses lack diversity that affects the users’ experience to the system. To this end, we propose a conversational recommendation model with User Entity focus and Multi-Granularity latent variable enhancement (UEMG). In UEMG, we design a novel neural network that utilizes Bi-GRU to capture the appearing orders of entities in dialogues, and leverages Transformer to capture the global dependencies of entities, and then combines them to extract user preferences. For the second issue, to improve the diversity of dialogue generation, we propose a multi-granularity latent variable mechanism, which can extract more entities from the context information and the knowledge graphs, respectively. We conducted extensive experiments on publicly available dialogue generation datasets. Experimental results demonstrate that compared to current state-of-the-art methods, UEMG achieves 9.7% improvements in recommendation performance and 23% improvements in dialogue generation.

Abstract:
Privacy concerns in recommender systems are potentially addressed due to constitutional and commercial requirements. Centralized recommendation models are susceptible to poisoning attacks, which threaten their integrity. In this context, federated learning has emerged as an optimal solution to privacy concerns. However, recent investigations proved that Federated Recommender Systems (FedRS) are also vulnerable to model poisoning attacks. Existing attack possibilities highlighted in academic literature require a large fraction of Byzantine clients to effectively influence the training process, which is unrealistic for practical systems with millions of users. Additionally, most attack models neglected the role of the defense mechanism running at the aggregation server. To this end, we propose a novel undetectable hidden attack strategy (HidAttack) for FedRS, aiming to raise the exposure ratio of targeted items with minimum Byzantine clients. To achieve this goal, we construct a cluster of baseline attacks, on top of which a bandit model is designed that intelligently infers effective poisoned gradients. It ensures a diverse pattern of poisoned gradients and therefore, Byzantine clients cannot be distinguished from benign clients by the defense mechanism. Extensive experiments demonstrate that: 1) our attack model significantly increases the target item's exposure rate covertly without compromising the recommendation accuracy and 2) the current defenses are insufficient, emphasizing the need for better security improvements against our model poisoning attack to FedRS.

Abstract:
The Domain Name System (DNS) is a critical Internet service that translates domain names into IPs, but it is often targeted by attackers, posing a serious security risk. Graph-based models for detecting malicious domains have shown high performance but are vulnerable to adversarial attacks. To address this issue, we propose RMD-Graph, which is characterized by its ability to resist adversarial attacks and its low dependency on labeled data. A dual denoising module is specifically designed based on two autoencoders to generate the reconstructed graph, where SVD, TOP-k and reconstruction loss are introduced to enhance the denoising capability of autoencoders. Subsequently, residual connections are employed to generate an optimized graph that retains essential information from the original graph. The reconstructed graph and the optimized graph are then utilized as two views for graph contrastive learning, thereby achieving an self-supervised representation learning task without labels. In the downstream malicious domain detection, the denoised node representations are employed for machine learning classification. Extensive experiments are conducted on publicly available DNS datasets, and the results demonstrate that RMD-Graph significantly outperforms known baseline methods, especially in adversarial scenarios.

Abstract:
Spatiotemporal trajectories are sequences of timestamped locations, which enable a variety of analyses that in turn enable important real-world applications. It is common to map trajectories to vectors, called embeddings, before subsequent analyses. Thus, the qualities of embeddings are very important. Methods for pre-training embeddings, which leverage unlabeled trajectories for training universal embeddings, have shown promising applicability across different tasks, thus attracting considerable interest. However, research progress on this topic faces two key challenges: a lack of a comprehensive overview of existing methods, resulting in several related methods not being well-recognized, and the absence of a unified pipeline, complicating the development of new methods and the analysis of methods. We present UniTE, a survey and a unified pipeline for this domain. In doing so, we present a comprehensive list of existing methods for pre-training trajectory embeddings, which includes methods that either explicitly or implicitly employ pre-training techniques. Further, we present a unified and modular pipeline with publicly available underlying code, simplifying the process of constructing and evaluating methods for pre-training trajectory embeddings. Additionally, we contribute a selection of experimental results using the proposed pipeline on real-world datasets.

Abstract:
Amidst the rapid propagation of multimodal fake news across social media platforms, the detection of fake news has emerged as a prime research pursuit. To detect heightened level of meticulous fabrications, propagation paths are introduced to provide nuanced social context that enhances the basic semantic analysis of the news content. However, existing propagation-enhanced models encounter a dilemma between detection efficacy and social hazard. In this paper, we explore the innovative problem of early fake news detection through the generation of propagation paths, capable of benefiting from the extensive social context within propagation paths while mitigating potential social hazards. To address these challenges, we propose a novel Reinforced Propagation Path Generation Fake News Detection model, RPPG-Fake. Departing from conventional discriminative approaches, RPPG-Fake captures the propagation topology pattern from a heterogeneous social graph and generates the propagation paths to detect fake news effectively under a reinforcement learning paradigm. Our proposal is extensively evaluated over three popular datasets, and experimental results demonstrate the superiority of our proposal.

Abstract:
Traditional recommendation systems focus on maximizing user satisfaction by suggesting their favorite items. This user-centric approach may lead to unfair exposure distribution among the providers. On the contrary, a provider-centric design might become unfair to the users. Therefore, this paper proposes a re-ranking model FairSort1 to find a trade-off solution among user-side fairness, provider-side fairness, and personalized recommendations utility. Previous works habitually treat this issue as a knapsack problem, incorporating both-side fairness as constraints. In this paper, we adopt a novel perspective, treating each recommendation list as a runway rather than a knapsack. In this perspective, each item on the runway gains a velocity and runs within a specific time, achieving re-ranking for both-side fairness. Meanwhile, we ensure the Minimum Utility Guarantee for personalized recommendations by designing a Binary Search approach. This can provide more reliable recommendations compared to the conventional greedy strategy based on the knapsack problem. We further broaden the applicability of FairSort, designing two versions for online and offline recommendation scenarios. Theoretical analysis and extensive experiments on real-world datasets indicate that FairSort can ensure more reliable personalized recommendations while considering fairness for both the provider and user.

Abstract:
Feature weighting aims to assign different weights to features based on their importance in machine learning tasks. In clustering tasks, the existing methods learn feature importance based on the clustering results derived from the collaborative contribution of all features, which overlooks the independent effect of each feature. In fact, there are underlying causal relationships between features and the clustering results, and the features with high causal effects are always more crucial for clustering. Therefore, we propose an enhanced Feature Weighting method via Causal Effect for Clustering, calculating the causal effect of each feature on the clustering results for obtaining the independent contribution of each feature. Specifically, we start by identifying the causal relationships among the features and utilizing the causal relationships to generate a reasonable treatment group. Next, we compare the changes in the data distribution between the treatment and control groups to determine the causal effect of each feature. Finally, the causal effects of features are used for enhancing the clustering-driven weight learning. Moreover, we present a theory of relative order consistency in causal effect. Experimental results demonstrate that utilizing causal effect in weight learning facilitates efficient convergence and achieves superior accuracy compared to state-of-the-art clustering algorithms.

Abstract:
Individual mobility prediction holds significant importance in urban computing, supporting various applications such as place recommendations. Current studies primarily focus on frequent mobility patterns including commuting trips to residential and workplaces. However, such studies do not accurately forecast irregular trips, which incorporate journeys that end at locations other than residences and workplaces. Despite their usefulness in recommendations and advertising, the stochastic, infrequent, and spontaneous nature of irregular trips makes them challenging to predict. To address the difficulty, this study proposes a web search-driven bipartite graph neural network, namely WS-BiGNN, for the individual irregular mobility prediction (IIMP) problem. Specifically, we construct bipartite graphs to represent mobility and web search records, formulating the IIMP problem as a link prediction task. First, WS-BiGNN employs user-user edges and POI-POI edges (POI: point-of-interest) to bolster information propagation within sparse bipartite graphs. Second, the temporal weighting module is created to discern the influence of past mobility and web searches on future mobility. Lastly, WS-BiGNN incorporates the search-mobility memory module, which classifies four interpretable web search-mobility patterns and harnesses them to improve prediction accuracy. We perform experiments utilizing real-world data in Tokyo from October 2019 to March 2020. The results showcase the superior performance of WS-BiGNN compared to baseline models, as supported by higher scores in Recall and NDCG. The exceptional performance and additional analysis reveal that infrequent behavior may be effectively predicted by learning search-mobility patterns at the individual level.

Abstract:
Anomalies often occur in real-world information networks/graphs, such as malevolent users in online review networks and fake news in social media. When representing such structured network data as graphs, anomalies usually appear as anomalous nodes that exhibit significantly deviated structure patterns, or different attributes, or the both. To date, numerous unsupervised methods have been developed to detect anomalies based on residual analysis, which assumes that anomalies will introduce larger residual errors (i.e., graph reconstruction loss). While these existing works achieved encouraging performance, in this paper, we formally prove that their employed learning objectives, i.e., MSE and cross-entropy losses, encounter significant limitations in learning the major data distributions, particularly for anomaly detection, and through our preliminary study, we reveal that the vanilla residual analysis-based methods cannot effectively investigate the rich graph structure. Upon these discoveries, we propose a novel structure-biased graph anomaly detection framework (SALAD) to attain anomalies’ divergent patterns with the assistance of a specially designed node representation augmentation approach. We further present two effective training objectives to empower SALAD to effectively capture the major structure and attribute distributions by emphasizing less on anomalies that introduce higher reconstruction errors under the encoder-decoder framework. The detection performance on eight widely-used datasets demonstrates SALAD's superiority over twelve state-of-the-art baselines. Additional ablation and case studies validate that our data augmentation method and training objectives result in the impressive performance.

Abstract:
Graph Neural Networks (GNNs) have aroused increasing research attention for their effectiveness on graph mining tasks. However, full-batch training methods based on stochastic gradient descent (SGD) require substantial resources since all gradient-required computational processes are stored in the acceleration device. The bottleneck of storage challenges the training of classic GNNs on large-scale datasets within one acceleration device. Meanwhile, message-passing based (spatial) GNN designs usually necessitate the homophily hypothesis of the graph, which easily fails on heterophilous graphs. In this paper, we propose the random walk extension for those message-passing based GNNs, enriching them with spectral powers. We prove that our random walk sampling with appropriate correction coefficients generates an unbiased approximation of the KK-order polynomial filter matrix, thus promoting the neighborhood aggregation of the central nodes. Node-wise sampling strategy and historical embedding allow the classic models to be trained with mini-batches, which extends the scalability of the basic models. To show the effectiveness of our method, we conduct a thorough experimental analysis on some frequently-used benchmarks with diverse homophily and scale. The empirical results show that our model achieves significant performance improvements in comparison with the corresponding base GNNs and some state-of-the-art baselines in node classification tasks.

Abstract:
Mining multiple longest common subsequences (MLCS) from a set of sequences of length three or more over a finite alphabet (a classical NP-hard problem) is an important task in many fields, e.g., bioinformatics, computational genomics, pattern recognition, information extraction, etc. Applications in these fields often involve generating very long sequences (length \geqslant⩾ 10,000), referred to as big sequences. Despite efforts in improving the time and space complexities of MLCS mining algorithms, both existing exact and approximate algorithms face challenges in handling big sequences due to the overwhelming size of their problem-solving graph model MLCS-DAG (Directed Acyclic Graph), leading to the issue of memory explosion or extremely high time complexity. To bridge the gap, this paper first proposes a new identification and deletion strategy for different classes of non-critical points in the mining of MLCS, which are the points that do not contribute to their MLCSs mining in the MLCS-DAG. It then proposes a new MLCS problem-solving graph model, namely DAG_KPDAGKP (a new MLCS-DAG containing only Key Points). A novel parallel MLCS algorithm, called KP-MLCS (Key Point based MLCS), is also presented, which can mine and compress all MLCSs of big sequences effectively and efficiently. Extensive experiments on both synthetic and real-world biological sequences show that the proposed algorithm KP-MLCS drastically outperforms the existing state-of-the-art MLCS algorithms in terms of both efficiency and effectiveness.

Abstract:
Multi-view clustering is an important approach to mining the valuable information within multi-view data. In this paper, we propose a novel multi-view deep subspace clustering method based on contrastive learning and Cauchy-Schwarz (CS) divergence. Our method not only uses contrastive learning techniques and block diagonalization constraints to guide representation matrix learning, but also combines representation learning and clustering processes to achieve the interaction of representation and clustering. First, we introduce a novel loss function based on CS divergence in the clustering module to achieve the interaction of representation and clustering. Second, we propose an extension of the multiple positive and negative pair diffusion method to enhance contrastive learning. Finally, we establish the equivalence between contrastive clustering and spectral clustering with orthogonal constraints, leading to a comprehensive model optimization. We evaluate our method on six publicly available datasets and compare its performance with eight competing methods. The results demonstrate the superiority of our method over the compared multi-view clustering methods.

Abstract:
Selecting key data subsets for model training is an effective way to improve training efficiency. Existing methods generally utilize a well-trained model to evaluate samples and select crucial subsets, ignoring the fact that the sample importance changes dynamically during model training, resulting in the selected subset only being critical in a specific training epoch rather than a changing training phase. To address this issue, we attempt to evaluate the significant changes in sample importance during dynamic training and propose a novel data selection method to improve model training efficiency. Specifically, the temporal changes in sample importance are considered from three perspectives: (i) loss, the difference between the predicted labels and the true labels of samples in the current training epoch; (ii) instability, the dispersion of sample importance in the recent training phase; and (iii) inconsistency, the comparison of the changing trend in the importance of an individual sample relative to the average importance of all samples in the recent training phase. Extensive experiments demonstrate that dynamic data selection can reduce computational costs and improve model training efficiency. Additionally, we find that the difficulty level of the training task influences the data selection strategy.

Abstract:
Existing geometric knowledge graph embedding methods employ various relational transformations, such as translation, rotation, and projection, to model different relation patterns, which aims to enhance the expressiveness of models. In contrast to current approaches that treat the expressiveness of the model as a binary issue, we aim to delve deeper into analyzing the level of difficulty in which geometric knowledge graph embedding models can represent relation patterns. In this paper, we provide a theoretical analysis framework that measures the expressiveness of the model in relation patterns by quantifying the size of the solution space of linear equation systems. Additionally, we propose a mechanism for imposing relational constraints on geometric knowledge graph embedding models by setting “traps” near relational optimal solutions, which enables the model to better converge to the optimal solution. Empirically, we analyze and compare several typical knowledge graph embedding models with different geometric algebras, revealing that some models have insufficient solution space due to their design, which leads to performance weaknesses. We also demonstrate that the proposed relational constraint operations can improve the performance of certain relation patterns. The experimental results on public benchmarks and relation pattern specified dataset are consistent with our theoretical analysis.

Abstract:
Autonomous index tuning (“auto-indexing” for short) has recently started being supported by cloud database service providers. Index tuners rely on query optimizer's cost estimates to recommend indexes that can minimize the execution cost of an input workload. Such cost estimates can often be erroneous that lead to significant query performance regression. To reduce the chance of regression, existing work primarily uses machine learning (ML) technologies to build prediction models to improve query execution cost estimation using actual query execution telemetry as training data. However, training data collection is typically an expensive process, especially for index tuning due to the significant overhead of creating/dropping indexes. As a result, the amount of training data can be limited in auto-indexing for cloud databases. In this paper, we propose a new approach named “hybrid cost modeling” to address this challenge. The key idea is to limit the ML-based modeling effort to the leaf operators such as table scans, index scans, and index seeks, and then combine the ML-model predicted costs of the leaf operators with optimizer's estimated costs of the other operators in the query plan. We conduct theoretical study as well as empirical evaluation to demonstrate the efficacy of applying hybrid cost modeling to index tuning, using both industrial benchmarks and real workloads.

Abstract:
When handling streaming graphs, existing graph representation learning models encounter a catastrophic forgetting problem, where previously learned knowledge of these models is easily overwritten when learning with newly incoming graphs. In response, Continual Graph Learning (CGL) emerges as a novel paradigm enabling graph representation learning from static to streaming graphs. Our prior work, Condense and Train (CaT) (Liu et al. 2023) is a replay-based CGL framework with a balanced continual learning procedure, which designs a small yet effective memory bank for replaying data by condensing incoming graphs. Although the CaT alleviates the catastrophic forgetting problem, there exist three issues: (1) The graph condensation algorithm derived in CaT only focuses on labelled nodes while neglecting abundant information carried by unlabelled nodes; (2) The continual training scheme of the CaT overemphasises on the previously learned knowledge, limiting the model capacity to learn from newly added memories; (3) Both the condensation process and replaying process of the CaT are time-consuming. In this paper, we propose a PsUdo-label guided Memory bAnk (PUMA) CGL framework, extending from the CaT to enhance its efficiency and effectiveness by overcoming the above-mentioned weaknesses and limits. To fully exploit the information in a graph, PUMA expands the coverage of nodes during graph condensation with both labelled and unlabelled nodes. Furthermore, a training-from-scratch strategy is proposed to upgrade the previous continual learning scheme for a balanced training between the historical and the new graphs. Besides, PUMA uses a one-time prorogation and wide graph encoders to accelerate the graph condensation and the graph encoding process in the training stage to improve the efficiency of the whole framework. Extensive experiments on seven datasets for the node classification task demonstrate the state-of-the-art performance and efficiency over existing methods.

Abstract:
Graph neural networks (GNNs) are effective models for analyzing graph-structured data, but encounter challenges when training on large distributed graphs. Existing GNN training frameworks use sampling parallelism and historical embedding methods to support distributed training and enhance efficiency. However, these methods suffer from issues like stale historical embeddings, imbalanced communication messages, and redundant storage and computation costs. In this paper, we present Emma, a distributed GNN training framework that incorporates source node centric chunking for frequent updates of embeddings and balanced communication, as well as a moving message aggregation technique to boost training efficiency and reduce storage costs. Experimental results show that Emma significantly enhances training efficiency by reducing computation and communication overhead, leading to a notable speedup while maintaining convergence accuracy compared to state-of-the-art distributed GNN training methods.

Abstract:
Recently, large language models (LLMs) have made remarkable progress in table understanding, yet they remain vulnerable to the structural noise (SN) and the textual noise (TN). Existing methods usually employ biased denoising strategies such as structural matching and textual filtering, or overzealous denoising strategies such as introducing supplementary tasks like text-to-SQL and table-to-text to reduce these two types of noise. However, these methods either neglect one type of noise or introduce substantial external noise. Therefore, how to simultaneously mitigate the structural and textual noise without introducing extra noise and improve the performance of LLMs in table understanding is still an unresolved issue. In this paper, we rethink the bottlenecks in table understanding from the perspective of noise reduction and propose a novel dual-denoiser-reasoner model, called TabDDR, for balanced and effective denoising. Specially, our model consists of a structural-and-textual denoiser and a task-adaptive reasoner. The former removes two types of noise via triplet alignment and planning extraction to seek an interpretable balance between breaking structural barriers and preserving structural characteristics, eliminating textual noise and retaining maximal information; the latter ensures a simple but effective reasoning process which can adapt to various downstream tasks. To highlight the presence and impact of the structural and textual noise, we construct the WTQ-SN and WTQ-TN datasets based on the WikiTableQuestion (WTQ) dataset. Extensive experiments on these self-constructed datasets and two other public datasets demonstrate that our proposed method performs better than state-of-the-art baselines.

Abstract:
Distribution shifts from external events and new entities can significantly compromise spatial-temporal prediction accuracy, potentially leading to severe outcomes like traffic accidents. Existing methods often fail under these conditions due to two main limitations: they focus on invariant patterns, missing the diversity required to capture the evolving dynamics of distribution shifts; they rely on often inaccessible future knowledge, such as spatial information of new entities, limiting their generalizability. To address these limitations, we formally define the problem of inductive spatial-temporal prediction under continuous distribution shifts and introduce the Contrastive Learning Based Inductive Graph Neural Network (COIN-GNN) as a solution. We develop a novel metric, Relation Importance (RI), to effectively select stable entities and distinct spatial relationships, forming an informative subgraph. Additionally, we construct an informative temporal memory buffer to store and review influential timestamps identified using influence functions. COIN-GNN then generates pseudo-observations for unstable and uninformative entities during these influential timestamps, simulating potential distribution shifts. By applying contrastive learning, the network learns stable and informative representations that can effectively counter distribution shifts without relying on future knowledge. Our extensive experiments on several real-world datasets—from traffic to weather—demonstrate COIN-GNN’s superior performance across different domains without requiring future knowledge.

Abstract:
Multimodal recommender systems utilize a variety of information types to model user preferences and item properties, aiding in the discovery of items that align with user interests. Rich multimodal information alleviates inherent challenges in recommendation systems, such as data sparsity and cold start problems. However, multimodal information further introduces challenges in terms of robustness and generalization capability. Regarding robustness, multimodal information magnifies the risks associated with information adjustment and inherent noise, posing severe challenges to the stability of recommendation models. For generalization capability, multimodal recommender systems are more complex and difficult to train, making it harder for models to handle data beyond the training set, posing significant challenges to model generalization capability. In this paper, we analyze the shortcomings of existing robustness and generalization capability enhancement strategies in the multimodal recommendation field. We propose a sharpness-aware minimization strategy focused on batch data (BSAM), which effectively enhances the robustness and generalization capability of multimodal recommender systems without requiring extensive hyper-parameter tuning. Furthermore, we introduce a mixed loss variant strategy (BSAM+), which accelerates convergence and achieves remarkable performance improvement. We provide rigorous theoretical proofs and conduct experiments with nine advanced models on five widely used datasets to validate the superiority of our strategies. Moreover, our strategies can be integrated with existing robust training and data augmentation strategies to achieve further improvement, providing a superior training paradigm for multimodal recommendations.

Affiliations: School of Computer Engineering, Suzhou Vocational University, Suzhou, China; School of Computer Science and Artificial Intelligence, Alibaba Cloud Big Data College, School of Software, Changzhou University, Changzhou, China; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China; Department of Control Science and Engineering, Tongji University, Shanghai, China; School of Transportation, Southeast University, Nanjing, China

Abstract:
Traffic flow prediction is essential for intelligent transportation systems, yet privacy concerns and limited cross-regional data sharing hinder accurate modeling of global traffic patterns. This paper proposes a Federated Graph Neural Network with Equivalent Hypergraph (FGNNEH) framework to address these challenges by preserving privacy and enhancing cross-client collaboration. FGNNEH consists of two key stages. First, local traffic networks are transformed into high-dimensional hypernodes through an integrated process of backbone network extraction, kernel matrix analysis, and multilayer perceptrons. The backbone network extraction simplifies graph structures by isolating critical nodes and edges based on topological centrality, ensuring computational efficiency while retaining key spatial dependencies. Kernel matrix analysis captures complex nonlinear correlations among traffic flow features, including spatial-temporal dependencies and region-specific dynamics, enabling more effective feature representation. The multilayer perceptrons further fuse these features into robust hypernode embeddings that encapsulate both structural and traffic flow characteristics. Second, a global hypergraph construction mechanism is introduced to optimize inter-client collaboration. This mechanism employs an iterative performance feedback loop to dynamically add or remove edges between hypernodes, addressing the issue of lost inter-client connections and enabling effective cross-regional information exchange. Together, these components reconstruct a global traffic model that balances local privacy with holistic accuracy. Experiments on real-world traffic datasets, including PeMSD4, METR-LA and Guangzhou, demonstrate that FGNNEH outperforms existing methods in prediction accuracy, computational efficiency, and scalability.

Abstract:
Class imbalance, which is common in real-world classification tasks, often leads to biased models favoring majority classes. Data oversampling is a widely used strategy to address this issue. However, traditional oversampling methods often generate incorrect or redundant instances when class overlap occurs, increasing decision boundary complexity. To this end, we propose a novel Generative Oversampling approach to addressing Class Imbalance and Overlap (GOIO) in the classification of tabular data. GOIO combines a Metric-Learning-based Variational Autoencoder (MLVAE) and a Conditional Latent Diffusion Model (CLDM) to handle class imbalance and overlap effectively. The MLVAE employs a triplet-center loss to the adverse effects of class overlap by transforming the data distribution into a more separable latent feature space. Following this, the CLDM is trained with class-center feature prompting and classifier-free guidance strategy to capture class-specific latent distributions accurately. Minority class samples are synthesized in the latent space using the CLDM and then reconstructed into the data space via the MLVAE decoder. Comprehensive experiments on 18 real-world and five synthetic datasets demonstrate that GOIO outperforms the state-of-the-art oversampling methods in F1-score, MCC, and Accuracy. Ablation studies further validate the effectiveness of the proposed contributions in addressing class imbalance and overlap.

Abstract:
Noise-tolerant feature selections are valuable for data learning; they can resort to efficient fuzzy granulations and uncertainty measures, and a fundamental model concerns weighted kernel fuzzy rough sets (WKFRSs) which consider data distributions and uncertainty. In terms of current WKFRSs, fuzzy granulations adopt kk-nearest neighbors for weighted optimization, while uncertainty measures consider single algebraic and informational views; corresponding feature selection algorithms have made achievements of noisy processing, but still exist advancement space from granulation deepening and measurement reinforcement. In this paper embracing WKFRSs, two-type weight-fuzzy granulations are defined by using self-adapting radius neighborhoods, three-view uncertainty measures are comprehensively constructed from uncertainty mechanisms, so 2× (1+1+2)=82×(1+1+2)=8 heuristic algorithms of feature selections are systematically established for better noise-aware learning. At first, two improved factors of local density and boundary influence are proposed by general neighborhood characterization and statistical radius determination, and thus two sample weights emerge to adjust Gaussian-kernel fuzzy relations to induce two weight-fuzzy granulations. Then, the fuzzy precision and fuzzy-complementary mutual information are respectively proposed from algebraic and informational views, and the two are combined into two fused measures via arithmetic and geometric means. Furthermore, the above two-type granulations and three-view measures two-dimensionally generate 2× (1+1+2)=82×(1+1+2)=8 new heuristic selection algorithms via feature significances. Finally by data experiments, constructional fuzzy granulations, uncertainty measures, feature selections are validated to have anti-noise characteristics and corresponding robustness, while new selection algorithms acquire better performances of classification learning than multiple contrast algorithms.

Abstract:
Information Extraction (IE) and Text Classification (CLS) serve as the fundamental pillars of NLU, with both disciplines relying on analyzing input sequences to categorize outputs into pre-established schemas. However, there is no existing encoder-based model that can unify IE and CLS tasks from this perspective. To fully explore the foundation shared within NLU tasks, we have proposed a recursive method with explicit schema instructor for universal NLU. Specifically, we first redefine the true universal information extraction (UIE) with a formal formulation that covers almost all extraction schemas, including quadruples and quintuples which remain unsolved for previous UIE models. Then, we expands the formulation to all CLS and multi-modal NLU tasks. Based on that, we introduce RexUniNLU, an universal NLU solution that employs explicit schema constraints for IE and CLS, which encompasses all IE and CLS tasks and prevent incorrect connections between schema and input sequence. To avoid interference between different schemas, we reset the position ids and attention mask matrices. Extensive experiments are conducted on IE, CLS in both English and Chinese, and multi-modality, revealing the effectiveness and superiority.

Affiliations: Department of Strategic and Advanced Interdisciplinary Research, Pengcheng Laboratory, Shenzhen, China; State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China; College of Computer Science and Technology, National University of Defense Technology, Changsha, China; Key Laboratory of High Confidence Software Technologies (MOE) & School of Computer Science, Peking University, Beijing, China

Abstract:
Frequent object mining has gained considerable interest in the research community and can be split into frequent item mining and frequent set mining depending on the type of object. While existing sketch-based algorithms have made significant progress in addressing these two tasks concurrently, they also possess notable limitations. They either support only software platforms with low throughput or compromise accuracy for faster processing speed and better hardware compatibility. In this paper, we make a substantial stride towards supporting frequent object mining by designing SandwichSketch, which draws inspiration from sandwich making and proposes two techniques including the double fidelity enhancement and hierarchical hot locking to guarantee high fidelity on both two tasks. We implement SandwichSketch on three platforms (CPU, Redis, and FPGA) and show that it enhances accuracy by 38.4×38.4× and 5×5× for two tasks on three real-world datasets, respectively. Additionally, it supports a distributed measurement scenario with less than a 0.01% decrease in Average Relative Error (ARE) when the number of nodes increases from 1 to 16.

Abstract:
Graph neural networks (GNNs) have shown strong performance on graph-structured data but may inherit bias from training data, leading to discriminatory predictions based on sensitive attributes like gender and race. Existing fairness methods assume that training and testing data share the same distribution, but how fairness is affected under distribution shifts remains largely unexplored. To address this, we first identify theoretical factors that cause bias in graphs and explore how fairness is influenced by distribution shifts, particularly focusing on representation distances between groups in training and testing graphs. Based on this, we propose FatraGNN, which uses a graph generator to create biased graphs from different distributions and an alignment module to reduce representation distances for specific groups. This improves fairness and classification performance on unseen graphs. However, FatraGNN has limitations in generating realistic graphs and addressing group differentiation. To overcome these, we introduce AuCoGNN, which includes an automated graph generation module and a contrastive alignment mechanism. This ensures better fairness by maximizing the representation distance between the same certain groups while minimizing the representation distance between different groups. Experiments on real-world and semi-synthetic datasets demonstrate the effectiveness of both models in improving fairness and accuracy.

Abstract:
In recent years, multi-view unsupervised feature selection has gained significant interest for its ability to efficiently handle multi-view datasets while offering better interpretability. However, most existing methods face the following challenges: First, the presence of noisy features in the data significantly impacts the process of learning accurate feature importance. Second, the selected features contain redundant information due to ignored redundancy between them. Third, graph structure learning is performed on all samples, resulting in large computational and space overheads, which is not conducive to expansion to large-scale data. To address these challenges, we propose a multi-view unsupervised feature selection method based on latent semantics and anchor graph learning. Specifically, this method designs a feature-weighted orthogonal regression and subspace learning framework to suppress noise interference in the consensus latent semantics discovery and anchor graph construction process, enhance the robustness of multi-view representation learning and reduce the computation of graph construction. Meanwhile, the proposed method employs explicit redundancy mitigation mechanisms that penalize discriminative weight allocation to highly correlated features. Furthermore, the proposed method unifies feature weighting, consensus latent semantics discovery, and adaptive graph learning within a multi-layer learning framework, enabling comprehensive feature importance evaluation through interactive learning between multiple layers. Finally, an efficient iterative algorithm is designed to solve the proposed model. The superiority of the proposed algorithm is demonstrated by comparing it with seven state-of-the-art algorithms on seven public multi-view datasets.

Abstract:
Learning under feature evolution data streams has attracted widespread attention in recent years. Existing methods usually assume that the model predicts and learns from all instances in the data stream. However, when the data stream rate is faster than the model update rate, the model can only learn from some instances. Therefore, this assumption may not always hold in practical scenarios. Additionally, existing methods often update based only on the current instance, ignoring the impact of data stream changes, which further limits their application in practical data streams. This paper proposes a novel learning paradigm to solve this problem: Online Learning under Feature Evolution data streams with A Fast Rate, called OLFE-FR. Specifically, OLFE-FR introduces the concept of relative rate to adaptively determine the prediction mode and update node of the model in the data stream. Additionally, OLFE-FR proposes an adaptive learning rate adjustment strategy based on the upper bound of dynamic regret minimization. This strategy enables the model to find a suitable learning rate based on weights change induced by known data stream variations before using the instance update. Theoretical analysis and experimental results show that OLFE-FR can effectively handle feature evolution data streams with a fast rate.

Abstract:
Heterogeneous Graph Neural Networks (HGNNs) have achieved promising results in various heterogeneous graph learning tasks, owing to their superiority in capturing the intricate relationships and diverse relational semantics inherent in heterogeneous graph structures. However, the neighborhood-fetching latency incurred by structure dependency in HGNNs makes it challenging to deploy for latency-constrained applications that require fast inference. Inspired by recent GNN-to-MLP knowledge distillation frameworks, we introduce HG2M and HG2M+ to combine both HGNN’s superior performance and MLP’s efficient inference. HG2M directly trains student MLPs with node features as input and soft labels from teacher HGNNs as targets, and HG2M+ further distills reliable and heterogeneous semantic knowledge into student MLPs through reliable node distillation and reliable meta-path distillation. Experiments conducted on six heterogeneous graph datasets show that despite lacking structural dependencies, HG2Ms can still achieve competitive or even better performance than HGNNs and significantly outperform vanilla MLPs. Moreover, HG2Ms demonstrate a 379.24× speedup in inference over HGNNs on the large-scale IGB-3M-19 dataset, showcasing their ability for latency-sensitive deployments.

Abstract:
The issue of multi-set membership query is a fundamental task in the fields of distributed systems and computer networks. It entails identifying which sets, out of n sets S_0, S_1,\ldots , S_n-1S0,S1,...,Sn-1, in a Multi Set Multi-Membership Querying (MS-MMQ) contain a given element q. To address this problem while minimizing space usage, probabilistic data structures, such as cuckoo filters, are commonly employed. However, existing sketch data structures struggle to effectively balance scalability and query efficiency. To address this challenge, we introduce an enhanced marked cuckoo filter (EMCF), which enhances support for MS-MMQ scenarios by incorporating set markers after the fingerprint field. Additionally, it grants horizontal scalability and collaborative search capabilities at the filter level through the inclusion of global markers. Furthermore, we have developed an optimized variant of the enhanced marked cuckoo filter (EMCF-V) for multi-set scenarios to achieve space optimization. Experimental results using real-world datasets demonstrate that EMCF exceeds CSC-CF by more than 10 times in terms of speed, and EMCF-V exhibits a 33.3% higher query efficiency than CSC-CF methods, particularly in multiset scenarios.

Abstract:
Owing to the impressive general intelligence of large language models (LLMs), there has been a growing trend to integrate them into recommender systems to gain a more profound insight into human interests and intentions. Existing LLMs-based recommender systems primarily leverage item attributes and user interaction histories in textual format, improving the single task like rating prediction or explainable recommendation. Nevertheless, these approaches underestimate the crucial contribution of traditional collaborative signals in discerning users’ profound intentions and disregard the interrelatedness among tasks. To address these limitations, we introduce a novel framework known as CKF, specifically developed to boost multi-task recommendations via personalized collaborative knowledge fusion into LLMs. Specifically, to enhance collaborative signal integration, we develop a meta-network that creates personalized mapping bridges for each user. This enables the seamless incorporation of trained collaborative filtering embeddings into structured prompt templates, significantly boosting the LLM’s understanding of user interests. To investigate the intrinsic relationship among diverse recommendation tasks, we develop Multi-LoRA, a new parameter-efficient approach for multi-task optimization, adept at distinctly segregating task-shared and task-specific knowledge. This semantic approach forges a connection between LLMs and recommendation scenarios, while simultaneously enriching the supervisory signal through mutual knowledge transfer among various tasks. Extensive experiments and in-depth robustness analyses across four common recommendation tasks on four large public data sets substantiate our effectiveness.

Abstract:
Traffic prediction has long been a focal and pivotal area in research, witnessing both significant strides from city-level to road-level predictions in recent years. With the advancement of Vehicle-to-Everything (V2X) technologies, autonomous driving, and large-scale models in the traffic domain, lane-level traffic prediction has emerged as an indispensable direction. However, further progress in this field is hindered by the absence of comprehensive and unified evaluation standards, coupled with limited public availability of data and code. In this paper, we present the first systematic classification framework for lane-level traffic prediction, offering a structured taxonomy and analysis of existing methods. We construct three representative datasets from two real-world road networks, covering both regular and irregular lane configurations, and make them publicly available to support future research. We further establishes a unified spatial topology structure and prediction task formulation, and proposes a simple yet effective baseline model, GraphMLP, based on graph structure and MLP networks. This unified framework enables consistent evaluation across datasets and modeling paradigms. We also reproduce previously unavailable code from existing studies and conduct extensive experiments to assess a range of models in terms of accuracy, efficiency, and applicability, providing the first benchmark that jointly considers predictive performance and training cost for lane-level traffic scenarios.

Abstract:
In time series anomaly detection (TSAD), the scarcity of labeled data poses a challenge to the development of accurate models. Unsupervised domain adaptation (UDA) offers a solution by leveraging labeled data from a related domain to detect anomalies in an unlabeled target domain. However, existing UDA methods assume consistent anomalous classes across domains. To address this limitation, we propose a novel Domain Adaptation Contrastive learning model for Anomaly Detection in multivariate time series (DACAD), combining UDA with contrastive learning. DACAD utilizes an anomaly injection mechanism that enhances generalization across unseen anomalous classes, improving adaptability and robustness. Additionally, our model employs supervised contrastive loss for the source domain and self-supervised contrastive triplet loss for the target domain, ensuring comprehensive feature representation learning and domain-invariant feature extraction. Finally, an effective Center-based Entropy Classifier (CEC) accurately learns normal boundaries in the source domain. Extensive evaluations on multiple real-world datasets and a synthetic dataset highlight DACAD’s superior performance in transferring knowledge across domains and mitigating the challenge of limited labeled data in TSAD.

Abstract:
Knowledge tracing (KT) involves utilizing historical data from students’ learning interactions to model their mastery of knowledge over time, with the aim of predicting their future performance in interactions. Recently, significant advancements have been achieved through the application of various deep learning methodologies to address the KT challenge. However, a considerable proportion of deep learning-based knowledge tracing (DLKT) approaches exhibit striking similarities in their methodologies, and model designs, and even the outcomes demonstrate minimal divergence. In addition, the evaluation procedures employed in current DLKT studies are not standardized, resulting in substantial inconsistencies in the reported area under the curve (AUC) outcomes, despite analyzing the same model on identical datasets. To address the two aforementioned problems, this paper proposes a generalized DLKT framework and represents the existing DLKT models with five components, i.e., multimodal data encoder, student knowledge memory, auxiliary knowledge base, learning outcome objective, and computational efficiency and scalability. Furthermore, we develop and open source a standardized DLKT benchmark platform named pyKT,1 that consists of a standardized set of integrated data preprocessing procedures on 9 popular datasets across different domains, and 21 frequently compared DLKT model implementations. With pyKT, we conduct empirical and reproducible research to assess the performance of prevalent DLKT algorithms in an unbiased and clear setting over multiple data sources. Finally, we discuss the applications of KT techniques in the educational sector and their future development directions.

Abstract:
Multivariate Time Series Classification (MTSC) has important research significance and practical value. Deep learning models have achieved considerable success in addressing MTSC problems. However, a key challenge faced by existing classification models is how to effectively consider the correlations between time series instances and across channels simultaneously, as well as how to capture the dynamic of these inter-channel correlations over time. Current methods often fall short in these aspects: on one hand, they fail to fully account for the combined effects of inter-instance and inter-channel correlations; on the other hand, they largely overlook the dynamic nature of how inter-channel correlations change over time. To address these issues, we propose a novel graph neural network model, called Similarity-Aware Graph of Graphs neural networks (SAGoG), for multivariate time series classification. This model can comprehensively consider the dependencies between channel-level and instance-level time series, it dynamically learns dependency features through graph structure evolution and graph pooling layers. We conduct experiments on the UEA dataset to validate the SAGoG model, and the results demonstrate its outstanding performance in multivariate time series classification tasks.

Abstract:
Dirty data commonly exist. Simply discarding a large number of inaccurate points (as noises) could greatly affect clustering results. We argue that dirty data can be repaired and utilized as strong supports in clustering. To this end, we study a novel problem of clustering and repairing over dirty data at the same time. Referring to the minimum change principle in data repairing, the objective is to find a minimum modification of inaccurate points such that the large amount of dirty data can enhance clustering. We show that the problem is np-hard and can be formulated as an integer linear programming (ilp) problem. A constant factor approximation algorithm gdorc is devised based on grid, with high efficiency. In experiments, gdorc has great repairing and clustering results with low time consumption. Empirical results demonstrate that both the clustering and cleaning accuracies can be improved by our approach of repairing and utilizing the dirty data in clustering.

Abstract:
Dynamic graphs, which capture time-evolving edges between nodes, are formulated in continuous-time or discrete-time dynamic graphs. They differ in temporal granularity: Continuous-Time Dynamic Graphs (CTDGs) exhibit rapid, localized changes, while Discrete-Time Dynamic Graphs (DTDGs) show gradual, global updates. This difference leads to isolated developments in representation learning for each type. To advance dynamic graph representation learning, recent research attempts to design a unified model capable of handling both CTDGs and DTDGs, achieving promising results. However, it typically focuses on local dynamic propagation for temporal structure learning in the time domain, failing to accurately capture the underlying structural evolution associated with each temporal granularity and thus compromising model effectiveness. In addition, existing works-whether specific or unified-often overlook the issue of temporal noise, compromising the model’s robustness. To better model both types of dynamic graphs, we propose UniDyG, a unified and effective representation learning approach, which can scale to large dynamic graphs. Specifically, we first propose a novel Fourier Graph Attention (FGAT) mechanism that can model local and global structural correlations based on recent neighbors and complex-number selective aggregation, while theoretically ensuring consistent representations of dynamic graphs over time. Based on approximation theory, we demonstrate that FGAT is well-suited to capture the underlying structures in both CTDGs and DTDGs. We further enhance FGAT to resist temporal noise by designing an energy-gated unit, which adaptively filters out high-frequency noise according to the energy. Last, we leverage our proposed FGAT mechanisms for temporal structure learning and employ the frequency-enhanced linear function for node-level dynamic updates, facilitating the generation of high-quality temporal embeddings. Extensive experiments show that our UniDyG achieves an average improvement of 14.4% over sixteen baselines across nine dynamic graphs while exhibiting superior robustness in noisy scenarios.

Abstract:
Graphs represent interconnected structures prevalent in a myriad of real-world scenarios. Effective graph analytics, such as graph learning methods, enables users to gain profound insights from graph data, underpinning various tasks including node classification and link prediction. However, these methods often suffer from data imbalance, a common issue in graph data where certain segments possess abundant data while others are scarce, thereby leading to biased learning outcomes. This necessitates the emerging field of imbalanced learning on graphs, which aims to correct these data distribution skews for more accurate and representative learning outcomes. In this survey, we embark on a comprehensive review of the literature on imbalanced learning on graphs. We begin by providing a definitive understanding of the concept and related terminologies, establishing a strong foundational understanding for readers. Following this, we propose two comprehensive taxonomies: (1) the problem taxonomy, which describes the forms of imbalance we consider, the associated tasks, and potential solutions and (2) the technique taxonomy, which details key strategies for addressing these imbalances, and aids readers in their method selection process. Finally, we suggest prospective future directions for both problems and techniques within the sphere of imbalanced learning on graphs, fostering further innovation in this critical area.

Abstract:
Given a large graph GG, a subgraph query QQ finds the set of all subgraphs of GG that satisfy certain conditions specified by QQ. Examples of subgraph queries including finding a community containing designated members to organize an event, and subgraph matching. To overcome the weakness of existing graph-parallel systems that underutilize CPU cores when finding subgraphs, our prior system, G-thinker, was proposed that adopts a novel think-like-a-task (TLAT) parallel programming model. However, G-thinker targets offline analytics and cannot support interactive online querying where users continually submit subgraph queries with different query contents. The challenges here are (i) how to maintain fairness that queries are answered in the order that they are received: a later query is processed only if earlier queries cannot saturate the available computation resources; (ii) how to track the progress of active queries (each with many tasks under computation) so that users can be timely notified as soon as a query completes; and (iii) how to maintain memory boundedness and high task concurrency as in G-thinker. In this article, we propose a novel TLAT programming framework, called G-thinkerQ, for answering online subgraph queries. G-thinkerQ inherits the memory boundedness and high task concurrency of G-thinker by organizing the tasks of each query using a “task capsule” structure, and designs a novel task-capsule list is to ensure fairness among queries. A novel lineage-based mechanism is also designed to keep track of when the last task of a query is completed. Parallel counterparts of the state-of-the-art algorithms for 4 recent advanced subgraph queries are implemented on G-thinkerQ to demonstrate its CPU-scalability.

Abstract:
Negative sampling is an essential part in knowledge graph embedding, which offers significant advantages to numerous downstream related tasks. There are two kinds of important negatives: hard and false negatives. Hard negatives are the negatives which are difficult to distinguish from positive samples, while false negatives are positive samples which are mistakenly identified as negatives. Harnessing hard negatives effectively can make the model more discriminative, and reducing false negatives can avoid misleading the model during training. Therefore, the two kinds of negatives are essential in high-quality negative sampling. However, the present negative sampling methods face two shortcomings: 1.judging one negative is hard or false mainly relies on score functions; 2. difficulty in balancing the impact of hard and false negatives. In this paper, we absorb bigram language model and propose a novel criterion to help verify the negatives are hard or false, and discuss how to keep the balance between hard and false negatives. Experiments on four representative score functions and two public datasets demonstrate the effects of the proposed negative sampling method.

Abstract:
Multi-view subspace clustering (MVSC) separates the data with multiple views into multiple clusters, and each cluster corresponds to one certain subspace. Existing tensor-based MVSC methods construct self-representation subspace coefficient matrices of all views as a tensor, and introduce the tensor nuclear norm (TNN) to capture the complementary information hidden in different views. The key assumption is that the data samples of each subspace must be sufficient for subspace representation. This work proposes a nonconvex latent transformed low-rank tensor representation framework for MVSC. To deal with the insufficient sample problem, we study the latent low-rank representation in the multi-view case to supplement underlying observed samples. Moreover, we propose to use data-driven transformed TNN (TTNN), resulting from the intrinsic structure of multi-view samples, to preserve the consensus and complementary information in the transformed domain. Meanwhile, the proposed unified nonconvex low-rank tensor representation framework can better learn the high correlation among different views. To resolve the proposed nonconvex optimization model, we propose an effective algorithm under the framework of the alternating direction method of multipliers and theoretically prove that the iteration sequences converge to the critical point. Experiments on various datasets showcase outstanding performance.

Abstract:
This study was inspired by video forgery detection techniques. If the topic space at a certain time is considered as a frame image, the consecutive frame images over time could be viewed as a video. Then the rumor topic detection problem is transformed into a topic video forgery detection problem. Thus, a novel rumor detection method was proposed. First, a Topic2RGB algorithm was proposed to convert comment users into pixel points. The algorithm views commenting users as pixel points while using game theory to mine user pro-opposition emotions as RGB information. Secondly, a Topic2Video algorithm was proposed to convert the topic space into video. The algorithm converts the topic space into frame images. Meanwhile, the topic space is time-sliced, then the topic space is transformed into a video. Finally, the volatility of user emotional confrontation during a long time in the topic space is like the change of characteristics of frame images in forgeries videos. Then, a topic video rumor detection method (TVRD) was proposed. The experiments indicate that the method successfully verifies the viability of the topic videolization for rumor detection. Additionally, the method also demonstrates the effectiveness of user emotion confrontation of topic space on detection performance.

Abstract:
Attribute-based signature (ABS) is an attractive variation of digital signature that enables signers to sign messages with fine-grained signature predicates. In ABS, a signer is able to perform signing operations without revealing personal attributes, and verifiers can only confirm that the signature was created by someone with attributes satisfying a specific signature predicate. However, traditional ABS suffers from key exposure, and the compromise of a signer’s signature key results in invalidating all signatures from him/her. To address this problem, forward-secure ABS (FS-ABS) was introduced. Nevertheless, existing FS-ABS schemes have the shortcomings of low policy expressiveness and high computation costs, and thus are not suitable to be employed on mobile devices with limited resources. In this paper, we propose a user-friendly and expressive FS-ABS (UEFS-ABS) scheme that is proven secure in the standard model. The proposed scheme not only supports expressive signature predicates based on the linear secret sharing scheme, but also provides server-aided signature and outsourced verification functions, significantly reducing the workload of user terminals at both signature generation and verification stages. The experiments indicate that compared with the up-to-date FS-ABS scheme, our scheme reduces the computation costs for signature generation (on signers’ devices) and verification (on verifiers’ devices) by about 85% and 68%, respectively. This makes our scheme more suitable for user terminals in mobile computing scenarios.

Abstract:
Motif discovery is a critical operation for analyzing series data in many applications. Recent works demonstrate the importance of finding motifs with Dynamic Time Warping. However, existing algorithms spend most of their time in computing lower bounds of Dynamic Time Warping to filter out the unpromising candidates. Specifically, the time complexity for computing these lower bounds is O(L)O(L) for each pair of subsequences, where LL is the length of the motif (subsequences). This paper proposes two new lower bounds, called LB_fLBf and LB_MLBM, both of them only cost amortized O(1)O(1) time for each pair of subsequences. On real datasets, the proposed lower bounds are at least one magnitude faster than the state-of-the-art lower bounds used in motif discovery while still keeping satisfying effectiveness. Based on these faster lower bounds, this paper designs an efficient motif discovery algorithm that significantly reduces the cost of lower bounds. The experiments conducted on real datasets show the proposed algorithm is 5.6 times faster than the state-of-the-art algorithms on average.

Abstract:
Phasor Measurement Units (PMUs) are state-of-the-art measuring devices that capture high-resolution time-synchronized voltage and current phasor measurements in wide area monitoring systems (WAMS). Their usage for various real-time applications demands a huge amount of data collected from multiple PMUs to be transmitted from the local phasor data concentrator (PDC) to the control centre. To optimize the requirements of bandwidth to transmit the data as well as to store the data, an efficient synchrophasor data compression technique is desired. To this end, this paper presents a 3-stage data compression scheme in which Stage-1 performs the accumulation of the data matrix from the optimally placed PMUs in WAMS into the local PDC. The data is then passed through a novel Ramanujan's sum-based fault window detection algorithm to identify the fault within the PMU data matrix in Stage-2. Finally, Stage-3 proposes an enhanced graph filtering-enabled principal component analysis scheme which expands the notion of conventional PCA techniques into the graph domain to compress the data. The performance of the proposed scheme is verified on the IEEE 14-bus system and New England 39-bus system. Further, practical applicability of the proposed method is validated on field PMU data collected from EPFL campus in Switzerland.

Abstract:
Secure cloud storage is a prevalent way to provide data retrieval services, where users’ data are encrypted before uploading to the cloud. To effectively perform keyword searches over the encrypted data, the approach of searchable encryption (SE) was introduced. However, the leakage of the keyword-pair result pattern to the cloud could be exploited to reconstruct the queried keywords. To mitigate such information leakages, numerous result pattern-hiding SE systems were proposed but rarely supported data sharing with expressive queries and even owner-enforced authorization. Therefore, we present a result pattern hiding and authorized SE system (AXT) supporting conjunctive queries for cloud-based data sharing. Technically, we construct an authorized label private set intersection protocol from a refined authorized public key encryption with an equality test and then combine it with an introduced asymmetric variant of oblivious cross-tag protocol. Moreover, we introduce the system and security model of AXT along with rigorous security proof. Furthermore, we conduct comparative experiments between state-of-the-art solutions with AXT on HUAWEI Cloud platform under the widely recognized Enron dataset, which reveal that AXT achieves practical performance with retaining authorized data sharing and result pattern hiding, specifically, the time overhead for conjunctive queries with 10 keywords is reduced by 20%%.

Abstract:
Existing studies have proven that pre-trained ranking models outperform pre-trained language models when it comes to ranking tasks. To pre-train such models, researchers have utilized large-scale search logs and clicks as weak-supervised signals of query-document relevance. However, search logs are incomplete and sparse. Different users with the same intent tend to use various forms of queries. It is hard for recorded clicks to sufficiently cover diverse relevance patterns between queries and documents. Moreover, the diverse intentions of a large user base lead to long-tail distributions of search intents. Deriving sufficient relevance signals from sparse clicks of these long-tail intents poses another challenge. Therefore, there is significant potential for exploring richer relevance signals beyond direct clicks to pre-train high-quality ranking models. To tackle this problem, we develop two exploratory data augmentation strategies that consider the diversity of query forms from local and global perspectives, hence mining potential and diverse relevance signals from search logs. A generative augmentation strategy is also devised to create supplementary positive samples, to enhance the ranking ability for long-tail query intents. We leverage a multi-level pairwise ranking objective and a contrastive learning approach to enable our model to capture fine-grained relevance patterns and be robust for noisy training samples. Experimental results on a large-scale public dataset and a commercial dataset confirm that our model, namely PRADA, can yield better ranking effectiveness over existing pre-trained ranking models.

Abstract:
Extracting fine-grained features such as styles from unlabeled data is crucial for data analysis. Unsupervised methods such as variational autoencoders (VAEs) can extract styles that are usually mixed with other features. Conditional VAEs (CVAEs) can isolate styles using class labels; however, there are no established methods to extract only styles using unlabeled data. In this paper, we propose a CVAE-based method that extracts style features using only unlabeled data. The proposed model consists of a contrastive learning (CL) part that extracts style-independent features and a CVAE part that extracts style features. The CL model learns representations independent of data augmentation, which can be viewed as a perturbation in styles, in a self-supervised manner. Considering the style-independent features from the pretrained CL model as a condition, the CVAE learns to extract only styles. Additionally, we introduce a constraint based on mutual information between the CL and VAE features to prevent the CVAE from ignoring the condition. Experiments conducted using two simple datasets, MNIST and an original dataset based on Google Fonts, demonstrate that the proposed method can efficiently extract style features. Further experiments using real-world natural image datasets were also conducted to illustrate the method’s extendability.

Abstract:
We introduce a weighted and unconstrained variant of the well-known minimum kk union problem: Given a bipartite graph \mathcal G(U,V,E)G(U,V,E) with weights for all nodes in VV, find a set S\subseteq VS⊆V such that the ratio between the total weight of the nodes in SS and the number of their distinct adjacent nodes in UU is maximized. Our problem, which we term Heavy Nodes in a Small Neighborhood (HNSN), finds applications in marketing, team formation, and money laundering detection. For example, in the latter application, SS represents bank account holders who obtain illicit money from some peers of a criminal and route it through their accounts to a target account belonging to the criminal. We prove that HNSN can be solved exactly in polynomial time via linear programming. We also develop several algorithms offering different effectiveness/efficiency trade-offs: an exact algorithm, based on node contraction, graph decomposition, and linear programming, as well as three peeling algorithms. The first peeling algorithm is a near-linear time approximation algorithm with a tight approximation ratio, the second is an iterative algorithm that converges to an optimal solution in a very small number of iterations in practice, and the third is a near-linear time greedy heuristic. In addition, we formalize a money laundering scenario involving multiple target accounts and show how our algorithms can be extended to deal with it. Our experiments on real and synthetic datasets show that our algorithms find (near-)optimal solutions, outperforming a natural baseline, and that they can detect money laundering more effectively and efficiently than two state-of-the-art methods.

Abstract:
Quantitative trading is a prominent field that employs time series analysis today, attracting researchers who apply machine intelligence to real-world issues like stock price movement prediction. In recent literature, various types of auxiliary data have been integrated alongside stock prices to improve prediction accuracy, such as textual news and correlational information. However, they typically rely on directly related documents or symmetric price correlations to make predictions for a particular stock (we refer to as “self-influence”). In this paper, we propose a Memory-Aware Graph Interactive Causal Network (MagicNet) that considers both temporal and spatial dependencies in financial documents and introduces causality-based correlations between multivariate stocks in a hierarchical fashion. MagicNet involves a text memory slot for each stock to retain the most influential texts over time and contains a dynamic interaction graph based on causal relationships to aggregate interactive influences asymmetrically. We believe that MagicNet leverages influential texts across stocks and explores their interrelationships through a logical structure, improving predictions on multiple stocks (we refer to as “interactive-influence”). The effectiveness of MagicNet is demonstrated through experiments on three real-world datasets, where MagicNet outperforms existing state-of-the-art models, offering an intuitive framework for understanding how texts and correlations affect future stock prices.

Abstract:
In response to the burgeoning cryptocurrency sector and its associated financial risks, there is a growing focus on detecting fraudulent activities and malicious addresses. Traditional studies are limited by their reliance on comprehensive historical data and address-wise manipulation, which are not available for early malice detection and fail to identify addresses controlled by the same fraudulent entity. We thus introduce Evolve Path Tracer, a novel solution designed for early malice detection in cryptocurrency. This system innovatively incorporates Asset Transfer Paths and corresponding path graphs in an evolve model, which effectively characterize rapidly evolving transaction patterns. First, for the target address, the Clustering-based Path Selector weight each Asset Transfer Path by finding sibling addresses along the Asset Transfer Paths. Evolve Path Encoder LSTM and Evolve Path Graph GCN then encode the asset transfer path and path graph within a dynamic structure. Additionally, our Hierarchical Survival Predictor efficiently scales to predict the address labels, demonstrating high scalability and efficiency. We rigorously tested Evolve Path Tracer on three real-world datasets of malicious addresses, where it consistently outperformed existing state-of-the-art methods. Our extensive scalability tests further confirmed the model's robust adaptability in dynamic prediction environments, highlighting its potential as a significant tool in the realm of cryptocurrency security.

Abstract:
In distributed data stream mining, we abstract a MIMO scenario where a stream of multiple items is mined by multiple nodes. We design a framework named MimoSketch for the MIMO-specific scenario, which improves the fundamental mining tasks of item frequency estimation, item size distribution estimation, heavy hitter detection, heavy change detection, and entropy estimation. MimoSketch consists of an algorithm design and a policy to schedule items to nodes. MimoSketch's algorithm applies random counting to preserve a mathematically proven unbiasedness property, which makes it friendly to the aggregate query on multiple nodes; its memory layout is dynamically adaptive to the runtime item size distribution, which maximizes the estimation accuracy by storing more items. MimoSketch's scheduling policy balances items among nodes, avoiding nodes being overloaded or underloaded, which improves the overall mining accuracy. Our prototype and evaluation show that our algorithm can improve the accuracy of five typical mining tasks by an order of magnitude compared with the state-of-the-art solutions, and the scheduling policy further promotes the performance in MIMO scenarios.

Abstract:
In this paper, we address the problem of revenue maximization (RM) for multi-grade products in social networks by considering pricing, seed selection, and coupon distribution. Previous works on RM often focus on a single product and neglect the use of coupons for promotion. We propose a new optimization problem, Revenue Maximization of Multi-Grade Product(RMMGP), to simultaneously determine pricing, seed selection, and coupon distribution for multi-grade products with both promotional and competitive relationships between grades in order to maximize revenue through viral marketing. We prove the hardness and inapproximability of RMMGP and show that the revenue function is not monotone or submodular. To solve RMMGP, we design an approximation algorithm, namely Data-Dependent Revenue Maximization (DDRM), and propose the Pricing-Seeding-Coupon allocation (PriSCa) algorithm, which uses the concepts of Worth Receiving Probability, Pricing-Promotion Alternating Framework, and Independent/Holistic Customer-Grade Determinant sets. Our experiments on real social networks, using valuation distributions from Amazon.com, demonstrate that PriSCa and DDRM achieve on average 1.5 times higher revenue than state-of-the-art approaches. Additionally, PriSCa is efficient and scalable on large datasets.

Abstract:
Intelligent logistics relies on accurately predicting the service time, which is a part of time cost in the last-mile delivery. However, service time prediction (STP) is non-trivial given complex delivery circumstances, location heterogeneity, and skewed observations in space, which are not well-handled by existing solutions. In our prior work, we treat STP at each location as a learning task to keep the location heterogeneity, propose a prior knowledge-enhanced meta-learning to tackle skewed observations, and introduce a Transformer-based representation module to encode complex delivery circumstances. Maintaining the design principles of prior work, in this extended paper, we propose MetaSTP+. In addition to fusing the prior knowledge after the meta-learning process, MetaSTP+ also injects the prior knowledge before and during the meta-learning process to better tackle skewed observations. More specifically, MetaSTP+ completes the support set of tasks with scarce samples from other tasks based on prior knowledge and is equipped with a prior knowledge-aware historical observation encoding module to achieve those purposes accordingly. Experiments show MetaSTP+ outperforms the best baseline by 11.2% and 8.4% on two real-world datasets. Finally, an intelligent waybill assignment system based on MetaSTP+ is deployed in JD Logistics.

Abstract:
Risk scoring systems have been widely deployed in many applications, which assign risk scores to users according to their behavior sequences. Though many deep learning methods with sophisticated designs have achieved promising results, the black-box nature hinders their applications due to fairness, explainability, and compliance consideration. Rule-based systems are considered reliable in these sensitive scenarios. However, building a rule system is labor-intensive. Experts need to find informative statistics from user behavior sequences, design rules based on statistics and assign weights to each rule. In this paper, we bridge the gap between effective but black-box models and transparent rule models. We propose a two-stage framework, CauseRuDi, that distills the knowledge of black-box teacher models into rule-based student models. We design a Monte Carlo tree search-based statistics generation method that maximizes the correlation or dependence between the generated statistics and the teacher model's outputs. We formulate a sequential move game and a simultaneous move coalitional game to generate multiple statistics. Then statistics are composed into logical rules with our proposed neural logical networks by mimicking the outputs of teacher models. We evaluate CauseRuDi on three real-world public datasets and an industrial dataset to demonstrate its effectiveness.

Affiliations: Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, Huazhong University of Science and Technology, Hubei, China; Tsinghua University, Beijing, China; School of Computer Science and Engineering, The University of New South Wales, Sydney, NSW, Australia; School of Computer Science and Engineering, Nanyang Technological University, Singapore; Department of Computer Science, Aalborg University, Aalborg, Denmark

Abstract:
Multivariate Time Series (MTS) analysis is crucial to understanding and managing complex systems, such as traffic and energy systems, and a variety of approaches to MTS forecasting have been proposed recently. However, we often observe inconsistent or seemingly contradictory performance findings across different studies. This hinders our understanding of the merits of different approaches and slows down progress. We address the need for means of assessing MTS forecasting proposals reliably and fairly, in turn enabling better exploitation of MTS as seen in different applications. Specifically, we first propose BasicTS+, a benchmark designed to enable fair, comprehensive, and reproducible comparison of MTS forecasting solutions. BasicTS+ establishes a unified training pipeline and reasonable settings, enabling an unbiased evaluation. Second, we identify the heterogeneity across different MTS as an important consideration and enable classification of MTS based on their temporal and spatial characteristics. Disregarding this heterogeneity is a prime reason for difficulties in selecting the most promising technical directions. Third, we apply BasicTS+ along with rich datasets to assess the capabilities of more than 30 MTS forecasting solutions. This provides readers with an overall picture of the cutting-edge research on MTS forecasting.

Abstract:
The ubiquity of Graph Neural Networks (GNNs) emphasizes the imperative to assess their resilience against node injection attacks, a type of evasion attacks that impact victim models by injecting nodes with fabricated attributes and structures. However, prevailing attacks face two primary limitations: (1) Sequential construction of attributes and structures results in suboptimal outcomes as structure information is overlooked during attribute construction and vice versa. (2) In black-box scenarios, where attackers lack access to victim model architecture and parameters, reliance on surrogate models degrades performance due to architectural discrepancies. To overcome these limitations, we introduce GZOO, a black-box node injection attack that leverages an adversarial graph generator, compromising both attribute and structure sub-generators. This integration crafts optimal attributes and structures by considering their mutual information, enhancing their influence when aggregating information from injected nodes. Furthermore, GZOO proposes a zeroth-order optimization algorithm leveraging prediction results from victim models to estimate gradients for updating generator parameters, eliminating the necessity to train surrogate models. Across sixteen datasets, GZOO significantly outperforms state-of-the-art attacks, achieving remarkable effectiveness and robustness. Notably, on the Cora dataset with the GCN model, GZOO achieves an impressive 95.69% success rate, surpassing the maximum 66.01% achieved by baselines.

Abstract:
In recent years, the medical industry is generating a large amount of data. How to securely store and reliably share these medical data has been a hot research topic. Cloud storage technology can be applied to the medical industry to adapt to the rapid growth of medical data. However, cloud-based data storage and sharing systems face a series of security issues: whether the integrity of outsourced medical data can be guaranteed, and malicious access between different medical institutions may leak user’s privacy. This article proposes a system that simultaneously solves the integrity auditing of medical data and securely data sharing between different medical institutions under the terminal-edge-cloud framework. Specifically, patients/doctors are treated as terminal users, medical institutions are viewed as edge nodes, and medical clouds form the central storage layer. In the process of data auditing, third-party auditor can achieve integrity auditing of medical cloud storage data. Moreover, different medical institutions use private-set-intersection technology to share the common user’s electronic medical data, while for other users not in intersection set, their data does not need to be shared. Finally, security and performance analyses show that our proposed system is provable secure and has high computational and communication efficiency.

Abstract:
This paper proposes a novel fuzzy community detection (FCD) approach, which we term as ‘Label Propagation-Based Fuzzy Community (LaProFC)’, and shows that it has the ability to outperform the existing FCD approaches. While designing the proposed FCD approach, we introduce a new compound type similarity metric termed ‘proportion of common neighbors and edges-based similarity (CCS)’ to compute similarity between two neighboring nodes. By executing local exploration on graphs with modified local random walk (mLRW), most similar neighbors of each node are identified; and based on the directions of most similar neighbors some tentative communities are generated. Afterward, these tentative communities are corrected and stabilized by iteratively computing membership degrees of each node using a novel label propagation-based membership computation function. We also propose a novel edge-density-based technique called ‘community-weight based tie-breaking (CTB)’, which is incorporated with the membership degree computation function. We conduct extensive experiments with both real-life and synthetic datasets and show the working of the proposed approach. Our Proposed LaProFC approach outperforms baseline approaches in terms of popular quality and accuracy metrices including modularity and normalized mutual information. Further, popular multi-criteria decision making (MCDM) tools are used to show supremacy of the proposed approach by computing the ranks of different approaches through two sets of accuracy and quality metrices. Our proposed LaProFC approach supersedes other approaches in terms of faster computations and asymptotic time complexity.

Abstract:
Sequential recommender systems (SRSs) are designed to suggest relevant items to users by analyzing their interaction sequences. However, SRSs often suffer from exposure bias in these sequences due to imbalanced item exposure and varied user activity levels, creating a self-reinforcing loop favoring popular items regardless of their true relevance. Most SRSs only focus on item dependencies to address exposure bias, while overlooking user-side exposure bias and the rich semantics behind interactions. These oversights result in a limited understanding of less active users’ preferences and inaccurate preference capture for less exposed items, exacerbating exposure biases. Towards this end, we propose a novel method LLM-enhanced Dual Propensity Score Estimation (LDPE), which synergistically integrates Large Language Models (LLMs) and causal inference. First, LDPE leverages LLMs’ superior ability in capturing rich semantics from textual data and then integrates collaborative information to generate debiased semantic-rich LLM-based user/item embeddings. With these debiased item/user embeddings, LDPE estimates time-aware debiased propensity scores from both the item and user sides. These dual propensity scores can fully mitigate exposure bias by considering item popularity, user activity levels, and temporal dynamics. Lastly, LDPE employs the transformer as the backbone of our method, incorporating estimated dual propensity scores for accurately predicting users’ true preferences. Extensive experiments show that our LDPE outperforms state-of-the-art baselines in terms of recommendation performance.

Abstract:
The storage location assignment problem for export containers (EC-SLAP) is crucial to the efficiency of cargo turnover in ports. Existing methods fall short in real-world applications due to the challenges of the unpredictability of container arrival sequences and the large-scale problem. We propose a new buffer-wise framework based on hierarchical reinforcement learning for EC-SLAP, aimed at optimizing container turnover efficiency. The framework comprises two processes: 1) Ranking Agent ranks containers in the buffer, reducing the uncertainty of random arrival sequences compared to the immediate assignment. 2) Assigning Agents assign storage locations in a two-step process by block and slot, diminishing the dimensionality of the large-scale discrete action space. We iteratively optimize agents by asynchronously obtaining rewards from the environment. In addition, to address the challenge of sparse rewards in long-sequence decision-making, we have developed a novel immediate reward function to enhance learning efficiency and accelerate convergence. We propose a new large-scale dataset, NZP-SLAD, collected from real-world historical data from the terminal operating system of Ningbo-Zhoushan Port and develop a realistic container terminal simulator. We conducted numerous offline simulations and tests with this dataset. The experimental results demonstrate that our proposed method achieves rapid convergence and significantly surpasses expert methods used in real-world production.

Abstract:
Multi-hop query answering over a Knowledge Graph (KG) involves traversing one or more hops from the start node to answer a query. Path-based and logic-based methods are state-of-the-art for multi-hop question answering. The former is used in link prediction tasks. The latter is for answering complex logical queries. The logical multi-hop querying technique embeds the KG and queries in the same embedding space. The existing work incorporates First Order Logic (FOL) operators, such as conjunction (\wedge∧), disjunction (\vee∨), and negation (\lnot¬), in queries. Though current models have most of the building blocks to execute the FOL queries, they cannot use the dense information of multi-modal entities in the case of Multi-Modal Knowledge Graphs (MMKGs). We propose RConE, an embedding method to capture the multi-modal information needed to answer a query. The model first shortlists candidate (multi-modal) entities containing the answer. It then finds the solution (sub-entities) within those entities. Several existing works tackle path-based question-answering in MMKGs. However, to our knowledge, we are the first to introduce logical constructs in querying MMKGs and to answer queries that involve sub-entities of multi-modal entities as the answer. Extensive evaluation of four publicly available MMKGs indicates that RConE outperforms the current state-of-the-art.

Abstract:
We explore Multimodal Large Language Models (MLLMs), which integrate LLMs like GPT-4 to handle multimodal data, including text, images, audio, and more. MLLMs demonstrate capabilities such as generating image captions and answering image-based questions, bridging the gap towards real-world human-computer interactions and hinting at a potential pathway to artificial general intelligence. However, MLLMs still face challenges in addressing the semantic gap in multimodal data, which may lead to erroneous outputs, posing potential risks to society. Selecting the appropriate modality alignment method is crucial, as improper methods might require more parameters without significant performance improvements. This paper aims to explore modality alignment methods for LLMs and their current capabilities. Implementing effective modality alignment can help LLMs address environmental issues and enhance accessibility. The study surveys existing modality alignment methods for MLLMs, categorizing them into four groups: (1) Multimodal Converter, which transforms data into a format that LLMs can understand; (2) Multimodal Perceiver, which improves how LLMs percieve different types of data; (3) Tool Learning, which leverages external tools to convert data into a common format, usually text; and (4) Data-Driven Method, which teaches LLMs to understand specific data types within datasets.

Abstract:
Multiple change point detection seeks to identify potential shifts in the data properties. While existing detection methods primarily focus on univariate and multivariate data, they often fall short in detecting variations in profile data, which represent the functional relationships between explanatory and response variables. This paper introduces a novel change point detection method tailored for profile data, employing a smooth profile decomposition (SPD) strategy that accommodates arbitrary designs and heteroscedasticity. This approach facilitates a comprehensive representation of both overall trends and fluctuations within the profiles. Furthermore, we propose an order-splitting screening estimator (OSE) to construct a detection statistic, allowing for precise estimation of change points while ensuring a significant theoretical guarantee regarding the false discovery rate (FDR). We validate the performance and robustness of the proposed method through numerical experiments and a real case study involving wind turbines.

Abstract:
Multi-view clustering (MVC) for remote sensing data has demonstrated significant potential in Earth observation, given its ability to aggregate multi-source information without relying on labels. Despite achieving compelling results through the combination of deep encoders and contrastive learning, existing algorithms still face two limitations: inadequate exploration of diverse spatial relationships and inability to guide the selection of sample pairs leads to blind sampling, both of which lead to suboptimal clustering performance. To tackle these challenges, we propose a sampling enhanced contrastive multi-view clustering method for remote sensing data, namely SEC-LSRM. The proposed method incorporates long- and short-range information mining to enhance clustering performance. By aggregating short-range information extracted through autoencoders and long-range information obtained via graph autoencoders, our method improves the sampling quality of positive and negative sample pairs. To render the extracted features more compact, a multi-view correlation reduction strategy is devised to filter out irrelevant information. With the extracted comprehensive features, an adaptive sampling strategy is designed to obtain high-quality positive and negative samples. Subsequently, we select positive and negative sample pairs based on these affinity matrices with idempotence and block diagonal constraints. Moreover, we integrate the optimization of these sample pairs and contrastive learning within the same framework to achieve iterative updates of both. Experiments conducted on multiple multi-view remote sensing datasets illustrate that our proposed SEC-LSRM method achieves excellent and reliable clustering performance.

Abstract:
Spatial-Temporal Graph Neural Networks (STGNNs) have been widely utilized in multivariate time series forecasting (MTSF), but they rely on the assumption of data completeness. In practice, due to factors such as natural disaster, STGNNs frequently encounter the challenge of missing data resulting from numerous malfunctioning data collectors. In this case, on the one hand, due to the presence of missing values, STGNNs easily generate incorrect spatial correlations, leading to the performance degradation. On the other hand, STGNNs require separate training of models for different missing rates, limiting their robustness. To address these challenges, we first propose two important components (interpolation attention and adaptive graph convolution), which utilize normal values to recover missing values into reliable representations and reconstruct spatial correlations. Then, we replace the fully connected layers in simple recursive units with these two components and propose Graph Interpolation Attention Recursive Network (GinAR), aiming to recursively correct spatial correlations and achieve end-to-end MTSF with missing values. Finally, we use data with different missing rates as positive and negative data pairs. By employing contrastive learning to train GinAR, we propose GinAR+ and enhance its robustness to data with different missing rates. Experiments validate the superiority of GinAR+ and our motivation.

Abstract:
Many practical time series forecasting (TSF) tasks are plagued by data limitations. To alleviate this challenge, we design a data-level augmentation framework. It involves a time series generation (TSG) module and a source data selection (Sel-src) module. TSG aims to achieve better generation results by considering both the global profile and temporal dynamics of series. However, when only few target data is available, TSG module may tend to simulate the limited target samples, leading to poor generalization performance. A natural idea for this problem is to seek help from related source domain, which can provide additional useful information for TSG module. Here we consider a more complex situation, where the relevance between source and target domains is ambiguous. That is, irrelevant samples may exist in the source domain. Blindly using all the source data may lead to counterproductive results. To meet this challenge, Sel-src module is designed to select effective source samples by Inter-Representation Learning (Inter-RL) and Intra-Representation Learning (Intra-RL). Effectiveness of this algorithm is underpinned from two aspects: the quality of the augmented data and the accuracy improvement upon the augmentation.

Abstract:
Edge perturbation is a basic method to modify graph structures. It can be categorized into two veins based on their effects on the performance of graph neural networks (GNNs), i.e., graph data augmentation and attack. Surprisingly, both veins of edge perturbation methods employ the same operations, yet yield opposite effects on GNNs’ accuracy. A distinct boundary between these methods in using edge perturbation has never been clearly defined. Consequently, inappropriate perturbations may lead to undesirable outcomes, necessitating precise adjustments to achieve desired effects. Therefore, questions of “why edge perturbation has a two-faced effect?” and “what makes edge perturbation flexible and effective?” still remain unanswered. In this paper, we will answer these questions by proposing a unified formulation and establishing a quantizable boundary between two categories of edge perturbation methods. Specifically, we conduct experiments to elucidate the differences and similarities between these methods and theoretically unify the workflow of these methods by casting it to one optimization problem. Then, we devise Edge Priority Detector (EPD) to generate a novel priority metric, bridging these methods up in the workflow. Experiments show that EPD can make augmentation or attack flexibly and achieve comparable or superior performance to other counterparts with less time overhead.

Abstract:
Time series forecasting, aiming to learn models from historical data and predict future values in time series, is a fundamental research topic in machine learning. However, few efforts have been devoted to addressing the confounding effects in time series data, e.g., the historical data are affected by some hidden surrounding factors (i.e., confounders), leading to biased forecasting models for future data. This paper presents a causal intervention approach to eliminate the bias that is raised by some hidden confounders. By using a causal graph, we illustrate why hidden confounders can bring bias in time series forecasting and how to tackle it. We implement causal intervention by a deep architecture that consists of two modules, a Confounders Estimation module to estimate the hidden confounders and a Debiasing module to eliminate the confounding bias in the forecasting model via sampling on confounders. We conduct comprehensive evaluations on various time series datasets. The experiment results indicate that the proposed method can reduce the negative confounding effects in time series data, and it achieves superior gains over state-of-the-art baselines for time series forecasting.

Abstract:
Contrastive Learning (CL) has emerged as a popular self-supervised representation learning paradigm that has been shown in many applications to perform similarly to traditional supervised learning methods. A key component of CL is mining the latent discriminative relationships between positive and negative samples and using them as self-supervised labels. We argue that this discriminative contrastive task is, in essence, similar to a classification task, and the “either positive or negative” hard label sampling strategies are arbitrary. To solve this problem, we explore ideas from data distillation, which considers probabilistic logit vectors as soft labels to transfer model knowledge. We attempt to abandon the classical hard sampling labels in CL and instead explore self-supervised soft labels. We adopt soft sampling labels that are extracted, without supervision, from the inherent relationships in data pairs to retain more information. We propose a new self-supervised graph learning method, Distill and Contrast (D&C), for learning representations that closely approximate natural data relationships. D&C extracts node similarities from the features and structures to derive soft sampling labels, which also eliminate noise in the data to increase robustness. Extensive experimental results on real-world datasets demonstrate the effectiveness of the proposed method.

Abstract:
Entity alignment aims to use pre-aligned seed pairs to find other equivalent entities from different knowledge graphs and is widely used in graph fusion-related fields. However, as the scale of knowledge graphs increases, manually annotating pre-aligned seed pairs becomes difficult. Existing research utilizes entity embeddings obtained by aggregating single structural information to identify potential seed pairs, thus reducing the reliance on pre-aligned seed pairs. However, due to the structural heterogeneity of KG, the quality of potential seed pairs obtained using only a single structural information is not ideal. In addition, although existing research improves the quality of potential seed pairs through semi-supervised iteration, they underestimate the impact of embedding distortion produced by noisy seed pairs on the alignment effect. In order to solve the above problems, we propose a seed expanded-aware graph neural network with iterative optimization for semi-supervised entity alignment, named SE-GNN. First, we utilize the semantic attributes and structural features of entities, combined with a conditional filtering mechanism, to obtain high-quality initial potential seed pairs. Next, we designed a local and global awareness mechanism. It introduces initial potential seed pairs and combines local and global information to obtain a more comprehensive entity embedding representation, which alleviates the impact of KG structural heterogeneity and lays the foundation for the optimization of initial potential seed pairs. Then, we designed the threshold nearest neighbor embedding correction strategy. It combines the similarity threshold and the bidirectional nearest neighbor method as a filtering mechanism to select iterative potential seed pairs and also uses an embedding correction strategy to eliminate the embedding distortion. Finally, we will reach the optimized potential seeds after iterative rounds to input local and global sensing mechanisms, obtain the final entity embedding, and perform entity alignment. Experimental results on public datasets demonstrate the excellent performance of our SE-GNN, showcasing the effectiveness of the model. Our code is publicly available at https://github.com/ShuoShan1/SE-GNN.

Abstract:
Open information extraction (OIE) methods extract plenty of OIE triples < > from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. In order to leverage these two views of knowledge jointly, we propose CMVC+, a novel unsupervised framework for canonicalizing OKBs without the need for manually annotated labels. Specifically, we propose a multi-view CHF K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering the clustering quality in a fine-grained manner. Furthermore, we propose a novel contrastive learning module to refine the learned view-specific embeddings and further enhance the canonicalization performance. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods.

Abstract:
Big data applications such as Artificial Intelligence (AI) and Internet of Things (IoT) have in recent years been leading to many technological breakthroughs in system modeling. However, these applications are typically data intensive, thus requiring an increasing cost of resources. In this paper, a first-of-its-kind comprehensive review of data selection methods across different engineering disciplines is given in order to analyze the effectiveness of these methods in improving the data efficiency of mathematical modeling algorithms. Eight distinct selection methods have been identified and subsequently analyzed and discussed on the basis of the relevant literature. In addition, the selection methods have been classified according to three dichotomies established by the survey. A comparative analysis of these methods was conducted along with a discussion of potentials, challenges, and future research directions for the research area. Data selection was found to be widely used in many engineering applications and has the potential to play an important role in making more sustainable Big Data applications, especially those in which transmission of data across large distances is required. Furthermore, making resource-aware decisions about the use of data has been shown to be highly effective in reducing energy costs while ensuring high performance of the model.

Abstract:
Maritime transportation, vital for nearly 90% of global trade, necessitates precise vessel trajectory prediction for safety and efficiency. Although the Automatic Identification System (AIS) provides a comprehensive data source, how to model these multi-modal and heterogeneous time-varying sequences (such as vessels’ kinetic information and ocean weather factors) poses a formidable challenge. Moreover, most existing approaches are limited by the confined scope of vessel trajectory modeling, making it impossible to consider the unique characteristics of maritime transportation system. To tackle these challenges, we propose a novel framework called AISFuser to i) encode unique maritime traffic network into graphical representations, and ii) introduce the heterogeneity into multi-modal temporal embeddings through Self-Supervised Learning (SSL). Specifically, our AISFuser is constructed by combining an attention-based graph block with a transformer network to encode information across space and time, respectively. In terms of temporal dimension, one SSL auxiliary task is also designed to enhance the heterogeneity of temporal representations and supplement the main vessel prediction task. We validate the effectiveness of the proposed AISFuser on a real-world AIS dataset. Extensive experimental results demonstrate that our method can forecast multiple attributes of vessel trajectory for over 10 hours into the future, outperforming competitive baselines.

Abstract:
Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.

Abstract:
Robustness is paramount for ensuring the reliability of knowledge graph models in safety-sensitive applications. While recent research has delved into adversarial attacks on static knowledge graph models, the exploration of more practical temporal knowledge graphs has been largely overlooked. To fill this gap, we present the Adaptive Temporal Perturbation Framework (ATPF), a novel adversarial attack framework aimed at probing the robustness of temporal knowledge graph (TKG) models. The general idea of ATPF is to inject perturbations into the victim model input to undermine the prediction. First, we propose the Temporal Perturbation Prioritization (TPP) algorithm, which identifies the optimal time sequence for perturbation injection before initiating attacks. Subsequently, we design the Rank-Based Edge Manipulation (RBEM) algorithm, enabling the generation of both edge addition and removal perturbations under black-box setting. With ATPF, we present two adversarial attack methods: the stringent ATPF-hard and the more lenient ATPF-soft, each imposing different perturbation constraints. Our experimental evaluations on the link prediction task for TKGs demonstrate the superior attack performance of our methods compared to baseline methods. Furthermore, we find that strategically placing a single perturbation often suffices to successfully compromise a target link.

Abstract:
Sliding window aggregation, which extracts summaries from data streams, is a core operation in streaming analysis. Though existing sliding window algorithms that perform single eviction and insertion operations can achieve a worst-case time complexity of O(1)O(1) for in-order streams, real-world data streams often involve out-of-order data and exhibit burst data characteristics, which pose performance challenges to these sliding window algorithms. To address this challenging issue, we propose Gecko - a novel sliding window aggregation algorithm that supports bulk eviction. Gecko leverages a granular-based eviction strategy for various bulk sizes, enabling efficient bulk eviction while maintaining the performance close to that of in-order stream algorithms for single evictions. For large data bulks, Gecko performs coarse-grained eviction at the chunk level, followed by fine-grained eviction using leftward binary tree aggregation (LTA) as a complementary method. Moreover, Gecko partitions data based on chunks to prevent the impacts of out-of-order data on other chunks, thereby enabling efficient handling of out-of-order data streams. We conduct extensive experiments to evaluate the performance of Gecko. Experimental results demonstrate that Gecko exhibits superior performance over other solutions, which is consistent with theoretical expectations. In real-world data scenarios, Gecko improves the average throughput of the state-of-the-art algorithm b_FiBA by 1.7 times, with a maximum improvement of up to 3.5 times. Gecko also demonstrates the best latency performance among all compared schemes.

Abstract:
Sequential causal effect estimation has recently attracted increasing attention from research and industry. While the existing models have achieved many successes, there are still many limitations. Existing models usually assume the causal graphs to be sufficient, i.e., there are no latent factors, such as the unmeasured confounders and instrumental variables. However, in real-world scenarios, it is hard to record all of the factors in the observational data, which makes the causally sufficient assumptions not hold. Moreover, existing models mainly focus on discrete treatments rather than continuous ones. To alleviate the above problems, in this paper, we propose a novel Continous Causal Model by explicitly capturing the Latent Factors (called C^22M-LF for short). Specifically, we define a sequential causal graph by simultaneously considering the unmeasured confounders and instrumental variables. Second, we describe the independence that should be satisfied among different variables from the mutual information perspective and further propose our learning objective. Then, we reweight different samples in the continuous treatment space to optimize our model unbiasedly. Beyond the above designs, we also theoretically analyze our model’s causal identifiability and unbiasedness. Finally, we conduct extensive experiments on both simulation and real-world datasets to demonstrate the effectiveness of our proposed model.

Abstract:
Long-term user behavior sequences are a goldmine for businesses to explore users’ interests to improve Click-Through Rate (CTR). However, it is very challenging to accurately capture users’ long-term interests from their long-term behavior sequences and give quick responses from the online serving systems. To meet such requirements, existing methods “inadvertently” destroy two basic requirements in long-term sequence modeling: R1) make full use of the entire sequence to keep the information as much as possible; R2) extract information from the most relevant behaviors to keep high relevance between learned interests and current target items. The performance of online serving systems is significantly affected by incomplete and inaccurate user interest information obtained by existing methods. To this end, we propose an efficient two-stage long-term sequence modeling approach, named as EfficieNt Clustering based twO-stage interest moDEling (ENCODE), consisting of offline extraction stage and online inference stage. It not only meets the aforementioned two basic requirements but also achieves a desirable balance between online service efficiency and precision. Specifically, in the offline extraction stage, ENCODE clusters the entire behavior sequence and extracts accurate interests. To reduce the overhead of the clustering process, we design a metric learning-based dimension reduction algorithm that preserves the relative pairwise distances of behaviors in the new feature space. While in the online inference stage, ENCODE takes the off-the-shelf user interests to predict the associations with target items. Besides, to further ensure the relevance between user interests and target items, we adopt the same relevance metric throughout the whole pipeline of ENCODE. The extensive experiment and comparison with SOTA on both industrial and public datasets have demonstrated the effectiveness and efficiency of our proposed ENCODE.

Abstract:
A graph is a fundamental data model to represent various entities and their complex relationships in society and nature, such as social networks, transportation networks, financial networks, and biomedical systems. Recently, large language models (LLMs) have showcased a strong generalization ability to handle various natural language processing tasks to answer users’ arbitrary questions and generate specific-domain content. Compared with graph learning models, LLMs enjoy superior advantages in addressing the challenges of generalizing graph tasks by eliminating the need for training graph learning models and reducing the cost of manual annotation. However, LLMs are sequential models for textual data, but graphs are non-sequential topological data. It is challenging to adapt LLMs to tackle graph analytics tasks. In this survey, we conduct a comprehensive investigation of existing LLM studies on graph data, which summarizes the relevant graph analytics tasks solved by advanced LLM models and points out the existing challenges and future directions. Specifically, we study the key problems of LLM-based generative graph analytics (LLM-GGA) in terms of three categories: LLM-based graph query processing (LLM-GQP), LLM-based graph inference and learning (LLM-GIL), and graph-LLM-based applications. LLM-GQP focuses on an integration of graph analytics techniques and LLM prompts, including graph understanding and knowledge graphs and LLMs, while LLM-GIL focuses on learning and reasoning over graphs, including graph learning, graph-formed reasoning, and graph representation. We summarize the useful prompts incorporated into LLM to handle different graph downstream tasks. Moreover, we give a summary of LLM model evaluation, benchmark datasets/tasks, and a deep pro and cons analysis of the discussed LLM-GGA models. We also explore open problems and future directions in this exciting interdisciplinary research area of LLMs and graph analytics.

Abstract:
Accurately predicting the complex networks dynamics is a challenging task. Many studies have shown that data-driven frameworks offer promising solutions to this issue. However, existing approaches still face significant limitations, particularly when network structures evolve from lower-order to higher-order networks, or when the dynamical equation of the network is governed by multiple dynamical terms, such as local self-dynamics, lower-order and higher-order coupling dynamics. To this end, we propose a universal physics-informed neural network framework capable of predicting various types of dynamics on both lower- and higher-order networks. First, the framework captures and integrates more nonlinear features through the higher-order term expansion module. Second, we design a hybrid neural network module to differentially learn each dynamical term to comprehensively capture network dynamics. Finally, a physics-informed loss function construction module is introduced to integrate differential loss with prediction loss, improving the accuracy of network dynamical prediction. Experimental results indicate that our method outperforms the state-of-the-art approaches in predicting network dynamics on both lower- and higher-order networks. Ablation studies confirm the critical role of each module. In addition, our method also performs well on real-world dynamical processes, which shows that it remains robust to real complex scenarios.

Abstract:
Causal discovery in multi-rate time series encounters greater challenges compared to regular time series. This stems from a potential problem that has not been noticed and explored in existing studies: information granularity heterogeneity, which refers to the natural difference in information granularity between fast sampling rate data (high information granularity) and slow sampling rate data (low information granularity). Such an imbalance in information granularity can hinder forecasting relationships modeling and induce biased causal learning. Therefore, we propose a Mutual Information-iNspired causal Discovery framework (MIND), aiming to derive rate-agnostic features with consistent information granularity to alleviate information granularity heterogeneity problem. Technically, MIND comprises Stage 1 (pre-training) and Stage 2 (fine-tuning and causal discovery). In Stage 1, empowered by pseudo-slow sampling rate data (generated through the interleaved down sampling strategy) and mutual information, we can eliminate the influence of sampling rates and drive rate-aware encoders (RAEs) to sense key information (i.e., rate-agnostic) that remains unchanged across varying sampling rates. In Stage 2, the well-trained RAEs can extract rate-agnostic features from real multi-rate time series, thus facilitating effective forecasting relationships modeling and yield accurate causal discovery. Empirically, MIND realizes superior performance on various multi-rate scenarios, including four simulation datasets and one real-world dataset.

Abstract:
In large-scale service networks, Quality of Service (QoS) data is vital for tasks such as resource provisioning, real-time service recommendation, and user experience optimization; yet effectively predicting QoS often faces two key obstacles: data sparsity and highly imbalanced distributions (where most response times cluster near small values while a minority grow disproportionately large). Existing approaches typically rely on individual object (user/service) features, overlooking the phenomenon that users or services within the same region (physical or virtual) exhibit similar network states. This paper presents a Region-Aware Dual-Latent State Learning (R2SL) framework that tackles these challenges by explicitly modeling regional network latent states. Specifically, we propose to learn physical-region (city-level) and virtual-region (AS-level) latent states from historical QoS records through a joint EM–gradient descent strategy, thereby alleviating data sparsity. Furthermore, to mitigate label imbalance in QoS data, we introduce a Smooth Huber (S-Huber) loss function that appropriately reweights extreme errors, preventing the training process from being dominated by outliers. We also develop a sparsely activated mixture-of-experts module, dynamically routing regional latent features based on each prediction task’s context. Experiments on real-world QoS datasets show that R2SL substantially outperforms state-of-the-art baselines, including the newly introduced FRLN. On throughput tasks, R2SL reduces MAE by an average of 18.8% and RMSE by 12.9%, while on response time tasks, it achieves 26.4% lower MAE and 24.2% lower RMSE. These findings indicate that dual-latent state modeling, combined with a distribution-aware loss, effectively captures complex regional patterns and mitigates long-tail label effects, making R2SL a powerful and scalable framework for large-scale QoS data mining in service networks.

Abstract:
The rapid proliferation of multimedia fake news on social media has raised significant concerns in recent years. Existing studies on fake news detection predominantly adopt an instance-based paradigm, where the detector evaluates a single post to determine its veracity. Despite notable advancements achieved in this domain, we argue that the instance-based approach is misaligned with real-world deployment scenarios. In practice, detectors typically operate on servers that process incoming posts in temporal order, striving to assess their authenticity promptly. Instance-based detectors lack awareness of temporal information and contextual relationships between surrounding posts, therefore fail to capture long-range dependencies from the timeline. To bridge this gap, we introduce a more practical stream-based multi-modal fake news detection paradigm, which assumes that social media posts arrive continuously over time and allows the utilization of previously seen posts to aid in the classification of incoming ones. To enable effective and transferable fake news detection under this novel paradigm, we propose maintaining historical knowledge as a collection of incremental high-level forgery patterns. Based on this principle, we design a novel framework called Incremental Forgery Pattern Learning and Clues Refinement (IPLCR). IPLCR incrementally learns high-level forgery patterns as the stream evolves, leveraging this knowledge to improve the detection of newly arrived posts. At the core of IPLCR is the Incremental Forgery Pattern Bank (IPB), which dynamically summarizes historical posts into a set of latent forgery patterns. IPB is designed to continuously incorporate timely knowledge and actively discard obsolete information, even during inference. When a new post arrives, IPLCR retrieves the most relevant forgery pattern knowledge from IPB and refines the clues for fake news detection. The refined clues are subsequently incorporated into IPB to enrich its knowledge base. Extensive experiments validate IPLCR’s effectiveness as a robust stream-based detector. Moreover, IPLCR addresses several critical issues relevant to industrial applications, including seamless context transfer and efficient model upgrading, making it a practical solution for real-world deployment.

Abstract:
Predicting equipment failures plays a pivotal role in minimizing maintenance costs and boosting production efficiency within the industrial sector. This paper introduces a novel approach that integrates Causal Inference with predictive modeling to enhance prediction accuracy, tackling key challenges such as noise interference, insufficient causal validation, and missing data. We first validate the causal connections identified by the Greedy Equivalence Search algorithm using conditional mutual information to strengthen the reliability of the causal graph. An information bottleneck strategy is then employed to isolate essential causal features, effectively filtering out irrelevant noise and refining the causal structure. Crucially, in the actual prediction phase, we propose a recursive causal inference-based imputation method to handle missing data, leveraging the causal graph to iteratively infer and fill gaps, thereby improving data completeness and prediction accuracy. Experimental results demonstrate that the proposed method significantly outperforms existing approaches, exhibiting superior accuracy and robustness in managing complex industrial datasets.

Abstract:
Multi-label feature selection is an effective approach to mitigate the high-dimensional feature problem in multi-label learning. Most existing multi-label feature selection methods either assume that the data is complete, or that either the features or the labels are incomplete. So far, there are few studies on multi-label data with missing features and labels. In many cases, missing features in instances of multi-label data often lead to missing labels, which is ignored by existing studies. We define this type of data as instance-dependent incomplete multi-label data. In this paper, we propose a feature selection method for instance-dependent incomplete multi-label data. Firstly, we use the positive correlations between features to reconstruct the feature space, thereby recovering missing values and enhancing non-missing values. Secondly, we use fuzzy tolerance relation to guide label recovery, and utilize fuzzy mutual implication granularity to impose structural constraint on the projection matrix. Thirdly, we achieve feature selection by eliminating the impact of incomplete instances and imposing sparse regularization on the projection matrix. Finally, we provide a convergent solution for the proposed feature selection framework. Comparative experiments with existing multi-label feature selection methods show that our method can perform effective feature selection on instance-dependent incomplete multi-label data.

Abstract:
Expert knowledge holds a pivotal role in artificial intelligence models. Constrained by the subjectivity and ignorance of human cognition, it is imperfectly reliable. Modeling and decision-making driven by such knowledge may generate large risks. To this end, it is necessary to investigate a mechanism for handling such imperfectly reliable knowledge. In this paper, the reliability of knowledge is described as expert reliability. A novel rule-based modeling framework with expert reliability is proposed correspondingly, including the following four parts: modeling, reasoning, optimization and robustness analysis. The main works are: (1) Based on the transparent knowledge representation of belief rule base (BRB), a linguistic Z-number BRB (LZ-BRB) is proposed, where the linguistic Z-number quantitatively represents expert reliability. (2) An improved evidential reasoning (ER) rule is developed to obtain the inference result of the LZ-BRB model. (3) A data-driven parameter optimization model is designed to reduce modeling errors caused by imperfectly reliable knowledge. (4) The robustness analysis of expert reliability is performed to further analyze its influence on the inference result. Finally, a fiber optic gyro (FOG) health evaluation case verifies the proposed method.

Abstract:
K-plane clustering (KPC), hyperplane clustering, and mixture regression all essentially fall within the same class of problems. This problem can be conceptualized as clustering in relatively high-dimensional K subspaces or K linear manifolds. Traditional KPC or fuzzy KPC models demonstrate a pronounced susceptibility to outliers, as they presuppose that the projection distance between data points and the plane normal vector adheres to the L_2L2 distance. Meanwhile, the assumption of infinitely extending clusters adversely affects clustering performance. To solve these problems, this paper proposed a new robust fuzzy local k-plane clustering (RFLkPC) method that combines the mixture distance of hinge loss and L_1L1 norm. The RFLkPC model assumes that each plane cluster is bounded to a finite area, which can flexibly and robustly handle plane clustering tasks with outliers or not. The corresponding model and optimization algorithms of RFLkPC were provided. Compared to other related models on this topic, a large number of experiments verify the efficiency of RFLkPC on simulated data and real data.

Abstract:
An increasing number of web applications require cloud in-memory key-value stores to minimize latency and achieve higher throughput. They generally have diverse characteristics and constantly changing traffic volumes, which require different computational and memory resources. A serverless in-memory key-value store characterized by elastic resource allocation and pay-as-you-go billing could satisfy the requirements of diverse and dynamic workloads. However, we find current serverless IMKVs fail to achieve fine-grained and prompt resource elasticity due to the limitations of their infrastructures. This paper proposes Genie, a lightweight serverless infrastructure for in-memory key-value caching with fine-grained and immediate elasticity. In Genie, a novel approach is adopted to enable dynamic and independent resource allocation to multiple tenants. It processes all arrived requests and estimates the vCPU consumption with a lightweight machine-learning approach for fine-grained billing. Moreover, Genie estimates the working set and dynamically resizes the allocated memory for hit ratio requirements. Evaluation results show that CPU estimation could be achieved every 100 microseconds without impacting system performance, and memory capacity could be adjusted by megabytes within seconds. The holistic design incurs 1% -2% performance degradation compared to our baseline. Moreover, Genie achieves an average of 58.3% CPU and 49.9% memory savings compared to AsparaDB for Memcache.

Abstract:
Federated Learning (FL) is a promising privacy-preserving machine learning paradigm that allows data owners to collaboratively train models while keeping their data localized. Despite its potential, FL faces challenges related to the trustworthiness of both clients and servers, particularly against curious or malicious adversaries. In this paper, we introduce a novel framework named Federated Learning with Low-Dimensional Update Representation and Proximity-Based defense (FLURP), designed to address privacy preservation and resistance to Byzantine attacks in distributed learning environments. FLURP employs \mathsf LinfSampleLinfSample method, enabling clients to compute the l_\infty l∞ norm across sliding windows of updates, resulting in a Low-Dimensional Update Representation (LUR). Calculating the shared distance matrix among LURs, rather than updates, significantly reduces the overhead of Secure Multi-Party Computation (SMPC) by three orders of magnitude while effectively distinguishing between benign and poisoned updates. Additionally, FLURP integrates a privacy-preserving proximity-based defense mechanism utilizing optimized SMPC protocols to minimize communication rounds. Our experiments demonstrate FLURP's effectiveness in countering Byzantine adversaries with low communication and runtime overhead. FLURP offers a scalable framework for secure and reliable FL in distributed environments, facilitating its application in scenarios requiring robust data management and security.

Abstract:
Prompt tuning for pre-trained language models (PLMs) has been an effective approach for few-shot text classification. To make a prediction, a typical prompt tuning method employs a template wrapping the input text into a cloze question, and a verbalizer mapping the output embedding to labels. However, current methods typically depend on handcrafted templates and verbalizers, which require much domain-specific prior knowledge by human efforts. In this work, we investigate how to build a good human-free prompt tuning using soft prompt templates and soft verbalizers, which can be learned directly from data. To address the challenge of data scarcity, we integrate a set of trainable bases for sentence representation to transfer the contextual information into a low-dimensional space. By jointly pre-training the soft prompts and the bases using contrastive learning, the projection space can catch critical semantics at the sentence level, which could be transferred to various downstream tasks. To better bridge the gap between downstream tasks and the pre-training procedure, we formulate the few-shot classification tasks as another contrastive learning problem. We name this Jointly Pretrained Template and Verbalizer (JPTV). Extensive experiments show that this human-free prompt tuning can achieve comparable or even better performance than manual prompt tuning.

Abstract:
The excellent performance of graph neural networks (GNNs), which learn node representations by aggregating their neighborhood information, led to their use in various graph tasks. However, GNNs are black box models, the prediction results of which are difficult to understand directly. Although node attributes are vital for making predictions, previous studies have ignored their importance for explanation. This study presents GAFExplainer, a novel GNN explainer that emphasizes node attributes via attribute augmentation and fusion embedding. The former enhances node attribute encoding for more expressive masks, while the latter preserves the discrimination of node representations across different layers. Together, these modules significantly improve explanation performance. By training the explanatory network, a global view explanation of GNN models is obtained, and reasonably explainable subgraphs are available for new graphs, thus rendering the model well-generalizable. Multiple sets of experimental results on real and synthetic datasets demonstrate that the proposed model provides valid and accurate explanations. In the visual analysis, the explanations obtained by the proposed model are more comprehensible than those in existing work. Further, the fidelity evaluation and efficiency comparison reveal that with an average performance improvement of 8.9% % compared with representative baselines, GAFExplainer achieves the best fidelity metrics while maintaining computational efficiency.

Abstract:
Dynamic symmetric searchable encryption (SSE) enables clients to perform searches and updates on an encrypted database outsourced to an untrusted server while preserving the privacy of data and queries. For restricting information leakage, it is very important to limit what the server can learn about the deleted data during searches after the deletion, i.e., to satisfy backward privacy. However, previous backward privacy definitions only considered the logical deletion of keywords in documents while ignoring security risks caused by the actual deletion of documents. Moreover, existing SSE schemes often depend on heavy cryptographic primitives for achieving high-level backward privacy, which greatly degrades the end-to-end performance. To this end, we define a new backward privacy notion named BP-DEL, which restricts the information leakage of the actual deletion. Moreover, we design a hybrid index structure that provides BP-DEL for SSE schemes such that they support deletions securely. Based on the hybrid index, we propose a BP-DEL construction named LUNA and design its protocols with a trusted execution environment (TEE) to maintain the index efficiently. Finally, we implement LUNA in the MySQL database by encapsulating it in UDFs. The experimental results show that LUNA has a performance much better than previous works satisfying BP-DEL.

Abstract:
Fraud detection, a classical data mining problem in finance applications, has risen in significance amid the intensifying confrontation between fraudsters and anti-fraud forces. Recently, an increasing number of criminals are constantly expanding the scope of fraud activities to covet the property of innocent victims. However, most existing approaches require abundant historical records to mine fraud patterns from financial transaction behaviors, thereby leading to significant challenges to protect minority groups, who are less involved in the modern financial market but also under the threat of fraudsters nowadays. Therefore, in this paper, we propose a novel community-enhanced multi-relation graph neural network-based model, named CMR-GNN, to address the important defects of existing fraud detection models in the tail effect situation. In particular, we first construct multiple types of relation graphs from historical transactions and then devise a clustering-based neural network module to capture diverse patterns from transaction communities. To mitigate information lacking tailed nodes, we proposed tailed-groups learning modules to aggregate features from similarly clustered subgraphs by graph convolution networks. Extensive experiments on both the real-world and public datasets demonstrate that our method not only surpasses the state-of-the-art baselines but also could effectively harness information within transaction communities while mitigating the impact of tail effects.

Abstract:
Fraud detection has always been one of the primary concerns in social and economic activities and is becoming a decisive force in the booming digital economy. Graph structures formed by rich user interactions naturally serve as important clues for identifying fraudsters. While numerous graph neural network-based methods have been proposed, the diverse interactive connections within graphs and the heterophilic connections deliberately established by fraudsters to normal users as camouflage pose new research challenges. In this light, we propose H2IDE (Homophily and Heterophily Identification with Disentangled Embeddings) for accurate fraud detection in multi-relation graphs. H2IDE features in an independence-constrained disentangled representation learning scheme to capture various latent behavioral patterns in graphs, along with a supervised identification task to specifically model the factor-wise heterophilic connections, both of which are proven crucial to fraud detection. We also design a relation-aware attention mechanism for hierarchical and adaptive neighborhood aggregation in H2IDE. Extensive comparative experiments with state-of-the-art baseline methods on two real-world multi-relation graphs and two large-scale homogeneous graphs demonstrate the superiority and scalability of our proposed method and highlight the key role of disentangled representation learning with homophily and heterophily identification.

Abstract:
Variable Subset Forecasting (VSF) refers to a unique scenario in multivariate time series forecasting, where available variables in the inference phase are only a subset of the variables in the training phase. VSF presents significant challenges as the entire time series may be missing, and neither inter- nor intra-variable correlations persist. Such conditions impede the effectiveness of traditional imputation methods, primarily focusing on filling in individual missing data points. Inspired by the principle of feature engineering that not all variables contribute positively to forecasting, we propose Task-Oriented Imputation for VSF (TOI-VSF), a novel framework shifts the focus from accurate data recovery to directly support the downstream forecasting task. TOI-VSF incorporates a self-supervised imputation module, agnostic to the forecasting model, designed to fill in missing variables while preserving the vital characteristics and temporal patterns of time series data. Additionally, we implement a joint learning strategy for imputation and forecasting, ensuring that the imputation process is directly aligned with and beneficial to the forecasting objective. Extensive experiments across four datasets demonstrate the superiority of TOI-VSF, outperforming baseline methods by 15% on average. The code is available at https://github.com/Asteriaqq/TOI-VSF.

Abstract:
We introduce the notion of orthogonality between column sets as a means to eliminate accidental keys during data driven key discovery. This becomes particularly important for inconsistent relations, where keys that almost hold need to be considered as well. Here we employ the classic g_3g3 metric for dirtiness, which can be applied to a variety of key semantics for incomplete relations, although the difficulty in computing the metric under different semantics varies greatly. Efficient algorithms for orthogonal mining and measuring dirtiness are proposed and evaluated on a real-world database. Additionally, we propose the notion of partial key, which turns out to be particularly useful for dealing with incomplete data in a key discovery context.

Abstract:
Sparse data gathering has become a promising solution for reducing measurement costs by leveraging the inherent sparsity of data. However, most existing approaches rely on low-dimensional models such as compressive sensing or matrix completion, which are limited in capturing complex high-dimensional structures. To overcome these limitations, we propose TensorMon, a novel tensor-based sparse data gathering framework that introduces a cuboid sampling strategy to more effectively exploit multidimensional correlations. Unlike traditional entry-based or tube-based sampling, TensorMon introduces the innovative concept of cuboid sampling. We further develop a lightweight sampling scheduling algorithm and a non-iterative inference algorithm to ensure efficient measurement planning and accurate reconstruction of unmeasured data. Theoretical analysis establishes a new performance bound for our sampling strategy, which is significantly lower than those in existing literature. To validate our theoretical findings, we conduct extensive experiments on four real-world datasets: two network monitoring datasets, a city-scale crowd flow dataset, and a road traffic speed dataset. Experimental results demonstrate that TensorMon achieves substantial reductions in measurement cost, delivers high inference accuracy, and ensures rapid data recovery, highlighting its effectiveness and practicality across diverse application scenarios.

Abstract:
Inductive relation prediction aims to predict missing connections between entities unseen during training. Recent approaches adopt binary (positive or negative) training labels, which indicate whether the query relation exists between the entities, as supervision to teach models recognizing the entity-independent relation patterns in the context (enclosed subgraph or connective path). However, we argue that in this kind of method, the trained models are guided to make relation predictions by remembering whether the query relation and its contextual relational pattern co-occur more frequently in positive or negative samples. This solution could introduce two major limitations: 1) the model struggles with long-tail combinations, i.e., the combination between query relation and the relational pattern rarely occurs during training; 2) when noisy relational patterns, which fail to provide evidence for predicting the query relation, frequently occur with the query relation in positive training samples, the model will be misled into considering the noisy relational patterns as a feature supporting the existence of the query relation. To solve these problems, we propose ToC (Thinking on Context). ToC first utilizes large language models (LLMs) to incorporate a chain of thought as an additional supervisory constraint, guiding the model to make relational predictions based on logical reasoning instead of co-occurrence frequency. Additionally, ToC employs the reasoning capabilities of LLMs to construct context-level negative samples, aiding the model in identifying and disregarding noisy relational patterns. Extensive experiments show that ToC significantly outperforms state-of-the-art methods across three widely used datasets in multiple inductive settings.

Abstract:
Information popularity prediction is important yet challenging in various domains, including viral marketing and news recommendations. The key to accurately predicting information popularity lies in subtly modeling the underlying temporal information diffusion process behind observed events of an information cascade, such as the retweets of a tweet. To this end, most existing methods either adopt recurrent networks to capture the temporal dynamics from the first to the last observed event or develop a statistical model based on self-exciting point processes to make predictions. However, information diffusion is intrinsically a complex continuous-time process with irregularly observed discrete events, which is oversimplified using recurrent networks as they fail to capture the irregular time intervals between events, or using self-exciting point processes as they lack flexibility to capture the complex diffusion process. Against this background, we propose ConCat, modeling the Continuous-time dynamics of Cascades for information popularity prediction. On the one hand, it leverages neural Ordinary Differential Equations (ODEs) to model irregular events of a cascade in continuous time based on the cascade graph and sequential event information. On the other hand, it considers cascade events as neural temporal point processes (TPPs) parameterized by a conditional intensity function which can also benefit the popularity prediction task. We conduct extensive experiments to evaluate ConCat on three real-world datasets. Results show that ConCat achieves superior performance compared to state-of-the-art baselines, yielding 2.3%-33.2% improvement over the best-performing baselines across the three datasets.

Abstract:
Label Distribution Learning (LDL) offers a promising solution to label ambiguity by employing Label Distributions (LDs) instead of traditional logical labels. However, acquiring LDs for real-world data is both expensive and challenging. To address this issue, Label Enhancement (LE) techniques have been proposed to derive LDs from readily available logical labels. While much of the prior work has focused on enhancing LE for better recovery performance, the ultimate objective remains improving LDL’s overall effectiveness. In this paper, we introduce a novel LE method, Topological Label Enhancement via Optimal Transport (TLEOT), which integrates Optimal Transport (OT) theory with topological space analysis. This method goes beyond improving LE, targeting the enhancement of LDL performance by aligning the feature and label distributions within a unified topological framework. Additionally, we present two innovative topological techniques designed to further improve LDL. Extensive experimental evaluations on real-world datasets demonstrate that TLEOT consistently outperforms nine state-of-the-art methods in predictive tasks. Furthermore, the proposed topological techniques significantly enhance LDL’s performance, validating their practical utility in real-world applications.

Abstract:
A multi-modal knowledge graph (MKG) includes triplets that consist of entities and relations and multi-modal auxiliary data. In recent years, multi-hop multi-modal knowledge graph reasoning (MMKGR) based on reinforcement learning (RL) has received extensive attention because it addresses the intrinsic incompleteness of MKG in an interpretable manner. However, its performance is limited by empirically designed rewards and sparse relations. In addition, this method has been designed for the transductive setting where test entities have been seen during training, and it works poorly in the inductive setting where test entities do not appear in the training set. To overcome these issues, we propose TMR (Topology-aware Multi-hop Reasoning), which can conduct MKG reasoning under inductive and transductive settings. Specifically, TMR mainly consists of two components. (1) The topology-aware inductive representation captures information from the directed relations of unseen entities, and aggregates query-related topology features in an attentive manner to generate the fine-grained entity-independent features. (2) After completing multi-modal feature fusion, the relation-augmented adaptive RL conducts multi-hop reasoning by eliminating manual rewards and dynamically adding actions. Finally, we construct new MKG datasets with different scales for inductive reasoning evaluation. Experimental results demonstrate that TMP outperforms state-of-the-art MKGR methods under both inductive and transductive settings.

Abstract:
Multi-label feature selection can effectively solve the curse of dimensionality problem in multi-label learning. Existing multi-label feature selection methods mostly handle multi-label data without missing features. However, in practical applications, multi-label data with missing features exist widely, and most existing multi-label feature selection methods are not directly applicable. Therefore, we propose a feature selection method for multi-label data with missing features. First, we propose a method to extract implicit label information from the feature space to replenish the binary label information. Second, we learn the positive correlation between features to construct a feature correlation recovery matrix to recover missing features. Finally, we design a sparse model-based multi-label feature selection method for processing multi-label data with missing features and prove the convergence of this method. Comparative experiments with existing feature selection methods demonstrate the effectiveness of our method.

Abstract:
Researchers confront significant challenges when classifying and modeling data due to the growing system complexity, data volume, and requirement for accurate and reliable models. Information granules, a fundamental component of Granular Computing (GrC), play a crucial role in human cognition. In this study, we develop a granular classifier called DBCGM, which facilitates delivering models with improved accuracy and efficiency in Big Data classification. In particular, DBCGM achieves the goal through the following four steps. First, we construct a 1-D index for each point and then utilize a context-based data bisection method to obtain non-overlapping subsets. These disjoint subsets enhance both the quality and efficiency of Big Data clustering and make it possible to process the entire dataset simultaneously. Next, we propose a cascade weighted clustering (CWC) algorithm to generate numeric prototypes from the obtained subsets. Then, following the principle of justification granularity (PJG), the numeric prototypes are refined into information granules. Finally, the classifier can be regarded as the weighted sum of all the values of the spatial relationship between the input instance and the information granules. We evaluate the performance of DBCGM in terms of accuracy, V-measure, and execution time. We compare DBCGM with benchmark classifiers and three Big Data granulating methods. Experimental results on both synthetic and public datasets show that DBCGM outperforms the existing methods. In particular, compared with the state-of-the-art method, DBCGM reduces the running time by an average of 2.55%, and improves the V-measure and accuracy by an average of 5.71% and 2.32%, respectively.

Abstract:
Nowadays, detecting sentiment or emotion from user generated texts has been intensively studied in natural language understanding, especially via neural-based models based on text representation. However, the interpretability on how could the final text sentiment be determined by neural-based text representation has not been thoroughly unfolded yet. Consequently, in this paper, we propose CogLign which injects the neural-cognition derived from Electroencephalogram (EEG)-signal into the neural-based text sentiment analysis model, aimed at learning the activation of brain regions stimulated by different sentiments, so as to guide our proposed CogLign to make proper determination on text sentiment in brain-like way. Specifically, on the one hand, the given videos in different sentiments have been watched by subjects, during which the EEG-signals are monitored to construct brain connectivity pattern as brain graph (BG), attaining more obvious sentiment response on brain region activation for neural-cognition. On the other hand, we interpret the video-plots (or video-semantics) along timeline into text, where the entire video-interpreted-text will be strictly bound with the whole EEG-signal-sequence by segment via the fixed size of time-window. Then, entities and relations are extracted from the video-interpreted-text to construct knowledge graph (KG), depicting text semantics. Next, mapping from entities (or nodes) in KG to EEG-Electrodes (or nodes) in BG, further dated back to different brain regions, has been learned via cognition alignment between the EEG-derived BG and text-derived KG. In this way, by aligning neural cognition from brain graph with the semantic cognition from knowledge graph, our proposed framework CogLign can not only achieve the overall best sentiment analysis performance on the video-interpreted-text, but can also detect brain connectivity patterns in different sentiments more consistent with the prior conclusion of brain region sentiment preference, revealing competitive interpretability on text sentiment determination.

Abstract:
Translating users’ natural language queries (NL) into SQL queries (i.e., Text-to-SQL, a.k.a. NL2SQL) can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of Text-to-SQL has been greatly enhanced with the emergence of Large Language Models (LLMs). In this survey, we provide a comprehensive review of Text-to-SQL techniques powered by LLMs, covering its entire lifecycle from the following four aspects: (1) Model: Text-to-SQL translation techniques that tackle not only NL ambiguity and under-specification, but also properly map NL with database schema and instances; (2) Data: From the collection of training data, data synthesis due to training data scarcity, to Text-to-SQL benchmarks; (3) Evaluation: Evaluating Text-to-SQL methods from multiple angles using different metrics and granularities; and (4) Error Analysis: analyzing Text-to-SQL errors to find the root cause and guiding Text-to-SQL models to evolve. Moreover, we offer a rule of thumb for developing Text-to-SQL solutions. Finally, we discuss the research challenges and open problems of Text-to-SQL in the LLMs era.

Abstract:
The task of incomplete multi-view clustering (IMvC) aims to partition multi-view data with a lack of completeness into different clusters. The incompleteness can be typically categorized into the case of instance-missing and view-unaligned MvC. However, prior methods either consider each of them or struggle to pursue consistent latent representations among views. In this paper, we propose two forms of contrastive learning paradigms to jointly handle both cases for IMvC. Specifically, we design an instance-oriented contrastive (IOC) learning strategy to achieve intra-class consistency. As negative samples within different datasets can exhibit diverse distributions, we formulate a parameterized boundary for IOC learning to flexibly deal with such differing data modes. To preserve inter-view consistency, we further devise category-oriented contrastive (COC) learning such that data from different views can be seamlessly integrated into a combined semantic space. We also recover the missing instances with the learned latent representations in a reconstructing manner for realigning the incomplete multi-view data to facilitate clustering. Our approach unifies the solution to both incomplete cases into one formulation. To demonstrate the effectiveness of our model, we conduct four types of MvC tasks on six benchmark multi-view datasets and compare our method against state-of-the-art IMvC methods. Extensive experiments show that our method achieves state-of-the-art performance, quantitatively and qualitatively.

Abstract:
Discovering hidden key users of leading topics plays an important role in opinion control and risk prevention. Aiming at the dynamic nature of key users’ intentions and other problems, a key user discovery model based on behavioral intentions and implicit relationships is proposed. First, to address the dynamic nature of key users’ intentions, the dynamic latent Dirichlet allocation method is introduced. This approach effectively mines topic evolution in text data, uncovering dynamic behavioral themes of key users and analyzing evolutionary relationships between topics. Meanwhile, incremental learning is introduced to quantify the dynamic behavioral intentions of key users more precisely. Second, a random wandering strategy based on user interaction degree and propagation depth is designed to address the hidden nature of user relationships. The strategy introduces the user interaction degree designed by the social cognition theory and the propagation depth designed by the propagation chain theory to better explore the hidden user interaction relationships. Finally, for the timeliness of key user identification, considering the advantage of dynamic evolution for real-time interaction, dynamic evolution is introduced to effectively analyze the dynamic structure of topic networks, and attention mechanism is introduced to improve the adaptivity of the model. The experiments show that this paper verifies the factuality of the existence of hidden key users dominating the promotion behind the guiding public opinion, and is more effective in tracing the hidden key users in the topics.

Abstract:
It is an inevitable trend for the development of global digital economy to transform data into data assets and realize their transaction circulation. Aiming at the release of data value and the development of its transaction process, the concept of integrated score of data is proposed by combining integrated quality index containing four dimensions with data quantity. On this basis, data assets are priced according to the principle of profit maximization by constructing a nonlinear programming model. Among them, three types of pricing models are divided according to the heterogeneity of consumers’ utility sensitivity, and the consumers’ wiilingness to pay are adjusted based on business parameters using FAHP system. The proposed model is verified with the data of China's carbon emissions as the original data, combined with the KNN machine learning algorithm and a series of simulation analyses. In addition, multiple sets of heterogeneous data are tested. The results show that the quality, quantity and utility of data have an important impact on the pricing of data assets, and it is necessary to divide the utility sensitivity of consumers as well as take business parameters into consideration. The model proposed can also provide decision-making reference for data platforms.

Abstract:
Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models, have revolutionized various natural language processing tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. We review fundamental building blocks crucial for studying chart understanding tasks. Additionally, we explore various tasks and their evaluation metrics and sources of both charts and textual inputs. Various modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed, highlighting the importance of several topics, such as domain-specific charts, lack of efforts in developing evaluation metrics, and agent-oriented settings. This survey paper aims to provide valuable insights and directions for future research in chart understanding leveraging large foundation models.

Abstract:
Change point detection (CPD) is a valuable technique in time series (TS) analysis, which allows for the automatic detection of abrupt variations within the TS. It is often useful in applications such as fault, anomaly, and intrusion detection systems. However, the inherent unpredictability and fluctuations in many real-time data sources pose a challenge for existing contemporary CPD techniques, leading to inconsistent performance across diverse real-time TS with varying characteristics. To address this challenge, we have developed a novel and robust online CPD algorithm constructed from the principle of discriminant analysis and based upon a newly proposed between-class average and variance evaluation approach, termed B-CAVE. Our B-CAVE algorithm features a unique change point measure, which has only one tunable parameter (i.e. the window size) in its computational process. We have also proposed a new evaluation metric that integrates time delay and the false alarm error towards effectively comparing the performance of different CPD methods in the literature. To validate the effectiveness of our method, we conducted experiments using both synthetic and real datasets, demonstrating the superior performance of the B-CAVE algorithm over other prominent existing techniques.