AAAI 2025 collected by Wang

Abstract: A *citizens' assembly* is a group of people who are randomly selected to represent a larger population in a deliberation. While this approach has successfully strengthened democracy, it has certain limitations that suggest the need for assemblies to form and associate more organically. In response, we propose *federated assemblies*, where assemblies are interconnected, and each parent assembly is selected from members of its child assemblies. The main technical challenge is to develop random selection algorithms that meet new representation constraints inherent in this hierarchical structure. We design and analyze several algorithms that provide different representation guarantees under various assumptions on the structure of the underlying graph.

Abstract: Firstorder linear temporal logic (FOLTL) is a flexible and expressive formalism capable of naturally describing complex behaviors and properties. Although the logic is in general highly undecidable, the idea of using it as a specification language for the verification of complex infinite-state systems is appealing. However, a missing piece, which has proved to be an invaluable tool in dealing with other temporal logics, is an automaton model capable of capturing the logic. In this paper we address this issue, by defining and studying such a model, which we call first-order automaton. We define this very general class of automata, and the corresponding notion of regular first-order language (of finite words), showing their closure under most language-theoretic operations. We show how they can capture any FOLTL formula over finite words, over any signature and theory, and provide sufficient conditions for the semi-decidability of their non-emptiness problem. Then, to show the usefulness of the formalism, we prove the decidability of monodic FOLTL, a classic result known in the literature, with a simpler and direct proof.

Abstract: Large amounts of missing data are becoming increasingly ubiquitous in modern highdimensional datasets. Unfortunately, classical completion methods like low-rank, high-rank, or deep matrix completion (LRMC/HRMC/DMC) are often unable to handle real data that does not fall under their respective models. Here we propose a novel completion strategy that generalizes all these models. The main idea is to find a Union of Subspaces (UoS) that can fit a non-linear embedding of the original data, and complete the data according to this latent UoS. This embedding is obtained through a novel pseudo-completion layer in a deep architecture, and the UoS structure is identified in closed-form through an intermediate clustering layer. Our design reduces the exponential memory requirements that are typically induced by uneven patterns of missing data. We give exact details of our architecture, model, loss functions, and training strategy. Our experiments on over 10 real datasets show that our method consistently outperforms the state-of-the-art accuracy by more than a staggering 40%.

Abstract: While AI systems are capable of reading texts and seeing images, they typically perceive surface information explicitly conveyed with limited abilities to comprehend hidden messages (e.g., a doubleedged remark). We propose the novel task of advertisement understanding: given an advertisement, which can be a text, an image, or a video, the goal is to identify the persuasion strategies used and determine the (possibly hidden) messages conveyed. Efforts on this task could enhance machine comprehension capabilities, and provide users with increased situation awareness w.r.t. the advertised message and thus possibly enable mindful decision making. We believe that this task presents long-term challenges to AI researchers and that successful understanding of ads could bring machine understanding one important step closer to human understanding.

Abstract: Rapid advancements in medical image segmentation performance have been significantly driven by the development of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). These models follow discriminative pixelwise classification learning paradigm and often have limited ability to generalize across diverse medical imaging datasets. In this manuscript, we introduce Generative Medical Segmentation (GMS), a novel generative approach to perform image segmentation. GMS employs a robust pre-trained vision foundation model to extract latent representations for images and corresponding ground truth masks, followed by a lightweight model that learns a mapping function from the image to the mask in the latent space. Once trained, the model can generate estimated segmentation masks using the pre-trained vision foundation model to decode the predicted latent mask representation back into image space. The design of GMS leads to fewer trainable parameters in the model, reducing the risk of overfitting and enhancing its generalization capability. Our experimental analysis across five open-source datasets in different medical imaging domains demonstrates GMS outperforms existing discriminative and generative segmentation models. Furthermore, GMS is able to generalize well across datasets of the same imaging modality from different centers. Our experiments suggest GMS offers a scalable and effective solution for medical image segmentation.

Abstract: Contemporary deep learning, characterized by the training of cumbersome neural networks on massive datasets, confronts substantial computational hurdles. To alleviate heavy data storage burdens on limited hardware resources, numerous dataset compression methods such as dataset distillation (DD) and coreset selection have emerged to obtain a compact but informative dataset through synthesis or selection for efficient training. However, DD involves an expensive optimization procedure and exhibits limited generalization across unseen architectures, while coreset selection is limited by its low data keep ratio and reliance on heuristics, hindering its practicality and feasibility. To address these limitations, we introduce a newly versatile framework for dataset compression, namely Adaptive Dataset Quantization (ADQ). Specifically, we first identify the suboptimal performance of naive Dataset Quantization (DQ), which relies on uniform sampling and overlooks the varying importance of each generated bin. Subsequently, we propose a novel adaptive sampling strategy through the evaluation of generated bins' representativeness score, diversity score and importance score, where the former two scores are quantified by the texture level and contrastive learning-based techniques, respectively. Extensive experiments demonstrate that our method not only exhibits superior generalization capability across different architectures, but also attains state-of-the-art results.

Abstract: Multimodal hashing methods have gained popularity due to their fast speed and low storage requirements. Among them, the supervised methods demonstrate better performance by utilizing labels as supervisory signals compared with unsupervised methods. Currently, for almost all supervised multi-modal hashing methods, there is a hidden assumption that training sets have no noisy labels. However, labels are often annotated incorrectly due to manual labeling in real-world scenarios, which will greatly harm the retrieval performance. To address this issue, we first discover a significant distribution consistency pattern through experiments, i.e., the 1-0 distribution of the presence or absence of each category in the label is consistent with the high-low distribution of similarity scores of the hash codes relative to category centers. Then, inspired by this pattern, we propose a novel Distribution-Consistency-Guided Multi-modal Hashing (DCGMH), which aims to filter and reconstruct noisy labels to enhance retrieval performance. Specifically, the proposed method first randomly initializes several category centers, each representing the region's centroid of its respective category, which are used to compute the high-low distribution of similarity scores; Noisy and clean labels are then separately filtered out via the discovered distribution consistency pattern to mitigate the impact of noisy labels; Subsequently, a correction strategy, which is indirectly designed via the distribution consistency pattern, is applied to the filtered noisy labels, correcting high-confidence ones while treating low-confidence ones as unlabeled for unsupervised learning, thereby further enhancing the model’s performance. Extensive experiments on three widely used datasets demonstrate the superiority of the proposed method compared to state-of-the-art baselines in multi-modal retrieval tasks.

Abstract: Collaborative reasoning enhances recommendation performance by combining the strengths of symbolic learning and deep neural learning. However, current collaborative reasoning models rely on parameterized networks to simulate logical operations within the reasoning process, which (1) do not comply with all axiomatic principles of classical logic and (2) limit the model's generalizability. To address these limitations, a Fuzzy logic approach tailored for Collaborative Reasoning (FuzzCR) is proposed in this work, aiming to augment the recommendation system with cognitive abilities. Specifically, this method redefines the sequential recommendation task as a logical query answering process to facilitate a more structured and logical progression of reasoning. Moreover, learningfree fuzzy logical operations are implemented for the designed reasoning process. Taking advantage of the inherent properties of fuzzy logic, these logical operations satisfy fundamental logical rules and ensure complete reasoning. After training, these operations can be applied to flexible reasoning processes, rather than being confined to fixed computation graphs, thereby exhibiting good generalizability. Extensive experiments conducted on publicly available datasets demonstrate the superiority of this method in solving the sequential recommendation task.

Abstract: We study temporal fair division, whereby a set of agents are allocated a (possibly different) set of goods on each day for a period of days. We study this setting, as well as a number of its special cases formed by the restrictions to two agents, same goods on each day, identical preferences, or combinations thereof, and chart out the landscape of achieving two types of fairness guarantees simultaneously: fairness on each day (per day) and fairness over time (up to each day, or the weaker version, overall). In the most general setting, we prove that there always exists an allocation that is stochasticallydominant envy-free up to one good (SD-EF1) per day and proportional up to one good (PROP1) overall, and when all the agents have identical preferences, we show that SD-EF1 per day and SD-EF1 overall can be guaranteed. For the case of two agents, we prove that SD-EF1 per day and EF1 up to each day can be guaranteed using an envy balancing technique. We provide counterexamples for other combinations that establish our results as among the best guarantees possible, but also leave open some tantalizing questions.

Abstract: Rent division is the wellstudied problem of fairly assigning rooms and dividing rent among a set of roommates within a single apartment. A shortcoming of existing solutions is that renters are assumed to be considering apartments in isolation, whereas in reality, renters can choose among multiple apartments. In this paper, we generalize the rent division problem to the multi-apartment setting, where the goal is to both fairly choose an apartment among a set of alternatives and fairly assign rooms and rents within the chosen apartment. Our main contribution is a generalization of envy-freeness called *negotiated envy-freeness*. We show that a solution satisfying negotiated envy-freeness is guaranteed to exist and that it is possible to optimize over all negotiated envy-free solutions in polynomial time. We also define an even stronger fairness notion called *universal envy-freeness* and study its existence when values are drawn randomly.

Abstract: A core tension in the study of plurality elections is the clash between the classic HotellingDowns model, which predicts that two office-seeking candidates should cater to the median voter, and the empirical observation that democracies often have two major parties with divergent policies. Motivated in part by this tension, we introduce a dynamic model of candidate positioning based on a simple bounded rationality heuristic: candidates imitate the policy of previous winners. The resulting model is closely connected to evolutionary replicator dynamics. For uniformly-distributed voters, we prove in our model that with k = 2, 3, or 4 candidates per election, any symmetric candidate distribution converges over time to the center. With k ≥ 5 candidates per election, however, we prove that the candidate distribution does not converge to the center and provide an even stronger non-convergence result in a special case with no extreme candidates. Our conclusions are qualitatively unchanged if a small fraction of candidates are not winner-copiers and are instead positioned uniformly at random in each election. Beyond our theoretical analysis, we illustrate our results in extensive simulations; for five or more candidates, we find a tendency towards the emergence of two clusters, a mechanism suggestive of Duverger's Law, the empirical finding that plurality leads to two-party systems. Our simulations also explore several variations of the model, where we find the same general pattern: convergence to the center with four or fewer candidates, but not with five or more. Finally, we discuss the relationship between our replicator dynamics model and prior work on strategic equilibria of candidate positioning games.

Abstract: It has been shown recently that physicsbased simulation significantly enhances the disassembly capabilities of real-world assemblies with diverse 3D shapes and stringent motion constraints. However, the efficiency suffers when tackling intricate disassembly tasks that require numerous simulations and increased simulation time. In this work, we propose a State-Based Disassembly Planning (SBDP) approach, prioritizing physics-based simulation with translational motion over rotational motion to facilitate autonomy, reducing dependency on human input, while storing intermediate motion states to improve search scalability. We introduce two novel evaluation functions derived from new Directional Blocking Graphs (DBGs) enriched with state information to scale up the search. Our experiments show that SBDP with new evaluation functions and DBGs constraints outperforms the state-of-the-art in disassembly planning in terms of success rate and computational efficiency over benchmark datasets consisting of thousands of physically valid industrial assemblies.

Abstract: In reinforcement learning, especially in sparsereward domains, many environment steps are required to observe reward information. In order to increase the frequency of such observations, "potential-based reward shaping" (PBRS) has been proposed as a method of providing a more dense reward signal while leaving the optimal policy invariant. However, the required potential function must be carefully designed with task-dependent knowledge to not deter training performance. In this work, we propose a bootstrapped method of reward shaping, termed BS-RS, in which the agent's current estimate of the state-value function acts as the potential function for PBRS. We provide convergence proofs for the tabular setting, give insights into training dynamics for deep RL, and show that the proposed method improves training speed in the Atari suite.

Abstract: We study estimator selection and hyperparameter tuning in off-policy evaluation. Although cross-validation is the most popular method for model selection in supervised learning, off-policy evaluation relies mostly on theory, which provides only limited guidance to practitioners. We show how to use cross-validation for off-policy evaluation. This challenges a popular belief that cross-validation in off-policy evaluation is not feasible. We evaluate our method empirically and show that it addresses a variety of use cases.

Abstract: Graph unlearning technology has become increasingly important since the advent of the `right to be forgotten' and the growing concerns about the privacy and security of artificial intelligence. Graph unlearning aims to quickly eliminate the effects of specific data on graph neural networks (GNNs). However, most existing deterministic graph unlearning frameworks follow a balanced partitionsubmodel training-aggregation paradigm, resulting in a lack of structural information between subgraph neighborhoods and redundant unlearning parameter calculations. To address this issue, we propose a novel Graph Structure Mapping Unlearning paradigm (GSMU) and a novel method based on it named Community-centric Graph Eraser (CGE). CGE maps community subgraphs to nodes, thereby enabling the reconstruction of a node-level unlearning operation within a reduced mapped graph. CGE makes the exponential reduction of both the amount of training data and the number of unlearning parameters. Extensive experiments conducted on five real-world datasets and three widely used GNN backbones have verified the high performance and efficiency of our CGE method, highlighting its potential in the field of graph unlearning.

Abstract: In Semisupervised learning(SSL), we always accept cluster assumption, assuming features in different high-density regions belong to other categories. However, it is always ignored by existing algorithms and needs mathematical explanations. This paper first proposes a theorem to statistically explain cluster assumption and prove that the probability density can significantly help to use the prior fully. A Probability-Density-Aware Measure(PM) is proposed based on the theorem to discern the similarity between neighbor points. The PM is deployed to improve Label Propagation and a new pseudo-labeling algorithm, the Probability-Density-Aware Label Propagation(PMLP), is proposed. We also prove that traditional first-order similarity pseudo-labeling could be viewed as a particular case of PMLP, which provides a comprehensive theoretical understanding of PMLP's superior performance. Extensive experiments demonstrate that PMLP achieves outstanding performance compared with other recent methods.

Abstract: Large Language Models (LLMs) have demonstrated humanlike instruction-following abilities, particularly those exceeding 100 billion parameters. The combined capability of some smaller, resource-friendly LLMs can address most of the instructions that larger LLMs excel at. In this work, we explore how to route the best-performing LLM for each instruction to achieve better overall performance. We develop a new paradigm, constructing capability instructions with model capability representation, user instruction, and performance inquiry prompts to assess the performance. To learn from capability instructions, we introduce a new end-to-end framework called Model Selection with Aptitude Test (Model-SAT), which generates positive and negative samples based on what different models perform well or struggle with. Model-SAT uses a model capability encoder that extends its model representation to a lightweight LLM. Our experiments show that Model-SAT understands the performance dimensions of candidate models and provides the probabilities of their capability to handle various instructions. Additionally, during deployment, a new model can quickly infer its aptitude test results across 50 tasks, each with 20 shots. Model-SAT performs state-of-the-art model routing without candidate inference and in real-world new model-released scenarios.

Abstract: Face deidentification (DeID) has been widely studied for common scenes, but remains under-researched for medical scenes, mostly due to the lack of large-scale patient face datasets. In this paper, we release MeMa, consisting of over 40,000 photo-realistic patient faces. MeMa is re-generated from massive real patient photos. By carefully modulating the generation and data-filtering procedures, MeMa avoids breaching real patient privacy, while ensuring rich and plausible medical manifestations. We recruit expert clinicians to annotate MeMa with both coarse- and fine-grained labels, building the first medical-scene DeID benchmark. Additionally, we propose a baseline approach for this new medical-aware DeID task, by integrating data-driven medical semantic priors into the DeID procedure. Despite its conciseness and simplicity, our approach substantially outperforms previous ones.

Abstract: Knowledge Tracing (KT) is crucial in education assessment, which focuses on depicting students' learning states and assessing students' mastery of subjects. With the rise of modern online learning platforms, particularly massive open online courses (MOOCs), an abundance of interaction data has greatly advanced the development of the KT technology. Previous research commonly adopts deterministic representation to capture students' knowledge states, which neglects the uncertainty during student interactions and thus fails to model the true knowledge state in learning process. In light of this, we propose an UncertaintyAware Knowledge Tracing model (UKT) which employs stochastic distribution embeddings to represent the uncertainty in student interactions, with a Wasserstein self-attention mechanism designed to capture the transition of state distribution in student learning behaviors. Additionally, we introduce the aleatory uncertainty-aware contrastive learning loss, which strengthens the model's robustness towards different types of uncertainties. Extensive experiments on six real-world datasets demonstrate that UKT not only significantly surpasses existing deep learning-based models in KT prediction, but also shows unique advantages in handling the uncertainty of student interactions.

Abstract: This paper addresses the challenges of computational accountability in autonomous systems, particularly in Autonomous Vehicles (AVs), where safety and efficiency often conflict. We begin by examining current approaches such as cost minimization, reward maximization, humancentered approaches, and ethical frameworks, noting their limitations addressing these challenges. Foreseeability is a central concept in tort law that limits the accountability and legal liability of an actor to a reasonable scope. Yet, current data-driven methods to determine foreseeability are rigid, ignore uncertainty, and depend on simulation data. In this work, we advocate for a new computational approach to establish foreseeability of autonomous systems based on the legal “BPL” formula. We provide open research challenges, using fully autonomous vehicles as a motivating example, and call for researchers to help autonomous systems make accountable decisions in safety-critical scenarios.

Abstract: The future of Artificial Intelligence demands a paradigm shift towards multisensory perception—to systems that can digest ongoing multisensory observations, that can discover structure in unlabeled raw sensory data, and that can intelligently fuse useful information from different sensory modalities for decision making. While we humans perceive the world by looking, listening, touching, smelling, and tasting, traditional form of machine intelligence mostly focuses on a single sensory modality, particularly vision. Therefore, my research, which I call multisensory machine intelligence, aims to empower machines to emulate and enhance human capabilities in seeing, hearing, and feeling, ultimately enabling them to comprehensively perceive, understand, and interact with the multisensory world. In my AAAI25 new faculty highlight talk, I will present my research that studies two important aspects of the multisensory world: 1) multisensory objects, and 2) multisensory space. In both aspects, I will talk about how we design systems to reliably capture multisensory data from real-world objects and space, how we effectively model them with differentiable simulation algorithms that build a unified multisensory representation to virtualize real objects, and how we explore creative cross-modal/multi-modal applications with sight, sound, and touch in vision, graphics, and robotics. In the end, I will briefly conclude with my future plans.

Abstract: As the use of autonomous mobile robots expands into dynamic and complex environments, the need for them to provide understandable explanations for their actions becomes crucial. This thesis addresses the challenge of developing explainability for robot navigation by leveraging a hybrid model that combines machine learning techniques with symbolic reasoning methods. Furthermore, the thesis explores the modeling of human explanation preferences and the impact of different explanation attributes on explanation recipients' understanding, satisfaction, and trust. The goal is to integrate different explanation aspects and approaches into a unified framework to support explainable navigation in robotics.

Abstract: This paper outlines a proposal regarding the use of machine learning, specifically a longshort term model, to increase the military’s effectiveness and safety protocols. The approach is to collect data from weapons training and apply it to a model that can distinguish between weapon activities. By training the model on a dataset that consists of several common weapons activities, we hope to improve commanders' understanding of their troop's performance and readiness. The evaluation will consist of examining the loss of the model, its accuracy, and analyzing activities it frequently confused. This work will extend the current research in soldier activity recognition by introducing weapon activity recognition.

Abstract: We present EvalAssist, a framework that simplifies the LLMas-a-judge workflow. The system provides an online criteria development environment, where users can interactively build, test, and share custom evaluation criteria in a structured and portable format. A library of LLM based evaluators is made available that incorporates various algorithmic innovations such as token-probability based judgement, positional bias checking, and certainty estimation that help to engender trust in the evaluation process. We have computed extensive benchmarks and also deployed the system internally in our organization with several hundreds of users.

Abstract: Uncertainty quantification remains a difficult challenge in reinforcement learning. Several algorithms exist that successfully quantify uncertainty in a practical setting. However it is unclear whether these algorithms are theoretically sound and can be expected to converge. Furthermore, they seem to treat the uncertainty in the target parameters in different ways. In this work, we unify several practical algorithms into one theoretical framework by defining a new Bellman operator on distributions, and show that this Bellman operator is a contraction. We highlight use cases of our framework by analyzing an existing Bayesian Qlearning algorithm, and also introduce a novel uncertainty-aware variant of PPO that adaptively sets its clipping hyperparameter.

Abstract: Recent conceptbased interpretable models have succeeded in providing meaningful explanations by pre-defined concept sets. However, the dependency on the pre-defined concepts restricts the application because of the limited number of concepts for explanations. This paper proposes a novel interpretable deep neural network called explanation bottleneck models (XBMs). XBMs generate a text explanation from the input without pre-defined concepts and then predict a final task prediction based on the generated explanation by leveraging pre-trained vision-language encoder-decoder models. To achieve both the target task performance and the explanation quality, we train XBMs through the target task loss with the regularization penalizing the explanation decoder via the distillation from the frozen pre-trained decoder. Our experiments, including a comparison to state-of-the-art concept bottleneck models, confirm that XBMs provide accurate and fluent natural language explanations without pre-defined concept sets.

Abstract: This paper presents SAGA (Segment Any 3D GAussians), a highly efficient 3D promptable segmentation method based on 3D Gaussian Splatting (3DGS). Given 2D visual prompts as input, SAGA can segment the corresponding 3D target represented by 3D Gaussians within 4 ms. This is achieved by attaching a scale-gated affinity feature to each 3D Gaussian to endow it a new property towards multi-granularity segmentation. Specifically, a scale-aware contrastive training strategy is proposed for the scale-gated affinity feature learning. It 1) distills the segmentation capability of the Segment Anything Model (SAM) from 2D masks into the affinity features and 2) employs a soft scale gate mechanism to deal with multi-granularity ambiguity in 3D segmentation through adjusting the magnitude of each feature channel according to a specified 3D physical scale. Evaluations demonstrate that SAGA achieves real-time multi-granularity segmentation with quality comparable to state-of-the-art methods. As one of the first methods addressing promptable segmentation in 3D-GS, the simplicity and effectiveness of SAGA pave the way for future advancements in this field.

Abstract: Referring MultiObject Tracking (RMOT) is an important topic in the current tracking field. Its task form is to guide the tracker to track objects that match the language description. Current research mainly focuses on referring multi-object tracking under single-view, which refers to a view sequence or multiple unrelated view sequences. However, in the single-view, some appearances of objects are easily invisible, resulting in incorrect matching of objects with the language description. In this work, we propose a new task, called Cross-view Referring Multi-Object Tracking (CRMOT). It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task. CRMOT is a more challenging task of accurately tracking the objects that match the language description and maintaining the identity consistency of objects in each cross-view. To advance CRMOT task, we construct a cross-view referring multi-object tracking benchmark based on CAMPUS and DIVOTrack datasets, named CRTrack. Specifically, it provides 13 different scenes and 221 language descriptions. Furthermore, we propose an end-to-end cross-view referring multi-object tracking method, named CRTracker. Extensive experiments on the CRTrack benchmark verify the effectiveness of our method.

Abstract: In highstakes domains such as healthcare, finance, and law, the need for explainable AI is critical. Traditional methods for generating attribution maps, including white-box approaches relying on gradients and black-box techniques that perturb inputs, face challenges like gradient vanishing, blurred attributions, and computational inefficiencies. To overcome these limitations, we introduce a novel approach that leverages diffusion models within the framework of Information Bottleneck (IB) theory. By utilizing the Gaussian noise from diffusion models, we connect the information bottleneck with the Minimum Mean Squared Error (MMSE) from classical information theory, enabling precise calculation of mutual information. This connection leads to a new loss function that minimizes the Signal-to-Noise Ratio (SNR), facilitating efficient optimization and producing high-resolution, pixel-level attribution maps. Our method achieves greater clarity and accuracy in attributions than existing techniques, requiring significantly fewer pixel values to reach the necessary predictive confidence. This work demonstrates the power of diffusion models in advancing explainable AI, particularly in identifying critical input features with high precision.

Abstract: We present a novel, trainingfree approach to scene change detection. Our method leverages tracking models, which inherently perform change detection between consecutive frames of video by identifying common objects and detecting new or missing objects. Specifically, our method takes advantage of the change detection effect of the tracking model by inputting reference and query images instead of consecutive frames. Furthermore, we focus on the content gap and style gap between two input images in change detection, and address both issues by proposing adaptive content threshold and style bridging layers, respectively. Finally, we extend our approach to video, leveraging rich temporal information to enhance the performance of scene change detection. We compare our approach and baseline through various experiments. While existing train-based baseline tend to specialize only in the trained domain, our method shows consistent performance across various domains, proving the competitiveness of our approach.

Abstract: Video corpus grounding (VCG), which aims to retrieve relevant video moments from a video corpus, has attracted significant attention in the multimedia research community. However, the existing VCG setting primarily focuses on matching textual descriptions with videos and ignores the distinct visual identities in the videos, thus resulting in inaccurate understanding of video content and deteriorated retrieval performances. To address this limitation, we introduce a novel task, IdentityText Video Corpus Grounding (ITVCG), which simultaneously utilize textual descriptions and visual identities as queries. As such, ITVCG benefits in enabling more accurate video corpus grounding with visual identities, as well as providing users with more flexible options to locate relevant frames based on either textual descriptions or textual descriptions and visual identities. To conduct evaluations regarding the novel ITVCG task, we propose the TVR-IT dataset, comprising 463 identity images from 6 TV shows, with 68,840 out of 72,840 queries containing at least one identity image. Furthermore, we propose Video-Locator, the first model designed for the ITVCG task. Our proposed Video-Locator integrates video-identity-text alignment and multi-modal fine-grained fusion components, facilitating a video large language model (Video LLM) to jointly understand textual descriptions, visual identities, as well as videos. Experimental results demonstrate the effectiveness of the proposed Video-Locator model and highlight the importance of identity-generalization capability for ITVCG.

Abstract: Neural implicit representations have shown remarkable abilities in jointly modeling geometry, color, and camera poses in simultaneous localization and mapping (SLAM). Current methods use coordinates, positional encodings, or other geometry features as input to query neural implicit functions for signed distances and color which produce rendering errors to drive the optimization in overfitting image observations. However, due to the run time efficiency requirement in SLAM systems, we are merely allowed to conduct optimization on each frame in few iterations, which is far from enough for neural networks to overfit these queries. The underfitting usually results in severe drifts in camera tracking and artifacts in reconstruction. To resolve this issue, we propose query quantized neural SLAM which uses quantized queries to reduce variations of input for much easier and faster overfitting a frame. To this end, we quantize a query into a discrete representation with a set of codes, and only allow neural networks to observe a finite number of variations. This allows neural networks to become increasingly familiar with these codes after overfitting more and more previous frames. Moreover, we also introduce novel initialization, losses, and argumentation to stabilize the optimization with significant uncertainty in the early optimization stage, constrain the optimization space, and estimate camera poses more accurately. We justify the effectiveness of each design and report visual and numerical comparisons on widely used benchmarks to show our superiority over the latest methods in both reconstruction and camera tracking.

MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, National University of Singapore, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

Abstract: In this paper, we tackle the task of blurry video superresolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry inputs. Current BVSR methods often fail to restore sharp details at high resolutions, resulting in noticeable artifacts and jitter due to insufficient motion information for deconvolution and the lack of high-frequency details in LR frames. To address these challenges, we introduce event signals into BVSR and propose a novel event-enhanced network, Ev-DeblurVSR. To effectively fuse information from frames and events for feature deblurring, we introduce a reciprocal feature deblurring module that leverages motion information from intra-frame events to deblur frame features while reciprocally using global scene context from the frames to enhance event features. Furthermore, to enhance temporal consistency, we propose a hybrid deformable alignment module that fully exploits the complementary motion information from inter-frame events and optical flow to improve motion estimation in the deformable alignment process. Extensive evaluations demonstrate that Ev-DeblurVSR establishes a new state-of-the-art performance on both synthetic and real-world datasets. Notably, on real data, our method is 2.59 dB more accurate and 7.28× faster than the recent best BVSR baseline FMA-Net.

Abstract: Image rescaling (IR) seeks to determine the optimal lowresolution (LR) representation of a high-resolution (HR) image to reconstruct a high-quality super-resolution (SR) image. Typically, HR images with resolutions exceeding 2K possess rich information that is unevenly distributed across the image. Traditional image rescaling methods often fall short because they focus solely on the overall scaling rate, ignoring the varying amounts of information in different parts of the image. To address this limitation, we propose a Block-Based Multi-Scale Image Rescaling Framework (BBMR), tailored for IR tasks involving HR images of 2K resolution and higher. BBMR consists of two main components: the Downscaling Module and the Upscaling Module. In the Downscaling Module, the HR image is segmented into sub-blocks of equal size, with each sub-block receiving a dynamically allocated scaling rate while maintaining a constant overall scaling rate. For the Upscaling Module, we introduce the Joint Super-Resolution method (JointSR), which performs SR on these sub-blocks with varying scaling rates and effectively eliminates blocking artifacts. Experimental results demonstrate that BBMR significantly enhances the SR image quality in the of 2K and 4K test dataset compared to initial network image rescaling methods.

Abstract: Current benchmarks for video segmentation are limited to annotating only salient objects (i.e., foreground instances). Despite their impressive architectural designs, previous works trained on these benchmarks have struggled to adapt to realworld scenarios. Thus, developing a new video segmentation dataset aimed at tracking multigranularity segmentation target in the video scene is necessary. In this work, we aim to generate multi-granularity video segmentation dataset that is annotated for both salient and non-salient masks. To achieve this, we propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset that includes various types and granularities of mask annotations. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset, which leads to the best performance among the existing video object segmentation methods and Segment SAM-based video segmentation methods.

Abstract: Existing virtual tryon (VTON) methods provide only limited user control over garment attributes and generally overlook essential factors such as face, pose, and scene context. To address these limitations, we introduce the virtual dressing (VD) task, which aims to synthesize freely editable human images conditioned on fixed garments and optional user-defined inputs. We further propose a comprehensive affinity metric index (CAMI) to quantify the consistency between generated outputs and reference garments. We present IMAGDressing-v1, which leverages a garment-specific U-Net to integrate semantic features from CLIP and texture features from a VAE. To incorporate these garment features into a frozen denoising U-Net for flexible text-driven scene control, we employ a hybrid attention mechanism composed of frozen self-attention and trainable cross-attention layers. IMAGDressing-v1 seamlessly integrates with extension modules, such as ControlNet and IP-Adapter, enabling enhanced diversity and controllability. To alleviate data constraints, we introduce the Interactive Garment Pairing (IGPair) dataset, comprising over 300,000 garment–image pairs and a standardized data assembly pipeline. Extensive experiments demonstrate that IMAGDressing-v1 achieves state-of-the-art performance in controlled human image synthesis. The code and model will be available at https://github.com/muzishen/IMAGDressing.

Abstract: Tactile sensation plays a crucial role in the development of multimodal large models and embodied intelligence. To collect tactile data with minimal cost as possible, a series of studies have attempted to generate tactile images by vision-to-touch image translation. However, compared to text modality, visual modality-driven tactile generation cannot accurately depict human tactile sensation. In this work, we analyze the characteristics of tactile images in detail from two granularities: object-level (tactile texture, tactile shape), and sensor-level (gel status). We model these granularities of information through text descriptions and propose a fine-grained Text-to-Touch generation method (TextToucher) to generate high-quality tactile samples. Specifically, we introduce a multimodal large language model to build the text sentences about object-level tactile information and employ a set of learnable text prompts to represent the sensor-level tactile information. To better guide the tactile generation process with the built text information, we fuse the dual grains of text information and explore various dual-grain text conditioning methods within the diffusion transformer architecture. Furthermore, we propose a Contrastive Text-Touch Pre-training (CTTP) metric to precisely evaluate the quality of text-driven generated tactile data. Extensive experiments demonstrate the superiority of our TextToucher method.

Abstract: With the beneft of explicit objectoriented reasoning capabilities of scene graphs, scene graph-to-image generation has made remarkable advancements in comprehending object coherence and interactive relations. Recent state-of-the-arts typically predict the scene layouts as an intermediate representation of a scene graph before synthesizing the image. Nevertheless, transforming a scene graph into an exact layout may restrict its representation capabilities, leading to discrepancies in interactive relationships (such as standing on, wearing, or covering) between the generated image and the input scene graph. In this paper, we propose a Scene Graph-Grounded Image Generation (SGG-IG) method to mitigate the above issues. Specifcally, to enhance the scene graph representation, we design a masked auto-encoder module and a relation embedding learning module to integrate structural knowledge and contextual information of the scene graph with a mask self-supervised manner. Subsequently, to bridge the scene graph with visual content, we introduce a spatial constraint and image-scene alignment constraint to capture the fne-grained visual correlation between the scene graph symbol representation and the corresponding image representation, thereby generating semantically consistent and high-quality images. Extensive experiments demonstrate the effectiveness of the method both quantitatively and qualitatively.

Abstract: Model counting is the task of counting the number of satisfying assignments of a Boolean formula. Since counting is intractable in general, most applications use (ε, δ)approximations, where the output is within a (1+ε)-factor of the count with probability at least 1-δ. Many demanding applications make thousands of counting queries, and the state-of-the-art approximate counter, ApproxMC, makes hundreds of calls to SAT solvers to answer a single approximate counting query. The sheer number of SAT calls, poses a significant challenge to the existing approaches. In this work, we propose an approximation scheme, ApproxMC7, that is tailored to such demanding applications with low time limits. Compared to ApproxMC, ApproxMC7 makes 14× fewer SAT calls while providing the same guarantees as ApproxMC in the constant-factor regime. In an evaluation over 2,247 instances, ApproxMC7 solved 271 more and achieved a 2× speedup against ApproxMC.

Abstract: This study proposes a new change detection method that leverages hubness. Hubness is a phenomenon that occurs in highdimensional spaces, where certain special data points, known as hub data, tend to be closer to other data points. Hubness is known to degrade the accuracy of methods based on nearest neighbor search. Therefore, many studies in the past have focused on reducing hubness to improve accuracy. In contrast, this study utilizes hubness to detect changes. Specifically, if there is no change, suppressing the hubness occurring in the two datasets obtained by dividing the time series data will result in a uniform data distribution. However, if there is a change, even if we try to reduce the hubness in the two datasets obtained by dividing the time series data before and after the change, the hubness will not be reduced, and the data distribution will not become uniform. We use this finding to detect changes. Experiments with synthetic data show that the proposed method achieves accuracy comparable to or exceeding that of existing methods. Additionally, the proposed method achieves good accuracy with real-world data from hydraulic systems and gas sensors, along with excellent runtime performance.

Abstract: In Tennenholtz’s program equilibrium, players of a game submit programs to play on their behalf. Each program receives the other programs’ source code and outputs an action. This can model interactions involving AI agents, mutually transparent institutions, or commitments. Tennenholtz 2004 (https://doi.org/10.1016/j.geb.2004.02.002) proves a folk theorem for program games, but the equilibria constructed are very brittle. We therefore consider simulationbased programs – i.e., programs that work by running opponents’ programs. These are relatively robust (in particular, two programs that act the same are treated the same) and are more practical than proof-based approaches. Oesterheld’s (2019, https://doi.org/10.1007/s11238-018-9679-3) epsilon-Grounded-pi-Bot is such an approach. Unfortunately, it is not generally applicable to games of three or more players, and only allows for a limited range of equilibria in two player games. In this paper, we propose a generalisation to Oesterheld’s (2019) epsilon-Grounded-pi-Bot. We prove a folk theorem for our programs in a setting with access to a shared source of randomness. We then characterise their equilibria in a setting without shared randomness. Both with and without shared randomness, we achieve a much wider range of equilibria than Oesterheld’s (2019) epsilon-Grounded-pi-Bot. Finally, we explore the limits of simulation-based program equilibrium, showing that the Tennenholtz folk theorem cannot be attained by simulation-based programs without access to shared randomness.

Abstract: We study fair mechanisms for the classic job scheduling problem on unrelated machines with the objective of minimizing the makespan. This problem is equivalent to minimizing the egalitarian social cost in the fair division of chores. The two prevalent fairness notions in the fair division literature are envyfreeness and proportionality. Prior work has established that no envy-free mechanism can provide better than an Ω(log m / log log m)-approximation to the optimal makespan, where m is the number of machines, even when payments to the machines are allowed. In strong contrast to this impossibility, our main result demonstrates that there exists a proportional mechanism (with payments) that achieves a 3/2-approximation to the optimal makespan, and this ratio is tight. To prove this result, we provide a full characterization of allocation functions that can be made proportional with payments. Furthermore, we show that for instances with normalized costs, there exists a proportional mechanism that achieves the optimal makespan. We conclude with important directions for future research concerning other fairness notions, including relaxations of envy-freeness. Notably, we show that the technique leading to the impossibility result for envy-freeness does not extend to its relaxations.

Abstract: Common knowledge/belief in rationality is the traditional standard assumption in analysing interaction among agents. This paper proposes a graphbased language for capturing significantly more complicated structures of higher-order beliefs that agents might have about the rationality of the other agents. The two main contributions are a solution concept that captures the reasoning process based on a given belief structure and an efficient algorithm for compressing any belief structure into a unique minimal form.

Abstract: HumanAI collaboration has the potential to transform various domains by leveraging the complementary strengths of human experts and Artificial Intelligence (AI) systems. However, unobserved confounding can undermine the effectiveness of this collaboration, leading to biased and unreliable outcomes. In this paper, we propose a novel solution to address unobserved confounding in human-AI collaboration by employing the marginal sensitivity model (MSM). Our approach combines domain expertise with AI-driven statistical modeling to account for potential confounders that may otherwise remain hidden. We present a deferral collaboration framework for incorporating the MSM into policy learning from observational data, enabling the system to control for the influence of unobserved confounding factors. In addition, we propose a personalized deferral collaboration system to leverage the diverse expertise of different human decision-makers. By adjusting for potential biases, our proposed solution enhances the robustness and reliability of collaborative outcomes. The empirical and theoretical analyses demonstrate the efficacy of our approach in mitigating unobserved confounding and improving the overall performance of human-AI collaborations.

Abstract: List Update is a fundamental problem in online algorithms, with a wellknown 2-competitive algorithm that moves every requested element to the front. Randomization can slightly improve the competitive ratio to 1.6, but not beyond 1.5. However, practical inputs are not adversarial and one hopes to do better, particularly when additional information from a machine learning oracle is available. With access to predictions, the goal is to incur only a slight overhead compared to the prediction's accuracy, avoiding significant costs in case of substantial deviation. We propose a (1+epsilon)-smooth randomized algorithm, offering robustness of O(1/epsilon^4). This guarantees that the algorithm never exceeds a cost greater than 1+epsilon times the prediction cost, while maintaining a bound within O(1/epsilon^4) of the optimal cost for every possible sequence. In cases where no paid swaps are permitted for the prediction, we can improve robustness to O(1/epsilon^2) while retaining 1+epsilon smoothness. We complement these findings by demonstrating a lower bound of 1/epsilon on the robustness for deterministic algorithms and log(1/epsilon) for randomized ones. Finally, the experiments we have made show that our algorithms perform better than the standard competitive algorithms for this problem

Abstract: Considering the ubiquitous phenomenon of missing views in multiview data, incomplete multi-view learning is a crucial task in many applications. Existing methods usually follow an impute-then-predict strategy for handling this problem. However, they often assume that the view-missing patterns are uniformly random in multi-view data, which does not agree with real-world scenarios. In practice, view-missing patterns often vary across different classes. For example, in the medical field, patients with rare diseases would take more examinations than those with common diseases; in the financial field, high-risk customers tend to receive evaluations from more views than ordinary ones. Hence, we often observe that data-rich classes suffer limited views while data-poor classes suffer limited samples. Previous methods would typically fail due to such biased view-missing patterns. This motivates us to delve into a new biased incomplete multi-view learning problem. To this end, we develop a Reliable Incomplete Multi-view Learning (RIML) method. RIML is a simple yet effective learning-free imputation framework that goes beyond the conventional approaches by considering information from all classes, rather than just relying on individual views or within-class samples. Specifically, we utilize an inter-class association matrix that allows data-poor classes to refer the knowledge from data-rich classes. This enables the construction of more reliable view-specific distributions, from which we perform multiple samplings to recover missing views. Additionally, to obtain a reliable multi-view representation for downstream tasks, we develop an enhanced focal loss with a category-aware marginal term to learn a more distinguishable feature space. Experiments on five multi-view datasets demonstrate that RIML significantly outperforms existing methods in both accuracy and robustness.

Abstract: Sequential problems are ubiquitous in AI, such as in reinforcement learning or natural language processing. Stateof-the-art deep sequential models, like transformers, excel in these settings but fail to guarantee the satisfaction of constraints necessary for trustworthy deployment. In contrast, neurosymbolic AI (NeSy) provides a sound formalism to enforce constraints in deep probabilistic models but scales exponentially on sequential problems. To overcome these limitations, we introduce relational neurosymbolic Markov models (NeSy-MMs), a new class of end-to-end differentiable sequential models that integrate and provably satisfy relational logical constraints. We propose a strategy for inference and learning that scales on sequential settings, and that combines approximate Bayesian inference, automated reasoning, and gradient estimation. Our experiments show that NeSy-MMs can solve problems beyond the current state-of-the-art in neurosymbolic AI and still provide strong guarantees with respect to desired properties. Moreover, we show that our models are more interpretable and that constraints can be adapted at test time to out-of-distribution scenarios.

School of Automation, Southeast University, China Key Laboratory of Measurement and Control of CSE, Ministry of Education, China, School of Automation, Southeast University, China Key Laboratory of Measurement and Control of CSE, Ministry of Education, China, NLPR, MAIS, CASIA, China, School of Automation, Southeast University, China, School of Automation, Southeast University, China, School of Electronic and Information Engineering, Suzhou University of Science and Technology, China, School of Electronic and Information Engineering, Suzhou University of Science and Technology, China Suzhou Key Laboratory of Intelligent Low Carbon Technology Application, China Jiangsu Industrial Intelligent Low Carbon Technology Engineering Center, China, School of Automation, Southeast University, China

Abstract: Multilabel class-incremental learning (MLCIL) is essential for real-world multi-label applications, allowing models to learn new labels while retaining previously learned knowledge continuously. However, recent MLCIL approaches can only achieve suboptimal performance due to the oversight of the positive-negative imbalance problem, which manifests at both the label and loss levels because of the task-level partial label issue. The imbalance at the label level arises from the substantial absence of negative labels, while the imbalance at the loss level stems from the asymmetric contributions of the positive and negative loss parts to the optimization. To address the issue above, we propose a Rebalance framework for both the Loss and Label levels (RebLL), which integrates two key modules: asymmetric knowledge distillation (AKD) and online relabeling (OR). AKD is proposed to rebalance at the loss level by emphasizing the negative label learning in classification loss and down-weighting the contribution of overconfident predictions in distillation loss. OR is designed for label rebalance, which restores the original class distribution in memory by online relabeling the missing classes. Our comprehensive experiments on the PASCAL VOC and MS-COCO datasets demonstrate that this rebalancing strategy significantly improves performance, achieving new state-of-the-art results even with a vanilla CNN backbone.

College of Computer and Data Science, Fuzhou University, Fuzhou, China Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou, China, College of Computer and Data Science, Fuzhou University, Fuzhou, China Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou, China, College of Computer and Data Science, Fuzhou University, Fuzhou, China Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou, China, Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology, Jinan, China, College of Computer and Data Science, Fuzhou University, Fuzhou, China Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou, China, College of Computer and Data Science, Fuzhou University, Fuzhou, China Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou, China

Abstract: Multiview learning methods leverage multiple data sources to enhance perception by mining correlations across views, typically relying on predefined categories. However, deploying these models in real-world scenarios presents two primary openness challenges. 1) Lack of Interpretability: The integration mechanisms of multi-view data in existing black-box models remain poorly explained; 2) Insufficient Generalization: Most models are not adapted to multi-view scenarios involving unknown categories. To address these challenges, we propose OpenViewer, an openness-aware multi-view learning framework with theoretical support. This framework begins with a Pseudo-Unknown Sample Generation Mechanism to efficiently simulate open multi-view environments and previously adapt to potential unknown samples. Subsequently, we introduce an Expression-Enhanced Deep Unfolding Network to intuitively promote interpretability by systematically constructing functional prior-mapping modules and effectively providing a more transparent integration mechanism for multi-view data. Additionally, we establish a Perception-Augmented Open-Set Training Regime to significantly enhance generalization by precisely boosting confidences for known categories and carefully suppressing inappropriate confidences for unknown ones. Experimental results demonstrate that OpenViewer effectively addresses openness challenges while ensuring recognition performance for both known and unknown samples.

School of Software, Shandong University, Jinan 250101, China, School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China, School of Software, Shandong University, Jinan 250101, China, School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Co., Ltd, Jinan, China, School of Software, Shandong University, Jinan 250101, China

Abstract: Online crossmodal hashing has gained increasing interest due to its ability to encode streaming data and update hash functions simultaneously. Existing online methods often assume either fully supervised or completely unsupervised settings. However, they overlook the prevalent and challenging scenario of semi-supervised cross-modal streaming data, where diverse data types, including labeled/unlabeled, paired/unpaired, and multi-modal, are intertwined. To address this issue, we propose Semi-Supervised Online Cross-modal Hashing (SSOCH). It presents an alignment-free pseudo-labeling strategy that extracts semantic information from unlabeled streaming data without relying on pairing relations. Furthermore, we design an online tri-consistent preserving scheme, integrating pseudo-labeled data regularization, discriminative label embedding, and fine-grained similarity preservation. This scheme fully explores consistency across data annotation, modalities, and streaming chunks, improving the model's adaptiveness in these challenging scenarios. Extensive experiments on benchmark datasets demonstrate the superiority of SSOCH under various scenarios, highlighting the importance of semi-supervised learning for online cross-modal hashing.

Abstract: We introduce a biologically plausible RL framework for solving tasks in partially observable Markov decision processes (POMDPs). The proposed algorithm combines three integral parts: (1) A MetaRL architecture, resembling the mammalian basal ganglia; (2) A biologically plausible reinforcement learning algorithm, exploiting temporal difference learning and eligibility traces to train the policy and the value-function; (3) An online automatic differentiation algorithm for computing the gradients with respect to parameters of a shared recurrent network backbone. Our experimental results show that the method is capable of solving a diverse set of partially observable reinforcement learning tasks. The algorithm we call real-time recurrent reinforcement learning (RTRRL) serves as a model of learning in biological neural networks, mimicking reward pathways in the basal ganglia.

Abstract: SHAP scores represent the proposed use of the wellknown Shapley values in eXplainable Artificial Intelligence (XAI). Recent work has shown that the exact computation of SHAP scores can produce unsatisfactory results. Concretely, for some ML models, SHAP scores will mislead with respect to relative feature influence. To address these limitations, recently proposed alternatives exploit different axiomatic aggregations, all of which are defined in terms of abductive explanations. However, the proposed axiomatic aggregations are not Shapley values. This paper investigates how SHAP scores can be modified so as to extend axiomatic aggregations to the case of Shapley values in XAI. More importantly, the proposed new definition of SHAP scores avoids all the known cases where unsatisfactory results have been identified. The paper also characterizes the complexity of computing the novel definition of SHAP scores, highlighting families of classifiers for which computing these scores is tractable. Furthermore, the paper proposes modifications to the existing implementations of SHAP scores. These modifications eliminate some of the known limitations of SHAP scores, and have negligible impact in terms of performance.

School of Cyberspace Security, Hainan University, Haikou, China, School of Computer Science and Technology, Hainan University, Haikou, China Hainan Blockchain Technology Engineering Research Center, Haikou, China, School of Computer Science and Technology, Hainan University, Haikou, China, School of Computer Science and Technology, Hainan University, Haikou, China Hainan Blockchain Technology Engineering Research Center, Haikou, China, School of Computer Science and Technology, Hainan University, Haikou, China, School of Computer, National University of Defense Technology, Changsha, China

Abstract: Federated graph learning (FGL), which excels in analyzing nonIID graphs as well as protecting data privacy, has recently emerged as a hot topic. Existing FGL methods usually train the client model using labeled data and then collaboratively learn a global model without sharing their local graph data. However, in real-world scenarios, the lack of data annotations impedes the negotiation of multi-source information at the server, leading to sub-optimal feedback to the clients. To address this issue, we propose a novel unsupervised learning framework called Federated Graph-level Clustering Network (FedGCN), which collects the topology-oriented features of non-IID graphs from clients to generate global consensus representations through multi-source clustering structure sharing. Specifically, in the client, we first preserve the prototype features of each cluster from the structure-oriented embedding through clustering and then upload the learned multiple prototypes that are hard to be reconstructed into the raw graph data. In the server, we generate consensus prototypes from multiple condensed structure-oriented signals through Gaussian estimation, which are subsequently transferred to each client to promote the great encoding capacity of the local model for better clustering. Extensive experiments across multiple non-IID graph datasets have demonstrated the effectiveness and superiority of FedGCN against its competitors.

Abstract: Designing expressive generative models that support exact and efficient inference is a core question in probabilistic ML. Probabilistic circuits (PCs) offer a framework where this tractabilityvs-expressiveness trade-off can be analyzed theoretically. Recently, squared PCs encoding subtractive mixtures via negative parameters have emerged as tractable models that can be exponentially more expressive than monotonic PCs, i.e., PCs with positive parameters only. In this paper we provide a more precise theoretical characterization of the expressiveness relationships among these models. First, we prove that squared PCs can be less expressive than monotonic ones. Second, we formalize a novel class of PCs – sum of squares PCs – that can be exponentially more expressive than both squared and monotonic PCs. Around sum of squares PCs, we build an expressiveness hierarchy that allows us to precisely unify and separate different tractable model classes such as Born Machines and PSD models, and other recently introduced tractable probabilistic models by using complex parameters. Finally, we empirically show the effectiveness of sum of squares circuits in performing distribution estimation.

Abstract: Label polysemy, where an instance can be associated with multiple labels, is common in realworld tasks. LDL (label distribution learning) is an effective learning paradigm for handling label polysemy, where each instance is associated with a label distribution. Although numerous LDL algorithms have been proposed and achieved satisfactory performance on most existing datasets, they are typically trained directly on the collected label distributions which often lack quality guarantees in real-world tasks due to annotator subjectivity and algorithm assumptions. Consequently, direct learning from such uncertain label distributions can lead to unpredictable generalization performance. To address this problem, we propose an adaptive-grained label distribution learning framework whose main idea is to extract relatively reliable supervision information from unreliable label distributions, and thus the label distribution learning task can be decomposed into three subtasks: coarsening label distributions, learning coarse-grained labels and refining coarse-grained labels. In this framework, we design an adaptive label coarsening algorithm to extract an optimal coarsen-grained labels and a label refining function to enhance the coarse-grained label into the final label distributions. Finally, we conduct extensive experiments on real-world datasets to demonstrate the advantages of our proposal.

Abstract: Existing causal learning algorithms focus on microlevel causal discovery, confronting significant challenges in identifying the influence of macro systems, composed of micro-level variables, on other variables. This difficulty arises because the causal relationships in macro systems are often mediated through micro-level causal interactions, which can lead to erroneous causal discovery or omission when dispersed. To address this issue, we propose the Emergence-inspired Multi-granularity Causal learning (EMCausal) method. Inspired by the emerging phenomena of aggregating micro level variables into macro level representations, EMCausal introduces a progressive mapping encoder to simulate the process, thus capturing the causal relationships driven by these macro entities. Next, it introduces a causal consistency constraint to collaboratively reconstruct micro variables using macro-level representations, enabling the learning of a multi-granular causal structure. Experimental results on both synthetic and real datasets demonstrate that EMCausal can identify causal graphs under the influence of causal emergence, outperforming competitive baselines in term of accuracy and robustness.

Abstract: This study challenges strictly guaranteeing ``dissipativity'' of a dynamical system represented by neural networks learned from given timeseries data. Dissipativity is a crucial indicator for dynamical systems that generalizes stability and input-output stability, known to be valid across various systems including robotics, biological systems, and molecular dynamics. By analytically proving the general solution to the nonlinear Kalman–Yakubovich–Popov (KYP) lemma, which is the necessary and sufficient condition for dissipativity, we propose a differentiable projection that transforms any dynamics represented by neural networks into dissipative ones and a learning method for the transformed dynamics. Utilizing the generality of dissipativity, our method strictly guarantee stability, input-output stability, and energy conservation of trained dynamical systems. Finally, we demonstrate the robustness of our method against out-of-domain input through applications to robotic arms and fluid dynamics.

Abstract: Proxybased metric learning has enhanced semantic similarity with class representatives and exhibited noteworthy performance in deep metric learning (DML) tasks. While these methods alleviate computational demands by learning instance-to-class relationships rather than instance-to-instance relationships, they often limit features to be class-specific, thereby degrading generalization performance for unseen class. In this paper, we introduce a novel perspective called Disentangled Deep Metric Learning (DDML), grounded in the framework of information bottleneck, which applies class-agnostic regularization to existing DML methods. Unlike conventional NormSoftmax methods, which primarily emphasize distinct class-specific features, our DDML enables a diverse feature representation by seamlessly transitioning between class-specific features with the aid of class-agnostic features. It smooths decision boundaries, allowing unseen classes to have stable semantic representations in the embedding space. To achieve this, we learn disentangled representations of both class-specific and class-agnostic features in the context of DML. Empirical results demonstrate that our method addresses the limitations of conventional approaches. Our method easily integrates into existing proxy-based algorithms, consistently delivering improved performance.

Abstract: Metalearning, or "learning to learn," aims to enable models to quickly adapt to new tasks with minimal data. While traditional methods like Model-Agnostic Meta-Learning (MAML) optimize parameters in Euclidean space, they often struggle to capture complex learning dynamics, particularly in few-shot learning scenarios. To address this limitation, we propose Stiefel-MAML, which integrates Riemannian geometry by optimizing within the Stiefel manifold, a space that naturally enforces orthogonality constraints. By leveraging the geometric structure of the Stiefel manifold, we improve parameter expressiveness and enable more efficient optimization through Riemannian gradient calculations and retraction operations. We also introduce a novel kernel-based loss function defined on the Stiefel manifold, further enhancing the model’s ability to explore the parameter space. Experimental results on benchmark datasets—including Omniglot, Mini-ImageNet, FC-100, and CUB—demonstrate that Stiefel-MAML consistently outperforms traditional MAML, achieving superior performance across various few-shot learning tasks. Our findings highlight the potential of Riemannian geometry to enhance meta-learning, paving the way for future research on optimizing over different geometric structures.

Abstract: Counterfactual data augmentation (CDA) is a method for controlling information or biases in training datasets by generating a complementary dataset with typically opposing biases. Prior work often either relies on handcrafted rules or algorithmic CDA methods which can leave unwanted information in the augmented dataset. In this work, we show iterative CDA (ICDA) with initial, high-noise interventions can converge to a state with significantly lower noise. Our ICDA procedure produces a dataset where one target signal in the training dataset maintains high mutual information with a corresponding label and the information of spurious signals are reduced. We show training on the augmented datasets produces rationales on documents that better align with human annotation. Our experiments include six human produced datasets and two large-language model generated datasets.

Abstract: Federated Survival Analysis (FSA) is an emerging Federated Learning (FL) paradigm that enables training survival models on decentralized data while preserving privacy. However, existing FSA approaches largely overlook the potential risk of bias in predictions arising from demographic and censoring disparities across clients' datasets, which impacts the fairness and performance of federated survival models, especially for underrepresented groups. To address this gap, we introduce FairFSA, a novel FSA framework that adapts existing fair survival models to the federated setting. FairFSA jointly trains survival models using distributionally robust optimization, penalizing worstcase errors across subpopulations that exceed a specified probability threshold. Partially observed survival outcomes in clients are reconstructed with federated pseudo values (FPV) before model training to address censoring. Furthermore, we design a weight aggregation strategy by enhancing the FedAvg algorithm with a fairness-aware concordance index-based aggregation method to foster equitable performance distribution across clients. To the best of our knowledge, this is the first work to study and integrate fairness into Federated Survival Analysis. Comprehensive experiments on distributed non-IID datasets demonstrate FairFSA's superiority in fairness and accuracy over state-of-the-art FSA methods, establishing it as a robust FSA approach capable of handling censoring while providing equitable and accurate survival predictions for all subjects.

College of Computer Science, Sichuan University, Chengdu 610065, China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China, College of Computer Science, Sichuan University, Chengdu 610065, China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China, Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China, Stevens Institute of Technology, 1 Castle Point Terrace, Hoboken, NJ 07030, USA, College of Computer Science, Sichuan University, Chengdu 610065, China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China

Abstract: Previous multiview contrastive learning methods typically operate at two scales: instance-level and cluster-level. The former generally constructs positive and negative pairs based on the correspondence between samples and view instances. These methods aim to bring positive pairs closer and push negative pairs further apart in the latent space. This kind of approaches has the drawback of inevitably introducing false negatives in an unsupervised setting, leading to reduced model discriminability. The latter usually involves calculating cluster assignments for samples under each view and maximizing view consensus by reducing distribution discrepancies through methods like optimizing the KL divergence between different view distributions or maximizing mutual information. However, clusters represent a macro structure that overlooks the local structure within the sample set, and the relationships between clusters across different views cannot be explicitly measured. To overcome the shortcomings of these two types of methods, we propose a method named Multi-view Granular-ball Contrastive Clustering (MGBCC). This method segments the sample set into coarse-grained granular balls, and establishes associations between intra-view and cross-view granular balls. These associations are reinforced in a shared latent space, thereby achieving multi-granularity contrastive learning. Granular balls lie between instances and clusters, naturally preserving the local topological structure of the sample set. We conduct extensive experiments to validate the effectiveness of the proposed method.

Abstract: As machine learning model modification techniques are extensively employed to obtain wellperforming models at reduced costs, several studies have emerged to determine the presence of a modification relationship (i.e., lineage) between models. However, these methods are not robust to high-impact modification techniques and none of them have addressed the measurement of lineage closeness, which quantifies the degrees of modification. In this work, we visualize the changes in model decision boundaries resulting from different modification techniques and conclude that differences in decision boundaries serve as a precise metric of lineage closeness. Building upon this insight, we propose a modification-type agnostic and task-agnostic method to measure model lineage closeness by calculating mean adversarial distances from data points to decision boundaries and matching rate of data points, with data points selected through an efficient sampling method to reduce computational overhead. Moreover, we propose a novel indirect measurement approach to support lineage closeness measurement for models with different tasks. Finally, comprehensive experiments show that our design achieves an impressive 97% accuracy in lineage determination, and can precisely measure model lineage closeness for different modifications.

Abstract: Datafree knowledge distillation aims to learn a compact student network from a pre-trained large teacher network without using the original training data of the teacher network. Existing collection-based and generation-based methods train student networks by collecting massive real examples and generating synthetic examples, respectively. However, they inevitably become weak in practical scenarios due to the difficulties in gathering or emulating sufficient real-world data. To solve this problem, we propose a novel method called Hybrid Data-Free Distillation (HiDFD), which leverages only a small amount of collected data as well as generates sufficient examples for training student networks. Our HiDFD comprises two primary modules, i.e., the teacher-guided generation and student distillation. The teacher-guided generation module guides a Generative Adversarial Network (GAN) by the teacher network to produce high-quality synthetic examples from very few real-world collected examples. Specifically, we design a feature integration mechanism to prevent the GAN from overfitting and facilitate the reliable representation learning from the teacher network. Meanwhile, we drive a category frequency smoothing technique via the teacher network to balance the generative training of each category. In the student distillation module, we explore a data inflation strategy to properly utilize a blend of real and synthetic data to train the student network via a classifier-sharing-based feature alignment technique. Intensive experiments across multiple benchmarks demonstrate that our HiDFD can achieve state-of-the-art performance using 120 times less collected data than existing methods.

Abstract: Federated bilevel optimization (FBO) has garnered significant attention lately, driven by its promising applications in metalearning and hyperparameter optimization. Existing algorithms generally aim to approximate the gradient of the upper-level objective function (hypergradient) in the federated setting. However, because of the nonlinearity of the hypergradient and client drift, they often involve complicated computations. These computations, like multiple optimization sub-loops and second-order derivative evaluations, end up with significant memory consumption and high computational costs. In this paper, we propose a computationally and memory-efficient FBO algorithm named MemFBO. MemFBO features a fully single-loop structure with all involved variables updated simultaneously, and uses only first-order gradient information for all local updates. We show that MemFBO exhibits a linear convergence speedup with milder assumptions in both partial and full client participation scenarios. We further implement MemFBO in a novel FBO application for federated data cleaning. Our experiments, conducted on this application and federated hyper-representation, demonstrate the effectiveness of the proposed algorithm.

School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China

Abstract: Multiinstance partial-label learning (MIPL) is a paradigm where each training example is encapsulated as a multi-instance bag associated with the candidate label set, which includes one true label and several false positives. Current MIPL algorithms typically assume that all instances are independent, thereby neglecting the dependencies and heterogeneity inherent in MIPL data. Moreover, these algorithms often prove to be excessively time-consuming when dealing with complex datasets, significantly limiting the practical application of MIPL. In this paper, we propose FastMIPL, a framework that employs mixed-effects model to explicitly capture the dependencies and heterogeneity among instances and bags. FastMIPL is able to learn from MIPL data both effectively and efficiently by utilizing the predefined dependencies modeling module and leveraging the posterior predictive probability disambiguation strategy. Experiments show that the performance of FastMIPL is highly competitive to state-of-the-art methods, while significantly reducing computational time in benchmark and the real-world datasets.

Abstract: Machine unlearning without access to real data distribution is challenging. The existing method based on datafree distillation achieved unlearning by filtering out synthetic samples containing forgetting information but struggled to distill the retaining-related knowledge efficiently. In this work, we analyze that such a problem is due to over-filtering, which reduces the synthesized retaining-related information. We propose a novel method, Inhibited Synthetic PostFilter (ISPF), to tackle this challenge from two perspectives: First, the Inhibited Synthetic, by reducing the synthesized forgetting information; Second, the PostFilter, by fully utilizing the retaining-related information in synthesized samples. Experimental results demonstrate that the proposed ISPF effectively tackles the challenge and outperforms existing methods.

Abstract: Output uncertainty indicates whether the probabilistic properties of the overall distribution reflect objective characteristics of the model output. Unlike most loss functions and metrics in machine learning, uncertainty pertains to individual samples, but validating it on individual samples is unfeasible. When validated collectively, it cannot fully represent individual sample properties, posing a challenge in assessing and calibrating model confidence in a limited data set. Hence, it is crucial to consider confidence calibration characteristics. To counter the adverse effects of the gradual amplification of the classifier output amplitude in supervised learning, we introduce a postprocessing parametric calibration method, ρ-Norm Scaling, which expands the calibrator expression and mitigates overconfidence due to excessive amplitude while preserving accuracy. Moreover, calibrator optimization based bin-level calibration error often results in the loss of significant instance-level information. Therefore, we include probability distribution regularization, which incorporates a priori information that the instance-level uncertainty distribution after calibration should resemble the distribution before calibration. Experimental results demonstrate the substantial enhancement in the post-processing calibrator for uncertainty calibration with our proposed method.

Abstract: Causal inconsistency arises when the underlying causal graphs captured by generative models like Normalizing Flows are inconsistent with those specified in causal models like Struct Causal Models. This inconsistency can cause unwanted issues including unfairness. Prior works to achieve causal consistency inevitably compromise the expressiveness of their models by disallowing hidden layers. In this work, we introduce a new approach: Causally Consistent Normalizing Flow (CCNF). To the best of our knowledge, CCNF is the first causally consistent generative model that can approximate any distribution with multiple layers. CCNF relies on two novel constructs: a sequential representation of SCMs and partial causal transformations. These constructs allow CCNF to inherently maintain causal consistency without sacrificing expressiveness. CCNF can handle all forms of causal inference tasks, including interventions and counterfactuals. Through experiments, we show that CCNF outperforms current approaches in causal inference. We also empirically validate the practical utility of CCNF by applying it to realworld datasets and show how CCNF addresses challenges like unfairness effectively.

Abstract: Despite their nearly universal adoption for large language models, the internal workings of transformers are not well understood. We aim to better understand the impact of removing or reorganizing information throughout the layers of a pretrained transformer. Such an understanding could both yield better usage of existing models as well as to make architectural improvements to produce new variants. We present a series of empirical studies on frozen models that show that the lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity. We further show that some classes of problems have robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel. Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.

Abstract: Scheduling is adopted in various domains to assign jobs to resources, such that an objective is optimized. While schedules enable the analysis of the underlying system, publishing them also incurs a privacy risk. Recently, privacy attacks on schedules have been proposed, which may reveal sensitive information on the jobs by solving an inverse scheduling problem. In this work, we study the protection against such attacks. We formulate the problem of privacyand-utility preservation of schedules, which bounds both, the privacy leakage and the loss in the utility of the schedule due to obfuscation. We address the problem based on a set of perturbation functions for schedules, study their instantiations for standard scheduling problems, and implement privacy-and-utility-aware publishing of a schedule using constraint programming. Experiments with synthetic and real-world schedules demonstrate the feasibility, robustness, and effectiveness of our mechanism.

Abstract: Most image retrieval research prioritizes improving predictive performance, often overlooking situations where the reliability of predictions is equally important. The gap between model performance and reliability requirements highlights the need for a systematic approach to analyze and address the risks associated with image retrieval. Uncertainty quantification technique can be applied to mitigate this issue by assessing uncertainty for retrieval sets, but it provides only a heuristic estimate of uncertainty rather than a guarantee. To address these limitations, we present Risk Controlled Image Retrieval (RCIR), which generates retrieval sets with coverage guarantee, i.e., retrieval sets that are guaranteed to contain the true nearest neighbors with a predefined probability. RCIR can be easily integrated with existing uncertaintyaware image retrieval systems, agnostic to data distribution and model selection. To the best of our knowledge, this is the first work that provides coverage guarantees to image retrieval. The validity and efficiency of RCIR are demonstrated on four real-world datasets: CAR-196, CUB-200, Pittsburgh, and ChestX-Det.

Abstract: Common methods for aligning alreadycapable models with desired behavior rely on the ability of humans to provide supervision. However, future superhuman models will surpass the capability of humans. Therefore, humans will only be able to weakly supervise superhuman models. This expected deficiency of human evaluation would weaken the safety of future AI systems. Scalable oversight and weak-to-strong generalization are two complementary approaches to tackle this issue. In this paper, we attempt to combine the strengths of these two approaches to further improve alignment. Specifically, we investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. To make iterative empirical progress, we consider an analogy: can we use a strong model to improve weak model supervision and then use it to supervise the strong model? We empirically test it by finetuning a small weak model on ground truth labels with the additional help from a large strong model, and then finetuning the strong model on labels generated by the weak model. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model, which provides leverage as context on samples when training a weak model. We also show that an ensemble of weak models helps exploit long arguments generated by strong model debaters and obtain a more robust supervision estimate. Extensive experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment, which indicates that debate has the potential to help weak-to-strong generalization.

Abstract: We introduce for the first time a neuralcertificate framework for continuous-time stochastic dynamical systems. Autonomous learning systems in the physical world demand continuous-time reasoning, yet existing learnable certificates for probabilistic verification assume discretization of the time continuum. Inspired by the success of training neural Lyapunov certificates for deterministic continuous-time systems and neural supermartingale certificates for stochastic discrete-time systems, we propose a framework that bridges the gap between continuous-time and probabilistic neural certification for dynamical systems under complex requirements. Our method combines machine learning and symbolic reasoning to produce formally certified bounds on the probabilities that a nonlinear system satisfies specifications of reachability, avoidance, and persistence. We present both the theoretical justification and the algorithmic implementation of our framework and showcase its efficacy on popular benchmarks.

Abstract: It is often of interest to learn a contextsensitive decision policy, such as in contextual multi-armed bandit processes. To quantify the efficiency of a machine learning algorithm for such settings, probably approximately correct (PAC) bounds, which bound the number of samples required, or cumulative regret guarantees, are typically used. However, real-world settings often have limited resources for experimentation, and decisions/interventions may differ in the amount of resources required (e.g., money or time). Therefore, it is of interest to consider how to design an experiment strategy that reduces the experimental budget needed to learn a near-optimal contextual policy. Unlike reinforcement learning or bandit approaches that embed costs into the reward function, we focus on reducing resource use in learning a near-optimal policy without resource constraints. We introduce two resource-aware algorithms for the contextual bandit setting and prove their soundness. Simulations based on real-world datasets demonstrate that our algorithms significantly reduce the resources needed to learn a near-optimal decision policy compared to previous resource-unaware methods.

Authors:Todd W. Neller, Rasika Bhalerao, Eun Kyung Ko, Vishodana Thamotharan, Lisa Zhang, Sonya Allin, Mahdi Haghifam, Michael Pawliuk, Rutwa Engineer, Florian Shkurti, Cunyan Ma, Daniella DiPaola, Cynthia Breazeal, Loreto Alonzi, Brian Wright, Ali Rivera, Kristin Fasiang, Duri Long, Shruthi Chockkalingam, Giulia Toti, Evan Shieh, Princewill Okoroafor, Thema Monroe-White, Mustafa Haiderbhai, Carolyn Quinlan, Ashwin R. Bharadwaj, Anio Zhang, Rajagopal Venkatesaramani, Sarah Wharton, John Masla, Lydia Guterman, Mary Cate Gustafson-Quiett, Christina Bosch, Samar Abu Hegley, Calvin Macatantan, Eric Klopfer, Hal Abelson, Shira Wein, Mercy Wairimu Gachoka, Li-Hsin Chang, Maryam Mirzaei, Mohammad Mahdi Ajallooeian

Gettysburg College, Northeastern University, National Louis University, National Louis University, University of Toronto, York University, Northeastern University, University of Toronto, University of Toronto, University of Toronto, Brown University, Massachusetts Institute of Technology, Massachusetts Institute of Technology, University of Virginia, University of Virginia, University of Virginia, Northwestern University, Northwestern University, University of British Columbi, University of British Columbi, Young Data Scientists League, Cornell University, George Mason University, University of Toronto, University of Toronto, Northeastern University, Northeastern University, Northeastern University, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Amherst College, University of Eastern Finland, University of Turku, NorQuest College, NorQuest College

Abstract: The Model AI Assignments session seeks to gather and disseminate the best assignment designs of the Artificial Intelligence (AI) Education community. Recognizing that assignments form the core of student learning experience, we here present abstracts of thirteen AI assignments from the 2025 session that are easily adoptable, playfully engaging, and flexible for a variety of instructor needs. Assignment specifications and supporting resources may be found at http://modelai.gettysburg.edu

Abstract: The global expansion of Artificial Intelligence (AI) has highlighted significant challenges in inclusivity and representation, particularly for underrepresented communities. Current AI systems often fail to accommodate diverse linguistic and cultural contexts, resulting in biases in name pronunciation, language preservation, and communication. This research proposes a framework for advancing inclusivity in AI through Natural Language Processing (NLP) and Reinforcement Learning (RL). The envisioned system could integrate with home assistants like Siri and Alexa, enabling realtime interactions in local languages while maintaining cultural relevance. Key proposed features include accurate pronunciation of names, conversational capabilities in underrepresented languages, and an interactive platform where users can learn their language, history, and cultural heritage. By leveraging transformer-based models and adaptive RL frameworks, this research aims to explore solutions that bridge the gap in AI inclusivity for low-resource languages and culturally diverse populations.

Abstract: Diffusion Models (DMs) offer robust tools for addressing uncertainty and enhancing adaptability in robotics. This work explores their application to trajectory generation, 3D image synthesis, and interpretable scene understanding. For trajectory planning, we propose using colored Gaussian noise to improve robustness and temporal coherence. In 3D image generation, Transfer Entropy enhances information flow between textual and visual modalities for more coherent outputs. Partial Information Decomposition (PID) is leveraged to improve model interpretability and efficiency in scene generation. Rigorous evaluation will assess trajectory quality, robustness, and realworld transferability, aiming to advance autonomous decision-making and scene understanding in robotics.

Abstract: SelfKnowledge Distillation (SKD) leverages the student's own knowledge to create a virtual teacher for distillation when the pre-trained bulky teacher is not available. Whilst existing SKD approaches demonstrate gorgeous efficiency in single-label learning, to directly apply them to multi-label learning would suffer from dramatic degradation due to the following inherent imbalance: \textit{targets with unified labels but multifarious visual scales are crammed into one image, resulting in biased learning of major targets and disequilibrium of precision-recall}. To address this issue, this paper proposes a novel SKD method for multi-label learning named Multi-label Self-knowledge Distillation (MSKD), incorporating three Spatial Decoupling mechanisms (i.e. Locality-SD (L-SD), Reconstruction-SD (R-SD), and Step-SD (S-SD)). L-SD exploits relational dark knowledge from regional outputs to amplify the model's perception of visual details. R-SD reconstructs global semantics by integrating regional outputs from local patches and leverages it to guide the model. S-SD aligns outputs of the same input at different steps, aiming to find a synthetical optimizing direction and avoid the overconfidence. In addition, MSKD combines our tailored loss named MBD for balanced distillation. Exhaustive experiments demonstrate that MSKD not only outperforms previous approaches but also effectively mitigates biased learning and equips the model with more robustness.

The Hong Kong Polytechnic University Center for Artificial Intelligence and Robotics, HKISI-CAS, ETH Zurich, Center for Artificial Intelligence and Robotics, HKISI-CAS, Center for Artificial Intelligence and Robotics, HKISI-CAS Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, The Hong Kong Polytechnic University, Center for Artificial Intelligence and Robotics, HKISI-CAS, Center for Artificial Intelligence and Robotics, HKISI-CAS Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences

Abstract: Removing reflection from a single image is challenging due to the absence of general reflection priors. Although existing methods incorporate extensive user guidance for satisfactory performance, they often lack the flexibility to adapt user guidance in different modalities, and dense user interactions further limit their practicality. To alleviate these problems, this paper presents FIRM, a novel framework for Flexible Interactive image Reflection reMoval with various forms of guidance, where users can provide sparse visual guidance (e.g., points, boxes, or strokes) or text descriptions for better reflection removal. Firstly, we design a novel user guidance conversion module (UGC) to transform different forms of guidance into unified contrastive masks. The contrastive masks provide explicit cues for identifying reflection and transmission layers in blended images. Secondly, we devise a contrastive maskguided reflection removal network that comprises a newly proposed contrastive guidance interaction block (CGIB). This block leverages a unique cross-attention mechanism that merges contrastive masks with image features, allowing for precise layer separation. The proposed framework requires only 10% of the guidance time needed by previous interactive methods, which makes a step-change in flexibility. Extensive results on public real-world reflection removal datasets validate that our method demonstrates state-of-the-art reflection removal performance.

Abstract: Openvocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to categories seen during training. In this work, we propose InstFormer, a carefully designed framework for the OpenVIS task that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data. InstFormer begins with the open-world mask proposal network, encouraged to propose all potential instance class-agnostic masks by the contrastive instance margin loss. Next, we introduce InstCLIP, adapted from pre-trained CLIP with Instance Guidance Attention, which encodes open-vocabulary instance tokens efficiently. These instance tokens not only enable open-vocabulary classification but also offer strong universal tracking capabilities. Furthermore, to prevent the tracking module from being constrained by the training data with limited categories, we propose the universal rollout association, which transforms the tracking problem into predicting the next frame’s instance tracking token. The experimental results demonstrate the proposed InstFormer achieve state-of-the-art capabilities on a comprehensive OpenVIS evaluation benchmark, while also achieves competitive performance in fully supervised VIS task.

Abstract: In this work, we introduce Modulated Flows (ModFlows), a novel approach for color transfer between images based on rectified flows. The primary goal of the color transfer is to adjust the colors of a target image to match the color distribution of a reference image. Our technique is based on optimal transport and executes color transfer as an invertible transformation within the RGB color space. The ModFlows utilizes the bijective property of flows, enabling us to introduce a common intermediate color distribution and build a dataset of rectified flows. We train an encoder on this dataset to predict the weights of a rectified model for new images. After training on a set of optimal transport plans, our approach can generate plans for new pairs of distributions without additional finetuning. We additionally show that the trained encoder provides an image embedding, associated only with its color style. The presented method is capable of processing 4K images and achieves the state-of-the-art performance in terms of content and style similarity.

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China Institute of Artificial Intelligence, Xiamen University, Fujian, China, National University of Singapore, Singapore, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Contemporary Amperex Technology Co., Limited (CATL), Fujian, China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China

Abstract: Openvocabulary panoptic segmentation aims to segment and classify everything in diverse scenes across an unbounded vocabulary. Existing methods typically employ two-stage or single-stage framework. The two-stage framework involves cropping the image multiple times using masks generated by a mask generator, followed by feature extraction, while the single-stage framework relies on a heavyweight mask decoder to make up for the lack of spatial position information through self-attention and cross-attention in multiple stacked Transformer blocks. Both methods incur substantial computational overhead, thereby hindering the efficiency of model inference. To fill the gap in efficiency, we propose EOV-Seg, a novel single-stage, shared, efficient, and spatialaware framework designed for open-vocabulary panoptic segmentation. Specifically, EOV-Seg innovates in two aspects. First, a Vocabulary-Aware Selection (VAS) module is proposed to improve the semantic comprehension of visual aggregated features and alleviate the feature interaction burden on the mask decoder. Second, we introduce a Two-way Dynamic Embedding Experts (TDEE), which efficiently utilizes the spatial awareness capabilities of ViT-based CLIP backbone. To the best of our knowledge, EOV-Seg is the first open-vocabulary panoptic segmentation framework towards efficiency, which runs faster and achieves competitive performance compared with state-of-the-art methods. Specifically, with COCO training only, EOV-Seg achieves 24.5 PQ, 32.1 mIoU, and 11.6 FPS on the ADE20K dataset and the inference time of EOV-Seg is 4-19 times faster than state-of-the-art methods. Especially, equipped with ResNet50 backbone, EOV-Seg runs 23.8 FPS with only 71M parameters on a single RTX 3090 GPU.

Abstract: There are two manifestations of classification fairness. One is the preference for head classes with more instances due to the longtail (LT) distribution of training data. The other is the clever Hans (CH) effect, where non-discriminative features are mistakenly used for classification. In this paper, we find that using category-agnostic zero-valued data can simultaneously reveal both types of unfairness. Based on this, we propose a zero uniformity training (ZUT) framework to optimize classification fairness. The ZUT framework inputs category-agnostic zero-valued data into the model in parallel and uses zero uniformity loss (ZUL) to optimize classification fairness. The ZUL loss mitigates bias towards specific classes by unifying the classification features corresponding to zero-valued data. The ZUT framework is compatible with various classification-based tasks. Experiments show that the ZUT framework can improve the performance of multiple state-of-the-art methods in image classification, person re-identification, and semantic segmentation.

Abstract: Evaluations of largescale recognition methods typically focus on overall performance. While this approach is common, it often fails to provide insights into performance across individual classes, which can lead to fairness issues and misrepresentation. Addressing these gaps is crucial for accurately assessing how well methods handle novel or unseen classes and ensuring a fair evaluation. To address fairness in Open-Set Recognition (OSR), we demonstrate that per-class performance can vary dramatically. We introduce Gaussian Hypothesis Open Set Technique (GHOST), a novel hyperparameter-free algorithm that models deep features using class-wise multivariate Gaussian distributions with diagonal covariance matrices. We apply Z-score normalization to logits to mitigate the impact of feature magnitudes that deviate from the model’s expectations, thereby reducing the likelihood of the network assigning a high score to an unknown sample. We evaluate GHOST across multiple ImageNet-1K pre-trained deep networks and test it with four different unknown datasets. Using standard metrics such as AUOSCR, AUROC and FPR95, we achieve statistically significant improvements, advancing the state-of-the-art in large-scale OSR. Source code is provided online.

Abstract: Human mesh recovery (HMR) is crucial in many computer vision applications; from health to arts and entertainment. HMR from monocular images has predominantly been addressed by deterministic methods that output a single prediction for a given 2D image. However, HMR from a single image is an illposed problem due to depth ambiguity and occlusions. Probabilistic methods have attempted to address this by generating and fusing multiple plausible 3D reconstructions, but their performance has often lagged behind deterministic approaches. In this paper, we introduce GenHMR, a novel generative framework that reformulates monocular HMR as an image-conditioned generative task, explicitly modeling and mitigating uncertainties in the 2D-to-3D mapping process. GenHMR comprises two key components: (1) a pose tokenizer to convert 3D human poses into a sequence of discrete tokens in a latent space, and (2) an image-conditional masked transformer to learn the probabilistic distributions of the pose tokens, conditioned on the input image prompt along with randomly masked token sequence. During inference, the model samples from the learned conditional distribution to iteratively decode high-confidence pose tokens, thereby reducing 3D reconstruction uncertainties. To further refine the reconstruction, a 2D pose-guided refinement technique is proposed to directly fine-tune the decoded pose tokens in the latent space, which forces the projected 3D body mesh to align with the 2D pose clues. Experiments on benchmark datasets demonstrate that GenHMR significantly outperforms state-of-the-art methods.

Abstract: Intentbased grasp generation inherently involves challenges such as manipulation ambiguity and modality gaps. To address these, we propose a novel Retrieval-Augmented Grasp Generation model (RAGG). Our key insight is that when humans manipulate new objects, they initially mimic the interaction patterns observed in similar objects, then progressively adjust hand-object contact. Consequently, we develop RAGG as a two-stage approach, encompassing retrieval-guided generation and structurally stable grasp refinement. In the first stage, we propose a Retrieval-Augmented Diffusion Model (ReDim), which identifies the most relevant interaction instance from a knowledge base to explicitly guide grasp generation, thereby mitigating ambiguity and bridging modality gaps to ensure semantically correct manipulation. In the second stage, we introduce a Progressive Refinement Network (PRN) with Kolmogorov-Arnold Network (KAN) layers to refine the generated coarse grasp, employing a Structural Similarity Index loss to constrain the spatial relationship between the hand and the object, thus ensuring the stability of the grasp. Extensive experiments on the OakInk and GRAB benchmarks demonstrate that RAGG achieves superior results compared to state-of-the-art approach, indicating not only better physical feasibility and controllability but also strong generalization and interpretability for unseen objects.

Abstract: Knowledge distillation transfers "dark knowledge" from a large teacher model to a smaller student model, yielding a highly efficient network. To improve network's generalization ability, existing works use a larger temperature coefficient for knowledge distillation. Nevertheless, these methods may lower the target category's confidence and lead to ambiguous recognition of similar samples. To mitigate this issue, some studies introduce intrabatch distillation to reduce prediction discrepancy. However, these methods overlook the inconsistency between background information and the target category, which may increase prediction bias due to noise disturbance. Additionally, label imbalance from random sampling and batch size can undermine network generalization reliability. To tackle these challenges, we propose a simple yet effective Intra-class Knowledge Distillation (IKD) method that facilitates knowledge sharing within the same class to ensure consistent predictions. First, we initialize the matrix and the vector to store logits and class counts provided by the teacher, respectively. Then, in the first epoch, we calculate the sum of logits and sample counts per class and perform KD to prevent knowledge omission. Finally, in subsequent training, we update the matrix to obtain the average logits and compute the KL divergence between the student's output and the updated matrix according to the label index. This process ensures intra-class consistency and improves the student's performance. Furthermore, this method theoretically reduces prediction bias by ensuring intra-class consistency. Extensive experiments on the CIFAR-100, ImageNet-1K, and Tiny-ImageNet datasets validate the superiority of IKD.

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. Department of Content Security, Kuaishou Technology, Beijing, China., School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. Department of Computer Science, City University of Hong Kong, Hong Kong, China., Department of Content Security, Kuaishou Technology, Beijing, China., School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China., SeetaCloud, Nanjing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China., Department of Computer Science, City University of Hong Kong, Hong Kong, China.

Abstract: Editing videos with textual guidance has garnered popularity due to its streamlined process which mandates users to solely edit the text prompt corresponding to the source video. Recent studies have explored and exploited largescale text-to-image diffusion models for text-guided video editing, resulting in remarkable video editing capabilities. However, they may still suffer from some limitations such as mislocated objects, incorrect number of objects. Therefore, the controllability of video editing remains a formidable challenge. In this paper, we aim to challenge the above limitations by proposing a Re-Attentional Controllable Video Diffusion Editing (ReAtCo) method. Specially, to align the spatial placement of the target objects with the edited text prompt in a training-free manner, we propose a Re-Attentional Diffusion (RAD) to refocus the cross-attention activation responses between the edited text prompt and the target video during the denoising stage, resulting in a spatially location-aligned and semantically high-fidelity manipulated video. In particular, to faithfully preserve the invariant region content with less border artifacts, we propose an Invariant Region-guided Joint Sampling (IRJS) strategy to mitigate the intrinsic sampling errors w.r.t the invariant regions at each denoising timestep and constrain the generated content to be harmonized with the invariant region content. Experimental results verify that ReAtCo consistently improves the controllability of video diffusion editing and achieves superior video editing performance.

Abstract: 360° images have wide applications in fields such as virtual reality and user experience design. Our goal is to adjust these images to guide users' visual attention. To achieve this, we present a novel task: target scanpathguided 360° image enhancement, which aims to enhance 360° images based on user-specified target scanpaths. We develop a Progressive Scanpath-Guided Enhancement Method (PSEM) to address this problem through three stages. In the first stage, we propose a Time-Alignment and Spatial Similarity Clustering (TASSC) algorithm that accounts for the spherical nature of 360° images and the temporal dependency of scanpaths to generate representative scanpaths. In the second stage, we learn the differences between the source and the target scanpaths and select the objects to be edited based on these differences. Particularly, we propose a Dual-Stream Scanpath Difference Encoder (DSDE) embedded into the Segment Anything Model (SAM) network for object mask generation. Finally, we employ a Stable Diffusion network fine-tuned with LoRA technology to produce the final enhanced image. Additionally, we design special loss functions to supervise the training of the second and third stages. Experimental results have demonstrated the effectiveness of our approach for scanpath-guided 360° image enhancement.

Abstract: CutMix is a data augmentation strategy that cuts and pastes image patches to mixup training data. Existing methods pick either random or salient areas which are often inconsistent to labels, thus misguiding the training model. By our knowledge, we integrate human gaze to guide cutmix for the first time. Since human attention is driven by both highlevel recognition and low-level clues, we propose a controllable Top-down Attention Guided Module to obtain a general artificial attention which balances top-down and bottom-up attention. The proposed TdATttenMix then picks the patches and adjust the label mixing ratio that focuses on regions relevant to the current label. Experimental results demonstrate that our TdAttenMix outperforms existing state-of-the-art mixup methods across eight different benchmarks. Additionally, we introduce a new metric based on the human gaze and use this metric to investigate the issue of image-label inconsistency.

Abstract: A new trend in the computer vision community is to capture objects of interest following flexible human command represented by a natural language prompt. However, the progress of using language prompts in driving scenarios is stuck in a bottleneck due to the scarcity of paired promptinstance data. To address this challenge, we propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt. It expands nuScenes dataset by constructing a total of 40,147 language descriptions, each referring to an average of 7.4 object tracklets. Based on the object-text pairs from the new benchmark, we formulate a novel prompt-based driving task, \ie, employing a language prompt to predict the described object trajectory across views and frames. Furthermore, we provide a simple end-to-end baseline model based on Transformer, named PromptTrack. Experiments show that our PromptTrack achieves impressive performance on NuPrompt. We hope this work can provide some new insights for the self-driving community.

Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University, Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing

Abstract: Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks. Previous research has typically designed taskspecific architectures for each task, leading to a lack of unification. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing them from the background. In this paper, we emphasize the importance of the background and its relationship with the foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge information of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics.

Abstract: Generation of 3D human motion holds significant importance in the creative industry. While recent notable advances have been made in generating common motions, existing methods struggle to generate diverse and rare motions due to the complexity of motions and limited training data. This work introduces ReMoGPT, a unified motionlanguage generative model that solves a wide range of motion-related tasks by incorporating a multi-modal retrieval mechanism into the generation process to address the limitations of existing models, namely diversity and generalizability. We propose to focus on body-part-level motion features to enable fine-grained text-motion retrieval and locate suitable references from the database to conduct generation. Then, the motion-language generative model is trained with prompt-based question-and-answer tasks designed for different motion-relevant problems. We incorporate the retrieved samples into the prompt, and then perform instruction tuning of the motion-language model, to learn from task feedback and produce promising results with the help of fine-grained multi-modal retrieval. Extensive experiments validate the efficacy of ReMoGPT, showcasing its superiority over existing state-of-the-art methods. The framework performs well on multiple motion tasks, including motion retrieval, generation, and captioning.

Abstract: Zeroshot anomaly detection (ZSAD) aims to identify anomalies in new classes of images, and it’s vital in industry and other fields. Most current methods are based on the multimodal models CLIP and SAM, which have prior knowledge to assist model training, but they are highly dependent on the input of the prompts and their accuracy. We found that some diffusion model-based anomaly detection methods generate a large amount of semantic information and are very valuable for the ZSAD task. Therefore, we propose a diffusion model based zero-shot anomaly detection method, DZAD, and no additional prompt input is required. First, we propose the first diffusion-based zero-shot anomaly detection framework, which uses the proposed multi-timestep noise features extraction method to achieve anomaly detection in the denoising process of a latent space diffusion model with a semantic-guided (SG) network. Second, based on the detection results, we proposed a two-branch feature extractor for anomaly maps at different scales. Third, based on the difference between the anomaly detection task and other general image detection tasks, we propose a noise feature weight function for the diffusion model in the zero-shot anomaly detection task. Comparing with 7 recently state-of-the-art (SOTA) methods on MVTec AD and VisA datasets and analysis of the role of each component in ablation studies. The experiments demonstrate the validity of the method beyond the existing methods.

Abstract: Shadow is a phenomenon that degenerates image quality and decreases the performance of downstream vision algorithms. Despite the fact that current image shadow removal methods have achieved promising progress, many of them require an externally obtained shadow mask as a necessary part of the input data, which not only introduces additional workload but also leads to degenerated performance near the shadow boundary due to the inaccuracy of the mask. Some of them do not require the shadow mask, however, they need to simultaneously consider the restoration of the brightness and color information along with the preservation of the texture and structure information inside the shadow region without external clues, which poses highly illposedness and makes the results prone to artifacts. In this paper, we propose Pol-ShaRe, the first Polarization-guided image Shadow Removal solution, to remove shadow in a mask-free manner with fewer artifacts. Specifically, it consists of a two-stage pipeline to relieve the ill-posedness and a neural network tailored to the pipeline to suppress the artifacts. Experimental results show that our Pol-ShaRe achieves state-of-the-art performance on both synthetic and real-world images.

Abstract: In eXplainable Constraint Solving (XCS), it is common to extract a Minimal Unsatisfiable Subset (MUS) from a set of unsatisfiable constraints. This helps explain to a user why a constraint specification does not admit a solution. Finding MUSes can be computationally expensive for highly symmetric problems, as many combinations of constraints need to be considered. In the traditional context of solving satisfaction problems, symmetry has been well studied, and effective ways to detect and exploit symmetries during the search exist. However, in the setting of finding MUSes of unsatisfiable constraint programs, symmetries are understudied. In this paper, we take inspiration from existing symmetryhandling techniques and adapt well-known MUS-computation methods to exploit symmetries in the specification, speeding-up overall computation time. Our results display a significant reduction of runtime for our adapted algorithms compared to the baseline on symmetric problems.

Abstract: Microvideo popularity prediction (MVPP) plays a crucial role in various downstream applications. Recently, multimodal methods that integrate multiple modalities to predict the popularity have exhibited impressive performance. However, these methods face several unresolved issues: (1) limited contextual information and (2) incomplete modal semantics. Incorporating relevant videos and performing full fine-tuning on pre-trained models typically achieves powerful capabilities in addressing these issues. However, this paradigm is not optimal due to its weak transferability and scarce downstream data. Inspired by prompt learning, we propose ICPF, a novel In-Context Prompt-augmented Framework to enhance popularity prediction. ICPF maintains a model-agnostic design, facilitating seamless integration with various multimodal fusion models. Specifically, the multi-branch retriever first retrieves similar modal content through within-modality similarities. Next, in-context prompt generator extracts semantic prior features from retrieved videos and generates in-context prompts, enriching pre-trained models with valuable contextual knowledge. Finally, knowledge-augmented predictor captures complementary features including modal semantics and popularity information. Extensive experiments conducted on three real-world datasets demonstrate the superiority of ICPF compared to 14 competitive baselines.

Abstract: Multimodal Knowledge Graph Completion (KGC), which aims to enrich knowledge graph embeddings by incorporating images and text as supplementary information alongside triplets, is an significant task in learning KGs. Existing multi-modal KGC methods mainly focus on modalitylevel fusion, neglecting the importance of modeling the complex structures, such as hierarchical and circular patterns. To address this, we propose a Mixed-Curvature multi-modal Knowledge Graph Completion method (MCKGC) that embeds the information into three single-curvature spaces, including hyperbolic space, hyperspherical space, and Euclidean space, and incorporates multi-modal information into a mixed space. Specifically, MCKGC consists of Modality Information Mixed-Curvature Module (MIMCM) and Progressive Fusion Module (PFM). To improve the expressive ability for different modalities, MIMCM introduces multi-modal information into three single-curvature spaces for interaction. Then, to extract useful information from different modalities and capture the complex structure from the geometric information, PFM implements a progressive fusion strategy by utilizing modality-level and space-level gates to adaptively incorporate the information from different spaces. Extensive experiments on three widely used benchmarks demonstrate the effectiveness of our method.

College of Computer Science, Sichuan University, Chengdu, China, School of Computer Science, State Key Laboratory for Multimedia Information Processing, PKU-Anker LLM Lab, Peking University, Beijing, China, College of Mathematics, Sichuan University, Chengdu, China, School of Computer Science, State Key Laboratory for Multimedia Information Processing, PKU-Anker LLM Lab, Peking University, Beijing, China, School of Computer Science, State Key Laboratory for Multimedia Information Processing, PKU-Anker LLM Lab, Peking University, Beijing, China, Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA, Huawei Hisilicon, Shanghai, China, School of Computing and Information Technology, Great Bay University, Dongguan, China, School of Computer Science, State Key Laboratory for Multimedia Information Processing, PKU-Anker LLM Lab, Peking University, Beijing, China

Abstract: This paper studies the problem of classimbalanced graph classification, which aims at effectively classifying the graph categories in scenarios with imbalanced class distributions. While graph neural networks (GNNs) have achieved remarkable success, their modeling ability on imbalanced graph-structured data remains suboptimal, which typically leads to predictions biased towards the majority classes. On the other hand, existing class-imbalanced learning methods in vision may overlook the rich graph semantic substructures of the majority classes and excessively emphasize learning from the minority classes. To address these challenges, we propose a simple yet powerful approach called C3GNN that integrates the idea of clustering into contrastive learning to enhance class-imbalanced graph classification. Technically, C3GNN clusters graphs from each majority class into multiple subclasses, with sizes comparable to the minority class, mitigating class imbalance. It also employs the Mixup technique to generate synthetic samples, enriching the semantic diversity of each subclass. Furthermore, supervised contrastive learning is used to hierarchically learn effective graph representations, enabling the model to thoroughly explore semantic substructures in majority classes while avoiding excessive focus on minority classes. Extensive experiments on real-world graph benchmark datasets verify the superior performance of our proposed method against competitive baselines.

Abstract: Interactive Recommendation (IR) has gained significant attention recently for its capability to quickly capture dynamic interest and optimize both short and long term objectives. IR agents are typically implemented through Deep Reinforcement Learning (DRL), because DRL is inherently compatible with the dynamic nature of IR. However, DRL is currently not perfect for IR. Due to the large action space and sample inefficiency problem, training DRL recommender agents is challenging. The key point is that useful features cannot be extracted as highquality representations for the recommender agent to optimize its policy. To tackle this problem, we propose Contrastive Representation for Interactive Recommendation (CRIR). CRIR efficiently extracts latent, high-level preference ranking features from explicit interaction, and leverages the features to enhance users’ representation. Specifically, the CRIR provides representation through one representation network, and refines it through our proposed Preference Ranking Contrastive Learning (PRCL). The key insight of PRCL is that it can perform contrastive learning without relying on computations involving high-level representations or large potential action sets. Furthermore, we also propose a data exploiting mechanism and an agent training mechanism to better adapt CRIR to the DRL backbone. Extensive experiments have been carried out to show our method's superior improvement on the sample efficiency while training an DRL-based IR agent.

Abstract: Exclusion is an important and universal linguistic skill that humans use to express what they do not want. There is little research on exclusionary retrieval, where users express what they do not want to be part of the results produced for their queries. We investigate the scenario of exclusionary retrieval in document retrieval for the first time. We present ExcluIR, a set of resources for exclusionary retrieval, consisting of an evaluation benchmark and a training set for helping retrieval models to comprehend exclusionary queries. The evaluation benchmark includes 3,452 highquality exclusionary queries, each of which has been manually annotated. The training set contains 70,293 exclusionary queries, each paired with a positive document and a negative document. We conduct detailed experiments and analyses, obtaining three main observations: (i) existing retrieval models with different architectures struggle to comprehend exclusionary queries effectively; (ii) although integrating our training data can improve the performance of retrieval models on exclusionary retrieval, there still exists a gap compared to human performance; and (iii) generative retrieval models have a natural advantage in handling exclusionary queries.

Abstract: Graph anomaly detection is crucial for identifying anomalous nodes within graphs and addressing applications like financial fraud detection and social spam detection. Recent spectral graph neural network methods advance graph anomaly detection by focusing on anomalies that notably affect the distribution of graph spectral energy. Such spectrumbased methods rely on two steps: graph wavelet extraction and feature fusion. However, both steps are hand-designed, capturing incomprehensive anomaly information of wavelet-specific features and resulting in their inconsistent feature fusion. To address these problems, we propose a dynamic spectral graph anomaly detection framework DSGAD to adaptively capture comprehensive anomaly information and perform consistent feature fusion. DSGAD introduces dynamic wavelets, consisting of trainable wavelets to adaptively learn anomalous patterns and capture wavelet-specific features with comprehensive anomaly information. Furthermore, the consistent fusion of wavelet-specific features achieves dynamic fusion by combining wavelet-specific feature extraction with energy difference and channel convolution fusion using location correlation. Experimental results on four datasets substantiate the efficacy of our DSGAD method, surpassing state-of-the-art methods in both homogeneous and heterogeneous graphs.

Abstract: We study the fair allocation of indivisible goods among a group of agents, aiming to limit the envy between any two agents. The central open problem in this literature, which has proven to be extremely challenging, is regarding the existence of an EFX allocation, i.e., an allocation such that any envy from some agent i toward another agent j would vanish if we were to remove any single good from the bundle allocated to j. Prior work has shown that when the agents’ valuations are additive, which has been the main focus of prior works, an EFX allocation is guaranteed to exist for all instances involving up to three agents. Subsequent work extended this guarantee to more general valuations, like nicecancelable and MMS-feasible. However, the existence of EFX allocations for instances involving four agents remains open, even for additive valuations. We contribute to this literature by focusing on EF2X, a relaxation of EFX which requires that any envy toward some agent would vanish if any two of the goods allocated to that agent were to be removed. Our main result shows that EF2X allocations exist for any instance with four agents, even for the class of cancelable valuations, which is more general than additive. Our proof is constructive, proposing an algorithm that computes such an allocation in pseudo-polynomial time. Furthermore, for instances involving three agents we provide an algorithm that computes an EF2X allocation in polynomial time, in contrast to EFX for which the fastest known algorithm for three agents is only pseudo-polynomial.

Abstract: We introduce a model of fair division with market values, where indivisible goods must be partitioned among agents with (additive) subjective valuations, and each good additionally has a market value. The market valuation can be viewed as a separate additive valuation that holds identically across all the agents. We seek allocations that are simultaneously fair with respect to the subjective valuations and under the market valuation. We show that an allocation that satisfies stochasticallydominant envy-freeness up to one good (SD-EF1) with respect to both the subjective valuations and the market valuation does not always exist, but the weaker guarantee of EF1 with respect to the subjective valuations along with SD-EF1 with respect to the market valuation can be guaranteed. We also study a number of other guarantees such as Pareto optimality, EFX, and MMS. In addition, we explore non-additive valuations and extend our model to cake-cutting. Along the way, we identify several tantalizing open questions.

Abstract: Fairness and efficiency have become the pillars of modern fair division research, but prior work on achieving both simultaneously is largely limited to the unconstrained setting. We study fair and efficient allocations of indivisible goods under additive valuations and various types of allocation feasibility constraints, and demonstrate the unreasonable effectiveness of the maximum Nash welfare (MNW) solution in this previously uncharted territory. Our main result is that MNW allocations are 1/2envy-free up to one good (EF1) and Pareto optimal under the broad family of (arbitrary) matroid constraints. We extend these guarantees to complete MNW allocations for base-orderable matroid constraints, and to a family of non-matroid constraints (which includes balancedness). We establish tightness of our results by providing counterexamples for the satisfiability of certain stronger desiderata, but show an improved result for the special case of goods with copies (Gafni et al. 2023). Finally, we also establish novel best-of-both-worlds guarantees for goods with copies and balancedness.

Abstract: Majority illusion is a phenomenon in social networks wherein the decision by the majority of the network is not the same as one's personal social circle's majority, leading to an incorrect perception of the majority in a large network. We present polynomialtime algorithms which completely eliminate majority illusion by altering as few connections in the network as possible. Eliminating majority illusion ensures each neighbourhood in the network has at least a 1/2-fraction of the majority winner. This result is surprising as partially eliminating majority illusion is NP-hard. We generalize the majority illusion problem to an arbitrary fraction p and show that the problem of ensuring all neighbourhoods in the network contain at least a p-fraction of nodes consistent with a given preference is NP-hard, for nearly all values of p.

Abstract: We study a model of temporal voting where there is a fixed time horizon, and at each round the voters report their preferences over the available candidates and a single candidate is selected. Prior work has adapted popular notions of justified representation as well as voting rules that provide strong representation guarantees from the multiwinner election setting to this model. In our work, we focus on the complexity of verifying whether a given outcome offers proportional representation. We show that in the temporal setting verification is strictly harder than in multiwinner voting, but identify natural special cases that enable efficient algorithms.

Abstract: In this paper, we consider the problem of fair division of indivisible goods, where the allocation of goods impacts society. Specifically, we introduce a second valuation function for each agent, which determines the social impact of allocating a good to the agent. Such impact is considered desirable for the society the higher, the better. Our goal is to understand how to allocate goods fairly from the agents' perspective while maintaining society as happy as possible. To this end, we measure the impact on society using the utilitarian social welfare, and provide both possibility and impossibility results. Our findings reveal that achieving good approximations, better than linear in the number of agents, is not possible while ensuring fairness to the agents. These impossibility results can be attributed to the fact that agents are completely unconscious of their social impact. Consequently, we explore scenarios where agents are socially aware, by introducing related fairness notions, and demonstrate that an appropriate definition of fairness is compatible with the social objective.

Abstract: Beginning with Witkowski et al. (2023), recent work on forecasting competitions has addressed incentive problems with the common winnertake-all mechanism. Frongillo et al. (2021) propose a competition mechanism based on Multiplicative Weights, an online learning algorithm. They show that their mechanism selects an epsilon-optimal forecaster with high probability using only O(log(n)/epsilon^2) events. These works, together with all prior work on this problem thus far, assume that events are independent. We prove the first accuracy and approximate truthfulness guarantees for forecasting competitions with correlated events. To quantify correlation, we introduce a notion of block correlation, which allows each event to be strongly correlated with up to b others and weakly correlated with the rest. We show that under distributions with this correlation, the Multiplicative Weights mechanism retains its epsilon-optimal guarantee using O(b^2 log(n)/epsilon^2) events. Our proof involves a novel concentration bound for correlated random variables which may be of broader interest.

Abstract: ChatGPT has established Generative AI (GenAI) as a significant technological advancement. However, GenAI's intricate relationship with competing platforms and its downstream impact on users remains underexplored. This paper initiates the study of GenAI's long-term social impact resulting from the weakening network effect of human-based platforms like Stack Overflow. First, we study GenAI's revenue-maximization optimization problem. We develop an approximately optimal solution and show that the optimal solution has a non-cyclic structure. Then, we analyze the social impact, showing that GenAI could be socially harmful. Specifically, we present an analog to Braess's paradox in which all users would be better off without GenAI. Finally, we develop necessary and sufficient conditions for a regulator with incomplete information to ensure that GenAI is socially beneficial.

Abstract: Individual human decisionmakers may benefit from different forms of support to improve decision outcomes, but when will each form of support yield better outcomes? In this work, we posit that personalizing access to decision support tools can be an effective mechanism for instantiating the appropriate use of AI assistance. Specifically, we propose the general problem of learning a decision support policy that, for a given input, chooses which form of support to provide to decision-makers for whom we initially have no prior information. We develop Modiste, an interactive tool to learn personalized decision support policies. Modiste leverages stochastic contextual bandit techniques to personalize a decision support policy for each decision-maker. In our computational experiments, we characterize the expertise profiles of decision-makers for whom personalized policies will outperform offline policies, including population-wide baselines. Our experiments include realistic forms of support (e.g., expert consensus and predictions from a large language model) on vision and language tasks. Our human subject experiments add nuance to and bolster our computational experiments, demonstrating the practical utility of personalized policies when real users benefit from accessing support across tasks.

Abstract: We study the problem of realizing strategies for an LTLf goal specification while ensuring that at least an LTLf backup specification is satisfied in case of unreliability of certain input variables. We formally define the problem and characterize its worstcase complexity as 2EXPTIME-complete, like standard LTLf synthesis. Then we devise three different solution techniques: one based on direct automata manipulation, which is 2EXPTIME, one disregarding unreliable input variables by adopting a belief construction, which is 3EXPTIME, and one leveraging second-order quantified LTLf (QLTLf), which is 2EXPTIME and allows for a direct encoding into monadic second-order logic, which in turn is worst-case nonelementary. We prove their correctness and evaluate them against each other empirically. Interestingly, theoretical worst-case bounds do not translate into observed performance; the MSO technique performs best, followed by belief construction and direct automata manipulation. As a byproduct of our study, we provide a general synthesis procedure for arbitrary QLTLf specifications.

Abstract: The card game Hanabi has recently gained popularity as a benchmark for handling epistemic reasoning in AI systems. However it has until now mostly been approached through the lens of machine learning rather than formal logical analysis. This is mostly due to the fact that modeling Hanabi in the standard epistemic logic DEL is untractable. In this paper we take a different approach to formalizing Hanabi, using the simple epistemic logic ELO as a starting point. We generalize common knowledge in EL-O to arbitrary groups of agents and show how to overcome some of the limitations EL-O places on agent reasoning by introducing a special reasoning action. Analyzing our formalization of Hanabi finally leads us to introduce an alternative semantics for our generalization of EL-O in which models are finite and satisfiability checking is NP-complete, and which is enough to fully describe the evolution of knowledge in a game of Hanabi.

Abstract: Datalog is a logic programming language widely used in knowledge representation and reasoning (KRR), program analysis, and social media mining due to its expressiveness and high performance. Traditionally, Datalog engines use either roworiented or column-oriented storage. Engines like VLog and Nemo favor column-oriented storage for efficiency on limited-resource machines, while row-oriented engines like Soufflé use advanced datastructures with locking to perform better on multi-core CPUs. The advent of modern datacenter GPUs, such as the NVIDIA H100 with its ability to run over 16k threads simultaneously and high memory bandwidth, has reopened the debate on which storage layout is more effective. This paper presents the first column-oriented Datalog engines tailored to the strengths of modern GPUs. We present VFLog, a CUDA-based Datalog runtime library with a column-oriented GPU datastructure that supports all necessary relational algebra operations. Our results demonstrate over 200x performance gains over SOTA CPU-based column-oriented Datalog engines and a 2.5x speedup over GPU Datalog engines in various workloads, including KRR.

Abstract: Bayes nets are extensively used in practice to efficiently represent joint probability distributions over a set of random variables and capture dependency relations. Prior work has shown that given a distribution P defined as the marginal distribution of a Bayes net, it is NPhard to decide whether there is a parameter-bounded Bayes net that represents P. They called this problem LEARN. In this work, we extend the NP-hardness result of LEARN and prove the NP-hardness of a promise search variant of LEARN, whereby the Bayes net in question is guaranteed to exist and one is asked to find such a Bayes net. We complement our hardness result with a positive result about the sample complexity that is sufficient to recover a parameter-bounded Bayes net that is close (in TV distance) to a given distribution P, represented by some parameter-bounded Bayes net, thereby generalizing a degree-bounded sample complexity literature result.

Abstract: Federated Learning (FL) is a distributed machine learning (ML) paradigm, in which multiple clients collaboratively train ML models without centralizing their local data. Similar to conventional ML pipelines, the client local optimization and server aggregation procedure in FL are sensitive to the hyperparameter (HP) selection. Despite extensive research on tuning HPs for centralized ML, these methods yield suboptimal results when employed in FL. This is mainly because their "trainingafter-tuning" framework is unsuitable for FL with limited client computation power. While some approaches have been proposed for HP-Tuning in FL, they are limited to the HPs for client local updates. In this work, we propose a novel HP-tuning algorithm, called Federated Population-based Hyperparameter Tuning (FedPop), to address this vital yet challenging problem. FedPop employs population-based evolutionary algorithms to optimize the HPs, which accommodates various HP types at both the client and server sides. Compared with prior tuning methods, FedPop employs an online "tuning-while-training" framework, offering computational efficiency and enabling the exploration of a broader HP search space. Our empirical validation on the common FL benchmarks and complex real-world FL datasets, including full-sized Non-IID ImageNet-1K, demonstrates the effectiveness of the proposed method, which substantially outperforms the concurrent state-of-the-art HP-tuning methods in FL.

Abstract: We present *generative clustering* (GC) for clustering a set of documents, X, by using texts Y generated by large language models (LLMs) instead of by clustering the original documents X. Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an informationtheoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC outperforms any previous clustering method, often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.

Abstract: Existing Multiple Kernel Clustering (MKC) algorithms commonly utilize the Nyström method to handle largescale datasets. However, most of them employ uniform sampling for kernel matrix approximation, hence failing to accurately capture the underlying data structure, leading to large approximation errors. Additionally, they often use the same landmark points for all kernel matrix approximations, reducing kernel diversity. Moreover, in scenarios where approximate kernel matrices emerge over time, these methods require storing historical kernel information and recalculating, resulting in inefficient resource utilization. To address these issues, we propose a novel MKC algorithm, termed Incremental Nyström-based Multiple Kernel Clustering (INMKC). Specifically, leverage score sampling is utilized to reduce kernel approximation errors and enhance kernel diversity. Furthermore, we employ a consensus clustering structure that aligns with the newly emerged base kernel matrix for updates, avoiding recalculating previous kernel matrices, thus saving substantial computational resources. Additionally, we tackle the challenge of aligning incremental approximate kernels with different landmark points. Extensive experiments on the proposed INMKC demonstrate its effectiveness and efficiency compared to state-of-the-art methods.

Abstract: Graph neural networks(GNNs) have been demonstrated to depend on whether the node effective information is sufficiently passing. Discrete curvature (Ricci curvature) is used to study graph connectivity and information propagation efficiency with a geometric perspective, and has been raised in recent years to explore the efficient messagepassing structure of GNNs. However, most empirical studies are based on directly observed graph structures or heuristic topological assumptions, and lack in-depth exploration of underlying optimal information transport structures for downstream tasks. We suggest that graph curvature optimization is more in-depth and essential than directly rewiring or learning for graph structure with richer message-passing characterization and better information transport interpretability. From both graph geometry and information theory perspectives, we propose the novel Discrete Curvature Graph Information Bottleneck (CurvGIB) framework to optimize the information transport structure and learn better node representations simultaneously. CurvGIB advances the Variational Information Bottleneck (VIB) principle for Ricci curvature optimization to learn the optimal information transport pattern for specific downstream tasks. The learned Ricci curvature is used to refine the optimal transport structure of the graph, and the node representation is fully and efficiently learned. Moreover, for the computational complexity of Ricci curvature differentiation, we combine Ricci flow and VIB to deduce a curvature optimization approximation to form a tractable IB objective function. Extensive experiments on various datasets demonstrate the superior effectiveness and interpretability of CurvGIB.

Abstract: Federated feature selection (FFS) is a promising field for selecting informative features while preserving data privacy in federated learning (FL) settings. Existing FFS methods focus on capturing the correlations between features and labels. They struggle to achieve satisfactory performance in the face of data distribution heterogeneity among FL clients, and cannot address the outof-distribution (OOD) problem that arises when a significant portion of clients do not actively participate in FL training. To address these limitations, we propose Federated Causally Invariant Feature Learning (FedCIFL), a novel approach for learning causally invariant features in a privacy-preserving manner. We design a sample reweighting strategy to eliminate spurious correlations introduced by selection bias and iteratively estimate the federated causal effect between each feature and the labels (with the remaining features initially treated as confounders). By iteratively refining the confounding feature set to identify the true confounders, FedCIFL mitigates the impact of limited local data on the accuracy of federated causal effect estimation. Theoretical analysis proves the correctness of FedCIFL under reasonable assumptions. Extensive experiments on synthetic and real-world datasets demonstrate the superiority of FedCIFL against eight state-of-the-art baselines, beating the best-performing approach by 3.19%, 9.07% and 2.65% in terms of average test Accuracy, RMSE and F1 score, respectively. It is a first-of-its-kind FFS approach capable of handling Non-IID and OOD data simultaneously. The source code is available at https://github.com/Xianjie-Guo/FedCIFL.

Abstract: We present Diffusion Model Patching (DMP), a simple method to boost the performance of pretrained diffusion models that have already reached convergence, with a negligible increase in parameters. DMP inserts a small, learnable set of prompts into the model's input space while keeping the original model frozen. The effectiveness of DMP is not merely due to the addition of parameters but stems from its dynamic gating mechanism, which selects and combines a subset of learnable prompts at every step of the generative process (i.e., reverse denoising steps). This strategy, which we term "mixture-of-prompts'', enables the model to draw on the distinct expertise of each prompt, essentially "patching'' the model's functionality at every step with minimal yet specialized parameters. Uniquely, DMP enhances the model by further training on the original dataset already used for training, even in a scenario where significant improvements are typically not expected due to model convergence. Experiments show that DMP significantly enhances the converged FID of DiT-L/2 on FFHQ by 10.38%, achieved with only a 1.43% parameter increase and 50K additional training iterations.

Abstract: With the advancement of pretrained vision-language (VL) models, enhancing the alignment between visual and linguistic modalities in downstream tasks has emerged as a critical challenge. Different from existing fine-tuning methods that add extra modules to these two modalities, we investigate whether the frozen model can be fine-tuned by customized noise. Our approach is motivated by the scientific study of beneficial noise, namely Positive-incentive Noise (Pi-noise) , which quantitatively analyzes the impact of noise. It therefore implies a new scheme to learn beneficial noise distribution that can be employed to fine-tune VL models. Focusing on few-shot classification tasks based on CLIP, we reformulate the inference process of CLIP and apply variational inference, demonstrating how to generate Pi-noise towards visual and linguistic modalities. Then, we propose Positive-incentive Noise Injector (PiNI), which can fine-tune CLIP via injecting noise into both visual and text encoders. Since the proposed method can learn the distribution of beneficial noise, we can obtain more diverse embeddings of vision and language to better align these two modalities for specific downstream tasks within limited computational resources. We evaluate different noise incorporation approaches and network architectures of PiNI. The evaluation across 11 datasets demonstrates its effectiveness.

Abstract: Probabilistic embeddings have several advantages over deterministic embeddings as they map each data point to a distribution, which better describes the uncertainty and complexity of data. Many works focus on adjusting the distribution constraint under the Information Bottleneck (IB) principle to enhance representation learning. However, these proposed regularization terms only consider the constraint of each latent variable, omitting the structural information between latent variables. In this paper, we propose a novel structural entropyguided probabilistic coding model, named SEPC. Specifically, we incorporate the relationship between latent variables into the optimization by proposing a structural entropy regularization loss. Besides, as traditional structural information theory is not well-suited for regression tasks, we propose a probabilistic encoding tree, transferring regression tasks to classification tasks while diminishing the influence of the transformation. Experimental results across 12 natural language understanding tasks, including both classification and regression tasks, demonstrate the superior performance of SEPC compared to other state-of-the-art models in terms of effectiveness, generalization capability, and robustness to label noise.

State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Computing Science and Artificial Intelligence College, Suzhou City University, College of Computer Science, Zhejiang University of Technology, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security

Abstract: The applicability of drug molecules in various clinical scenarios is significantly influenced by a diverse range of molecular properties. By leveraging selfsupervised conditions such as atom attributes and interatomic bonds, existing advanced molecular foundation models can generate expressive representations of these molecules. However, such models often overlook the fixed association patterns within molecules that influence physiological or chemical properties. In this paper, we introduce a novel association pattern-aware message passing method, which can serve as an effective yet general plug-and-play plugin, thereby enhancing the atom representations generated by molecular foundation models without requiring additional pretraining. Additionally, molecular property-specific pattern libraries are constructed to collect the generated interpretable common patterns that bind to these properties. Extensive experiments conducted on 11 benchmark molecular property prediction tasks across 8 advanced molecular foundation models demonstrate significant superiority of the proposed method, with performance improvements of up to approximately 20%. Furthermore, a property-specific pattern library is tailored for blood-brain barrier penetration, which has undergone corresponding mechanistic validation.

Abstract: Solutions of symbolic regression problems are expressions that are composed of input variables and operators from a finite set of function symbols. One measure for evaluating symbolic regression algorithms is their ability to recover formulae, up to symbolic equivalence, from finite samples. Not unexpectedly, the recovery problem becomes harder when the formula gets more complex, that is, when the number of variables and operators gets larger. Variables in naturally occurring symbolic formulas often appear only in fixed combinations. This can be exploited in symbolic regression by substituting one new variable for the combination, effectively reducing the number of variables. However, finding valid substitutions is challenging. Here, we address this challenge by searching over the expression space of small substitutions and testing for validity. The validity test is reduced to a test of functional dependence. The resulting iterative dimension reduction procedure can be used with any symbolic regression approach. We show that it reliably identifies valid substitutions and significantly boosts the performance of different types of stateof-the-art symbolic regression algorithms.

Abstract: In this work, we extend the concept of the pmean welfare objective from social choice theory to study p-mean regret in stochastic multi-armed bandit problems. The p-mean regret, defined as the difference between the optimal mean among the arms and the p-mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter p. Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel p-mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer et al. (2002). Under mild assumptions, we prove that our algorithm achieves a p-mean regret bound of Otilde( sqrt( k / T^{1/(2|p|)} ) ) for all p <= -1, where k represents the number of arms and T the time horizon. When -1< p < 0, we achieve a regret bound of Otilde( sqrt( k^{1.5} / T^{1/2} ) ). For the range 0 < p <= 1, we achieve a p-mean regret scaling as Otilde( sqrt( k / T ) ), which matches the previously established lower bound up to logarithmic factors. This result stems from the fact that the p-mean regret of any algorithm is at least its average cumulative regret for p <= 1. In the case of Nash regret (the limit as p approaches zero), our unified approach differs from prior work of Barman et al. (2023), which requires a new Nash Confidence Bound algorithm. Notably, we achieve the same regret bound up to constant factors using our more general method.

Abstract: Clustering ensemble has been a popular research topic in data science due to its ability to improve the robustness of the single clustering method. Many clustering ensemble methods have been proposed, most of which can be categorized into clusteringview and sample-view methods. The clustering-view method is generally efficient, but it could be affected by the unreliability that existed in base clustering results. The sample-view method shows good performance, while the construction of the pairwise sample relation is time-consuming. In this paper, the clustering ensemble is formulated as a k-HyperEdge Medoids discovery problem and a clustering ensemble method based on k-HyperEdge Medoids that considers the characteristics of the above two types of clustering ensemble methods is proposed. In the method, a set of hyperedges is selected from the clustering view efficiently, then the hyperedges are diffused and adjusted from the sample view guided by a hyperedge loss function to construct an effective k-HyperEdge Medoid set. The loss function is mainly reduced by assigning samples to the hyperedge with the highest degree of belonging. Theoretical analyses show that the solution can approximate the optimal, the assignment method can gradually reduce the loss function, and the estimation of the belonging degree is statistically reasonable. Experiments on artificial data show the working mechanism of the proposed method. The convergence of the method is verified by experimental analysis of twenty data sets. The effectiveness and efficiency of the proposed method are also verified on these data, with nine representative clustering ensemble algorithms as reference.

Key Laboratory of Multimedia Trusted Perception and Effcient Computing, Ministry of Education of China, Xiamen University, China School of Informatics, Xiamen University, China, Hangzhou Research Institute, School of Artificial Intelligence, Beihang University, China Nanchang Institute of Technology, China, School of Engineering Science, University of Chinese Academy of Sciences, China, Key Laboratory of Multimedia Trusted Perception and Effcient Computing, Ministry of Education of China, Xiamen University, China School of Informatics, Xiamen University, China, Ningbo Institute of Digital Twin, Eastern Institute of Technology, China, Key Laboratory of Multimedia Trusted Perception and Effcient Computing, Ministry of Education of China, Xiamen University, China School of Informatics, Xiamen University, China Key Laboratory of Oracle Bone Inscriptions Information Processing, Ministry of Education of China, Anyang Normal University, China

Abstract: Convolutional neural networks (CNNs) have been playing a dominant role in computer vision. However, the existing approaches of using local window modeling in popular CNNs lack flexibility and hinder their ability to capture longrange dependencies of objects in an image. To overcome these limitations, we propose a novel CNN architecture, termed Dynamic Clustering Convolutional Neural Network (DCCNeXt). The proposed DCCNeXt takes a unique approach by employing global clustering to group image patches with similar semantics into clusters that are then convolved using the shared convolution kernels. To address the high computational complexity of global clustering, the feature vectors from each patch's subspace are extracted for efficient clustering, which makes the proposed model widely compatible with the downstream vision tasks. The extensive experiments of image classification, object detection, instance segmentation, and semantic segmentation on the benchmark datasets demonstrate that the proposed DCCNeXt outperforms the mainstream Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), Vision Multi-layer Perceptrons (MLPs), Vision Graph Neural Networks (GNNs), and Vision Mambas. We anticipate that this study will provide a new perspective and a promising avenue for the design of convolutional neural networks.

Abstract: Knowledge distillation (KD) aims to improve the performance of lightweight student networks under the guidance of pretrained teachers. However, the large capacity gap between teachers and students limits the distillation gains. Previous methods addressing this problem have two weaknesses. First, most of them decrease the performance of pre-trained teachers, hindering students from achieving comparable performance. Second, these methods fail to dynamically adjust the transferred knowledge to be compatible with the representation ability of students, which is less effective in bridging the capacity gap. In this paper, we propose Adaptive Dual Guidance Knowledge Distillation (ADG-KD), which retains the guidance of the pre-trained teacher and uses the teacher's bidirectional optimization route guiding the student to alleviate the capacity gap problem. Specifically, ADG-KD introduces an initialized teacher, which has an identical structure to the pre-trained teacher and is optimized through the bidirectional supervision from both the pre-trained teacher and student. In this way, we construct the teacher's bidirectional optimization route to provide the students with an easy-to-hard and compatible knowledge sequence. ADG-KD trains the students under the proposed dual guidance approaches and automatically determines their importance weights, making the transferred knowledge better compatible with the representation ability of students. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate the effectiveness of our method.

Abstract: Differential equations are widely used to describe complex dynamical systems with evolving parameters in nature and engineering. Effectively learning a family of maps from the parameter function to the system dynamics is of great significance. In this study, we propose a novel learning framework of symbolic continuousdepth neural networks, termed Symbolic Neural Ordinary Differential Equations (SNODEs), to effectively and accurately learn the underlying dynamics of complex systems. Specifically, our learning framework comprises three stages: initially, pre-training a predefined symbolic neural network via a gradient flow matching strategy; subsequently, fine-tuning this network using Neural ODEs; and finally, constructing a general neural network to capture residuals. In this process, we apply the SNODEs framework to partial differential equation systems through Fourier analysis, achieving resolution-invariant modeling. Moreover, this framework integrates the strengths of symbolism and connectionism, boasting a universal approximation theorem while significantly enhancing interpretability and extrapolation capabilities relative to state-of-the-art baseline methods. We demonstrate this through experiments on several representative complex systems. Therefore, our framework can be further applied to a wide range of scientific problems, such as system bifurcation and control, reconstruction and forecasting, as well as the discovery of new equations.

Abstract: In multiagent reinforcement learning, centralized training with decentralized execution (CTDE) methods typically assumes that agents make decisions based on their local observations independently, which may not lead to a correlated joint policy with coordination. Coordination can be explicitly encouraged during training and individual policies can be trained to imitate the correlated joint policy. However, this may lead to an asymmetric learning failure due to the observation mismatch between the joint and individual policies. Inspired by the concept of correlated equilibrium, we introduce a strategy modification called AgentMixer that allows agents to correlate their policies. AgentMixer combines individual partially observable policies into a joint fully observable policy non-linearly. To enable decentralized execution, we introduce Individual-Global-Consistency to guarantee mode consistency during joint training of the centralized and decentralized policies and prove that AgentMixer converges to an ϵ-approximate Correlated Equilibrium. In the Multi-Agent MuJoCo, SMAC-v2, Matrix Game, and Predator-Prey benchmarks, AgentMixer outperforms or matches state-of-the-art methods.

Abstract: Federated learning is often used in environments with many unverified participants. Therefore, federated learning under adversarial attacks receives significant attention. This paper proposes an algorithmic framework for listdecodable federated learning, where a central server maintains a list of models, with at least one guaranteed to perform well. The framework has no strict restriction on the fraction of honest clients, extending the applicability of Byzantine federated learning to the scenario with more than half adversaries. Assuming the variance of gradient noise in stochastic gradient descent is bounded, we prove a convergence theorem of our method for strongly convex and smooth losses. Experimental results, including image classification tasks with both convex and non-convex losses, demonstrate that the proposed algorithm can withstand the malicious majority under various attacks.

Abstract: In this paper, we investigate a variant of the classical stochastic Multiarmed Bandit (MAB) problem, where the payoff received by an agent (either cost or reward) is both delayed, and directly corresponds to the magnitude of the delay. This setting models faithfully many real world scenarios such as the time it takes for a data packet to traverse a network given a choice of route (where delay serves as the agent’s cost); or a user's time spent on a web page given a choice of content (where delay serves as the agent’s reward). Our main contributions are tight upper and lower bounds for both the cost and reward settings. For the case that delays serve as costs, which we are the first to consider, we prove optimal regret that scales as ∑i:Δi > 0(log T)/Δi + d*, where T is the maximal number of steps, Δi are the sub-optimality gaps and d* is the minimal expected delay amongst arms. For the case that delays serves as rewards, we show optimal regret of ∑i:Δi > 0(log T)/Δi + d̄, where d̄ is the second maximal expected delay. These improve over the regret in the general delay-dependent payoff setting, which scales as ∑i:Δi > 0(log T)/Δi+ D, where D is the maximum possible delay. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward. Finally, we accompany our theoretical results with an empirical evaluation.

Abstract: The human brain is a complex system, and understanding its mechanisms has been a longstanding challenge in neuroscience. The study of the functional connectome, which maps the functional connections between different brain regions, has provided valuable insights through various advanced analysis techniques developed over the years. Similarly, neural networks, inspired by the brain's architecture, have achieved notable success in diverse applications but are often noted for their lack of interpretability. In this paper, we propose a novel approach that bridges neural networks and human brain functions by leveraging brain-inspired techniques. Our approach, grounded in the insights from the functional connectome, offers scalable ways to characterize topology of large neural networks using stable statistical and machine learning techniques. Our empirical analysis demonstrates its capability to enhance the interpretability of neural networks, providing a deeper understanding of their underlying mechanisms.

Abstract: Formal XAI is an emerging field that focuses on providing explanations with mathematical guarantees for the decisions made by machine learning models. A significant amount of work in this area is centered on the computation of ``sufficient reasons''. Given a model M and an input instance x, a sufficient reason for the decision x is a subset S of the features of x such that for any instance z that has the same values as x for every feature in S, it holds that M(x) = M(z). Intuitively, this means that the features in S are sufficient to fully justify the classification of x by M. For sufficient reasons to be useful in practice, they should be as small as possible, and a natural way to reduce the size of sufficient reasons is to consider a probabilistic relaxation; the probability of M(x) = M(z) must be at least some value delta in (0,1], where z is a random instance that coincides with x on the features in S. Computing small deltasufficient reasons (delta-SRs) is known to be a theoretically hard problem; even over decision trees — traditionally deemed simple and interpretable models — strong inapproximability results make the efficient computation of small delta-SRs unlikely. We propose the notion of (delta, epsilon)-SR, a simple relaxation of delta-SRs, and show that this kind of explanations can be computed efficiently over linear models.

Abstract: Deep neural networks (DNNs) have achieved significant success across various tasks, but ensuring reliable uncertainty estimates, known as model calibration, is crucial for their safe and effective deployment. Modern DNNs often suffer from overconfidence, leading to miscalibration. We propose a novel posthoc calibration method called feature clipping (FC) to address this issue. FC involves clipping feature values to a specified threshold, effectively increasing entropy in high calibration error samples while maintaining the information in low calibration error samples. This process reduces the overconfidence in predictions, improving the overall calibration of the model. Our extensive experiments on datasets such as CIFAR-10, CIFAR-100, and ImageNet, and models including CNNs and transformers, demonstrate that FC consistently enhances calibration performance. Additionally, we provide a theoretical analysis that validates the effectiveness of our method. As the first calibration technique based on feature modification, feature clipping offers a novel approach to improving model calibration, showing significant improvements over both post-hoc and train-time calibration methods and pioneering a new avenue for feature-based model calibration.

Abstract: In this paper, we address a practical distributed Bayesian learning problem with asynchronous measurements and predictions due to diverse computational conditions. To this end, asynchronous distributed Gaussian process (AsyncDGP) regression is proposed, which is the first effective online distributed Gaussian processes (GPs) approach to improve the prediction accuracy in realtime learning tasks. By leveraging the devised evaluation criterion and established prediction error bounds, AsyncDGP enables the distinction of contributions of each model for prediction ensembling using aggregation strategy. Furthermore, we extend its utility to dynamic systems by introducing a learning-based control law, ensuring guaranteed control performance in safety-critical applications. Additionally, a networked online learning simulation platform for distributed GPs, namely online GP gym (GPgym), is introduced for testing the performance of learning and control of dynamical systems. Numerical simulations within GPgym across regression tasks with real-world data sets and dynamical control scenarios demonstrate the effectiveness and applicability of AsyncDGP.

School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, School of Computer Science and Technology, University of Science and Technology of China, School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence

Abstract: The task of musicto-lyric generation aims to create lyrics that can be sung in harmony with the music while capturing the music’s intrinsic meaning. Previous efforts in this area have struggled to effectively handle both the structural and semantic alignments of music and lyrics, often relying on rigid, manually crafted rules or overlooking the semantic essence of music, which deviates from the natural lyric-writing process of humans. In this paper, we bridge the structural and semantic gap between music and lyrics by proposing an end-to-end model for music-driven lyric generation. Our model aims at generating well-formatted lyrics based solely on the music while capturing its inherent semantic essence. In the music processing phase, we introduce a hierarchical music information extractor, which operates at both the song and sentence levels. The song-level extractor focuses on discerning the overall semantic content of the music, such as themes and emotions. Simultaneously, the sentence-level extractor captures the local semantic and structural details from note sequences. Additionally, we propose a lyric length predictor that determines the optimal length for the generated lyrics. During the lyric generation phase, the information gathered by the above modules is integrated, providing essential guidance for the downstream lyric generation module to produce coherent and meaningful lyrics. Experimental results on objective and subjective benchmarks demonstrate the capabilities of our proposed model in capturing semantics and generating well-formatted lyrics.

Abstract: As functional data assumes a central role in contemporary data analysis, the search for meaningful dimension reduction becomes critical due to its inherent infinitedimensional structure. Traditional methods, such as Functional Principal Component Analysis (FPCA), adeptly explore the overarching structures within the functional data. However, these methods may not sufficiently identify low-dimensional representations that are specific or enriched in a foreground dataset (case or treatment group) relative to a background dataset (control group). This limitation becomes critical in scenarios where the foreground dataset, such as a specific treatment group in biomedical applications, contains unique patterns or trends that are not as pronounced in the background dataset. Addressing this gap, we propose Contrastive Functional Principal Component Analysis (CFPCA), a method designed to spotlight low-dimensional structures unique to or enriched in the foreground dataset relative to the background counterpart. We supplement our method with theoretical guarantees on CFPCA estimates supported by multiple simulations. Through a series of applications, CFPCA successfully identifies these foreground-specific structures, thereby revealing distinct patterns and trends that traditional FPCA overlooks.

Abstract: Classical neural ODEs trained with explicit methods are intrinsically limited by stability, crippling their efficiency and robustness for stiff learning problems that are common in graph learning and scientific machine learning. We present a semiimplicit neural ODE approach that exploits the partitionable structure of the underlying dynamics. Our technique leads to an implicit neural network with significant computational advantages over existing approaches because of enhanced stability and efficient linear solves during time integration. We show that our approach outperforms existing approaches on a variety of applications including graph classification and learning complex dynamical systems. We also demonstrate that our approach can train challenging neural ODEs where both explicit methods and fully implicit methods are intractable.

Abstract: Visual language models like Contrastive LanguageImage Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized domains such as remote sensing due to the limited availability of image-text pairs for training. To tackle this issue, we introduce DiffCLIP, a novel framework that extends CLIP to effectively convey comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images. DiffCLIP is a few-shot learning method that leverages unlabeled images for pretraining. It employs unsupervised mask diffusion learning to capture the distribution of diverse modalities without requiring labels. The modality-shared image encoder maps multimodal data into a unified subspace, extracting shared features with consistent parameters across modalities. A well-trained image encoder further enhances learning by aligning visual representations with class-label text information from CLIP. By integrating these approaches, DiffCLIP significantly boosts CLIP performance using a minimal number of image-text pairs. We evaluate DiffCLIP on widely used high-dimensional multimodal datasets, demonstrating its effectiveness in addressing few-shot annotated classification tasks. DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP, while utilizing only 2-shot image-text pairs.

Abstract: Existing knowledge distillation (KD) methods have demonstrated their ability to achieve student network performance on par with their teachers. However, the knowledge gap between the teacher and student remains significant and may hinder the effectiveness of the distillation process. In this work, we introduce the structure of Neural Collapse (NC) into the KD framework. NC typically occurs in the final phase of training, resulting in a graceful geometric structure where the lastlayer features form a simplex equiangular tight frame. We hypothesize that NC can alleviate the knowledge gap in distillation, thereby enhancing student performance. This paper begins with an empirical analysis to bridge the connection between KD and NC. Through this analysis, we establish that transferring the teacher's NC structure to the student benefits the distillation process. Therefore, instead of merely transferring instance-level logits or features, as done by existing distillation methods, we encourage students to learn the teacher's NC structure. We propose the new distillation paradigm termed Neural Collapse-inspired Knowledge Distillation (NCKD). Comprehensive experiments demonstrate that NCKD is simple yet effective, improving the generalization of all distilled student models and achieving state-of-the-art accuracy performance.

Abstract: Adaptive optimizers such as Adam and RMSProp have gained attraction in complex neural networks, including generative adversarial networks (GANs) and Transformers, thanks to their stable performance and fast convergence compared to nonadaptive optimizers. A frequently overlooked limitation of adaptive optimizers is that adjusting the learning rate of each dimension individually would ignore the knowledge of the whole loss landscape, resulting in slow updates of parameters, invalidating the learning rate adjustment strategy and eventually leading to widespread insufficient convergence of parameters. In this paper, we propose HVAdam, a novel optimizer that associates all dimensions of the parameters to find a new parameter update direction, leading to a refined parameter update strategy for an increased convergence rate. We validated HVAdam in extensive experiments, showing its faster convergence, higher accuracy, and more stable performance on image classification, image generation, and natural language processing tasks. Particularly, HVAdam achieves a significant improvement on GANs compared with other state-of-the-art methods, especially in Wasserstein-GAN (WGAN) and its improved version with gradient penalty (WGAN-GP).

Abstract: This paper studies the prediction task of tensoron-tensor regression in which both covariates and responses are multi-dimensional arrays (a.k.a., tensors) across time with arbitrary tensor order and data dimension. Existing methods either focused on linear models without accounting for possibly nonlinear relationships between covariates and responses, or directly employed black-box deep learning algorithms that failed to utilize the inherent tensor structure. In this work, we propose a Factor Augmented Tensor-on-Tensor Neural Network (FATTNN) that integrates tensor factor models into deep neural networks. We begin with summarizing and extracting useful predictive information (represented by the ``factor tensor'') from the complex structured tensor covariates, and then proceed with the prediction task using the estimated factor tensor as input of a temporal convolutional neural network. The proposed methods effectively handle nonlinearity between complex data structures, and improve over traditional statistical models and conventional deep learning approaches in both prediction accuracy and computational cost. By leveraging tensor factor models, our proposed methods exploit the underlying latent factor structure to enhance the prediction, and in the meantime, drastically reduce the data dimensionality that speeds up the computation. The empirical performances of our proposed methods are demonstrated via simulation studies and real-world applications to three public datasets. Numerical results show that our proposed algorithms achieve substantial increases in prediction accuracy and significant reductions in computational time compared to benchmark methods.

College of Computer Science and Technology, Zhejiang University State Key Laboratory of Transvascular Implantation Devices of The Second Affiliated Hospital, Zhejiang University Alibaba Cloud Computing, College of Computer Science and Technology, Zhejiang University State Key Laboratory of Transvascular Implantation Devices of The Second Affiliated Hospital, Zhejiang University, School of Artificial Intelligence and Data Science, University of Science and Technology of China, Alibaba Cloud Computing, Alibaba Cloud Computing, AI Thrust, Information Hub, HKUST(Guangzhou), State Key Laboratory of Transvascular Implantation Devices of The Second Affiliated Hospital, Zhejiang University Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence, Alibaba Cloud Computing

Abstract: Multimodality pre-training paradigm that aligns protein sequences and biological descriptions has learned general protein representations and achieved promising performance in various downstream applications. However, these works were still unable to replicate the extraordinary success of language-supervised visual foundation models due to the ineffective usage of aligned protein-text paired data and the lack of an effective function-informed pre-training paradigm. To address these issues, this paper curates a large-scale protein-text paired dataset called ProtAnno with a property-driven sampling strategy, and introduces a novel function-informed protein pre-training paradigm. Specifically, the sampling strategy determines selecting probability based on the sample confidence and property coverage, balancing the data quality and data quantity in face of large-scale noisy data. Furthermore, motivated by significance of the protein specific functional mechanism, the proposed paradigm explicitly model protein static and dynamic functional segments by two segment-wise pre-training objectives, injecting fine-grained information in a function-informed manner. Leveraging all these innovations, we develop ProtCLIP, a multi-modality foundation model that comprehensively represents function-aware protein embeddings. On 22 different protein benchmarks within 5 types, including protein functionality classification, mutation effect prediction, cross-modal transformation, semantic similarity inference and protein-protein interaction prediction, our ProtCLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks, 59.9% in GO-CC and 39.7% in GO-BP protein function prediction. The experimental results verify the extraordinary potential of ProtCLIP serving as the protein multi-modality foundation model.

Abstract: We introduce a novel method for assessing agent teamwork based on their spatial coordination. Our approach models the influence of spatial proximity on team formation and sustained spatial dominance over adversaries using a Multiagent Markov Decision Process. We develop an algorithm to derive efficient teamwork strategies by combining Monte Carlo Tree Search and linear programming. When applied to team defence in football (soccer) using real-world data, our approach reduces opponent threat by 21%, outperforming optimised individual behaviour by 6%. Additionally, our model enhances the predictive accuracy of future attack locations and provides deeper insights compared to existing teamwork models that do not explicitly consider the spatial dynamics of teamwork.

Abstract: Emergent Communication (EC) provides a unique window into the language systems that emerge autonomously when agents are trained to jointly achieve shared goals. However, it is difficult to interpret EC and evaluate its relationship with natural languages (NL). This study employs unsupervised neural machine translation (UNMT) techniques to decipher ECs formed during referential games with varying task complexities, influenced by the semantic diversity of the environment. Our findings demonstrate UNMT's potential to translate EC, illustrating that task complexity characterized by semantic diversity enhances EC translatability, while higher task complexity with constrained semantic variability exhibits pragmatic EC, which, although challenging to interpret, remains suitable for translation. This research marks the first attempt, to our knowledge, to translate EC without the aid of parallel data.

Abstract: Traditional multiagent path finding (MAPF) methods try to compute entire collision free start-goal paths, with several algorithms offering completeness guarantees. However, computing partial paths offers significant advantages including faster planning, adaptability to changes, and enabling decentralized planning. Methods that compute partial paths employ a "windowed" approach and only try to find collision free paths for a limited timestep horizon. While this improves flexibility, this adaptation introduces incompleteness; all existing windowed approaches can become stuck in deadlock or livelock. Our main contribution is to introduce our framework, WinC-MAPF, for Windowed MAPF that enables completeness. Our framework leverages heuristic update insights from single-agent real-time heuristic search algorithms and agent independence ideas from MAPF algorithms. We also develop Single-Step Conflict Based Search (SS-CBS), an instantiation of this framework using a novel modification to CBS. We show how SS-CBS, which only plans a single step and updates heuristics, can effectively solve tough scenarios where existing windowed approaches fail.

Abstract: Large Language Models (LLMs) have recently achieved impressive results in complex reasoning tasks through Chain of Thought (CoT) prompting. However, most existing CoT methods rely on using the same prompts, whether manually designed or automatically generated, to handle the entire dataset. This onesize-fits-all approach may fail to meet the specific needs arising from the diversities within a single dataset. To solve this problem, we propose the Clustered Distance-Weighted Chain of Thought (CDW-CoT) method, which dynamically constructs prompts tailored to the characteristics of each data instance by integrating clustering and prompt optimization techniques. Our method employs clustering algorithms to categorize the dataset into distinct groups, from which a candidate pool of prompts is selected to reflect the inherent diversity within the dataset. For each cluster, CDW-CoT trains the optimal prompt probability distribution tailored to their specific characteristics. Finally, it dynamically constructs a unique prompt probability distribution for each test instance, based on its proximity to cluster centers, from which prompts are selected for reasoning. CDW-CoT consistently outperforms traditional CoT methods across six datasets, including commonsense, symbolic, and mathematical reasoning tasks. Specifically, when compared to manual CoT, CDW-CoT achieves an average accuracy improvement of 25.34% on LLaMA2 (13B) and 15.72% on LLaMA3 (8B).

Abstract: Codeswitching is a linguistic phenomenon in which different languages are used interactively during conversation. It poses significant performance challenges to natural language processing (NLP) tasks due to the often monolingual nature of the underlying system. We focus on sentence-level semantic associations between the different code-switching expressions. And we propose an innovative task-free semantic learning method based on the semantic property. Specifically, there are many different ways of languages switching for a sentence with the same meaning. We refine this into a semantic computational method by designing the loss of semantic invariant constraint during the model optimization. In this work, we conduct thorough experiments on speech recognition, speech translation, and language modeling tasks. The experimental results fully demonstrate that the proposed method can widely improve the performance of code-switching related tasks.

Abstract: In recent years, Textto-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.

Abstract: With the widespread adoption of AI systems, many of the decisions once made by humans are now delegated to automated systems. Recent works in the literature demonstrate that these automated systems, when used in socially sensitive domains, may exhibit discriminatory behavior based on sensitive characteristics such as gender, sex, religion, or race. In light of this, various notions of fairness and methods to quantify discrimination have been proposed, also leading to the development of numerous approaches for constructing fair predictors. At the same time, imposing fairness constraints may decrease the utility of the decisionmaker, highlighting a tension between fairness and utility. This tension is also recognized in legal frameworks, for instance in the disparate impact doctrine of Title VII of the Civil Rights Act of 1964 -- in which specific attention is given to considerations of \textit{business necessity} -- possibly allowing the usage of proxy variables associated with the sensitive attribute in case a high-enough utility cannot be achieved without them. In this work, we analyze the tension between fairness and accuracy from a causal lens for the first time. We introduce the notion of a path-specific excess loss (PSEL) that captures how much the predictor's loss increases when a causal fairness constraint is enforced. We then show that the total excess loss (TEL), defined as the difference between the loss of predictor fair along all causal pathways vs. an unconstrained predictor, can be decomposed into a sum of more local PSELs. At the same time, enforcing a causal constraint often reduces the disparity between demographic groups. Thus, we introduce a quantity that summarizes the fairness-utility trade-off, called the causal fairness/utility ratio, defined as the ratio of the reduction in discrimination vs. the excess in the loss from constraining a causal pathway. This quantity is particularly suitable for comparing the fairness-utility trade-off across different causal pathways. Finally, as our approach requires causally-constrained fair predictors, we introduce a new neural approach for causally-constrained fair learning. Our approach is evaluated across multiple real-world datasets, providing new insights into the tension between fairness and accuracy.

Abstract: We present a methodology based on interactive theorem proving that facilitates the development of verified implementations of algorithms for solving factored Markov Decision Processes. As a case study, we formally verify an algorithm for approximate policy iteration in the proof assistant Isabelle/HOL. We show how the verified algorithm can be refined to an executable, verified implementation. Our evaluation on benchmark problems shows that it is practical. As part of the development, we build verified software to certify linear programming solutions. We discuss the verification process and the modifications we made to the algorithm during formalization.

Abstract: Classical planning asks for a sequence of operators reaching a given goal. While the most common case is to compute a plan, many scenarios require more than that. However, quantitative reasoning on the plan space remains mostly unexplored. A fundamental problem is to count plans, which relates to the conditional probability on the plan space. Indeed, qualitative and quantitative approaches are wellestablished in various other areas of automated reasoning. We present the first study to quantitative and qualitative reasoning on the plan space. In particular, we focus on polynomially bounded plans. On the theoretical side, we study its complexity, which gives rise to rich reasoning modes. Since counting is hard in general, we introduce the easier notion of facets, which enables understanding the significance of operators. On the practical side, we implement quantitative reasoning for planning. Thereby, we transform a planning task into a propositional formula and use knowledge compilation to count different plans. This framework scales well to large plan spaces, while enabling rich reasoning capabilities such as learning pruning functions and explainable planning.

Abstract: Reasoning with counterfactuals is one of the hallmarks of human cognition, involved in various tasks such as explanation, credit assignment, blame, and responsibility. Counterfactual quantities that are not identifiable in the general nonparametric case may be identified under shape constraints on the functional mechanisms, such as monotonicity. One prominent example of such an approach is the celebrated result by Angrist and Imbens on identifying the Local Average Treatment Effect (LATE) in the instrumental variable setting. In this paper, we study the identification problem of more general settings under monotonicity constraints. We begin by proving the monotonicity reduction lemma, which simplifies counterfactual queries using monotonicity assumptions and facilitates the reduction of a larger class of these queries to interventional quantities. We then extend the existing identification results on Probabilities of Causation (PoCs) and LATE to a broader set of queries and graphs. Finally, we develop an algorithm, M-ID, for identifying arbitrary counterfactual queries from combinations of observational and experimental data, which takes as input a causal diagram with monotonicity constraints. We show that M-ID subsumes the previously known identification results in the literature. We demonstrate the applicability of our results using synthetic and real data.

Abstract: In this paper, we propose a class of faster double adaptive gradient methods to solve nonconvex finitesum optimization problems possibly with nonsmooth regularization by simultaneously using adaptive learning rate and adaptive mini-batch size. Specifically, we first propose a double adaptive stochastic gradient method (i.e., 2AdaSGD), and prove that our 2AdaSGD obtains a low stochastic first-order oracle (SFO) complexity for finding a stationary solution under the population smoothness condition. Furthermore, we propose a variance reduced double adaptive stochastic gradient method (i.e., 2AdaSPIDER), and prove that our 2AdaSPIDER obtains an optimal SFO complexity under the average smoothness condition, which is lower than the SFO complexity of the existing double adaptive gradient algorithms. In particular, we introduce a new stochastic gradient mapping to adaptively adjust mini-batch size in our stochastic gradient methods. We conduct some numerical experiments to verify efficiency of our proposed methods.

Abstract: A wide variety of goals could cause an AI to disable its off switch because ``you can’t fetch the coffee if you’re dead.'' Prior theoretical work on this shutdown problem assumes that humans know everything that AIs do. In practice, however, humans have only limited information. Moreover, in many of the settings where the shutdown problem is most concerning, AIs might have vast amounts of private information. To capture these differences in knowledge, we introduce the Partially Observable OffSwitch Game (POSG), a game-theoretic model of the shutdown problem with asymmetric information. Unlike in the fully observable case, we find that in optimal play, even AI agents assisting perfectly rational humans sometimes avoid shutdown. As expected, increasing the amount of communication or information available always increases (or leaves unchanged) the agents' expected common payoff. But counterintuitively, introducing bounded communication can make the AI defer to the human less in optimal play even though communication mitigates information asymmetry. Thus, designing safe artificial agents in the presence of asymmetric information requires careful consideration of the tradeoffs between maximizing payoffs (potentially myopically) and maintaining AIs’ incentives to defer to humans.

Abstract: The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond finalanswer accuracy, we introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. ReasonEval employs validity and redundancy to characterize the reasoning quality, as well as accompanying LLMs to assess them automatically. We explore different design options for the LLM-based evaluators and empirically demonstrate that ReasonEval, when instantiated with base models possessing strong mathematical knowledge and trained with high-quality labeled data, consistently outperforms baseline methods in the meta-evaluation datasets. We also highlight the strong generalization capabilities of ReasonEval. By utilizing ReasonEval to evaluate LLMs specialized in math, we find that an increase in final-answer accuracy does not necessarily guarantee an improvement in the overall quality of the reasoning steps for challenging mathematical problems. Additionally, we observe that ReasonEval can play a significant role in data selection. We open-source the best-performing model, meta-evaluation script, and all evaluation results to facilitate future research.

Abstract: Community engagement plays a critical role in antipoaching efforts, yet existing mathematical models aimed at enhancing this engagement often overlook direct participation by community members as alternative patrollers. Unlike professional rangers, community members typically lack flexibility and experience, resulting in new challenges in optimizing patrol resource allocation. To address this gap, we propose a novel game-theoretic model for community-participated patrol, where a conservation agency strategically deploys both professional rangers and community members to safeguard wildlife against a best-responding poacher. In addition to a mixed-integer linear program formulation, we introduce a Two-Dimensional Binary Search algorithm and a novel Hybrid Waterfilling algorithm to efficiently solve the game in polynomial time. Through extensive experiments and a detailed case study focused on a protected tiger habitat in Northeast China, we demonstrate the effectiveness of our algorithms and the practical applicability of our model.

Abstract: Language is highly structured, with syntactic and semantic structures, to some extent, agreed upon by speakers. With implicit or explicit awareness of such structures, humans can learn and use language efficiently and generalize to sentences that contain unseen words. Motivated by human language learning, in this presentation, I will introduce a family of machine learning tasks that learns language structures through grounding, where distant supervision from other data sources (i.e., grounds), including but not limited to different modalities (e.g., vision), execution results of programs, and other languages, are used to guide the learning of language structures. I will demonstrate the potential of this task formulation, advocate for its adoption through three schemes, and discuss the possibility of the general language learning problem through grounding.

Abstract: Representation learning constructs lowdimensional representations to summarize essential features of high-dimensional data. This learning problem is often approached by describing various desiderata associated with learned representations; e.g., that they be non-spurious, efficient, or disentangled. It can be challenging, however, to turn these intuitive desiderata into formal criteria that can be measured and enhanced based on observed data. In this paper, we take a causal perspective on representation learning, formalizing desiderata like non-spuriousness and demonstrating their practical utility.

Abstract: My research in AI for Science revolves around the development and application of knowledge graphs (KG) and large language models (LLM) for scientific discovery. Leveraging my expertise in AI, I extensively explore disciplinary knowledge, construct knowledge graphs, and develop pretrained large models for chemical and biological research. The overarching goal is to better capture correlations and patterns between substances by incorporating explicit and implicit knowledge bases into pre-trained large models. I have published in top AI journals and conferences, including Nature Machine Intelligence, NeurIPS, AAAI, ICML, and ICLR, and received several prestigious awards such as the Excellent Prize of the Tencent Rhino-Bird Project (2024) and the Great Britain-China Educational Trust (2020). My research has garnered wide recognition, with over 6000 Google Scholar citations and GitHub repositories of my work on knowledge graph-enhanced molecular and protein learning receiving hundreds of stars. By pushing the boundaries of AI for scientific discovery, I aspire to contribute to significant advancements that address pressing global challenges. I am eager to present and share my work at AAAI’s New Faculty Highlight program and engage with fellow researchers at the forefront of AI.

Abstract: Recent advances in visionlanguage models have shown remarkable potential, yet creating scalable systems that can effectively understand and generate across modalities remains challenging. This talk will present our contributions to advancing scalable vision-language systems, focusing on three key themes: (1) efficient vision-language understanding, including our work on temporal perceiving video-language pre-training and knowledge-enhanced zero-shot retrieval; (2) scalable generation frameworks, encompassing our innovations in zero-shot captioning and co-speech gesture generation; and (3) practical applications and deployments of these technologies. We will discuss how these advances have enabled both better performance and improved efficiency in real-world scenarios, and explore future directions for scalable multimodal systems.

Abstract: Explainable reinforcement learning (xRL) provides explanations for ``blackbox" decision making systems. However, most work in xRL is based on single-agent settings instead of the more complex multi-agent reinforcement learning (MARL). Several different types of post-hoc explanations must be provided to increase understanding of both centralized and decentralized MARL systems. For centralized MARL, this research develops methods to generate global policy summaries, query-based explanations, and temporal explanations. For decentralized MARL, this research develops global policy summaries and query-based explanations.

Abstract: The main goal of FewShot learning algorithms is to enable learning from small amounts of data. One of the most popular and elegant Few-Shot learning approaches is Model-Agnostic Meta-Learning (MAML). In this paper, we propose a novel framework for Bayesian MAML called BH-MAML, which employs Hypernetworks for weight updates. It learns the universal weights point-wise, but a probabilistic structure is added when adapted for specific tasks. In such a framework, we can use simple Gaussian distributions or more complicated posteriors induced by Continuous Normalizing Flows.

Abstract: A vast amount of textual data is added to the internet daily, making utilization and interpretation of textual data difficult and cumbersome. As a result, automatic text summarization is crucial for extracting relevant information, saving precious time. Although many transformer models excel in summarization, they are constrained by their input size, preventing them from processing texts longer than their context size. This study introduces several novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications. We test our algorithms on texts with more than 70,000 words, and our experiments show a significant increase in BERTScore with competitive ROUGE scores.

Abstract: Machine unlearning is becoming increasingly important as deep models become more prevalent, particularly when there are frequent requests to remove the influence of specific training data due to privacy concerns or erroneous sensing signals. Spatialtemporal Graph Neural Networks, in particular, have been widely adopted in real-world applications that demand efficient unlearning, yet research in this area remains in its early stages. In this paper, we introduce STEPS, a framework specifically designed to address the challenges of spatio-temporal graph unlearning. Our results demonstrate that STEPS not only ensures data continuity and integrity but also significantly reduces the time required for unlearning, while minimizing the accuracy loss in the new model compared to a model with 0% unlearning.

Abstract: Conformal prediction (CP) has gained prominence as a popular technique for uncertainty quantification in deep neural networks (DNNs), providing statistically rigorous uncertainty sets. However, existing CP methods fail to clarify the origins of predictive uncertainties. While neuronlevel interpretability has been effective in revealing the internal mechanisms of DNNs, explaining CP at the neuron level remains unexplored. Nonetheless, generating neuron explanations for CP is challenging due to the discrete and non-differentiable characteristics of CP, and the labor-intensive process of semantic annotation. To address these limitations, this paper proposes a novel neuron explanation approach for CP by identifying neurons crucial for understanding predictive uncertainties and automatically generating semantic explanations. The effectiveness of the proposed method is validated through both qualitative and quantitative experiments.

Abstract: Causal Inference (CI) plays a crucial role in building unbiased recommender systems. However, most current CIbased debiasing methods only pay attention on either popularity bias or conformity bias. This paper presents a Disentangled Counterfactual Reasoning framework to alleviate dual biases in recommendation, so called DCR. Concretely, we consider the impact of both item popularity and user conformity during training, and separate their indirect effects by disentangling user and item embeddings into biased and unbiased components. In the inference stage, we perform counterfactual reasoning to simultaneously mitigate the indirect and direct effects of bias factors. Experimental results demonstrate the effectiveness of our DCR.

Abstract: VisionLanguage Models (VLMs) bridge the gap between visual and textual data, enabling multimodal tasks like Visual Question Answering (VQA). Leveraging this capability, Medical VQA systems have the potential to transform clinical decision-making by allowing healthcare providers to query medical images—such as X-rays, MRIs, and CT scans—and receive rapid, informed responses, thereby speeding up diagnoses and treatment planning. In this work, we introduce Falcon Med-VQA, a generative VQA system meticulously designed to interpret visual and textual medical data and generate free-form answers to medical questions. By leveraging a vision language model and a dynamic model selection mechanism, Falcon Med-VQA ensures relevance and precision in its responses. The system is equipped with an intuitive user interface that displays top answers with Confidence Scores (CF), enhances explainability through medical terminology extraction, and offers attention map visualizations for improved interpretability. Our experiments demonstrate that Falcon Med-VQA achieves comparable performance against specialized models and outperforms recent generative approaches in a key benchmark.

Abstract: The complexity of the shipping industry, dynamic operational drivers, and diverse data sources present significant scalability challenges for digital twins. Agentic Large Language Models (LLMs) augmented with external tools offer a promising solution to accelerate digital twin adoption. Using pretrained knowledge and reasoning capabilities, these LLMs autonomously select optimal tools and data streams for user-specific queries, enabling language to serve as a universal interface between digital twins and various stakeholders, from technicians to fleet managers. This interface facilitates real-time decision making and insight generation across multiple operational workflows. In this demonstration, we present an interactive agentic digital twin designed to enhance scalability, flexibility, and efficiency in managing the extensive and intricate decision-making requirements of the shipping industry. We showcase the transformative potential of agentic LLMs in reducing complexity and improving the practical application of digital twins, ultimately enabling more efficient operations in real-world settings.

Abstract: Conditioning image generation facilitates seamless editing and the creation of photorealistic images. However, conditioning on noisy or Outof-Distribution (OoD) images poses significant challenges, particularly in balancing fidelity to the input and realism of the output. We introduce Confident Ordinary Differential Editing (CODE), a novel approach for image synthesis that effectively handles OoD guidance images. Utilizing a diffusion model as a generative prior, CODE enhances images through score-based updates along the probability-flow Ordinary Differential Equation (ODE) trajectory. This method requires no task-specific training, handcrafted modules, or assumptions, and is compatible with any diffusion model. Positioned at the intersection of conditional image generation and blind image restoration, CODE operates in a fully blind manner, relying solely on a pre-trained generative model. Our method introduces an alternative approach to blind restoration: instead of targeting a specific ground truth image based on assumptions about the underlying corruption, CODE aims to increase the likelihood of the input image while maintaining fidelity. This results in the most probable in-distribution image around the input. Our contributions are twofold. First, CODE introduces a novel editing method based on ODE providing enhanced control, realism, and fidelity compared to SDE-based counterpart. Second, we introduce a confidence interval-based clipping method, which improves CODE’s effectiveness by allowing it to disregard certain pixels or information, thus enhancing the restoration process in a blind manner. Experimental results demonstrate CODE’s effectiveness over existing methods, particularly in scenarios involving severe degradation or OoD inputs.

Abstract: Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such as language, vision, and audio, to enhance the understanding of human sentiment. While existing models often focus on extracting shared information across modalities or directly fusing heterogeneous modalities, such approaches can introduce redundancy and conflicts due to equal treatment of all modalities and the mutual transfer of information between modality pairs. To address these issues, we propose a DisentangledLanguage-Focused (DLF) multimodal representation learning framework, which incorporates a feature disentanglement module to separate modality-shared and modality-specific information. To further reduce redundancy and enhance language-targeted features, four geometric measures are introduced to refine the disentanglement process. A Language-Focused Attractor (LFA) is further developed to strengthen language representation by leveraging complementary modality-specific information through a language-guided cross-attention mechanism. The framework also employs hierarchical predictions to improve overall accuracy. Extensive experiments on two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant performance gains achieved by the proposed DLF framework. Comprehensive ablation studies further validate the effectiveness of the feature disentanglement module, language-focused attractor, and hierarchical predictions.

Laboratory of Intelligent Collaborative Computing, University of Electronic Science and Technology of China, Chengdu, China Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China, Complex Laboratory of New Finance and Economics, Southwest University of Finance and Economics, Chengdu, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, China Kash Institute of Electronics and Information Industry, China, Laboratory of Intelligent Collaborative Computing, University of Electronic Science and Technology of China, Chengdu, China, Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China, School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China, Complex Laboratory of New Finance and Economics, Southwest University of Finance and Economics, Chengdu, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, China Kash Institute of Electronics and Information Industry, China

Abstract: Personalized federated learning (PFL) is a new paradigm to address the statistical heterogeneity problem in federated learning. Most existing PFL methods focus on leveraging global and local information such as model interpolation or parameter decoupling. However, these methods often overlook the generalization potential during local client learning. From a local optimization perspective, we propose a simple and general PFL method, Federated learning with Flexible SharpnessAware Minimization (FedFSA). Specifically, we emphasize the importance of applying a larger perturbation to critical layers of the local model when using the Sharpness-Aware Minimization (SAM) optimizer. Then, we design a metric, perturbation sensitivity, to estimate the layer-wise sharpness of each local model. Based on this metric, FedFSA can flexibly select the layers with the highest sharpness to employ larger perturbation. Extensive experiments are conducted on four datasets with two types of statistical heterogeneity for image classification. The results show that FedFSA outperforms seven state-of-the-art baselines by up to 8.26% in test accuracy. Besides, FedFSA can be applied to different model architectures and easily integrated into other federated learning methods, achieving a 4.45% improvement.

Abstract: The optimizationbased meta-learning approach is gaining increased traction because of its unique ability to quickly adapt to a new task using only small amounts of data. However, existing optimization-based meta-learning approaches, such as MAML, ANIL and their variants, generally employ backpropagation for upper-level gradient estimation, which requires using historical lower-level parameters/gradients and thus increases computational and memory overhead in each iteration. In this paper, we propose a meta-learning algorithm that can avoid using historical parameters/gradients and significantly reduce memory costs in each iteration compared to existing optimization-based meta-learning approaches. In addition to memory reduction, we prove that our proposed algorithm converges sublinearly with the iteration number of upper-level optimization, and the convergence error decays sublinearly with the batch size of sampled tasks. In the specific case in terms of deterministic meta-learning, we also prove that our proposed algorithm converges to an exact solution. Moreover, we quantify the computational complexity of the algorithm, which matches existing convergence results on meta-learning even without using any historical parameters/gradients. Experimental results on meta-learning benchmarks confirm the efficacy of our proposed algorithm.

Abstract: Traditional recommendation system focus more on the correlations between users and items (useritem relationships), while research on user-user relationships has received significant attention these years, which is also known as social recommendation. Graph-based models have achieved a great success in this task by utilizing the complex topological information of the social networks. However, these models still face the insufficient expressive and overfitting problems. Counterfactual approaches are proven effective as information augmentation strategies towards above issues in various scenarios, but not fully utilized in social recommendations. To this end, we propose a novel social recommendation method, termed SR-GCA, via a plug-and-play Graph-Level Counterfactual Augmentation mechanism. Specifically, we first generate counterfactual social and item links by constructing a counterfactual matrix for data aug- mentation. Then, we employ a supervised learning strategy to refine data both factual and counterfactual links. Thirdly, we enhance representations learning between users via an alignment and self-supervised optimization techniques. Extensive experiments demonstrate the promising capacity of our model from five aspects, including superiority, effectively, transfer- ability, complexity, sensitively. In particular, the transferability is well-proven by extending our GCA module to three typical social recommendation models.

Abstract: Quantum computing promises to revolutionize various fields, yet the execution of quantum programs necessitates an effective compilation process. This involves strategically mapping quantum circuits onto the physical qubits of a quantum processor. The qubits' arrangement, or topology, is pivotal to the circuit's performance, a factor that often defies traditional heuristic or manual optimization methods due to its complexity. In this study, we introduce a novel approach leveraging reinforcement learning to dynamically tailor qubit topologies to the unique specifications of individual quantum circuits, guiding algorithmdriven quantum processor topology design for reducing the depth of mapped circuit, which is particularly critical for the output accuracy on noisy quantum processors. Our method marks a significant departure from previous methods that have been constrained to mapping circuits onto a fixed processor topology. Experiments demonstrate that we have achieved notable enhancements in circuit performance, with a minimum of 20% reduction in circuit depth in 60% of the cases examined, and a maximum enhancement of up to 46%. Furthermore, the pronounced benefits of our approach in reducing circuit depth become increasingly evident as the scale of the quantum circuits increases, exhibiting the scalability of our method in terms of problem size. This work advances the co-design of quantum processor architecture and algorithm mapping, offering a promising avenue for future research and development in the field.

Abstract: Adaptive learning, also known as adaptive teaching, relies on learning path recommendations that sequentially suggest personalized learning items (such as lectures and exercises) to meet the unique needs of each learner. Despite the extensive research in this field, previous approaches have primarily modeled the interaction sequences between learners and items using simple indexing, leading to three issues: (1) The utilization of information from both learners and items is not sufficient. For instance, these models are unable to leverage the semantic information contained within the textual content of the items. (2) Models need to be retrained on different datasets separately, which makes it difficult to adapt to the continuously expanding item pool in online educational scenarios. (3) The existing recommendation paradigm based on trained reinforcement learning frameworks, suffers from unstable recommendation performance in sparse learning logs. To address these challenges, we propose a generalized Generative Agent for Adaptive Learning (GenAL), which integrates educational tools with LLMs' semantic understanding to enable effective and generalizable learning path recommendations across diverse data distributions. Specifically, our framework consists of two components: the Global Thinking Agent, which updates the learner profile and reflects on recommendation outcomes based on the learner's historical learning records. The other is the Local Teaching Agent, which recommends items using educational prior knowledge. Leveraging the LLM's robust semantic understanding, our framework does not rely on item indexing but instead extracts relevant information from the textual content. We evaluated our approach on three realworld datasets, and the experimental results demonstrate that our GenAL not only consistently outperforms all baselines but also exhibits strong generalization ability.

Abstract: Graph neural networks (GNNs) have gained considerable attention in recent years for traffic flow prediction due to their ability to learn spatiotemporal pattern representations through a graph-based message-passing framework. Although GNNs have shown great promise in handling traffic datasets, their deployment in real-life applications has been hindered by scalability constraints arising from high-order message passing. Additionally, the over-smoothing problem of GNNs may lead to indistinguishable region representations as the number of layers increases, resulting in performance degradation. To address these challenges, we propose a new knowledge distillation paradigm termed LightST that transfers spatial and temporal knowledge from a high-capacity teacher to a lightweight student. Specifically, we introduce a spatio-temporal knowledge distillation framework that helps student MLPs capture graph-structured global spatio-temporal patterns while alleviating the over-smoothing effect with adaptive knowledge distillation. Extensive experiments verify that LightST significantly speeds up traffic flow predictions by 5X to 40X compared to state-of-the-art spatio-temporal GNNs, all while maintaining superior accuracy.

Abstract: Spatially Resolved Transcriptomics (SRT) has become an indispensable tool in various fields, including tumor microenvironment identification, neurobiology, and the study of complex tissue architecture. However, the accuracy of these insights is often compromised by noise in spatial transcriptomics data due to technical limitations. While recent advancements in denoising methods have shown some promise, they frequently fall short by neglecting spatial features, overlooking the variability in noise levels among genes, and relying heavily on external histological images for supplementary information. In our study, we propose DUSTED, a DualAttention Enhanced Spatial Transcriptomics Denoiser, designed to address these challenges. Built on a graph autoencoder framework, DUSTED utilizes gene channel attention and graph attention mechanisms to simultaneously consider spatial features and noise variability in gene expression data. Additionally, it integrates the negative binomial distribution with or without zero-inflation, ensuring a more accurate fit for gene expression distributions. Benchmark tests using simulated datasets demonstrate that DUSTED outperforms existing methods. Furthermore, in real-world applications with the HOCWTA and DLPFC datasets, DUSTED excels in enhancing the correlation between gene and protein expression, recovering spatial gene expression patterns, and improving clustering results. These improvements underscore its potential impact on advancing our understanding of tumor microenvironments, neural tissue organization, and other biologically significant areas.

Abstract: Functional decomposition is the process of breaking down a function f into a composition f=g(f_1,...,f_k) of simpler functions f_1,...,f_k belonging to some class F. This fundamental notion can be used to model applications arising in a wide variety of contexts, ranging from machine learning to formal language theory. In this work, we study functional decomposition by leveraging on the notion of functional reconfiguration. In this setting, constraints are imposed not only on the factor functions f_1,...,f_k but also on the intermediate functions arising during the composition process. We introduce a symbolic framework to address functional reconfiguration and decomposition problems. In our framework, functions arising during the reconfiguration process are represented symbolically, using ordered binary decision diagrams (OBDDs). The function g used to specify the reconfiguration process is represented by a Boolean circuit C. Finally, the function class F is represented by a secondorder finite automaton A. Our main result states that functional reconfiguration, and hence functional decomposition, can be solved in fixed-parameter linear time when parameterized by the width of the input OBDD, by structural parameters associated with the reconfiguration circuit C, and by the size of the second-order finite automaton A.

Abstract: Understanding people's social interactions in complex realworld scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.

Abstract: Urban change is a constant process that influences the perception of neighbourhoods and the lives of the people within them. The field of Urban Scene Change Detection (USCD) aims to capture changes in street scenes using computer vision and can help raise awareness of changes that make it possible to better understand the city and its residents. Traditionally, the field of USCD has used supervised methods with small scale datasets. This constrains methods when applied to new cities, as it requires labourintensive labeling processes and forces a priori definitions of relevant change. In this paper we introduce AC-1M the largest USCD dataset by far of over 1.1M images, together with EMPLACE, a self-supervising method to train a Vision Transformer using our adaptive triplet loss. We show EMPLACE outperforms SOTA methods both as a pre-training method for linear fine-tuning as well as a zero-shot setting. Lastly, in a case study of Amsterdam, we show that we are able to detect both small and large changes throughout the city and that changes uncovered by EMPLACE, depending on size, correlate with housing prices - which in turn is indicative of inequity.

Abstract: Two prominent challenges in explainability research involve 1) the nuanced evaluation of explanations and 2) the modeling of missing information through baseline representations. The existing literature introduces diverse evaluation metrics, each scrutinizing the quality of explanations through distinct lenses. Additionally, various baseline representations have been proposed, each modeling the notion of missingness differently. Yet, a consensus on the ultimate evaluation metric and baseline representation remains elusive. This work acknowledges the diversity in explanation metrics and baselines, demonstrating that different metrics exhibit preferences for distinct explanation maps resulting from the utilization of different baseline representations and distributions. To address the diversity in metrics and accommodate the variety of baseline representations in a unified manner, we propose Baseline ExplorationExploitation (BEE) - a path-integration method that introduces randomness to the integration process by modeling the baseline as a learned random tensor. This tensor follows a learned mixture of baseline distributions optimized through a contextual exploration-exploitation procedure to enhance performance on the specific metric of interest. By resampling the baseline from the learned distribution, BEE generates a comprehensive set of explanation maps, facilitating the selection of the best-performing explanation map in this broad set for the given metric. Extensive evaluations across various model architectures showcase the superior performance of BEE in comparison to state-of-the-art explanation methods on a variety of objective evaluation metrics.

Abstract: In the field of graphic design, automating the integration of design elements into a cohesive multilayered artwork not only boosts productivity but also paves the way for the democratization of graphic design. One existing practice is Graphic Layout Generation (GLG), which aims to layout sequential design elements. It has been constrained by the necessity for a predefined correct sequence of layers, thus limiting creative potential and increasing user workload. In this paper, we present Hierarchical Layout Generation (HLG) as a more flexible and pragmatic setup, which creates graphic composition from any-ordered sets of design elements. To tackle the HLG task, we introduce Graphist, the first layout generation model based on large multimodal models. Graphist efficiently reframes the HLG as a sequence generation problem, utilizing RGB-A images as input, outputs a JSON draft protocol, indicating the coordinates, size, and order of each element. We develop multiple evaluation metrics for HLG. Graphist outperforms prior arts and establishes a strong baseline for this field.

Abstract: In the field of scene text spotting, previous OCR methods primarily relied on image encoders and pretrained text information, but they often overlooked the advantages of incorporating human language instructions. To address this gap, we propose InstructOCR, an innovative instruction-based scene text spotting model that leverages human language instructions to enhance the understanding of text within images. Our framework employs both text and image encoders during training and inference, along with instructions meticulously designed based on text attributes. This approach enables the model to interpret text more accurately and flexibly. Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks. By leveraging instruction strategies during pre-training, the performance on downstream VQA tasks can be significantly improved, with a 2.6% increase on the TextVQA dataset and a 2.1% increase on the ST-VQA dataset. These experimental results provide insights into the benefits of incorporating human language instructions for OCR-related tasks.

Abstract: Despite recent advances in UNetbased image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture the long-range dependencies among patches, leading to higher-quality image generation. In this paper, we propose DiT4Edit, the first Diffusion Transformer-based image editing framework. Specifically, DiT4Edit uses the DPM-Solver inversion algorithm to obtain the inverted latents, reducing the number of steps compared to the DDIM inversion algorithm commonly used in UNet-based frameworks. Additionally, we design unified attention control and patch merging, tailored for transformer computation streams. This integration allows our framework to generate higher-quality edited images faster. Our design leverages the advantages of DiT, enabling it to surpass UNet structures in image editing, especially in high-resolution and arbitrary-size images. Extensive experiments demonstrate the strong performance of DiT4Edit in various editing scenarios, highlighting the potential of diffusion transformers for image editing.

Abstract: Neural video compression has recently demonstrated significant potential to compete with conventional video codecs in terms of ratequality performance. These learned video codecs are however associated with various issues related to decoding complexity (for autoencoder-based methods) and/or system delays (for implicit neural representation (INR) based models), which currently prevent them from being deployed in practical applications. In this paper, targeting a practical neural video codec, we propose a novel INR-based coding framework, PNVC, which innovatively combines autoencoder-based and overfitted solutions. Our approach benefits from several design innovations, including a new structural reparameterization-based architecture, hierarchical quality control, modulation-based entropy modeling, and scale-aware positional embedding. Supporting both low delay (LD) and random access (RA) configurations, PNVC outperforms existing INR-based codecs, achieving nearly 35%+ BD-rate savings against HEVC HM 18.0 (LD) - almost 10% more compared to one of the state-of-the-art INR-based codecs, HiNeRV and 5% more over VTM 20.0 (LD), while maintaining 20+ FPS decoding speeds for 1080p content. This represents an important step forward for INR-based video coding, moving it towards practical deployment.

Abstract: This work addresses the task of generalized class discovery (GCD) in instance segmentation. The goal is to discover novel classes and obtain a model capable of segmenting instances of both known and novel categories, given labeled and unlabeled data. Since the real world contains numerous objects with longtailed distributions, the instance distribution for each class is inherently imbalanced. To address the imbalanced distributions, we propose an instance-wise temperature assignment (ITA) method for contrastive learning and class-wise reliability criteria for pseudo-labels. The ITA method relaxes instance discrimination for samples belonging to head classes to enhance GCD. The reliability criteria are to avoid excluding most pseudo-labels for tail classes when training an instance segmentation network using pseudo-labels from GCD. Additionally, we propose dynamically adjusting the criteria to leverage diverse samples in the early stages while relying only on reliable pseudo-labels in the later stages. We also introduce an efficient soft attention module to encode object-specific representations for GCD. Finally, we evaluate our proposed method by conducting experiments on two settings: COCO$_{half}$ + LVIS and LVIS + Visual Genome. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art methods.

Abstract: Neural Radiance Field (NeRF)based volumetric video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences that provide audiences with unprecedented immersion and interactivity. However, the substantial data volumes pose significant challenges for storage and transmission. Existing solutions typically optimize NeRF representation and compression independently or focus on a single fixed rate-distortion (RD) tradeoff. In this paper, we propose VRVVC, a novel end-to-end joint optimization variable-rate framework for volumetric video compression that achieves variable bitrates using a single model while maintaining superior RD performance. Specifically, VRVVC introduces a compact tri-plane implicit residual representation for inter-frame modeling of long-duration dynamic scenes, effectively reducing temporal redundancy. We further propose a variable-rate residual representation compression scheme that leverages a learnable quantization and a tiny MLP-based entropy model. This approach enables variable bitrates through the utilization of predefined Lagrange multipliers to manage the quantization error of all latent representations. Finally, we present an end-to-end progressive training strategy combined with a multi-rate-distortion loss function to optimize the entire framework. Extensive experiments demonstrate that VRVVC achieves a wide range of variable bitrates within a single model and surpasses the RD performance of existing methods across various datasets.

Abstract: Pansharpening aims to combine a highresolution panchromatic (PAN) image with a low-resolution multispectral (LRMS) image to produce a high-resolution multispectral (HRMS) image. Although pansharpening in the frequency domain offers clear advantages, most existing methods either continue to operate solely in the spatial domain or fail to fully exploit the benefits of the frequency domain. To address this issue, we innovatively propose Multi-Frequency Fusion Attention (MFFA), which leverages wavelet transforms to cleanly separate frequencies and enable lossless reconstruction across different frequency domains. Then, we generate Frequency-Query, Spatial-Key, and Fusion-Value based on the physical meanings represented by different features, which enables a more effective capture of specific information in the frequency domain. Additionally, we focus on the preservation of frequency features across different operations. On a broader level, our network employs a wavelet pyramid to progressively fuse information across multiple scales. Compared to previous frequency domain approaches, our network better prevents confusion and loss of different frequency features during the fusion process. Quantitative and qualitative experiments on multiple datasets demonstrate that our method outperforms existing approaches and shows significant generalization capabilities for real-world scenarios.

Abstract: Temporal Action Detection (TAD) is fundamental yet challenging for realworld video applications. Leveraging the unique benefits of transformers, various DETR-based approaches have been adopted in TAD. However, it has recently been identified that the attention collapse in self-attention causes the performance degradation of DETR for TAD. Building upon previous research, this paper newly addresses the attention collapse problem in cross-attention within DETR-based TAD methods. Moreover, our findings reveal that cross-attention exhibits patterns distinct from predictions, indicating a short-cut phenomenon. To resolve this, we propose a new framework, Prediction-Feedback DETR (Pred-DETR), which utilizes predictions to restore the collapse and align the cross- and self-attention with predictions. Specifically, we devise novel prediction-feedback objectives using guidance from the relations of the predictions. As a result, Pred-DETR significantly alleviates the collapse and achieves state-of-the-art performance among DETR-based methods on various challenging benchmarks including THUMOS14, ActivityNet-v1.3, HACS, and FineAction.

Abstract: Growing customer demand for smart solutions in robotics and augmented reality has attracted considerable attention to 3D object detection from point clouds. Yet, existing indoor datasets taken individually are too small and insufficiently diverse to train a powerful and general 3D object detection model. In the meantime, more general approaches utilizing foundation models are still inferior in quality to those based on supervised training for a specific task. In this work, we propose UniDet3D, a simple yet effective 3D object detection model, which is trained on a mixture of indoor datasets and is capable to work in various indoor environments. By unifying different label spaces, UniDet3D enables learning a strong representation across multiple datasets through a supervised joint training scheme. The proposed network architecture is built upon a vanilla transformer encoder, making it easy to run, customize and extend the prediction pipeline for practical use. Extensive experiments demonstrate that UniDet3D obtains significant gains over existing 3D object detection methods in 6 indoor benchmarks: ScanNet (+1.1 mAP50), S3DIS (+9.1 mAP50), ARKitScenes (+19.4 mAP25), MultiScan (+14.3 mAP50), 3RScan (+3.2 mAP50), and ScanNet++ (+2.7 mAP50).

Abstract: Similar to Language or Image LLMs, VideoLLMs are also plagued by hallucination issues. Hallucinations in videos not only manifest in the spatial dimension regarding the perception of the existence of visual objects (static) but also the temporal dimension influencing the perception of actions and events (dynamic). This paper introduces the concept of Motion Hallucination for the first time, exploring the hallucination phenomena caused by insufficient motion perception capabilities in VideoLMMs, as well as how to detect, evaluate, and mitigate the hallucination. To this end, we propose the first benchmark for assessing motion hallucination MHBench, which consists of 1,200 videos of 20 different action categories. By constructing a collection of adversarial triplet types of videos (original/antonym/incomplete), we achieve a comprehensive evaluation of motion hallucination. Furthermore, we present a Motion Contrastive Decoding (MotionCD) method, which employs bidirectional motion elimination between the original video and its reverse playback to construct an amateur model that removes the influence of motion while preserving visual information, thereby effectively suppressing motion hallucination. Extensive experiments on MHBench reveal that current stateof-the-art VideoLLMs significantly suffer from motion hallucination, while the introduction of MotionCD can effectively mitigate this issue, achieving up to a 15.1% performance improvement. We hope this work will guide future efforts in avoiding and mitigating hallucinations in VideoLLMs.

Abstract: Most techniques approach the problem of image forgery localization as a binary segmentation task, training neural networks to label original areas as 0 and forged areas as 1. In contrast, we tackle this issue from a more fundamental perspective by partitioning images according to their originating sources. To this end, we propose Segment Any Forged Image Region (SAFIRE), which solves forgery localization using point prompting. Each point on an image is used to segment the source region containing itself. This allows us to partition images into multiple source regions, a capability achieved for the first time. Additionally, rather than memorizing certain forgery traces, SAFIRE naturally focuses on uniform characteristics within each source region. This approach leads to more stable and effective learning, achieving superior performance in both the new task and the traditional binary forgery localization.

Abstract: Imagegoal navigation aims to steer an agent towards the goal location specified by an image. Most prior methods tackle this task by learning a navigation policy, which extracts visual features of goal and observation images, compares their similarity and predicts actions. However, if the agent is in a different room from the goal image, it's extremely challenging to identify their similarity and infer the likely goal location, which may result in the agent wandering around. Intuitively, when humans carry out this task, they may roughly compare the current observation with the goal image, having an approximate concept of whether they are in the same room before executing the actions. Inspired by this intuition, we try to imitate human behaviour and propose a Room Expert Guided Image-Goal Navigation model~(REGNav) to equip the agent with the ability to analyze whether goal and observation images are taken in the same room. Specifically, we first pre-train a room expert with an unsupervised learning technique on the self-collected unlabelled room images. The expert can extract the hidden room style information of goal and observation images and predict their relationship about whether they belong to the same room. In addition, two different fusion approaches are explored to efficiently guide the agent navigation with the room relation knowledge. Extensive experiments show that our REGNav surpasses prior state-of-the-art works on three popular benchmarks.

Abstract: Existing semisupervised learning (SSL) approaches follow the idealized closed-world assumption, neglecting the challenges present in realistic medical scenarios, such as open-set distribution and imbalanced class distribution. Although some methods in natural domains attempt to address the open-set problem, they are insufficient for medical domains, where intertwined challenges like class imbalance and small inter-class lesion discrepancies persist. Thus, this paper presents a novel self-recalibrated semantic training framework, which is tailored for SSL in medical imaging by ingeniously harvesting realistic unlabeled samples. Inspired by the observation that certain open-set samples share some similar disease-related representations with in-distribution samples, we first propose an informative sample selection strategy that identifies high-value samples to serve as augmentations, thereby effectively enriching the semantics of known categories. Furthermore, we adopt a compact semantic clustering strategy to address the semantic confusion raised by the above newly introduced open-set semantics. Moreover, to mitigate the interference of class imbalance in open-set SSL, we introduce a less biased dual-balanced classifier with similarity pseudo-label regularization and category-customized regularization. Extensive experiments on a variety of medical image datasets demonstrate the superior performance of our proposed method over state-of-the-art Closed-set and Open-set SSL methods.

Abstract: The labeling difficulty has been a longstanding problem in deep image matting. To escape from fine labels, this work explores using rough annotations such as trimaps coarsely indicating the foreground/background as supervision. We present that the cooperation between learned semantics from indicated known regions and proper assumed matting rules can help infer alpha values at transition areas. Inspired by the nonlocal principle in traditional image matting, we build a directional distance consistency loss (DDC loss) at each pixel neighborhood to constrain the alpha values conditioned on the input image. DDC loss forces the distance of similar pairs on the alpha matte and on its corresponding image to be consistent. In this way, the alpha values can be propagated from learned known regions to unknown transition areas. With only images and trimaps, a matting model can be trained under the supervision of a known loss and the proposed DDC loss. Experiments on AM2K and P3M-10K dataset show that our paradigm achieves comparable performance with the fine-label-supervised baseline, while sometimes offers even more satisfying results than human-labeled ground truth.

Abstract: 3D panoptic scene understanding seeks to create novel view images with 3Dconsistent panoptic segmentation, which is crucial for many vision and robotics applications. Mainstream methods (e.g., Panoptic Lifting) directly use machine-generated 2D panoptic segmentation masks as training labels. However, these generated masks often exhibit multi-view inconsistencies, leading to ambiguities during the optimization process. To address this, we present Multi-view Consistent 3D Panoptic Scene Understanding (MVC-PSU), featuring two key components: 1) Probabilistic Semantic Aligner, which associates semantic information of corresponding pixels across multiple views by probabilistic alignment to ensure that predicted panoptic segmentation masks are consistent across different views. 2) Geometric Consistency Enforcer, which uses multi-view projection and monocular depth consistency to ensure that the geometry of the reconstructed scene is accurate and consistent across different views. Experimental results demonstrate that the proposed MVC-PSU surpasses state-of-the-art methods on the ScanNet, Replica, and HyperSim datasets.

Abstract: Human motion generative models have enabled promising applications, but the ability of textto-motion (T2M) models to produce realistic motions raises security concerns if exploited maliciously. Despite growing interest in T2M, limited research focus on safeguarding these models against adversarial attacks, with existing work on text-to-image models proving insufficient for the unique motion domain. In the paper, we propose ALERT-Motion, an autonomous framework that leverages large language models (LLMs) to generate targeted adversarial attacks against black-box T2M models. Unlike prior methods that modify prompts through predefined rules, ALERT-Motion uses the knowledge of LLMs of human motion to autonomously generate subtle yet powerful adversarial text descriptions. It comprises two key modules: an adaptive dispatching module that constructs an LLM-based agent to iteratively refine and search for adversarial prompts; and a multimodal information contrastive module that extracts semantically relevant motion information to guide the agent's search. Through this LLM-driven approach, ALERT-Motion produces adversarial prompts querying victim models to produce outputs closely matching targeted motions, while avoiding obvious perturbations. Evaluations across popular T2M models demonstrate ALERT-Motion's superiority over previous methods, achieving higher attack success rates with stealthier adversarial prompts. This pioneering work on T2M adversarial attacks highlights the urgency of developing defensive measures as motion generation technology advances, urging further research into safe and responsible deployment.

Abstract: Recent visionlanguage generative models still frequently produce outputs misaligned with their inputs, evidenced by object hallucination in captioning and prompt misalignment in the text-to-image generation model. Recent studies have explored methods for identifying misaligned elements, aiming not only to enhance interpretability but also to improve model performance. However, current approaches primarily rely on large foundation models in a zero-shot manner or fine-tuned models with human annotations, which limits scalability due to significant computational costs. This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP, specifically focusing on pinpointing misaligned words between image and text. We carefully revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment. We also propose F-CLIPScore, which aggregates misaligned attributions with a global alignment score. We evaluate our method on various dense misalignment detection benchmarks, covering various image and text domains and misalignment types. Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models while maintaining superior efficiency. Our qualitative examples show that our method has a unique strength to detect entity-level objects, intangible objects, and attributes that can not be easily detected for existing works. We conduct ablation studies and analyses to highlight the strengths and limitations of our approach.

Abstract: Face parsing refers to the semantic segmentation of human faces into key facial regions such as eyes, nose, hair, etc. It serves as a prerequisite for various advanced applications, including face editing, face swapping, and facial makeup, which often require segmentation masks for classes like eyeglasses, hats, earrings, and necklaces. These infrequently occurring classes are called longtail classes, which are overshadowed by more frequently occurring classes known as head classes. Existing methods, primarily CNN-based, tend to be dominated by head classes during training, resulting in suboptimal representation for long-tail classes. Previous works have largely overlooked the problem of poor segmentation performance of long-tail classes. To address this issue, we propose SegFace, a simple and efficient approach that uses a lightweight transformer-based model which utilizes learnable class-specific tokens. The transformer decoder leverages class-specific tokens, allowing each token to focus on its corresponding class, thereby enabling independent modeling of each class. The proposed approach improves the performance of long-tail classes, thereby boosting overall performance. To the best of our knowledge, SegFace is the first work to employ transformer models for face parsing. Moreover, our approach can be adapted for low-compute edge devices, achieving 95.96 FPS. We conduct extensive experiments demonstrating that SegFace significantly outperforms previous state-of-the-art models, achieving a mean F1 score of 88.96 (+2.82) on the CelebAMask-HQ dataset and 93.03 (+0.65) on the LaPa dataset.

Abstract: The study of enhancing model robustness against adversarial examples has become increasingly critical in the security of deep learning, leading to the development of numerous adversarial defense techniques. While these defense methods have shown promise in mitigating the impact of adversarial perturbations, evaluating their effectiveness remains a critical challenge. The recently introduced AutoAttack technique has been recognized as a standardized method for assessing model robustness. However, the computational demands of the AutoAttack method significantly limits its applicability, underscoring the urgent need for efficient evaluation techniques. To address this challenge, we propose a novel and efficient evaluation framework based on strategic constraint relaxation. Our key insight is that temporarily expanding the adversarial perturbation bounds during the attack process can help discover more effective adversarial examples. Based on this insight, we develop the Constraint Relaxation Attack (CR Attack) method, which systematically relaxes and resets perturbation constraints during optimization. Extensive experiments on 105 robust models show that CR Attack outperforms AutoAttack in both attack success rate and efficiency, reducing forward and backward propagation time by 38.3× and 15.9× respectively. Through comprehensive analysis, we validate that the constraint relaxation mechanism is crucial for the method's effectiveness.

Abstract: From image to video understanding, the capabilities of Multimodal LLMs (MLLMs) are increasingly powerful. However, most existing video understanding benchmarks are relatively short, which makes them inadequate for effectively evaluating the long-sequence modeling capabilities of MLLMs. This highlights the urgent need for a comprehensive and integrated long video understanding benchmark to assess the ability of MLLMs thoroughly. To this end, we propose ALLVB (ALL-in-One Long Video Understanding Benchmark). ALLVB's main contributions include: 1) It integrates 9 major video understanding tasks. These tasks are converted into video QA formats, allowing a single benchmark to evaluate 9 different video understanding capabilities of MLLMs, highlighting the versatility, comprehensiveness, and challenging nature of ALLVB. 2) A fully automated annotation pipeline using GPT-4o is designed, requiring only human quality control, which facilitates the maintenance and expansion of the benchmark. 3) It contains 1,376 videos across 16 categories, averaging nearly 2 hours each, with a total of 252k QAs. To the best of our knowledge, it is the largest long video understanding benchmark in terms of the number of videos, average duration, and number of QAs. We have tested various mainstream MLLMs on ALLVB, and the results indicate that even the most advanced commercial models have significant room for improvement. This reflects the benchmark's challenging nature and demonstrates the substantial potential for development in long video understanding.

Abstract: Point cloud salient object detection has attracted the attention of researchers in recent years. Since existing works do not fully utilize the geometry context of 3D objects, blurry boundaries are generated when segmenting objects with complex backgrounds. In this paper, we propose a geometryaware 3D salient object detection network that explicitly clusters points into superpoints to enhance the geometric boundaries of objects, thereby segmenting complete objects with clear boundaries. Specifically, we first propose a simple yet effective superpoint partition module to cluster points into superpoints. In order to improve the quality of superpoints, we present a point cloud class-agnostic loss to learn discriminative point features for clustering superpoints from the object. After obtaining superpoints, we then propose a geometry enhancement module that utilizes superpoint-point attention to aggregate geometric information into point features for predicting the salient map of the object with clear boundaries. Extensive experiments show that our method achieves new state-of-the-art performance on the PCSOD dataset.

Abstract: Small lesions play a critical role in early disease diagnosis and intervention of severe infections. Popular models often face challenges in segmenting small lesions, as it occupies only a minor portion of an image, while downsampling operations may inevitably lose focus on local features of small lesions. To tackle the challenges, we propose a Small-Size-Sensitive Mamba (S³-Mamba), which promotes the sensitivity to small lesions across three dimensions: channel, spatial, and training strategy. Specifically, an Enhanced Visual State Space block is designed to focus on small lesions through multiple residual connections to preserve local features, and selectively amplify important details while suppressing irrelevant ones through channel-wise attention. A Tensor-based Cross-feature Multi-scale Attention is designed to integrate input image features and intermediate-layer features with edge features and exploit the attentive support of features across multiple scales, thereby retaining spatial details of small lesions at various granularities. Finally, we introduce a novel regularized curriculum learning to automatically assess lesion size and sample difficulty, and gradually focus from easy samples to hard ones like small lesions. Extensive experiments on three medical image segmentation datasets show the superiority of our S³-Mamba, especially in segmenting small lesions.

School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University State Key Laboratory of Communication Content Cognition, School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University, School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University, School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University, College of Computer Science and Technology, Zhejiang University of Technology SGIT AI Lab, State Grid Corporation of China, SGIT AI Lab, State Grid Corporation of China, School of Computer Science and Technology, MOEKLINNS, Xi’an Jiaotong University, China Telecom Corporation Ltd. Data&AI Technology Company, Baidu Inc

Abstract: Textto-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of dual energy guidance with optimization in a dual semantic-latent space and thus propose a training-free pipeline, SpotActor, which features a layout-conditioned optimizing stage and a consistent sampling stage. In the optimizing stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the sampling stage, we design Regional Interconnection Self-Attention (RISA) and Semantic Fusion Cross-Attention (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present ActorBench, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.

Abstract: Pixel tracking in singleview video sequences has recently emerged as a significant area of research. While previous work has primarily concentrated on tracking within a given video, we propose to expand pixel correspondence estimation into multi-view scenarios. The central concept involves utilizing a canonical space that preserves a universal 3D representation across different views and timesteps. This model allows for precise tracking of points even through prolonged occlusions and significant deformations in appearance between views. Moreover, we show that our model, through the use of an efficient training strategy incorporating distillation loss, is capable of performing incremental pixel tracking, a process often seen as complex in test-time optimization techniques. Comprehensive experiments validate the method's ability to accurately establish point correspondences across cameras. Furthermore, our method achieves promising results of multi-view pixel tracking without requiring the entire video sequences to be provided at once.

Abstract: Model extraction attacks are one type of inferencetime attacks that approximate the functionality and performance of a black-box victim model by launching a certain number of queries to the model and then leveraging the model's predictions to train a substitute model. These attacks pose severe security threats to production models and MLaaS platforms and could cause significant monetary losses to the model owners. A body of work has proposed to defend machine learning models against model extraction attacks, including both active defense methods that modify the model's outputs or increase the query overhead to avoid extraction and passive defense methods that detect malicious queries or leverage watermarks to perform post-verification. In this work, we introduce a new defense paradigm called attack as defense which modifies the model's output to be poisonous such that any malicious users that attempt to use the output to train a substitute model will be poisoned. To this end, we propose a novel lightweight backdoor attack method dubbed HoneypotNet that replaces the classification layer of the victim model with a honeypot layer and then fine-tunes the honeypot layer with a shadow model (to simulate model extraction) via bi-level optimization to modify its output to be poisonous while remaining the original performance. We empirically demonstrate on four commonly used benchmark datasets that HoneypotNet can inject backdoors into substitute models with a high success rate. The injected backdoor not only facilitates ownership verification but also disrupts the functionality of substitute models, serving as a significant deterrent to model extraction attacks.

Abstract: With the successful transition of Transformers from natural language processing (NLP) to computer vision (CV) domains, Vision Transformers (ViTs) have achieved stateof-the-art performance in many CV tasks. However, backdoor attacks, a significant threat in deep learning, also pose a risk to the security of ViT models. Recently, several backdoor attack methods targeting the patch-level self-attention mechanism in ViTs have been proposed, but they are relatively naive in terms of stealthiness and robustness against defensive measures, lacking in-depth investigation. In this paper, we explore the crucial role of attention-level imperceptibility in backdoor attacks for ViTs and propose an Attention-Imperceptible Backdoor Attacks on Vision Transformers (AIBA). In AIBA, a constrained adversarial perturbation is used as the trigger to achieve visual imperceptibility. Additionally, the trigger is designed to seamlessly implant into the focal areas of the image, ensuring that the trigger receives enough attention from the model without causing anomalies at the attention level. During the backdoor learning process, we designed an efficient constrained bi-level optimization training strategy at the mini-batch level to implant an effective backdoor in the victim model using the imperceptible trigger. We evaluated the effectiveness of the proposed AIBA across multiple datasets and ViT benchmarks and explored the robustness of AIBA against current ViT-specific defense methods. The experimental results demonstrate that our backdoor attack method can successfully implant a powerful and stealthy backdoor into ViTs.

Abstract: Deep denoising models require extensive realworld training data, which is challenging to acquire. Current noise synthesis techniques struggle to accurately model complex noise distributions. We propose a novel Realistic Noise Synthesis Diffusor (RNSD) method using diffusion models to address these challenges. By encoding camera settings into a time-aware camera-conditioned affine modulation (TCCAM), RNSD generates more realistic noise distributions under various camera conditions. Additionally, RNSD integrates a multi-scale content-aware module (MCAM), enabling the generation of structured noise with spatial correlations across multiple frequencies. We also introduce Deep Image Prior Sampling (DIPS), a learnable sampling sequence based on depth image prior, which significantly accelerates the sampling process while maintaining the high quality of synthesized noise. Extensive experiments demonstrate that our RNSD method significantly outperforms existing techniques in synthesizing realistic noise under multiple metrics and improving image denoising performance.

School of Artificial Intelligence, Anhui University, Hefei, China Anhui Provincial Key Laboratory of Security Artificial Intelligence, Hefei, China Information Materials and Intelligent Sensing Laboratory of Anhui Province, Hefei, China, School of Artificial Intelligence, Anhui University, Hefei, China, School of Computer Science and Technology, Anhui University, Hefei, China, School of Artificial Intelligence, Anhui University, Hefei, China Anhui Provincial Key Laboratory of Security Artificial Intelligence, Hefei, China Information Materials and Intelligent Sensing Laboratory of Anhui Province, Hefei, China, iFLYTEK CO.LTD., Hefei, China, iFLYTEK CO.LTD., Hefei, China, iFLYTEK CO.LTD., Hefei, China

Abstract: Existing Transformerbased RGBT trackers achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction and search-template correlation. Nevertheless, the independent search-template correlation calculations are prone to be affected by low-quality data, which might result in contradictory and ambiguous correlation weights. It not only limits the intra-modal feature representation, but also harms the robustness of cross-attention for multi-modal feature interaction and search-template correlation computation. To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which innovatively integrates inter-modality interaction into the search-template correlation computation within typical attention mechanism, for RGBT tracking. In particular, we first independently generate correlation maps for each modality and feed them into the designed correlation modulated enhancement module, which can modify inaccurate correlation weights by seeking the consensus between modalities. Such kind of design unifies self-attention and cross-attention schemes, which not only alleviates inaccurate attention weight computation in self-attention but also eliminates redundant computation introduced by extra cross-attention scheme. In addition, we design a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Experiments on five public RGBT tracking benchmarks show the outstanding performance of the proposed CAFormer against state-of-the-art methods.

Abstract: Event cameras, which capture pixellevel brightness changes asynchronously, provide rich motion information that is often missed during traditional frame-based camera exposures, thereby offering fresh perspectives for motion deblurring. Although current approaches incorporate event intensity, they neglect essential spatial motion information. Unlike their CNN architectures, Transformers excel in modeling long-range dependencies but struggle with establishing relevant non-local connections in sparse events and fail to highlight significant interactions in dense images. To address these limitations, we introduce a Motion-Adaptive Transformer network (MAT) that utilizes spatial motion information to forge robust global connections. The core design is an Adaptive Motion Mask Predictor (AMMP) that identifies key motion regions, guiding the Motion-Sparse Attention (MSA) to eliminate irrelevant event tokens and enabling the Motion-Aware Attention (MAA) to focus on relevant ones, thereby enhancing long-range dependency modeling. Additionally, we elaborately design a Cross-Modal Intensity Gating mechanism that efficiently merges intensity data across modalities while minimizing parameter use. The learnable Expansion-Controlled Spatial Gating further optimizes the transmission of event features. Comprehensive testing confirms that our approach sets a new benchmark in image deblurring, surpassing previous methods by up to 0.60dB on the GoPro dataset, 1.04dB on the HS-ERGB dataset, and achieving an average improvement of 0.52dB across two real-world datasets.

Abstract: In the domain of spacetime video super-resolution, it is typically challenging to handle complex motions (including large and nonlinear motions) and varying illumination scenes due to the lack of inter-frame information. Leveraging the dense temporal information provided by event signals offers a promising solution. Traditional event-based methods typically rely on multiple images, using motion estimation and compensation, which can introduce errors. Accumulated errors from multiple frames often lead to artifacts and blurriness in the output. To mitigate these issues, we propose EvSTVSR, a method that uses fewer adjacent frames and integrates dense temporal information from events to guide alignment. Additionally, we introduce a coordinate-based feature fusion upsampling module to achieve spatial super-resolution. Experimental results demonstrate that our method not only outperforms existing RGB-based approaches but also excels in handling large motion scenarios.

Abstract: In the field of classagnostic counting (CAC), counting only objects of interest that are similar to exemplars in multi-class scenarios has been a challenging task. To address this challenge, recent research has proposed the extract-and-match paradigm based on the vision transformer (ViT) architecture. However, although this paradigm can improve the accuracy of exemplar-similar object identification, it overly emphasizes the role of the ViT structure. To address this shortcoming, this work introduces a more generalized prompt-before-extract paradigm on top of the extract-and-match paradigm and designs a pure convolutional neural network (CNN) model named PBECount. In addition, an innovative loss function, a post-processing strategy, and a dynamic threshold method are proposed to enhance the detection performance of the proposed model when the probability maps are used as ground truth during model training. The experimental results on the FSC-147 and CARPK datasets demonstrate that the proposed PBECount can identify whether unknown class objects are similar to exemplars and outperform the state-of-the-art CAC methods in terms of accuracy and generalization.

Abstract: Cooperatively utilizing both egovehicle and infrastructure sensor data via V2X communication has emerged as a promising approach for advanced autonomous driving. However, current research mainly focuses on improving individual modules, rather than taking end-to-end learning to optimize final planning performance, resulting in underutilized data potential. In this paper, we introduce UniV2X, a pioneering cooperative autonomous driving framework that seamlessly integrates all key driving modules across diverse views into a unified network. We propose a sparse-dense hybrid data transmission and fusion mechanism for effective vehicle-infrastructure cooperation, offering three advantages: 1) Effective for simultaneously enhancing agent perception, online mapping, and occupancy prediction, ultimately improving planning performance. 2) Transmission-friendly for practical and limited communication conditions. 3) Reliable data fusion with interpretability of this hybrid data. We implement UniV2X, as well as reproducing several benchmark methods, on the challenging DAIR-V2X, the real-world cooperative driving dataset. Experimental results demonstrate the effectiveness of UniV2X in significantly enhancing planning performance, as well as all intermediate output performance.

Abstract: Textbased person search aims at locating a person described by natural language in uncropped scene images. Recent works for TBPS mainly focus on aligning multi-granularity vision and language representations, neglecting a key discrepancy between training and inference where the former learns to unify vision and language features where the visual side covers all clues described by language, yet the latter matches image-text pairs where the images may capture only part of the described clues due to perturbations such as occlusions, background clutters and misaligned boundaries. To alleviate this issue, we present ViPer: a Visual Perturbation network that learns to match language descriptions with perturbed visual clues. On top of a CLIP-driven baseline, we design three visual perturbation modules: (1) Spatial ViPer that varies person proposals and produces visual features with misaligned boundaries, (2) Attentive ViPer that estimates visual attention on the fly and manipulates attentive visual tokens within a proposal to produce global features under visual perturbations, and (3) Fine-grained ViPer that learns to recover masked visual clues from detailed language descriptions to encourage matching language features with perturbed visual features at the fine granularity. This overall framework thus simulates real-world scenarios at the training stage to minimize the discrepancy and improve the generalization ability of the model. Experimental results demonstrate that the proposed method clearly surpasses previous TBPS methods on the PRW-TBPS and CUHK-SYSU-TBPS datasets.

University of Chinese Academy of Sciences Institute of Automation, Chinese Academy of Sciences State Key Laboratory of Multimodal Artificial Intelligence Systems New Laboratory of Pattern Recognition, Centre for Artificial Intelligence and Robotics, Centre for Artificial Intelligence and Robotics, China University of Geoscience Beijing, University of Chinese Academy of Sciences, Shandong University, University of Science and Technology Beijing, University of Chinese Academy of Sciences Institute of Automation, Chinese Academy of Sciences State Key Laboratory of Multimodal Artificial Intelligence Systems New Laboratory of Pattern Recognition Centre for Artificial Intelligence and Robotics

Abstract: Developing comprehensive explicit world models is crucial for understanding and simulating realworld scenarios. Recently, Procedural Controllable Generation (PCG) has gained significant attention in large-scale scene generation by enabling the creation of scalable, high-quality assets. However, PCG faces challenges such as limited modular diversity, high expertise requirements, and challenges in managing the diverse elements and structures in complex scenes. In this paper, we introduce a large-scale scene generation framework, SceneX, which can automatically produce high-quality procedural models according to designers' textual descriptions. Specifically, the proposed method comprises two components, PCGHub and PCGPlanner. The former encompasses an extensive collection of accessible procedural assets and thousands of hand-craft API documents to perform as a standard protocol for PCG controller. The latter aims to generate executable actions for Blender to produce controllable and precise 3D assets guided by the user's instructions. Extensive experiments demonstrated the capability of our method in controllable large-scale scene generation, including nature scenes and unbounded cities, as well as scene editing such as asset placement and season translation.

Abstract: Learned image lossy compression techniques have surpassed traditional methods in both subjective vision and quantitative evaluation. However, current models are only applicable to threechannel image formats, limiting their practical application due to the diversity and complexity of image formats. We propose a high-performance learned image compression model for general image formats. We first introduce a transfer method to unify any-channel image formats, enhancing the applicability of neural networks. This method's effectiveness is demonstrated through image information entropy and image homomorphism theory. Then, we introduce an adaptive attention residual block into the entropy model to give it better generalization ability. Meanwhile, we propose an evenly grouped cross-channel context module for progressive preview image decoding. Experimental results demonstrate that our method achieves state-of-the-art (SOTA) in the field of learned image compression in terms of PSNR and MS-SSIM. This work extends the applicability of learned image compression techniques to more practical production environments.

Abstract: This paper introduces a SATbased technique that calculates a compact and complete symmetry-break for finite model finding, with the focus on structures with a single binary operation (magmas). Classes of algebraic structures are typically described as first-order logic formulas and the concrete algebras are models of these formulas. Such models include an enormous number of isomorphic, i.e. symmetric, algebras. A complete symmetry-break is a formula that has as models, exactly one canonical representative from each equivalence class of algebras. Thus, we enable answering questions about properties of the models so that computation and search are restricted to the set of canonical representations. For instance, we can answer the question: How many non-isomorphic semigroups are there of size n? Such questions can be answered by counting the satisfying assignments of a SAT formula, which already filters out non-isomorphic models. The introduced technique enables us calculating numbers of algebraic structures not present in the literature and going beyond the possibilities of pure enumeration approaches.

Abstract: The fundamental caching problem in networks asks to find an allocation of contents to a network of caches with the aim of maximizing the cache hit rate. Despite the problem's importance to a variety of research areas including not only content delivery, but also edge intelligence and inference - and the extensive body of work on empirical aspects of caching, very little is known about the exact boundaries of tractability for the problem beyond its general NP-hardness. We close this gap by performing a comprehensive complexity-theoretic analysis of the problem through the lens of the parameterized complexity paradigm, which is designed to provide more precise statements regarding algorithmic tractability than classical complexity. Our results include algorithmic lower and upper bounds which together establish the conditions under which the caching problem becomes tractable.

Abstract: Large Language Models (LLMs) demonstrate impressive capabilities in the domain of program synthesis. This level of performance is not, however, universal across all tasks, all LLMs and all prompting styles. There are many areas where one LLM dominates, one prompting style dominates, or where calling a symbolic solver is a better choice than an LLM. A key challenge for the user then, is to identify not only when an LLM is the right choice of solver, and the appropriate LLM to call for a given synthesis task, but also the right way to call it. A nonexpert user who makes the wrong choice, incurs a cost both in terms of results (number of tasks solved, and the time it takes to solve them) and financial cost, if using a closed-source language model via a commercial API. We frame this choice as an online learning problem. We use a multi-armed bandit algorithm to select which symbolic solver, or LLM and prompt combination to deploy in order to maximize a given reward function (which may prioritize solving time, number of synthesis tasks solved, or financial cost of solving). We implement an instance of this approach, called \name, and evaluate it on synthesis queries from the literature in ranking function synthesis, from the syntax-guided synthesis competition, and fresh, unseen queries generated from SMT problems. Cyanea solves 37.2 % more queries than the best single solver and achieves results within 4 % of the virtual best solver.

Abstract: Constraint Acquisition (CA) aims to widen the use of constraint programming by assisting users in the modeling process. However, most CA methods suffer from a significant drawback: they learn a single set of individual constraints for a specific problem instance, but cannot generalize these constraints to the parameterized constraint specifications of the problem. In this paper, we address this limitation by proposing GenCon, a novel approach to learn parameterized constraint models capable of modeling varying instances of the same problem. To achieve this generalization, we make use of statistical learning techniques at the level of individual constraints. Specifically, we propose to train a classifier to predict, for any possible constraint and parameterization, whether the constraint belongs to the problem. We then show how, for some classes of classifiers, we can extract decision rules to construct interpretable constraint specifications. This enables the generation of ground constraints for any parameter instantiation. Additionally, we present a generateand-test approach that can be used with any classifier, to generate the ground constraints on the fly. Our empirical results demonstrate that our approach achieves high accuracy and is robust to noise in the input instances.

Abstract: Longer historical behaviors often improve recommendation accuracy but bring efficient problems. As sequences get longer, the following two main challenges have not been addressed: (1) efficient modeling under increasing sequence length and (2) interest drifting within historical items. In this paper, we propose Iterative Sparse Attention for Longsequence Recommendation (ISA) with Sparse Attention Layer and Iterative Attention Layer to efficiently capture sequential pattern and expand the receptive field of each historical items. We take the pioneering step to address the efficient and interest drifting challenges for the long-sequence recommendation simultaneously. The theoretical analysis illustrates that our proposed iterative method can approximate full attention efficiently. Experiments on two real-world datasets show the superiority of our proposed method against state-of-the-art baselines.

Abstract: Existing sequential recommendation models are mostly based on sequential models, which can be misled by inconsistent items in the local sequence. This study proposes GlobalDiff, a plugand-play framework to enhance the performance of sequential models by utilizing a diffusion model to restore the global non-sequential data structure of the item universe and compensate for the local sequential context. Several novel techniques are proposed, including training construction, guided reverse approximator, and inference ensemble, to seamlessly integrate the diffusion model with the sequential model. Extensive experiments on various datasets demonstrate that GlobalDiff can enhance advanced sequential models by an average improvement of 9.67%.

Abstract: In today’s informationrich era, users rely heavily on recommender systems to identify relevant content. Graph structures, renowned for their ability to model intricate user-content relationships, have become essential to these systems. However, the accuracy of recommendations hinges critically on the quality of node representations within these graphs. Personalized recommendations strive to enhance uniqueness by maximizing the dissimilarity between representations (known as uniformity) while simultaneously ensuring that the representations align closely with the content users engage with (dubbed as alignment). Nevertheless, balancing these conflicting objectives remains a challenge for optimal recommendation performance. To tackle these challenges, we propose an innovative approach called SIURec, which differs significantly from previous studies. Rather than relying on manual weight selection between uniformity and alignment and optimizing uniformity solely on the final representation, SIURec adopts an adaptive adjustment method that learns the optimal weight between uniformity and alignment automatically. By optimizing uniformity at every convolutional layer, SIURec captures users’ sub-interests more effectively, ultimately leading to improved recommendation accuracy. Experimental results on four datasets demonstrate that SIURec achieves superior learning of uniformity (with an average improvement of 4.26% in accuracy compared to eleven SOTA methods) and exhibits robustness across different hyperparameter settings.

Abstract: We study the problem of robust influence maximization in dynamic diffusion networks. In line with recent works, we consider the scenario where the network can undergo insertion and removal of nodes and edges, in discrete time steps, and the influence weights are determined by the features of the corresponding nodes and a global hyperparameter. Given this, our goal is to find, at every time step, the seed set maximizing the worstcase influence spread across all possible values of the hyperparameter. We propose an approximate solution using multiplicative weight updates and a greedy algorithm, with theoretical quality guarantees. Our experiments validate the effectiveness and efficiency of the proposed methods.

Abstract: Multimodal recommendation (MMRec) aims to integrate multimodal information of items to address the inherent data sparsity issue in collaborativebased recommendation. Traditional MMRec methods typically capture the structure-level item representations from the observed user behaviors within the multimodal graph, overlooking the potential impact of negative instances for personalized preference understanding. In light of the outstanding generative ability and step-by-step inference characteristic of Diffusion Models (DMs), we propose a Curriculum Conditioned Diffusion framework for Multimodal Recommendation (CCDRec), which precisely excavates the modality-aware distribution-level correlation among multi-modalities and elegantly integrates the reverse phase of DMs into negative sampling to highlight the most suitable instances in a curricular manner. Specifically, CCDRec proposes the Diffusion-controlled Multimodal Aligning module (DMA) to align multimodal knowledge with collaborative signals by capturing the fine-grained relationships among multi-modalities in the probabilistic distribution space. Furthermore, CCDRec designs the Negative-sensitive Diffusive Inferring module (NDI) to progressively synthesize the negative sample pool with diverse hardness to support the following knowledge-aware negative sampling. To gradually ramp up the training complexity, CCDRec further introduces a Curricular Negative Sampler (CNS) to tally the curriculum learning paradigm with the reverse phase of DMA, thereby adaptively sampling the gold-standard negative instances to enhance optimization. Extensive experiments on three datasets with four diverse backbones demonstrate the effectiveness and robustness of our CCDRec. The visualization analyses also clarify the underlying mechanism of our DMA in multimodal representation alignment and CNS in curricular negative discovery. The code and the corresponding dataset will be uploaded in the Appendix.

Abstract: Graph classification is a pivotal challenge in machine learning, especially within the realm of graphbased data, given its importance in numerous real-world applications such as social network analysis, recommendation systems, and bioinformatics. Despite its significance, graph classification faces several hurdles, including adapting to diverse prediction tasks, training across multiple target domains, and handling small-sample prediction scenarios. Current methods often tackle these challenges individually, leading to fragmented solutions that lack a holistic approach to the overarching problem. In this paper, we propose an algorithm aimed at addressing the aforementioned challenges. By incorporating insights from various types of tasks, our method aims to enhance adaptability, scalability, and generalizability in graph classification. Motivated by the recognition that the underlying subgraph plays a crucial role in GNN prediction, while the remainder is task-irrelevant, we introduce the Core Knowledge Learning (CKL) framework for graph adaptation and scalability learning. CKL comprises several key modules, including the core subgraph knowledge submodule, graph domain adaptation module, and few-shot learning module for downstream tasks. Each module is tailored to tackle specific challenges in graph classification, such as domain shift, label inconsistencies, and data scarcity. By learning the core subgraph of the entire graph, we focus on the most pertinent features for task relevance. Consequently, our method offers benefits such as improved model performance, increased domain adaptability, and enhanced robustness to domain variations. Experimental results demonstrate significant performance enhancements achieved by our method compared to state-of-the-art approaches. Specifically, our method achieves notable improvements in accuracy and generalization across various datasets and evaluation metrics, underscoring its effectiveness in addressing the challenges of graph classification.

Abstract: User simulators can rapidly generate a large volume of timely user behavior data, providing a testing platform for reinforcement learningbased recommender systems, thus accelerating their iteration and optimization. However, prevalent user simulators generally suffer from significant limitations, including the opacity of user preference modeling and the incapability of evaluating simulation accuracy. In this paper, we introduce an LLM-powered user simulator to simulate user engagement with items in an explicit manner, thereby enhancing the efficiency and effectiveness of reinforcement learning-based recommender systems training. Specifically, we identify the explicit logic of user preferences, leverage LLMs to analyze item characteristics and distill user sentiments, and design a logical model to imitate real human engagement. By integrating a statistical model, we further enhance the reliability of the simulation, proposing an ensemble model that synergizes logical and statistical insights for user interaction simulations. Capitalizing on the extensive knowledge and semantic generation capabilities of LLMs, our user simulator faithfully emulates user behaviors and preferences, yielding high-fidelity training data that enrich the training of recommendation algorithms. We establish quantifying and qualifying experiments on five datasets to validate the simulator's effectiveness and stability across various recommendation scenarios.

Abstract: The takeaway recommendation system aims to recommend users' future takeaway purchases based on their historical purchase behaviors, thereby improving user satisfaction and boosting merchant sales. Existing methods focus on incorporating auxiliary information or leveraging knowledge graphs to alleviate the sparsity issue of user purchase sequences. However, two main challenges limit the performance of these approaches: (1) capturing dynamic user preferences on complex geospatial information and (2) efficiently integrating spatialtemporal knowledge from both graphs and sequence data with low computational costs. In this paper, we propose a novel spatial-temporal knowledge distillation model for takeaway recommendation (STKDRec) based on the two-stage training process. Specifically, during the first pre-training stage, a spatial-temporal knowledge graph (STKG) encoder is trained to extract high-order spatial-temporal dependencies and collaborative associations from the STKG. During the second spatial-temporal knowledge distillation (STKD) stage, a spatial-temporal Transformer (ST-Transformer) is employed to comprehensively model dynamic user preferences on various types of fine-grained geospatial information from a sequential perspective. Furthermore, the STKD strategy is introduced to transfer graph-based spatial-temporal knowledge to the ST-Transformer, facilitating the adaptive fusion of rich knowledge derived from both the STKG and sequence data while reducing computational overhead. Extensive experiments on three real-world datasets show that STKDRec significantly outperforms the state-of-the-art baselines.

Abstract: In this paper, we consider the classic fair division problem of allocating m divisible items to n agents with linear valuations over the items. We define novel notions of fair shares from the perspective of individual agents via the cakecutting process. These shares generalize the notion of proportionality by taking into account the valuations of other agents via constraints capturing envy. We study what fraction (approximation) of these shares are achievable in the worst case, and present tight and non-trivial approximation bounds as a function of n and m. In particular, we show a tight approximation bound of Θ(√n) for various notions of such shares. We show this bound via a novel application of dual fitting, which may be of independent interest. We also present a bound of O(m^(2/3)) for a strict notion of share, with an almost matching lower bound. We further develop weaker notions of shares whose approximation bounds interpolate smoothly between proportionality and the shares described above. We finally present empirical results showing that our definitions lead to more reasonable shares than the standard fair share notion of proportionality.

Abstract: Participatory budgeting (PB) is an increasingly popular tool for democratically allocating limited budgets to publicgood projects. In PB, constituents vote on their preferred projects via ballots, and then an aggregation rule selects a set of projects whose total cost fits within the budget. Recent work studies how to design PB ballots and aggregation rules that yield low-distortion outcomes (informally, outcomes with high social welfare). Existing distortion bounds, however, rely on strong assumptions that restrict voters' latent utilities. We prove that low distortion PB outcomes can be achieved by dropping these assumptions and instead leveraging the established idea that voters can be public-spirited: they may consider others' interests alongside their own when voting. Flanigan, Procaccia, and Wang (2023) prove that in public-spirited single-winner voting (the special case of PB where exactly one project can be funded) with ranking ballots, deterministic aggregation rules can achieve constant distortion. Our first contribution is to extend this analysis to PB; there, we prove that the best distortion permitted by deterministic rules with ranking ballots grows linearly in the number of projects. We find that this impossibility --- a problem in practice, where m is often large --- holds for other known ballots as well. Our second contribution is the design of a new PB ballot format that breaks this linear distortion barrier. This ballot asks voters to rank a predetermined set of entire feasible bundles of projects. We design multiple protocols for implementing these ballots, each striking a different trade-off between the number of bundles voters must rank and the distortion: with m bundles, we get sublinear distortion; with polynomial bundles, we get logarithmic distortion; and with pseudopolynomial bundles, we get constant distortion.

Abstract: The predominant setting in classic auction theory considers bidders as utility maximizers (UMs), who aim to maximize quasilinear utility functions. Recent autobidding strategies in online advertising have sparked interest in auction design with value maximizers (VMs), who aim to maximize the total value obtained. In this work, we investigate revenue-maximizing auction design for selling a single item to a mix of UMs and VMs. Crucially, we assume the UM/VM type is private information of a bidder. This shift to a multi-parameter domain complicates the design of incentive compatible mechanisms. Under this setting, we first characterize the optimal auction structure for auctions with a single bidder. We observe that the optimal auction moves gradually from a first-price auction to a Myerson auction as the probability of the bidder being a UM increases from 0 to 1. We also extend our study to multi-bidder setting and present an algorithm for deriving the optimal lookahead auction with multiple mixed types of bidders.

Abstract: An important but very demanding - property in collective decision-making is strategyproofness, which requires that voters cannot benefit from submitting insincere preferences. Gibbard (1977) has shown that only rather unattractive rules are strategyproof, even when allowing for randomization. However, Gibbard's theorem is based on a rather strong interpretation of strategyproofness, which deems a manipulation successful if it increases the voter's expected utility for at least one utility function consistent with his ordinal preferences. In this paper, we study weak strategyproofness, which deems a manipulation successful if it increases the voter's expected utility for all utility functions consistent with his ordinal preferences. We show how to systematically design attractive, weakly strategyproof social decision schemes (SDSs) and explore their limitations for both strict and weak preferences. In particular, for strict preferences, we show that there are weakly strategyproof SDSs that are either ex post efficient or Condorcet-consistent, while neither even-chance SDSs nor pairwise SDSs satisfy both properties and weak strategyproofness at the same time. By contrast, for the case of weak preferences, we discuss two sweeping impossibility results that preclude the existence of appealing weakly strategyproof SDSs.

Abstract: We study the explorationexploitation trade-off for large multiplayer coordination games where players strategise via Q-Learning, a common learning framework in multi-agent reinforcement learning. Q-Learning is known to have two shortcomings, namely non-convergence and potential equilibrium selection problems, when there are multiple fixed points, called Quantal Response Equilibria (QRE). Furthermore, whilst QRE have full support for finite games, it is not clear how Q-Learning behaves as the game becomes large. In this paper, we characterise the critical exploration rate that guarantees convergence to a unique fixed point, addressing the two shortcomings above. Using a generating-functional method, we show that this rate increases with the number of players and the alignment of their payoffs. For many-player coordination games with perfectly aligned payoffs, this exploration rate is roughly twice that of p-player zero-sum games. As for large games, we provide a structural result for QRE, which suggests that as the game size increases, Q-Learning converges to a QRE near the boundary of the simplex of the action space, a phenomenon we term asymptotic extinction, where a constant fraction of the actions are played with zero probability at a rate o(1/N) for an N -action game.

Abstract: In the recently introduced model of fair partitioning of friends, there is a set of agents located on the vertices of an underlying graph that indicates the friendships between the agents. The task is to partition the graph into k balancedsized groups, keeping in mind that the value of an agent for a group is equal to the number of edges they have in that group. The goal is to construct partitions that are "fair", i.e., no agent would like to replace an agent in a different group. We generalize the standard model by considering utilities for the agents that are beyond binary and additive. Having this as our foundation, our contribution is threefold: (a) we adapt several fairness notions that have been developed in the fair division literature to our setting; (b) we give several existence guarantees supported by polynomial-time algorithms; (c) we initiate the study of the computational (and parameterized) complexity of the model and provide an almost complete landscape of the (in)tractability frontier for our fairness concepts.

Abstract: Two prominent objectives in social choice are utilitarian maximizing the sum of agents' utilities, and leximin - maximizing the smallest agent's utility, then the second-smallest, etc. Utilitarianism is typically computationally easier to attain but is generally viewed as less fair. This paper presents a general reduction scheme that, given a utilitarian solver, produces a distribution over states (deterministic outcomes) that is leximin in expectation. Importantly, the scheme is robust in the sense that, given an approximate utilitarian solver, it produces a lottery that is approximately-leximin (in expectation) - with the same approximation factor. We apply our scheme to several social choice problems: stochastic allocations of indivisible goods, giveaway lotteries, and fair lotteries for participatory budgeting.

Abstract: By classic results in social choice theory, any reasonable preferential voting method sometimes gives individuals an incentive to report an insincere preference. The extent to which different voting methods are more or less resistant to such strategic manipulation has become a key consideration for comparing voting methods. Here we measure resistance to manipulation by whether neural networks of varying sizes can learn to profitably manipulate a given voting method in expectation, given different types of limited information about how other voters will vote. We trained over 100,000 neural networks of 26 sizes to manipulate against 8 different voting methods, under 6 types of limited information, in committeesized elections with 5-21 voters and 3-6 candidates. We find that some voting methods, such as Borda, are highly manipulable by networks with limited information, while others, such as Instant Runoff, are not, despite being quite profitably manipulated by an ideal manipulator with full information. For the three probability models for elections that we use, the overall least manipulable of the 8 methods we study are Condorcet methods, namely Minimax and Split Cycle.

Abstract: We initiate the study of matching roommates and rooms wherein the preferences of agents over other agents and rooms are complementary and represented by Leontief utilities. In this setting, 2n agents must be paired up and assigned to n rooms. Each agent has cardinal valuations over the rooms as well as compatibility values over all other agents. Under Leontief preferences, an agent’s utility for a matching is the minimum of the two values. We focus on the tradeoff between maximizing utilitarian social welfare and strategyproofness. Our main result shows that—in a stark contrast to the additive case— under binary Leontief utilities, there exist strategyproof mechanisms that maximize the social welfare. We further devise a strategyproof mechanism that implements such a welfare maximizing algorithm and is parameterized by the number of agents. Along the way, we highlight several possibility and impossibility results, and give upper bounds and lower bounds for welfare with or without strategyproofness.

Abstract: Envyfreeness up to any good (EFX) is a popular and important fairness property in the fair allocation of indivisible goods, of which its existence in general is still an open question. In this work, we investigate the problem of determining the minimum number of EFX allocations for a given instance, arguing that this approach may yield valuable insights into the existence and computation of EFX allocations. We focus on restricted instances where the number of goods slightly exceeds the number of agents, and extend our analysis to weighted EFX (WEFX) and a novel variant of EFX for general monotone valuations, termed EFX+. In doing so, we identify the transition threshold for the existence of allocations satisfying these fairness notions. Notably, we resolve open problems regarding WEFX by proving polynomial-time computability under binary additive valuations, and establishing the first constant-factor approximation for two agents.

Abstract: In the kcommittee election problem, we wish to aggregate the preferences of n agents over a set of alternatives and select a committee of k alternatives that minimizes the cost incurred by the agents. While we typically assume that agent preferences are captured by a cardinal utility function, in many contexts we only have access to ordinal information, namely the agents' rankings over the outcomes. As preference rankings are not as expressive as cardinal utilities, a loss of efficiency is inevitable, and is quantified by the notion of distortion. We study the problem of electing a k-committee that minimizes the sum of the \ell-largest costs incurred by the agents, when agents and candidates are embedded in a metric space. This problem is called the \ell-centrum problem and captures both the utilitarian and egalitarian objectives. When k >= 2, it is not possible to compute a bounded-distortion committee using purely ordinal information. We develop the first algorithms (that we call mechanisms) for the \ell-centrum problem (when k >= 2), which achieve O(1)-distortion while eliciting only a very limited amount of cardinal information via value queries. We obtain two types of query-complexity guarantees: O(log k log n) queries per agent, and O(k^2 log^2 n) queries in total (while achieving O(1)-distortion in both cases). En route, we give a simple adaptive-sampling algorithm for the \ell-centrum k-clustering problem.

Abstract: Ensuring annotator quality in training and evaluation data is a key piece of machine learning in NLP. Tasks such as sentiment analysis and offensive speech detection are intrinsically subjective, creating a challenging scenario for traditional quality assessment approaches because it is hard to distinguish disagreement due to poor work from that due to differences of opinions between sincere annotators. With the goal of increasing diverse perspectives in annotation while ensuring consistency, we propose ARTICLE, an incontext learning (ICL) framework to estimate annotation quality through self-consistency. We evaluate this framework on two offensive speech datasets using multiple LLMs and compare its performance with traditional methods. Our findings indicate that ARTICLE can be used as a robust method for identifying reliable annotators, hence improving data quality.

Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences

Abstract: Social media platforms like X(Twitter) and Reddit are vital to global communication. However, advancements in Large Language Model (LLM) technology give rise to social media bots with unprecedented intelligence. These bots adeptly simulate human profiles, conversations, and interactions, disseminating large amounts of false information and posing significant challenges to platform regulation. To better understand and counter these threats, we innovatively design BotSim, a malicious social botnet simulation powered by LLM. BotSim mimics the information dissemination patterns of realworld social networks, creating a virtual environment composed of intelligent agent bots and real human users. In the temporal simulation constructed by BotSim, these advanced agent bots autonomously engage in social interactions such as posting and commenting, effectively modeling scenarios of information flow and user interaction. Building on the BotSim framework, we construct a highly human-like, LLM-driven bot dataset called BotSim-24 and benchmark multiple bot detection strategies against it. The experimental results indicate that detection methods effective on traditional bot datasets perform worse on BotSim-24, highlighting the urgent need for new detection strategies to address the cybersecurity threats posed by these advanced bots.

Abstract: We extend belief revision theory from propositional logic to the modal logic S5. Our first contribution takes the form of three new postulates (M1M3) that go beyond the AGM ones and capture the idea of minimal change in the presence of modalities. Concerning the construction of modal revision operations, we work with set pseudo-distances, i.e., distances between sets of points that may violate the triangle-inequality. Our second contribution is the identification of three axioms (A3-A5) that go beyond the standard axioms of metrics. Loosely speaking, our main result states the following: if a pseudo-distance satisfies certain axioms, then the induced revision operation satisfies (M1-M3). We investigate three pseudo-distances from the literature (Dhaus, Dinj, Dsum), and the three induced revision operations (*Haus, *Inj, *Sum). Using our main result, we show that only *Sum satisfies (M1-M3) all together. As a last contribution, we revisit a major criticism of AGM operations, namely that the revisions of (p ∧ q) and (p ∧ (p → q)) are identical. We show that the problem disappears if instead of material implication we use the modal operator of strict implication that can be defined in S5.

Abstract: Knowledge refactoring compresses logic programs by replacing them with new rules. Current approaches struggle to scale to large programs. To overcome this limitation, we introduce a constrained optimisation refactoring approach. Our first key idea is to encode the problem with decision variables based on literals rather than rules. Our second key idea is to focus on linear invented rules. Our empirical results on multiple domains show that our approach can refactor programs quicker and with more compression than the previous stateof-the-art approach, sometimes by 60%.

Abstract: In recent years, Machine Learning (ML) models have achieved remarkable success in various domains. However, these models also tend to demonstrate unsafe behaviors, precluding their deployment in safetycritical systems. To cope with this issue, ample research focuses on developing methods that guarantee the safe behaviour of a given ML model. A prominent example is shielding which incorporates an ex- ternal component (a “shield”) that blocks unwanted behavior. Despite significant progress, shielding suffers from a main setback: it is currently geared towards properties encoded solely in propositional logics (e.g., LTL) and is unsuitable for richer logics. This, in turn, limits the widespread applicability of shielding in many real-world systems. In this work, we address this gap, and extend shielding to LTL modulo theories, by building upon recent advances in reactive synthesis modulo theories. This allowed us to develop a novel approach for generating shields conforming to complex safety specifications in these more expressive, logics. We evaluated our shields and demonstrate their ability to handle rich data with temporal dynamics. To the best of our knowledge, this is the first approach for synthesizing shields for such expressivity.

Abstract: Querying temporal data has recently gained traction in several artificial intelligence applications. As operational domains of intelligent agents are constantly being expanded, there is a strong need for representing domain knowledge. This comes in the form of ontologies, which are predominantly expressed in description logics and enrich timestamped data to temporal knowledge bases. For modeling highly complex system environments, expressive description logics are often the formalism of choice. Querying such temporal knowledge bases is a challenging task, but recently a first practical solution has been put forward. We propose a novel approach to the query answering problem based on two well-known rewriting rules from temporal logic. After a careful theoretical analysis of our algorithm, we show in a practical evaluation on several benchmarks that it outperforms state of the art, sometimes by orders of magnitude. Based on our findings, we also propose a fragment of temporal conjunctive queries which guides users towards well-performing queries.

Abstract: Exponentialfamily harmoniums (EFHs) generalize the restricted Boltzmann machine beyond Bernoulli random variables to other exponential families. Here we show how to extend the EFH beyond standard exponential families (Poisson, Gaussian, etc.), by allowing the sufficient statistics for the hidden units to be arbitrary functions of the observed data, parameterized by deep neural networks. This rules out the standard sampling scheme, block Gibbs sampling, so we replace it with a form of Langevin dynamics within Gibbs, inspired by a recent method for training Gaussian restricted Boltzmann machines (GRBMs). With Gibbs-Langevin, the GRBM can successfully model small datasets like MNIST and CelebA-32, but struggles with CIFAR-10, and cannot scale to larger images because it lacks convolutions. In contrast, our neural-network EFHs (NN-EFHs) generate high-quality samples from CIFAR-10 and scale well to CelebA-HQ. On these datasets, the NN-EFH achieves FID scores that are 25--50% lower than a standard energy-based model with a similar neural-network architecture and the same number of parameters; and competitive with noise-conditional score networks, which utilize more complex neural networks (U-nets) and require considerably more sampling steps.

Abstract: Metric magnitude of a point cloud is a measure of its ``size." It has been adapted to various mathematical contexts and recent work suggests that it can enhance machine learning and optimization algorithms. But its usability is limited due to the computational cost when the dataset is large or when the computation must be carried out repeatedly (e.g. in model training). In this paper, we study the magnitude computation problem, and show efficient ways of approximating it. We show that it can be cast as a convex optimization problem, but not as a submodular optimization. The paper describes two new algorithms an iterative approximation algorithm that converges fast and is accurate in practice, and a subset selection method that makes the computation even faster. It has previously been proposed that the magnitude of model sequences generated during stochastic gradient descent is correlated to the generalization gap. Extension of this result using our more scalable algorithms shows that longer sequences bear higher correlations. We also describe new applications of magnitude in machine learning -- as an effective regularizer for neural network training, and as a novel clustering criterion.

Abstract: Large Language Models (LLMs) often struggle when prompted to generate content under specific constraints. However, in such cases it is often easy to check whether these constraints are satisfied or violated. Recent works have shown that LLMs can benefit from such ``corrective feedback''. Here we claim that this skill of LLMs can be significantly enhanced via training. We introduce an RL framework for teaching models to use such rewards, by simulating interaction sessions, and rewarding the model according to its ability to satisfy the constraints. We refer to our method as CORGI (Controlled Generation with RL for Guided Interaction), and evaluate it on a variety of controlled generation tasks. We find that CORGI consistently outperforms the baseline reinforcement learning method that does not incorporate conversational feedback. Furthermore, CORGI's interactive framework enables metalearning, allowing the LLM to better generalize to guided interaction in new tasks. Our results clearly show that conversational optimization, when combined with reinforcement learning, significantly improves the effectiveness of LLMs in controlled generation contexts.

Abstract: AudioVisual Segmentation (AVS) aims to identify, at the pixel level, the object in a visual scene that produces a given sound. Current AVS methods rely on costly fine-grained annotations of mask-audio pairs, making them impractical for scalability. To address this, we propose the Modality Correspondence Alignment (MoCA) framework, which seamlessly integrates off-the-shelf foundation models like DINO, SAM, and ImageBind. Our approach leverages existing knowledge within these models and optimizes their joint usage for multimodal associations. Our approach relies on estimating positive and negative image pairs in the feature space. For pixel-level association, we introduce an audio-visual adapter and a novel {pixel matching aggregation} strategy within the image-level contrastive learning framework. This allows for a flexible connection between object appearance and audio signal at the pixel level, with tolerance to imaging variations such as translation and rotation. Extensive experiments on the AVSBench (single and multi-object splits) and AVSS datasets demonstrate that MoCA outperforms unsupervised baseline approaches and some supervised counterparts, particularly in complex scenarios with multiple auditory objects. In terms of mIoU, MoCA achieves a substantial improvement over baselines in both the AVSBench (S4: +17.24%, MS3: +67.64%) and AVSS (+19.23%) audio-visual segmentation challenges.

Abstract: Learning lowdimensional numerical representations from symbolic data, e.g., embedding the nodes of a graph into a geometric space, is an important concept in machine learning. While embedding into Euclidean space is common, recent observations indicate that hyperbolic geometry is better suited to represent hierarchical information and heterogeneous data (e.g., graphs with a scale-free degree distribution). Despite their potential for more accurate representations, hyperbolic embeddings also have downsides like being more difficult to compute and harder to use in downstream tasks. We propose embedding into a weighted space, which is closely related to hyperbolic geometry but mathematically simpler. We provide the embedding algorithm WEmbed and demonstrate, based on generated as well as over 2000 real-world graphs, that our weighted embeddings heavily outperform state-of-the-art Euclidean embeddings for heterogeneous graphs while using fewer dimensions. The running time of WEmbed and embedding quality for the remaining instances is on par with state-of-the-art Euclidean embedders.

Abstract: Counterfactual explanations are one of the prominent eXplainable Artificial Intelligence (XAI) techniques, and suggest changes to input data that could alter predictions, leading to more favourable outcomes. Existing counterfactual methods do not readily apply to temporal domains, such as that of process mining, where data take the form of traces of activities that must obey to temporal background knowledge expressing which dynamics are possible and which not. Specifically, counterfactuals generated offthe-shelf may violate the background knowledge, leading to inconsistent explanations. This work tackles this challenge by introducing a novel approach for generating temporally constrained counterfactuals, guaranteed to comply by design with background knowledge expressed in Linear Temporal Logic on process traces (LTLp). We do so by infusing automata-theoretic techniques for LTLp inside a genetic algorithm for counterfactual generation. The empirical evaluation shows that the generated counterfactuals are temporally meaningful and more interpretable for applications involving temporal dependencies.

School of Computer Science and Technology, Zhejiang Normal University, China Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, China, Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, China School of Computer Science and Technology, Zhejiang Normal University, China School of Information Engineering, Huzhou University, China, Zhejiang Institute of Optoelectronics, China Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, China, Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, China, School of Computer Science and Technology, Zhejiang Normal University, China Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, China, Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, China

Abstract: The outof-distribution (OOD) detection on graph-structured data is crucial for deploying graph neural networks securely in open-world scenarios. However, existing methods have overlooked the prevalent scenario of multi-label classification in real-world applications. In this work, we investigate the unexplored issue of OOD detection within multi-label node classification tasks. We propose ML-GOOD, a simple yet sufficient approach that utilizes an energy function to gauge the OOD score for each label. We further develop a strategy for amalgamating multiple label energies, allowing for the comprehensive utilization of label information to tackle the primary challenges encountered in multi-label scenarios. Extensive experimentation conducted on seven diverse sets of real-world multi-label graph datasets, encompassing cross-domain scenarios. The results show that the AUROC of ML-GOOD is improved by 5.26% in intra-domain and 6.54% in cross-domain compared to the previous methods. These empirical validations not only affirm the robustness of our methodology but also illuminate new avenues for further exploration within this burgeoning field of research.

School of Computer Science and Engineering, Southeast University, Nanjing, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China Information Technology and Data Management Department of China Mobile Communications Group Zhejiang Co., Ltd, School of Computer Science and Engineering, Southeast University, Nanjing, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China

Abstract: The learnware paradigm aims to establish a learnware dock system of numerous welltrained machine learning models, enabling users to reuse existing helpful models for their tasks instead of starting from scratch. Each learnware in the system is a well-established model submitted by its developer, associated with a specification generated by the learnware dock system. The specification characterizes the specialty of the corresponding model, enabling it to be identified accurately for new task requirements. Existing specification generation methods are mostly based on the Reduced Kernel Mean Embedding (RKME) technique, which uses the Maximum Mean Discrepancy (MMD) in the Reproducing Kernel Hilbert Space (RKHS) to seek a reduced set that characterizes the model's capabilities. However, existing RKME-based methods mainly utilize feature information to generate specifications by assuming the existence of the ground-truth labeling function, while leaving the label information, which is capable of providing rich semantic characterization, untouched. Furthermore, the quality of the generated specifications heavily relies on the choice of the kernels, which makes it prohibitive to adapt to all real-world scenarios. In this paper, to overcome the above limitations, we propose a novel specification approach named LANE, i.e., Label-Aware Neural Embedding. In LANE, the neural embedding space is utilized to replace the RKHS, effectively circumventing the step of kernel selection and thereby addressing the dependency on kernels in existing RKME-based specification methods. More importantly, LANE uses the label information as additional supervision to enhance the generation process, resulting in specifications of superior quality. Extensive experiments demonstrate the effectiveness and superiority of the proposed LANE approach in the learnware paradigm.

Abstract: In reallife scenarios, a Reinforcement Learning (RL) agent aiming to maximize their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.

Abstract: The use of neural differential equation models in machine learning applications has gained significant traction in recent years. In particular, fractional differential equations (FDEs) have emerged as a powerful tool for capturing complex dynamics in various domains. While existing models have primarily focused on constantorder fractional derivatives, variable-order fractional operators offer a more flexible and expressive framework for modeling complex memory patterns. In this work, we introduce the Neural Variable-Order Fractional Differential Equation network (NvoFDE), a novel neural network framework that integrates variable-order fractional derivatives with learnable neural networks. Our framework allows for the modeling of adaptive derivative orders dependent on hidden features, capturing more complex feature-updating dynamics and providing enhanced flexibility. We conduct extensive experiments across multiple graph datasets to validate the effectiveness of our approach. Our results demonstrate that NvoFDE outperforms traditional constant-order fractional and integer models across a range of tasks, showcasing its superior adaptability and performance.

Abstract: Given an edgecolored graph, the goal of the proportional fair matching problem is to find a maximum weight matching while ensuring proportional representation (with respect to the number of edges) of each color. The colors may correspond to demographic groups or other protected traits where we seek to ensure roughly equal representation from each group. It is known that, assuming ETH, it is impossible to approximate the problem with ℓ colors in time subexponential in ℓ even on unweighted path graphs. Further, even determining the existence of a non-empty matching satisfying proportionality is NP-Hard. To overcome this hardness, we relax the stringent proportional fairness constraints to a probabilistic notion. We introduce a notion we call δ-PʀᴏʙᴀʙʟʏFᴀɪʀ where we ensure proportionality up to a factor of at most (1 ± δ) for some small δ > 0 with high probability. The violation δ can be brought arbitrarily close to 0 for some good instances with large values of matching size. We propose and analyze simple and fast algorithms for bipartite graphs that achieve constant-factor approximation guarantees, and return a δ-PʀᴏʙᴀʙʟʏFᴀɪʀ matching.

Abstract: In recent years, Graph Neural Networks (GNNs) have been utilized for various applications ranging from drug discovery to network design and social networks. In many applications, it is impossible to observe some properties of the graph directly; instead, noisy and indirect measurements of these properties are available. These scenarios are coined as Graph Inverse Problems (GRIPs). In this work, we introduce a framework leveraging GNNs to solve GRIPs. The framework is based on a combination of likelihood and prior terms, which are used to find a solution that fits the data while adhering to learned prior information. Specifically, we propose to combine recent deep learning techniques that were developed for inverse problems, together with GNN architectures, to formulate and solve GRIPs. We study our approach on a number of representative problems that demonstrate the effectiveness of the framework.

Abstract: Reservoir Computing (RC) models, a subclass of recurrent neural networks, are distinguished by their fixed, nontrainable input layer and dynamically coupled reservoir, with only the static readout layer being trained. This design circumvents the issues associated with backpropagating error signals through time, thereby enhancing both stability and training efficiency. RC models have been successfully applied across a broad range of application domains. Crucially, they have been demonstrated to be universal approximators of time-invariant dynamic filters with fading memory, under various settings of approximation norms and input driving sources. Simple Cycle Reservoirs (SCR) represent a specialized class of RC models with a highly constrained reservoir architecture, characterized by uniform ring connectivity and binary input-to-reservoir weights with an aperiodic sign pattern. For linear reservoirs, given the reservoir size, the reservoir construction has only one degree of freedom -- the reservoir cycle weight. Such architectures are particularly amenable to hardware implementations without significant performance degradation in many practical tasks. In this study we endow these observations with solid theoretical foundations by proving that SCRs operating in real domain are universal approximators of time-invariant dynamic filters with fading memory. Our results supplement recent research showing that SCRs in the complex domain can approximate, to arbitrary precision, any unrestricted linear reservoir with a non-linear readout. We furthermore introduce a novel method to drastically reduce the number of SCR units, making such highly constrained architectures natural candidates for low-complexity hardware implementations. Our findings are supported by empirical studies on real-world time series datasets.

Abstract: Hypergraphs are powerful mathematical structures that can model complex, highorder relationships in various domains, including social networks, bioinformatics, and recommender systems. However, generating realistic and diverse hypergraphs remains challenging due to their inherent complexity and lack of effective generative models. In this paper, we introduce a diffusion-based Hypergraph Generation (HYGENE) method that addresses these challenges through a progressive local expansion approach. HYGENE works on the bipartite representation of hypergraphs, starting with a single pair of connected nodes and iteratively expanding it to form the target hypergraph. At each step, nodes and hyperedges are added in a localized manner using a denoising diffusion process, which allows for the construction of the global structure before refining local details. Our experiments demonstrated the effectiveness of HYGENE, proving its ability to closely mimic a variety of properties in hypergraphs. To the best of our knowledge, this is the first attempt to employ diffusion models for hypergraph generation.

Abstract: The strength of multimodal learning lies in its ability to integrate information from various sources, providing rich and comprehensive insights. However, in realworld scenarios, multi-modal systems often face the challenge of dynamic modality contributions, the dominance of different modalities may change with the environments, leading to suboptimal performance in multimodal learning. Current methods mainly enhance weak modalities to balance multimodal representation bias, which inevitably optimizes from a partialmodality perspective, easily leading to performance descending for dominant modalities. To address this problem, we propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM). Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information. Moreover, we provide an in-depth analysis that optimizing certain modalities could cause information loss and prevent leveraging the full advantages of multimodal data. By exploring the dominance and narrowing the contribution gaps between modalities, we have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.

Key Lab of Data Engineering and Knowledge Engineering of MOE Renmin University of China School of Information, Renmin University of China, Key Lab of Data Engineering and Knowledge Engineering of MOE Renmin University of China School of Information, Renmin University of China, School of Statistics, Remin University of China, School of Statistics, Remin University of China, Key Lab of Data Engineering and Knowledge Engineering of MOE Renmin University of China School of Information, Renmin University of China, Key Lab of Data Engineering and Knowledge Engineering of MOE Renmin University of China School of Information, Renmin University of China, Key Lab of Data Engineering and Knowledge Engineering of MOE Renmin University of China School of Information, Renmin University of China, Key Lab of Data Engineering and Knowledge Engineering of MOE Renmin University of China School of Information, Renmin University of China

Abstract: Clustering traditionally aims to reveal a natural grouping structure within unlabeled data. However, this structure may not always align with users' preferences. In this paper, we propose a personalized clustering method that explicitly performs targeted representation learning by interacting with users via modicum task information (e.g., mustlink or cannot-link pairs) to guide the clustering direction. We query users with the most informative pairs, i.e., those pairs most hard to cluster and those most easy to miscluster, to facilitate the representation learning in terms of the clustering preference. Moreover, by exploiting attention mechanism, the targeted representation is learned and augmented. By leveraging the targeted representation and constrained contrastive loss as well, personalized clustering is obtained. Theoretically, we verify that the risk of personalized clustering is tightly bounded, guaranteeing that active queries to users do mitigate the clustering risk. Experimentally, extensive results show that our method performs well across different clustering tasks and datasets, even when only a limited number of queries are available.

Abstract: Outof-distribution (OOD) detection is a crucial task for deploying deep learning models in the wild. One of the major challenges is that well-trained deep models tend to perform over-confidence on unseen test data. Recent research attempts to leverage real or synthetic outliers to mitigate the issue, which may significantly increase computational costs and be biased toward specific outlier characteristics. In this paper, we propose a simple yet effective framework, Prototypical Outlier Proxy (POP), which introduces virtual OOD prototypes to reshape the decision boundaries between ID and OOD data. Specifically, we transform the learnable classifier into a fixed one and augment it with a set of prototypical weight vectors. Then, we introduce a hierarchical similarity boundary loss to impose adaptive penalties depending on the degree of misclassification. Extensive experiments across various benchmarks demonstrate the effectiveness of POP. Notably, POP achieves average FPR95 reductions of 7.70%, 6.30%, and 5.42% over the second-best methods on CIFAR-10, CIFAR-100, and ImageNet-200, respectively. Moreover, compared to the recent method NPOS, which relies on outlier synthesis, POP trains 7.2 times faster and performs inference 19.5 times faster.

Abstract: Partial label learning (PLL) allows each instance to be annotated with a set of candidate labels, but only one is the groundtruth label. Although the state-of-the-art (SOTA) PLL models have shown competitive performance, they cannot get rid of the negative influence from the noisy false-positive labels during the training process. This leads to a large extent of uncertainty of PLL models’ prediction, and it becomes unreliable to trust a PLL model’s performance only by its prediction accuracy. To bridge this gap, we develop a new framework to quantify the uncertainty for PLL models with valid confidence guarantee, which is named as Conformal Prediction for Partial Label Learning (CP-PLL). This framework can be implemented on top of any PLL method to quantify their predictive confidence in terms of average prediction set size with a use-specified error rate or coverage/confidence level (i.e., probability). We prove that the coverage guarantee in PLL still holds, that is, the ground-truth label can be covered in the constructed prediction set with the user pre-defined error rate α when we use the noisy calibration data to carlibrate the PLL models, which yields to a probability interval of [1- α, 1- α + 1/n+1 + ε]. Extensive experiments are conducted on SOTA PLL methods and benchmark datasets to verify the effectiveness of the proposed framework.

Abstract: Derivativefree optimization algorithms play an important role in scientific and engineering design optimization problems, especially when derivative information is not accessible. In this paper, we study the framework of sequential classification-based derivative-free optimization algorithms. By introducing learning theoretic concept hypothesis-target shattering rate, we revisit the computational complexity upper bound of SRACOS Inspired by the revisited upper bound, we propose an algorithm named RACE-CARS, which adds a random region-shrinking step compared with SRACOS. We further establish theorems showing the acceleration by region shrinking. Experiments on the synthetic functions as well as black-box tuning for language-model-as-a-service demonstrate empirically the efficiency of RACE-CARS. An ablation experiment on the introduced hyper-parameters is also conducted, revealing the mechanism of RACE-CARS and putting forward an empirical hyperparameter tuning guidance.

Abstract: Medical image segmentation provides useful information about the shape and size of organs, which is beneficial for improving diagnosis, analysis, and treatment. Despite traditional deep learningbased models can extract domain-specific knowledge, they face a generalization bottleneck due to the limited embedded knowledge scope. Vision foundation models have been demonstrated to be effective in extracting generalizable knowledge, but they cannot extract domain-specific knowledge without fine-tuning. In this work, we propose a novel multi-view evidential learning-based framework, which can extract both domain-specific and generalizable knowledge from multi-view features by combining the advantages of traditional and vision foundation models. Specifically, a novel multi-view state space model (MV-SSM) is designed to extract task-related knowledge while removing redundant information within multi-view features. The proposed MV-SSM utilizes Mamba, a state space model, to model cross-view contextual dependencies between domain-specific and generalizable features. Additionally, evidential learning is adopted to quantify the segmentation uncertainty of the model for boundary. In special, variational Dirichlet is introduced to characterize the distribution of the result probabilities, parameterized with collected evidence to quantify uncertainty. As a result, the model can reduce the segmentation uncertainties of boundaries by optimizing the parameters of the Dirichlet distribution. Experimental results on three datasets show that our method obtains superior segmentation performance.

School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China, Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada, Research Centre for Analytical Instrumentation, State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China, Research Centre for Analytical Instrumentation, State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China

Abstract: Infrared (IR) spectroscopy is a fundamental technique in analytical chemistry. Recently, deep learning (DL) has drawn great interest as the modeling method of infrared spectral data. However, unlike vision or language tasks, IR spectral data modeling is faced with the problem of calibration transfer and has distinctive characteristics. Introducing the prior knowledge of IR spectroscopy could guide the DL methods to learn representations aligned with the domaininvariant characteristics of spectra, and thus improve the performance. Despite such potential, there is a notable absence of DL methods that incorporate such inductive bias. To this end, we propose Analytical-Chemistry-Informed Transformer (ACT) with two modules informed by the field knowledge in analytical chemistry. First, ACT includes learnable spectral processing inspired by chemometrics, which comprises spectral pre-processing, tokenization, and post-processing. Second, a straightforward yet effective representation learning mechanism, namely spectral-attention, is incorporated into ACT. Spectral-attention utilizes the intra-spectral and inter-spectral correlations to extract spectral representations. Empirical results show that ACT has achieved competitive results in 9 analytical tasks covering applications across pharmacy, chemistry, and agriculture. Compared with existing networks, ACT reduces the root mean square error of prediction (RMSEP) by more than 20% in calibration transfer tasks. These results indicate that DL methods in IR spectroscopy could benefit from the integration of prior knowledge in analytical chemistry.

Abstract: This paper introduces state abstraction for twoplayer zero-sum Markov games (TZMGs) where the payoffs for the two players are determined by the state representing the environment and their respective actions, with state transitions following a Markov decision processes. For example, in games like soccer, the value of actions changes according to the state of play, we should describe them as Markov games. In TZMGs, the more the number of states becomes, the more difficult computing the equilibrium becomes. Therefore, we abstract the states of TZMGs and examine the performance. State abstraction reduces the number of states by treating multiple different states as a single state, and there is a substantial body of research on finding optimal policies for Markov decision processes using state abstraction. This study extends the state abstraction for MDPs to Markov games. In this case, the game with state abstraction may yield different equilibrium solutions from those of the ground game. To evaluate the equilibrium solutions of the game with state abstraction, we derived bounds on duality gap, which represents the distance from the equilibrium solutions of the ground game. Finally, we demonstrate our state abstraction with Markov Soccer, compute equilibrium policies, and examine the results.

Abstract: Textto-audio generation models (TAG) have achieved significant advances in generating audio conditioned on text descriptions. However, a critical challenge lies in the lack of transparency regarding how each textual input impacts the generated audio. To address this issue, we introduce AudioGenX, an Explainable AI (XAI) method that provides explanations for text-to-audio generation models by highlighting the importance of input tokens. AudioGenX optimizes an Explainer by leveraging factual and counterfactual objective functions to provide faithful explanations at the audio token level. This method offers a detailed and comprehensive understanding of the relationship between text inputs and audio outputs, enhancing both the explainability and trustworthiness of TAG models. Extensive experiments demonstrate the effectiveness of AudioGenX in producing faithful explanations, benchmarked against existing methods using novel evaluation metrics specifically designed for audio generation tasks.

Abstract: CrossDomain Few-Shot Learning (CDFSL) methods typically parameterize models with task-agnostic and task-specific parameters. To adapt task-specific parameters, recent approaches have utilized fixed optimization strategies, despite their potential sub-optimality across varying domains or target tasks. To address this issue, we propose a novel adaptation mechanism called Task-Specific Preconditioned gradient descent (TSP). Our method first meta-learns Domain-Specific Preconditioners (DSPs) that capture the characteristics of each meta-training domain, which are then linearly combined using task-coefficients to form the Task-Specific Preconditioner. The preconditioner is applied to gradient descent, making the optimization adaptive to the target task. We constrain our preconditioners to be positive definite, guiding the preconditioned gradient toward the direction of steepest descent. Empirical evaluations on the Meta-Dataset show that TSP achieves state-of-the-art performance across diverse experimental scenarios.

Abstract: We consider the finitehorizon offline reinforcement learning (RL) setting, and are motivated by the challenge of learning the policy at any step h in dynamic programming (DP) algorithms. To learn this, it is sufficient to evaluate the treatment effect of deviating from the behavioral policy at step h after having optimized the policy for all future steps. Since the policy at any step can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically hard than estimating such treatment effects in the stochastic contextual bandit setting. However, the hardness of many real-world RL instances lies between the two regimes. We develop a flexible and general method called selective uncertainty propagation for confidence interval construction that adapts to the hardness of the associated distribution shift challenges. We show benefits of our approach on toy environments and demonstrate the benefits of these techniques for offline policy learning.

Abstract: The transformer architecture has catalyzed revolutionary advances in language modeling. However, recent architectural recipes, such as statespace models, have bridged the performance gap. Motivated by this, we examine the benefits of Convolution-Augmented Transformer (CAT) for recall, copying, and length generalization tasks. CAT incorporates convolutional filters in the K/Q/V embeddings of an attention layer. Through CAT, we show that the locality of the convolution synergizes with the global view of the attention. Unlike comparable architectures, such as Mamba or transformer, CAT can provably solve the associative recall (AR) and copying tasks using a single layer while also enjoying guaranteed length generalization. We also establish computational tradeoffs between convolution and attention by characterizing how convolution can mitigate the need for full attention by summarizing the context window and creating salient summary tokens to attend. Evaluations on real datasets corroborate our findings and demonstrate that CAT and its variations indeed enhance the language modeling performance.

School of Software Engineering, Huazhong University of Science and Technology, Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, School of Cyber Science and Engineering, Huazhong University of Science and Technology, Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, School of Computer Science and Technology, Huazhong University of Science and Technology, Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, School of Information and Communication Technology, Griffith University

Abstract: Convolutionbased unlearnable examples (UEs) employ class-wise multiplicative convolutional noise to training samples, severely compromising model performance. This fire-new type of UEs have successfully countered all defense mechanisms against UEs. The failure of such defenses can be attributed to the absence of norm constraints on convolutional noise, leading to severe blurring of image features. To address this, we first design an Edge Pixel-based Detector (EPD) to identify convolution-based UEs. Upon detection of them, we propose the first defense scheme against convolution-based UEs, COrrupting these samples via random matrix multiplication by employing bilinear INterpolation (COIN) such that disrupting the distribution of class-wise multiplicative noise. To evaluate the generalization of our proposed COIN, we newly design two convolution-based UEs called VUDA and HUDA to expand the scope of convolution-based UEs. Extensive experiments demonstrate the effectiveness of detection scheme EPD and that our defense COIN outperforms 11 state-of-the-art (SOTA) defenses, achieving a significant improvement on the CIFAR and ImageNet datasets.

Abstract: Partial multilabel learning (PML) aims to train a classifier on dataset whose instances are over-annotated with not only relevant labels but also irrelevant labels, which is common when datasets are collected from crowd-sourcing platform. Existing works primarily approach it from a curriculum learning perspective, leveraging the memorization effect to disambiguate noisy labels and produce robust predictions. However, these methods are based on non-adaptive weighting functions and lack theoretical guidance for optimal weighting. To overcome these issues, a calibrated disambiguation model named PML-CD is proposed. We firstly formulate the optimal weighting function for curriculum-based disambiguation, which is equivalent to the calibration of the model's predicted confidences, thus provide a guidance for curriculum designing. To obtain the optimal weighting function from PML dataset during the training, a transferable calibrator is designed, which takes the histogram of positive samples' confidences as input, and outputs the optimal curriculum weighting for training. Prototype alignment regularization is also proposed to promote the model's performance. Experiments conducted on Pascal VOC, MS-COCO, NUS-WIDE and CUB have verified that our method outperforms existing state-of-the-art PML methods.

Abstract: Local causal discovery is crucial for revealing the causal relationships between specific variables from data. Existing local causal discovery algorithms are designed under the assumption of causal sufficiency, which states that there are no latent common causes for two or more of the observed variables in data. However, the assumption of causal sufficiency is often violated in practice. To address this issue, we first propose the local Maximal Ancestral Graph (MAG), referred to as LocalMAG, to describe the local causal relationships of the target variable in the MAG. Then, we propose a local causal discovery algorithm without the assumption of causal sufficiency, called LatentLCD, to learn the LocalMAG. Specifically, LatentLCD first uses the traditional parents and children discovery algorithm to identify the local causal skeleton that includes latent variables and verifies it theoretically. It then identifies bidirectional edges by determining whether both the target variable and its adjacent variables are colliders, thereby identifying latent variables in the local structure of the target variable. Extensive experiments on synthetic datasets have validated that the proposed LatentLCD algorithm significantly outperforms the stateof-the-art methods.

Abstract: To unbiasedly evaluate multiple target policies, the dominant approach among RL practitioners is to run and evaluate each target policy separately. However, this evaluation method is far from efficient because samples are not shared across policies, and running target policies to evaluate themselves is actually not optimal. In this paper, we address these two weaknesses by designing a tailored behavior policy to reduce the variance of estimators across all target policies. Theoretically, we prove that executing this behavior policy with manyfold fewer samples outperforms onpolicy evaluation on every target policy under characterized conditions. Empirically, we show our estimator has a substantially lower variance compared with previous best methods and achieves state-of-the-art performance in a broad range of environments.

Abstract: Learning control policy from continuous action space by visual observations is a fundamental and challenging task in reinforcement learning (RL). An essential problem is how to accurately map the highdimensional images to the optimal actions by the policy network. Traditional decision-making modules output actions solely based on the current observation, while the distributions of optimal actions are dependent on specific tasks and cannot be known priorly, which increases the learning difficulty. To make the learning easier, we analyze the action characteristics in several control tasks, and propose Reinforcement Learning with Residual Action (ResAct) to explicitly model the adjustments of actions based on the differences between adjacent observations, rather than learning actions directly from observations. The method just redefines the output of the policy network, and doesn’t introduce any prior assumption to constrain or simplify the vanilla control problem. Extensive experiments on DeepMind Control Suite and CARLA demonstrate that the method could improve different RL baselines significantly, and achieve state-of-the-art performance.

Abstract: Multiview tensor clustering (MVTC) has gained much attention for its effectiveness in capturing global high-order correlations across views. However, current MVTC methods suffer from two limitations: 1) adopting a two-stage process to learn the latent features for clustering, and 2) either ignoring local similarities within views or treating local similarities and global high-order correlations equally. In this paper, we propose a smooth low-rank MVTC (SLR-MVTC) method, which aims to extract latent features that are smooth within each view and low-rank across views, enhancing clustering performance. Specifically, we first learn latent features from each view using orthogonal projection and then construct the latent feature tensor by concatenation and rotation. Then, we introduce a new smooth tensor nuclear norm to depict the low-rank components of the low-frequency parts in the feature tensor. Benefiting from the fast Fourier transform along the sample dimension, the obtained low-frequency components effectively capture local smoothness within views, while their low-rank parts further explore global correlations across views. Experimental results on six multi-view datasets demonstrate that SLR-MVTC outperforms state-of-the-art algorithms in terms of clustering performance and CPU time.

Abstract: Uncertainty quantification is essential in decisionmaking, especially when joint distributions of random variables are involved. While conformal prediction provides distribution-free prediction sets with valid coverage guarantees, it traditionally focuses on single predictions. This paper introduces novel conformal prediction methods for estimating the sum or average of unknown labels over specific index sets. We develop conformal prediction intervals for single target to the prediction interval for sum of multiple targets. Under permutation invariant assumptions, we prove the validity of our proposed method. We also apply our algorithms on class average estimation and path cost prediction tasks, and we show that our method outperforms existing conformalized approaches as well as non-conformal approaches.

Abstract: This paper introduces Conformal Thresholded Intervals (CTI), a novel conformal regression method that aims to produce the smallest possible prediction set with guaranteed coverage. Unlike existing methods that rely on nested conformal frameworks and full conditional distribution estimation, CTI estimates the conditional probability density for a new response to fall into each interquantile interval using offthe-shelf multi-output quantile regression. By leveraging the inverse relationship between interval length and probability density, CTI constructs prediction sets by thresholding the estimated conditional interquantile intervals based on their length. The optimal threshold is determined using a calibration set to ensure marginal coverage, effectively balancing the trade-off between prediction set size and coverage. CTI's approach is computationally efficient and avoids the complexity of estimating the full conditional distribution. The method is theoretically grounded, with provable guarantees for marginal coverage and achieving the smallest prediction size given by Neyman-Pearson . Extensive experimental results demonstrate that CTI achieves superior performance compared to state-of-the-art conformal regression methods across various datasets, consistently producing smaller prediction sets while maintaining the desired coverage level. The proposed method offers a simple yet effective solution for reliable uncertainty quantification in regression tasks, making it an attractive choice for practitioners seeking accurate and efficient conformal prediction.

Abstract: Algebraic model counting unifies many inference tasks on logic formulas by exploiting semirings. Rather than focusing on inference, we consider learning, especially in statisticalrelational and neurosymbolic AI, which combine logical, probabilistic and neural representations. Concretely, we show that the very same semiring perspective of algebraic model counting also applies to learning. This allows us to unify various learning algorithms by generalizing gradients and backpropagation to different semirings. Furthermore, we show how cancellation and ordering properties of a semiring can be exploited for more memory-efficient backpropagation. This allows us to obtain some interesting variations of state-of-the-art gradient-based optimisation methods for probabilistic logical models. We also discuss why algebraic model counting on tractable circuits does not lead to more efficient second-order optimization. Empirically, our algebraic backpropagation exhibits considerable speed-ups as compared to existing approaches.

Abstract: We study a hinted heterogeneous multiagent multi-armed bandits problem (HMA2B), where agents can query low-cost observations (hints) in addition to pulling arms. In this framework, each of the M agents has a unique reward distribution over K arms, and in T rounds, they can observe the reward of the arm they pull only if no other agent pulls that arm. The goal is to maximize the total utility by querying the minimal necessary hints without pulling arms, achieving time-independent regret. We study HMA2B in both centralized and decentralized setups. Our main centralized algorithm, GP-HCLA, which is an extension of HCLA, uses a central decision-maker for arm-pulling and hint queries, achieving O(M^4 K) regret with O(M K log T) adaptive hints. In decentralized setups, we propose two algorithms, HD-ETC and EBHD-ETC, that allow agents to choose actions independently through collision-based communication and query hints uniformly until stopping, yielding O(M^3 K^2) regret with O(M^3 K log T) hints, where the former requires knowledge of the minimum gap and the latter does not. Finally, we establish lower bounds to prove the optimality of our results and verify them through numerical simulations.

Abstract: A recent work introduces the problem of learning set functions from data generated by a socalled optimal subset oracle. Their approach approximates the underlying utility function with an energy-based model, whose parameters are estimated via mean-field variational inference. This approximation reduces to fixed point iterations; however, as the number of iterations increases, automatic differentiation quickly becomes computationally prohibitive due to the size of the Jacobians that are stacked during backpropagation. We address this challenge with implicit differentiation and examine the convergence conditions for the fixed-point iterations. We empirically demonstrate the efficiency of our method on synthetic and real-world subset selection applications including product recommendation, set anomaly detection and compound selection tasks.

Abstract: In performative Reinforcement Learning (RL), an agent faces a policydependent environment: the reward and transition functions depend on the agent's policy. Prior work on performative RL has studied the convergence of repeated retraining approaches to a performatively stable policy. In the finite sample regime, these approaches repeatedly solve for a saddle point of a convex-concave objective, which estimates the Lagrangian of a regularized version of the reinforcement learning problem. In this paper, we aim to extend such repeated retraining approaches, enabling them to operate under corrupted data. More specifically, we consider Huber's ε-contamination model, where an ε fraction of data points is corrupted by arbitrary adversarial noise. We propose a repeated retraining approach based on convex-concave optimization under corrupted gradients and a novel problem-specific robust mean estimator for the gradients. We prove that our approach exhibits last-iterate convergence to an approximately stable policy, with the approximation error linear in √ε. We experimentally demonstrate the importance of accounting for corruption in performative reinforcement learning.

Abstract: Diffusionbased molecular graph generative models have achieved significant success in template-free, single-step retrosynthesis prediction. However, these models typically generate reactants from scratch, often overlooking the fact that the scaffold of a product molecule typically remains unchanged during chemical reactions. To leverage this useful observation, we introduce a retrieval-augmented molecular graph generation framework. Our framework comprises three key components: a retrieval component that identifies similar molecules for the given product, an integration component that learns valuable clues from these molecules about which part of the product should remain unchanged, and a base generative model that is prompted by these clues to generate the corresponding reactants. We explore various design choices for critical and under-explored aspects of this framework and instantiate it as the Retrieval-Augmented RetroBridge (RARB). RARB demonstrates state-of-the-art performance on standard benchmarks, achieving a 14.8% relative improvement in top-1 accuracy over its base generative model, highlighting the effectiveness of retrieval augmentation. Additionally, RARB excels in handling out-of-distribution molecules, and its advantages remain significant even with smaller models or fewer denoising steps. These strengths make RARB highly valuable for real-world retrosynthesis applications, where extrapolation to novel molecules and high-throughput prediction are essential.

Abstract: While machine learning models have proven effective across various scenarios, it is widely acknowledged that many models are vulnerable to adversarial attacks. Recently, numerous efforts have emerged in adversarial defense. Among them, certified defense is well known for its theoretical guarantees against arbitrary adversarial perturbations on input within a certain range. However, most existing works in this line struggle to generalize their certified robustness in other data domains with distribution shifts. This issue is rooted in the difficulty of eliminating the negative impact of spurious correlations on robustness in different domains. To address this problem, in this work, we propose a novel certified defense framework GLEAN, which incorporates a causal perspective into the generalization problem in certified defense. More specifically, our framework integrates a certifiable causal factor learning component to disentangle the causal relations and spurious correlations between input and label, thereby excluding the negative effect of spurious correlations on defense. On top of that, we design a causally certified defense strategy to handle adversarial attacks on latent causal factors. In this way, our framework is not only robust against malicious noises on data in the training distribution but also can generalize its robustness across domains with distribution shifts. Extensive experiments on benchmark datasets validate the superiority of our framework in certified robustness generalization in different data domains.

Abstract: Multiple signal modalities, such as vision and sounds, are naturally present in realworld phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), to for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by minimizing the asymmetric weighted KL divergence to unimodal inference distributions. Our novel formulation extends these two barycenters to a more flexible choice by considering different types of divergences. In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions by capturing both modality-specific and modality-invariant representations compared to KL divergence. Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.

School of Advanced Interdisciplinary Sciences, UCAS, Beijing 100049, China Academy of Mathematics and Systems Science, CAS, Beijing 100190, China, Academy of Mathematics and Systems Science, CAS, Beijing 100190, China University of Chinese Academy of Sciences, Beijing 101408, China, University of Chinese Academy of Sciences, Beijing 101408, China, Academy of Mathematics and Systems Science, CAS, Beijing 100190, China University of Chinese Academy of Sciences, Beijing 101408, China, Institute of Software, CAS, Beijing 100190, China State Key Laboratory of Computer Science, Academy of Mathematics and Systems Science, CAS, Beijing 100190, China University of Chinese Academy of Sciences, Beijing 101408, China

Abstract: The KolmogorovArnold Network (KAN) is a new network architecture known for its high accuracy in several tasks such as function fitting and PDE solving. The superior expressive capability of KAN arises from the Kolmogorov-Arnold representation theorem and learnable spline functions. However, the computation of spline functions involves multiple iterations, which renders KAN significantly slower than MLP, thereby increasing the cost associated with model training and deployment. The authors of KAN also noted that "the biggest bottleneck of KANs lies in their slow training. KANs are usually 10x slower than MLPs, given the same number of parameters." To address this issue, we propose a novel MLP-type neural network PowerMLP that employs simpler non-iterative spline function representation, offering approximately the same training time as MLP while theoretically demonstrating stronger expressive power than KAN. Furthermore, we compare the FLOPs of KAN and PowerMLP, quantifying the faster computation speed of PowerMLP. Our comprehensive experiments demonstrate that PowerMLP generally achieves higher accuracy and a training speed about 40 times faster than KAN in various tasks.

Abstract: We show that current SOTA methods for privately and fairly training models are unreliable in many practical scenarios. Specifically, we (1) introduce a new type of adversarial attack that seeks to introduce unfairness into private model training, and (2) demonstrate that the use of methods for training on private data that are robust to adversarial attacks often leads to unfair models, regardless of the use of fairnessenhancing training methods. This leads to a dilemma when attempting to train fair models on private data: either (A) we use a robust training method which may introduce unfairness to the model itself, or (B) we train models which are vulnerable to adversarial attacks that introduce unfairness. This paper highlights flaws in robust learning methods when training fair models, yielding a new perspective for the design of robust and private learning systems.

Abstract: AIsynthesized voice technology has the potential to create realistic human voices for beneficial applications, but it can also be misused for malicious purposes. While existing AI-synthesized voice detection models excel in intra-domain evaluation, they face challenges in generalizing across different domains, potentially becoming obsolete as new voice generators emerge. Current solutions use diverse data and advanced machine learning techniques (e.g., domain-invariant representation, self-supervised learning), but are limited by predefined vocoders and sensitivity to factors like background noise and speaker identity. In this work, we introduce an innovative disentanglement framework aimed at extracting domain-agnostic artifact features related to vocoders. Utilizing these features, we enhance model learning in a flat loss landscape, enabling escape from suboptimal solutions and improving generalization. Extensive experiments on benchmarks show our approach outperforms state-of-the-art methods, achieving up to 5.12% improvement in the equal error rate metric in intra-domain and 7.59% in cross-domain evaluations.

Abstract: Decision makers may suffer from uncertainty induced by limited data. This may be mitigated by accounting for epistemic uncertainty, which is however challenging to estimate efficiently for large neural networks. To this extent we investigate Delta Variances, a family of algorithms for epistemic uncertainty quantification, that is computationally efficient and convenient to implement. It can be applied to neural networks and more general functions composed of neural networks. As an example we consider a weather simulator with a neuralnetwork-based step function inside - here Delta Variances empirically obtain competitive results at the cost of a single gradient computation. The approach is convenient as it requires no changes to the neural network architecture or training procedure. We discuss multiple ways to derive Delta Variances theoretically noting that special cases recover popular techniques and present a unified perspective on multiple related methods. Finally we observe that this general perspective gives rise to a natural extension and empirically show its benefit.

Abstract: Classic algorithms for stochastic bandits typically use hyperparameters that govern their critical properties such as the tradeoff between exploration and exploitation. Tuning these hyperparameters is a problem of great practical significance. However this is a challenging problem and in certain cases is information theoretically impossible. To address this challenge, we consider a practically relevant transfer learning setting where one has access to offline data collected from several bandit problems (tasks) coming from an unknown distribution over the tasks. Our aim is to use this offline data to set the hyperparameters for a new task drawn from the unknown distribution. We provide bounds on the inter-task (number of tasks) and intra-task (number of arm pulls for each task) sample complexity for learning near-optimal hyperparameters on unseen tasks drawn from the distribution. Our results apply to several classic algorithms, including tuning the exploration parameters in UCB and LinUCB and the noise parameter in GP-UCB. Our experiments indicate the significance and effectiveness of transfer of hyperparameters from offline problems in online learning with stochastic bandit feedback.

Abstract: Transformers have emerged as the leading architecture in deep learning, proving to be versatile and highly effective across diverse domains beyond language and image processing. However, their impressive performance often incurs high computational costs due to their substantial model size. This paper focuses on compressing decoderonly transformer-based autoregressive models through structural weight pruning to improve the model efficiency while preserving performance for both language and image generation tasks. Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and MLP modules, respectively. Besides, we further propose another compensation algorithm to recover the pruned model for better performance. To verify the effectiveness of our method, we provide both theoretical support and extensive experiments. Our experiments show that our method achieves state-of-the-art performance with reduced memory usage and faster generation speeds on GPUs.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for learning representations of graphstructured data, demonstrating remarkable performance across various tasks. Recognizing their importance, there has been extensive research focused on explaining GNN predictions, aiming to enhance their interpretability and trustworthiness. However, GNNs and their explainers face a notable challenge: graphs are primarily designed to model pair-wise relationships between nodes, which can make it tough to capture higher-order, multi-node interactions. This characteristic can pose difficulties for existing explainers in fully representing multi-node relationships. To address this gap, we present Framework For Higher-Order Representations In Graph Explanations (FORGE), a framework that enables graph explainers to capture such interactions by incorporating higher-order structures, resulting in more accurate and faithful explanations. Extensive evaluation shows that on average real-world datasets from the GraphXAI benchmark and synthetic datasets across various graph explainers, FORGE improves average explanation accuracy by 1.9x and 2.25x, respectively. We perform ablation studies to confirm the importance of higher-order relations in improving explanations, while our scalability analysis demonstrates FORGE's efficacy on large graphs.

Abstract: Pretrained Vision Mamba~(Vim) models have demonstrated exceptional performance across various computer vision tasks in a computationally efficient manner, attributed to their unique design of selective state space models. To further extend their applicability to diverse downstream vision tasks, Vim models can be adapted using the efficient fine-tuning technique known as visual prompting. However, existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models that leverage global attention, neglecting the distinctive sequential token-wise compression and propagation characteristics of Vim. Specifically, existing prompt tokens prefixed to the sequence are insufficient to effectively activate the input and forget gates across the entire sequence, hindering the extraction and propagation of discriminative information. To address this limitation, we introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim. To prevent the loss of discriminative information during state space propagation, SVP employs lightweight selective prompters for token-wise prompt generation, ensuring adaptive activation of the update and forget gates within Mamba blocks to promote discriminative information propagation. Moreover, considering that Vim propagates both shared cross-layer information and specific inner-layer information, we further refine SVP with a dual-path structure: Cross-Prompting and Inner-Prompting. Cross-Prompting utilizes shared parameters across layers, while Inner-Prompting employs distinct parameters, promoting the propagation of both shared and specific information, respectively. Extensive experimental results on various large-scale benchmarks demonstrate that our proposed SVP significantly outperforms state-of-the-art methods.

National University of Defense Technology Centre for Frontier AI Research, Agency for Science, Technology and Research Institute of High Performance Computing, Agency for Science, Technology and Research, Centre for Frontier AI Research, Agency for Science, Technology and Research Institute of High Performance Computing, Agency for Science, Technology and Research, Intelligent Game and Decision Lab, National University of Defense Technology, Zhejiang Normal University, National University of Defense Technology, National University of Defense Technology, Centre for Frontier AI Research, Agency for Science, Technology and Research Institute of High Performance Computing, Agency for Science, Technology and Research

Abstract: Anchor selection or learning has become a critical component in largescale multi-view clustering. Existing anchor-based methods, which either select-then-fix or initialize-then-optimize with orthogonality, yield promising performance. However, these methods still suffer from instability of initialization or insufficient depiction of data distribution. Moreover, the desired properties of anchors in multi-view clustering remain unspecified. To address these issues, this paper first formalizes the desired characteristics of anchors, namely Diversity, Balance and Compactness. We then devise and mathematically validate anchors that satisfy these properties by maximizing the Mahalanobis distance between anchors. Furthermore, we introduce a novel method called Max-Mahalanobis Anchors Guidance for multi-view Clustering (MAGIC), which guides the cross-view representations to progressively align with our well-defined anchors. This process yields highly discriminative and compact representations, significantly enhancing the performance of multi-view clustering. Experimental results show that our meticulously designed strategy significantly outperforms existing anchor-based methods in enhancing anchor efficacy, leading to substantial improvement in multi-view clustering performance.

Institute for Artificial Intelligence, Peking University State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China, Institute of automation, Chinese academy of science, Chinese Academy of Sciences, Institute for Artificial Intelligence, Peking University, State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China, State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Institute for Artificial Intelligence, Peking University State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China, State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China, Institute for Artificial Intelligence, Peking University State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China

Abstract: Differentiable environments have heralded new possibilities for learning control policies by offering rich differentiable information that facilitates gradientbased methods. In comparison to prevailing model-free reinforcement learning approaches, model-based reinforcement learning (MBRL) methods exhibit the potential to effectively harness the power of differentiable information for recovering the underlying physical dynamics. However, this presents two primary challenges: effectively utilizing differentiable information to 1) construct models with more accurate dynamic prediction and 2) enhance the stability of policy training. In this paper, we propose a Differentiable Information Enhanced MBRL method, MB-MIX, to address both challenges. Firstly, we adopt a Sobolev model training approach that penalizes incorrect model gradient outputs, enhancing prediction accuracy and yielding more precise models that faithfully capture system dynamics. Secondly, we introduce mixing lengths of truncated learning windows to reduce the variance in policy gradient estimation, resulting in improved stability during policy learning. To validate the effectiveness of our approach in differentiable environments, we provide theoretical analysis and empirical results. Notably, our approach outperforms previous model-based and model-free methods, in multiple challenging tasks involving controllable rigid robots such as humanoid robots' motion control and deformable object manipulation.

Abstract: Machine learning algorithms often struggle to eliminate inherent data biases, particularly those arising from unreliable labels, which poses a significant challenge in ensuring fairness. Existing fairness techniques that address label bias typically involve modifying models and intervening in the training process, but these lack flexibility for largescale datasets. To address this limitation, we introduce a data selection method designed to efficiently and flexibly mitigate label bias, tailored to more practical needs. Our approach utilizes a zero-shot predictor as a proxy model that simulates training on a clean holdout set. This strategy, supported by peer predictions, ensures the fairness of the proxy model and eliminates the need for an additional holdout set, which is a common requirement in previous methods. Without altering the classifier's architecture, our modality-agnostic method effectively selects appropriate training data and has proven efficient and effective in handling label bias and improving fairness across diverse datasets in experimental evaluations.

Abstract: In multidimensional classification (MDC), the classifier chain approach is based on a chain structure to model dependencies between class spaces. However, current research on constructing a chain order is usually based on a greedy criterion or random generation, which is highly likely to lead to an incorrect chain order and fit incorrect class dependencies. Moreover, existing classifier chain-based approaches do not consider the misleading effects of irrelevant input features on the classifiers. To fill the above gap, a classifier chain-based approach incorporating evolutionary chain order optimization and feature selection (ECCO) is proposed. Specifically, this approach designs a meta-heuristic algorithm to optimize the chain order of multiple classifiers. Simultaneously, the approach selects dimension-specific feature combinations that are more conducive to class prediction of each dimension. These strategies enhance the class prediction capability of the constructed MDC model. Comparative experiments on 14 real datasets validate that ECCO outperforms 7 state-of-the-art MDC approaches.

Abstract: Over the last decade, graph neural networks (GNNs) have made significant progress in numerous graph machine learning tasks. In realworld applications, where domain shifts occur and labels are often unavailable for a new target domain, graph domain adaptation (GDA) approaches have been proposed to facilitate knowledge transfer from the source domain to the target domain. Previous efforts in tackling distribution shifts across domains have mainly focused on aligning the node embedding distributions generated by the GNNs in the source and target domains. However, as the core part of GDA approaches, the impact of the underlying GNN architecture has received limited attention. In this work, we explore this orthogonal direction, i.e., how to facilitate GDA with architectural enhancement. In particular, we consider a class of GNNs that are designed explicitly based on optimization problems, namely unfolded GNNs (UGNNs), whose training process can be represented as bi-level optimization. Empirical and theoretical analyses demonstrate that when transferring from the source domain to the target domain, the lower-level objective value generated by the UGNNs significantly increases, resulting in an increase in the upper-level objective as well. Motivated by this observation, we propose a simple yet effective strategy called cascaded propagation (CP), which is guaranteed to decrease the lower-level objective value. The CP strategy is widely applicable to general UGNNs, and we evaluate its efficacy with three representative UGNN architectures. Extensive experiments on five real-world datasets demonstrate that the UGNNs integrated with CP outperform state-of-the-art GDA baselines.

Abstract: We introduce TCAMDiff, a novel 3D medical image generation model that reduces the memory requirements to encode and generate high-resolution 3D data. This model utilizes a decoder-only autoencoder method to learn triplane representation from dense volume and leverages generalization operations to prevent overfitting. Subsequently, it uses a triplane-aware cross-attention diffusion model to learn and integrate these features effectively. Furthermore, the features generated by the diffusion model can be rapidly transformed into 3D volumes using a pre-trained decoder module. Our experiments on three different scales of medical datasets, BrainTumour 128x128x128, Pancreas 256x256x256, and Colon 512x512x512, demonstrated outstanding results. We utilized MSE and SSIM to evaluate reconstruction quality and leveraged the Wasserstein Generative Adversarial Network (W-GAN) critic to assess generative quality. Comparisons to existing approaches show that our method gives better reconstruction and generation results than other encoder-decoder methods with similar-sized latent spaces.

Abstract: Scorebased generative models can effectively learn the distribution of data by estimating the gradient of the distribution. Due to the multi-step denoising characteristic, researchers have recently considered combining score-based generative models with the gradient boosting algorithm, a multi-step supervised learning algorithm, to solve supervised learning tasks. However, existing generative model algorithms are often limited by the stochastic nature of the models and the long inference time, impacting prediction performances. Therefore, we propose a Supervised Score-based Model (SSM), which can be viewed as a gradient boosting algorithm combining score matching. We provide a theoretical analysis of learning and sampling for SSM to balance inference time and prediction accuracy. Via the ablation experiment in selected examples, we demonstrate the outstanding performances of the proposed techniques. Additionally, we compare our model with other probabilistic models, including Natural Gradient Boosting (NGboost), Classification and Regression Diffusion Models (CARD), Diffusion Boosted Trees (DBT), and non-probabilistic gradient boosting models. The experimental results show that our model outperforms existing models in both accuracy and inference time.

State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China, State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China, Department of Computer Science, University of California, Los Angeles, CA, USA, State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China, State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China, Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA, State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China

Abstract: Testtime adaptation aims to adapt a well-trained model to potential distribution shifts at test time using only unlabeled test data, without access to the original training data. While previous efforts mainly focus on a single modality, test-time distribution shift in the multi-modal setting is more complex and calls for new solutions. This paper tackles the problem of multi-modal test-time adaptation by proposing a novel method named Attention Bootstrapping with Principal Entropy Minimization (ABPEM). We observe that test-time distribution shift causes misalignment across modalities, leading to a large gap between intra-modality discrepancies (measured by self-attention) and inter-modality discrepancies (measured by cross-attention). We name this the attention gap. This attention gap widens with more severe distribution shifts, hindering effective modality fusion. To mitigate this attention gap and encourage better modality fusion, we propose attention bootstrapping that promotes cross-attention with the guidance of self-attention. Moreover, to reduce the gradient noise in the commonly-used entropy minimization, we adopt principal entropy minimization, a refinement of entropy minimization that reduces gradient noise by focusing on the principal parts of entropy, excluding less reliable gradient information. Extensive experiments on the benchmarks validate the effectiveness of the proposed ABPEM in comparison with competing baselines.

Abstract: Tabular data plays a vital role in various realworld scenarios and finds extensive applications. Although recent deep tabular models have shown remarkable success, they still struggle to handle data distribution shifts, leading to performance degradation when testing distributions change. To remedy this, a robust tabular model must adapt to generalize to unknown distributions during testing. In this paper, we investigate the problem of fully test-time adaptation (FTTA) for tabular data, where the model is adapted using only the testing data. We identify three key challenges: the existence of label and covariate distribution shifts, the lack of effective data augmentation, and the sensitivity of adaptation, which render existing FTTA methods ineffective for tabular data. To this end, we propose the Fully Test-time Adaptation for Tabular data, namely FTAT, which enables FTTA methods to robustly optimize the label distribution of predictions, adapt to shifted covariate distributions, and dynamically adapt the model for various tasks and models. We conduct comprehensive experiments on six benchmark datasets, which are evaluated using three metrics. The experimental results demonstrate that FTAT outperforms state-of-the-art methods by a margin.

Abstract: Federated reinforcement learning (FRL) has emerged as a promising paradigm, enabling multiple agents to collaborate and learn a shared policy adaptable across heterogeneous environments. Among the various reinforcement learning (RL) algorithms, the actorcritic (AC) algorithm stands out for its low variance and high sample efficiency. However, little to nothing is known theoretically about AC in a federated manner, especially each agent interacts with a potentially different environment. The lack of such results is attributed to various technical challenges: a two-level structure illustrating the coupling effect between the actor and the critic, heterogeneous environments, Markovian sampling and multiple local updates. In response, we study Single-Loop Federated Actor Critic (SFAC) where agents perform AC learning in a two-level federated manner while interacting with heterogeneous environments. We then provide bounds on the convergence error of SFAC. The results show that the convergence error asymptotically converges to a near-stationary point, with the extent proportional to environment heterogeneity. Moreover, the sample complexity exhibits a linear speed-up through the federation of agents. We evaluate the performance of SFAC through numerical experiments using common RL benchmarks, which demonstrate its effectiveness.

Abstract: Cooperative MultiAgent Reinforcement Learning (MARL) has drawn increasing interest in recent works due to its significant achievements. However, there are still some challenges impeding the learning of optimal cooperative policies, such as insufficient exploration. Prior works typically adopt mutual information-based methods to encourage exploration. However, this category of methods does not necessarily encourage agents to fully explore the joint behavior space. To address this limitation, we propose a novel objective based on learning a representation function with a Lipschitz constraint to maximize the traveled distances in the joint behavior space, encouraging agents to learn joint behaviors with large variations and leading to sufficient exploration. We further implement our method on top of QMIX. We demonstrate the effectiveness of our method by conducting experiments on the LBF, SMAC, and SMACv2 benchmarks. Our method outperforms previous methods in terms of final performance and state-action space exploration.

Abstract: This paper introduces text2midi, an endto-end model to generate MIDI files from textual descriptions. Leveraging the growing popularity of multimodal generative approaches, text2midi capitalizes on the extensive availability of textual data and the success of large language models (LLMs). Our end-to-end system harnesses the power of LLMs to generate symbolic music in the form of MIDI files. Specifically, we utilize a pretrained LLM encoder to process captions, which then condition an autoregressive transformer decoder to produce MIDI sequences that accurately reflect the provided descriptions. This intuitive and user-friendly method significantly streamlines the music creation process by allowing users to generate music pieces using text prompts. We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality that are indeed controllable by text captions that may include music theory terms such as chords, keys, and tempo.

Abstract: Beginner musicians often struggle to identify specific errors in their performances, such as playing incorrect notes or rhythms. There are two limitations in existing tools for music error detection: (1) Existing approaches rely on automatic alignment; therefore, they are prone to errors caused by small deviations between alignment targets; (2) There is insufficient data to train music error detection models, resulting in overreliance on heuristics. To address (1), we propose a novel transformer model, Polytune, that takes audio inputs and outputs annotated music scores. This model can be trained end-to-end to implicitly align and compare performance audio with music scores through latent space representations. To address (2), we present a novel data generation technique capable of creating large-scale synthetic music error datasets. Our approach achieves a 64.1% average Error Detection F1 score, improving upon prior work by 40 percentage points across 14 instruments. Additionally, our model can handle multiple instruments compared with existing transcription methods repurposed for music error detection.

Abstract: The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data. Although several studies have proposed methods for highquality data selection, they do not consider the importance of knowledge richness in text corpora. In this paper, we propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge, to alleviate the problem of knowledge scarcity in the pre-trained corpus. We propose a comprehensive multi-domain knowledge element pool and introduce knowledge density and coverage as metrics to assess the knowledge content of the text. Based on this, we propose a comprehensive knowledge scorer to select data with intensive knowledge, which can also be utilized for domain-specific high-knowledge data selection by restricting knowledge elements to the specific domain. We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model's performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model.

Abstract: The autoregressive decoding paradigm endows large language models (LLMs) with superior language generation capabilities; however, its stepby-step decoding process inherently limits decoding speed. To mitigate these constraints, the prevalent “draft and validation” strategy enables parallel validation of candidate drafts, allowing LLMs to decode multiple tokens simultaneously during one model forward propagation. However, existing methodologies for obtaining drafts often incur additional overhead in communication or training process, or statistical biases from the corpus. To this end, we propose an innovative draft generation and maintenance approach that leverages the capabilities of LLM itself. Specifically, we extend the autoregressive decoding paradigm to a multi-branch drafting procedure, which can efficiently generate draft sequences without any additional models or training process, while preserving the quality of the generated content by maintaining LLM parameters. Experiments across various open-source benchmarks show that our method generates 2.0 to 3.2 tokens per forward step and achieves around 2 times improvement of end-to-end throughput compared to the autoregressive decoding strategy.

Abstract: Speech watermarking techniques can proactively mitigate the potential harmful consequences of instant voice cloning techniques. These techniques involve the insertion of signals into speech that are imperceptible to humans but can be detected by algorithms. Previous approaches typically embed watermark messages into continuous space. However, intuitively, embedding watermark information into robust discrete latent space can significantly improve the robustness of watermarking systems. In this paper, we propose DiscreteWM, a novel speech watermarking framework that injects watermarks into the discrete intermediate representations of speech. Specifically, we map speech into discrete latent space with a vectorquantized autoencoder and inject watermarks by changing the modular arithmetic relation of discrete IDs. To ensure the imperceptibility of watermarks, we also propose a manipulator model to select the candidate tokens for watermark embedding. Experimental results demonstrate that our framework achieves state-of-the-art performance in robustness and imperceptibility, simultaneously. Moreover, our flexible frame-wise approach can serve as an efficient solution for both voice cloning detection and information hiding. Additionally, DiscreteWM can encode 1 to 150 bits of watermark information within a 1-second speech clip, indicating its encoding capacity.

State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center School of Computer Science and Artificial Intelligence, Hefei Normal University

Abstract: Instructiontuned Code Large Language Models (Code LLMs) have excelled in diverse code-related tasks, such as program synthesis, automatic program repair, and code explanation. To collect training datasets for instruction-tuning, a popular method involves having models autonomously generate instructions and corresponding responses. However, the direct generation of responses does not ensure functional correctness, a crucial requirement for generating responses to code instructions. To overcome this, we present Verification-Based Self-Play (VERSE), aiming to enhance model proficiency in generating correct responses. VERSE establishes a robust verification framework that covers various code instructions. Employing VERSE, Code LLMs engage in self-play to generate instructions and corresponding verifications. They evaluate execution results and self-consistency as verification outcomes, using them as scores to rank generated data for self-training. Experiments show that VERSE improves multiple base Code LLMs (average 7.6%) across various languages and tasks on many benchmarks, affirming its effectiveness.

Abstract: Dialogue serves as the most natural manner of humancomputer interaction (HCI). Recent advancements in speech language models (SLM), have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies—early fusion, middle fusion, and late fusion—are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM’s robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM’s capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

Abstract: Attention mechanisms have played a crucial role in the success of Transformer models, as seen in platforms like ChatGPT. However, since they compute attentions from relationships between only one or two object types, they fail to effectively capture multiobject relationships in real-world scenarios, resulting in low prediction accuracy. In fact, they cannot calculate attention weights among diverse object types, such as the `comments,' `replies,' and `subjects' that naturally constitute conversations on platforms like Reddit or X, representing relationships simultaneously observed in real-world contexts. To overcome this limitation, we introduce the Tensorized Attention Model (TAM), which uses the Tucker decomposition to calculate attention weights across various object types and seamlessly integrates them into the Transformer models. Evaluations show that TAM significantly outperforms existing encoder methods, and its integration into the LoRA adapter for Llama2 enhances fine-tuning accuracy.

Abstract: Quantifying image complexity at the entity level is straightforward, but the assessment of semantic complexity has been largely overlooked. In fact, there are differences in semantic complexity across images. Images with richer semantics can tell vivid and engaging stories and offer a wide range of application scenarios. For example, the Cookie Theft picture is such a kind of image and is widely used to assess human language and cognitive abilities due to its higher semantic complexity. Additionally, semantically rich images can benefit the development of vision models, as images with limited semantics are becoming less challenging for them. However, such images are scarce, highlighting the need for a greater number of them. For instance, there is a need for more images like Cookie Theft to cater to people from different cultural backgrounds and eras. Assessing semantic complexity requires human experts and empirical evidence. Automatic evaluation of how semantically rich an image will be the first step of mining or generating more images with rich semantics, and benefit human cognitive assessment, Artificial Intelligence, and various other applications. In response, we propose the Image Semantic Assessment (ISA) task to address this problem. We introduce the first ISA dataset and a novel method that leverages language to solve this vision problem. Experiments on our dataset demonstrate the effectiveness of our approach.

Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China, Guangdong Neusoft University, Foshan, China, Department of Mathematics, The University of Hong Kong, Hong Kong SAR, China, School of Software Engineering, South China University of Technology, Guangzhou, China, Department of Computer Science, University of Copenhagen, Denmark, College of Computer Science, Chongqing University, China, Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China

Abstract: Visual question generation (VQG) aims to generate questions from images automatically. While existing studies primarily focus on the quality of generated questions, such as fluency and relevance, the difficulty of the questions is also a crucial factor in assessing their quality. Question difficulty directly impacts the effectiveness of VQG systems in applications like education and humancomputer interaction, where appropriately challenging questions can stimulate learning interest and improve interaction experiences. However, accurately defining and controlling question difficulty is a challenging task due to its multidimensional and subjective nature. In this paper, we propose a new definition of the difficulty of questions, i.e., being positively correlated with the number of reasoning steps required to answer a question. For our definition, we construct a corresponding dataset and propose a benchmark as a foundation for future research. Our benchmark is designed to progressively increase the reasoning steps involved in generating questions. Specifically, we first extract the relationships among objects in the image to form a reasoning chain, then gradually increase the difficulty by rewriting the generated question to include more reasoning sub-chains. Experimental results on our constructed dataset show that our benchmark significantly outperforms existing baselines in controlling the reasoning chains of generated questions, producing questions with varying difficulty levels.

Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology, China Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), China, Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology, China Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), China, Ping An Property & Casualty Insurance Company of China, Ltd, Institute of Computing Technology, Chinese Academy of Sciences

Abstract: Model editing is a novel research topic in large language models (LLMs), aimed at efficiently handling various knowledge editing tasks. Since irrelevant knowledge is difficult to measure, existing editing methods often lack explicit ways to preserve it, especially for editing methods based on the finetuning paradigm. They generally control the locality performance of model editing by constraining the range of changes in model parameters. However, their performance improvements are not always ideal, and may even lead to a decrease in the editing reliability. In this paper, we try to explore effective editing locality control methods based on the relationship between the stored knowledge and the strongly associated model components. Based on the discovery of ``knowledge neurons'' and enough experimental results, we further explore the potential characteristics between knowledge and model components, confirm and point out: (1) only 1% neurons have significant contributions to specific knowledge storage, and (2) these targeted neurons often have a high overlap for knowledge with similar relational descriptions, which means that knowledge with similar relationships may be severely affected when these targeted neurons are modified. Based on these findings, we propose Targeted Neurons Fine-tuning with Data Augmentation (TNF-DA), which performs data augmentation based on the relational representation of edited knowledge to improve editing locality. By freezing most of the model parameters and only fine-tuning the highly contributing neurons corresponding to the edited knowledge, we obtain desirable results in terms of generalization and specificity compared with previous fine-tuning-based methods. Extensive experiments have demonstrated the superior editing performance achieved by our proposed method.

Provable Responsible AI and Data Analytics (PRADA) Lab King Abdullah University of Science and Technology Institute of Information Engineering, Chinese Academy of Sciences, Provable Responsible AI and Data Analytics (PRADA) Lab King Abdullah University of Science and Technology SDAIA-KAUST, King Abdullah University of Science and Technology University of Auckland, The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Institute of Information Engineering, Chinese Academy of Sciences, Provable Responsible AI and Data Analytics (PRADA) Lab King Abdullah University of Science and Technology SDAIA-KAUST

Abstract: In this paper, we address the limitations of existing textto-image diffusion models in generating demographically fair results when given human-related descriptions. These models often struggle to disentangle the target language context from sociocultural biases, resulting in biased image generation. To overcome this challenge, we propose Fair Mapping, a flexible, model-agnostic, and lightweight approach that modifies a pre-trained text-to-image diffusion model by controlling the prompt to achieve fair image generation. One key advantage of our approach is its high efficiency. It only requires updating an additional linear network with few parameters at a low computational cost. By developing a linear network that maps conditioning embeddings into a debiased space, we enable the generation of relatively balanced demographic results based on the specified text condition. With comprehensive experiments on face image generation, we show that our method significantly improves image generation fairness with almost the same image quality compared to conventional diffusion models when prompted with descriptions related to humans. By effectively addressing the issue of implicit language bias, our method produces more fair and diverse image outputs.

Abstract: Reliable failure detection holds paramount importance in safetycritical applications. Yet, neural networks are known to produce overconfident predictions for misclassified samples. As a result, it remains a problematic matter as existing confidence score functions rely on category-level signals, the logits, to detect failures. This research introduces an innovative strategy, leveraging human-level concepts for a dual purpose: to reliably detect when a model fails and to transparently interpret why. By integrating a nuanced array of signals for each category, our method enables a finer-grained assessment of the model's confidence. We present a simple yet highly effective approach based on the ordinal ranking of concept activation to the input image. Without bells and whistles, our method is able to significantly reduce the false positive rate across diverse real-world image classification benchmarks, specifically by 3.7% on ImageNet and 9.0% on EuroSAT.

Abstract: In a temporal graph the edge set dynamically changes over time according to a set of timelabels associated with each edge that indicates at which time-step the edge is available. Two vertices are connected if there is a path connecting them in which the edges are traversed in increasing order of their labels. We study the problem of scheduling the availability time of the edges of a temporal graph in such a way that all pairs of vertices are connected within a given maximum allowed time a and the overall number of labels is minimum. The problem, called Minimum Aged Labeling (MAL), has several applications in logistics, distribution scheduling, and information spreading in social networks, where carefully choosing the time-labels can significantly reduce infrastructure costs, fuel consumption, or greenhouse gases. Problem MAL has previously been proved to be NP-complete on undirected graphs and APX-hard on directed graphs. In this paper, we extend our knowledge on the complexity and approximability of MAL in several directions. We first show that the problem cannot be approximated within a factor better than O(log n) when a >= 2, unless P = NP, and a factor better than 2^[log^(1-ε) n] when a >= 3, unless NP is contained in DTIME(2^(polylog(n))), where n is the number of vertices in the graph. Then we give a set of approximation algorithms that, under some conditions, almost match these lower-bounds. In particular, we show that the approximation depends on a relation between a and the diameter of the input graph. We further establish a connection with a foundational optimization problem on static graphs called Diameter Constrained Spanning Subgraph (DCSS) and show that our hardness results also apply to DCSS.

Abstract: The application of graph neural networks (GNNs) to learn heuristic functions in classical planning is gaining traction. Despite the variety of methods proposed in the literature to encode classical planning tasks for GNNs, a comparative study evaluating their relative performances has been lacking. Moreover, some encodings have been assessed solely for their expressiveness rather than practical effectiveness in planning. This paper provides an extensive comparative analysis of existing encodings. Our results indicate that the smallest encoding based on Gaifman graphs, not yet applied in planning, outperforms the rest due to its fast evaluation times and the informativeness of the resulting heuristic. The overall coverage measured on the IPC almost reaches that of the stateof-the-art planner LAMA while exhibiting rather complementary strengths across different domains.

Abstract: Formulating a realworld problem under the Reinforcement Learning framework involves non-trivial design choices, such as selecting a discount factor for the learning objective (dis- counted cumulative rewards), which articulates the planning horizon of the agent. This work investigates the impact of the discount factor on the bias-variance trade-off given structural parameters of the underlying Markov Decision Process. Our results support the idea that a shorter planning horizon might be beneficial, especially under partial observability.

School of Computer Science and Engineering, South China University of Technology Institute for Super Robotics (Huangpu), School of Computer Science and Engineering, South China University of Technology Guangdong Provincial Key Laboratory of Multimodal Big Data Intelligent Analysis Peng Cheng Laboratory of Shenzhen, Institute for Super Robotics (Huangpu) Key Laboratory of Large-Model Embodied-Intelligent Humanoid Robot, School of College of Mathematics and Informatics, South China Agricultural University, School of Computer Science and Engineering, South China University of Technology Institute for Super Robotics (Huangpu), Institute for Super Robotics (Huangpu) Shien-Ming Wu School of Intelligent Engineering, South China University of Technology School of Automation Science and Engineering, South China University of Technology

Abstract: Languageconditioned robotic manipulation in unstructured environments presents significant challenges for intelligent robotic systems. However, due to partial observation or imprecise action prediction, failure may be unavoidable for learned policies. Moreover, operational failures can lead to the robotic arm entering an untrained state, potentially causing destructive results. Consequently, the ability to detect and self-correct failures is crucial for the development of practical robotic systems. To address this challenge, we propose a foresight-driven failure detection and self-correction module for robot manipulation. By leveraging 3D Gaussian Splatting, we represent the current scene with multiple Gaussians. Subsequently, we train a prediction network to forecast the Gaussian representation of future scenes conditioned on planned actions. Failure is detected when the predicted future significantly deviates from the real observation after action execution. In such cases, the end-effector rolls back to the previous action to avoid an untrained state. Integrating this approach with the PerACT framework, we develop a self-correcting robot manipulation policy. Evaluations on ten RLBench tasks with 166 variations demonstrate the superior performance of the proposed method, which outperforms state-of-the-art methods by 12.0% success rate on average.

Abstract: Multiagent pathfinding MAPF is a problem where multiple autonomous agents must find paths to their respective destinations without colliding. Decisional MAPF on undirected graphs can be solved in polynomial time; Several optimization MAPF variants however are NP-complete. The directed graph variant (diMAPF) is more complex, with its decisional version already being NP-complete. This paper examines the computational approximability of optimal MAPF problems (i.e., minimizing makespan for agent travel distance and maximizing the total number of agents reaching their goals), providing a first set of several inapproximability results for these problems. The results reveal an inherent limitation in approximating optimal solutions for MAPFs, provide a deeper understanding regarding their computational intractability, thus offer foundational references for future research.

Abstract: Probabilities of causation (PoC) offer valuable insights for informed decisionmaking. This paper introduces novel variants of PoC-controlled direct, natural direct, and natural indirect probability of necessity and sufficiency (PNS). These metrics quantify the necessity and sufficiency of a treatment for producing an outcome, accounting for different causal pathways. We develop identification theorems for these new PoC measures, allowing for their estimation from observational data. We demonstrate the practical application of our results through an analysis of a real-world psychology dataset.

Abstract: We initiate the study of tree structures in the context of scenariobased robust optimization. Specifically, we study Binary Search Trees (BSTs) and Huffman coding, two fundamental techniques for efficiently managing and encoding data based on a known set of frequencies of keys. Given a number of distinct scenarios, each defined by a frequency distribution over the keys, our objective is to compute a single tree of best-possible performance, relative to any scenario. We consider, as performance metrics, the competitive ratio, which compares multiplicatively the cost of the solution to the tree of least cost among all scenarios, as well as the regret, which induces a similar, but additive comparison. For BSTs, we show that the problem is NP-hard across both metrics. We also obtain an optimal competitive ratio that is logarithmic in the number of scenarios. For Huffman Trees, we likewise prove NP-hardness, and we present an algorithm with logarithmic regret, which we prove to be near-optimal by showing a corresponding lower bound. Last, we give a polynomial-time algorithm for computing Pareto-optimal BSTs with respect to their regret, assuming scenarios defined by uniform distributions over the keys. This setting captures, in particular, the first study of fairness in the context of data structures. We provide an experimental evaluation of all algorithms. To this end, we also provide mixed integer linear program formulation for computing optimal trees.

Abstract: The emergence of powerful LLMs has led to a paradigm shift in Natural Language Understanding and Natural Language Generation. But the properties that make LLMs so valuable for these tasks creativity, ability to produce fluent speech, and ability to quickly and effectively abstract information from large corpora -- also present new challenges to evaluating their outputs. The rush to market has led teams to fall back on quick, cost-effective automatic evaluations which offer value, but do not obviate the need for human judgments in model training and evaluation. We argue that when end users need to agree with the decisions made by ML models -- e.g. in toxicity detection or in extraction of main points for summarization -- models should be trained and tested on data that represent the preferences of those users. This paper primarily discusses the role of human feedback in labeling and judgment tasks for model training and evaluation. We first propose methods for disentangling noise from signal in labeling tasks. We show that noise in labeling disagreement can be minimized by adhering to proven methodological best practices, while signal in labeling disagreement can be maximized to play an integral role in model training and evaluation tasks. We illustrate best practices by providing a case study in which two guardrail classifiers are evaluated, using human judgments to align final model behavior to user preferences. We aim for this paper to provide researchers and professionals with guidelines to integrating human judgments into their ML and generative AI evaluation toolkit when working toward achieving accurate and unbiased features that align with users’ needs and expectations.

Abstract: Recent years have witnessed rapid advancements in the safety alignments of large language models (LLMs). Methods such as supervised instruction finetuning (SFT) and reinforcement learning with human feedback (RLHF) have thus emerged as vital components in constructing LLMs. While these methods achieve robust and fine-grained alignment to human values, their practical application is still hindered by high annotation costs and incomplete human alignments. Besides, the intrinsic human values within training corpora have not been fully exploited. To address these issues, we propose ISAAC (Intrinsically Supervised Alignments by Assessing Corpus), a primary and coarse-grained safety alignment strategy for LLMs. ISAAC only relies on a prior assumption about the text corpus, and does not require preferences in RLHF or human responses selection in SFT. Specifically, it assumes a long-tail distribution of text corpus and employs a specialized sampling strategy to automatically sample high-quality responses. Theoretically, we prove that this strategy can improve the safety of LLMs under our assumptions. Empirically, our evaluations on mainstream LLMs show that ISAAC achieves a safety score comparable to current SFT solutions. Moreover, we conduct experiments on ISAAC for some RLHF-based LLMs, where we find that ISAAC can even improve the safety of these models under specific safety domains. These findings demonstrate that ISAAC can provide preliminary alignment to LLMs, thereby reducing the construction costs of existing human-feedback-based methods.

Abstract: When LLMs are deployed in sensitive, humanfacing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as ``Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space or other single character token to the end of a model's input. In a study of a variety of open-source models, we demonstrate that this simple perturbation is able to cause the majority of models to generate harmful outputs with very high probability. We further find that both Claude and GPT-3.5 demonstrate the same behavior. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models answer in lists or other formatted responses, overriding training signals to refuse unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods.

Abstract: This paper addresses the problem of preference learning, which aims to align robot behaviors through learning userspecific preferences (e.g. “good pull-over location”) from visual demonstrations. Despite its similarity to learning factual concepts (e.g. “red door”), preference learning is a fundamentally harder problem due to its subjective nature and the paucity of person-specific training data. We address this problem using a novel framework called SYNAPSE, which is a neuro-symbolic approach designed to efficiently learn preferential concepts from limited data. SYNAPSE represents preferences as neuro-symbolic programs – facilitating inspection of individual parts for alignment – in a domain-specific language (DSL) that operates over images and leverages a novel combination of visual parsing, large language models, and program synthesis to learn programs representing individual preferences. We perform extensive evaluations on various preferential concepts as well as user case studies demonstrating its ability to align well with dissimilar user preferences. Our method significantly outperforms baselines, especially when it comes to out-of-distribution generalization. We show the importance of the design choices in the framework through multiple ablation studies.

Abstract: Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if defenses can improve KataGo's worstcase performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that though some of these defenses protect against previously discovered attacks, none withstand adaptive attacks. In particular, we are able to train new adversaries that reliably defeat our defended agents by causing them to blunder in ways humans would not. Our results suggest that building robust AI systems is challenging even for superhuman systems in narrow domains like Go.

Abstract: I am a person and so are you. Philosophically we sometimes grant personhood to nonhuman animals, and entities such as sovereign states or corporations can legally be considered persons. But when, if ever, should we ascribe personhood to AI systems? In this paper, we outline necessary conditions for AI personhood, focusing on agency, theory-of-mind, and self-awareness. We discuss evidence from the machine learn- ing literature regarding the extent to which contemporary AI systems, such as language models, satisfy these conditions, finding the evidence surprisingly inconclusive. If AI systems can be considered persons, then typical fram- ings of AI alignment may be incomplete. Whereas agency has been discussed at length in the literature, other aspects of per- sonhood have been relatively neglected. AI agents are often assumed to pursue fixed goals, but AI persons may be self- aware enough to reflect on their aims, values, and positions in the world and thereby induce their goals to change. We highlight open research directions to advance the understand- ing of AI personhood and its relevance to alignment. Finally, we reflect on the ethical considerations surrounding the treat- ment of AI systems. If AI systems are persons, then seeking control and alignment may be ethically untenable.

Abstract: Given that AI systems are set to play a pivotal role in future decisionmaking processes, their trustworthiness and reliability are of critical concern. Due to their scale and complexity, modern AI systems resist direct interpretation, and alternative ways are needed to establish trust in those systems, and determine how well they align with human values. We argue that good measures of the information processing similarities between AI and humans, may be able to achieve these same ends. While Representational alignment (RA) approaches measure similarity between the internal states of two systems, the associated data can be expensive and difficult to collect for human systems. In contrast, Behavioural alignment (BA) comparisons are cheaper and easier, but questions remain as to their sensitivity and reliability. We propose two new behavioural alignment metrics misclassification agreement which measures the similarity between the errors of two systems on the same instances, and class-level error similarity which measures the similarity between the error distributions of two systems. We show that our metrics correlate well with RA metrics, and provide complementary information to another BA metric, within a range of domains, and set the scene for a new approach to value alignment.

Abstract: Water temperature can vary substantially even across short distances within the same subwatershed. Accurate prediction of stream water temperature at fine spatial resolutions (i.e., fine scales, ≤ 1 km) enables precise interventions to maintain water quality and protect aquatic habitats. Although spatiotemporal models have made substantial progress in spatially coarse time series modeling, challenges persist in predicting at fine spatial scales due to the lack of data at that scale. To address the problem of insufficient fine-scale data, we propose a Multi-Scale Graph Learning (MSGL) method. This method employs a multi-task learning framework where coarse-scale graph learning, bolstered by larger datasets, simultaneously enhances fine-scale graph learning. Although existing multi-scale or multi-resolution methods integrate data from different spatial scales, they often overlook the spatial correspondences across graph structures at various scales. To address this, our MSGL introduces an additional learning task, cross-scale interpolation learning, which leverages the hydrological connectedness of stream locations across coarse- and fine-scale graphs to establish cross-scale connections, thereby enhancing overall model performance. Furthermore, we have broken free from the mindset that multi-scale learning is limited to synchronous training by proposing an Asynchronous Multi-Scale Graph Learning method (ASYNC-MSGL). Extensive experiments demonstrate the state-of-the-art performance of our method for anti-sparse downscaling of daily stream temperatures in the Delaware River Basin, USA, highlighting its potential utility for water resources monitoring and management.

Abstract: Deep learning models commonly benefit from data augmentation techniques to diversify the set of training images. When working with satellite imagery, it is common for practitioners to apply a limited set of transformations developed for natural images (e.g., flip and rotate) to expand the training set without overly modifying the satellite images. There are many techniques for natural image data augmentation, but given the differences between the two domains, it is not clear whether data augmentation methods developed for natural images are well suited for satellite imagery. This paper presents an extensive experimental study on three classification and three regression tasks over four satellite image datasets. We compare common computer vision data augmentation techniques and propose three novel satellitespecific data augmentation strategies. Across tasks and datasets, we find that geometric transformations are beneficial for satellite imagery while color transformations generally are not. Additionally, our novel Sat-SlideMix, Sat-CutMix, and Sat-Trivial methods all exhibit strong performance across all tasks and datasets.

Abstract: Accurate precipitation forecasting is crucial for early warnings of disasters, such as floods and landslides. Traditional forecasts rely on groundbased radar systems, which are space-constrained and have high maintenance costs. Consequently, most developing countries depend on a global numerical model with low resolution, instead of operating their own radar systems. To mitigate this gap, we propose the Neural Precipitation Model (NPM), which uses global-scale geostationary satellite imagery. NPM predicts precipitation for up to six hours, with an update every hour. We input three key channels to discriminate rain clouds: infrared radiation (at a wavelength of 10.5 µm), upper- (6.3 µm), and lower- (7.3 µm) level water vapor channels. Additionally, NPM introduces positional encoders to capture seasonal and temporal patterns, reflecting variations in precipitation. Our experimental results demonstrate that NPM can predict rainfall in real-time with a resolution of 2 km.

Abstract: Drought is one of the most destructive and expensive natural disasters, severely impacting natural resources and risks by depleting water resources and diminishing agricultural yields. Under climate change, accurately predicting drought is critical for mitigating droughtinduced risks. However, the intricate interplay among the physical and biological drivers that regulate droughts limits the predictability and understanding of drought, particularly at a subseasonal to seasonal (S2S) time scale. While deep learning has demonstrated the potential to address climate forecasting challenges, its application to drought prediction has received relatively less attention. In this work, we propose a new dataset, DroughtSet, which integrates relevant predictive features and three drought indices from multiple remote sensing and reanalysis datasets across the contiguous United States (CONUS). DroughtSet specifically provides the machine learning community with a new real-world dataset to benchmark drought prediction models and more generally, time-series forecasting methods. Furthermore, we propose a spatial-temporal model SPDrought to predict and interpret S2S droughts. Our model learns from the spatial and temporal information of physical and biological features to predict three types of droughts simultaneously. Multiple strategies are employed to quantify the importance of physical and biological features for drought prediction. Our results provide insights for researchers to better understand the predictability and sensitivity of drought to biological and physical conditions. We aim to contribute to the climate field by proposing a new tool to predict and understand the occurrence of droughts and provide the AI community with a new benchmark to study deep learning applications in climate science.

Abstract: Humanin-the-loop (HIL) systems have emerged as a promising approach for combining the strengths of data-driven machine learning models with the contextual understanding of human experts. However, a deeper look into several of these systems reveals that calling them HIL would be a misnomer, as they are quite the opposite, namely AI-in-the-loop (AI2L) systems: the human is in control of the system, while the AI is there to support the human. We argue that existing evaluation methods often overemphasize the machine (learning) component's performance, neglecting the human expert's critical role. Consequently, we propose an AI2L perspective, which recognizes that the human expert is an active participant in the system, significantly influencing its overall performance. By adopting an AI2L approach, we can develop more comprehensive systems that faithfully model the intricate interplay between the human and machine components, leading to more effective and robust AI systems.

Abstract: In this paper, I review approaches for acquiring hierarchical knowledge to improve the effectiveness of planning systems. First I note some benefits of such hierarchical content and the advantages of learning over manual construction. After this, I consider alternative paradigms for encoding and acquiring plan expertise before turning to hierarchical task networks. I specify the inputs to HTN learners and three subproblems they must address: identifying hierarchical structure, unifying method heads, and finding method conditions. Finally, I pose seven challenges the community should pursue so that techniques for learning HTNs can reach their full potential.

Abstract: The ability to reason at multiple levels of temporal abstraction is a fundamental aspect of intelligence. In reinforcement learning (RL), this attribute is often modelled through temporally extended courses of actions called options. In this talk, I will introduce a general framework for option discovery, which uses the agent's representation to discover useful options. By leveraging these options to generate a rich stream of experience, the agent can improve its representations and learn more effectively. This representationdriven option discovery approach creates a virtuous cycle of refinement, continuously improving both the representation and options, and it is particularly effective for problems where agents need to operate at varying levels of abstraction to succeed.

Abstract: Machine learning techniques are notably vulnerable to natural or adversarial perturbations, which can lead to catastrophic failures with significant economic, ethical, and societal risks. In this New Faculty Highlight Talk, I will showcase my research on harnessing robust statistics to build robust and trustworthy AI systems. Specifically, I will highlight my research breakthroughs in graph learning (GNNs), large language models (LLMs), deep equilibrium models (DEQs), and general deep representation learning. These breakthroughs stem from a unified and principled robust statistics framework that incorporates robustness as the core inductive bias in deep learning architecture. This approach has enabled significant improvements in intrinsic robustness and generalization, even in complex and challenging environments. My research demonstrates the transformative potential of harnessing robust statistics in enhancing the robustness and trustworthiness of AI systems. Looking forward, I will continue to push this frontier by advocating the design of robustnessinformed neural networks across various areas.

Abstract: Artificial Intelligence (AI) has revolutionized fields like computer vision and natural language processing, yet its impact on robotics remains limited by challenges in longhorizon decision-making and complex physical interactions. My research pioneers robot learning algorithms that exploit (predict, perceive, plan, and reason about) physical interaction as a core component of artificial intelligence, pushing beyond passive solutions in domains such as perception, navigation, and manipulation. By leveraging techniques in imitation learning and hierarchical reinforcement learning, my work empowers robots to learn from human demonstrations, navigate interactively in real-world environments, and gather information through purposeful interactions. In my talk, I will explain how these advances are critical for robots to become useful helpers in human environments, opening the door to the next generation of household robots. I will present several AI algorithmic innovations to integrate physical interactions in computation procedures and outline the path toward developing continually learning robots capable of operating autonomously in unstructured human environments, enhancing their utility as adaptable and intelligent assistants.

Abstract: We describe an application that uses large language models to generate structured documents related to industrial equipment, specifically focusing on Failure Modes and Effects Analysis (FMEAs). Our novel application uses techniques in structured document generation, incontext learning, and ensembling to create high-quality structured content that subject matter experts supervise through a user-centric interface that presents FMEA entities as UI elements. Novel evaluation metrics for structured document generation are also proposed. Our empirical results, based on 71 asset evaluations, demonstrate the individual and combined contributions of these techniques, with an overall effectiveness that varies between a recall of 0.669 and a precision of 0.91. Qualitative feedback from target users validates the practicality of the described approach to seamlessly integrate expert supervision with generative AI in a labour-saving workflow.

Abstract: We present ECLAIR (Enhanced CLArification for Interactive Responses), a novel unified and endto-end framework for interactive disambiguation in enterprise AI assistants. ECLAIR generates clarification questions for ambiguous user queries and resolves ambiguity based on the user's response. We introduce a generalized architecture capable of integrating ambiguity information from multiple downstream agents, enhancing context-awareness in resolving ambiguities and allowing enterprise specific definition of agents. We further define agents within our system that provide domain-specific grounding information. We conduct experiments comparing ECLAIR to few-shot prompting techniques and demonstrate ECLAIR's superior performance in clarification question generation and ambiguity resolution.

Abstract: This doctoral dissertation establishes and addresses frontiers in graph generation. I first apply a Graph Neural Network (GNN) model on social network data, a new domain, to establish what frontiers exist for graph generators. I establish that GNN models are currently limited in the diversity of feature sets that they can produce, the variety of graph structure types they can generate, and highly limited in the size of generated graphs. Further, I find that the quality metrics available for graph generation are aggregatebased and un-expressive. To address the issue of scale I propose Hierarchical Generation of Graphs (HiGGs), a framework for producing graphs orders of magnitude larger than is possible with a single model. As a step towards more expressive metrics I develop Topology only Pre-training (ToP), a pre-training framework for graph models that is capable of representing multiple domains of graphs simultaneously, without relying on tertiary models in downstream applications. The next stage of research will adapt ToP as a model based metric for graph generators.

Abstract: Standards, or expertdefined preferences, are documented guidelines describing strict specifications for text-based content such as books, manuals, and reports. These guidelines are curated, defined, and continuously improved by domain experts in various fields, such as education, policy, and healthcare, and are used for maintaining quality. In my dissertation, I focus on evaluating and teaching large language models (LLMs) to capture standards to improve generation quality across diverse language generation tasks. I draw motivation from my preliminary published works, where I explored how open and commercial LLMs can learn complex constraints from standards in education and language assessment to produce classroom-ready narrative content. In this proposal, I also discuss the technical novelty, impact, and target contributions and highlight how this line of work can be scaled and generalized for other domains where standards are also used as a reference of quality.

Abstract: We now live in a world where we can reach people directly through social media, without relying on traditional media such as television and radio. On the other hand, social media platforms collect vast amounts of data and create very specific profiles of different users through targeted advertising. Various interest groups, including politicians, advertisers, and stakeholders, utilize these platforms to target potential users to advance their interests by adapting their messaging. This process, known as microtargeting, relies on datadriven techniques that exploit the rich information collected by social networks about their users. Microtargeting is a double-edged sword. It enhances the relevance and efficiency of targeted content, can influence people to take action based on personal beliefs. This could be great, increasing the relevance based on users to help guide people in making better health decisions and offering them opportunities for career growth. On the other hand, it can influence people to make decisions against their own interests, foster echo chambers, and increase polarization. My research is motivated by the fact that some of these risks can be mitigated by providing transparency, identifying conflicting or harmful messaging choices, and indicating bias introduced in messaging in a nuanced way. I provide computational frameworks to analyze microtargeting patterns, which will help policymakers make better decisions. This is crucial for promoting healthy public discourse in the digital age and maintaining a cohesive society.

Abstract: Mobility data from smartphones, connected cars, and GPS devices are widely used for tasks such as transportation mode classification and suspicious movement detection. Time series research, a closely related field, focuses more on classification methods. Yet, Mobility Data analysis faces unique challenges like geographic transferability and limited public data due to privacy issues. My PhD work focuses on developing reusable, interpretable MD representations. I created Trajectory Interval Forest and later Geolet, a shapeletbased transformation to improve MD classification across geographic regions. Ongoing research explores improving geographic transferability and event-based trajectory clustering.

Abstract: Efficient humanagent collaboration requires understanding each other’s capabilities and establishing appropriate reliance. My thesis focuses on optimizing performance in mixed-initiative settings, where humans and agents dynamically contribute to decisions and actions. I first explore key factors shaping human reliance on decision-support agents, then examine how agents can model this reliance to initiate actions. My proposed work aims to enable agents to jointly provide decision and action support in multi-objective tasks, using bi-directional communication to enhance collaboration.

Abstract: My thesis primarily focuses on hyperspectral image generation from frequency spectrums for downstream computer vision tasks. Hyper-spectral images are images with more than three channels commonly created by special hyper-spectral cameras or from frequency spectrums of various sensing applications such as radargrams or distributed acoustic sensing (DAS) systems. The range of frequencies considered in a frequency spectrum is typically too large to map one frequency to one image channel, i.e. we generally consider a frequency spectrum of 2500 Hz. Frequencies need to be binned together in frequency bands where each band forms one image channel. Usually, frequency bands are created either by expert knowledge or trial-and-error. I research how filters can be trained to automatically select frequencies and bin them into frequency bands. My aim is to represent a variety of signal information and decrease noise. Signal representation is optimised for object detection on time-sequenced images with a set number of image channels. The object detection task consists of localising and classifying events in the generated hyper-spectral images. Events are typically types of intrusions, structural changes, or defined actions and structures, e.g. someone climbing a fence. Events and noise often share at least some frequencies and vary between application types.

Abstract: While machine learning (ML) models of today have the potential to be useful in many societal applications, they also harbor the potential for great harm, be it perpetuating biases or compromising privacy. To prevent these harms, many (evolving) regulatory guardrails have been put in place; for instance European Union's GDPR and Biden's Executive Order which demand explainability, privacy, fairness and so on from models deployed in societal applications. Yet, most technical solutions in the Trustworthy ML literature which claim to meet these regulatory requirements are brittle and often fail at the task in hand. To this end, my research aims to make the field of Trustworthy ML reliable using mainstay concepts of Measurement, Mitigation and Maintenance. With these concepts, I develop endto-end solutions for trustworthy ML by (1) exploring the limitations of existing approaches and (2) providing principled novel solutions exploiting interconnections with cryptography.

Abstract: Continual reinforcement learning (CRL) is the study of optimal strategies for maximizing rewards in sequential environments that change over time. This is particularly crucial in domains such as robotics, where the operational environment is inherently dynamic and subject to continual change. Nevertheless, research in this area has thus far concentrated on offpolicy algorithms with replay buffers that are capable of amortizing the impact of distribution shifts. Such an approach is not feasible with on-policy reinforcement learning algorithms that learn solely from the data obtained from the current policy. In this paper, we examine the performance of proximal policy optimization (PPO), a prevalent on-policy reinforcement learning (RL) algorithm, in a classical CRL benchmark. Our findings suggest that the current methods are suboptimal in terms of average performance. Nevertheless, they demonstrate encouraging competitive outcomes with respect to forward transfer and forgetting metrics. This highlights the need for further research into continual on-policy reinforcement learning. The source code is available at https://github.com/Teddy298/continualworld-ppo.

Abstract: Recent advancements in single image superresolution have been predominantly driven by token mixers and transformer architectures. WaveMixSR utilized the WaveMix architecture, employing a two-dimensional discrete wavelet transform for spatial token mixing, achieving superior performance in super-resolution tasks with remarkable resource efficiency. In this work, we present an enhanced version of the WaveMixSR architecture by (1) replacing the traditional transpose convolution layer with a pixel shuffle operation and (2) implementing a multistage design for higher resolution tasks (4x). Our experiments demonstrate that our enhanced model -- WaveMixSR-V2 -- outperforms other architectures in multiple super-resolution tasks, achieving state-of-the-art for the BSD100 dataset, while also consuming fewer resources and exhibiting higher parameter efficiency and throughput.

Abstract: Topone recommendation with anonymous user behaviors, also known as session-based recommendation (SBR), faces challenges of top-one ranking and short anonymous sequences. To this end, we propose a novel objective that combines (1) a reciprocal rank loss to directly optimize the benchmark metric of top-one recommendation, with (2) a listwise contrastive loss to handle short sequences through listwise augmented consistency regularization. Empirical studies demonstrate that optimizing the proposed objective significantly improves the performance of existing SBR baselines.

Abstract: Attempting to align AI capabilities and value structures by means of value elicitation from humans, such as through Reinforcement Learning from Human Feedback (RLHF), is a computational challenge that raises both psychological and philosophical questions. Adopting an evolutionary perspective on the emergence of value structures in humans and machine learning systems can offer a bridge between qualitative and quantitative aspects of alignment. Here, evolutionary dynamics are applied to a gametheoretic model of RLHF. This allows for formal reasoning about the process and capabilities that result from alignment training, even where quantitative benchmarks cannot be clearly defined. A simple parametrized game model of RLHF, subject to replicator dynamics, shows how the success of the training method is sensitive to bias in human judgments. Under ideal conditions, RHLF training leads to aligned behavior. If the choice pattern of the human judge is biased, the training instead incentivizes misalignment. This application shows that evolutionary analyses can contribute to improving the prospects for safety and support successful cooperation between humans and AI systems in deployment.

Abstract: Due to increasing privacy regulations and regulatory compliance, Machine Unlearning (MU) has become essential. The goal of unlearning is to remove information related to a specific class from a model. Traditional approaches achieve exact unlearning by retraining the model on the remaining dataset, but incur high computational costs. This has driven the development of more efficient unlearning techniques, including model sparsification techniques, which boost computational efficiency, but degrade the model’s performance on the remaining classes. To mitigate these issues, we propose a novel method, PruneLoRA which introduces a new MU paradigm, termed prune first, then adapt, then unlearn. LoRA reduces the need for largescale parameter updates by applying low-rank updates to the model. We leverage LoRA to selectively modify a subset of the pruned model’s parameters, thereby reducing the computational cost, memory requirements and improving the model’s ability to retain performance on the remaining classes. Experimental Results across various metrics showcase that our method outperforms other approximate MU methods and bridges the gap between exact and approximate unlearning. Our code is available at https://github.com/vlgiitr/LoRA-Unlearn.

Abstract: Conformal Prediction (CP) is an uncertainty quantification framework that provides prediction sets with a userspecified probability to include the true class in the prediction set. This guarantee on the user-specified probability is known as marginal coverage. Marginal coverage refers to the probability that the true label is included in the prediction set, averaged over all test samples. However, this can lead to inconsistent coverage across different classes, constraining its suitability for high-stakes applications such as pathological workflows. This study implements a Classwise CP method applied to two cancer datasets to achieve class-conditional coverage which ensures that each class has a user-specified probability of being included in the prediction set when it is the true label. Our results demonstrate the effectiveness of this approach through a significant reduction in the average class coverage gap compared to the Baseline CP method.

Abstract: Counterfactual explanations in Explainable AI (XAI) identify which features to change to alter an outcome, but existing methods adjust only the features of a single agent. We present a new approach to reevaluating rankings that is based on predictions of future features of the other agents in a ranking system. It uses an algorithm that provides a more realistic counterfactual explanation of changing the ranking of a particular agent. Computer experiments demonstrated that the proposed algorithm can capture the time variation of the entire ranking system in the inference results.

Abstract: Semantic segmentation of marine environments is essential for autonomous navigation of unmanned surface vessels (USVs) as well as the detection of environmental hazards such as oil spills. To tackle the challenges of accurate environmental perception, we propose a lightweight semantic segmentation network, LAqua (Laplacians for Aquatic Segmentation), which leverages Laplacian pyramids to enhance edge detection in marine imagery. Our method drastically reduces computational requirements while maintaining high accuracy in generating semantic masks for marine environments. We evaluate LAqua on two distinct datasets: one focused on detecting oil spills in port environments and another on environmental segmentation for USVs. Results show that LAqua not only performs well across varied marine settings but also achieves comparable or superior segmentation accuracy with far fewer parameters than other models. This efficiency highlights LAqua's potential for applications in realtime detection for marine environments.

Abstract: Neural networks have emerged as powerful tools across various applications, yet their decisionmaking process often remains opaque, leading to them being perceived as "black boxes." This opacity raises concerns about their interpretability and reliability, especially in safety-critical scenarios. Network inversion techniques offer a solution by allowing us to peek inside these black boxes, revealing the features and patterns learned by the networks behind their decision-making processes and thereby provide valuable insights into how neural networks arrive at their conclusions, making them more interpretable and trustworthy. This paper presents a simple yet effective approach to network inversion using a meticulously conditioned generator that learns the data distribution in the input space of the trained neural network, enabling the reconstruction of inputs that would most likely lead to the desired outputs. To capture the diversity in the input space for a given output, instead of simply revealing the conditioning labels to the generator, we encode the conditioning label information into vectors and intermediate matrices and further minimize the cosine similarity between features of the generated images.

Abstract: In this paper, we present a new image similarity search algorithm designed to enhance traditional information retrieval(IR) by adding an image search capability. Our approach uses a quadtree data structure to organize image data, significantly reducing search space and improving retrieval efficiency. We describe an indexing strategy and two query algorithms that can be implemented in any IR system. We tested our method on a 70K material microscopy image dataset, achieving a 25 times improvement in retrieval speed with only a 20% reduction in ranking accuracy.

Abstract: This paper proposes MAFT, a novel multimodal automated factchecking system capable of handling content in any combination of text, images, videos, and audio. The core idea behind our system is the textualization of multimodal content using various machine learning techniques. MAFT comprehensively analyzes this textualized content along with external information collected via web APIs by large language models (LLMs). MAFT generates interpretable fact-checking reports that include not only verification results but also a detailed verification process. With its adaptability and ability to automatically verify multimodal content, MAFT contributes to the fight against the spread of multimodal misinformation.

Abstract: Graph neural networks (GNNs) have recently been adapted to temporal settings, often employing temporal versions of the messagepassing mechanism known from GNNs. We divide temporal message passing mechanisms from literature into two main types: global and local, and establish Weisfeiler-Leman characterisations for both. This allows us to formally analyse expressive power of temporal message-passing models. We show that global and local temporal message-passing mechanisms have incomparable expressive power when applied to arbitrary temporal graphs. However, the local mechanism is strictly more expressive than the global mechanism when applied to colour-persistent temporal graphs, whose node colours are initially the same in all time points. Our theoretical findings are supported by experimental evidence, underlining practical implications of our analysis.

Abstract: Dimension reduction (DR) algorithms have proven to be extremely useful for gaining insight into largescale high-dimensional datasets, particularly finding clusters in transcriptomic data. The initial phase of these DR methods often involves converting the original high-dimensional data into a graph. In this graph, each edge represents the similarity or dissimilarity between pairs of data points. However, this graph is frequently suboptimal due to unreliable high-dimensional distances and the limited information extracted from the high-dimensional data. This problem is exacerbated as the dataset size increases. If we reduce the size of the dataset by selecting points for a specific sections of the embeddings, the clusters observed through DR are more separable since the extracted subgraphs are more reliable. In this paper, we introduce LocalMAP, a new dimensionality reduction algorithm that dynamically and locally adjusts the graph to address this challenge. By dynamically extracting subgraphs and updating the graph on-the-fly, LocalMAP is capable of identifying and separating real clusters within the data that other DR methods may overlook or combine. We demonstrate the benefits of LocalMAP through a case study on biological datasets, highlighting its utility in helping users more accurately identify clusters for real-world problems.

Abstract: Multiinstance multi-label classification (MIML) is a fundamental task in machine learning, where each data sample comprises a bag containing several instances and multiple binary labels. Despite its wide applications, the data collection process involves matching multiple instances and labels, typically resulting in high annotation costs. In this paper, we study a novel yet practical crowdsourced multi-instance multi-label classification (CMIML) setup, where labels are collected from multiple crowd sources. To address this problem, we first propose a novel data generation process for CMIML, i.e., cross-label transition, where cross-label annotation error is more likely to appear rather than previous single-label transition assumption, due to the inherent similarity of localized instances from different classes. Then, we formally define the cross-label transition by cross-label transition matrices which are dependent across classes. Subsequently, we establish the first unbiased risk estimator for CMIML and further improve it through aggregation techniques, along with a rigorous generalization error bound. We also provide a practical implementation of cross-label transition matrix estimation. Comprehensive experiments on six benchmark datasets under various scenarios demonstrate that our algorithm outperforms the baselines by a large margin, validating its effectiveness in handling the CMIML problem.

Abstract: Integrating AI in healthcare can greatly improve patient care and system efficiency. However, the lack of explainability in AI systems (XAI) hinders their clinical adoption, especially in multimodal decisionmaking that combines various data sources. The majority of existing XAI methods focus on unimodal models, which fail to capture cross-modal interactions that are crucial for understanding the combined impact of multiple data sources. Existing methods for quantifying cross-modal interactions are limited to two modalities, rely on labelled data, and depend on model performance, which is problematic in healthcare, where XAI must handle multiple data sources and provide individualised explanations. This paper introduces InterSHAP, a cross-modal interaction score that addresses the limitations of existing approaches. InterSHAP uses the Shapley interaction index to precisely separate and quantify the contributions of the individual modalities and their interactions without approximations. By integrating an open-source implementation with the SHAP package, we enhance reproducibility and ease of use. We show that InterSHAP accurately measures the presence of cross-modal interactions, can handle multiple modalities, and provides detailed explanations at a local level for individual data points. Furthermore, we apply InterSHAP to real medical multimodal datasets, and demonstrate its practical applicability for individualised explanations.

Abstract: Motivated by the settings where sensing the entire tensor is infeasible, this paper proposes a novel tensor compressed sensing model, where measurements are only obtained from sensing each lateral slice via mutually independent matrices. Leveraging the low tubal rank structure, we reparameterize the unknown tensor ?* using two compact tensor factors and formulate the recovery problem as a nonconvex minimization problem. To solve the problem, we first propose an alternating minimization algorithm, termed AltPGD-Min, that iteratively optimizes the two factors using a projected gradient descent and an exact minimization step, respectively. Despite nonconvexity, we prove that Alt-PGD-Min achieves ϵ-accuracy recovery with ?(?²log1/?) iteration complexity and ?(?⁶rn₃logn₃(?²r(n₁+n₂)+n₁log1/ε)) sample complexity, where ? denotes tensor condition number of ?*. To further accelerate the convergence, especially when the tensor is ill-conditioned with large ?, we prove Alt-ScalePGD-Min that preconditions the gradient update using an approximate Hessian that can be computed efficiently. We show that Alt-ScalePGD-Min achieves ? independent iteration complexity ?(log1/ε) and improves the sample complexity to ?(?⁴rn₃log n₃(?⁴ r(n₁ + n₂)+n₁log 1/ε)). Experiments validate the effectiveness of the proposed methods.

Abstract: This paper aims to recover a multisubspace matrix from permuted data: given a matrix, in which the columns are drawn from a union of low-dimensional subspaces and some columns are corrupted by permutations on their entries, recover the original matrix. The task has numerous practical applications such as data cleaning, integration, and de-anonymization, but it remains challenging and cannot be well addressed by existing techniques such as robust principal component analysis because of the presence of multiple subspaces and the permutations on the elements of vectors. To solve the challenge, we develop a novel four-stage algorithm pipeline including outlier identification, subspace reconstruction, outlier classification, and unsupervised sensing for permuted vector recovery. Particularly, we provide theoretical guarantees for the outlier classification step, ensuring reliable multi-subspace matrix recovery. Our pipeline is compared with state-of-the-art competitors on multiple benchmarks and shows superior performance.

Abstract: In recent years, multiview learning has aroused extensive research passion. Most existing multi-view learning methods often rely on well-annotations to improve decision accuracy. However, noise labels are ubiquitous in multi-view data due to imperfect annotations. To deal with this problem, we propose a novel noisy label calibration method (NLC) for multi-view classification to resist the negative impact of noisy labels. Specifically, to capture consensus information from multiple views, we employ max-margin rank loss to reduce the heterogeneous gap. Subsequently, we evaluate the confidence scores to enrich predictions associated with noise instances according to all reliable neighbors. Further, we propose Label Noise Detection (LND) to separate multi-view data into a clean or noisy subset, and propose Label Calibration Learning (LCL) to correct noisy instances. Finally, we adopt the cross-entropy loss to achieve multi-view classification. Extensive experiments on six datasets validate that our method outperforms eight state-of-the-art methods.

Abstract: This paper investigates the connection between neural networks and sufficient dimension reduction (SDR), demonstrating that neural networks inherently perform SDR in regression tasks under appropriate rank regularizations. Specifically, the weights in the first layer span the central mean subspace. We establish the statistical consistency of the neural networkbased estimator for the central mean subspace, underscoring the suitability of neural networks in addressing SDR-related challenges. Numerical experiments further validate our theoretical findings, and highlight the underlying capability of neural networks to facilitate SDR compared to the existing methods. Additionally, we discuss an extension to unravel the central subspace, broadening the scope of our investigation.

Abstract: Label correction methods are popular for their simple architecture in learning with noisy labels. However, they suffer severely from false label correction and achieve subpar performance compared with stateof-the-art methods. In this paper, we revisit the label correction methods through theoretical analysis of gradient scaling and demonstrate that the sample-wise dynamic and class-wise uniformity of interpolation weight prevents memorization of the mislabeled samples. We then propose DULC, a simple yet effective label correction method that uses the normalized Jensen-Shannon divergence (JSD) metric as the interpolation weight to promote sample-wise dynamic and class-wise uniformity. Additionally, we provide theoretical evidence that sharpening predictions in label correction facilitates the memorization of true class, and we achieve it by employing the augmentation strategy along with the sharpening function. Extensive experiments on CIFAR-10, CIFAR-100, TinyImageNet, WebVision and Clothing1M datasets demonstrate substantial improvements over state-of-the-art methods.

Abstract: Parameterefficient fine-tuning (PEFT) has been widely employed for domain adaptation, with LoRA being one of the most prominent methods due to its simplicity and effectiveness. However, in multi-task learning (MTL) scenarios, LoRA tends to obscure the distinction between tasks by projecting sparse high-dimensional features from different tasks into the same dense low-dimensional intrinsic space. This leads to task interference and suboptimal performance for LoRA and its variants. To tackle this challenge, we propose MTL-LoRA, which retains the advantages of low-rank adaptation while significantly enhancing MTL capabilities. MTL-LoRA augments LoRA by incorporating additional task-adaptive parameters that differentiate task-specific information and capture shared knowledge across various tasks within low-dimensional spaces. This approach enables pretrained models to jointly adapt to different target domains with a limited number of trainable parameters. Comprehensive experimental results, including evaluations on public academic benchmarks for natural language understanding, commonsense reasoning, and image-text understanding, as well as real-world industrial text Ads relevance datasets, demonstrate that MTL-LoRA outperforms LoRA and its various variants with comparable or even fewer learnable parameters in MTL setting.

Abstract: Rumor detection is critical as the spread of misinformation on social media threatens social stability. The propagation structure has garnered attention for its ability to capture discriminative information, such as crowd stance, which has led to the development of enhanced detection methods. However, these detectors are vulnerable to attacks that can manipulate results and evade detection, potentially disrupting public order or influencing public opinion. While adversarial attacks on rumor detectors have been studied, the use of backdoor attacks—an evasive and powerful method—remains unexplored due to the challenges in applying them to propagation trees. In this paper, we introduce the first backdoor attack framework against propagationbased rumor detectors, designed to maintain overall detector performance while enabling targeted attacks on specific rumors. We propose an adaptive discrete trigger generator that injects trigger nodes into critical nodes, creating evasive, transferable attacks. Extensive experiments on three real-world rumor datasets demonstrate that our framework effectively undermines the performance of propagation-based rumor detectors and is transferable across different architectures.

Abstract: Trajectory generation has garnered significant attention from researchers in the field of spatiotemporal analysis, as it can generate substantial synthesized human mobility trajectories that enhance user privacy and alleviate data scarcity. However, existing trajectory generation methods often focus on improving trajectory generation quality from a singular perspective, lacking a comprehensive semantic understanding across various scales. Consequently, we are inspired to develop a HOlistic SEmantic Representation (HOSER) framework for navigational trajectory generation. Given an origin-and-destination (OD) pair and the starting time point of a latent trajectory, we first propose a Road Network Encoder to expand the receptive field of road- and zone-level semantics. Second, we design a Multi-Granularity Trajectory Encoder to integrate the spatio-temporal semantics of the generated trajectory at both the point and trajectory levels. Finally, we employ a Destination-Oriented Navigator to seamlessly integrate destination-oriented guidance. Extensive experiments on three real-world datasets demonstrate that HOSER outperforms state-of-the-art baselines by a significant margin. Moreover, the model's performance in few-shot learning and zero-shot learning scenarios further verifies the effectiveness of our holistic semantic representation.

Abstract: Hallucination detection has attracted considerable interest due to the tendency of language models to generate texts that contain hallucinations. Most existing methods start with specific local details directly extracted from text, then aggregate to form the final conclusion. However, this direct extraction approach ignores the global context, leading to isolated details, and is prone to missed or overdetections. In this paper, we present a global-to-local approach for hallucination detection (G2LDetect), which considers the global information of the text before identifying local details. We first construct a global representation of the text by transforming it into a hierarchical tree structure. Afterward, we obtain specific local details from the global tree representation using path-wise identification and perform detection on them. This global-to-local detection process ensures that local details are context-aware and complete, thus making more accurate and reliable detection results. Experimental results show that our global-to-local method outperforms existing methods, especially for longer texts.

Abstract: Molecular representation learning is vital for various downstream applications, including the analysis and prediction of molecular properties and side effects. While Graph Neural Networks (GNNs) have been a popular framework for modeling molecular data, they often struggle to capture the full complexity of molecular representations. In this paper, we introduce a novel method called Gode, which accounts for the duallevel structure inherent in molecules. Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph. Gode integrates individual molecular graph representations with multi-domain biochemical data from knowledge graphs. By pre-training two GNNs on different graph structures and employing contrastive learning, Gode effectively fuses molecular structures with their corresponding knowledge graph substructures. This fusion yields a more robust and informative representation, enhancing molecular property predictions by leveraging both chemical and biological information. When fine-tuned across 11 chemical property tasks, our model significantly outperforms existing benchmarks, achieving an average ROC-AUC improvement of 12.7% for classification tasks and an average RMSE/MAE improvement of 34.4% for regression tasks. Notably, Gode surpasses the current leading model in property prediction, with advancements of 2.2% in classification and 7.2% in regression tasks.

Abstract: Recent face forgery detection methods based on disentangled representation learning utilize paired images for crossreconstruction, aiming to extract forgery-relevant attributes and forgery-irrelevant content. However, there still exist the following issues that may comprise the detector performance: 1) using information-dense images as the decoupling targets increases the decoupling difficulty; 2) the extracted attribute features are reconstruction-irrelevant rather than forgery-relevant, and single-scale forgery representation decoupling cannot capture sufficient discriminative information; 3) the generalization performance of decoupled attribute features is poor as the detector focuses on learning specific artifact types in the training set. To address these issues, we propose a novel disentangled representation learning framework for deepfake detection. First, we extract features by partitioning the dense information within the image, focusing independently on texture, color, or edges. These features are then used as the decoupling targets rather than the images themselves, which could mitigate the decoupling difficulty. Second, we extend reconstruction loss from image-level to feature-level, thus extending the forgery representation decoupling from single-scale to multi-scale. Third, we propose a critical forgetting mechanism that forces the detector to forget the most salient features during training, which correspond to specific forgery artifact types in the training set. Extensive experimental results validate the efficacy of the proposed method.

School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, Sichuan, P.R.China Kash Institute of Electronics and Information Industry, Kash, P.R.China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, Sichuan, P.R.China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, Sichuan, P.R.China Kash Institute of Electronics and Information Industry, Kash, P.R.China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, Sichuan, P.R.China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, Sichuan, P.R.China Kash Institute of Electronics and Information Industry, Kash, P.R.China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, Sichuan, P.R.China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, Sichuan, P.R.China Kash Institute of Electronics and Information Industry, Kash, P.R.China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, Sichuan, P.R.China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, Sichuan, P.R.China Kash Institute of Electronics and Information Industry, Kash, P.R.China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, Sichuan, P.R.China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, Sichuan, P.R.China Kash Institute of Electronics and Information Industry, Kash, P.R.China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, Sichuan, P.R.China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, Sichuan, P.R.China Kash Institute of Electronics and Information Industry, Kash, P.R.China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, Sichuan, P.R.China

Abstract: Deep learning has been widely applied to various aspects of computer vision, but the emergence of adversarial attacks raises concerns about its reliability. Adversarial training (AT) is one of the most effective defense methods, which incorporates adversarial examples into the training data. However, AT is typically employed in a discriminative learning manner, i.e., learning the mapping (conditional probability) from samples to labels, it essentially reinforces this mapping without considering the underlying data distribution. It is notable that adversarial examples often deviate from the distribution of normal (clean) samples. Therefore, building upon existing adversarial defense schemes, we propose to further exploit the distribution of normal samples, partly from the generative learning perspective, resulting in a novel robustness enhancement paradigm. We train a simple autoencoder (AE) autoregressively on normal samples to learn their prior distribution, effectively serving as an image manifold. This AE is then used as a manifold projection operator to incorporate the distribution information of normal samples. Specifically, we organically integrate the pretrained AE into the training process of both AT and adversarial distillation (AD), a method aiming at improving the robustness of small models with low capacity. Since the AE captures the distribution of normal samples, it can adaptively pull adversarial examples closer to the normal sample manifold, weakening the attack strength of adversarial samples and easing the learning of mappings from adversarial samples to correct labels. From the Pearson correlation coefficient (PCC) between the statistics on normal and adversarial examples, it’s validated that the AE indeed pulls adversarial samples closer to normal samples. Extensive experiments illustrate that our proposed adversarial defense paradigm significantly improves the robustness compared with previous stateof-the-art AT and AD methods.

Abstract: Recently, multidomain fake news detection has garnered increasing attention in academia. In particular, the integration of multimodal information into multi-domain fake news detection has emerged as a highly promising research direction. However, this field faces three main challenges: (1) Inaccurate domain identification, where predefined explicit identifiers fail to adapt to the inherent complexity of data; (2) Imbalanced multi-domain data distribution, which may induce negative transfer effects; and (3) Variable multi-domain modal contributions, indicating domain-specific differences in how various modalities influence news veracity assessments. To address these issues, we propose the Domain-Aware Multi-Modal Multi-View Fake News Detection (DAMMFND) framework. DAMMFND effectively extracts more accurate domain information through Domain Disentanglement, while simultaneously mitigating negative transfer between domains. Furthermore, DAMMFND introduces a Domain-Aware Multi-View Discriminator and a Domain-Enhanced Multi-view Decision Layer, which accurately quantify the contribution of domain information to multimodal, multi-view decision-making processes. Extensive experiments conducted on two real-world datasets demonstrate that the proposed model outperforms state-of-the-art baselines.

Abstract: The ability to estimate temporal relationships is critical for both animals and artificial agents. Cognitive science and neuroscience provide remarkable insights into behavioral and neural aspects of temporal credit assignment. In particular, scale invariance of learning dynamics, observed in behavior and supported by neural data, is one of the key principles that governs animal perception: proportional rescaling of temporal relationships does not alter the overall learning efficiency. Here we integrate a computational neuroscience model of scale invariant memory into deep reinforcement learning (RL) agents. We first provide a theoretical analysis and then demonstrate through experiments that such agents can learn robustly across a wide range of temporal scales, unlike agents built with commonly used recurrent memory architectures such as LSTM. This result illustrates that incorporating computational principles from neuroscience and cognitive science into deep neural networks can enhance adaptability to complex temporal dynamics, mirroring some of the core properties of human learning.

Abstract: The rapid advancement in selfsupervised representation learning has highlighted its potential to leverage unlabeled data for learning rich visual representations. However, the existing techniques, particularly those employing different augmentations of the same image, often rely on a limited set of simple transformations that cannot fully capture variations in the real world. This constrains the diversity and quality of samples, which leads to sub-optimal representations. In this paper, we introduce a framework that enriches the self-supervised learning (SSL) paradigm by utilizing generative models to produce semantically consistent image augmentations. By directly conditioning generative models on a source image, our method enables the generation of diverse augmentations while maintaining the semantics of the source image, thus offering a richer set of data for SSL. Our extensive experimental results on various joint-embedding SSL techniques demonstrate that our framework significantly enhances the quality of learned visual representations by up to 10% Top-1 accuracy in downstream tasks. This research demonstrates that incorporating generative models into the joint-embedding SSL workflow opens new avenues for exploring the potential of synthetic data. This development paves the way for more robust and versatile representation learning techniques.

Abstract: Highresolution (HR) images are commonly downscaled to low-resolution (LR) to reduce bandwidth, followed by upscaling to restore their original details. Recent advancements in image rescaling algorithms have employed invertible neural networks (INNs) to create a unified framework for downscaling and upscaling, ensuring a one-to-one mapping between LR and HR images. Traditional methods, utilizing dual-branch based vanilla invertible blocks, process high-frequency and low-frequency information separately, often relying on specific distributions to model high-frequency components. However, processing the low-frequency component directly in the RGB domain introduces channel redundancy, limiting the efficiency of image reconstruction. To address these challenges, we propose a plug-and-play tri-branch invertible block (T-InvBlocks) that decomposes the low- frequency branch into luminance (Y) and chrominance (CbCr) components, reducing redundancy and enhancing feature processing. Additionally, we adopt an all-zero mapping strategy for high-frequency components during upscaling, focusing essential rescaling information within the LR image. Our T-InvBlocks can be seamlessly integrated into existing rescaling models, improving performance in both general rescaling tasks and scenarios involving lossy compression. Extensive experiments confirm that our method advances the state of the art in HR image reconstruction.

Abstract: We present a lightingaware image editing pipeline that, given a portrait image and a text prompt, performs single image relighting. Our model modifies the lighting and color of both the foreground and background to align with the provided text description. The unbounded nature in creativeness of a text allows us to describe the lighting of a scene with any sensory features including temperature, emotion, smell, time, and so on. However, the modeling of such mapping between the unbounded text and lighting is extremely challenging due to the lack of dataset where there exists no scalable data that provides large pairs of text and relighting, and therefore, current text-driven image editing models does not generalize to lighting-specific use cases. We overcome this problem by introducing a novel data synthesis pipeline: First, diverse and creative text prompts that describe the scenes with various lighting are automatically generated under a crafted hierarchy using a large language model (e.g., ChatGPT). A text-guided image generation model creates a lighting image that best matches the text. As a condition of the lighting images, we perform image-based relighting for both foreground and background using a single portrait image or a set of OLAT (One-Light-at-A-Time) images captured from lightstage system. Particularly for the background relighting, we represent the lighting image as a set of point lights and transfer them to other background images. A generative diffusion model learns the synthesized large-scale data with auxiliary task augmentation (e.g., portrait delighting and light positioning) to correlate the latent text and lighting distribution for text-guided portrait relighting. In our experiment, we demonstrate that our model outperforms existing text-guided image generation models, showing high-quality portrait relighting results with a strong generalization to unconstrained scenes.

Abstract: Deep neural networks (DNNs) are vulnerable to adversarial examples (AEs) that mislead the model while appearing benign to human observers. A critical concern is the transferability of AEs, which enables blackbox attacks without direct access to the target model. However, many previous attacks have failed to explain the intrinsic mechanism of adversarial transferability, lacking a unified and representative metric for transferability as well. In this paper, we rethink the property of transferable AEs and develop a novel metric to measure transferability from the perspective of generalization. Building on insights from this metric, we analyze the generalization of AEs across models with different architectures and prove that we can find a local perturbation to mitigate the gap between surrogate and target models. We further establish the inner connections between model smoothness and flat local maxima, both of which contribute to the transferability of AEs. Further, we propose a new adversarial attack algorithm, Adversarial Weight Tuning (AWT), which adaptively adjusts the parameters of the surrogate model using generated AEs to optimize the flat local maxima and model smoothness simultaneously, without the need for extra data. AWT is a data-free tuning method that combines gradient-based and model-related attack methods to enhance the transferability of AEs. Extensive experiments on a variety of models with different architectures on ImageNet demonstrate that AWT yields superior performance over other attacks, with an average increase of nearly 5% and 10% attack success rates on CNN-based and Transformer-based models, respectively, compared to state-of-the-art attacks.

Abstract: This paper considers the problem of MultiHop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MULTIHOP-EGOQA, with careful manual verification and refinement. Experimental results reveal that existing multimodal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction-tuning data, GeLM demonstrates improved multi-hop grounding and reasoning capabilities, setting a baseline for this new task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstrating its effectiveness.

Abstract: The Segment Anything Model (SAM) is a powerful foundation model for image segmentation, showing robust zeroshot generalization through prompt engineering. However, relying on manual prompts is impractical for real-world applications, particularly in scenarios where rapid prompt provision and resource efficiency are crucial. In this paper, we propose the Automation of Prompts for SAM (AoP-SAM), a novel approach that learns to generate essential prompts in optimal locations automatically. AoP-SAM enhances SAM’s efficiency and usability by eliminating manual input, making it better suited for real-world tasks. Our approach employs a lightweight yet efficient Prompt Predictor model that detects key entities across images and identifies the optimal regions for placing prompt candidates. This method leverages SAM’s image embeddings, preserving its zero-shot generalization capabilities without requiring fine-tuning. Additionally, we introduce a test-time instance-level Adaptive Sampling and Filtering mechanism that generates prompts in a coarse-to-fine manner. This notably enhances both prompt and mask generation efficiency by reducing computational overhead and minimizing redundant mask refinements. Evaluations of three datasets demonstrate that AoP-SAM substantially improves both prompt generation efficiency and mask generation accuracy, making SAM more effective for automated segmentation tasks.

Abstract: Single image supersolution (SR) aims to restore a high-resolution (HR) image from a degraded low-resolution (LR) image. However, existing SR models still face a significant domain gap between synthetic and real-world datasets due to the mismatched degradation distributions, hindering SR models from achieving optimal results. In this paper, we propose an unsupervised diffusion-based degradation modeling framework (UDDM) to effectively capture real-world degradation distributions. Specifically, given unpaired LR and HR images, a diffusion-based degradation module (DDM) first models the degradation distribution by diffusing real-world LR images to downsampled LR images, which does not require HR images. It then applies reverse diffusion to generate real-world LR images from extremely downsampled HR images. This approach allows DDM to model and generate real-world degradation distributions without requiring paired data, by using extreme downsampling to link unpaired LR and HR images. Additionally, we introduce a physics-based dynamic degradation module (P-DDM) that adaptively models content-aware degradation, ensuring both content and structural accuracy. Finally, the LR images generated by DDM and P-DDM are adaptively weighted to produce the final LR images, which are paired with the given HR images for training the SR network. Extensive experiments across multiple real-world datasets demonstrate that our framework achieves state-of-the-art performance in both qualitative and quantitative comparison.

State Key Laboratory of Integrated Services Networks, Xidian University, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, State Key Laboratory of Integrated Services Networks, Xidian University, Huawei Noah's Ark Lab, Consumer Business Group, Huawei, State Key Laboratory of Integrated Services Networks, Xidian University, State Key Laboratory of Integrated Services Networks, Xidian University, Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Huawei Noah's Ark Lab

Abstract: Recent advances indicate that diffusion model holds great promise in image superresolution. While latest methods are primarily based on latent diffusion models with convolutional neural networks, there are few attempts to explore transformers, which have demonstrated remarkable performance in image generation. In this work, we design an effective diffusion transformer for image super resolution (DiT-SR) that achieves the visual quality of prior-based methods, but through a training-from-scratch manner. In practice, DiT-SR leverages an overall U-shaped architecture, and adopts uniform isotropic design for all the transformer blocks across different stages. The former facilitates multi-scale hierarchical feature extraction, while the latter reallocate the computational resources to critical layers to further enhance performance. Moreover, we thoroughly analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module, enhancing the model's capacity to process distinct frequency information at different time steps. Extensive experiments demonstrate that DiT-SR outperforms the existing training-from-scratch diffusion-based SR methods significantly, and even beats some of the prior-based methods on pretrained Stable Diffusion, proving the superiority of diffusion transformer in image super resolution.

Abstract: Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video. In this study, we propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm. Powered by a strong generative model, our method not only significantly enhances framelevel quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts. For pixel propagation, we introduce a one-shot pixel pulling method that effectively avoids error accumulation from repeated sampling while maintaining sub-pixel precision. To evaluate various VI methods in realistic scenarios, we also propose a high-quality VI benchmark, HQVI, comprising carefully generated videos using alpha matte composition. On public benchmarks and the HQVI dataset, our method demonstrates significantly higher visual quality and metric scores compared to existing solutions. Furthermore, it can process high-resolution videos exceeding 2K resolution with ease, underscoring its superiority for real-world applications.

Abstract: Recent advances in textto-image diffusion models spurred research on personalization, i.e., a customized image synthesis, of subjects within reference images. Although existing personalization methods are able to alter the subjects’ positions or to personalize multiple subjects simultaneously, they often struggle to modify the behaviors of subjects or their dynamic interactions. The difficulty is attributable to overfitting to reference images, which worsens if only a single reference image is available. We propose DynASyn, an effective multi-subject personalization from a single reference image addressing these challenges. DynASyn preserves the subject identity in the personalization process by aligning concept-based priors with subject appearances and actions. This is achieved by regularizing the attention maps between the subject token and images through concept-based priors. In addition, we propose concept-based prompt-and-image augmentation for an enhanced trade-off between identity preservation and action diversity. We adopt an SDE-based editing guided by augmented prompts to generate diverse appearances and actions while maintaining identity consistency in the augmented images. Experiments show that DynASyn is capable of synthesizing highly realistic images of subjects with novel contexts and dynamic interactions with the surroundings, and outperforms baseline methods in both quantitative and qualitative aspects.

Abstract: Pretrained visuallanguage models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that \modelname effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.

Abstract: While novel gradientbased attacks are continuously proposed to improve the optimization of adversarial examples, each is shown to outperform its predecessors using different experimental setups, implementations, and computational budgets, leading to biased and unfair comparisons. In this work, we overcome this issue by proposing AttackBench, i.e., an attack evaluation framework that evaluates the effectiveness of each attack (along with its different library implementations) under the same maximum available computational budget. To this end, we (i) define a novel optimality metric that quantifies how close each attack is to the optimal solution (empirically estimated by ensembling all attacks), and (ii) limit the maximum number of forward and backward queries that each attack can execute on the target model. Our extensive experimental analysis compares more than 100 attack implementations over 800 different configurations, considering both CIFAR-10 and ImageNet models, and shows that only a few attack implementations outperform all the remaining approaches. These findings suggest that novel defenses should be evaluated against different attacks than those normally used in the literature to avoid overly-optimistic robustness evaluations. We release AttackBench as a publicly-available benchmark that will be continuously updated with new attack implementations to maintain an up-to-date ranking of the best gradient-based attacks. We release AttackBench as a publicly available benchmark, including a continuously updated leaderboard and source code to maintain an up-to-date ranking of the best gradient-based attacks.

Abstract: Multitask visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture (C3VG), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of C3VG, which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.

PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, Shenzhen International Graduate School, Tsinghua University, China, Shenzhen International Graduate School, Tsinghua University, China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, Shenzhen International Graduate School, Tsinghua University, China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China

Abstract: Creating group choreography from music is crucial in cultural entertainment and virtual reality, with a focus on generating harmonious movements. Despite growing interest, recent approaches often struggle with two major challenges: multidancer collisions and single-dancer foot sliding. To address these challenges, we propose a Trajectory-Controllable Diffusion (TCDiff) framework, which leverages non-overlapping trajectories to ensure coherent and aesthetically pleasing dance movements. To mitigate collisions, we introduce a Dance-Trajectory Navigator that generates collision-free trajectories for multiple dancers, utilizing a distance-consistency loss to maintain optimal spacing. Furthermore, to reduce foot sliding, we present a footwork adaptor that adjusts trajectory displacement between frames, supported by a relative forward-kinematic loss to further reinforce the correlation between movements and trajectories. Experiments demonstrate our method's superiority.

Abstract: Seamlessly moving objects within a scene is a common requirement for image editing, but it is still a challenge for existing editing methods. Especially for realworld images, the occlusion situation further increases the difficulty. The main difficulty is that the occluded portion needs to be completed before movement can proceed. To leverage the real-world knowledge embedded in the pre-trained diffusion models, we propose a Diffusion-based framework specifically designed for Occluded Object Movement, named DiffOOM. The proposed DiffOOM consists of two parallel branches that perform object de-occlusion and movement simultaneously. The de-occlusion branch utilizes a background color-fill strategy and a continuously updated object mask to focus the diffusion process on completing the obscured portion of the target object. Concurrently, the movement branch employs latent optimization to place the completed object in the target location and adopts local text-conditioned guidance to integrate the object into new surroundings appropriately. Extensive evaluations across various metrics demonstrate the superior performance of our method, which is further validated by a comprehensive user study.

Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education) School of Integrated Circuits, Xidian University, Xi’an 710071, China, Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education) School of Integrated Circuits, Xidian University, Xi’an 710071, China, Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education) School of Integrated Circuits, Xidian University, Xi’an 710071, China Hangzhou Institute of Technology, Xidian University, Hangzhou, China, Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education) School of Integrated Circuits, Xidian University, Xi’an 710071, China, RIKEN AIP, Tokyo103-0027, Japan The University of Tokyo, Japan, Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education) School of Integrated Circuits, Xidian University, Xi’an 710071, China, Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education) School of Integrated Circuits, Xidian University, Xi’an 710071, China, Key Laboratory of Analog Integrated Circuits and Systems (Ministry of Education) School of Integrated Circuits, Xidian University, Xi’an 710071, China

Abstract: Event Cameras offer appealing advantages, including power efficiency and ultralow latency, driving forward advancements in edge applications. In order to leverage mature frame-based algorithms, most approaches typically compute dense, image-like representations from sparse, asynchronous events. However, they are often unable to capture comprehensive information or are computationally intensive, which hinders the edge deployment of event-based vision. Meanwhile, pillar-based paradigms have been proven to be efficient and well established for dense representations of sparse data. Hence, from a novel pillar-based perspective, we present EventPillars, an efficient, comprehensive framework for dense event representations. To summarize, it (i) incorporates the Temporal Event Range to describe an intact temporal distribution, (ii) Activates the Event Polarities to explicitly record the scene dynamics, (iii) enhances the target awareness by a spatial attention prior from Normalized Event Density, (iv) can be plug-and-played into different downstream tasks. Extensive experiments show that our EventPillars records a new state-of-the-art precision on object recognition and detection datasets with surprisingly 9.2× and 4.5× lower computation and storage consumption. This brings a new insight into dense event representations and is promising to boost the edge deployment of event-based vision.

Abstract: Deep Unfolding Networks (DUNs), with their outstanding performance and partial interpretability, have revitalized the field of pansharpening. However, the current DUNs for pan-sharpening rely entirely on implicit deep priors, ignoring the intrinsic physical prior knowledge of multispectral image (MS) and panchromatic image (PAN) to guide the reconstruction process. Moreover, these methods often depend on single-scale prior features, failing to adequately capture multiscale information, resulting in spatial and spectral distortions in detail. In this paper, we introduce a spatial-spectral prior-aware framework for pan-sharpening, called SSPF, which formulates a constrained minimization problem integrating MS and PAN prior knowledge based on spatial and spectral domains. We further develop SSPF into a lightweight deep unfolding network, called SSUN-Net, which provides more efficient prior feature extraction and requires less computational cost. Additionally, we augment SSUN-Net's capabilities by integrating a customized multi-scale prior structure (MPS). MPS imposes constraints on the solution space at various scales through regularization, which markedly enhances the reconstruction of intricate details. Extensive experiments demonstrate the significant advantages of our proposed SSUN-Net over the current SOTA methods.

School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China. Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Wuhan, China. Baidu Inc., School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China. Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Wuhan, China., School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China. Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Wuhan, China., Baidu Inc., Baidu Inc.

Abstract: ControlNet has significantly advanced controllable image generation by integrating dense conditions (such as depth and canny edges) with textto-image diffusion models. However, ControlNet's integration requires an additional amount nearly equal to half of the base diffusion model's parameters, making it inefficient. To address this, we introduce Simple-ControlNet, an efficient and streamlined network for controllable text-to-image generation. It employs a single-scale projection layer to incorporate condition information into the denoising U-Net. It is supplemented by Low-Rank Adapter (LoRA) parameters to facilitate condition learning. Impressively, Simple-ControlNet requires fewer than 3 million parameters for the control mechanism, substantially less than the 300 million needed by ControlNet. Our extensive experiments confirm that Simple-ControlNet matches and surpasses ControlNet's performance across a broad range of tasks and base diffusion models, showcasing its utility and efficiency.

Abstract: Textdriven Human-Object Interaction (Text-to-HOI) generation is an emerging field with applications in animation, video games, virtual reality, and robotics. A key challenge in HOI generation is maintaining interaction consistency in long sequences. Existing Text-to-Motion-based approaches, such as discrete motion tokenization, cannot be directly applied to HOI generation due to limited data in this domain and the complexity of the modality. To address the problem of interaction consistency in long sequences, we propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token. Specifically, we introduce a Contrastive Variational Autoencoder (cVAE) to learn a physically plausible space of continuous HOI tokens, thereby ensuring that generated human-object motions are realistic and natural. For generating sequences autoregressively, we develop a Mamba-based context encoder to capture and maintain consistent sequential actions. Additionally, we implement an MLP-based denoiser to generate the subsequent token conditioned on the encoded context. Our model has been evaluated on the OMOMO and BEHAVE datasets, where it outperforms existing state-of-the-art methods in terms of both performance and inference speed. This makes ARDHOI a robust and efficient solution for text-driven HOI tasks.

Abstract: Using images captured by cameras with different light spectrum sensitivities, training a unified model for crossspectral scene representation is challenging. Recent advances have shown the possibility of jointly optimizing cross-spectral relative poses and neural radiance fields using normalized cross-device coordinates. However, such method suffers from cross-spectral misalignment when collecting data asynchronously from devices and lacks the capability to render in real-time or handle large scenes. We address these issues by proposing cross-spectral Gaussian Splatting with spatial occupancy consistency, strictly aligns cross-spectral scene representation by sharing explicit Gaussian surfaces across spectra and separately optimizing each view's extrinsic using a matching-optimizing pose estimation method. Additionally, to address field-of-view differences in cross-spectral cameras, we improve the adaptive densify controller to fill non-overlapping areas. Comprehensive experiments demonstrate that SOC-GS achieves superior performance in novel view synthesis and real-time cross-spectral rendering.

Abstract: The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zeroshot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we boost it to detect arbitrary objects from human inputs like category names or reference expressions. Building upon the SAM image encoder, we introduce a novel SideFormer module designed to acquire SAM features adept at perceiving objects and inject comprehensive semantic information for recognition. In addition, we devise an Open-set RPN that leverages SAM proposals to assist in finding potential objects. Consequently, Sambor enables the open-vocabulary detector to equally focus on generalizing both localization and classification sub-tasks. Our approach demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous state-of-the-art methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.

Institute of Medical Technology, Peking University, Beijing, China Department of Biomedical Engineering, Peking University, Beijing, China National Biomedical Imaging Center, Peking University, Beijing, China Institute of Biomedical Engineering, Peking University Shenzhen Graduate School, Shenzhen, China, Institute of Medical Technology, Peking University, Beijing, China Department of Biomedical Engineering, Peking University, Beijing, China National Biomedical Imaging Center, Peking University, Beijing, China Institute of Biomedical Engineering, Peking University Shenzhen Graduate School, Shenzhen, China, Institute of Medical Technology, Peking University, Beijing, China Department of Biomedical Engineering, Peking University, Beijing, China National Biomedical Imaging Center, Peking University, Beijing, China Institute of Biomedical Engineering, Peking University Shenzhen Graduate School, Shenzhen, China, Institute of Medical Technology, Peking University, Beijing, China Department of Biomedical Engineering, Peking University, Beijing, China National Biomedical Imaging Center, Peking University, Beijing, China Institute of Biomedical Engineering, Peking University Shenzhen Graduate School, Shenzhen, China, Institute of Medical Technology, Peking University, Beijing, China Department of Biomedical Engineering, Peking University, Beijing, China National Biomedical Imaging Center, Peking University, Beijing, China Institute of Biomedical Engineering, Peking University Shenzhen Graduate School, Shenzhen, China, Institute of Medical Technology, Peking University, Beijing, China Department of Biomedical Engineering, Peking University, Beijing, China National Biomedical Imaging Center, Peking University, Beijing, China Institute of Biomedical Engineering, Peking University Shenzhen Graduate School, Shenzhen, China

Abstract: Concept Bottleneck Models (CBMs) offer inherent interpretability by initially translating images into humancomprehensible concepts, followed by a linear combination of these concepts for classification. However, the annotation of concepts for visual recognition tasks requires extensive expert knowledge and labor, constraining the broad adoption of CBMs. Recent approaches have leveraged the knowledge of large language models to construct concept bottlenecks, with multimodal models like CLIP subsequently mapping image features into the concept feature space for classification. Despite this, the concepts produced by language models can be verbose and may introduce non-visual attributes, which hurts accuracy and interpretability. In this study, we investigate to avoid these issues by constructing CBMs directly from multimodal models. To this end, we adopt common words as base concept vocabulary and leverage auxiliary unlabeled images to construct a Vision-to-Concept (V2C) tokenizer that can explicitly quantize images into their most relevant visual concepts, thus creating a vision-oriented concept bottleneck tightly coupled with the multimodal model. This leads to our V2C-CBM which is training efficient and interpretable with high accuracy. Our V2C-CBM has matched or outperformed LLM-supervised CBMs on various visual classification benchmarks, validating the efficacy of our approach.

Abstract: Current outof-distribution (OOD) detection methods typically assume balanced in-distribution (ID) data, while most real-world data follow a long-tailed distribution. Previous approaches to long-tailed OOD detection often involve balancing the ID data by reducing the semantics of head classes. However, this reduction can severely affect the classification accuracy of ID data. The main challenge of this task lies in the severe lack of features for tail classes, leading to confusion with OOD data. To tackle this issue, we introduce a novel Prioritizing Attention to Tail (PATT) method using augmentation instead of reduction. Our main intuition involves using a mixture of von Mises-Fisher (vMF) distributions to model the ID data and a temperature scaling module to boost the confidence of ID data. This enables us to generate infinite contrastive pairs, implicitly enhancing the semantics of ID classes while promoting differentiation between ID and OOD data. To further strengthen the detection of OOD data without compromising the classification performance of ID data, we propose feature calibration during the inference phase. By extracting an attention weight from the training set that prioritizes the tail classes and reduces the confidence in OOD data, we improve the OOD detection capability. Extensive experiments verified that our method outperforms the current state-of-the-art methods on various benchmarks.

Abstract: Prompt tuning (PT) has emerged as a key to unlocking the power of visuallanguage models like CLIP for various downstream tasks. Predominant approaches learn a small set of task-relevant soft prompts by solving an image-class matching problem. Nevertheless, by optimizing merely with respect to class names, they face challenges in learning high performant prompts capable of capturing fine-grained, diverse characteristics of each class, and tends to overfit potentially biased distribution of base classes. In this work, we propose PTinCAS to tackle prompt tuning in a compact attribute space, driven by the premise that attributes offer detailed class interpretations and can facilitate transfer across related categories. Particularly, PTinCAS is grounded in two innovative designs. First, we create a compact attribute space by properly prompting large language models to generate factual descriptions about categories, which are subsequently clustered to form a concise attribute vocabulary. Second, we leverage attributes as a source of supervision in PT to transfer the inherent common sense knowledge in attributes to soft prompts. An object-aware visual prompting mechanism is developed to effortlessly highlight intended regions in the original image, which guides the model towards learning visual attributes associated with object regions rather than the background. We show that PTinCAS not only improves few-shot generalizability compared to existing PT methods, but also provides some level of inherent explainability that helps us understand why a class name is determined based on the attributes activated in an image.

Abstract: Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patchbased cascaded pixel diffusion model for high resolution frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks. Cascades, which generate a series of images from low to high resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. At inference time, this drastically reduces memory usage and allows a single model, solving both frame interpolation (base model’s task) and spatial up-sampling, saving training cost as well. HiFI excels at high-resolution images and complex repeated textures that require global context, achieving comparable or state-of-the-art performance on various benchmarks (Vimeo, Xiph, X-Test, and SEPE-8K). We further introduce a new dataset, LaMoR, that focuses on particularly challenging cases, and HiFI significantly outperforms other baselines.

Abstract: Recent texture generation methods achieve impressive results due to the powerful generative prior they leverage from largescale text-to-image diffusion models. However, abstract textual prompts are limited in providing global textural or shape information, which results in the texture generation methods producing blurry or inconsistent patterns. To tackle this, we present FlexiTex, embedding rich information via visual guidance to generate a high-quality texture. The core of FlexiTex is the Visual Guidance Enhancement module, which incorporates more specific information from visual guidance to reduce ambiguity in the text prompt and preserve high-frequency details. To further enhance the visual guidance, we introduce a Direction-Aware Adaptation module that automatically designs direction prompts based on different camera poses, avoiding the Janus problem and maintaining semantically global consistency. Benefiting from the visual guidance, FlexiTex produces quantitatively and qualitatively sound results, demonstrating its potential to advance texture generation for real-world applications.

Abstract: Learningbased multi-view stereo methods aim to predict depth maps for reconstructing dense point clouds. These methods rely on regularization to reduce redundancy in the cost volume. However, existing methods have limitations: CNN-based regularization is restricted to local receptive fields, while Transformer-based regularization struggles with handling depth discontinuities. These limitations often result in inaccurate depth maps with significant noise, particularly noticeable in the boundary and background regions. In this paper, we propose a Recurrent Regularization Transformer for Multi-View Stereo (RRT-MVS), which addresses these limitations by regularizing the cost volume separately for depth and spatial dimensions. Specifically, we introduce Recurrent Self-Attention (R-SA) to aggregate global matching costs within and across the cost maps and filter out noisy feature correlations. Additionally, we present Depth Residual Attention (DRA) to aggregate depth correlations within the cost volume and a Positional Adapter (PA) to enhance 3D positional awareness in each 2D cost map, further augmenting the effectiveness of R-SA. Experimental results demonstrate that RRT-MVS achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets. Notably, RRT-MVS ranks first on both the Tanks-and-Temples intermediate and advanced benchmarks among all published methods.

Abstract: For largescale point cloud processing, resampling takes the important role of controlling the point number and density while keeping the geometric consistency. However, current methods cannot balance such different requirements. Particularly with large-scale point clouds, classical methods often struggle with decreased efficiency and accuracy. To address such issues, we propose a weighted Poisson-disk (WPD) resampling method to improve the usability and efficiency for the processing. We first design an initial Poisson resampling with a voxel-based estimation strategy. It is able to estimate a more accurate radius of the Poisson-disk while maintaining high efficiency. Then, we design a weighted tangent smoothing step to further optimize the Voronoi diagram for each point. At the same time, sharp features are detected and kept in the optimized results with isotropic property. Finally, we achieve a resampling copy from the original point cloud with the specified point number, uniform density, and high-quality geometric consistency. Experiments show that our method significantly improves the performance of large-scale point cloud resampling for different applications, and provides a highly practical solution.

Abstract: Realtime Monte Carlo (MC) ray tracing with low sampling rates demands a denoising algorithm that adeptly balances the trade-off between quality and efficiency. Previous works have paid much attention on designing delicate denoising architecture while ignoring model compression. In this work, we present a render-aware knowledge distillation (RAKD) framework, specifically designed for Monte Carlo denoising. We meticulously delineate the Knowledge Distillation (KD) process within RAKD, emphasizing three pivotal techniques: the strategic incorporation of an auxiliary unlabeled dataset, the integration of adversarial learning through generative adversarial network (GAN), and the application of parameter transfer for robust model initialization. These approaches are harmoniously combined to distill knowledge effectively, enabling our student model to adeptly strike a balance between preserving high-frequency details and reducing low-frequency noise. Finally, our results demonstrate that RAKD achieves state-of-the-art quality while upholding real-time performance, successfully tackling the computational constraints faced by resource-limited devices.

Abstract: Although existing Sparsely Annotated Object Detection (SAOD) approches have made progress in handling sparsely annotated environments in multispectral domain, where only some pedestrians are annotated, they still have the following limitations: (i) they lack considerations for improving the quality of pseudolabels for missing annotations, and (ii) they rely on fixed ground truth annotations, which leads to learning only a limited range of pedestrian visual appearances in the multispectral domain. To address these issues, we propose a novel framework called Sparsely Annotated Multispectral Pedestrian Detection (SAMPD). For limitation (i), we introduce Multispectral Pedestrian-aware Adaptive Weight (MPAW) and Positive Pseudo-label Enhancement (PPE) module. Utilizing multispectral knowledge, these modules ensure the generation of high-quality pseudo-labels and enable effective learning by increasing weights for high-quality pseudo-labels based on modality characteristics. To address limitation (ii), we propose an Adaptive Pedestrian Retrieval Augmentation (APRA) module, which adaptively incorporates pedestrian patches from ground-truth and dynamically integrates high-quality pseudo-labels with the ground-truth, facilitating a more diverse learning pool of pedestrians. Extensive experimental results demonstrate that our SAMPD significantly enhances performance in sparsely annotated environments within the multispectral domain.

Abstract: Propagationbased video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.

Abstract: Deep generative models are proficient in generating realistic data but struggle with producing rare samples in low density regions due to their scarcity of training datasets and the mode collapse problem. While recent methods aim to improve the fidelity of generated samples, they often reduce diversity and coverage by ignoring rare and novel samples. This study proposes a novel approach for generating diverse rare samples from highresolution image datasets with pretrained GANs. Our method employs gradient-based optimization of latent vectors within a multi-objective framework and utilizes normalizing flows for density estimation on the feature space. This enables the generation of diverse rare images, with controllable parameters for rarity, diversity, and similarity to a reference image. We demonstrate the effectiveness of our approach both qualitatively and quantitatively across various datasets and GANs without retraining or fine-tuning the pretrained GANs.

Abstract: The remarkable achievements of Large Language Models (LLMs) have captivated the attention of both academia and industry, transcending their initial role in dialogue generation. To expand the usage scenarios of LLM, some works enhance the effectiveness and capabilities of the model by introducing more external information, which is called the agent paradigm. Based on this idea, we propose a new method that integrates the agent paradigm into outof-distribution (OOD) detection task, aiming to improve its robustness and adaptability. Our proposed method, Concept Matching with Agent (CMA), employs neutral prompts as agents to augment the CLIP-based OOD detection process. These agents function as dynamic observers and communication hubs, interacting with both In-distribution (ID) labels and data inputs to form vector triangle relationships. This triangular framework offers a more nuanced approach than the traditional binary relationship, allowing for better separation and identification of ID and OOD inputs. Our extensive experimental results showcase the superior performance of CMA over both zero-shot and training-required methods in a diverse array of real-world scenarios.

Abstract: Video Frame Interpolation (VFI) aims to synthesize intermediate frames between existing frames to enhance visual smoothness and quality. Beyond the conventional methods based on the reconstruction loss, recent works have employed generative models for improved perceptual quality. However, they require complex training and large computational costs for pixel space modeling. In this paper, we introduce disentangled Motion Modeling (MoMo), a diffusionbased approach for VFI that enhances visual quality by focusing on intermediate motion modeling. We propose a disentangled two-stage training process. In the initial stage, frame synthesis and flow models are trained to generate accurate frames and flows optimal for synthesis. In the subsequent stage, we introduce a motion diffusion model, which incorporates our novel U-Net architecture specifically designed for optical flow, to generate bi-directional flows between frames. By learning the simpler low-frequency representation of motions, MoMo achieves superior perceptual quality with reduced computational demands compared to the generative modeling methods on the pixel space. MoMo surpasses state-of-the-art methods in perceptual metrics across various benchmarks, demonstrating its efficacy and efficiency in VFI.

Abstract: This paper focuses on face stylization with a single artistic target. Existing works for this task often fail to retain the source content while achieving geometry variation. Here, we present a novel StyO model, i.e., Stylize the face in only Oneshot, to solve the above problem. In particular, StyO exploits a disentanglement and recombination strategy. It first disentangles the content and style of source and target images into identifiers, which are then recombined in a cross manner to derive the stylized face image. In this way, StyO decomposes complex images into independent and specific attributes, and simplifies one-shot face stylization as the combination of different attributes from input images, thus producing results better matching face geometry of target image and content of source one. StyO is implemented with latent diffusion models (LDM) and composed of two key modules: 1) Identifier Disentanglement Learner (IDL) for disentanglement phase. It represents identifiers as contrastive text prompts, i.e. positive and negative descriptions. And it introduces a novel triple reconstruction loss to fine-tune the pre-trained LDM for encoding style and content into corresponding identifiers; 2) Fine-graind Content Controller (FCC) for recombination phase. It recombines disentangled identifiers from IDL to form an augmented text prompt for generating stylized faces. In addition, FCC also constrains the cross-attention maps of latent and text features to preserve source face details in results. The extensive evaluation shows that StyO produces high-quality images on numerous paintings of various styles and outperforms the current state-of-the-art.

Abstract: Compositional generalization is the capability of a model to understand novel compositions composed of seen concepts. There are multiple levels of novel compositions including phrasephrase level, phrase-word level, and word-word level. Existing methods achieve promising compositional generalization, but the consistency of compositional generalization across multiple levels of novel compositions remains unexplored. The consistency refers to that a model should generalize to a phrase-phrase level novel composition, and phrase-word/word-word level novel compositions that can be derived from it simultaneously. In this paper, we propose a meta-learning based framework, for achieving consistent compositional generalization across multiple levels. The basic idea is to progressively learn compositions from simple to complex for consistency. Specifically, we divide the original training set into multiple validation sets based on compositional complexity, and introduce multiple meta-weight-nets to generate sample weights for samples in different validation sets. To fit the validation sets in order of increasing compositional complexity, we optimize the parameters of each meta-weight-net independently and sequentially in a multilevel optimization manner. We build a GQA-CCG dataset to quantitatively evaluate the consistency. Experimental results on visual question answering and temporal video grounding, demonstrate the effectiveness of the proposed framework.

School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology, ReLER Lab, CCAI, Zhejiang University, ReLER Lab, CCAI, Zhejiang University, School of Computer Science and Information Engineering, Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

Abstract: MicroAction Recognition (MAR) has gained increasing attention due to its crucial role as a form of non-verbal communication in social interactions, with promising potential for applications in human communication and emotion analysis. However, current approaches often overlook the inherent ambiguity in micro-actions, which arises from the wide category range and subtle visual differences between categories. This oversight hampers the accuracy of micro-action recognition. In this paper, we propose a novel Prototypical Calibrating Ambiguous Network (PCAN) to unleash and mitigate the ambiguity of MAR. Firstly, we employ a hierarchical action-tree to identify the ambiguous sample, categorizing them into distinct sets of ambiguous samples of false negatives and false positives, considering both body- and action-level categories. Secondly, we implement an ambiguous contrastive refinement module to calibrate these ambiguous samples by regulating the distance between ambiguous samples and their corresponding prototypes. This calibration process aims to pull false negative (FN) samples closer to their respective prototypes and push false positive (FP) samples apart from their affiliated prototypes. In addition, we propose a new prototypical diversity amplification loss to strengthen the model's capacity by amplifying the differences between different prototypes. Finally, we propose a prototype-guided rectification to rectify prediction by incorporating the representability of prototypes. Extensive experiments conducted on the benchmark dataset demonstrate the superior performance of our method compared to existing approaches.

Abstract: Federated Learning (FL) enables collaborative learning from distributed data while preserving the privacy of participating clients. While supervised federated learning with labeled data has made notable strides and achieved success, federated semisupervised learning (FSSL) lags in its progress. Existing works for FSSL heavily rely on fully-labeled clients, while ignoring the distribution of pseudo-labels generated from skewed unlabeled data. In this work, we offer empirical and theoretical insights into the challenges encountered when applying conventional semi-supervised algorithms in the federated regime. Specifically, we highlight how the inherent data heterogeneity in FSSL can exacerbate issues within the pseudo-labeling process. Motivated by these observations, we propose federated learning with progressive distribution matching (FedPDM) to regularize the distribution of pseudo-labels, aiming to progressively reshape it to align with the ground-truth distribution. The matching problem could be formulated as an optimal transport (OT) problem and efficiently solved by Sinkhorn-Knopp iteration. Through extensive experiments, we demonstrate the superiority of FedPDM on a variety of models and datasets compared with prior arts for FSSL.

Abstract: Masked autoencoder (MAE) shows that severe augmentation during training produces robust representations for highlevel tasks. This paper brings the MAE-like framework to nighttime image enhancement, demonstrating that severe augmentation during training produces strong network priors that are resilient to real-world night haze degradations. We propose a novel nighttime image dehazing method with self-prior learning. Our main novelty lies in the design of severe augmentation, which allows our model to learn robust priors. Unlike MAE that uses masking, we leverage two key challenging factors of nighttime images as augmentation: light effects and noise. During training, we intentionally degrade clear images by blending them with light effects as well as by adding noise, and subsequently restore the clear images. This enables our model to learn clear background priors. By increasing the noise values to approach as high as the pixel intensity values of the glow and light effect blended images, our augmentation becomes severe, resulting in stronger priors. While our self-prior learning is considerably effective in suppressing glow and revealing details of background scenes, in some cases, there are still some undesired artifacts that remain, particularly in the forms of over-suppression. To address these artifacts, we propose a self-refinement module based on the semi-supervised teacher-student framework. Our NightHaze, especially our MAE-like self-prior learning, shows that models trained with severe augmentation effectively improve the visibility of input haze images, approaching the clarity of clear nighttime images. Extensive experiments demonstrate that our NightHaze achieves state-of-the-art performance, outperforming existing nighttime image dehazing methods by a substantial margin of 15.5% for MUSIQ and 23.5% for ClipIQA.

Abstract: Visualtextual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context-rich sentences used for image generation and the isolated class names typically used in semantic segmentation. This discrepancy hinders diffusion models from capturing accurate visual-textual correlations. To solve this, we propose InvSeg, a test-time prompt inversion method that tackles open-vocabulary semantic segmentation by inverting image-specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure-consistent mask. Specifically, we introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information, softly selecting anchors for each class and calculating weighted distances to push inner-class pixels closer while separating inter-class pixels, thereby ensuring mask distinction and internal consistency. By incorporating sample-specific context, InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities. Experiments show that InvSeg achieves state-of-the-art performance on the PASCAL VOC, PASCAL Context and COCO Object datasets.

Abstract: In recent years, there has been a growing demand to stylize a given 3D scene to align with the artistic style of reference images for creative purposes. While 3D Gaussian Splatting (GS) has emerged as a promising and efficient method for realistic 3D scene modeling, there remains a challenge in adapting it to stylize 3D GS to match with multiple styles through automatic local style transfer or manual designation, while maintaining memory efficiency for stylization training. In this paper, we introduce a novel 3D GS stylization solution termed MultiStyleGS to tackle these challenges. In particular, we employ a bipartite matching mechanism to automatically identify correspondences between the style images and the local regions of the rendered images. To facilitate local style transfer, we introduce a novel semantic style loss function that employs a segmentation network to apply distinct styles to various objects of the scene and propose a local-global feature matching to enhance the multi-view consistency. Furthermore, this technique can achieve memory-efficient training, more texture details and better color match. To better assign a robust semantic label to each Gaussian, we propose several techniques to regularize the segmentation network. As demonstrated by our comprehensive experiments, our approach outperforms existing ones in producing plausible stylization results and offering flexible editing.

Institute of Information Science, Beijing Jiaotong University Visual Intelligence + X International Joint Laboratory of the Ministry of Education, Institute of Information Science, Beijing Jiaotong University Visual Intelligence + X International Joint Laboratory of the Ministry of Education MT Lab, Meitu Inc., Institute of Information Science, Beijing Jiaotong University Visual Intelligence + X International Joint Laboratory of the Ministry of Education MT Lab, Meitu Inc., MT Lab, Meitu Inc., MT Lab, Meitu Inc., MT Lab, Meitu Inc., Institute of Information Science, Beijing Jiaotong University Visual Intelligence + X International Joint Laboratory of the Ministry of Education Pengcheng Laboratory, Shenzhen, China, Institute of Information Science, Beijing Jiaotong University Visual Intelligence + X International Joint Laboratory of the Ministry of Education Pengcheng Laboratory, Shenzhen, China

Abstract: Transformerbased models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a memory-efficient matting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of 4872 × 6017. This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark.

Abstract: Spatiotemporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted objects class and 290K interacted object boxes annotation. Accordingly, an object grounding task is proposed expecting vision systems to discover interacted objects. Even though today’s detectors and grounding methods have succeeded greatly, they perform unsatisfactorily in localizing diverse and rare objects in GIO. This profoundly reveals the limitations of current vision systems and poses a great challenge. Thus, we explore leveraging spatio-temporal cues to address object grounding and propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos. Our method demonstrates significant superiority in extensive experiments compared to current baselines.

Abstract: Textbased person retrieval (TPR) has gained significant attention as a fine-grained and challenging task that closely aligns with practical applications. Tailoring CLIP to person domain is now a emerging research topic due to the abundant knowledge of vision-language pretraining, but challenges still remain during fine-tuning: (i) Previous full-model fine-tuning in TPR is computationally expensive and prone to overfitting.(ii) Existing parameter-efficient transfer learning (PETL) for TPR lacks of fine-grained feature extraction. To address these issues, we propose Domain-Aware Mixture-of-Adapters (DM-Adapter), which unifies Mixture-of-Experts (MOE) and PETL to enhance fine-grained feature representations while maintaining efficiency. Specifically, Sparse Mixture-of-Adapters is designed in parallel to MLP layers in both vision and language branches, where different experts specialize in distinct aspects of person knowledge to handle features more finely. To promote the router to exploit domain information effectively and alleviate the routing imbalance, Domain-Aware Router is then developed by building a novel gating function and injecting learnable domain-aware prompts. Extensive experiments show that our DM-Adapter achieves state-of-the-art performance, outperforming previous methods by a significant margin.

Abstract: Vehicleto-everything (V2X) collaborative perception has recently gained increasing attention in autonomous driving due to its ability to enhance scene understanding by integrating information from other collaborators, e.g. vehicles or infrastructure. Existing algorithms usually share deep features to achieve a trade-off between accuracy and bandwidth. However, most of these methods require joint training of all agents, which results in privacy leakage and is impractical and unacceptable in the real world. Sharing prediction results seems to be a direct solution, but its performance is suboptimal and sensitive to localization noise and communication delay. In this paper, we propose a privacy-preserving collaborative perception framework, where each agent is separately trained with its own dataset and the ego vehicle needs to integrate with completely unknown collaborators. Specifically, we propose MSD, a multi-scale feature fusion method combined with deformable attention, to better fuse features of different agents. We also propose a plug-in domain adapter to align the features from unknown collaborators to ego-domain. Extensive experiments on the challenging DAIR-V2X and V2V4Real demonstrate that: 1) MSD achieves remarkable performance, outperforming others by at least 2.8% and 6.7% in AP0.7 on DAIR-V2X and V2V4Real, respectively; 2) After domain adaptation, it significantly outperforms the No Fusion, Late Fusion scenarios and can approach or even surpass the performance of joint training. We truly achieves privacy-preserving collaboration, providing a new paradigm for the study of collaborative perception, which is crucial for practical applications.

Abstract: Largescale text-guided image diffusion models have demonstrated remarkable results in text-to-image (T2I) generation. However, applying these models to synthesize textures for 3D geometries remains challenging due to the domain gap between 2D images and textures on a 3D surface. Early works that used a projecting-inpainting approach managed to preserve generation diversity, but often resulted in noticeable artifacts and style inconsistencies. While recent methods have attempted to address these inconsistencies, they often introduce other issues, such as blurring, over-saturation, or over-smoothing. To overcome these challenges, we propose a novel text-to-texture synthesis framework that takes advantage of pre-trained diffusion models. We introduce a local attention reweighing mechanism in the self-attention layers to guide the model in focusing on spatial-correlated patches across different views, thereby enhancing local details while preserving cross-view consistency. Additionally, we propose a novel latent space merge pipeline, which further ensures consistency across different viewpoints without sacrificing too much diversity. Our method significantly outperforms existing state-of-the-art techniques in terms of texture consistency and visual quality, while delivering results much faster than distillation-based methods. Importantly, our framework does not require additional training or fine-tuning, making it highly adaptable to a wide range of models available on public platforms.

Abstract: We propose DiffShadow, a global-guided diffusion model for high-quality shadow removal. Previous transformer-based approaches can utilize global information to relate shadow and non-shadow regions but are limited in their synthesis ability and recover images with obvious boundaries. In contrast, diffusion-based methods can generate better content but they are not exempt from issues related to inconsistent illumination. In this work, we combine the advantages of diffusion models and global guidance to realize shadow-free restoration. Specifically, we propose a parallel UNets architecture: 1) the local branch performs the patch-based noise estimation in the diffusion process, and 2) the global branch recovers the low-resolution shadow-free images. A Reweight Cross Attention (RCA) module is designed to integrate global contextual information of non-shadow regions into the local branch. We further design a Global-guided Sampling Strategy (GSS) that mitigates patch boundary issues and ensures consistent illumination across shaded and unshaded regions in the recovered image. Comprehensive experiments on three publicly standard datasets ISTD, ISTD+, and SRD have demonstrated the effectiveness of Diff-Shadow. Compared to state-of-the-art methods, our method achieves a significant improvement in terms of PSNR, increasing from 32.33dB to 33.69dB on the ISTD dataset.

Abstract: The goal of image change captioning is to capture the content differences between two images and describe them in natural language. The key is how to learn stable content changes from noise such as viewpoint and image structure. However, current work mostly focuses on identifying changes, and the influence of global noise leads to unstable recognition of global features. In order to tackle this problem, we propose a Selfsupervised Global-Part Alignment (SSGPA) network and revisit the image change captioning task by enhancing the construction process of overall image global features, enabling the model to integrate global changes such as viewpoint into local changes, and to detect and describe changes in the image through alignment. Concretely, we first design a Global-Part Transport Alignment mechanism to enhance global features and learn stable content changes through a self-supervised method of optimal transport. Further, we design a Change Fusion Adapter with pre-trained vision-language model to enhance the similar parts features of paired images, thereby enhancing global features, and expanding content changes. Extensive experiments show our method achieves the state-of-the-art results on four datasets.

Abstract: Highquality, high-resolution medical imaging is essential for clinical care. Raman-based biomedical optical imaging uses non-ionizing infrared radiation to evaluate human tissues in real time and is used for early cancer detection, brain tumor diagnosis, and intraoperative tissue analysis. Unfortunately, optical imaging is vulnerable to image degradation due to laser scattering and absorption, which can result in diagnostic errors and misguided treatment. Restoration of optical images is a challenging computer vision task because the sources of image degradation are multi-factorial, stochastic, and tissue-dependent, preventing a straightforward method to obtain paired low-quality/high-quality data. Here, we present Restorative Step-Calibrated Diffusion (RSCD): an unpaired diffusion-based image restoration method that uses a step calibrator model to dynamically determine the number of steps required to complete the reverse diffusion process for image restoration. RSCD outperforms other widely used unpaired image restoration methods on both image quality and perceptual evaluation metrics for restoring optical images. Medical imaging experts consistently prefer images restored using RSCD in blinded comparison experiments and report minimal to no hallucinations. Finally, we show that RSCD improves performance on downstream clinical imaging tasks, including automated brain tumor diagnosis and deep tissue imaging.

Abstract: The key to semisupervised semantic segmentation lies in how to fully exploit a large amount of unlabeled data to improve the model’s generalization performance. Most methods are lured into the trap of taking each class independently (i.e., class-independent consistency) and neglecting the fact that there exist semantic dependencies among classes. In this paper, we analyze the bottlenecks of class-independent consistency inherent in previous methods and offer a fresh perspective of cooperative game theory to explicitly encourage class-consensus alignment (i.e., class-consensus consistency between the teacher (weak augmented view) and student network (strong augmented view). We formulate classes as players in an cooperative game to model their interpretable consensus and shed light on the possibility of closer collaboration between consensus themselves and consistency regularization, yielding more comprehensive and effective supervision signals. To this end, we carefully design the class-consensus consistency without introducing any external knowledge to model class structure information which renders better interpretability, and further, prepend relaxed class-consensus consistency (RCC) to unlock the potential of modeling class consensus by relaxing the strict alignment of direct class consensus values to ranking alignment. Extensive experimental results on multiple benchmarks demonstrate that RCC performs favorably against state-of-the-art methods. Particularly in the low-data regimes, RCC achieves significant improvements.

Abstract: Diffusion models (DMs) have demonstrated great potential in the field of adversarial robustness, where DMbased defense methods can achieve superior defense capability without adversarial training. However, they all require huge computational costs due to the usage of large-scale pre-trained DMs, making it difficult to conduct full evaluation under strong attacks and compare with traditional CNN-based methods. Simply reducing the network size and timesteps in DMs could significantly harm the image generation quality, which invalidates previous frameworks. To alleviate this issue, we redesign the diffusion framework from generating high-quality images to predicting distinguishable image labels. Specifically, we employ an image translation framework to learn many-to-one mapping from input samples to designed orthogonal image labels. Based on this framework, we introduce an efficient Image-to-Image diffusion classifier with a pruned U-Net structure and reduced diffusion timesteps. Besides the framework, we redesign the optimization objective of DMs to fit the target of image classification, where a new classification loss is incorporated in the DM-based image translation framework to distinguish the generated label from those of other classes. We conduct sufficient evaluations of the proposed classifier under various attacks on popular benchmarks. Extensive experiments show that our method achieves better adversarial robustness with fewer computational costs than DM-based and CNN-based methods.

Abstract: Testtime prompt tuning (TPT) aims to adjust the vision-language models (e.g., CLIP) with learnable prompts during the inference phase. However, previous works overlooked that pre-trained models as a service (MaaS) have become a noticeable trend due to their commercial usage and potential risk of misuse. In the context of MaaS, users can only design prompts in inputs and query the black-box vision-language models through inference APIs, rendering the previous paradigm of utilizing gradient for prompt tuning is infeasible. In this paper, we propose black-box test-time prompt tuning (B²TPT), a novel framework that addresses the challenge of optimizing prompts without gradients in an unsupervised manner. Specifically, B²TPT designs a consistent or confident (CoC) pseudo-labeling strategy to generate high-quality pseudo-labels from the outputs. Subsequently, we propose to optimize low-dimensional intrinsic prompts using a derivative-free evolution algorithm and to project them onto the original text and vision prompts. This strategy addresses the gradient-free challenge while reducing complexity. Extensive experiments across 15 datasets demonstrate the superiority of B²TPT. The results show that B²TPT not only outperforms CLIP's zero-shot inference at test time, but also surpasses other gradient-based TPT methods.

Abstract: Diffusion Models (DM) have democratized AI image generation through an iterative denoising process. Quantization is a major technique to alleviate the inference cost and reduce the size of DM denoiser networks. However, as denoisers evolve from variants of convolutional UNets toward newer Transformer architectures, it is of growing importance to understand the quantization sensitivity of different weight layers, operations and architecture types to performance. In this work, we address this challenge with Qua2SeDiMo, a mixed-precision Post-Training Quantization framework that generates explainable insights on the cost-effectiveness of various model weight quantization methods for different denoiser operation types and block structures. We leverage these insights to make high-quality mixed-precision quantization decisions for a myriad of diffusion models ranging from foundational U-Nets to state-of-the-art Transformers. As a result, Qua2SeDiMo can construct 3.4-bit, 3.9-bit, 3.65-bit and 3.7-bit weight quantization on PixArt-α, PixArt-Σ, Hunyuan-DiT and SDXL, respectively. We further pair our weight-quantization configurations with 6-bit activation quantization and outperform existing approaches in terms of quantitative metrics and generative image quality.

Abstract: We propose iMoT, an innovative Transformerbased inertial odometry method that retrieves cross-modal information from motion and rotation modalities for accurate positional estimation. Unlike prior work, during the encoding of the motion context, we introduce Progressive Series Decoupler at the beginning of each encoder layer to stand out critical motion events inherent in acceleration and angular velocity signals. To better aggregate cross-modal interactions, we present Adaptive Positional Encoding, which dynamically modifies positional embeddings for temporal discrepancies between different modalities. During decoding, we introduce a small set of learnable query motion particles as priors to model motion uncertainties within velocity segments. Each query motion particle is intended to draw cross-modal features dedicated to a specific motion mode, all taken together allowing the model to refine its understanding of motion dynamics effectively. Lastly, we design a dynamic scoring mechanism to stabilize iMoT's optimization by considering all aligned motion particles at the final decoding step, ensuring robust and accurate velocity segment estimation. Extensive evaluations on various inertial datasets demonstrate that iMoT significantly outperforms state-of-the-art methods in delivering superior robustness and accuracy in trajectory reconstruction.

Abstract: Temporal grounding, which localizes video moments related to a natural language query, is a core problem of visionlanguage learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments' representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding.

Abstract: Recently, learningbased Underwater Image Enhancement (UIE) methods have demonstrated promising performance. However, existing learning-based methods still face two challenges. 1) They rarely consider the inconsistent degradation levels in different spatial regions and spectral bands simultaneously. 2) They treat all regions equally, ignoring that the regions with high-frequency details are more difficult to reconstruct. To address these challenges, we propose a novel UIE method based on spatial-spectral dual-domain adaptive learning, termed SS-UIE. Specifically, we first introduce a spatial-wise Multi-scale Cycle Selective Scan (MCSS) module and a Spectral-Wise Self-Attention (SWSA) module, both with linear complexity, and combine them in parallel to form a basic Spatial-Spectral block (SS-block). Benefiting from the global receptive field of MCSS and SWSA, SS-block can effectively model the degradation levels of different spatial regions and spectral bands, thereby enabling degradation level-based dual-domain adaptive UIE. By stacking multiple SS-blocks, we build our SS-UIE network. Additionally, a Frequency-Wise Loss (FWL) is introduced to narrow the frequency-wise discrepancy and reinforce the model's attention on the regions with high-frequency details. Extensive experiments validate that the SS-UIE technique outperforms state-of-the-art UIE methods while requiring cheaper computational and memory costs.

Abstract: Rainy images suffer from quality degradation due to the synergistic effect of rain streaks and accumulation. The rain streaks are anisotropic and show a specific directional arrangement, while the rain accumulation is isotropic and shows a consistent concentration distribution in local regions. This distribution difference makes unified representation learning for rain streaks and accumulation challenging, which may lead to structure distortion and contrast degradation in the deraining results. To address this problem, a centralsurrounding mechanism inspired Synergistic Convolution (SC) is proposed to extract rain streaks and accumulation features simultaneously. Specifically, the SC consists of two parallel novel convolutions: Central-Surrounding Difference Convolution (CSD) and Central-Surrounding Addition Convolution (CSA). In CSD, the difference operation between central and surrounding pixels is injected into the feature extraction process of convolution to perceive the direction distribution of rain streaks. In CSA, the addition operation between central and surrounding pixels is injected into the feature extraction process of convolution to facilitate the modeling of rain accumulation properties. The SC can be used as a general unit to substitute Vanilla Convolution (VC) in current de-raining networks to boost performance. To reduce computational costs, CSA and CSD in SC are merged into a single VC kernel by our parameter equivalent transformation before inferencing. Evaluations of twelve de-raining methods on nine public datasets demonstrate that our proposed SC can comprehensively improve the performance of twelve de-raining networks under various rainy conditions without changing the original network structure or introducing extra computational costs. Even for the current SOTA methods, SC can further achieve SOTA++ performance. The source codes will be publicly available.

Abstract: The scene flow estimation methods make significant progress by estimating pixelwise 3D motion on implicitly learning a motion embedding using an end-to-end differentiable optimization framework. However, the motion embedding learned implicitly is insufficient for grouping pixels into rigid object in challenging regions, such as occlusion and inconsistent multi-view geometric properties. To address this issue, we propose a novel method for estimating scene flow called OAMaskFlow, which has three novelties. Firstly, we propose the concept of occlusion-aware motion (OAM) mask and generate the ground truth annotation through the photo-metric and geometry consistency. Secondly, we propose to supervise the motion embedding with the OAM mask to learn informative and reliable motion representation of the scene. Finally, a 3D motion propagation module is proposed to propagate high-quality 3D motion from reliable pixels to the challenging occluded regions. Experiments show that our proposed OAMaskFlow has reduced the EPE3D metric by 21.0% on the FlyingThings3D dataset and decreased SF-all metric by 24.3% on the KITTI scene flow benchmark than the baseline method RAFT-3D. Furthermore, we apply our proposed OAM mask in simultaneous localization and mapping (SLAM) to improve a state-of-the-art method DROID-SLAM. In comparison, the ATE metric has decreased by 65.7% and 58.3% on the TartanAir monocular and stereo datasets respectively.

Abstract: Capturing images under different color temperatures can result in color casts, causing the color presented in photos to differ from what is perceived by the human eye. Correcting these color temperature shifts to achieve White Balance (WB) is a challenging task, requiring the identification of variations in color tones from diverse light sources and the removal of color casts. The advent of deep neural networks has significantly advanced the progress of WB methods, evolving from simply identifying the scene illumination color to directly producing a colorcorrected image from the color-shifted input. To better map color distributions and scene information from the input to the WB image, we propose HVDualformer, an end-to-end histogram-vision dual transformer architecture that can rectify color temperature features from WB color histograms and exploit them to adjust image features to yield accurate WB results. Extensive experimental results on public benchmark datasets demonstrate that the proposed model performs favorably against state-of-the-art methods.

Abstract: Recent advancements in generative models have revolutionized image generation and editing, making these tasks accessible to nonexperts. This paper focuses on local image editing, particularly the task of adding new content to a loosely specified area. Existing methods often require a precise mask or a detailed description of the location, which can be cumbersome and prone to errors. We propose Click2Mask, a novel approach that simplifies the local editing process by requiring only a single point of reference (in addition to the content description). A mask is dynamically grown around this point during a Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based semantic loss. Click2Mask surpasses the limitations of segmentation-based and fine-tuning dependent methods, offering a more user-friendly and contextually accurate solution. Our experiments demonstrate that Click2Mask not only minimizes user effort but also enables competitive or superior local image manipulations compared to SoTA methods, according to both human judgement and automatic metrics. Key contributions include the simplification of user input, the ability to freely add objects unconstrained by existing segments, and the integration potential of our dynamic mask approach within other editing methods.

Abstract: Video summarization aims to eliminate visual redundancy while retaining key parts of video to construct concise and comprehensive synopses. Most existing methods use discriminative models to predict the importance scores of video frames. However, these methods are susceptible to annotation inconsistency caused by the inherent subjectivity of different annotators when annotating the same video. In this paper, we introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective, effectively reducing the interference of subjective annotation noise. Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction, and generates summaries by iterative denoising. Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability. Moreover, to facilitate training DDPM with limited data, we employ an unsupervised video summarization model to implement the earlier denoising process. Extensive experiments on various datasets (TVSum, SumMe, and FPVSum) demonstrate the effectiveness of our method.

Abstract: Multiobject tracking is a challenging vision task that requires simultaneous reasoning about object detection and object association. Conventional solutions use frame as the basic unit and typically rely on a motion predictor that exploits the appearance features to associate detected candidates, leading to insufficient adaptability to long-term associations. In this study, we propose a section-based multi-object tracking approach that integrates a temporal coherent Object Flow Tracker (OFTrack), capable of achieving simultaneous multi-frame tracking by treating multiple consecutive frames as the basic processing unit, denoted as a “section”. Our OFTrack boosts the optical flow to the object flow by employing object perception and section-based motion estimation strategies. Object perception adopts object-aware sampling and scale-aware correlation to enable precise target discrimination. Motion estimation models the correlation of different objects in multi-frames via specialized temporal-spatial attention to achieve robust association in very long videos. Additionally, to address the oscillation of unpredictable trajectories in multi-frame estimation, we have designed temporal coherent enhancement including the trajectory masking pre-training and the smoothing constraint on trajectory curves. Comprehensive experiments on several widely used benchmarks demonstrate the superior performance of our approach.

Abstract: In this work, we address unsupervised temporal action segmentation, which segments a set of long, untrimmed videos into semantically meaningful segments that are consistent across videos. While recent approaches combine representation learning and clustering in a single step for this task, they do not cope with large variations within temporal segments of the same class. To address this limitation, we propose a novel method, termed Hierarchical Vector Quantization (HVQ), that consists of two subsequent vector quantization modules. This results in a hierarchical clustering where the additional subclusters cover the variations within a cluster. We demonstrate that our approach captures the distribution of segment lengths much better than the state of the art. To this end, we introduce a new metric based on the JensenShannon Distance (JSD) for unsupervised temporal action segmentation. We evaluate our approach on three public datasets, namely Breakfast, YouTube Instructional and IKEA ASM. Our approach outperforms the state of the art in terms of F1 score, recall and JSD.

Abstract: The domain of nonline-of-sight (NLOS) imaging is advancing rapidly, offering the capability to reveal occluded scenes that are not directly visible. However, contemporary NLOS systems face several significant challenges: (1) The computational and storage requirements are profound due to the inherent three-dimensional grid data structure, which restricts practical application. (2) The simultaneous reconstruction of albedo and depth information requires a delicate balance using hyperparameters in the loss function, rendering the concurrent reconstruction of texture and depth information difficult. This paper introduces the innovative methodology, DG-NLOS, which integrates an albedo-focused reconstruction branch dedicated to albedo information recovery and a depth-focused reconstruction branch that extracts geometrical structure, to overcome these obstacles. The dual-branch framework segregates content delivery to the respective reconstructions, thereby enhancing the quality of the retrieved data. To our knowledge, we are the first to employ the GNN as a fundamental component to transform dense NLOS grid data into sparse structural features for efficient reconstruction. Comprehensive experiments demonstrate that our method attains the highest level of performance among existing methods across synthetic and real data.

School of Artificial Intelligence, University of Chinese Academy of Sciences MAIS, Institute of Automation of Chinese Academy of Sciences, Mohamed bin Zayed University of Artificial Intelligence, School of Artificial Intelligence, University of Chinese Academy of Sciences MAIS, Institute of Automation of Chinese Academy of Sciences, Electrical and Computer Engineering Department, Carnegie Mellon University, School of Artificial Intelligence, University of Chinese Academy of Sciences MAIS, Institute of Automation of Chinese Academy of Sciences, Tencent Robotics X, School of Artificial Intelligence, University of Chinese Academy of Sciences MAIS, Institute of Automation of Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences MAIS, Institute of Automation of Chinese Academy of Sciences

Abstract: Aerial VisionDialog Navigation (AVDN) is a new task that requires drones to navigate to a target location based on human-robot dialog history. This paper focuses on the critical fine-grained cross-modal alignment problem in AVDN, requiring the drone to align language entities with visual landmarks in top-down views. To achieve this, we first construct a Fine-Grained AVDN (FG-AVDN) dataset via a semi-automatic annotation pipeline, providing diverse multimodal annotations at the entity-landmark level. Based on this, a novel Fine-grained Entity-Landmark Alignment (FELA) method is proposed to learn the cross-modal alignment explicitly. Concretely, FELA first boosts the drone's visual understanding with a precise semantic grid representation, which captures the environmental semantics and spatial structure simultaneously. Subsequently, to learn the entity-landmark alignment, we devise cross-modal auxiliary tasks from three perspectives, including grounding, captioning, and contrastive learning. Extensive experiments demonstrate that our explicit entity-landmark alignment learning is beneficial for AVDN. As a result, FELA achieves leading performance with 3.2% SR and 4.9% GP improvements over prior arts. Code and dataset will be publicly available.

Beijing Key Lab. of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Beijing Key Lab. of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, The Chinese University of Hong Kong, The University of Hong Kong SenseTime Research and Tetras.AI, Beijing Key Lab. of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Peng Cheng Laboratory

Abstract: Segmentation of ultrahigh resolution (UHR) images is a critical task with numerous applications, yet it poses significant challenges due to high spatial resolution and rich fine details. Recent approaches adopt a dual-branch architecture, where a global branch learns long-range contextual information and a local branch captures fine details. However, they struggle to handle the conflict between global and local information while adding significant extra computational cost. Inspired by the human visual system's ability to rapidly orient attention to important areas with fine details and filter out irrelevant information, we propose a novel UHR segmentation method called Boundary-enhanced Patch-merging Transformer (BPT). BPT consists of two key components: (1) Patch-Merging Transformer (PMT) for dynamically allocating tokens to informative regions to acquire global and local representations, and (2) Boundary-Enhanced Module (BEM) that leverages boundary information to enrich fine details. Extensive experiments on multiple UHR image segmentation benchmarks demonstrate that our BPT outperforms previous state-of-the-art methods without introducing extra computational overhead.

Abstract: Assessing the quality of artificial intelligencegenerated images (AIGIs) plays a crucial role in their application in real-world scenarios. However, traditional image quality assessment (IQA) algorithms primarily focus on low-level visual perception, while existing IQA works on AIGIs overemphasize the generated content itself, neglecting its effectiveness in real-world applications. To bridge this gap, we propose AIGI-VC, a quality assessment database for AI-Generated Images in Visual Communication, which studies the communicability of AIGIs in the advertising field from the perspectives of information clarity and emotional interaction. The dataset consists of 2,500 images spanning 14 advertisement topics and 8 emotion types. It provides coarse-grained human preference annotations and fine-grained preference descriptions, benchmarking the abilities of IQA methods in preference prediction, interpretation, and reasoning. We conduct an empirical study of existing representative IQA methods and large multi-modal models on the AIGI-VC dataset, uncovering their strengths and weaknesses.

Abstract: In this study, we establish a benchmark and a baseline approach for Multimodal referring and grounding with Chainof-Questions (MCQ), opening up a promising direction for ‘logical’ multimodal dialogues. The newly collected dataset, named CB-300K, spans challenges including probing dialogues with spatial relationship among multiple objects, consistent reasoning, and complex question chains. The baseline approach, termed ChatterBox, involves a modularized design and a referent feedback mechanism to ensure logical coherence in continuous referring and grounding tasks. This design reduces the risk of referential confusion, simplifies the training process, and presents validity in retaining the language model’s generation ability. Experiments show that ChatterBox demonstrates superiority in MCQ both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with logical interactions.

Abstract: Semantic Scene Completion (SSC) aims to reconstruct a 3D voxel representation occupied by semantic classes based on ordinary inputs such as 2D RGB images, depth maps, or point clouds. Given the costeffective and promising applications in autonomous driving, camera-based SSC has attracted considerable attention to developing various approaches. However, current methods mainly focus on precise 2D-to-3D projection while overlooking the challenge of completing invisible regions, leading to numerous false negatives and suboptimal SSC performance. To address this issue, we propose a novel architecture, Memory-augmented Re-completion (MARE), designed to enhance completion capability. Our MARE model encapsulates regional relationships by incorporating a memory bank that stores vital region-tokens while two protocols concerning diversity and age are adopted to optimize the bank adversarially. Additionally, we introduce a Re-completion pipeline incorporated with an Information Spreading module to progressively complete the invisible regions while bridging the scale gap between region-level and voxel-level information. Extensive experiments conducted on the SSCBench-KITTI-360 and SemanticKITTI datasets validate the effectiveness of our approach.

Abstract: Incontext segmentation has drawn increasing attention with the advent of vision foundation models. Its goal is to segment objects using given reference images. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. This work approaches the problem from a fresh perspective - unlocking the capability of the latent diffusion model (LDM) for in-context segmentation and investigating different design choices. Specifically, we examine the problem from three angles: instruction extraction, output alignment, and meta-architectures. We design a two-stage masking strategy to prevent interfering information from leaking into the instructions. In addition, we propose an augmented pseudo-masking target to ensure the model predicts without forgetting the original images. Moreover, we build a new and fair in-context segmentation benchmark that covers both image and video datasets. Experiments validate the effectiveness of our approach, demonstrating comparable or even stronger results than previous specialist or visual foundation models. We hope our work inspires others to rethink the unification of segmentation and generation.

Abstract: The goal of imagebased virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results from multiple views using the given clothes. Given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. Moreover, we adopt diffusion models that have demonstrated superior abilities to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest joint attention blocks to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset MVG, in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets.

Abstract: We introduce MMMixing, a multi-modal mixing alignment framework for 3D understanding. MM-Mixing applies mixing-based methods to multi-modal data, preserving and optimizing cross-modal connections while enhancing diversity and improving alignment across modalities. Our proposed two-stage training pipeline combines feature-level and input-level mixing to optimize the 3D encoder. The first stage employs feature-level mixing with contrastive learning to align 3D features with their corresponding modalities. The second stage incorporates both feature-level and input-level mixing, introducing mixed point cloud inputs to further refine 3D feature representations. MM-Mixing enhances intermodality relationships, promotes generalization, and ensures feature consistency while providing diverse and realistic training samples. We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios, including zero-shot 3D classification, linear probing 3D classification, and cross-modal 3D shape retrieval. Notably, we improved the zero-shot classification accuracy on ScanObjectNN from 51.3% to 61.9%, and on Objaverse-LVIS from 46.8% to 51.4%. Our findings highlight the potential of multi-modal mixing-based alignment to significantly advance 3D object recognition and understanding while remaining straightforward to implement and integrate into existing frameworks.

Abstract: Recovering 4D world from monocular video is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multiview videos, known camera parameters, or static scenes. In this paper, we relax all these constraints and tackle a highly ambitious but practical task: With only one monocular video without camera parameters, we aim to recover the dynamic 3D world alongside the camera poses. To solve this, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video to a 4D scene, as a flow of 3D Gaussians through space and time. GFlow starts by segmenting the video into still and moving parts, then alternates between optimizing camera poses and the dynamics of the 3D Gaussian points. This method ensures consistency among adjacent points and smooth transitions between frames. Since dynamic scenes always continually introduce new visual content, we present prior-driven initialization and pixel-wise densification strategy for Gaussian points to integrate new content. By combining all those techniques, GFlow transcends the boundaries of 4D recovery from causal videos; it naturally enables tracking of points and segmentation of moving objects across frames. Additionally, GFlow estimates the camera poses for each frame, enabling novel view synthesis by changing camera pose. This capability facilitates extensive scene-level or object-level editing, highlighting GFlow's versatility and effectiveness.

Abstract: Current Siamese and Transformer trackers commonly use various subtask branches like regression and classification to predict object states. Despite the demonstrated success, these subtask branches might introduce location and scale offsets due to discrepancies and misalignment in the respective predictions. To address this, we propose a novel generative tracker, MIMTrack, which defines tracking as a Masked Image Modeling (MIM) process combined with incontext learning (ICL). MIMTrack begins with building the visual prompt image, which consists of a template, a search area, and two target images associated with them. The target image transforms the bounding box into a unified RGB image space as other tracking image. All states prediction are naturally aligned by pixels generation of search target image. In light of this, we perform a MIM process within the visual prompt to reconstruct a masked search target image using the context from other parts. MIM with ICL makes use of implicit cross-relations between template and search area. A singlestream generative framework reduces the offset in the estimation. Furthermore, a latent memory module is introduced as a plugin to enhance pixel generation by leveraging various target appearances over time. The advanced performance observed on leading benchmark datasets highlights the simplicity and effectiveness of our MIMTrack framework.

Abstract: Crossdomain talking head generation, such as animating a static cartoon animal photo with real human video, is crucial for personalized content creation. However, prior works typically rely on domain-specific frameworks and paired videos, limiting its utility and complicating its architecture with additional motion alignment modules. Addressing these shortcomings, we propose Anytalk, a unified framework that eliminates the need for paired data and learns a shared motion representation across different domains. The motion is represented by canonical 3D keypoints extracted using an unsupervised 3D keypoint detector. Further, we propose an expression consistency loss to improve the accuracy of facial dynamics in video generation. Additionally, we present AniTalk, a comprehensive dataset designed for advanced multi-modal cross-domain generation. Our experiments demonstrate that Anytalk excels at generating high-quality, multi-modal talking head videos, showcasing remarkable generalization capabilities across diverse domains.

Abstract: Image superresolution (SR) is essential for bridging the gap between modern hardware and real-time computer graphics (CG) applications. It reduces CG workload by allowing low-resolution rendering, with original quality restored later via mathematical operations or machine learning. However, recent learning-based SR methods often rely on complex models, demanding high computational resources and undermining the benefits of reduced rendering workload. Our qualitative and quantitative analysis of the SR process and rendering reveals that readily accessible rendering information can significantly enhance neural network design by serving as additional features. To capitalize on this, we propose CGSR, an optimization framework designed for lightweight real-time super-resolution. CGSR utilizes rendering information to boost both network extensibility and efficiency. It utilizes progressively available rendering information from the pipeline, which arrives earlier than the rendered frame, enabling pre-processing and masking of latency. These features are then integrated into a selected SR network backbone to form a CG-enhanced network. This network is further optimized and refined into a CG-optimized version using neural architecture search (NAS). To improve runtime performance, CGSR also employs rendering-aware hybrid pruning, which dynamically prunes the network based on temporal rendering data. Evaluation results show that CGSR significantly reduces parameter size, multi-add operations, and inference time while maintaining high SR quality across various backbone SR networks.

Abstract: Group Equivariant Convolution (GConv) can capture rotational equivariance from original data. It assumes uniform and strict rotational equivariance across all features as the transformations under the specific group. However, the presentation or distribution of realworld data rarely conforms to strict rotational equivariance, commonly referred to as Rotational Symmetry-Breaking (RSB) in the system or dataset, making GConv unable to adapt effectively to this phenomenon. Motivated by this, we propose a simple but highly effective method to address this problem, which utilizes a set of learnable biases called G-Biases under the group order to break strict group constraints and then achieve a Relaxed Rotational Equivariant Convolution (RREConv). To validate the efficiency of RREConv, we conduct extensive ablation experiments on the discrete rotational group Cn. Experiments demonstrate that the proposed RREConv-based methods achieve excellent performance compared to existing GConv-based methods in both classification and 2D object detection tasks on the natural image datasets.

Abstract: Recent methods, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have demonstrated remarkable capabilities in novel view synthesis. However, despite their success in producing highquality images for viewpoints similar to those seen during training, they struggle when generating detailed images from viewpoints that significantly deviate from the training set, particularly in close-up views. The primary challenge stems from the lack of specific training data for close-up views, leading to the inability of current methods to render these views accurately. To address this issue, we introduce a novel pseudo-label-based learning strategy. This approach leverages pseudo-labels derived from existing training data to provide targeted supervision across a wide range of close-up viewpoints. Recognizing the absence of benchmarks for this specific challenge, we also present a new dataset designed to assess the effectiveness of both current and future methods in this area. Our extensive experiments demonstrate the efficacy of our approach.

Abstract: Automatic Radiology Report Generation (RRG) is an important topic for alleviating the substantial workload of radiologists. Existing RRG approaches rely on supervised regression based on different architectures or additional knowledge injection, while the generated report may not align optimally with radiologists’ preferences. Especially, since the preferences of radiologists are inherently heterogeneous and multidimensional, e.g., some may prioritize report fluency, while others emphasize clinical accuracy. To address this problem, we propose a new RRG method via Multi-objective Preference Optimization (MPO) to align the pre-trained RRG model with multiple human preferences, which can be formulated by multi-dimensional reward functions and optimized by multi-objective reinforcement learning (RL). Specifically, we use a preference vector to represent the weight of preferences and use it as a condition for the RRG model. Then, a linearly weighed reward is obtained via a dot product between the preference vector and multi-dimensional reward. Next, the RRG model is optimized to align with the preference vector by optimizing such a reward via RL. In the training stage, we randomly sample diverse preference vectors from the preference space and align the model by optimizing the weighted multi-objective rewards, which leads to an optimal policy on the entire preference space. When inference, our model can generate reports aligned with specific preferences without further fine-tuning. Extensive experiments on two public datasets show the proposed method can generate reports that cater to different preferences in a single model and achieve state-of-the-art performance.

Abstract: Neural Radiance Fields (NeRF) with hybrid representations have shown impressive capabilities for novel view synthesis, delivering high efficiency. Nonetheless, their performance significantly drops with sparse input views. Various regularization strategies have been devised to address these challenges. However, these strategies either require additional rendering costs or involve complex pipeline designs, leading to a loss of training efficiency. Although FreeNeRF has introduced an efficient frequency annealing strategy, its operation on frequency positional encoding is incompatible with the efficient hybrid representations. In this paper, we introduce an accurate and efficient fewshot neural rendering method named Spatial Annealing regularized NeRF (SANeRF), which adopts the pre-filtering design of a hybrid representation. We initially establish the analytical formulation of the frequency band limit for a hybrid architecture by deducing its filtering process. Based on this analysis, we propose a universal form of frequency annealing in the spatial domain, which can be implemented by modulating the sampling kernel to exponentially shrink from an initial one with a narrow grid tangent kernel spectrum. This methodology is crucial for stabilizing the early stages of the training phase and significantly contributes to enhancing the subsequent process of detail refinement. Our extensive experiments reveal that, by adding merely one line of code, SANeRF delivers superior rendering quality and much faster reconstruction speed compared to current few-shot neural rendering methods. Notably, SANeRF outperforms FreeNeRF on the Blender dataset, achieving 700X faster reconstruction speed.

Abstract: Transformerbased networks have set new benchmarks in light field super-resolution (SR), but adapting them to capture both global and local spatial-angular correlations efficiently remains challenging. Moreover, many methods fail to account for geometric details like occlusions, leading to performance drops. To tackle these issues, we introduce OHT. This hybrid network leverages occlusion maps through an occlusion-embedded mix layer. It combines the strengths of convolutional networks and Transformers via spatial-angular separable convolution (SASep-Conv) and angular self-attention (ASA). SASep-Conv offers a lightweight alternative to 3D convolution for capturing spatial-angular correlations, while the ASA mechanism applies 3D self-attention across the angular dimension. These designs allow OHT to capture global angular correlations effectively. Extensive experiments on multiple datasets demonstrate OHT's superior performance.

Key Laboratory of Education Blockchain and Intelligent Technology Ministry of Education, Guangxi Normal University, Guilin 541004, China Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China, Key Laboratory of Education Blockchain and Intelligent Technology Ministry of Education, Guangxi Normal University, Guilin 541004, China Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China, Key Laboratory of Education Blockchain and Intelligent Technology Ministry of Education, Guangxi Normal University, Guilin 541004, China Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China, Key Laboratory of Education Blockchain and Intelligent Technology Ministry of Education, Guangxi Normal University, Guilin 541004, China Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China, Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University, Wuzhou 543002, China, Key Laboratory of Education Blockchain and Intelligent Technology Ministry of Education, Guangxi Normal University, Guilin 541004, China Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China

Abstract: How to make a good tradeoff between performance and computational cost is crucial for a tracker. However, current famous methods typically focus on complicated and time-consuming learning that combining temporal and appearance information by input more and more images (or features). Consequently, these methods not only increase the model's computational source and learning burden but also introduce much useless and potentially interfering information. To alleviate the above issues, we propose a simple yet robust tracker that separates temporal information learning from appearance modeling and extracts temporal relations from a set of representative tokens rather than several images (or features). Specifically, we introduce one track token for each frame to collect the target's appearance information in the backbone. Then, we design a mamba-based Temporal Module for track tokens to be aware of context by interacting with other track tokens within a sliding window. This module consists of a mamba layer with autoregressive characteristic and a cross-attention layer with strong global perception ability, ensuring sufficient interaction for track tokens to perceive the appearance changes and movement trends of the target. Finally, track tokens serve as a guidance to adjust the appearance feature for the final prediction in the head. Experiments show our method is effective and achieves competitive performance on multiple benchmarks at a real-time speed.

Abstract: Following the recent popularity of vision language models, several attempts, e.g., parameterefficient fine-tuning (PEFT), have been made to extend them to different downstream tasks. Previous PEFT works motivate their methods from the view of introducing new parameters for adaptation but still need to learn this part of weight from scratch, i.e., random initialization. In this paper, we present a novel strategy that incorporates the potential of prompts, e.g., vision features, to facilitate the initial parameter space adapting to new scenarios. We introduce a Feature-Adapted parameTer Efficient tuning paradigm for vision-language models, dubbed as FATE, which injects informative features from the vision encoder into language encoder's parameters space. Specifically, we extract vision features from the last layer of CLIP's vision encoder and, after projection, treat them as parameters for fine-tuning each layer of CLIP's language encoder. By adjusting these feature-adapted parameters, we can directly enable communication between the vision and language branches, facilitating CLIP's adaptation to different scenarios. Experimental results show that FATE exhibits superior generalization performance on 11 datasets with a very small amount of extra parameters and computation.

Abstract: Transformerbased methods have achieved remarkable performance in event-based object detection, owing to the global modeling ability. However, they neglect the influence of non-event and noisy regions and process them uniformly, leading to high computational overhead. To mitigate computation cost, some researchers propose window attention based sparsification strategies to discard unimportant regions, which sacrifices the global modeling ability and results in suboptimal performance. To achieve better trade-off between accuracy and efficiency, we propose Sparse Mamba (SMamba), which performs adaptive sparsification to reduce computational effort while maintaining global modeling capability. Specifically, a Spatio-Temporal Continuity Assessment module is proposed to measure the information content of tokens and discard uninformative ones by leveraging the spatiotemporal distribution differences between activity and noise events. Based on the assessment results, an Information-Prioritized Local Scan strategy is designed to shorten the scan distance between high-information tokens, facilitating interactions among them in the spatial dimension. Furthermore, to extend the global interaction from 2D space to 3D representations, a Global Channel Interaction module is proposed to aggregate channel information from a global spatial perspective. Results on three datasets (Gen1, 1Mpx, and eTram) demonstrate that our model outperforms other methods in both performance and efficiency.

Abstract: Generating sketches that accurately reflect the content of reference images presents numerous challenges. Current methods either require paired training data or fail to accommodate a wider range and diversity of sketch styles. While pretrained diffusion models have shown strong text-based control capabilities for reference-based content sketch generation, state-of-the-art methods still struggle with reference-based sketch generation for given content. The main difficulties lie in (1) balancing content preservation with style enhancement, and (2) representing content image textures at varying levels of abstraction to approximate the reference sketch style. In this paper, we propose a method (Ref2Sketch-SA) that transforms a given content image into a sketch based on a reference sketch. The core strategies include (1) using DDIM Inversion to enhance structural consistency in the sketch generation of content images; (2) injecting noise into the input image during the denoising process to produce a sketch that retains content attributes while aligning with, yet differing in texture from, the reference. Our model demonstrates superior performance across multiple evaluation metrics, including user style preference.

Abstract: We introduce RealPortrait, a framework based on Diffusion Transformers (DiT), designed to generate highly expressive and visually appealing portrait animations. Given a static portrait image, our method can transfer complex facial expressions and head pose movements extracted from a driving video onto the portrait, transforming it into a lifelike video. Specifically, we exploit the robust spatialtemporal modeling capabilities of DiT, enabling the generation of portrait videos that maintain high-fidelity visual details and ensure temporal coherence. In contrast to conventional image-to-video generation frameworks that necessitate a separate reference network, we incorporate an efficient reference attention within the DiT backbone, thereby obviating the computational overhead and achieving superior reference appearance preservation. Concurrently, we integrate a parallel ControlNet to precisely regulate intricate facial expressions and head poses. Diverging from prior methods that utilize explicit sparse motion representations, such as facial landmarks or 3DMM coefficients, we adopt a dense implicit motion representation as the control guidance. This implicit motion representation excels in capturing nuanced emotional facial expressions and subtle non-rigid dynamics of the lips. To further enhance the generalization capability of the model, we augment the training dataset by incorporating a substantial volume of facial image data through random crop augmentation. This strategy ensures the model's robustness across a wide variety of facial appearances and expressions. Empirical evaluations demonstrate that RealPortrait excels in generating portrait animations with highly-realistic quality and exceptional temporal coherence in appearance retention.

Abstract: Recently, deep learning based methods have revolutionized remote sensing image segmentation. However, these methods usually rely on a predefined semantic class set, thus needing additional image annotation and model training when adapting to new classes. More importantly, they are unable to segment arbitrary semantic classes. In this work, we introduce OpenVocabulary Remote Sensing Image Semantic Segmentation (OVRSISS), which aims to segment arbitrary semantic classes in remote sensing images. To address the lack of OVRSISS datasets, we develop LandDiscover50K, a comprehensive dataset of 51,846 images covering 40 diverse semantic classes. In addition, we propose a novel framework named GSNet that integrates domain priors from special remote sensing models and versatile capabilities of general vision-language models. Technically, GSNet consists of a Dual-Stream Image Encoder (DSIE), a Query-Guided Feature Fusion (QGFF), and a Residual Information Preservation Decoder (RIPD). DSIE first captures comprehensive features from both special models and general models in dual streams. Then, with the guidance of variable vocabularies, QGFF integrates specialist and generalist features, enabling them to complement each other. Finally, RIPD is proposed to aggregate multi-source features for more accurate mask predictions. Experiments show that our method outperforms other methods by a large margin, and our proposed LandDiscover50K improves the performance of OVRSISS methods. The dataset and method will be publicly available.

Abstract: Previous research has shown that constraining the gradient of loss function w.r.t. modelpredicted probabilities can enhance the model robustness against noisy labels. These methods typically specify a fixed optimal threshold for gradient clipping through validation data to obtain the desired robustness against noise. However, this common practice overlooks the dynamic distribution of gradients from both clean and noisy-labeled samples at different stages of training, significantly limiting the model capability to adapt to the variable nature of gradients throughout the training process. To address this issue, we propose a simple yet effective approach called Optimized Gradient Clipping (OGC), which dynamically adjusts the clipping threshold based on the ratio of noise gradients to clean gradients after clipping, estimated by modeling the distributions of clean and noisy samples. This approach allows us to modify the clipping threshold at each training step, effectively controlling the influence of noise gradients. Additionally, we provide statistical analysis to certify the noise-tolerance ability of OGC. Our extensive experiments across various types of label noise, including symmetric, asymmetric, instance-dependent, and real-world noise, demonstrate the effectiveness of our approach.

Abstract: Vision Transformers (ViTs) have achieved remarkable success in various computer vision tasks. However, ViTs have a huge computational cost due to their inherent reliance on multihead self-attention (MHSA), prompting efforts to accelerate ViTs for practical applications. To this end, recent works aim to reduce the number of tokens, mainly focusing on how to effectively prune or merge them. Nevertheless, since ViT tokens are generated from non-overlapping grid patches, they usually do not convey sufficient semantics, making it incompatible with efficient ViTs. To address this, we propose ImagePiece, a novel re-tokenization strategy for Vision Transformers. Following the MaxMatch strategy of NLP tokenization, ImagePiece groups semantically insufficient yet locally coherent tokens until they convey meaning. This simple retokenization is highly compatible with previous token reduction methods, being able to drastically narrow down relevant tokens, enhancing the inference speed of DeiT-S by 54% (nearly 1.5x faster) while achieving a 0.39% improvement in ImageNet classification accuracy. For hyper-speed inference scenarios (with 251% acceleration), our approach surpasses other baselines by an accuracy over 8%.

Abstract: We propose actionagnostic point-level (AAPL) supervision for temporal action detection to achieve accurate action instance detection with a lightly annotated dataset. In the proposed scheme, a small portion of video frames is sampled in an unsupervised manner and presented to human annotators, who then label the frames with action categories. Unlike point-level supervision, which requires annotators to search for every action instance in an untrimmed video, frames to annotate are selected without human intervention in AAPL supervision. We also propose a detection model and learning method to effectively utilize the AAPL labels. Extensive experiments on the variety of datasets (THUMOS'14, FineAction, GTEA, BEOID, and ActivityNet 1.3) demonstrate that the proposed approach is competitive with or outperforms prior methods for video-level and point-level supervision in terms of the trade-off between the annotation cost and detection performance.

Abstract: Recently, patch deformationbased methods have demonstrated significant strength in multi-view stereo by adaptively expanding the reception field of patches to help reconstruct textureless areas. However, such methods mainly concentrate on searching for pixels without matching ambiguity (i.e., reliable pixels) when constructing deformed patches, while neglecting the deformation instability caused by unexpected edge-skipping, resulting in potential matching distortions. Addressing this, we propose MSP-MVS, a method introducing multi-granularity segmentation prior for edge-confined patch deformation. Specifically, to avoid unexpected edge-skipping, we first aggregate and further refine multi-granularity depth edges gained from Semantic-SAM as prior to guide patch deformation within depth-continuous (i.e., homogeneous) areas. Moreover, to address attention imbalance caused by edge-confined patch deformation, we implement adaptive equidistribution and disassemble-clustering of correlative reliable pixels (i.e., anchors), thereby promoting attention-consistent patch deformation. Finally, to prevent deformed patches from falling into local-minimum matching costs caused by the fixed sampling pattern, we introduce disparity-sampling synergistic 3D optimization to help identify global-minimum matching costs. Evaluations on ETH3D and Tanks & Temples benchmarks prove our method obtains state-of-the-art performance with remarkable generalization.

Abstract: Machine learning model bias can arise from dataset composition: correlated sensitive features can distort the downstream classification model's decision boundary and lead to performance differences along these features. Existing debiasing works tackle the most prominent bias features, such as colors of digits or background of animals. However, real-world datasets often include a large number of feature correlations that intrinsically manifest in the data as common sense information. Such spurious visual cues can further reduce model robustness. Thus, domain practitioners desire a comprehensive understanding of correlations and the flexibility to address relevant biases. To this end, we propose a novel framework to extract comprehensive biases in image datasets based on textual descriptions, a common sense-rich modality. Specifically, features are constructed by clustering noun phrase embeddings with similar semantics. The presence of each feature across the dataset is inferred, and their co-occurrence statistics are measured, with spurious correlations optionally examined by a human-in-the-loop module. Downstream experiments show that our method uncovers novel model biases in multiple image benchmark datasets. Furthermore, the discovered bias can be mitigated by simple data re-weighting to de-correlate the features, outperforming state-of-the-art unsupervised bias mitigation methods.

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China, School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China, MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, Qilu University of Technology (Shandong Academy of Sciences), Shandong, China, MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, School of Artificial Intelligence, Beijing Normal University, Beijing, China MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China, College of Computer Science and Technology, Hengyang Normal University, Hunan, China, MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

Abstract: As immersive experiences become increasingly popular, panoramic video has garnered significant attention in both research and applications. The high cost associated with capturing panoramic video underscores the need for efficient promptbased generation methods. Although recent text-to-video (T2V) diffusion techniques have shown potential in standard video generation, they face challenges when applied to panoramic videos due to substantial differences in content and motion patterns. In this paper, we propose PanoDiT, a framework that utilizes the Diffusion Transformer (DiT) architecture to generate panoramic videos from text descriptions. Unlike traditional methods that rely on UNet-based denoising, our method leverages a transformer architecture for denoising, incorporating both temporal and global attention mechanisms. This ensures coherent frame generation and smooth motion transitions, offering distinct advantages in long-horizon generation tasks. To further enhance motion and consistency in the generated videos, we introduce DTM-LoRA and two panoramic-specific losses. Compared to previous methods, our PanoDiT achieves state-of-the-art performance across various evaluation metrics and user study, with code is available in the supplementary material.

Abstract: Learned image compression (LIC) has achieved stateof-the-art rate-distortion performance, deemed promising for next-generation image compression techniques. However, pre-trained LIC models usually suffer from significant performance degradation when applied to out-of-training-domain images, implying their poor generalization capabilities. To tackle this problem, we propose a few-shot domain adaptation method for LIC by integrating plug-and-play adapters into pre-trained models. Drawing inspiration from the analogy between latent channels and frequency components, we examine domain gaps in LIC and observe that out-of-training-domain images disrupt pre-trained channel-wise decomposition. Consequently, we introduce a method for channel-wise re-allocation using convolution-based adapters and low-rank adapters, which are lightweight and compatible to mainstream LIC schemes. Extensive experiments across multiple domains and multiple representative LIC schemes demonstrate that our method significantly enhances pre-trained models, achieving comparable performance to H.266/VVC intra coding with merely 25 target-domain samples. Additionally, our method matches the performance of full-model finetune while transmitting fewer than 2% of the parameters.

Abstract: Current hair transfer methods struggle to handle diverse and intricate hairstyles, limiting their applicability in realworld scenarios. In this paper, we propose a novel diffusion-based hair transfer framework, named Stable-Hair, which robustly transfers a wide range of real-world hairstyles to user-provided faces for virtual hair try-on. To achieve this goal, our Stable-Hair framework is designed as a two-stage pipeline. In the first stage, we train a Bald Converter alongside stable diffusion to remove hair from the user-provided face images, resulting in bald images. In the second stage, we specifically designed a Hair Extractor and a Latent IdentityNet to transfer the target hairstyle with highly detailed and high-fidelity to the bald image. The Hair Extractor is trained to encode reference images with the desired hairstyles, while the Latent IdentityNet ensures consistency in identity and background. To minimize color deviations between source images and transfer results, we introduce a novel Latent ControlNet architecture, which functions as both the Bald Converter and Latent IdentityNet. After training on our curated triplet dataset, our method accurately transfers highly detailed and high-fidelity hairstyles to the source images. Extensive experiments demonstrate that our approach achieves state-of-the-art performance compared to existing hair transfer methods.

Abstract: Image manipulation localization (IML) is a critical technique in media forensics, focusing on identifying tampered regions within manipulated images. Most existing IML methods require extensive training on labeled datasets with both imagelevel and pixel-level annotations. These methods often struggle with new manipulation types and exhibit low generalizability. In this work, we propose a training-free IML approach using diffusion models. Our method adaptively selects an appropriate number of diffusion timesteps for each input image in the forward process and performs both conditional and unconditional reconstructions in the backward process without relying on external conditions. By comparing these reconstructions, we generate a localization map highlighting regions of manipulation based on inconsistencies. Extensive experiments were conducted using sixteen state-of-the-art (SoTA) methods across six IML datasets. The results demonstrate that our training-free method outperforms SoTA unsupervised and weakly-supervised techniques. Furthermore, our method competes effectively against fully-supervised methods on novel (unseen) manipulation types.

Abstract: Deep hashing has been widely used for largescale approximate nearest neighbor search due to its storage and search efficiency. However, existing deep hashing methods predominantly rely on abundant training data, leaving the more challenging scenario of low-resource adaptation for deep hashing relatively underexplored. This setting involves adapting pre-trained models to downstream tasks with only an extremely small number of training samples available. Our preliminary benchmarks reveal that current methods suffer significant performance degradation due to the distribution shift caused by limited training samples. To address these challenges, we introduce Class-Calibration LoRA (CLoRA), a novel plug-and-play approach that dynamically constructs low-rank adaptation matrices by leveraging class-level textual knowledge embeddings. CLoRA effectively incorporates prior class knowledge as anchors, enabling parameter-efficient fine-tuning while maintaining the original data distribution. Furthermore, we propose Knowledge-Guided Discrete Optimization (KIDDO), a framework to utilize class knowledge to compensate for the scarcity of visual information and enhance the discriminability of hash codes. Extensive experiments demonstrate that our proposed method, Knowledge- Anchored Low-Resource Adaptation Hashing (KALAHash), significantly boosts retrieval performance and achieves a 4× data efficiency in low-resource scenarios.

Abstract: Diffusion models have exhibited impressive prowess in the textto-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we explore a novel and practical task setting: local control. It focuses on controlling specific local region according to user-defined image conditions, while the remaining regions are only conditioned by the original text prompt. However, it is non-trivial to achieve it. The naive manner of directly adding local conditions may lead to the local control dominance problem, which forces the model to focus on the controlled region and neglect object generation in other regions. To mitigate this problem, we propose Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions. Furthermore, the proposed Focused Token Response suppresses weaker attention scores which lack the strongest response to enhance object distinction and reduce duplication. Lastly, we adopt Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region. All proposed strategies are operated at the inference stage. Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions.

Abstract: Person reidentification (Re-ID) is crucial for intelligent surveillance systems, facilitating the identification of individuals across multiple camera views. While significant advancements have been made for daytime scenarios, ensuring reliable Re-ID performance during nighttime remains a significant challenge. Given the cost and limited accessibility of infrared cameras, we investigate a critical question: Can RGB cameras be effectively utilized for accurate Re-ID during nighttime? To address this, we introduce NightReID, a large-scale RGB Re-ID dataset collected from a real-world nighttime surveillance system. NightReID includes 1,500 identities and over 53,000 images, capturing diverse scenes with complex lighting and adverse weather conditions. This rich dataset provides a valuable benchmark for advancing nighttime Re-ID research. Moreover, we propose the Enhancement, Denoising, and Alignment (EDA) framework with two novel modules to enhance nighttime Re-ID performance. First, an unsupervised Image Enhancement and Denoising (IED) method is designed to improve the quality of nighttime images, preserving critical details while removing noise without requiring paired ground truth. Second, we introduce Data Distribution Alignment (DDA) through statistical priors, aligning the distributions between pre-training data and nighttime data to mitigate domain shift. Extensive experiments on multiple nighttime Re-ID datasets demonstrate the significance of NightReID and validate the efficacy, flexibility, and applicability of the EDA framework.

Abstract: Model immunization is an emerging direction that aims to mitigate the potential risk of misuse associated with opensourced models and advancing adaptation methods. The idea is to make the released models' weights difficult to fine-tune on certain harmful applications, hence the name "immunized". Recent work on model immunization focuses on the single-concept setting. However, in real-world situations, models need to be immunized against multiple concepts. To address this gap, we propose an immunization algorithm that, simultaneously, learns a single "difficult initialization" for adaptation methods over a set of concepts. We achieve this by incorporating a differentiable merging layer that combines a set of model weights adapted over multiple concepts. In our experiments, we demonstrate the effectiveness of multi-concept immunization by generalizing prior work's experiment setup of re-learning and personalization adaptation to multiple concepts.

Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541004, China

Abstract: The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel SelfSupervised Tracking framework, named SSTrack, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm enables SSTrack to effectively learn generic tracking representations in a self-supervised manner, while reducing reliance on extensive box annotations. Extensive experiments on nine benchmark datasets demonstrate that SSTrack surpasses SOTA self-supervised tracking methods, achieving an improvement of more than 25.3%, 20.4%, and 14.8% in AUC (AO) score on the GOT10K, LaSOT, TrackingNet datasets, respectively.

Abstract: Although significant progress has been made in enhancing visibility, retrieving texture details, and mitigating noise in LowLight (LL) images, the challenge persists in applying current Low-Light Image Enhancement (LLIE) methods to real-world scenarios, primarily due to the diverse illumination conditions encountered. Furthermore, the quest for generating enhancements that are visually realistic and attractive remains an underexplored realm. In response to these challenges, we present a novel LLIE framework with the guidance of Generative Perceptual Priors (GPP-LLIE) derived from vision-language models (VLMs). Specifically, we first propose a pipeline that guides VLMs to assess multiple visual attributes of the LL image and quantify the assessment to output the global and local perceptual priors. Subsequently, to incorporate these generative perceptual priors to benefit LLIE, we introduce a transformer-based backbone in the diffusion process, and develop a new layer normalization (GPP-LN) and an attention mechanism (LPP-Attn) guided by global and local perceptual priors. Extensive experiments demonstrate that our model outperforms current SOTA methods on paired LL datasets and exhibits superior generalization on real-world data.

Abstract: Compositional visual question answering (Compositional VQA) needs to provide an answer to a compositional question, which requires the model to have advanced capabilities of multimodal semantic understanding and logical reasoning. However, current VQA models mainly concentrate on enriching the visual representations of images and neglect the redundancy in the enriched information to bring some negative impacts. To enhance the value and availability of semantic features, we propose a novel core-to-global reasoning (CTGR) model for compositional VQA. The model first extracts both global features and core features from image and question through a feature embedding module. Then, to enhance the value of semantic features, we propose an information filtering module to align visual features and text features at the core semantic level and to filter out the redundancy carried by image and question features at the global semantic level, which can further strengthen cross-modal correlations. Besides, we design a novel core-to-global reasoning mechanism for multimodal fusion, which integrates content features from core learning and context features from global features for accurate answer predictions. Finally, extensive experimental results on GQA, GQA-sub, VQA2.0 and Visual7W demonstrate the effectiveness and superiority of CTGR.

Abstract: We propose a dynamic Computed Tomography (CT) reconstruction framework called STNF4D (SpatioTemporalaware Neural Fields). First, we represent the 4D scene using four orthogonal volumes and compress these volumes into more compact hash grids. Compared to the plane decomposition method, this method enhances the model's capacity while keeping the representation compact and efficient. However, in densely predicted high-resolution dynamic CT scenes, the lack of constraints and hash conflicts in the hash grid features lead to obvious dot-like artifact and blurring in the reconstructed images. To address these issues, we propose the Spatiotemporal Transformer (ST-Former) that guides the model in selecting and optimizing features by sensing the spatiotemporal information in different hash grids, significantly improving the quality of reconstructed images. We conducted experiments on medical and industrial datasets covering various motion types, sampling modes, and reconstruction resolutions. Experimental results show that our method outperforms the second-best by 5.99 dB and 4.11 dB in medical and industrial scenes, respectively.

Abstract: This paper introduces a novel exemplarbased framework for reading Chinese texts in natural scene or document images. We present the Deep Exemplar-based Chinese Text Recognizer, which is structured to first identify candidate characters as exemplars from each text-line, and subsequently recognize them by retrieving analogous exemplars from a database. With text-line level annotations, we design the exemplar discovery network to simultaneously recognize texts and capture individual character positions in a weak-supervision manner. The exemplar retrieval module is then crafted to identify the most similar exemplar and propagate the corresponding character label. This enables us to effectively rectify the misrecognized characters and boost the performance of scene text recognition. Experiments on four scenarios of Chinese texts demonstrate the effectiveness of our proposed framework.

Abstract: Animation line inbetweening is a crucial step in animation production aimed at enhancing animation fluidity by predicting intermediate line arts between two key frames. However, existing methods face challenges in effectively addressing sparse pixels and significant motion in line art key frames. In literature, Chamfer Distance (CD) is commonly adopted for evaluating inbetweening performance. Despite achieving favorable CD values, existing methods often generate interpolated frames with line disconnections, especially for scenarios involving large motion. Motivated by this observation, we propose a simple yet effective interpolation method for animation line inbetweening that adopts thinplate spline-based transformation to estimate coarse motion more accurately by modeling the keypoint correspondence between two key frames, particularly for large motion scenarios. Building upon the coarse estimation, a motion refine module is employed to further enhance motion details before final frame interpolation using a simple UNet model. Furthermore, to more accurately assess the performance of animation line inbetweening, we refine the CD metric and introduce a novel metric termed Weighted Chamfer Distance, which demonstrates a higher consistency with visual perception quality. Additionally, we incorporate Earth Mover's Distance and conduct user study to provide a more comprehensive evaluation. Our method outperforms existing approaches by delivering high-quality interpolation results with enhanced fluidity.

Abstract: A constraint programming (CP) solver that implements proof logging will output a machinecheckable certificate of correctness alongside any result it obtains. This is useful for trusting claims of unsatisfiability or optimality, as well as for debugging and auditing solver implementations. Proofs can be constructed by having the solver log justifications for each inference it makes, and previous work has shown that many standard CP reasoning techniques can be efficiently justified using a pseudo-Boolean (PB) proof format. This paper extends PB justifications to propagators enforcing bounds consistency on multiplication and division constraints. We show that even though the proof system and checker operate only on linear inequalities over 0-1 variables, non-linear reasoning over bounded domains can be efficiently expressed as a sequence of PB proof steps. Additionally, we demonstrate that bespoke proof logging for bounds-consistency algorithms offers a clear advantage over constructing justifications by brute force.

Abstract: Hard combinatorial optimization problems, often mapped to Ising models, promise potential solutions with quantum advantage but are constrained by limited qubit counts in nearterm devices. We present an innovative quantum-inspired framework that dynamically compresses large Ising models to fit available quantum hardware of different sizes. Thus, we aim to bridge the gap between large-scale optimization and current hardware capabilities. Our method leverages a physics-inspired GNN architecture to capture complex interactions in Ising models and accurately predict alignments among neighboring spins (aka qubits) at ground states. By progressively merging such aligned spins, we can reduce the model size while preserving the underlying optimization structure. It also provides a natural trade-off between the solution quality and size reduction, meeting different hardware constraints of quantum computing devices. Extensive numerical studies on Ising instances of diverse topologies show that our method can reduce instance size at multiple levels with virtually no losses in solution quality on the latest D-wave quantum annealers.

Abstract: Model counting is a fundamental task that involves determining the number of satisfying assignments to a logical formula, typically in conjunctive normal form (CNF). While CNF model counting has received extensive attention over recent decades, interest in PseudoBoolean (PB) model counting is just emerging partly due to the greater flexibility of PB formulas. As such, we observed feature gaps in existing PB counters such as a lack of support for projected and incremental settings, which could hinder adoption. In this work, our main contribution is the introduction of the PB model counter PBCount2, the first exact PB model counter with support for projected and incremental model counting. Our counter, PBCount2, uses our Least Occurrence Weighted Min Degree (LOW-MD) computation ordering heuristic to support projected model counting and a cache mechanism to enable incremental model counting. In our evaluations, PBCount2 completed at least 1.40x the number of benchmarks of competing methods for projected model counting and at least 1.18x of competing methods in incremental model counting.

Abstract: Recommendation systems (RS) play a crucial role in assisting decisionmaking but often suffer from either a lack of credibility or unfairness problems. A few recommendation models have endeavored to address the problem from only one aspect, and approaches to solving both problems remain to be explored. This paper aims to construct a generalized fairness-based recommendation framework that can also provide the credibility of recommendation models. Generally, we propose a reliable and fair recommendation framework called Conformalized User Group Fairness (CUGF) based on the inspiration of conformal prediction. Specifically, we construct dynamic prediction sets that are guaranteed to cover the true item with a user pre-specified probability to ensure credibility while designing novel fairness metrics based on empirical risks to guarantee the fairness of users across different groups. Furthermore, we design a novel CUGF Algorithm to optimize the parameter γ that dominates the prediction sets and also the fairness. Besides, we conduct extensive experiments by applying CUGF on top of various recommendation models and representative datasets to validate its effectiveness with respect to recommendation performance (in terms of average set size) and fairness (in terms of the two defined fairness metrics), the results of which demonstrate the validity of the proposed framework.

Abstract: Multimedia event extraction aims to jointly extract event structural knowledge from multiple modalities, thus improving the comprehension and utilization of events in the growing multimedia content (e.g., multimedia news). A key challenge in multimedia event extraction is to establish crossmodal correlations during training without multimedia event annotations. Considering the complexity and cost of annotation across modalities, the multimedia event extraction task only provides parallel annotated data for evaluation. Previous works attempt to learn implicit correlations directly from unlabeled image-text pairs, but do not yield substantially better performance for event-centric tasks. To address this problem, we propose a cross-modal multi-task learning framework X-MTL to establish cross-modal correlations at the task level, which can simultaneously address four key tasks of multimedia event extraction: trigger detection, argument extraction, verb classification, and role classification. Specifically, to process inputs from different modalities and tasks, we utilize two separate modality-specific encoders and a modality-shared encoder to learn joint task representations, and introduce textual and visual prompt learning methods to enrich and unify task inputs. To resolve task conflict in cross-modal multi-task learning, we propose a pseudo label based knowledge distillation method, combined with dynamic weight adjustment method, which can effectively lift the performance to surpass the separately-trained models. On the Multimedia Event Extraction benchmark M2E2, experimental results show that X-MTL surpasses the current state-of-the-art (SOTA) methods by 4.1% for multimedia event mention and 8.2% for multimedia argument role.

Abstract: Traffic prediction provides vital support for urban traffic management and has received extensive research interest. By virtue of the ability to effectively learn spatial and temporal dependencies from a global view, Transformers have achieved superior performance in longterm traffic prediction. However, existing methods usually underrate the complex spatio-temporal entanglement in long-range sequences. Compared with purely temporal entanglement, spatio-temporal data emphasizes the entangled dynamics under the restrictions of traffic networks, which brings additional difficulties. Moreover, the computational costs of spatio-temporal Transformers scale quadratically as the sequence length grows, limiting their applications on long-range and large-scale scenarios. To address these problems, we propose a decomposed spatio-temporal Mamba (DST-Mamba) for traffic prediction. We aim to apply temporal decomposition to the entangled sequences and obtain the seasonal and trend parts. Shifting from the temporal view to the spatial view, we leverage Mamba, a state space model with near-linear complexity, to capture seasonal variations in a node-centric manner. Meanwhile, multi-scale trend information is extracted and aggregated by simple linear layers. Such combination equips DST-Mamba with superior capability to model long-range spatio-temporal dependencies while remaining efficient compared with Transformers. Experimental results across five real-world datasets demonstrate that DST-Mamba can capture both local fluctuations and global trends within traffic patterns, achieving state-of-the-art performance with favorable efficiency.

Abstract: Multiinterest recommendation constantly aspires to an oracle individual preference modeling approach, that satisfies the diverse and dynamic properties. Fueled by the deep learning technology, existing neural network (NN)-based recommender systems employ single-point or multi-point interest representation strategy to realize preference modeling,and boost the recommendation performance with a remarkable margin. However, as parameterized approximate functions, NN-based methods remain deficiencies with respect to the adaptability towards distinctive preference patterns cross different users and the calibration over the individual current intent. In this paper, we revisit multi-interest recommendation with the lens of stochastic process and Bayesian inference. Specifically, we propose to learn a distribution over functions to depict the individual diverse preferences rather than a unified function to approximate preference. Subsequently, the recommendation is encouraged with the uncertainty estimation which conforms to the dynamic shifting intent. Along these lines, we establish the connection between multi-interest recommendation and neural processes by proposing NP-Rec, which realizes the flexible multiple interests modeling and uncertainty estimation, simultaneously. Empirical study on 4 real world datasets demonstrates that our NP-Rec attains superior recommendation performances to several state-of-the-art baselines, where the average improvement achieves up to 13.94%.

State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center School of Computer Science and Artificial Intelligence, Hefei Normal University

Abstract: Dense retrieval has emerged as the leading approach in information retrieval, aiming to find semantically relevant documents based on natural language queries. Given that a single document can be retrieved by multiple distinct queries, existing methods aim to represent a document with multiple vectors. Each vector is aligned with a different query to model the manyto-one relationship between queries and documents. However, these multiple vector-based approaches encounter challenges such as Increased Storage, Vector Collapse, and Search Efficiency. To address these issues, we introduce the Distribution-Driven Dense Retrieval framework (DDR). Specifically, we use vectors to represent queries and distributions to represent documents. This approach not only captures the relationships between multiple queries corresponding to the same document but also avoids the need to use multiple vectors to represent the document. Furthermore, to ensure search efficiency for DDR, we propose a dot product-based computation method to calculate the similarity between documents represented by distributions and queries represented by vectors. This allows for seamless integration with existing approximate nearest neighbor (ANN) search algorithms for efficient search. Finally, we conduct extensive experiments on real-world datasets, which demonstrate that our method significantly outperforms traditional dense retrieval methods.

Abstract: The growing trend of sharing short videos on social media platforms, where users capture and share moments from their daily lives, has led to an increase in research efforts focused on microvideo recommendations. However, conventional methods oversimplify the modeling of skip behavior, categorizing interactions solely as positive or negative based on whether skipping occurs. This study was motivated by the importance of the first few seconds of micro-videos, leading to a refinement of signals into three distinct categories: highly positive, less positive, and negative. Specifically, we classify skip interactions occurring within a short time as negatives, while those occurring after a delay are categorized as less positive. The proposed dual-level graph and hierarchical ranking loss are designed to effectively learn these fine-grained interactions. Our experiments demonstrated that the proposed method outperformed three conventional methods across eight evaluation measures on two public datasets.

Abstract: Bundle recommendation aims to improve user experience by suggesting complementary items that users are likely to purchase together. Although recent advances in recommendation systems have shown promise, there are still significant challenges: i) The dynamic nature of user preferences and interactions introduces noise that can distort the effectiveness of recommendations. ii) Existing methods frequently exhibit limited robustness when addressing the sparsity of user interactions with bundles in realworld scenarios. To tackle these issues, we introduce a disentangled contrastive bundle recommendation (DCBR) framework with conditional diffusion. First, we propose a conditional bundle diffusion model for denoising the user-bundle interaction graph, introducing a bundle latent consistency constraint during the optimization process to mitigate the degradation of original interaction information. Subsequently, we design a triple-view denoised graph learning module to obtain effective representations from multiple views. Furthermore, we present a dual-level disentangled contrastive learning paradigm, which addresses the latent relationships at two levels: between views (inter-view) and within each view (intra-view). By maximizing the consistency between positive samples in these contrastive views, we generate disentangled contrastive signals, overcoming interaction sparsity and alleviating noise issues. Our experimental evaluations on three benchmark datasets reveal that DCBR significantly outperforms state-of-the-art methods.

Abstract: Signed Graph Neural Networks (SGNNs) have been shown to be effective in analyzing complex patterns in realworld situations where positive and negative links coexist. However, SGNN models suffer from poor explainability, which limit their adoptions in critical scenarios that require understanding the rationale behind predictions. To the best of our knowledge, there is currently no research work on the explainability of the SGNN models. Our goal is to address the explainability of decision-making for the downstream task of link sign prediction specific to signed graph neural networks. Since post-hoc explanations are not derived directly from the models, they may be biased and misrepresent the true explanations. Therefore, in this paper we introduce a Self-Explainable Signed Graph transformer (SE-SGformer) framework, which can not only outputs explainable information while ensuring high prediction accuracy. Specifically, we propose a new Transformer architecture for signed graphs and theoretically demonstrate that using positional encoding based on signed random walks has greater expressive power than current SGNN methods and other positional encoding graph Transformer-based approaches. We construct a novel explainable decision process by discovering the K-nearest (farthest) positive (negative) neighbors of a node to replace the neural network-based decoder for predicting edge signs. These K positive (negative) neighbors represent crucial information about the formation of positive (negative) edges between nodes and thus can serve as important explanatory information in the decision-making process. We conducted experiments on several real-world datasets to validate the effectiveness of SE-SGformer, which outperforms the state-of-the-art methods by improving 2.2% prediction accuracy and 73.1% explainablity accuracy in the best-case scenario.

Abstract: Sequential Recommender Systems (SRS) has stood out as a highly promising technique in numerous domains due to its impressive capability of capturing complex user preferences. Current SRS have employed transformerbased models to give the next-item prediction. Nevertheless, its quadratic computational complexity has often resulted in notable inefficiencies, posing a significant obstacle to real-time recommendation processes. Recently, Mamba has demonstrated its exceptional effectiveness in time series prediction, delivering substantial improvements in both efficiency and effectiveness. However, directly applying Mamba to SRS poses certain challenges. Its unidirectional structure may impede the ability to capture contextual information in user-item interactions, while its instability in state estimation may hinder the ability to capture short-term patterns in interaction sequences. To address these issues, we propose a novel framework called Selective Gated Mamba for Sequential Recommendation (SIGMA). By introducing the Partially Flipped Mamba (PF-Mamba), we construct a special bi-directional structure to address the context modeling challenge. Then, to consolidate PF-Mamba's performance, we employed an input-dependent Dense Selective Gate (DS Gate) to allocate the weights of the two directions and further filter the sequential information. Moreover, for short sequence modeling, we devise a Feature Extract GRU (FE-GRU) to capture the short-term dependencies. Experimental results demonstrate that SIGMA significantly outperforms existing baselines across five real-world datasets. Our implementation code is available in Supplementary Material to ease reproducibility.

Abstract: Previous ad auctions predominantly relied on rulebased mechanisms, which selected winning advertisements (ads) at the ad-level and subsequently combined them into page views (PVs), leading to suboptimal allocations in multi-round auctions. This limitation stems from the significant computational burden required to design ranking score rules and select winning ad sets, as well as the inability to fully capture contextual information within PVs during ad-level selection. In this paper, we propose a key-performance-indicator (KPI) based auction mechanism that selects winning PVs at the PV-level, modeling the ad allocation as a constrained optimization problem. This approach enables us to address both short-term and long-term KPIs while leveraging the comprehensive contextual information available within PVs. Based on this framework, we design GenAuction, a generative auction mechanism utilizing a Generator-Evaluator architecture powered by Transformer algorithms. The Generator swiftly generates multiple candidate PVs, while the Evaluator selects the optimal PVs based on contextual information, adhering to the objectives and KPIs of multi-round auctions. We conduct extensive experiments using real-world data and online A/B tests to validate that GenAuction efficiently handles multi-objective allocation tasks, demonstrating its efficacy and potential for real-world application.

Abstract: We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the topk tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature. While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient. We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels.

Abstract: Sequential recommendation systems aim to predict the next item based on users' historical interactions. While traditional methods focus on learning feature representations or user preferences, they often struggle with detecting subtle demand shifts in short sequences, especially when these shifts are obscured by noise or biases. To address these issues, we propose CoDeR (Counterfactual Demand Reasoning), a novel framework designed to handle demand shifts in sequential recommendations with greater precision. CoDeR features two key modules: (1) the User Demand Extraction module, which utilizes selfattention mechanisms and demand graphs to identify and model demand shifts from minimal user interactions; and (2) the Counterfactual Demand Reasoning module, which employs causal effect analysis and backdoor adjustment techniques to distinguish true demand shifts from noisy or biased signals. Our approach represents the first application of counterfactual reasoning to sequential recommendation systems. Comprehensive experiments on three real-world datasets demonstrate that CoDeR significantly outperforms existing baselines.

Abstract: Pointof-Interest (POI) recommendation aims to predict users' future locations based on their historical check-ins. Despite the success of recent deep learning approaches in capturing POI semantics and user behavior, they continue to face the persistent problem of data sparsity and incompleteness. In this paper, we introduce Multi-Objective Adversarial Imitation Recommender (MOAIR), a novel framework that integrates Generative Adversarial Imitation Learning with multi-objective to address this issue. MOAIR effectively captures user behavior patterns and spatial-temporal contextual information via graph-enhanced self-supervised state encoder and overcomes data sparsity by robustly learning from limited data and generating diverse samples. By accommodating diverse user patterns in the training data, the framework also mitigates the typical mode-collapse issue in generative adversarial learning and thus enhances the overall performance. MOAIR employs a multi-objective imitation learning architecture where the imitation learning agent (IL agent) explores the POI space and receives multifaceted reward signals. Utilizing the Paralleled Proximal Policy Optimization (3PO) framework to optimize multi-objectives, the IL agent ensures efficient and stable policy updates. Additionally, to address the issue of high noise in POI recommendation scenarios, we use a novel generative way to define our policy net and incorporate a variational bottleneck for regularization to enhance the stability of adversarial learning. Comprehensive experiments reveal the superior performance for MOAIR compared to other baseline approaches, especially with sparse training data.

National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University Hubei Key Laboratory of Multimedia and Network Communication Engineering,Wuhan University, Cyberspace Security Laboratory, School of Network and Information Security, Xidian University, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University Hubei Key Laboratory of Multimedia and Network Communication Engineering,Wuhan University School of Cyber Science and Engineering, Wuhan University, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University Hubei Key Laboratory of Multimedia and Network Communication Engineering,Wuhan University, The School of Computer Science and Engineering, Nanyang Technological University, Singapore, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University Hubei Key Laboratory of Multimedia and Network Communication Engineering,Wuhan University

Abstract: Fatigue is a critical factor contributing to accidents in industries such as safety monitoring and engineering construction. Fatigue exhibits dynamic complexity and nonstationary characteristics, so there are many intermediate states of short-term variation between alert and fatigue. Capturing and learning the signs of these intermediate states is essential for accurate fatigue assessment. However, current fatigue detection methods primarily rely on coarse-grained labels, typically spanning minutes to hours, and commonly treat alert and fatigue as two distinctly separate distributions, overlooking the expression of intermediate states and oversimplifying the rich distribution information of fatigue types and levels, thereby limiting detection effectiveness. To address these, this paper explores a refined representation of fatigue in terms of three dimensions: time, type, and level, and proposes a Multi-Dimensional Fine-Grained Modeling for Fatigue Detection (MDFG). This introduces the SmallLoss to extract trustworthy samples, utilizes clustering to identify diverse subtypes under alert and fatigued states, and establishes base class sets in each state. Subsequently, a complete base class set containing intermediate state bases is constructed using the base class synthesis method, which achieves the expression of intermediate fatigue states from absence to presence. Finally, fatigue levels are quantified based on the matching between samples and the complete base class set. Moreover, to cope with the complex variability of fatigue states, MDFG employs meta-learning for training. MDFG achieves an Average accuracy improvement of 10.0% and 12.1% on two real datasets compared to methods that do not consider fine-grained information. Extensive experiments demonstrate that the MDFG exhibits superior robustness and stability among current fatigue detection methods.

State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, China

Abstract: Sequential recommendation aims to predict the next item a user is likely to interact with based on their historical interaction sequence. Capturing user intent is crucial in this process, as each interaction is typically driven by specific intentions (e.g., buying skincare products for skin maintenance, buying makeup for cosmetic purposes, etc.). However, users often have multiple, dynamically changing intents, making it challenging for models to accurately learn these intents when relying on the entire historical sequence as input. To address this, we propose a novel framework called Intent Oriented Contrastive Learning for Sequential Recommendation (IOCLRec). This framework begins by segmenting users’ sequential behaviors into multiple subsequences, which represent the coarsegrained intents of users at different points in their interaction history. These subsequences form the basis for the three contrastive learning modules within IOCLRec. The fine-grained intent contrastive learning module uncovers detailed intent representations, while the single-intent and multi-intent contrastive learning modules utilize intent-oriented data augmentation operators to capture the diverse intents of users. These three modules work synergistically, driving comprehensive performance optimization in intricate sequential recommendation scenarios. Our method has been extensively evaluated on four public datasets, demonstrating superior effectiveness.

Abstract: With the development of federated learning techniques and the increased need for user privacy protection, the federated recommendation has become a new recommendation paradigm. However, most existing works focus on userlevel federated recommendation, leaving platform-level federated recommendation largely unexplored. A significant challenge in platform-level federated recommendation scenarios is severe label skew. Users behave in various ways on different platforms, bringing up the rating and item bias problem. In this work, we propose FREIB (Federated Recommendation with Explicitly Encoding Item Bias). The core idea is explicitly encoding item bias during federated learning, addressing the problem of fuzzy item bias, and achieving consistent representation in label skew scenarios. We achieve this by utilizing global knowledge guidance to model common rating patterns and by aligning feature prototypes to enhance item encoding at the same rating level. Extensive experiments conducted on three public datasets demonstrate the superiority of our method over several state-of-the-art approaches.

Abstract: Coldstart sequential recommendation, where user interaction histories are sparse or minimal, remains a significant challenge in recommendation systems. Current meta-learning-based approaches rely heavily on the interaction histories of regular users to construct meta-tasks, aiming to acquire prior knowledge for cold-start adaptation. However, these methods often fail to account for preference discrepancies between regular and cold-start users, leading to biased preference modeling and suboptimal recommendations. To address this issue, we propose a novel counterfactual task-augmented meta-learning method for cold-start sequential recommendations. Our approach intervenes in user interaction histories to create counterfactual sequences that simulate potential but unrealized user behaviors, establishing counterfactual tasks within a meta-learning framework. Additionally, we aggregate meta-path neighbors to uncover latent relationships between items, enabling more detailed and accurate modeling of user preferences. Moreover, by integrating real and counterfactual task losses, we jointly optimize the model through a combination of global and local updates, enhancing its adaptability to cold-start scenarios. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art techniques, achieving superior results in cold-start sequential recommendation tasks.

Abstract: As multimedia information proliferates, multimodal recommendation systems have garnered significant attention. These systems leverage multimodal information to alleviate the data sparsity issue inherent in recommendation systems, thereby enhancing the accuracy of recommendations. Due to the natural semantic disparities among multimodal features, recent research has primarily focused on crossmodal alignment using self-supervised learning to bridge these gaps. However, aligning different modal features might result in the loss of valuable interaction information, distancing them from ID embeddings. It is crucial to recognize that the primary goal of multimodal recommendation is to predict user preferences, not merely to understand multimodal content. To this end, we propose a new Multi-level sElf-supervised learNing for mulTimOdal Recommendation (MENTOR) method, which effectively reduces the gap among modalities while retaining interaction information. Specifically, MENTOR begins by extracting representations from each modality using both heterogeneous user-item and homogeneous item-item graphs. It then employs a multilevel cross-modal alignment task, guided by ID embeddings, to align modalities across multiple levels while retaining historical interaction information. To balance effectiveness and efficiency, we further propose an optional general feature enhancement task that bolsters the general features from both structure and feature perspectives, thus enhancing the robustness of our model.

Abstract: The development and evaluation of graph neural networks (GNNs) generally follow the independent and identically distributed (i.i.d.) assumption. Yet this assumption is often untenable in practice due to the uncontrollable data generation mechanism. In particular, when the data distribution shows a significant shift, most GNNs would fail to produce reliable predictions and may even make decisions randomly. One of the most promising solutions to improve the model generalization is to pick out causal invariant parts in the input graph. Nonetheless, we observe a significant distribution gap between the causal parts learned by existing methods and the groundtruth, leading to undesirable performance. In response to the above issues, this paper presents GPro, a model that learns graph causal invariance with progressive inference. Specifically, the complicated graph causal invariant learning is decomposed into multiple intermediate inference steps from easy to hard, and the perception of GPro is continuously strengthened through a progressive inference process to extract causal features that are stable to distribution shifts. We also enlarge the training distribution by creating counterfactual samples to enhance the capability of the GPro in capturing the causal invariant parts. Extensive experiments demonstrate that our proposed GPro outperforms the state-of-the-art methods by 4.91% on average. For datasets with more severe distribution shifts, the performance improvement can be up to 6.86%.

Abstract: Graph condensation (GC), which reduces the size of a largescale graph by synthesizing a small-scale condensed graph as its substitution, has benefited various graph learning tasks. However, existing GC methods rely on centralized data storage, which is unfeasible for real-world decentralized data distribution, and overlook data holders' privacy-preserving requirements. To bridge this gap, we propose and study the novel problem of federated graph condensation (FGC) for graph neural networks (GNNs). Specifically, we first propose a general framework for FGC, where we decouple the typical gradient matching process for GC into client-side gradient calculation and server-side gradient matching, integrating knowledge from multiple clients' subgraphs into one smaller condensed graph. Nevertheless, our empirical studies show that under the federated setting, the condensed graph will consistently leak data membership privacy, i.e., the condensed graph during federated training can be utilized to steal training data under the membership inference attack (MIA). To tackle this issue, we innovatively incorporate information bottleneck principles into the FGC, which only needs to extract partial node features in one local pre-training step and utilize the features during federated training. Theoretical and experimental analyses demonstrate that our framework consistently protects membership privacy during training. Meanwhile, it can achieve comparable and even superior performance against existing centralized GC and federated graph learning (FGL) methods.

Abstract: In this paper, we introduce an innovative hierarchical joint sourcechannel coding (HJSCC) framework for image transmission, utilizing a hierarchical variational autoencoder (VAE). Our approach leverages a combination of bottom-up and top-down paths at the transmitter to autoregressively generate multiple hierarchical representations of the original image. These representations are then directly mapped to channel symbols for transmission by the JSCC encoder. We extend this framework to scenarios with a feedback link, modeling transmission over a noisy channel as a probabilistic sampling process and deriving a novel generative formulation for JSCC with feedback. Compared with existing approaches, our proposed HJSCC provides enhanced adaptability by dynamically adjusting transmission bandwidth, encoding these representations into varying amounts of channel symbols. Additionally, we introduce a rate attention module to guide the JSCC encoder in optimizing its encoding strategy based on prior information. Extensive experiments on images of varying resolutions demonstrate that our proposed model outperforms existing baselines in rate-distortion performance and maintains robustness against channel noise.

Abstract: Trajectoryuser linking (TUL) aims to match anonymous trajectories to the most likely users who generated them, offering benefits for a wide range of real-world spatio-temporal applications. However, existing TUL methods are limited by high model complexity and poor learning of the effective representations of trajectories, rendering them ineffective in handling large-scale user trajectory data.In this work, we propose a novel Scalable Trajectory-User Linking with dual-stream representation networks for large-scale TUL problem, named ScaleTUL Specifically, ScaleTUL generates two views using temporal and spatial augmentations to exploit supervised contrastive learning framework to effectively capture the irregularities of trajectories. In each view, a dual-stream trajectory encoder consisting of a long-term encoder and a short-term encoder is designed to learn the unified representations of trajectories that fuses different temporal-spatial dependencies. Then, a TUL layer is used to associate the trajectories with the corresponding users in the representation space using a two-stage training model.Experimental results on check-in mobility datasets from three real-world cities and the nationwide U.S. demonstrate the superiority of ScaleTUL over state-of-the-art baselines for large-scale TUL tasks.

Abstract: In search scenarios, user experience can be hindered by erroneous queries due to typos, voice errors, or knowledge gaps. Therefore, query correction is crucial for search engines. Current correction models, usually small models trained on specific data, often struggle with queries beyond their training scope or those requiring contextual understanding. While the advent of Large Language Models (LLMs) offers a potential solution, they are still limited by their pretraining data and inference cost, particularly for complex queries, making them not always effective for query correction. To tackle these, we propose Trigger3, a large-small model collaboration framework that integrates the traditional correction model and LLM for query correction, capable of adaptively choosing the appropriate correction method based on the query and the correction results from the traditional correction model and LLM. Trigger3 first employs a correction trigger to filter out correct queries. Incorrect queries are then corrected by the traditional correction model. If this fails, an LLM trigger is activated to call the LLM for correction. Finally, for queries that no model can correct, a fallback trigger decides to return the original query. Extensive experiments demonstrate Trigger3 outperforms correction baselines while maintaining efficiency.

School of Computer Science and Engineering, Nanjing University of Science and Technology, China Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, China Jiangsu Key Laboratory of Image and Video Understanding for Social Security, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, China Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, China Jiangsu Key Laboratory of Image and Video Understanding for Social Security, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, China Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, China Jiangsu Key Laboratory of Image and Video Understanding for Social Security, China, Hong Kong Baptist University, China, Sydney AI Centre, The University of Sydney, Sydney, School of Computer Science and Engineering, Nanjing University of Science and Technology, China Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, China Jiangsu Key Laboratory of Image and Video Understanding for Social Security, China, Department of Automation, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, China

Abstract: Outof-distribution (OOD) detection aims to identify the test examples that do not belong to the distribution of training data. The distance-based methods, which identify OOD examples based on their distances from the centroids of in-distribution (ID) examples, have demonstrated promising OOD detection performance. However, the objectives utilized in prior approaches are typically designed for classification and thus might not yield sufficient discriminative power to distinguish between ID and OOD examples. Therefore, this paper proposes a prototype-based contrastive learning framework for OOD detection, which is termed provable Discriminative Hyperspherical Embedding (DHE). The proposed framework provides a theoretical analysis of inter-class dispersion, which is proved to be fundamental in reducing the false positive rate (FPR) on OOD examples. Based on this, we devise an angular spread loss to achieve the maximal dispersion of the prototypes of different classes prior to training. Subsequently, a prototype-enhanced contrastive loss is introduced to align embeddings of ID examples closely with their corresponding prototypes. In our proposed DHE, the maximal prototype dispersion is theoretically proved, thereby avoiding the pitfalls of local optima commonly encountered by most existing methods. Experimental results demonstrate the effectiveness of our proposed DHE, which showcases a remarkable reduction in FPR95 (i.e., 5.37% on CIFAR-100) and more than doubling the computational efficiency when compared with the state-of-the-art methods.

Abstract: While Nash equilibria are guaranteed to exist, they may exhibit dense support, making them difficult to understand and execute in some applications. In this paper, we study ksparse commitments in games where one player is restricted to mixed strategies with support size at most k. Finding k-sparse commitments is known to be computationally hard. We start by showing several structural properties of k-sparse solutions, including that the optimal support may vary dramatically as k increases. These results suggest that naive greedy or double-oracle-based approaches are unlikely to yield practical algorithms. We then develop a simple approach based on mixed integer linear programs (MILPs) for zero-sum games, general-sum Stackelberg games, and various forms of structured sparsity. We also propose practical algorithms for cases where one or both players have large (i.e., practically innumerable) action sets, utilizing a combination of MILPs and incremental strategy generation. We evaluate our methods on synthetic and real-world scenarios based on security applications. In both settings, we observe that even for small support sizes, we can obtain more than 90% of the true Nash value while maintaining a reasonable runtime, demonstrating the significance of our formulation and algorithms.

Abstract: Present bias, the tendency to overvalue immediate rewards while undervaluing future ones, is a wellknown barrier to achieving long-term goals. As artificial intelligence and behavioral economics increasingly focus on this phenomenon, the need for robust mathematical models to predict behavior and guide effective interventions has become crucial. However, existing models are constrained by their reliance on the discreteness of time and limited discount functions. This study introduces a novel continuous-time mathematical model for agents influenced by present bias. Using the variational principle, we model human behavior, where individuals repeatedly act according to a sequence of states that minimize their perceived cost. Our model not only retains analytical tractability but also accommodates various discount functions. Using this model, we consider intervention optimization problems under exponential and hyperbolic discounting and theoretically derive optimal intervention strategies, offering new insights into managing present-biased behavior.

Abstract: We study the fundamental problem of fairly dividing a set of indivisible items among agents with (general) monotone valuations. The notion of envyfreeness up to any item (EFX) is considered to be one of the most fascinating fairness concepts in this line of work. Unfortunately, despite significant efforts, existence of EFX allocations is a major open problem in fair division, thereby making the study of approximations and relaxations of EFX a natural line of research. Recently, Caragiannis et al. [2023] introduced a promising relaxation of EFX, called epistemic EFX (EEFX). An allocation is EEFX, if for every agent, it is possible to shuffle the items in the remaining bundles so that she becomes ``EFX-satisfied''. Caragiannis et al. [2023] prove existence and polynomial-time computability of EEFX allocations for additive valuations. A natural question asks what happens when we consider valuations more general than additive? We address this important open question and answer it affirmatively by establishing the existence of EEFX allocations for an arbitrary number of agents with general monotone valuations. To the best of our knowledge, besides EF1, EEFX is the only known relaxation of EFX to have such strong existential guarantees. Furthermore, we complement our existential result by proving computational and information-theoretic lower bounds. We prove that even for an arbitrary number of (more than one) agents with identical submodular valuations, it is PLS-hard to compute EEFX allocations and it requires exponentially-many value queries to do so.

Abstract: We study the problem of computing fair divisions of a set of indivisible goods among agents with additive valuations. For the past many decades, the literature has explored various notions of fairness, that can be primarily seen as either having envybased or share-based lens. For the discrete setting of resource-allocation problems, envy-free up to any good (EFX) and maximin share (MMS) are widely considered as the flag-bearers of fairness notions in the above two categories, thereby capturing different aspects of fairness herein. Due to lack of existence results of these notions and the fact that a good approximation of EFX or MMS does not imply particularly strong guarantees of the other, it becomes important to understand the compatibility of EFX and MMS allocations with one another. In this work, we identify a novel way to simultaneously achieve MMS guarantees with EFX/EF1 notions of fairness, while beating the best known approximation factors by Chaudhury et al. and Amanatidis et al. Our main contribution is to constructively prove the existence of (i) a partial allocation that is both 2/3-MMS and EFX, and (ii) a complete allocation that is both 2/3-MMS and EF1. Our algorithms run in pseudo-polynomial time if the approximation factor for MMS is relaxed to 2/3 - e for any constant e>0 and in polynomial time if, in addition, the EFX (or EF1) guarantee is relaxed to (1-d)-EFX (or (1-d)-EF1) for any constant d>0. In particular, we improve from the best approximation factor known prior to our work by Chaudhury et al., which computes partial allocations that are 1/2-MMS and EFX in pseudo-polynomial time.

Abstract: Today we rely on networks that are created and maintained by smart devices. For such networks, there is no governing central authority but instead the network structure is shaped by the decisions of selfish intelligent agents. A key property of such communication networks is that they should be easy to navigate for routing data. For this, a common approach is greedy routing, where every device simply routes data to a neighbor that is closer to the respective destination. Networks of intelligent agents can be analyzed via a gametheoretic approach and in the last decades many variants of network creation games have been proposed and analyzed. In this paper we present the first game-theoretic network creation model that incorporates greedy routing, i.e., the strategic agents in our model are embedded in some metric space and strive for creating a network among themselves where all-pairs greedy routing is enabled. Besides this, the agents optimize their connection quality within the created network by aiming for greedy routing paths with low stretch. For our model, we analyze the existence of (approximate)-equilibria and the computational hardness in different underlying metric spaces. E.g., we characterize the set of equilibria in 1-2-metrics and tree metrics and show that Nash equilibria always exist. For Euclidean space, the setting which is most relevant in practice, we prove that equilibria are not guaranteed to exist but that the well-known Θ-graph construction yields networks having a low stretch that are game-theoretically almost stable. For general metric spaces, we show that approximate equilibria exist where the approximation factor depends on the cost of maintaining any link.

Abstract: Imperfectrecall games—in which players may forget previously acquired information—have found many practical applications, ranging from game abstractions to team games and testing AI agents. In this paper, we quantify the utility gain by endowing a player with perfect recall, which we call the value of recall (VoR). While VoR can be unbounded in general, we parameterize it in terms of various game properties, namely the structure of chance nodes and the degree of absentmindedness (the number of successive times a player enters the same information set). Further, we identify several pathologies that arise with VoR, and show how to circumvent them. We also study the complexity of computing VoR, and how to optimally apportion partial recall. Finally, we connect VoR to other previously studied concepts in game theory, including the price of anarchy. We use that connection in conjunction with the celebrated smoothness framework to characterize VoR in a broad class of games.

Abstract: Serial dictatorship is a simple mechanism for coordinating agents in solving combinatorial optimization problems according to their preferences. The most representative such problem is onesided matching, in which a set of n agents have values for a set of n items, and the objective is to compute a matching of the agents to the items of maximum total value (a.k.a., social welfare). Following the recent framework of Caragiannis and Rathi (2023), we consider a model in which the agent-item values are not available upfront but become known by querying agent sequences. In particular, when the agents are asked to act in a sequence, they respond by picking their favorite item that has not been picked by agents who acted before and reveal their value for it. Can we compute an agent sequence that induces a social welfare-optimal matching? We answer this question affirmatively and present an algorithm that uses polynomial number (specifically, O(n^5) of queries). This solves the main open problem stated by Caragiannis and Rathi (2023). Our analysis uses a potential function argument that measures progress towards learning the underlying edge-weight information. Furthermore, the algorithm has a truthful implementation by adapting the paradigm of VCG payments.

Abstract: Manmade and natural disruptions such as planned constructions on roads, suspensions of bridges, and blocked roads by trees/mudslides/floods can often create obstacles that separate two connected regions. As a result, the traveling and reachability of agents from their respective regions to other regions can be affected. To minimize the impact of the obstacles and maintain agent accessibility, we initiate the problem of constructing a new pathway (e.g., a detour or new bridge) connecting the regions disconnected by obstacles from the mechanism design perspective. In the problem, each agent in their region has a private location and is required to access the other region. The cost of an agent is the distance from their location to the other region via the pathway. Our goal is to design strategyproof mechanisms that elicit truthful locations from the agents and approximately optimize the social or maximum cost of agents by determining locations in the regions for building a pathway. We provide a characterization of all strategyproof and anonymous mechanisms. For the social and maximum costs, we provide upper and lower bounds on the approximation ratios of strategyproof mechanisms.

Abstract: Knockout tournaments, also known as singleelimination or cup tournaments, are a popular form of sports competitions. In the standard probabilistic setting, for each pairing of players, one of the players wins the game with a certain (a priory known) probability. Due to their competitive nature, tournaments are prone to manipulation. We investigate the computational problem of determining whether, for a given tournament, a coalition has a manipulation strategy that increases the winning probability of a designated player above a given threshold. More precisely, in every round of the tournament, coalition players can strategically decide which games to throw based on the advancement of other players to the current round. We call this setting adaptive constructive coalition manipulation. To the best of our knowledge, while coalition manipulation has been studied in the literature, this is the first work to introduce adaptiveness to this context. We show that the above problem is hard for every complexity class in the polynomial hierarchy. On the algorithmic side, we show that the problem is solvable in polynomial time when the coalition size is a constant. Furthermore, we show that the problem is fixed-parameter tractable when parameterized by the coalition size and the size of a minimum player set that must include at least one player from each non-deterministic game. Lastly, we investigate a generalized setting where the tournament tree can be imbalanced.

Abstract: Social media platforms are responsible for collecting and disseminating vast quantities of content. Recently, however, they have also begun enlisting users in helping annotate this content for example, to provide context or label disinformation. However, users may act strategically, sometimes reflecting biases (e.g. political) about the "right" label. How can social media platforms design their systems to use human time most efficiently? Historically, competition over multiple items has been explored in the Colonel Blotto game setting. However, they were originally designed to model two centrally-controlled armies competing over zero-sum "items", a specific scenario with limited modern-day application. In this work, we propose and study Private Blotto game, a variant with the key difference that individual agents act independently, without being coordinated by a central "Colonel". We completely characterize the Nash stability of this game and how this impacts the amount of "misallocated effort" of users on unimportant items. We show that the outcome function (aggregating multiple labels on a single item) has a critical impact, and specifically contrast a majority rule outcome (the median) as compared to a smoother outcome function (mean). In general, for median outcomes we show that instances without stable arrangements only occur for relatively few numbers of agents, but stable arrangements may have very high levels of misallocated effort. For mean outcome functions, we show that unstable arrangements can occur even for arbitrarily large numbers of agents, but when stable arrangements exist, they always have low misallocated effort. We conclude by discussing implications our results have for motivating examples in social media platforms and political competition.

Abstract: The correlation of values commonly exists in auctions, which can be further exploited to improve revenue. However, the complex correlation structure makes it hard to manually design the optimal auction mechanism. Datadriven auction mechanisms, powered by machine learning, enable to design auctions directly from historical auction data, without relying on specific value distributions. In this work, we synthesize the learning-based auction and the characteristics of strategy-proofness in the correlated value setting, and propose a new auction mechanism, namely Conditional Auction Net (CAN). The CAN can encode the correlation of values into the rank score of each bidder, and further adjust the allocation rule to approach the optimal revenue. The property of strategy-proofness is guaranteed by encoding the game theoretical condition into the neural network structure. Furthermore, all operations in the designed auctions are differentiable to enable an end-to-end training paradigm. We also present CAN can provide a large solution space to adequately encode the correlation of values. Experimental results demonstrate that the proposed auction mechanism can represent almost any strategy-proof auction mechanism, and outperforms the auction mechanisms wildly used in the correlated value settings.

Abstract: In perpetual voting, multiple decisions are made at different moments in time. Taking the history of previous decisions into account allows us to satisfy properties such as proportionality over periods of time. In this paper, we consider the following question: is there a perpetual approval voting method that guarantees that no voter is dissatisfied too many times? We identify a sufficient condition on voter behavior which we call 'bounded conflicts' condition---under which a sublinear growth of dissatisfaction is possible. We provide a tight upper bound on the growth of dissatisfaction under bounded conflicts, using techniques from Kolmogorov complexity. We also observe that the approval voting with binary choices mimics the machine learning setting of prediction with expert advice. This allows us to present a voting method with sublinear guarantees on dissatisfaction under bounded conflicts, based on the standard techniques from prediction with expert advice.

Abstract: We formulate the problem of fair and efficient completion of indivisible goods, defined as follows: Given a partial allocation of indivisible goods among agents, does there exist an allocation of the remaining goods (i.e., a completion) that satisfies fairness and economic efficiency guarantees of interest? We study the computational complexity of the completion problem for prominent fairness and efficiency notions such as envyfreeness up to one good (EF1), proportionality up to one good (Prop1), maximin share (MMS), and Pareto optimality (PO), and focus on the class of additive valuations as well as its subclasses such as binary additive and lexicographic valuations. We find that while the completion problem is significantly harder than the standard fair division problem (wherein the initial partial allocation is empty), the consideration of restricted preferences facilitates positive algorithmic results for threshold-based fairness notions (Prop1 and MMS). On the other hand, the completion problem remains computationally intractable for envy-based notions such as EF1 and EF1+PO even under restricted preferences.

Abstract: We study a Bayesian persuasion problem with externalities. In this model, a principal sends signals to inform multiple agents about the state of the world. Simultaneously, due to the existence of externalities in the agents' utilities, the principal also acts as a correlation device to correlate the agents' actions. We consider the setting where the agents are categorized into a small number of types. Agents of the same type share identical utility functions and are treated equitably in the utility functions of both other agents and the principal. We study the problem of computing optimal signaling strategies for the principal, under three different types of signaling channels: public, private, and semiprivate. Our results include revelation-principle-style characterizations of optimal signaling strategies, linear programming formulations, and analysis of in/tractability of the optimization problems. It is demonstrated that when the maximum number of deviating agents is bounded by a constant, our LP-based formulations compute optimal signaling strategies in polynomial time. Otherwise, the problems are NP-hard.

Abstract: Issue salience is a major determinant in voters' decisions. Candidates and political parties campaign to shift salience to their advantage a process termed priming. We study the dynamics, strategies and equilibria of campaign spending for voter priming in multi-issue multi-party settings. We consider both parliamentary elections, where parties aim to maximize their share of votes, and various settings for presidential elections, where the winner takes all. For parliamentary elections, we show that pure equilibrium spending always exists and can be computed in time linear in the number of voters. For two parties and all settings, a spending equilibrium exists such that each party invests only in a single issue, and an equilibrium can be computed in time that is polynomial in the number of issues and linear in the number of voters. We also show that in most presidential settings no equilibrium exists. Additional properties of optimal campaign strategies are also studied.

Abstract: Decoding visual information from human brain activity has seen remarkable advancements in recent research. However, the diversity in cortical parcellation and fMRI patterns across individuals has prompted the development of deep learning models tailored to each subject. The personalization limits the broader applicability of brain visual decoding in realworld scenarios. To address this issue, we introduce Wills Aligner, a novel approach designed to achieve multi-subject collaborative brain visual decoding. Wills Aligner begins by aligning the fMRI data from different subjects at the anatomical level. It then employs delicate mixture-of-brain-expert adapters and a meta-learning strategy to account for individual fMRI pattern differences. Additionally, Wills Aligner leverages the semantic relation of visual stimuli to guide the learning of inter-subject commonality, enabling visual decoding for each subject to draw insights from other subjects' data. We rigorously evaluate our Wills Aligner across various visual decoding tasks, including classification, cross-modal retrieval, and image reconstruction. The experimental results demonstrate that Wills Aligner achieves promising performance.

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University School of Informatics, Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen University Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University School of Informatics, Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen University, Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University School of Informatics, Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen University

Abstract: Multiphysics simulation aims to predict and understand interactions between multiple physical phenomena, aiding in comprehending natural processes and guiding engineering design. The system of Partial Differential Equations (PDEs) is crucial for representing these physical fields, and solving these PDEs is fundamental to such simulations. However, current methods primarily yield numerical outputs, limiting interpretability and generalizability. We introduce TNNGP, a hybrid genetic programming algorithm that integrates traditional numerical methods with deep learning to derive approximate symbolic expressions for multiple unknown functions within a system of PDEs. T-NNGP initially obtains numerical solutions using traditional methods, then generates candidate symbolic expressions via deep reinforcement learning, and finally optimizes these expressions using genetic programming. Furthermore, a universal decoupling strategy guides the search direction and addresses coupling problems, thereby accelerating the search process. Experimental results on three types of PDEs demonstrate that our method can reliably obtain human-understandable symbolic expressions that fit both the PDEs and the numerical solutions from traditional methods. This work advances multiphysics simulation by enhancing our ability to derive approximate symbolic solutions for PDEs, thereby improving our understanding of complex physical phenomena.

Abstract: Deep learning models have recently shown great success in classifying epileptic patients using EEG recordings. Unfortunately, classificationbased methods lack a sound mechanism to detect the onset of seizure events. In this work, we propose a two-stage framework, SODor, that explicitly models seizure onset through a novel task formulation of subsequence clustering. Given an EEG sequence, the framework first learns a set of second-level embeddings with label supervision. It then employs model-based clustering to explicitly capture long-term temporal dependencies in EEG sequences and identify meaningful subsequences. Epochs within a subsequence share a common cluster assignment (normal or seizure), with cluster or state transitions representing successful onset detections. Extensive experiments on three datasets demonstrate that our method can correct misclassifications, achieving 5%-11% classification improvements over other baselines and accurately detecting seizure onsets.

Abstract: Human multimodal language understanding (MLU) is an indispensable component of expression analysis (e.g., sentiment or humor) from heterogeneous modalities, including visual postures, linguistic contents, and acoustic behaviours. Existing works invariably focus on designing sophisticated structures or fusion strategies to achieve impressive improvements. Unfortunately, they all suffer from the subject variation problem due to data distribution discrepancies among subjects. Concretely, MLU models are easily misled by distinct subjects with different expression customs and characteristics in the training data to learn subjectspecific spurious correlations, limiting performance and generalizability across new subjects. Motivated by this observation, we introduce a recapitulative causal graph to formulate the MLU procedure and analyze the confounding effect of subjects. Then, we propose SuCI, a simple yet effective causal intervention module to disentangle the impact of subjects acting as unobserved confounders and achieve model training via true causal effects. As a plug-and-play component, SuCI can be widely applied to most methods that seek unbiased predictions. Comprehensive experiments on several MLU benchmarks clearly show the effectiveness of the proposed module.

Abstract: The ability to autonomously explore and resolve tasks with minimal human guidance is crucial for the selfdevelopment of embodied intelligence. Although reinforcement learning methods can largely ease human effort, it's challenging to design reward functions for real-world tasks, especially for high-dimensional robotic control, due to complex relationships among joints and tasks. Recent advancements large language models (LLMs) enable automatic reward function design. However, approaches evaluate reward functions by re-training policies from scratch placing an undue burden on the reward function, expecting it to be effective throughout the whole policy improvement process. We argue for a more practical strategy in robotic autonomy, focusing on refining existing policies with policy-dependent reward functions rather than a universal one. To this end, we propose a novel reward-policy co-evolution framework where the reward function and the learned policy benefit from each other's progressive on-the-fly improvements, resulting in more efficient and higher-performing skill acquisition. Specifically, the reward evolution process translates the robot's previous best reward function, descriptions of tasks and environment into text inputs. These inputs are used to query LLMs to generate a dynamic amount of reward function candidates, ensuring continuous improvement at each round of evolution. For policy evolution, our method generates new policy populations by hybridizing historically optimal and random policies. Through an improved Bayesian optimization, our approach efficiently and robustly identifies the most capable and plastic reward-policy combination, which then proceeds to the next round of co-evolution. Despite using less data, our approach demonstrates an average normalized improvement of 95.3\% across various high-dimensional robotic skill learning tasks.

Abstract: Humans naturally rely on floor plans to navigate in unfamiliar environments, as they are readily available, reliable, and provide rich geometrical guidance. However, existing visual navigation settings overlook this valuable prior knowledge, leading to limited efficiency and accuracy. To eliminate this gap, we introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the first attempt to incorporate floor plans into embodied visual navigation. While the floor plan offers significant advantages, two key challenges emerge: (1) handling the spatial inconsistency between the floor plan and the actual scene layout for collisionfree navigation, and (2) aligning observed images with the floor plan sketch despite their distinct modalities. To address these challenges, we propose FloDiff, a novel diffusion policy framework incorporating a localization module to facilitate alignment between the current observation and the floor plan. We further collect 20k navigation episodes across 117 scenes in the iGibson simulator to support the training and evaluation. Extensive experiments demonstrate the effectiveness and efficiency of our framework in unfamiliar scenes using floor plan knowledge.

School of Computer Science & Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing, China, School of Software, Beijing Jiaotong University, Beijing, China, School of Computer Science & Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing, China, School of Computer Science & Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing, China, School of Computer Science & Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing, China, School of Computer Science & Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing, China, School of Computer Science & Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing, China

Abstract: Reasoning future unknowable facts on temporal knowledge graphs (TKGs) is a challenging task, holding significant academic and practical values for various fields. Existing studies exploring explainable reasoning concentrate on modeling comprehensible temporal paths relevant to the query. Yet, these pathbased methods primarily focus on local temporal paths appearing in recent times, failing to capture the complex temporal paths in TKG and resulting in the loss of longer historical relations related to the query. Motivated by the Dual Process Theory in cognitive science, we propose a Cognitive Temporal Knowledge Extrapolation framework (CognTKE), which introduces a novel temporal cognitive relation directed graph (TCR-Digraph) and performs interpretable global shallow reasoning and local deep reasoning over the TCR-Digraph. Specifically, the proposed TCR-Digraph is constituted by retrieving significant local and global historical temporal relation paths associated with the query. In addition, CognTKE presents the global shallow reasoner and the local deep reasoner to perform global one-hop temporal relation reasoning (System 1) and local complex multi-hop path reasoning (System 2) over the TCR-Digraph, respectively. The experimental results on four benchmark datasets demonstrate that CognTKE achieves significant improvement in accuracy compared to the state-of-the-art baselines and delivers excellent zero-shot reasoning ability.

Abstract: Inconsistency is a common problem in knowledge, and so there is a need to analyse it. Inconsistency measures assess its severity, but there is a more basic question: "where is the inconsistency?". Typically, not all subsets of a knowledgebase are causing the inconsistency, and minimal inconsistent sets have been the standard way to localise the germane ones, even though there are shortcomings in some scenarios. Recently, ⋆conflicts were proposed as a more suitable definition to localise inconsistency when considering a method to repair it. But in general there is no way to tell what is a sensible definition to capture the germane conflicts. This work provides a set of desirable properties to assess definitions for germane conflicts. Also, a new conflict definition, based on substitution, is presented and evaluated via the proposed properties, and the related computational complexity is analysed.

Abstract: We introduce a novel language for reasoning about agents' cognitive attitudes of both epistemic and motivational type. We interpret it by means of a computationally grounded semantics using belief bases. Our language includes five types of modal operators for implicit belief, complete attraction, complete repulsion, realistic attraction and realistic repulsion. We give an axiomatization and show that our operators are not mutually expressible and that they can be combined to represent a large variety of psychological concepts including ambivalence, indifference, being motivated, being demotivated and preference. We present a dynamic extension of the language that supports reasoning about the effects of belief change operations. Finally, we provide a succinct formulation of model checking for our languages and a PSPACE model checking algorithm relying on a reduction into TQBF. We present some experimental results for the implemented algorithm on computation time in a concrete example.

Abstract: Reasoning about the causes behind observations is crucial to the formalization of rationality. While extensive research has been conducted on root cause analysis, most studies have predominantly focused on deterministic settings. In this paper, we investigate causation in more realistic nondeterministic domains, where the agent does not have any control on and may not know the choices that are made by the environment. We build on recent preliminary work on actual causation in the nondeterministic situation calculus to formalize more sophisticated forms of reasoning about actual causes in such domains. We investigate the notions of “Certainly Causes” and “Possibly Causes” that enable the representation of actual cause for agent actions in these domains. We then show how regression in the situation calculus can be extended to reason about such notions of actual causes.

Abstract: We study the firstorder definability of progression for situation calculus action theories with a focus on the iterability of progression. Progression, the task of updating a knowledge base according to actions' effects so that proper information is retained, is notoriously challenging as it in general requires second-order logic. Exceptions where progression is first-order like local-effect actions and normal actions impose certain syntax constraints on action theories to eliminate second-order quantifiers in the progressed knowledge base. Unfortunately, the progressed result might not satisfy the constraints again, making it impossible to apply first-order progression iteratively. In this paper, we first lift the existing result on first-order progression for normal actions by allowing disjunctions in the knowledge base. As a result, we obtain an action theory whose type is called disjunctive normal, which is iteratively first-order progressable. Second, we propose a new class of action theories, called PANACK, that strictly subsumes the disjunctive normal ones, and we show that it remains iteratively first-order progressable as well.

Abstract: Temporal pattern matching tasks require the detection of situations of interest based on streams of symbolic events. The RunTime Event Calculus (RTEC) is a formal framework that represents situations of interest as time-varying properties called 'fluents'. Temporal patterns often express 'Boolean combinations' of situations; RTEC features two types of fluents that may model such patterns: 'simple' and 'statically determined'. A simple fluent representation, however, is exponentially larger and more expensive to reason with than the corresponding statically determined fluent one. We formally identify the class of simple fluent definitions that can be translated into statically determined fluent definitions. Moreover, we present a compiler for the translation, and a reproducible empirical evaluation on real applications.

Abstract: There has been considerable work on reasoning about the strategic ability of agents under imperfect information. However, existing logics such as Probabilistic Strategy Logic are unable to express properties relating to information transparency. Information transparency concerns the extent to which agents' behaviours and actions are observable by other agents. Reasoning about information transparency is useful in many domains including security, privacy, and decisionmaking. In this paper, we present a formal framework for reasoning about information transparency properties in stochastic multi-agent systems. We extend Probabilistic Strategy Logic with new observability operators that capture the degree of observability of temporal properties by agents. We show that the model checking problem for the resulting logic is decidable.

Abstract: In this paper, we introduce a syntactic framework for analyzing and handling inconsistencies in propositional bases. Our approach focuses on examining the relationships between variable occurrences within conflicts. We propose two dual concepts: Minimal Inconsistency Relation (MIR) and Maximal Consistency Relation (MCR). Each MIR is a minimal equivalence relation on variable occurrences that results in inconsistency, while each MCR is a maximal equivalence relation designed to prevent inconsistency. Notably, MIRs capture conflicts overlooked by minimal inconsistent subsets. Using MCRs, we develop a series of nonexplosive inference relations. The main strategy involves restoring consistency by modifying the propositional base according to each MCR, followed by employing the classical inference relation to derive conclusions. Additionally, we propose an unusual semantics that assigns truth values to variable occurrences instead of the variables themselves. The associated inference relations are established through Boolean interpretations compatible with the occurrence-based models.

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, China, Department of Computer Science, University of Oxford, UK School of Electronic Engineering and Computer Science, Queen Mary University of London, UK, Department of Computer Science, University of Oxford, UK, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, China

Abstract: DatalogMTL is a powerful rulebased language for temporal reasoning. Due to its high expressive power and flexible modeling capabilities, it is suitable for a wide range of applications, including tasks from industrial and financial sectors. However, due its high computational complexity, practical reasoning in DatalogMTL is highly challenging. To address this difficulty, we introduce a new reasoning method for DatalogMTL which exploits the magic sets technique—a rewriting approach developed for (non-temporal) Datalog to simulate top-down evaluation with bottom-up reasoning. We have implemented this approach and evaluated it on publicly available benchmarks, showing that the proposed approach significantly and consistently outperformed state-of-the-art reasoning techniques.

College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China, College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China, School of Computer, National University of Defense Technology, Changsha, China, College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China, School of Computer, National University of Defense Technology, Changsha, China, School of Computer, National University of Defense Technology, Changsha, China, College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China

Abstract: Textbased knowledge graph completion methods take advantage of pre-trained language models (PLM) to enhance intrinsic semantic connections of raw triplets with detailed text descriptions. Typical methods in this branch map an input query (textual descriptions associated with an entity and a relation) and its candidate entities into feature vectors, respectively, and then maximize the probability of valid triples. These methods are gaining promising performance and increasing attention for the rapid development of large language models. According to the property of the language models, the more related and specific context information the input query provides, the more discriminative the resultant embedding will be. In this paper, through observation and validation, we find a neglected fact that the relation-aware neighbors of the head entities in queries could act as effective contexts for more precise link prediction. Driven by this finding, we propose a relation-aware anchor enhanced knowledge graph completion method (RAA-KGC). Specifically, in our method, to provide a reference of what might the target entity be like, we first generate anchor entities within the relation-aware neighborhood of the head entity. Then, by pulling the query embedding towards the neighborhoods of the anchors, it is tuned to be more discriminative for target entity matching. The results of our extensive experiments not only validate the efficacy of RAA-KGC but also reveal that by integrating our relation-aware anchor enhancement strategy, the performance of current leading methods can be notably enhanced without substantial modifications.

Abstract: There has been a longstanding dispute over which formalism is the best for representing knowledge in AI. The wellknown “declarative vs. procedural controversy” is concerned with the choice of utilizing declarations or procedures as the primary mode of knowledge representation. The ongoing debate between symbolic AI and connectionist AI also revolves around the question of whether knowledge should be represented implicitly (e.g., as parametric knowledge in deep learning and large language models) or explicitly (e.g., as logical theories in traditional knowledge representation and reasoning). To address these issues, we propose a general framework to capture various knowledge representation formalisms in which we are interested. Within the framework, we find a family of universal knowledge representation formalisms, and prove that all universal formalisms are recursively isomorphic. Moreover, we show that all pairwise intertranslatable formalisms that admit the padding property are also recursively isomorphic. These imply that, up to an offline compilation, all universal (or natural and equally expressive) representation formalisms are in fact the same, which thus provides a partial answer to the aforementioned dispute.

Abstract: Physical commonsense is an essential aspect of human cognition, involving an intuitive understanding of the physical properties and interactions of everyday objects and materials. Though physical commonsense reasoning should inherently be a multisensory task, integrating both video and audio signals, existing physical audiovisual commonsense reasoning (PACR) models predominantly rely on visual information. This reliance leads to spurious correlations and undermines the models’ reasoning and generalization abilities. To counteract this, we introduce a modelagnostic Counterfactual Physical Audiovisual Commonsense Reasoning (CF-PACR) framework aimed at mitigating visual bias-induced spurious effects. Specifically, we construct a traditional PACR model using both audio and visual information as the factual reasoning model. Subsequently, in the counterfactual reasoning model, we isolate visual information to estimate direct effects. Finally, we subtract the direct effects from the total effects across modalities to derive indirect effects, thereby mitigating visual biases. Extensive experiments validate the effectiveness and generalizability of CF-PACR in alleviating the spurious correlations between visual modality and model predictions.

Abstract: Explainable AI has received significant attention in recent years. Machine learning models often operate as black boxes, lacking explainability and transparency while supporting decisionmaking processes. Local post-hoc explainability queries attempt to answer why individual inputs are classified in a certain way by a given model. While there has been important work on counterfactual explanations, less attention has been devoted to semifactual ones. In this paper, we focus on local post-hoc explainability queries within the semifactual `even-if' thinking and their computational complexity among different classes of models, and show that both linear and tree-based models are strictly more interpretable than neural networks. After this, we introduce a preference-based framework enabling users to personalize explanations based on their preferences, both in the case of semifactuals and counterfactuals, enhancing interpretability and user-centricity. Finally, we explore the complexity of several interpretability problems in the proposed preference-based framework and provide algorithms for polynomial cases.

Abstract: We consider the contextual combinatorial bandit setting where in each round, the learning agent, e.g., a recommender system, selects a subset of "arms,'' e.g., products, and observes rewards for both the individual base arms, which are a function of known features (called "context''), and the super arm (the subset of arms), which is a function of the base arm rewards. The agent's goal is to simultaneously learn the unknown reward functions and choose the highestreward arms. For example, the "reward'' may represent a user's probability of clicking on one of the recommended products. Conventional bandit models, however, employ restrictive reward function models in order to obtain performance guarantees. We make use of deep neural networks to estimate and learn the unknown reward functions and propose Neural UCB Clustering (NeUClust), which adopts a clustering approach to select the super arm in every round by exploiting underlying structure in the context space. Unlike prior neural bandit works, NeUClust uses a neural network to estimate the super arm reward and select the super arm, thus eliminating the need for a known optimization oracle. We non-trivially extend prior neural combinatorial bandit works to prove that NeUClust achieves sublinear regret in the number of rounds. Experiments on real world recommendation datasets show that NeUClust achieves better regret and reward than other contextual combinatorial and neural bandit algorithms.

Abstract: We present a significant advancement in the field of Langevin Monte Carlo (LMC) methods by introducing the Inexact Proximal Langevin Algorithm (IPLA). This novel algorithm broadens the scope of problems that LMC can effectively address while maintaining controlled computational costs. IPLA extends LMC's applicability to potentials that are convex, strongly convex in the tails, and exhibit polynomial growth, beyond the conventional Lsmoothness assumption. Moreover, we extend LMC's applicability to super-quadratic potentials and offer improved convergence rates over existing algorithms. Additionally, we provide bounds on all moments of the Markov chain generated by IPLA, enhancing its analytical robustness.

Abstract: As AIbased decision-makers increasingly influence human lives, it is a growing concern that their decisions may be unfair or biased with respect to people's protected attributes, such as gender and race. Most existing bias prevention measures provide probabilistic fairness guarantees in the long run, and it is possible that the decisions are biased on any decision sequence of fixed length. We introduce *fairness shielding*, where a symbolic decision-maker---the fairness shield---continuously monitors the sequence of decisions of another deployed black-box decision-maker, and makes interventions so that a given fairness criterion is met while the total intervention costs are minimized. We present four different algorithms for computing fairness shields, among which one guarantees fairness over fixed horizons, and three guarantee fairness periodically after fixed intervals. Given a distribution over future decisions and their intervention costs, our algorithms solve different instances of bounded-horizon optimal control problems with different levels of computational costs and optimality guarantees. Our empirical evaluation demonstrates the effectiveness of these shields in ensuring fairness while maintaining cost efficiency across various scenarios.

Abstract: Training a generalpurpose time series foundation models with robust generalization capabilities across diverse applications from scratch is still an open challenge. Efforts are primarily focused on fusing cross-domain time series datasets to extract shared subsequences as tokens for training models on Transformer architecture. However, due to significant statistical heterogeneity across domains, this cross-domain fusing approach doesn't work effectively as the same as fusing texts and images. To tackle this challenge, this paper proposes a novel federated learning approach to address the heterogeneity in time series foundation models training, namely FFTS. Specifically, each data-holding organization is treated as an independent client in a collaborative learning framework with federated settings, and then many client-specific local models will be trained to preserve the unique characteristics per dataset. Moreover, a new regularization mechanism will be applied to both client-side and server-side, thus to align the shared knowledge across heterogeneous datasets from different domains. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed federated learning approach. The newly learned time series foundation models achieve superior generalization capabilities on cross-domain time series analysis tasks, including forecasting, imputation, and anomaly detection.

Abstract: The strategy of selecting ``most informative'' hard samples in active learning has proven a boon for alleviating the challenges of fewshot learning and costly data annotation in deep learning. However, this very preference towards hard samples engenders bias issues, thereby impeding the full potential of active learning. It has witnessed an increasing trend to mitigate this stubborn problem, yet most neglect the quantification of bias itself and the direct rectification of dynamically evolving biases. Revisiting the bias issue, this paper presents an active learning approach based on the Variational Gradient Rectifier (VaGeRy). First, we employ variational methods to quantify bias at the level of latent state representations. Then, harnessing historical training dynamics, we introduce Uncertainty Consistency Regularization and Fluctuation Restriction, which asynchronously iterate to rectify gradient backpropagation. Extensive experiments demonstrate that our proposed methodology effectively counteracts bias phenomena in a majority of active learning scenarios

Abstract: Deep neural networks (DNNs) have substantially achieved high predictive accuracy in many vision tasks. However, we find that they are poorly calibrated for crack recognition tasks, as these DNNs tend to produce both underconfident and over-confident predictions in such safety-critical applications, thereby limiting their practical use in real-world scenarios. To address this issue, we propose a novel attack-inspired calibration loss (AICL) that explicitly regularizes class probabilities to be better confidence estimation. Specifically, we first propose the attack-inspired correctness estimation method (ACE) that aims to estimate the correctness degree of each sample via adversarial attacks. Then, we propose Correctness-aware Distribution Guidance, which starts from a distribution perspective that enforces the ordinal ranking of the predicted confidence referring to the estimated correctness degree. The proposed method can be conveniently implemented on top of any DNNs-based crack recognition model by serving as a plug-and-play loss function. To address the limited availability of related benchmarks, we collect a fully annotated dataset, namely, Bridge2024, which involves inconsistent cracks and noisy backgrounds in real-world bridges. Our AICL outperforms the state-of-art calibration methods on various benchmark datasets including CRACK2019, SDNET2018, and our BRIDGE2024.

Abstract: Variational autoencoder performs well in community detection on static networks, but it is difficult to directly extend to continuous dynamic networks. The main reason is that traditional methods mainly rely on adjacency structures to complete the inference and generation processes. However, continuous dynamic networks cannot be described by this structure because the inherent timeliness and causality information of the network would be lost. To address this issue, we propose a novel variational autoencoder, CTVAE, for community detection in continuous dynamic networks, along with its scalable variant, CT-CAVAE. By conceptualizing node interactions as event streams and adopting the Hawkes process to capture temporal dynamics and causality, and incorporating them into the inference process, CT-VAE can effectively extend the traditional inference approach to continuous dynamic networks. Additionally, in the generation phase, CT-VAE combines pseudo-labeling and compact constraint strategies to facilitate the reconstruction process of non-adjacent structures. For the scalable variant, CT-CAVAE, end-to-end community detection is achieved by cleverly combining Gaussian mixture distribution. Extensive experimental results demonstrate that the proposed CT-VAE and CT-CAVAE achieve more favorable performance compared with the state-of-the-art baselines.

Abstract: Dimension reduction techniques typically seek an embedding of a highdimensional point cloud into a low-dimensional Euclidean space which optimally preserves the geometry of the input data. Based on expert knowledge, one may instead wish to embed the data into some other manifold or metric space in order to better reflect the geometry or topology of the point cloud. We propose a general method for manifold-valued multidimensional scaling based on concepts from optimal transport. In particular, we establish theoretical connections between the recently introduced semi-relaxed Gromov-Wasserstein (srGW) framework and multidimensional scaling by solving the Monge problem in this setting. We also derive novel connections between srGW distance and Gromov-Hausdorff distance. We apply our computational framework to analyze ensembles of political redistricting plans for states with two Congressional districts, achieving an effective visualization of the ensemble as a distribution on a circle which can be used to characterize typical neutral plans, and to flag outliers.

Abstract: In many realworld applications, data is inherently decentralized, necessitating data analysis methods that prioritize privacy while delivering interpretable results. Federated Non-Negative Matrix Factorization (FedNMF) meets this requirement by factorizing latent components from distributed data that cannot be freely shared among clients. A significant challenge in FedNMF arises when clients converge on different solutions due to prolonged independent optimization, leading to drift and incoherent models. While Federated Learning (FL) typically mitigates drift through frequent synchronizations and strong regularization, it often overlooks critical properties of Non-Negative Matrix Factorization, such as permutation invariance. As a result, solutions from FedNMF clients may be misidentified by FL drift as distinct, despite being equivalent. Using an alignment-aware drift, we create coherence through proximal optimization and barycenter aggregation for FedNMF. We analyze the computational complexity of our approach, provide efficient heuristics, and ensure the convergence of our algorithms. On a diverse set of real-world and synthetic datasets, we demonstrate the effectiveness of our methods.

Abstract: Identifying informative components in binary data is an essential task in many application areas, including life sciences, social sciences, and recommendation systems. Boolean matrix factorization (BMF) is a family of methods that performs this task by factorizing the data into dense factor matrices. In realworld settings, the data is often distributed across stakeholders and required to stay private, prohibiting the straightforward application of BMF. To adapt BMF to this context, we approach the problem from a federated-learning perspective, building on a state-of-the-art continuous binary matrix factorization relaxation to BMF that enables efficient gradient-based optimization. Our approach only needs to share the relaxed component matrices, which are aggregated centrally using a proximal operator that regularizes for binary outcomes. We show the convergence of our federated proximal gradient descent algorithm and provide differential privacy guarantees. Our extensive empirical evaluation shows that our algorithm outperforms, in quality and efficacy, federation schemes of state-of-the-art BMF methods on a diverse set of real-world and synthetic data.

Abstract: Unsupervised Skill Discovery aims at learning diverse skills without any extrinsic rewards and leverage them as prior for learning a variety of downstream tasks. Existing approaches to unsupervised reinforcement learning typically involve discovering skills through empowermentdriven techniques or by maximizing entropy to encourage exploration. However, this mutual information objective often results in either static skills that discourage exploration or maximise coverage at the expense of non-discriminable skills. Instead of focusing only on maximizing bounds on f-divergence, we combine it with Integral Probability Metrics to maximize the distance between distributions to promote behavioural diversity and enforce disentanglement. Our method, Hilbert Unsupervised Skill Discovery (HUSD), provides an additional objective that seeks to obtain exploration and separability of state-skill pairs by maximizing the Maximum Mean Discrepancy between the joint distribution of skills and states and the product of their marginals in Reproducing Kernel Hilbert Space. Our results on Unsupervised RL Benchmark show that HUSD outperforms previous exploration algorithms on state-based tasks.

Abstract: This paper proposes a novel kmedoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity. We develop a local-search algorithm that iteratively improves the medoid selection based on the estimation of the k-medoids objective. A single batch of size m << n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n^2) compared to most k-medoids baselines. We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm. Multiple experiments conducted on real datasets of various sizes and dimensions show that our algorithm provides similar performances as state-of-the-art methods such as FasterPAM and BanditPAM++ with a drastically reduced running time.

Abstract: Federated Learning (FL) enables multiple clients, such as mobile phones and IoT devices, to collaboratively train a global machine learning model while keeping their data localized. However, recent studies have revealed that the training phase of FL is vulnerable to reconstruction attacks, such as attribute inference attacks (AIA), where adversaries exploit exchanged messages and auxiliary public information to uncover sensitive attributes of targeted clients. While these attacks have been extensively studied in the context of classification tasks, their impact on regression tasks remains largely unexplored. In this paper, we address this gap by proposing novel modelbased AIAs specifically designed for regression tasks in FL environments. Our approach considers scenarios where adversaries can either eavesdrop on exchanged messages or directly interfere with the training process. We benchmark our proposed attacks against state-of-the-art methods using real-world datasets. The results demonstrate a significant increase in reconstruction accuracy, particularly in heterogeneous client datasets, a common scenario in FL. The efficacy of our model-based AIAs makes them better candidates for empirically quantifying privacy leakage for federated regression tasks.

Abstract: Recent advancements in Zeroshot Neural Architecture Search (NAS) highlight the ability of zero-cost proxies in identifying superior architecture. However, we identify a critical issue with current zero-cost proxies: they aggregate node-wise zero-cost statistics without considering that not all nodes in a neural network equally impact performance estimation. Our observations reveal that node-wise zero-cost statistics significantly vary in their contributions to performance, with each node exhibiting a degree of uncertainty. Based on this insight, we introduce a novel method called Parametric Zero-Cost Proxies (ParZC) framework to enhance the adaptability of zero-cost proxies through parameterization. To address the node indiscrimination, we propose a Mixer Architecture with Bayesian Network (MABN) to explore the node-wise zero-cost statistics and estimate node-specific uncertainty. Moreover, we propose DiffKendall as a loss function to improve ranking consistency. Comprehensive experiments on NAS-Bench-101, 201, and NDS demonstrate the superiority of our proposed ParZC compared to existing zero-shot NAS methods. Additionally, we demonstrate the versatility and adaptability of ParZC on Vision Transformer search space.

Abstract: Deployment of Large Language Models (LLMs) has major computational costs, due to their rapidly expanding size. Compression of LLMs reduces the memory footprint, latency, and energy required for their inference. Posttraining Quantization (PTQ) techniques have been developed to compress LLMs while avoiding expensive re-training. Most PTQ approaches formulate the quantization error based on a layer-wise Euclidean loss, ignoring the model output. Then, each layer is calibrated using its layer-wise Hessian to update the weights towards minimizing the quantization error. The Hessian is also used for detecting the most salient weights to quantization. Such PTQ approaches are prone to accuracy drop in low-precision quantization. We propose Output-adaptive Calibration (OAC) to incorporate the model output in the calibration process. We formulate the quantization error based on the distortion of the output cross-entropy loss. OAC approximates the output-adaptive Hessian for each layer under reasonable assumptions to reduce the computational complexity. The output-adaptive Hessians are used to update the weight matrices and detect the salient weights towards maintaining the model output. Our proposed method outperforms the state-of-the-art baselines such as SpQR and BiLLM, especially, at extreme low-precision (2-bit and binary) quantization.

Abstract: Causal discovery is essential across various scientific fields to uncover causal structures within data. Traditional methods relying on observational data have limitations due to confounding variables. This paper presents an optimizationbased approach using integer programming (IP) to design minimal intervention sets that ensure causal structure identifiability. Our method provides exact and modular solutions, adaptable to different experimental settings and constraints. We demonstrate its effectiveness through comparative analysis across different settings demonstrating its applicability and robustness.

Institute of Information Engineering, Chinese Academy of Sciences University of the Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences University of the Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences University of the Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences University of the Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences University of the Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences

Abstract: The combination of selfsupervised learning and adversarial training (AT) can significantly improve the adversarial robustness of self-supervised models. However, the robustness of self-supervised adversarial training (self-AT) still lags behind that of state-of-the-art (SOTA) supervised AT (sup-AT), even though the performance of current self-supervised learning models has already matched or even surpassed that of SOTA supervised learning models. This issue raises concerns about the secure application of self-supervised learning models. The inclusion of adversarial training turns self-AT into a challenging joint optimization problem, and recent studies have shown that the data augmentation methods necessary for constructing positive pairs in self-supervised learning negatively impact the robustness improvement in self-AT. Inspired by this, we propose 3SAT, a simple self-supervised adversarial training framework. 3SAT conducts adversarial training on original, unaugmented samples, reducing the difficulty of optimizing the adversarial training subproblem and fundamentally eliminating the negative impact of data augmentation on robustness improvement. Additionally, 3SAT introduces a dynamic training objective scheduling strategy to address the issue of model training collapse during the joint optimization process when using original samples directly. 3SAT is not only structurally simple and computationally efficient, reducing self-AT training time by half, but it also improves the SOTA self-AT robustness accuracy by 16.19\% and standard accuracy by 11.41\% under Auto-Attack on the CIFAR-10 dataset. Even more impressively, 3SAT surpasses the SOTA sup-AT method in robust accuracy by a significant margin of 11.25\%. This marks the first time that self-AT has outperformed SOTA sup-AT in robustness, indicating that self-AT is a superior method for improving model robustness.

Abstract: Offline safe reinforcement learning (RL) has emerged as a promising approach for learning safe behaviors without engaging in risky online interactions with the environment. Most existing methods in offline safe RL rely on cost constraints at each time step (derived from global cost constraints) and this can result in either overly conservative policies or violation of safety constraints. In this paper, we propose to learn a policy that generates desirable trajectories and avoids undesirable trajectories. To be specific, we first partition the precollected dataset of state-action trajectories into desirable and undesirable subsets. Intuitively, the desirable set contains high reward and safe trajectories, and undesirable set contains unsafe trajectories and low-reward safe trajectories. Second, we learn a policy that generates desirable trajectories and avoids undesirable trajectories, where (un)desirability scores are provided by a classifier learnt from the dataset of desirable and undesirable trajectories. This approach bypasses the computational complexity and stability issues of a min-max objective that is employed in existing methods. Theoretically, we also show our approach’s strong connections to existing learning paradigms involving human feedback. Finally, we extensively evaluate our method using the DSRL benchmark for offline safe RL. Empirically, our method outperforms competitive baselines, achieving higher rewards and better constraint satisfaction across a wide variety of benchmark tasks.

Abstract: The recent past has seen an increasing interest in Heterogeneous Graph Neural Networks (HGNNs), since many realworld graphs are heterogeneous in nature, from citation graphs to email graphs. However, existing methods ignore a tree hierarchy among metapaths, naturally constituted by different node types and relation types. In this paper, we present HetTree, a novel HGNN that models both the graph structure and heterogeneous aspects in a scalable and effective manner. Specifically, HetTree builds a semantic tree data structure to capture the hierarchy among metapaths. To effectively encode the semantic tree, HetTree uses a novel subtree attention mechanism to emphasize metapaths that are more helpful in encoding parent-child relationships. Moreover, HetTree proposes carefully matching pre-computed features and labels correspondingly, constituting a complete metapath representation. Our evaluation of HetTree on a variety of real-world datasets demonstrates that it outperforms all existing baselines on open benchmarks and efficiently scales to large real-world graphs with millions of nodes and edges.

University of Science and Technology of China NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences, Anhui University, NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, University of Science and Technology of China, NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Nanjing University NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences

Abstract: Label skews, characterized by disparities in local label distribution across clients, pose a significant challenge in federated learning. As minority classes suffer from worse accuracy due to overfitting on local imbalanced data, prior methods often incorporate classbalanced learning techniques during local training. Although these methods improve the mean accuracy across all classes, we observe that vacant classes—referring to categories absent from a client's data distribution—remain poorly recognized. Besides, there is still a gap in the accuracy of local models on minority classes compared to the global model. This paper introduces FedVLS, a novel approach to label-skewed federated learning that integrates both vacant-class distillation and logit suppression simultaneously. Specifically, vacant-class distillation leverages knowledge distillation during local training on each client to retain essential information related to vacant classes from the global model. Moreover, logit suppression directly penalizes network logits for non-label classes, effectively addressing misclassifications in minority classes that may be biased toward majority classes. Extensive experiments validate the efficacy of FedVLS, demonstrating superior performance compared to previous state-of-the-art (SOTA) methods across diverse datasets with varying degrees of label skews.

Abstract: Consistency regularization and pseudolabeling have significantly advanced semi-supervised learning (SSL). Prior works have effectively employed Mixup for consistency regularization in SSL. However, our findings indicate that applying Mixup for consistency regularization may degrade SSL performance by compromising the purity of artificial labels. Moreover, most pseudo-labeling based methods utilize thresholding strategy to exclude low-confidence data, aiming to mitigate confirmation bias; however, this approach limits the utility of unlabeled samples. To address these challenges, we propose RegMixMatch, a novel framework that optimizes the use of Mixup with both high- and low-confidence samples in SSL. First, we introduce semi-supervised RegMixup, which effectively addresses reduced artificial labels purity by using both mixed samples and clean samples for training. Second, we develop a class-aware Mixup technique that integrates information from the top-2 predicted classes into low-confidence samples and their artificial labels, reducing the confirmation bias associated with these samples and enhancing their effective utilization. Experimental results demonstrate that RegMixMatch achieves state-of-the-art performance across various SSL benchmarks.

Abstract: Domain adaptive object detection (DAOD) aims to generalize an object detector trained on labeled sourcedomain data to a target domain without annotations, the core principle of which is source-target feature alignment. Typically, existing approaches employ adversarial learning to align the distributions of the source and target domains as a whole, barely considering the varying significance of distinct regions, say instances under different circumstances and foreground vs background areas, during feature alignment. To overcome the shortcoming, we investigate a differential feature alignment strategy. Specifically, a prediction-discrepancy feedback instance alignment module (dubbed PDFA) is designed to adaptively assign higher weights to instances of higher teacher-student detection discrepancy, effectively handling heavier domain-specific information. Additionally, an uncertainty-based foreground-oriented image alignment module (UFOA) is proposed to explicitly guide the model to focus more on regions of interest. Extensive experiments on widely-used DAOD datasets together with ablation studies are conducted to demonstrate the efficacy of our proposed method and reveal its superiority over other SOTA alternatives.

Abstract: Semisupervised learning (SSL) is a fundamental task in machine learning, empowering models to extract valuable insights from datasets with limited labeled samples and a large amount of unlabeled data. Although pseudo-labeling is a widely used approach for SSL that generates pseudo-labels for unlabeled data and leverages them as ground truth labels for training, traditional pseudo-labeling techniques often face challenges that significantly decrease the quality of pseudo-labels and hence the overall model performance. In this paper, we propose a novel Bi-level Optimization method for Pseudo-label Learning (BOPL) to boost semi-supervised training. It treats pseudo-labels as latent variables, and optimizes the model parameters and pseudo-labels jointly within a bi-level optimization framework. By enabling direct optimization over the pseudo-labels towards maximizing the prediction model performance, the method is expected to produce high-quality pseudo-labels. To evaluate the effectiveness of the proposed approach, we conduct extensive experiments on multiple SSL benchmarks. The experimental results show the proposed BOPL outperforms the state-of-the-art SSL techniques.

Abstract: In this paper, we present generalization bounds for the unsupervised risk in the Deep Contrastive Representation Learning framework, which employs deep neural networks as representation functions. We approach this problem from two angles. On the one hand, we derive a parametercounting bound that scales with the overall size of the neural networks. On the other hand, we provide a norm-based bound that scales with the norms of neural networks' weight matrices. Ignoring logarithmic factors, the bounds are independent of the size of the tuples provided for contrastive learning. To the best of our knowledge, this property is only shared by one other work, which employed a different proof strategy and suffers from very strong exponential dependence on the depth of the network which is due to a use of the peeling technique. Our results circumvent this by leveraging powerful results on covering numbers with respect to uniform norms over samples. In addition, we utilize loss augmentation techniques to further reduce the dependency on matrix norms and the implicit dependence on network depth. In fact, our techniques allow us to produce many bounds for the contrastive learning setting with similar architectural dependencies as in the study of the sample complexity of ordinary loss functions, thereby bridging the gap between the learning theories of contrastive learning and DNNs.

Abstract: Token compression techniques, such as token merging and pruning, are essential for alleviating the substantial computational burden caused by the proliferation of tokens within attention mechanisms. However, current methods often rely on tokento-token distances or similarity metrics to evaluate token importance, which is inadequate in the context of modern promptable designs and frameworks that are gaining prominence. To address this limitation, we introduce a novel and effective merging strategy called “Multimodal Promptable Token Merging” (MPTM). The proposed method leverages a multimodal, prompt-centric methodology, assessing the proximity between tokens of each input modality and the multimodal prompt to efficiently eliminate redundant tokens while preserving those rich in information. Extensive experiments demonstrate that MPTM significantly reduces computational costs without compromising essential information in generative image tasks. When integrated into diffusion-based detection architectures, MPTM outperforms existing state-of-the-art methods by 2.3% in object detection tasks. Additionally, when applied to multimodal diffusion models, MPTM maintains high-quality output while achieving a 2.9-fold increase in throughput, highlighting its versatility.

Abstract: With the emerging of huge amount of unlabeled data, unsupervised outof-distribution (OOD) detection is vital for ensuring the reliability of graph neural networks (GNNs) by identifying OOD samples from in-distribution (ID) ones during testing, where encountering novel or unknown data is inevitable. Existing methods often suffer from compromised performance due to redundant information in graph structures, which impairs their ability to effectively differentiate between ID and OOD data. To address this challenge, we propose SEGO, an unsupervised framework that integrates structural entropy into OOD detection regarding graph classification. Specifically, within the architecture of contrastive learning, SEGO introduces an anchor view in the form of coding tree by minimizing structural entropy. The obtained coding tree effectively removes redundant information from graphs while preserving essential structural information, enabling the capture of distinct graph patterns between ID and OOD samples. Furthermore, we present a multi-grained contrastive learning scheme at local, global, and tree levels using triplet views, where coding trees with essential information serve as the anchor view. Extensive experiments on real-world datasets validate the effectiveness of SEGO, demonstrating superior performance over state-of-the-art baselines in OOD detection. Specifically, our method achieves the best performance on 9 out of 10 dataset pairs, with an average improvement of 3.7% on OOD detection datasets, significantly surpassing the best competitor by 10.8% on the FreeSolv/ToxCast dataset pair.

Abstract: Diffusion models have achieved remarkable success in the image and video generation tasks. Nevertheless, they often require a large amount of memory and time overhead during inference, due to the complex network architecture and considerable number of timesteps for iterative diffusion. Recently, the posttraining quantization (PTQ) technique has proved a promising way to reduce the inference cost by quantizing the float-point operations to low-bit ones. However, most of them fail to tackle with the large variations in the distribution of activations across distinct channels and timesteps, as well as the inconsistent of input between quantization and inference on diffusion models, thus leaving much room for improvement. To address the above issues, we propose a novel method dubbed Timestep-Channel Adaptive Quantization for Diffusion Models (TCAQ-DM). Specifically, we develop a timestep-channel joint reparameterization (TCR) module to balance the activation range along both the timesteps and channels, facilitating the successive reconstruction procedure. Subsequently, we employ a dynamically adaptive quantization (DAQ) module that mitigate the quantization error by selecting an optimal quantizer for each post-Softmax layers according to their specific types of distributions. Moreover, we present a progressively aligned reconstruction (PAR) strategy to mitigate the bias caused by the input mismatch. Extensive experiments on various benchmarks and distinct diffusion models demonstrate that the proposed method substantially outperforms the state-of-the-art approaches in most cases, especially yielding comparable FID metrics to the full precision model on CIFAR-10 in the W6A6 setting, while enabling generating available images in the W4A4 settings.

Abstract: Multiview classification (MVC) faces inherent challenges due to domain gaps and inconsistencies across different views, often resulting in uncertainties during the fusion process. While Evidential Deep Learning (EDL) has been effective in addressing view uncertainty, existing methods predominantly rely on the Dempster-Shafer combination rule, which is sensitive to conflicting evidence and often neglects the critical role of neighborhood structures within multi-view data. To address these limitations, we propose a Trusted Unified Feature-NEighborhood Dynamics (TUNED) model for robust MVC. This method effectively integrates local and global feature-neighborhood (F-N) structures for robust decision-making. Specifically, we begin by extracting local F-N structures within each view. To further mitigate potential uncertainties and conflicts in multi-view fusion, we employ a selective Markov random field that adaptively manages cross-view neighborhood dependencies. Additionally, we employ a shared parameterized evidence extractor that learns global consensus conditioned on local F-N structures, thereby enhancing the global integration of multi-view features. Experiments on benchmark datasets show that our method improves accuracy and robustness over existing approaches, particularly in scenarios with high uncertainty and conflicting views.

Abstract: We present an option discovery algorithm that accelerates planning by minimizing the shortest distance between any two states in the MDP. The proposed algorithm produces options that approximately minimize planning time in the multigoal setting: it is shown to be a worst case (4-alpha, 2)-approximation of the optimal option set, where alpha is the approximation ratio of the k-medians with penalties subroutine. We then present a variation, "Fast Average Options", with improved run-time and describe a general means of producing similar algorithms based on selection of a k-medians subroutine. We empirically evaluate our method on four discrete and two continuous control planning domains and show that it outperforms other leading option discovery algorithms.

Abstract: Posttraining quantization (PTQ) has emerged as a promising solution for reducing the storage and computational cost of vision transformers (ViTs). Recent advances primarily target at crafting quantizers to deal with peculiar activations characterized by ViTs. However, most existing methods underestimate the information loss incurred by weight quantization, resulting in significant performance deterioration, particularly in low-bit cases. Furthermore, a common practice in quantizing post-Softmax activations of ViTs is to employ logarithmic transformations, which unfortunately prioritize less informative values around zero. This approach introduces additional redundancies, ultimately leading to suboptimal quantization efficacy. To handle these, this paper proposes an innovative PTQ method tailored for ViTs, termed AIQViT (Architecture-Informed Post-training Quantization for ViTs). First, we design an architecture-informed low-rank compensation mechanism, wherein learnable low-rank weights are introduced to compensate for the degradation caused by weight quantization. Second, we design a dynamic focusing quantizer to accommodate the unbalanced distribution of post-Softmax activations, which dynamically selects the most valuable interval for higher quantization resolution. Extensive experiments on five vision tasks, including image classification, object detection, instance segmentation, point cloud classification, and point cloud part segmentation, demonstrate the superiority of AIQViT over state-of-the-art PTQ methods.

Abstract: Solving systems of ordinary differential equations (ODEs) is essential when it comes to understanding the behavior of dynamical systems. Yet, automated solving remains challenging, in particular for nonlinear systems. Computer algebra systems (CASs) provide support for solving ODEs by first simplifying them, in particular through the use of Lie point symmetries. Finding these symmetries is, however, itself a difficult problem for CASs. Recent works in symbolic regression have shown promising results for recovering symbolic expressions from data. Here, we adapt searchbased symbolic regression to the task of finding generators of Lie point symmetries. With this approach, we can find symmetries of ODEs that existing CASs cannot find.

Abstract: Sequence models have demonstrated remarkable success in behavioral planning by leveraging previously collected demonstrations. However, solving multitask missions remains a significant challenge, particularly when the planner must adapt to unseen constraints and tasks, such as discovering goals and unlocking doors. Such behavioral planning problems are challenging to solve due to: a) agents failing to adapt beyond the single task learned through their reward function, and b) inability to generalize to new environments, e.g., those with walls and locked doors, when trained only in planar environments. Consequently, state-of-the-art decision-making methods are limited to missions where the required tasks are well-represented in the training demonstrations and can be solved within a short (temporal) planning horizon. To address this, we propose GenPlan: a stochastic and adaptive planner that leverages discrete-flow models for generative sequence modeling, enabling sample-efficient exploration and exploitation. This framework relies on an iterative denoising procedure to generate a sequence of goals and actions. This approach captures multi-modal action distributions and facilitates goal and task discovery, thereby generalizing to out-of-distribution tasks and environments, i.e., missions not part of the training data. We demonstrate the effectiveness of our method through multiple simulation environments. Notably, GenPlan outperforms state-of-the-art methods by over 10% on adaptive planning tasks, where the agent adapts to multi-task missions while leveraging demonstrations from single-goal-reaching tasks.

Abstract: In privacypreserving distributed learning environments, data stored on local clients cannot be shared with other clients or servers. We consider a new active learning problem setup for these environments, where the server aims to build a centralized model by distributing labeling budgets across different clients. Our algorithm identifies which clients and their data points warrant annotation by estimating the global impact of the resulting labels. We evaluate this impact by embedding the clients into the manifold of learner parameters, formed by the task learner's predictions on unlabeled data, and diffusing the reduction in predictive uncertainties caused by labeling. The algorithm effectively selects clients with high estimated impact while achieving diversity in client selection, all without accessing local client data. In experiments, our approach demonstrates substantial advancements when compared to adaptations of existing active learning algorithms.

Abstract: Attribute classification is crucial for identifying specific characteristics within image regions. VisionLanguage Models (VLMs) have been effective in zero-shot tasks by leveraging their general knowledge from large-scale datasets. Recent studies demonstrate that transformer-based models with class-wise queries can effectively address zero-shot multi-label classification. However, poor utilization of the relationship between seen and unseen attributes makes the model lack generalizability. Additionally, attribute classification generally involves many attributes, making maintaining the model’s scalability difficult. To address these issues, we propose Super-class guided transFormer (SugaFormer), a novel framework that leverages super-classes to enhance scalability and generalizability for zero-shot attribute classification. SugaFormer employs Super-class Query Initialization (SQI) to reduce the number of queries, utilizing common semantic information from super-classes, and incorporates Multi-context Decoding (MD) to handle diverse visual cues. To strengthen generalizability, we introduce two knowledge transfer strategies that utilize VLMs. During training, Super-class guided Consistency Regularization (SCR) aligns model’s features with VLMs using super-class guided prompts, and during inference, Zero-shot Retrieval-based Score Enhancement (ZRSE) refines predictions for unseen attributes. Extensive experiments demonstrate that SugaFormer achieves state-of-the-art performance across three widely-used attribute classification benchmarks under zero-shot, and cross-dataset transfer settings.

Abstract: In continuous domains, reinforcement learning policies are often based on Gaussian distributions for their generality. However, the unbounded support of Gaussian policy can cause a bias toward sampling boundary actions in many continuous control tasks that impose action limits due to physical constraints. This "boundary action bias'' can negatively impact training in algorithms like Proximal Policy Optimization. Despite this, it has been overlooked in many existing research and applications. In this paper, we revisit this issue by presenting illustrative explanations and analysis from the sampling point of view. Then, we introduce a truncated Gaussian policy with inherent bounds as a minimal alternative to mitigate the bias. However, we find that the plain truncated Gaussian policy may lay the counterbias, preferring interior actions: to balance the bias, we ultimately propose a scale-adjusted truncated Gaussian policy, where the distribution scale shrinks if the location is near the boundaries. This property makes boundary actions deterministic more than in plain truncated Gaussian, but still less than in original Gaussian. Extensive empirical studies and comparisons on various continuous control tasks demonstrate that the truncated Gaussian policies significantly reduce the rate of boundary action usage, while scale-adjusted ones successfully balance the bias and counter-bias. It generally outperforms the Gaussian policy and shows competitive results compared to other approaches designed to counteract the bias.

Abstract: To facilitate understanding of users' diverse queries against the backend databases in web applications, researchers have introduced Text-to-SQL (Text2SQL) models that can generate well-structured SQL queries from users' query texts in natural language. As the Text2SQL model decouples the user queries with the back-end databases, it inherently mitigates the SQL injection risk posed by inserting users' input into pre-written SQL queries. However, what security risks to web applications may be posed by Text2SQL models remains an open question. In this paper, we present a new attack framework, named Autoregression-based Injection Attacks (AIA), to evaluate the security risks of Text2SQL models. In particular, AIA makes target models generate attack payloads by constructing specific inputs and adjusting the input auto-regressively. Our evaluation demonstrates that AIA can cause Text2SQL models to generate target output by adversarial inputs with success rates of over 70% in most scenarios. The generated adversarial input has certain transferability in target Text2SQL models. Additionally, practice experiments show that AIA can make Text2SQL models extract user lists from databases and even delete data in databases directly.

Abstract: The common matrix completion methods minimize the rank of the matrix to be completed in addition to the Hamming loss between the incomplete and completed matrices. The rank of matrix measures the linear relation among the vectors of matrix, which may introduce ambiguity for data recovery. To cope with this issue, we extend multilabel ranking loss into matrix completion, and employ multi-label ranking loss minimization (MLRM) in this paper to exploit the relative correlation among matrix vectors. In MLRM, the original incomplete matrix is converted into a pairwise ranking matrix, and the approximation on this newly generated matrix can be viewed as a surrogate of multi-label ranking loss to replace the Hamming loss pattern in the existing methods. Extensive experiments demonstrate that MLRM outperforms the state-of-the-art matrix completion methods in varies of applications, including movie recommendation, drug-target interaction prediction and multi-label learning.

Abstract: Recent advancements in Neural Combinatorial Optimization (NCO) have shown promise in solving routing problems like the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) without handcrafted designs. Research in this domain has explored two primary categories of methods: iterative and noniterative. While non-iterative methods struggle to generate near-optimal solutions directly, iterative methods simplify the task by learning local search steps. However, existing iterative methods are often limited by restricted neighborhood searches, leading to suboptimal results. To address this limitation, we propose a novel approach that extends the search to larger neighborhoods by learning a destroy-and-repair strategy. Specifically, we introduce a Destroy-and-Repair framework based on Hyper-Graphs (DRHG). This framework reduces consecutive intact edges to hyper-edges, allowing the model to pay more attention to the destroyed part and decrease the complexity of encoding all nodes. Experiments demonstrate that DRHG achieves state-of-the-art performance on TSP with up to 10,000 nodes and shows strong generalization to real-world TSPLib and CVRPLib problems.

Abstract: FewShot Class-Incremental Learning (FSCIL) aims to continuously learn new classes from a limited set of training samples without forgetting knowledge of previously learned classes. Conventional FSCIL methods typically build a robust feature extractor during the base training session with abundant training samples and subsequently freeze this extractor, only fine-tuning the classifier in subsequent incremental phases. However, current strategies primarily focus on preventing catastrophic forgetting, considering only the relationship between novel and base classes, without paying attention to the specific decision spaces of each class. To address this challenge, we propose a plug-and-play Adaptive Decision Boundary Strategy (ADBS), which is compatible with most FSCIL methods. Specifically, we assign a specific decision boundary to each class and adaptively adjust these boundaries during training to optimally refine the decision spaces for the classes in each session. Furthermore, to amplify the distinctiveness between classes, we employ a novel inter-class constraint loss that optimizes the decision boundaries and prototypes for each class. Extensive experiments on three benchmarks, namely CIFAR100, miniImageNet, and CUB200, demonstrate that incorporating our ADBS method with existing FSCIL techniques significantly improves performance, achieving overall state-of-the-art results.

Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University Zhejiang Institute of Optoelectronics, School of Computer Science and Technology, Zhejiang Normal University Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, School of Computer Science and Technology, Zhejiang Normal University Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, Department of Mathematics, City University of Hong Kong, Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, School of Artificial Intelligence, and Engineering Research Center of Intelligent Technology and Educational Application, Ministry of Education, Beijing Normal University, Department of Computer Science and Technology, Cambridge University

Abstract: Hypergraphs provide a flexible framework for modeling highorder (complex) interactions among multiple entities, extending beyond traditional pairwise correlations in graph structures. However, deep hypergraph neural networks (HGNNs) often face the challenge of oversmoothing with increasing depth, similar to issues in graph neural networks (GNNs). While oversmoothing in GNNs has been extensively studied, its implications in relation to hypergraphs are less explored. This paper addresses this gap by first theoretically exploring the reasons behind oversmoothing in deep HGNNs. Our novel insights suggest that a spectral-based hypergraph convolution, equipped with both low-pass and high-pass filters, can potentially mitigate these effects. Motivated by these findings, we introduce FrameHGNN, a framework that utilizes framelet-based hypergraph convolutions integrating tight framelet transforms with both low-pass and high-pass components, as well as the commonly used strategies in designing deep GNN architecture: initial residual and identity mappings. The experiment results on diverse benchmark datasets demonstrate that FrameHGNN outperforms several state-of-the-art models, effectively reducing oversmoothing while improving predictive accuracy. Our contributions not only advance the theoretical understanding of deep hypergraph learning but also provide a practical spectral-based approach for HGNNs, emphasizing the design of multifrequency channels.

School of Computer Science and Engineering, University of Electronic Science and Technology of China, China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, China, Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology, China, College of Computer Science, Chongqing University, China, University of Sheffield, University of Surrey

Abstract: Unmanned Aerial Vehicle Object Detection (UAVOD) presents unique challenges due to varying altitudes, dynamic backgrounds, and the small size of objects. Traditional detection methods often struggle with these challenges, as they typically rely on visual feature only and fail to extract the semantic relations between the objects. To address these limitations, we propose a novel approach named SelfPrompting Analogical Reasoning (SPAR). Our method utilizes the vision-language model (CLIP) to generate context-aware prompts based on image feature, providing rich semantic information that guides analogical reasoning. SPAR includes two main modules: self-prompting and analogical reasoning. Self-prompting module based on learnable description and CLIP-text encoder generates context-aware prompt by combining specific image feature; then an objectness prompt score map is produced by computing the similarity between pixel-level features and context-aware prompt. With this score map, multi-scale image features are enhanced and pixel-level features are chosen for graph construction. While for analogical reasoning module, graph nodes consists of category-level prompt nodes and pixel-level image feature nodes. Analogical inference is based graph convolution. Under the guidance of category-level nodes, different-scale object features have been enhanced, which helps achieve more accurate detection of challenging objects. Extensive experiments illustrate that SPAR outperforms traditional methods, offering a more robust and accurate solution for UAVOD.

Abstract: We introduce a neural network conformal prediction method for time series that enhances adaptivity in nonstationary environments. Our approach acts as a neural controller designed to achieve desired target coverage, leveraging auxiliary multi-view data with neural network encoders in an end-to-end manner to further enhance adaptivity. Additionally, our model is designed to enhance the consistency of prediction intervals in different quantiles by integrating monotonicity constraints and leverages data from related tasks to boost few-shot learning performance. Using real-world datasets from epidemics, electric demand, weather, and others, we empirically demonstrate significant improvements in coverage and probabilistic accuracy, and find that our method is the only one that combines good calibration with consistency in prediction intervals.

Abstract: Diffusion models have shown promising ability in generating highquality time series (TS) data. Despite the initial success, existing works mostly focus on the authenticity of data at the individual level, but pay less attention to preserving the population-level properties on the entire dataset. Such population-level properties include value distributions for each dimension and distributions of certain functional dependencies (e.g., cross-correlation, CC) between different dimensions. For instance, when generating house energy consumption TS data, the value distributions of the outside temperature and the kitchen temperature should be preserved, as well as the distribution of CC between them. Preserving such TS population-level properties is critical in maintaining the statistical insights of the datasets, mitigating model bias, and augmenting downstream tasks like TS prediction. Yet, it is often overlooked by existing models. Hence, data generated by existing models often bear distribution shifts from the original data. We propose Population-aware Diffusion for Time Series (PaD-TS), a new TS generation model that better preserves the population-level properties. The key novelties of PaD-TS include 1) a new training method explicitly incorporating TS population-level property preservation, and 2) a new dual-channel encoder model architecture that better captures the TS data structure. Empirical results in major benchmark datasets show that PaD-TS can improve the average CC distribution shift score between real and synthetic data by 5.9x while maintaining a performance comparable to state-of-the-art models on individual-level authenticity.

Abstract: In this paper, we present a novel diffusion modelbased monaural speech enhancement method. Our approach incorporates the separate estimation of speech spectra's magnitude and phase in two diffusion networks. Throughout the diffusion process, noise clips from real-world noise interferences are added gradually to the clean speech spectra and a noise-aware reverse process is proposed to learn how to generate both clean speech spectra and noise spectra. Furthermore, to fully leverage the intrinsic relationship between magnitude and phase, we introduce a complex-cycle-consistent (CCC) mechanism that uses the estimated magnitude to map the phase, and vice versa. We implement this algorithm within a phase-aware speech enhancement diffusion model (SEDM). We conduct extensive experiments on public datasets to demonstrate the effectiveness of our method, highlighting the significant benefits of exploiting the intrinsic relationship between phase and magnitude information to enhance speech. The comparison to conventional diffusion models demonstrates the superiority of SEDM.

Abstract: Crossmodal hashing provides an efficient solution for retrieval tasks across various modalities, such as images and text. However, most existing methods are deterministic models, which overlook the reliability associated with the retrieved results. This omission renders them unreliable for determining matches between data pairs based solely on Hamming distance. To bridge the gap, in this paper, we propose a novel method called Deep Evidential Cross-modal Hashing (DECH). This method equips hashing models with the ability to quantify the reliability level of the association between a query sample and each corresponding retrieved sample, bringing a new dimension of reliability to the cross-modal retrieval process. To achieve this, our method addresses two key challenges: i) To leverage evidential theory in guiding the model to learn hash codes, we design a novel evidence acquisition module to collect evidence and place the evidence captured by hash codes on a Beta distribution to derive a binomial opinion. Unlike existing evidential learning approaches that rely on classifiers, our method collects evidence directly through hash codes. ii) To tackle the task-oriented challenge, we first introduce a method to update the derived binomial opinion, allowing it to present the uncertainty caused by conflicting evidence. Following this manner, we present a strategy to precisely evaluate the reliability level of retrieved results, culminating in performance improvement. We validate the efficacy of our DECH through extensive experimentation on four benchmark datasets. The experimental results demonstrate our superior performance compared to 12 state-of-the-art methods.

Abstract: Group Fairnessaware Continual Learning (GFCL) aims to eradicate discriminatory predictions against certain demographic groups in a sequence of diverse learning tasks. This paper explores an even more challenging GFCL problem – how to sustain a fair classifier across a sequence of tasks with covariate shifts and unlabeled data. We propose the MacFRL solution, with its key idea to optimize the sequence of learning tasks. We hypothesize that high-confident learning can be enabled in the optimized task sequence, where the classifier learns from a set of prioritized tasks to glean knowledge, thereby becoming more capable to handle the tasks with substantial distribution shifts that were originally deferred. Theoretical and empirical studies substantiate that MacFRL excels among its GFCL competitors in terms of prediction accuracy and group fair-ness metrics.

Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China, Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China, Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China

Abstract: The ubiquitous and unavoidable label noise brings great challenges to the generalization performance of learning methods.Label noise correction aims to detect and correct label noise in the data, which is one of the most potential methods to address this challenge.Current methods for label noise filtering that utilize primitive features primarily concentrate on identifying noise, which often limits their capacity to adaptively learn features crucial for specific tasks, thereby resulting in a higher rate of noise identification within the noise recognition process. On the other hand, deep neural networks, endowed with robust feature extraction capabilities, typically exhibit lower noise identification, as they are prone to fitting noise patterns during the recognition process, potentially undermining their overall efficacy. Moreover, Fuzzy Learning Machine (FLM) excels not only in feature extraction but also in noise tolerance, adeptly navigating data uncertainties. FLM enhances the accuracy of the labels by calculating the membership degrees of samples across categories and determining their fuzzy memberships. The introduction of a twostage FLM-based framework, which employs a secondary learning mechanism for precise noise filtering and correction, has shown substantial improvements in noise correction across various large-scale noisy datasets, thereby significantly enhancing samples' quality and boosting the generalization capabilities of classifiers.

Abstract: Identifying the Markov properties or conditional independencies of a collection of random variables is a fundamental task in statistics for modeling and inference. Existing approaches often learn the structure of a probabilistic graph, which encodes these dependencies, by assuming that the variables follow a distribution with a simple parametric form. Moreover, the computational cost of many algorithms scales poorly for highdimensional distributions, as they need to estimate all the edges in the graph simultaneously. In this work, we propose a scalable algorithm to infer the conditional independence relationships of each variable by exploiting the local Markov property. The proposed method, named Localized Sparsity Identification for Non-Gaussian Distributions (L-SING), estimates the graph by using flexible classes of transport maps to represent the conditional distribution for each variable. We show that L-SING includes existing approaches, such as neighborhood selection with Lasso, as a special case. We demonstrate the effectiveness of our algorithm in both Gaussian and non-Gaussian settings by comparing it to existing methods. Lastly, we show the scalability of the proposed approach by applying it to high-dimensional non-Gaussian examples, including a biological dataset with more than 150 variables.

Abstract: In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of nonMarkovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of Õ((HOAT)^(1/2) + H²O²A^(3/2) + H(T)^(1/2)), where H is the time horizon, O is the number of observations, A is the number of actions, and T is the number of time steps. This result improves over the best-known bound, Õ(H(OAT)^(1/2)), for MDPs with Deterministic Reward Machines (DRMs), a special case of PRMs. When T ≥ H³O³A² and OA ≥ H, our regret bound leads to a regret of Õ((HOAT)^(1/2)), which matches the established lower bound of Ω((HOAT)^(1/2)) for MDPs with DRMs up to a logarithmic factor. To the best of our knowledge, this is the first efficient algorithm for PRMs. Additionally, we present a new simulation lemma for non-Markovian rewards, which enables reward-free exploration for any non-Markovian reward given access to an approximate planner. Complementing our theoretical findings, we show through extensive experimental evaluations that our algorithm indeed outperforms prior methods in various PRM environments.

Abstract: Outof-distribution (OOD) generalization in Graph Neural Networks (GNNs) has gained significant attention due to its critical importance in graph-based predictions in real-world scenarios. Existing methods primarily focus on extracting a single causal subgraph from the input graph to achieve generalizable predictions. However, relying on a single subgraph can lead to susceptibility to spurious correlations and is insufficient for learning invariant patterns behind graph data. Moreover, in many real-world applications, such as molecular property prediction, multiple critical subgraphs may influence the target label property. To address these challenges, we propose a novel framework, SubGraph Aggregation(SuGAr), designed to learn a diverse set of subgraphs that are crucial for OOD generalization on graphs. Specifically, SuGAr employs a tailored subgraph sampler and diversity regularizer to extract a diverse set of invariant subgraphs. These invariant subgraphs are then aggregated by averaging their representations, which enriches the subgraph signals and enhances coverage of the underlying causal structures, thereby improving OOD generalization. Extensive experiments on both synthetic and real-world datasets demonstrate that SuGAr outperforms state-of-the-art methods, achieving up to a 24% improvement in OOD generalization on graphs. To the best of our knowledge, this is the first work to study graph OOD generalization by learning multiple invariant subgraphs.

Abstract: This paper studies the problem of solving the system of nonlinear equations. We propose the Gramreduced Levenberg--Marquardt method, which reuses the Gram matrix. Our method has a global convergence guarantee without relying on any step of line-search or solving sub-problems. We show that our method takes a smaller computational complexity than existing Levenberg--Marquardt methods to find the stationary point of the square norm of the equations. We also show that the proposed method enjoys a local superlinear convergence rate under the non-degenerate assumption. Experiments are conducted on real-world applications in scientific computing and machine learning, which validate the efficiency of our method.

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China, National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China, National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China, National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China, National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China, National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China, School of Artificial Intelligence, Nanjing University, Nanjing 210023, China

Abstract: Multiobjective decision-making problems have emerged in numerous real-world scenarios, such as video games, navigation and robotics. Considering the clear advantages of Reinforcement Learning (RL) in optimizing decision-making processes, researchers have delved into the development of Multi-Objective RL (MORL) methods for solving multi-objective decision problems. However, previous methods either cannot obtain the entire Pareto front, or employ only a single policy network for all the preferences over multiple objectives, which may not produce personalized solutions for each preference. To address these limitations, we propose a novel decomposition-based framework for MORL, Pareto Set Learning for MORL (PSL-MORL), that harnesses the generation capability of hypernetwork to produce the parameters of the policy network for each decomposition weight, generating relatively distinct policies for various scalarized subproblems with high efficiency. PSL-MORL is a general framework, which is compatible for any RL algorithm. The theoretical result guarantees the superiority of the model capacity of PSL-MORL and the optimality of the obtained policy network. Through extensive experiments on diverse benchmarks, we demonstrate the effectiveness of PSL-MORL in achieving dense coverage of the Pareto front, significantly outperforming state-of-the-art MORL methods in both the hypervolume and sparsity indicators.

Abstract: Optimizing spectral graph neural networks (GNNs) remains a critical challenge in the field, yet the underlying processes are not well understood. In this paper, we investigate the inherent differences between graph convolution parameters and feature transformation parameters in spectral GNNs and their impact on the optimization landscape. Our analysis reveals that these differences contribute to a poorly conditioned problem, resulting in suboptimal performance. To address this issue, we introduce the concept of the block condition number of the Hessian matrix, which characterizes the difficulty of poorly conditioned problems in spectral GNN optimization. We then propose an asymmetric learning approach, dynamically preconditioning gradients during training to alleviate poorly conditioned problems. Theoretically, we demonstrate that asymmetric learning can reduce block condition numbers, facilitating easier optimization. Extensive experiments on eighteen benchmark datasets show that asymmetric learning consistently improves the performance of spectral GNNs for both heterophilic and homophilic graphs. This improvement is especially notable for heterophilic graphs, where the optimization process is generally more complex than for homophilic graphs.

Abstract: Multiview classification based on evidence theory aims to enhance result reliability by effectively quantifying prediction uncertainty at the evidence level, particularly when dealing with low-quality views. However, these methods face limitations in real-world applications due to the sensitivity of estimated uncertainty to view distribution, leading to two main issues: 1) difficulty in making clear judgments about whether to trust predictions based on vague uncertainty scores, and 2) the potential negative impact of integrating information from low-quality views on multi-view classification performance. Both limitations compromise the reliability of multi-view decisions. To address these challenges, we introduce an adaptive rejection mechanism based on estimated uncertainty, which is free of data distribution constraints. By integrating this adaptive rejection mechanism into the fusion of multiple views, our method not only indicates whether predictions should be adopted or rejected at the view level but also enhances classification performance by minimizing the impact of unreliable information. The effectiveness of our method is demonstrated through comprehensive theoretical analysis and empirical experiments on various multi-view datasets, establishing its superiority in enhancing the reliability of multi-view classification.

Abstract: We consider a general nonstochastic online pricing bandit setting in a procurement scenario where a buyer with a budget wants to procure items from a fixed set of sellers to maximize the buyer's reward by dynamically offering purchasing prices to the sellers, where the sellers' costs and values at each time period can change arbitrarily and the sellers determine whether to accept the offered prices to sell the items. This setting models online pricing scenarios of procuring resources or services in multi-agent systems. We first consider the offline setting when sellers' costs and values are known in advance and investigate the best fixed-price policy in hindsight. We show that it has a tight approximation guarantee with respect to the offline optimal solutions. In the general online setting, we propose an online pricing policy, Granularity-based Pricing (GAP), which exploits underlying side-information from the feedback graph when the budget is given as the input. We show that GAP achieves an upper bound of O(n{v_{max}}{c_{min}}sqrt{B/c_{min}}ln B) on the alpha-regret where n, v_{max}, c_{min}, and B are the number, the maximum value, the minimum cost of sellers, and the budget, respectively. We then extend it to the unknown budget case by developing a variant of GAP, namely Doubling-GAP, and show its alpha-regret is at most O(n{v_{max}}{c_{min}}sqrt{B/c_{min}}ln2 B). We also provide an alpha-regret lower bound Omega(v_{max}sqrt{Bn/c_{min}}) of any online policy that is tight up to sub-linear terms. We conduct simulation experiments to show that the proposed policy outperforms the baseline algorithms.

Abstract: With the advancement of graph representation learning, selfsupervised graph contrastive learning (GCL) has emerged as a key technique in the field. In GCL, positive and negative samples are generated through data augmentation. While recent works have introduced model-based methods to enhance positive graph augmentations, they often overlook the importance of negative samples, relying instead on rule-based methods that can fail to capture meaningful graph patterns. To address this issue, we propose a novel model-based adversarial contrastive graph augmentation (ACGA) method that automatically generates both positive graph samples with minimal sufficient information and hard negative graph samples. Additionally, we provide a theoretical framework to analyze the process of positive and negative graph augmentation in self-supervised GCL. We evaluate our ACGA method through extensive experiments on representative benchmark datasets, and the results demonstrate that ACGA outperforms state-of-the-art baselines.

Wuhan University of Science and Technology Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education, China, Wuhan University of Science and Technology, Zhejiang Normal University, Wuhan university of Science and Technology, Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, China, Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, China

Abstract: Federated Learning (FL) mitigates privacy leakage in decentralized machine learning by allowing multiple clients to train collaboratively locally. However, dynamic mobile networks with high mobility, intermittent connectivity, and bandwidth limitation severely hinder model updates to the cloud server. Although previous studies have typically addressed user mobility issue through task reassignment or predictive modeling, frequent migrations may result in high communication overhead. Addressing this challenge involves not only dealing with resource constraints, but also finding ways to mitigate the challenges posed by user migrations. We therefore propose a intertemporal incentive framework, FedCross, which ensures the continuity of FL tasks by migrating interrupted training tasks to feasible mobile devices. FedCross comprises two distinct stages: Specifically, in Stage 1, we address the task allocation problem across regions under resource constraints by employing a multiobjective migration algorithm to quantify the optimal task receivers. Moreover, we adopt evolutionary game theory to capture the dynamic decision-making of users, forecasting the evolution of user proportions across different regions to mitigate frequent migrations. In Stage 2, we utilize a procurement auction mechanism to allocate rewards among base stations, ensuring that those providing high-quality models receive optimal compensation. This approach incentivizes sustained user participation, thereby ensuring the overall feasibility of FedCross. Finally, experimental results validate the theoretical soundness of FedCross and demonstrate its significant reduction in communication overhead.

Abstract: Concept drift, characterized by unpredictable changes in data distribution over time, poses significant challenges to machine learning models in streaming data scenarios. Although error ratebased concept drift detectors are widely used, they often fail to identify drift in the early stages when the data distribution changes but error rates remain constant. This paper introduces the Prediction Uncertainty Index (PU-index), derived from the prediction uncertainty of the classifier, as a superior alternative to the error rate for drift detection. Our theoretical analysis demonstrates that: (1) The PU-index can detect drift even when error rates remain stable. (2) Any change in the error rate will lead to a corresponding change in the PU-index. These properties make the PU-index a more sensitive and robust indicator for drift detection compared to existing methods. We also propose a PU-index-based Drift Detector (PUDD) that employs a novel Adaptive PU-index Bucketing algorithm for detecting drift. Empirical evaluations on both synthetic and real-world datasets demonstrate PUDD’s efficacy in detecting drift in structured and image data.

Abstract: Deep neural networks have become foundational to advancements in multiple domains, including recommendation systems, natural language processing, and so on. Despite their successes, these models often contain incompatible parameters that can be underutilized or detrimental to model performance, particularly when faced with specific, varying data distributions. Existing research excels in removing such parameters or merging the outputs of multiple different pretrained models. However, the former focuses on efficiency rather than performance, while the latter requires several times more computing and storage resources to support inference. In this paper, we set the goal to explicitly improve these incompatible parameters by leveraging the complementary strengths of different models, thereby directly enhancing the models without any additional parameters. Specifically, we propose Compatibilityaware Knowledge Integration (CKI), which consists of Parameter Compatibility Assessment and Parameter Splicing, which are used to evaluate the knowledge content of multiple models and integrate the knowledge into one model, respectively. The integrated model can be used directly for inference or for further fine-tuning. Extensive experiments on various recommendation and language datasets show that CKI can effectively optimize incompatible parameters under multiple tasks and settings to break through the training limit of the original model without increasing the inference cost.

Abstract: Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and massive scale. In this work, we derive model confidence from the distribution of multiple randomly sampled generations, using three measures of consistency. We extensively evaluate eleven open and closedsource models on nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches in terms of calibration error. Meanwhile, we find that factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. Moreover, confidence scores obtained from consistency can potentially enhance model performance. Finally, we offer guidance on choosing suitable consistency metrics for calibration, tailored to model characteristics such as the exposure to instruction-tuning and RLHF.

Abstract: Screenshooting robust watermarking is an effective means of preventing screen content leakage from unauthorized camera shooting, as it can trace the leaked source through the watermark extraction thereby providing an effective deterrent. However, current screen-shooting resilient watermarking schemes rely on the image's contours to synchronize and then extract the watermark. While in practical applications, it's common for only a portion of the image to be captured, resulting in a limited performance of the previous watermarking schemes. To address this problem, we propose the RoPaSS: a robust watermarking scheme for partial screen-shooting scenarios, which effectively constructs symmetric characteristics on the embedding watermark to handle the sticky re-synchronization issue. Specifically, RoPaSS consists of a watermark encoder, a decoder, and three estimators, which are trained in two stages. In the first training stage, RoPaSS integrates the flipping operation into the watermark encoder and decoder training to increase the redundancy of watermark messages and artificially guide the generation of symmetric watermarks. In the second stage, estimators utilize the watermark symmetry as an additional reference to estimate the restoration parameters to resynchronize the partially captured watermarked image. Experiments have demonstrated the excellent performance of RoPaSS in partial screen-shooting traceability, with extraction accuracy of above 93% in frontal shooting and above 86% in 30° shooting even if only 50% of the image content is captured.

Abstract: Graph neural networks for hyperbolic space has emerged as a powerful tool for embedding datasets exhibiting a highly nonEuclidean latent anatomy e.g., graphs with hierarchical structures. While several Hyperbolic Graph Neural Networks (Hy-GNNs) have been developed to enhance the representation of hierarchical datasets, they remain susceptible to noise and adversarial attacks, posing serious risks in critical applications. The absence of robust Hy-GNN frameworks underscores a pressing problem. This research addresses this challenge by introducing HyperDefender—a robust and flexible approach designed to fortify Hy-GNNs against adversarial attacks and noises. HyperDefender aims to secure the reliability of applications that depend on the integrity of hierarchical graph-structured data in real-world scenarios. Experimental results demonstrate that HyperDefender significantly improves node classification accuracy across various attacks, effectively mitigating the performance degradation typically observed in Hy-GNNs when the hierarchy in original datasets is compromised.

Abstract: Understanding causality is challenging and often complicated by changing causal relationships over time and across environments. Climate patterns, for example, shift over time with recurring seasonal trends, while also depending on geographical characteristics such as ecosystem variability. Existing methods for discovering causal graphs from time series either assume stationarity, do not permit both temporal and spatial distribution changes, or are unaware of locations with the same causal relationships. In this work, we therefore unify the three tasks of causal graph discovery in the nonstationary multi-context setting, of reconstructing temporal regimes, and of partitioning datasets and time intervals into those where invariant causal relationships hold. To construct a consistent score that forms the basis of our method, we employ the Minimum Description Length principle. Our resulting algorithm SPACETIME simultaneously accounts for heterogeneity across space and non-stationarity over time. Given multiple time series, it discovers regime changepoints and a temporal causal graph using non-parametric functional modeling and kernelized discrepancy testing. We also show that our method provides insights into real-world phenomena such as river-runoff measured at different catchments and biosphere-atmosphere interactions across ecosystems.

School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China Information Technology and Data Management Department of China Mobile Communications Group Zhejiang Co., Ltd, Lenovo Research, Lenovo Group Ltd., Beijing, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China

Abstract: Multilabel metric learning, as an extension of metric learning to multi-label scenarios, aims to learn better similarity metrics for objects with rich semantics. Existing multi-label metric learning approaches employ the common assumption of equal labeling-importance, i.e., all associated labels are considered relevant to the training instance, while there is no differentiation in the relative importance of their semantics. However, this common assumption does not reflect the fact that the importance of each relevant label is generally different, even though such importance information is not directly accessible from the training examples. In this paper, we claim that it is beneficial to leverage the implicit Relative LabelingImportance (RLI) information to facilitate multi-label metric learning. Specifically, the manifold structure within the feature space is exploited by local linear reconstruction, and then the RLIs are recovered by transferring such structure to the label space. Subsequently, a discrimiative multi-label metric learning framework is introduced to align the predictive modeling outputs with the recovered RLIs, under which instances with similar RLI are implicitly pulled closer to each other, while those with dissimilar RLI are pushed further apart. Comprehensive experiments on benchmark multi-label datasets validate the superiority of our proposed approach in learning effective similarity metrics between multi-label examples.

Abstract: eXplainable Artificial Intelligence (XAI) has garnered significant attention for enhancing transparency and trust in machine learning models. However, the scopes of most existing explanation techniques focus either on offering a holistic view of the explainee model (global explanation) or on individual instances (local explanation), while the middle ground, i.e., cohortbased explanation, is less explored. Cohort explanations offer insights into the explainee's behavior on a specific group or cohort of instances, enabling a deeper understanding of model decisions within a defined context. In this paper, we discuss the unique challenges and opportunities associated with measuring cohort explanations, define their desired properties, and create a generalized framework for generating cohort explanations based on supervised clustering.

Abstract: Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by classspecific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights (priors) from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging varied interactive prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions.

Abstract: We propose SMMF (SquareMatricized Momentum Factorization), a memory-efficient optimizer that reduces the memory requirement of the widely used adaptive learning rate optimizers, such as Adam, by up to 96%. SMMF enables flexible and efficient factorization of an arbitrary rank (shape) of the first and second momentum tensors during optimization, based on the proposed square-matricization and one-time single matrix factorization. From this, it becomes effectively applicable to any rank (shape) of momentum tensors, i.e., bias, matrix, and any rank-d tensors, prevalent in various deep model architectures, such as CNNs (high rank) and Transformers (low rank), in contrast to existing memory-efficient optimizers that applies only to a particular (rank-2) momentum tensor, e.g., linear layers. We conduct a regret bound analysis of SMMF, which shows that it converges similarly to non-memory-efficient adaptive learning rate optimizers, such as AdamNC, providing a theoretical basis for its competitive optimization capability. In our experiment, SMMF takes up to 96% less memory compared to state-of-the-art memoryefficient optimizers, e.g., Adafactor, CAME, and SM3, while achieving comparable model performance on various CNN and Transformer tasks.

Abstract: We propose a neurosymbolic architecture aimed at boosting the performance of any Language Model (LM) for SQL query generation. This approach leverages symbolic reasoning to guide the LM's exploration of the search space by considering multiple paths, symbolically evaluating choices at each decision point to choose the next step, with the added novel ability to backtrack. A key innovation is the use of symbolic checks on both partially and fully generated SQL queries, enabling early truncation of unsuccessful search paths. Input consists of textual requirements on the desired query, along with optional example tuples to be selected by the query. Experiments on Xander, our opensource implementation, show it both reduces runtime and increases accuracy of the generated SQL. A specific result is an LM using Xander outperforming a four-times-larger LM.

Abstract: This paper explores how to enhance existing masked timeseries modeling by randomly dropping sub-sequence level patches of time series. On this basis, a simple yet effective method named DropPatch is proposed, which has two remarkable advantages: 1) It improves the pre-training efficiency by a square-level advantage; 2) It provides additional advantages for modeling in scenarios such as in-domain, cross-domain, few-shot learning and cold start. This paper conducts comprehensive experiments to verify the effectiveness of the method and analyze its internal mechanism. Empirically, DropPatch strengthens the attention mechanism, reduces information redundancy and serves as an efficient means of data augmentation. Theoretically, it is proved that DropPatch slows down the rate at which the Transformer representations collapse into the rank-1 linear subspace by randomly dropping patches, thus optimizing the quality of the learned representations.

Abstract: Black box optimization (BBO) focuses on optimizing unknown functions in highdimensional spaces. In many applications, sampling the unknown function is expensive, imposing a tight sample budget.Ongoing work is making progress on reducing the sample budget by learning the shape/structure of the function, known as kernel learning. We propose a new method to learn the kernel of a Gaussian Process. Our idea is to create a continuous kernel space in the latent space of a variational autoencoder, and run an auxiliary optimization to identify the best kernel. Results show that the proposed method, Kernel Optimized Blackbox Optimization (KOBO), outperforms state of the art by estimating the optimal at considerably lower sample budgets. Results hold not only across synthetic benchmark functions but also in real applications. We show that a hearing aid may be personalized with fewer audio queries to the user, or a generative model could converge to desirable images from limited user ratings.

Abstract: The learnware paradigm aims to establish a learnware market such that users can build their own models by reusing appropriate existing models in the market without starting from scratch. It is often the case that a single model is insufficient to fully satisfy the user's requirement. Meanwhile, offering multiple models can lead to higher costs for users alongside an increase in hardware resource demands. To address this challenge, this paper proposes the ''Sliceand-Pack'' (S&P) framework to empower the market to provide users with only the required model fragments without having to offer entire abilities of all involved models. Our framework first slices a set of models into small fragments and subsequently packs selected fragments according to user's specific requirement. In the slicing stage, we extract units layer by layer and connect these units to create numerous fragments. In the packing stage, an encoder-decoder mechanism is employed to assemble these fragments. These processes are conducted within data-limited constraints due to privacy concerns. Extensive experiments validate the effectiveness of our framework.

Abstract: Vertical Federated Learning (VFL) involves multiple clients collaborating to train a global model, with distributed features of shared samples. While it becomes a critical privacypreserving learning paradigm, its security can be significantly compromised by backdoor attacks, where a malicious client injects a target backdoor by manipulating local data. Existing attack methods in VFL rely on the assumption that the malicious client can obtain additional knowledge about task labels, which is not applicable in VFL. In this work, we investigate a new backdoor attack paradigm in VFL, Label-Free Backdoor Attacks (LFBA), which does not require any additional task label information and is feasible in VFL settings. Specifically, while existing methods assume access to task labels or target-class samples, we demonstrate that the gradients of local embeddings reflect the semantic information of labels. It can be utilized to construct the target poison sample set. Besides, we uncover that backdoor triggers tend to be ignored and under-fitted due to the learning of original features, which hinders backdoor task optimization. To address this, we propose selectively switching poison samples to disrupt feature learning, promoting backdoor task learning while maintaining accuracy on clean data. Extensive experiments demonstrate the effectiveness of our method in various settings.

Abstract: We study the problem of designing replicationproof bandit mechanisms when agents strategically register or replicate their own arms to maximize their payoff. Specifically, we consider Bayesian agents who only know the distribution from which their own arms' mean rewards are sampled, unlike the original setting of by Shin, Lee, and Ok AISTATS'22. Interestingly, with Bayesian agents in stark contrast to the previous work, analyzing the replication-proofness of an algorithm becomes significantly complicated even in a single-agent setting. We provide sufficient and necessary conditions for an algorithm to be replication-proof in the single-agent setting, and present an algorithm that satisfies these properties. These results center around several analytical theorems that focus on comparing the expected regret of multiple bandit instances, and therefore might be of independent interest since they have not been studied before to the best of our knowledge. We expand this result to the multi-agent setting, and provide a replication-proof algorithm for any problem instance. We finalize our result by proving its sublinear regret upper bound which matches that of Shin, Lee, and Ok AISTATS'22.

Abstract: Federated Learning (FL) has pioneered the idea of "share wisdom not raw data" to enable collaborative learning over decentralized data. FL achieves this goal by averaging model parameters instead of centralizing data. However, representing "wisdom" in the form of model parameters has its own limitations including the requirement for uniform model architectures across clients and communication overhead proportional to model size. In this work we introduce CoDream a framework for representing "wisdom" in data space instead of model parameters. Here, clients collaboratively optimize random inputs based on their locally trained models and aggregate gradients of their inputs. Our proposed approach overcomes the aforementioned limitations and comes with additional benefits such as adaptive optimization and interpretable representation of knowledge. We empirically demonstrate the effectiveness of Co-Dream and compare its performance with existing techniques.

Abstract: When trained with severely imbalanced data, deep neural networks often struggle to accurately recognize classes with few samples. Previous studies in longtailed recognition have attempted to rebalance biased learning using known sample distributions, primarily addressing different classification difficulties at the class level. However, these approaches often overlook the instance difficulty variation within each class. In this paper, we propose a difficulty-aware balancing margin (DBM) loss, which considers both class imbalance and instance difficulty. DBM loss comprises two components: a class-wise margin to mitigate learning bias caused by imbalanced class frequencies, and an instance-wise margin assigned to hard positive samples based on their individual difficulty. DBM loss improves class discriminativity by assigning larger margins to more difficult samples. Our method effortlessly combine with existing approaches and consistently improves performance across various long-tailed recognition benchmarks.

Department of Information Convergence Engineering, Pusan National University, Korea, Department of Information Convergence Engineering, Pusan National University, Korea, Department of Information Convergence Engineering, Pusan National University, Korea, Department of Information Convergence Engineering, Pusan National University, Korea, Department of Information Convergence Engineering, Pusan National University, Korea School of Biomedical Convergence Engineering, Pusan National University, Korea Center for Artificial Intelligence Research, Pusan National University, Korea

Abstract: Foundation models, serving as pretrained fundamental bases for a variety of downstream tasks, try to learn versatile, rich, and generalizable representations that can be quickly adopted through finetuning or even in a zero-shot manner for specific applications. Foundation models for molecular representation are no exception. Various pretext tasks have been proposed for pretraining molecular representations, but these approaches have focused on only single or partial properties. Molecules are complicated and require different perspectives depending on purposes: insights from local- or global-level, 2D-topology or 3D-spatial arrangement, and low- or high-level semantics. We propose Multi-level mOlecule gRaph prE-train (MORE) to consider these multiple aspects of molecules simultaneously. Experimental results demonstrate that our proposed method effectively learns comprehensive representations by showing outstanding performance in both linear probing and full fine-tuning. Notably, in quantification experiments of forgetting the pretrained models, MORE consistently exhibits minimal and stable parameter changes with the smallest performance gap, whereas other methods show substantial and inconsistent fluctuations with larger gaps. The effectiveness of individual pretext tasks varies depending on the problems being solved, which again highlights the need for a multi-level perspective. Scalability experiments reveal steady improvements of MORE as the dataset size increases, suggesting potential gains with larger datasets as well.

Abstract: Catastrophic forgetting poses a significant challenge for graph neural networks in continuously updating their knowledge base with data streams. To address this issue, much of the research has focused on nodelevel continual learning using parameter regularization or rehearsal-based strategies, while little attention given to graph-level tasks. Furthermore, current paradigms for continual graph learning may inadvertently capture spurious correlations for specific tasks through shortcuts, thereby exacerbating the forgetting of previous knowledge when new tasks are introduced. To tackle these challenges, we propose a novel paradigm, Rationale Learning GNN (RL-GNN), for graph-level continual graph learning. Specifically, we harness the invariant learning principle to incorporate environmental interventions into both the current and historical distributions, aiming to uncover rationales by minimizing empirical risk across all environments. The rationale serves as the sole factor guiding the learning process. Therefore, continual graph learning is redefined as capturing these invariant rationales within task sequences, alleviating catastrophic forgetting caused by spurious features. Extensive experiments on real-world datasets with varying task lengths demonstrate the effectiveness of our RL-GNN in continuous knowledge assimilation and reduction of catastrophic forgetting.

Abstract: Offline reinforcement learning (RL) methods aim to learn optimal policies with access only to trajectories in a fixed dataset. Policy constraint methods formulate policy learning as an optimization problem that balances maximizing reward with minimizing deviation from the behavior policy. Closed form solutions to this problem can be derived as weighted behavioral cloning objectives that, in theory, must compute an intractable partition function. Reinforcement learning has gained popularity in language modeling to align models with human preferences; some recent works consider paired completions that are ranked by a preference model following which the likelihood of the preferred completion is directly increased. We adapt this approach of paired comparison. By reformulating the pairedsample optimization problem, we fit the maximum-mode of the Q function while maximizing behavioral consistency of policy actions. This yields our algorithm, Behavior Preference Regression for offline RL (BPR). We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite, which operates in image-based state spaces. BPR demonstrates state-of-the-art performance over all domains. Our on-policy experiments suggest that BPR takes advantage of the stability of on-policy value functions with minimal performance degradation on Locomotion datasets.

Abstract: Optimizing riskaverse objectives in discounted MDPs is challenging because most models do not admit direct dynamic programming equations and require complex history-dependent policies. In this paper, we show that the risk-averse total reward criterion, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR) risk measures, can be optimized by a stationary policy, making it simple to analyze, interpret, and deploy. We propose exponential value iteration, policy iteration, and linear programming to compute optimal policies. Compared with prior work, our results only require the relatively mild condition of transient MDPs and allow for both positive and negative rewards. Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.

Abstract: Auctionbased Federated Learning (AFL) has gained significant research interest due to its ability to incentivize data owners (DOs) to participate in FL model training tasks of data consumers (DCs) through economic mechanisms via the auctioneer. One of the critical research issues in AFL is decision support for the auctioneer. Existing approaches are based on the simplified assumption of a single, monopolistic AFL marketplace, which is unrealistic in real-world scenarios where multiple AFL marketplaces can co-exist and compete for the same pool of DOs. In this paper, we relax this assumption and frame the AFL auctioneer decision support problem from the perspective of helping them attract participants in a competitive AFL marketplace environment while safeguarding profit. To achieve this objective, we propose the Auctioneer Revenue Allocation Strategy for AFL (ARAS-AFL). We design the concepts of the attractiveness and competitiveness from the perspective of autioneer reputation. Based on the Lyapunov optimization, ARAS-AFL helps individual AFL auctioneer achieve the dual objective of balancing the reputation management costs and its own profit by designing a dynamic revenue allocation strategy. It takes into account both the auctioneer’s revenue and the changes in the number of participants on the AFL marketplace. Through extensive experiments on widely used benchmark datasets, ARAS-AFL demonstrates superior performance compared to state-of-the-art approaches. It outperforms the best baseline by 49.06%, 98.69%, 10.32%, and 4.77% in terms of total revenue, number of data owners, public reputation and accuracy of federated learning models, respectively.

Abstract: Conditional independence (CI) testing is a fundamental task in modern statistics and machine learning. The conditional randomization test (CRT) was recently introduced to test whether two random variables, X and Y, are conditionally independent given a potentially highdimensional set of random variables, Z. The CRT operates exceptionally well under the assumption that the conditional distribution X|Z is known. However, since this distribution is typically unknown in practice, accurately approximating it becomes crucial. In this paper, we propose using conditional diffusion models (CDMs) to learn the distribution of X|Z. Theoretically and empirically, it is shown that CDMs closely approximate the true conditional distribution. Furthermore, CDMs offer a more accurate approximation of X|Z compared to GANs, potentially leading to a CRT that performs better than those based on GANs. To accommodate complex dependency structures, we utilize a computationally efficient classifier-based conditional mutual information (CMI) estimator as our test statistic. The proposed testing procedure performs effectively without requiring assumptions about specific distribution forms or feature dependencies, and is capable of handling mixed-type conditioning sets that include both continuous and discrete variables. Theoretical analysis shows that our proposed test achieves a valid control of the type I error. A series of experiments on synthetic data demonstrates that our new test effectively controls both type-I and type-II errors, even in high dimensional scenarios.

Abstract: In this work, we present an incontext policy adaptation (ICPAD) framework designed for long-horizon multi-task environments, exploring diffusion-based skill learning techniques in cross-domain settings. The framework enables rapid adaptation of skill-based reinforcement learning policies to diverse target domains, especially under stringent constraints on no model updates and only limited target domain data. Specifically, the framework employs a cross-domain skill diffusion scheme, where domain-agnostic prototype skills and a domain-grounded skill adapter are learned jointly and effectively from an offline dataset through cross-domain consistent diffusion processes. The prototype skills act as primitives for common behavior representations of long-horizon policies, serving as a lingua franca to bridge different domains. Furthermore, to enhance the in-context adaptation performance, we develop a dynamic domain prompting scheme that guides the diffusion-based skill adapter toward better alignment with the target domain. Through experiments with robotic manipulation in Metaworld and autonomous driving in CARLA, we show that our ICPAD framework achieves superior policy adaptation performance under limited target domain data conditions for various cross-domain configurations including differences in environment dynamics, agent embodiment, and task horizon.

Abstract: In recent years, large language models (LLMs) have developed rapidly and revolutionized natural language processing. However, high storage overhead and computing costs limit LLM deployment in resourceconstrained environments. Quantization algorithms can effectively compress LLMs and accelerate inference, but they lead to loss in precision, especially in low-bit scenarios. In this paper, we find that the discarded weight values caused by quantization in fact contain treasures to improve LLMs' accuracy. To excavate those hidden treasures, we construct search spaces around these discarded weights and those weights within the search space can seamlessly be incorporated into the original quantization weights. To determine which weights should be merged, we design a plug-and-play weight compensation framework to capture global information and keep the weights with the highest potential benefits. Our framework can be combined with various LLM quantization algorithms to achieve higher precision without additional inference overhead. We validate the effectiveness of our approach on widely used benchmark datasets for LLMs.

Abstract: Heterogeneity is a fundamental and challenging issue in federated learning, especially for the graph data due to the complex relationships among the graph nodes. To deal with the heterogeneity, lots of existing methods perform the weighted federation based on their calculated similarities between pairwise clients (i.e., subgraphs). However, their intersubgraph similarities estimated with the outputs of local models are less reliable, because the final outputs of local models may not comprehensively represent the real distribution of subgraph data. In addition, they ignore the critical intra-heterogeneity which usually exists within each subgraph itself. To address these issues, we propose a novel Federated learning method by integrally modeling the Inter-Intra Heterogeneity (FedIIH). For the inter-subgraph relationship, we propose a novel hierarchical variational model to infer the whole distribution of subgraph data in a multi-level form, so that we can accurately characterize the inter-subgraph similarities with the global perspective. For the intra-heterogeneity, we disentangle the subgraph into multiple latent factors and partition the model parameters into multiple parts, where each part corresponds to a single latent factor. Our FedIIH not only properly computes the distribution similarities between subgraphs, but also learns disentangled representations that are robust to irrelevant factors within subgraphs, so that it successfully considers the inter- and intra- heterogeneity simultaneously. Extensive experiments on six homophilic and five heterophilic graph datasets in both non-overlapping and overlapping settings demonstrate the effectiveness of our method when compared with eight state-of-the-art methods. Specifically, FedIIH averagely outperforms the second-best method by a large margin of 5.79% on all heterophilic datasets.

Abstract: Deep learning has achieved remarkable success in supervised image classification tasks, which relies on a large number of labeled samples for each class. Recently, zeroshot learning has garnered significant attention, which aims to recognize unseen classes using only training samples from seen classes. To bridge the gap between images and classes, class semantic attributes are introduced, making the alignment between image and class semantic attributes critical to zero-shot learning. However, existing methods often struggle to accurately focus on the image regions corresponding to individual class semantic attributes and tend to overlook the relations between different regions of an image, leading to poor alignment. To address these challenges, we propose a class semantic attribute perception guided zero-shot learning method. Specifically, we achieve coarse-grained perception of class semantic attributes across the entire image through contrastive semantic learning. Additionally, we attain fine-grained perception of individual class semantic attributes within image regions via region partitioning-based attribute alignment, which fully considers the relations between different regions of an image. By integrating these two processes into a unified network, we achieve multi-grained class semantic attribute perception, thereby enhancing the alignment between images and class semantic attributes. We validate the effectiveness of the proposed method on zero-shot learning benchmark data sets.

Abstract: Achieving robust performance in visionlanguage tasks requires strong multimodal alignment, where textual and visual data interact seamlessly. Existing frameworks often combine contrastive learning with image captioning to unify visual and textual representations. However, reliance on global representations and unidirectional information flow from images to text limits their ability to reconstruct visual content accurately from textual descriptions. To address this limitation, we propose BiMAC, a novel framework that enables bidirectional interactions between images and text at both global and local levels. BiMAC employs advanced components to simultaneously reconstruct visual content from textual cues and generate textual descriptions guided by visual features. By integrating a text-region alignment mechanism, BiMAC identifies and selects relevant image patches for precise cross-modal interaction, reducing information noise and enhancing mapping accuracy. BiMAC achieves state-of-the-art performance across diverse vision-language tasks, including image-text retrieval, captioning, and classification.

Abstract: As data and computational resources continue to expand, incorporating a variety of knowledge during the pretraining phase enhances large models, providing them with strong zero-shot capabilities. Due to the alignment of modal features by visual language models, zero-shot image captioning no longer necessitates pre-training on paired image-text labeled data, enabling accurate text description generation for images not encountered before. While recent research focuses on methods utilizing entity retrieval as anchors to bridge the gap between different modalities, these approaches often fall short of thoroughly analyzing the impact of entity retrieval recall on the zero-shot generation capabilities. To address this issue, we propose MERCap, a zero-shot image captioning method employing Multi-type Entity representation Retrieval. More specifically, we first approximate image representation using the CLIP representation of text and Gaussian noise to address the modality gap. Then, we train a GPT-2 decoder to reconstruct text using entities as hard prompts and CLIP representations as soft prompts. Additionally, we construct a domain-specific entity set, assigning multiple representations to each entity and refining their representation vectors through contrastive learning. During inference, we retrieve entities and input them into the decoder to generate corresponding captions. Extensive experiments validate that our approach is efficient, achieving a new state-of-the-art level in cross-domain captioning and demonstrating strong competitiveness in in-domain captioning compared to existing methods.

Abstract: Graph Neural Networks (GNNs) have recently gained widespread attention as a successful tool for analyzing graphstructured data. However, imperfect graph structure with noisy links lacks enough robustness and may damage graph representations, therefore limiting the GNNs' performance in practical tasks. Moreover, existing generative architectures fail to fit discriminative graph-related tasks. To tackle these issues, we introduce an unsupervised method based on a joint of generative training and discriminative training to learn graph structure and representation, aiming to improve the discriminative performance of generative models. We propose an Energy-based Contrastive Learning (ECL) guided Graph Structure Refinement (GSR) framework, denoted as ECL-GSR. To our knowledge, this is the first work to combine energy-based models with contrastive learning for GSR. Specifically, we leverage ECL to approximate the joint distribution of sample pairs, which increases the similarity between representations of positive pairs while reducing the similarity between negative ones. Refined structure is produced by augmenting and removing edges according to the similarity metrics among node representations. Extensive experiments demonstrate that ECL-GSR outperforms the state-of-the-art on eight benchmark datasets in node classification. ECL-GSR achieves faster training with fewer samples and memories against the leading baseline, highlighting its simplicity and efficiency in downstream tasks.

Abstract: The Large VisionLanguage Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.

Abstract: Lately, the practice of utilizing taskspecific fine-tuning has been implemented to improve the performance of large language models (LLM) in subsequent tasks. Through the integration of diverse LLMs, the overall competency of LLMs is significantly boosted. Nevertheless, traditional ensemble methods are notably memory-intensive, necessitating the simultaneous loading of all specialized models into GPU memory. To address the inefficiency, model merging strategies have emerged, merging all LLMs into one model to reduce the memory footprint during inference. Despite these advances, model merging often leads to parameter conflicts and performance decline as the number of experts increases. Previous methods to mitigate these conflicts include post-pruning and partial merging. However, both approaches have limitations, particularly in terms of performance and storage efficiency when merged experts increase. To address these challenges, we introduce Channel Merging, a novel strategy designed to minimize parameter conflicts while enhancing storage efficiency. This method initially clusters and merges channel parameters based on their similarity to form several groups offline. By ensuring that only highly similar parameters are merged within each group, it significantly reduces parameter conflicts. During inference, we can instantly look up the expert parameters from the merged groups, preserving specialized knowledge. Our experiments demonstrate that Channel Merging consistently delivers high performance, matching unmerged models in tasks like English and Chinese reasoning, mathematical reasoning, and code generation. Moreover, it obtains results comparable to model ensemble with just 53% parameters when used with a task-specific router.

Abstract: In Continual Learning (CL), while existing work primarily focuses on the multiclass classification task, there has been limited research on Multi-Label Learning (MLL). In practice, MLL datasets are often class-imbalanced, making it inherently challenging, a problem that is even more acute in CL. Due to its sensitivity to imbalance, Macro-AUC is an appropriate and widely used measure in MLL. However, there is no research to optimize Macro-AUC in MLCL specifically. To fill this gap, in this paper, we propose a new memory replay-based method to tackle the imbalance issue for Macro-AUC-oriented MLCL. Specifically, inspired by recent theory work, we propose a new Reweighted Label-Distribution-Aware Margin (RLDAM) loss. Furthermore, to be compatible with the RLDAM loss, a new memory-updating strategy named Weight Retain Updating (WRU) is proposed to maintain the numbers of positive and negative instances of the original dataset in memory. Theoretically, we provide superior generalization analyses of the RLDAM-based algorithm in terms of Macro-AUC, separately in batch MLL and MLCL settings. This is the first work to offer theoretical generalization analyses in MLCL to our knowledge. Finally, a series of experimental results illustrate the effectiveness of our method over several baselines.

School of Computer Science and Engineering, Southeast University, Nanjing 210096, China, College of Computer Science, Sichuan University, School of Computer and Control Engineering, Yantai University, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, College of Computer, National University of Defense Technology, College of Computer Science and Technology, Zhejiang University, Department of Control Science and Intelligence Engineering, Nanjing University, School of Economics and Management, Beijing Jiaotong University, School of Computer Science and Engineering, University of Electronic Science and Technology of China

Abstract: Incomplete multiview clustering (IMVC) has garnered increasing attention in recent years due to the common issue of missing data in multi-view datasets. The primary approach to address this challenge involves recovering the missing views before applying conventional multi-view clustering methods. Although imputation-based IMVC methods have achieved significant improvements, they still encounter notable limitations: 1) heavy reliance on paired data for training the data recovery module, which is impractical in real scenarios with high missing data rates; 2) the generated data often lacks diversity and discriminability, resulting in suboptimal clustering results. To address these shortcomings, we propose a novel IMVC method called Diffusion Contrastive Generation (DCG). Motivated by the consistency between the diffusion and clustering processes, DCG learns the distribution characteristics to enhance clustering by applying forward diffusion and reverse denoising processes to intra-view data. By performing contrastive learning on a limited set of paired multi-view samples, DCG can align the generated views with the real views, facilitating accurate recovery of views across arbitrary missing view scenarios. Additionally, DCG integrates instance-level and category-level interactive learning to exploit the consistent and complementary information available in multi-view data, achieving robust and end-to-end clustering. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches.

Abstract: In deep regression, capturing the relationship among continuous labels in feature space is a fundamental challenge that has attracted increasing interest. Addressing this issue can prevent models from converging to suboptimal solutions across various regression tasks, leading to improved performance, especially for imbalanced regression and under limited sample sizes. However, existing approaches often rely on orderaware representation learning or distance-based weighting. In this paper, we hypothesize a linear negative correlation between label distances and representation similarities in regression tasks. To implement this, we propose an angle-compensated contrastive regularizer for deep regression, which adjusts the cosine distance between anchor and negative samples within the contrastive learning framework. Our method offers a plug-and-play compatible solution that extends most existing contrastive learning methods for regression tasks. Extensive experiments and theoretical analysis demonstrate that our proposed angle-compensated contrastive regularizer not only achieves competitive regression performance but also excels in data efficiency and effectiveness on imbalanced datasets.

Abstract: Graph similarity computation (GSC) is to calculate the similarity between one pair of graphs, which is a fundamental problem with fruitful applications in the graph community. In GSC, graph edit distance (GED) and maximum common subgraph (MCS) are the two most adopted similarity metrics, both of which are NPhard to compute. Instead of calculating the exact values, state-of-the-art solutions resort to leveraging graph neural networks (GNNs) to learn data-driven models for the estimation of GED and MCS. Most of them are built on components involving node-level interactions crossing graphs, which engender vast computation overhead but are of little avail in effectiveness. Motivated by this, in the paper, we present GraSP, a simple yet effective GSC approach for GED and MCS prediction. More concretely, GraSP achieves high result efficacy through several key instruments: enhanced node features via positional encoding and a GNN model augmented by a gating mechanism, residual connections, as well as multi-scale pooling. Theoretically, GraSP can surpass the 1-WL test, indicating its high expressiveness. Empirically, extensive experiments comparing GraSP against 10 competitors on multiple widely adopted benchmark datasets showcase the superiority of GraSP over prior arts in terms of both effectiveness and efficiency.

Abstract: Probabilistic truncation has been widely used in a broad range of privacypreserving machine learning (PPML) platforms, such as EdaBits (Crypto 20), ABY 2.0 (Usenix 21), Crypten (NIPS 21), Piranha-Falcon (Usenix 22), and Bicoptor (S&P 23), etc. In this work, we examine the problems of common probabilistic truncation protocols in PPML, and propose solutions from the perspectives of accuracy and efficiency. With regard to accuracy, we found the recommended precision parameters in many existing works are incorrect, leading to extremely low inference accuracy. We conducted a thorough analysis of their open-source code and found that their errors were mainly caused by simplified implementation; more specifically, random numbers are not correctly sampled in probabilistic truncation protocols. Based on this, we provide a detailed theoretical analysis to validate our views. With regard to efficiency, we identify limitations in the state-of-the-art secure comparison, Bicoptor’s (S&P 2023) DReLU protocol, which relies on the probabilistic truncation and is heavily constrained by the security parameter to eliminate errors, significantly impacting its performance. To address these challenges, we introduce a non-interactive deterministic truncation technique, replacing the original probabilistic truncation. Additionally, we propose a new technique for speeding up the ReLU/DReLU evaluation, which can be applied to the other non-linear functions as well. When the input size of DReLU is reduced to 7 bits, we can speed up approximately 5x the ReLU protocols w.r.t. ABY3, ABY2.0, EdaBits, and Bicoptor without compromising model accuracy. The improved protocol can complete a ReLU evaluation within 2 rounds and 704 bits overall communication when the input/output is secretly shared over the 64-bit ring, which yields a 92% communication reduction on original Bicoptor. Compared to existing PPML platforms with GPU acceleration, our benchmark indicates a 10x improvement in the DReLU protocol, and a 6x improvement in the ReLU protocol over Piranha-Falcon and a 3.7x improvement over Bicoptor. As a result, the overall PPML model inference could be sped up by 3-4 times.

Abstract: This work presents an informationtheoretic examination of diffusion-based purification methods, the state-of-the-art adversarial defenses that utilize diffusion models to remove malicious perturbations in adversarial examples. By theoretically characterizing the inherent purification errors associated with the Markov-based diffusion purifications, we introduce LoRID, a novel Low-Rank Iterative Diffusion purification method designed to remove adversarial perturbation with low intrinsic purification errors. LoRID centers around a multi-stage purification process that leverages multiple rounds of diffusion-denoising loops at the early time-steps of the diffusion models, and the integration of Tucker decomposition, an extension of matrix factorization, to remove adversarial noise at high-noise regimes. Consequently, LoRID increases the effective diffusion time-steps and overcomes strong adversarial attacks, achieving superior robustness performance in CIFAR-10/100, CelebA-HQ, and ImageNet datasets under both white-box and grey-box settings.

Abstract: We propose Orpheus, a novel programming model for communicating agents based on information protocols and realized using cognitive programming. Whereas traditional models are focused on reactions to handle incoming messages, Orpheus supports organizing the internal logic of an agent based on its goals. We give an operational semantics for Orpheus and implement this semantics in an adapter to help build agents. We use the adapter to demonstrate how Orpheus simplifies the programming of decentralized multiagent systems compared to the reactive programming model.

Abstract: Advancements in chip design and manufacturing have enabled the processing of complex tasks such as deep learning and natural language processing, paving the way for the development of artificial general intelligence (AGI). AI, on the other hand, can be leveraged to innovate and streamline semiconductor technology from planning and implementation to manufacturing. In this paper, we present Intelligent OPC Engineer Assistant, an AI/LLMpowered methodology designed to solve the core manufacturing-aware optimization problem known as Optical Proximity Correction (OPC). The methodology involves a reinforcement learning-based OPC recipe search and a customized multi-modal agent system for recipe summarization. Experiments demonstrate that our methodology can efficiently build OPC recipes on various chip designs with specially handled design topologies, a task that typically requires the full-time effort of OPC engineers with years of experience.

Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Tencent AI Lab, Tencent AI Lab, Tsinghua University, Institute of Automation, Chinese Academy of Sciences School of Future Technology, University of Chinese Academy of Sciences AiRiA

Abstract: Opponent Modeling (OM) aims to enhance decisionmaking by modeling other agents in multi-agent environments. Existing works typically learn opponent models against a pre-designated fixed set of opponents during training. However, this will cause poor generalization when facing unknown opponents during testing, as previously unseen opponents can exhibit out-of-distribution (OOD) behaviors that the learned opponent models cannot handle. To tackle this problem, we introduce a novel Open-Ended Opponent Modeling (OEOM) framework, which continuously generates opponents with diverse strengths and styles to reduce the possibility of OOD situations occurring during testing. Founded on population-based training and information-theoretic trajectory space diversity regularization, OEOM generates a dynamic set of opponents. This set is then fed to any OM approaches to train a potentially generalizable opponent model. Upon this, we further propose a simple yet effective OM approach that naturally fits within the OEOM framework. This approach is based on in-context reinforcement learning and learns a Transformer that dynamically recognizes and responds to opponents based on their trajectories. Extensive experiments in cooperative, competitive, and mixed environments demonstrate that OEOM is an approach-agnostic framework that improves generalizability compared to training against a fixed set of opponents, regardless of OM approaches or testing opponent settings. The results also indicate that our proposed approach generally outperforms existing OM baselines.

Abstract: Responsibility plays a key role in the development and deployment of trustworthy autonomous systems. In this paper, we focus on the problem of strategic reasoning in probabilistic multiagent systems with responsibility-aware agents. We introduce the logic PATL+R, a variant of Probabilistic Alternating-time Temporal Logic. The novelty of PATL+R lies in its incorporation of modalities for causal responsibility, providing a framework for responsibility-aware multi-agent strategic reasoning. We present an approach to synthesise joint strategies that satisfy an outcome specified in PATL+R, while optimising the share of expected causal responsibility and reward. This provides a notion of balanced distribution of responsibility and reward gain among agents. To this end, we utilise the Nash equilibrium as the solution concept for our strategic reasoning problem and demonstrate how to compute responsibility-aware Nash equilibrium strategies via a reduction to parametric model checking of concurrent stochastic multi-player games.

Abstract: In recent years, agents have become capable of communicating seamlessly via natural language and navigating in environments that involve cooperation and competition, a fact that can introduce social dilemmas. Due to the interleaving of cooperation and competition, understanding agents' decisionmaking in such environments is challenging, and humans can benefit from obtaining explanations. However, such environments and scenarios have rarely been explored in the context of explainable AI. While some explanation methods for cooperative environments can be applied in mixed-motive setups, they do not address inter-agent competition, cheap-talk, or implicit communication by actions. In this work, we design explanation methods to address these issues. Then, we proceed to establish generality and demonstrate the applicability of the methods to three games with vastly different properties. Lastly, we demonstrate the effectiveness and usefulness of the methods for humans in two mixed-motive games. The first is a challenging 7-player game called no-press Diplomacy. The second is a 3-player game inspired by the prisoner's dilemma, featuring communication in natural language.

Abstract: Selfcorrection in text-to-SQL is the process of prompting large language model (LLM) to revise its previously incorrectly generated SQL, and commonly relies on manually crafted self-correction guidelines by human experts that are not only labor-intensive to produce but also limited by the human ability in identifying all potential error patterns in LLM responses. We introduce MAGIC, a novel multi-agent method that automates the creation of the self-correction guideline. MAGIC uses three specialized agents: a manager, a correction, and a feedback agent. These agents collaborate on the failures of an LLM-based method on the training set to iteratively generate and refine a self-correction guideline tailored to LLM mistakes, mirroring human processes but without human involvement. Our extensive experiments show that MAGIC's guideline outperforms expert human's created ones. We empirically find out that the guideline produced by MAGIC enhances the interpretability of the corrections made, providing insights in analyzing the reason behind the failures and successes of LLMs in self-correction.

Abstract: The rapid development of large language models (LLMs) has significantly advanced code completion capabilities, giving rise to a new generation of LLMbased Code Completion Tools (LCCTs). Unlike general-purpose LLMs, these tools possess unique workflows, integrating multiple information sources as input and prioritizing code suggestions over natural language interaction, which introduces distinct security challenges. Additionally, LCCTs often rely on proprietary code datasets for training, raising concerns about the potential exposure of sensitive data. This paper exploits these distinct characteristics of LCCTs to develop targeted attack methodologies on two critical security risks: jailbreaking and training data extraction attacks. Our experimental results expose significant vulnerabilities within LCCTs, including a 99.4% success rate in jailbreaking attacks on GitHub Copilot and a 46.3% success rate on Amazon Q. Furthermore, We successfully extracted sensitive user data from GitHub Copilot, including 54 real email addresses and 314 physical addresses associated with GitHub usernames. Our study also demonstrates that these code-based attack methods are effective against general-purpose LLMs, highlighting a broader security misalignment in the handling of code by modern LLMs. These findings underscore critical security challenges associated with LCCTs and suggest essential directions for strengthening their security frameworks.

Abstract: Question answering systems are rapidly advancing, but their opaque nature may impact user trust. We explored trust through an antimonitoring framework, where trust is predicted to be correlated with presence of citations and inversely related to checking citations. We tested this hypothesis with a live question-answering experiment that presented text responses generated using a commercial Chatbot along with varying citations (zero, one, or five), both relevant and random, and recorded if participants checked the citations and their self-reported trust in the generated responses. We found a significant increase in trust when citations were present, a result that held true even when the citations were random; we also found a significant decrease in trust when participants checked the citations. These results highlight the importance of citations in enhancing trust in AI-generated content.

Abstract: Text generation with citations makes it easy to verify the factuality of Large Language Models’ (LLMs) generations. Existing onestep generation studies expose distinct shortages in answer refinement and in-context demonstration matching. In light of these challenges, we propose R2-MGA, a Retrieval and Reflection Memory-augmented Generative Agent. Specifically, it first retrieves the memory bank to obtain the best-matched memory snippet, then reflects the retrieved snippet as a reasoning rationale, next combines the snippet and the rationale as the best-matched in-context demonstration. Additionally, it is capable of in-depth answer refinement with two specifically designed modules. We evaluate R2-MGA across five LLMs on the ALCE benchmark. The results reveal R2-MGA’ exceptional capabilities in text generation with citations. In particular, compared to the selected baselines, it delivers up to +58.8% and +154.7% relative performance gains on answer correctness and citation quality, respectively. Extensive analyses strongly support the motivations of R2-MGA.

Abstract: Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previouslygenerated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that model’s are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.

Abstract: Generating Chainof-Thought (CoT) before deriving the answer can effectively improve the reasoning capabilities of large language models (LLMs) and significantly improve the accuracy of the generated answer. However, in most cases, the length of the generated CoT is much longer than the desired final answer, which results in additional decoding costs. Furthermore, existing research has discovered that shortening the reasoning steps in CoT, even while preserving the key information, diminishes LLMs' abilities. These phenomena make it difficult to use LLMs and CoT in many real-world applications that only require the final answer and are sensitive to latency, such as search and recommendation. To reduce the costs of model decoding and shorten the length of the generated CoT, this paper presents Conditioned Compressed Chain-of-Thought (C3oT), a CoT compression framework that involves a compressor to compress an original longer CoT into a shorter CoT while maintaining key information and interpretability, a conditioned training method to train LLMs with both longer CoT and shorter CoT simultaneously to learn the corresponding relationships between them, and a conditioned inference method to gain the reasoning ability learned from longer CoT by generating shorter CoT. We conduct experiments over four datasets from arithmetic and commonsense scenarios, showing that the proposed method is capable of compressing the length of generated CoT by up to more than 50% without compromising its effectiveness.

Abstract: Uncertainty estimation has been widely applied for trustworthy automatic speech recognition (ASR) systems across training and inference stages. In the training stage, previous studies show that uncertainty can facilitate selftraining by filtering out unlabeled data samples with high uncertainty. However, the current sequence-level uncertainty estimation method for connectionist temporal classification (CTC) based ASR models drops the output probability information and depends only on the textual distance of decoded predictions. In this study, we argue that this results in limited performance improvement and propose a novel output probability-based sequence-level uncertainty estimation method. We also categorize uncertainty as pseudo-label uncertainty and in-training uncertainty for the self-training process. Finally, we present uncertainty-aware self-training for CTC-based ASR models and experimentally show the effectiveness of the proposed method compared to the baselines.

CCSE, School of Computer Science and Engineering, Beihang University, Beijing, China, CCSE, School of Computer Science and Engineering, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China, CCSE, School of Computer Science and Engineering, Beihang University, Beijing, China Shen Yuan Honors College, Beihang University, Beijing, China, CCSE, School of Computer Science and Engineering, Beihang University, Beijing, China School of Software, Beihang University, Beijing, China, CCSE, School of Computer Science and Engineering, Beihang University, Beijing, China

Abstract: Weakly supervised phrase grounding tasks aim to learn alignments between phrases and regions with coarse imagecaption match information. One branch of previous methods established pseudo-label relationships between phrases and regions based on the Expectation-Maximization (EM) algorithm combined with contrastive learning. However, adopting a simplified batch-level local update (partial) of pseudo-labels in E-step is sub-optimal, while extending it to global update requires inefficiently numerous computations. In addition, their failure to consider potential false negative examples in contrastive loss negatively impacts the effectiveness of M-step optimization. To address these issues, we propose a Momentum Pseudo Labeling (MPL) method, which efficiently uses a momentum model to synchronize global pseudo-label updates on the fly with model parameter updating. Additionally, we explore potential relationships between phrases and regions from non-matching image-caption pairs and convert these false negative examples to positive ones in contrastive learning. Our approach achieved SOTA performance on 3 commonly used grounding datasets for weakly supervised phrase grounding tasks.

Abstract: How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a reference model. Recent approaches, such as direct preference optimization (DPO), have eliminated the need for unstable and sluggish reinforcement learning optimization by introducing closeformed supervised losses. However, a significant limitation of the current approach is its design for a single reference model only, neglecting to leverage the collective power of numerous pretrained LLMs. To overcome this limitation, we introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models, substantially enhancing preference learning capabilities compared to the single-reference DPO. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance. Furthermore, MRPO effectively finetunes LLMs to exhibit superior performance in several downstream natural language processing tasks such as HH-RLHF, GSM8K and TruthfulQA.

Abstract: Large Language Models (LLMs) have significantly advanced natural language processing (NLP), providing versatile capabilities across various applications. However, their application to complex, domainspecific tasks, such as cyber-security, often faces substantial challenges. In this study, we introduce SecKnowledge and CyberPal.AI to address these challenges and train security-expert LLMs. SecKnowledge is a domain-knowledge-driven cyber-security instruction dataset, meticulously designed using years of accumulated expert knowledge in the domain through a multi-phase generation process. CyberPal.AI refers to a family of LLMs fine-tuned using SecKnowledge, aimed at building security-specialized LLMs capable of answering and following complex security-related instructions. Additionally, we introduce SecKnowledge-Eval, a comprehensive and diverse cyber-security evaluation benchmark, composed of an extensive set of cyber-security tasks we specifically developed to assess LLMs in the field of cyber-security, along with other publicly available security benchmarks. Extensive evaluations demonstrate a significant average improvement of up to 24% over the baseline models, underscoring the benefits of our expert-driven instruction dataset generation process. These findings contribute to the advancement of AI-based cyber-security applications, paving the way for robust security-expert LLMs that can enhance threat-hunting and investigation processes.

Abstract: Large language models (LLMs) require model editing to efficiently update specific knowledge within them and avoid factual errors. Most model editing methods are solely designed for singletime use and result in a significant forgetting effect in lifelong editing scenarios, where sequential edits are conducted over time. Previous approaches manage sequential edits by freezing original parameters and discretely allocating new parameters for each knowledge update. However, these methods lack robustness to minor input variations due to the discrete mapping between data and parameters. To overcome this challenge, we propose ELDER, a novel approach to create a continuous association between data and adapters. ELDER integrates multiple LoRAs through a router network and is trained to establish a smooth data-adapter association, thereby enhancing the edit robustness and generalization of semantically equivalent inputs. To ensure inputs containing the same knowledge will be processed by the same LoRAs, we design a novel loss to guide the model link LoRA allocations with edit knowledge. Furthermore, we propose a deferral mechanism to retain the original LLM capabilities post-edit. Extensive experiments on GPT-2 XL and LLaMA2-7B demonstrate that ELDER effectively edits models in the lifelong setting, outperforming eight baselines while exhibiting strong scalability and preserving LLMs' general abilities on downstream tasks.

School of Electronic Science and Engineering, Nanjing University, University of Arizona, Samsung Electronic Research Centre of China, School of Electronic Science and Engineering, Nanjing University, School of Electronic Science and Engineering, Nanjing University, School of Electronic Science and Engineering, Nanjing University Interdisciplinary Research Center for Future Intelligent Chips, Nanjing University, Suzhou, School of Electronic Science and Engineering, Nanjing University Interdisciplinary Research Center for Future Intelligent Chips, Nanjing University, Suzhou

Abstract: Large language models (LLMs) excel in language tasks, especially with supervised finetuning after pre-training. However, their substantial memory and computational requirements hinder practical applications. Structural pruning, which reduces less significant weight dimensions, is one solution. Yet, traditional post-hoc pruning often leads to significant performance loss, with limited recovery from further fine-tuning due to reduced capacity. Since the model fine-tuning refines the general and chaotic knowledge in pre-trained models, we aim to incorporate structural pruning with the fine-tuning, and propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy while preserving the model performance to the maximum extend. Specifically, we insert the innovative Hybrid Sparsification Modules (HSMs) between the Attention and FFN components to accordingly sparsify the upstream and downstream linear modules. The HSM comprises a lightweight operator and a globally shared trainable mask. The lightweight operator maintains a training overhead comparable to that of LoRA, while the trainable mask unifies the channels to be sparsified, ensuring structural pruning. Additionally, we propose the Identity Loss which decouples the transformation and scaling properties of the HSMs to enhance training robustness. Extensive experiments demonstrate that PAT excels in both performance and efficiency. For example, our Llama2-7b model with a 25% pruning ratio achieves 1.33x speedup while outperforming the LoRA-finetuned model by up to 1.26% in accuracy with a similar training cost.

Abstract: Large language models (LLMs) based on transformer architecture have shown outstanding performance across numerous realworld tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, where as beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model's distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling. Experimental results show that our approach achieves a 1.5-1.9x speed-up and1.8-2.5x lower energy consumption compared to beam sampling, with no loss in downstream performance. Moreover, it can produce significantly higher-quality outputs than speculative decoding, while maintaining similar time, memory, and energy costs. In summary, our method offers a more efficient and effective inference process for LLMs.

Abstract: Language models have become foundational to many widely used systems. However, these seemingly advantageous models are doubleedged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.

Abstract: Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a crossdomain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages. We open source the code, corpus, and models to foster further research in this task.

Abstract: Fewshot Continual Relation Extraction is a crucial challenge for enabling AI systems to identify and adapt to evolving relationships in dynamic real-world domains. Traditional memory-based approaches often overfit to limited samples, failing to reinforce old knowledge, with the scarcity of data in few-shot scenarios further exacerbating these issues by hindering effective data augmentation in the latent space. In this paper, we propose a novel retrieval-based solution, starting with a large language model to generate descriptions for each relation. From these descriptions, we introduce a bi-encoder retrieval training paradigm to enrich both sample and class representation learning. Leveraging these enhanced representations, we design a retrieval-based prediction method where each sample "retrieves" the best fitting relation via a reciprocal rank fusion score that integrates both relation description vectors and class prototypes. Extensive experiments on multiple datasets demonstrate that our method significantly advances the state-of-the-art by maintaining robust performance across sequential tasks, effectively addressing catastrophic forgetting.

SKL of Processors, Institute of Computing Technology, CAS University of Chinese Academy of Sciences, SKL of Processors, Institute of Computing Technology, CAS, SKL of Processors, Institute of Computing Technology, CAS University of Chinese Academy of Sciences, Baidu Inc., Beijing, China, Autodesk Research, Baidu Inc., Beijing, China, Baidu Inc., Beijing, China, SKL of Processors, Institute of Computing Technology, CAS University of Chinese Academy of Sciences, SKL of Processors, Institute of Computing Technology, CAS University of Chinese Academy of Sciences, SKL of Processors, Institute of Computing Technology, CAS, SKL of Processors, Institute of Computing Technology, CAS, SKL of Processors, Institute of Computing Technology, CAS, SKL of Processors, Institute of Computing Technology, CAS, Baidu Inc., Beijing, China, SKL of Processors, Institute of Computing Technology, CAS, SKL of Processors, Institute of Computing Technology, CAS University of Chinese Academy of Sciences

Abstract: Recent advancements in opensource code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.

Abstract: The RetrievalAugmented Language Model (RALM) has demonstrated remarkable performance on knowledge-intensive tasks by integrating external knowledge during inference, which mitigates the factual hallucinations inherited in large language models (LLMs). Despite these advancements, challenges persist in the implementation of RALMs, particularly in terms of reliability and traceability. Specifically, the irrelevant document retrieval may result in unhelpful responses or even deteriorate the performance of LLMs, while the lack of appropriate citations in outputs complicates efforts to verify the trustworthiness of the models. To this end, we propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs, whose core idea is to leverage reasoning trajectories generated by the LLM itself. The framework involves constructing self-reasoning trajectories through three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process. We evaluated our framework across four public datasets (two short-form QA datasets, one long-form QA dataset, and one fact verification dataset) to demonstrate its superiority. Our method can outperform existing state-of-the-art models and achieve performance comparable with GPT-4, using only 2,000 training samples.

Abstract: We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of timesynchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities.

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China, National University of Singapore, Singapore, Singapore, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China Laboratory for Advanced Computing and Intelligence Engineering, Wuxi, China, North China Institute of Computing Technology, beijing, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

Abstract: With the continuous emergence of various social media platforms frequently used in daily life, the multimodal meme understanding (MMU) task has been garnering increasing attention. MMU aims to explore and comprehend the meanings of memes from various perspectives by performing tasks such as metaphor recognition, sentiment analysis, intention detection, and offensiveness detection. Despite making progress, limitations persist due to the loss of finegrained metaphorical visual clue and the neglect of multimodal text-image weak correlation. To overcome these limitations, we propose a multi-granular multimodal clue fusion model (MGMCF) to advance MMU. Firstly, we design an object-level semantic mining module to extract object-level image feature clues, achieving fine-grained feature clue extraction and enhancing the model's ability to capture metaphorical details and semantics. Secondly, we propose a brand-new global-local cross-modal interaction model to address the weak correlation between text and images. This model facilitates effective interaction between global multimodal contextual clues and local unimodal feature clues, strengthening their representations through a bidirectional cross-modal attention mechanism. Finally, we devise a dual-semantic guided training strategy to enhance the model's understanding and alignment of multimodal representations in the semantic space. Experiments conducted on the widely-used MET-MEME bilingual dataset demonstrate significant improvements over state-of-the-art baselines. Specifically, there is an 8.14% increase in precision for offensiveness detection task, and respective accuracy enhancements of 3.53%, 3.89%, and 3.52% for metaphor recognition, sentiment analysis, and intention detection tasks. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MMU.

Abstract: Digital watermarking has shown its effectiveness in protecting multimedia content. However, existing watermarking is predominantly tailored for specific media types, rendering them less effective for the protection of content displayed on computer screens, which is often multimodal and dynamic. Visual Screen Content (VSC), is particularly susceptible to theft and leakage through screenshots, a vulnerability that current watermarking methods fail to adequately address. To address these challenges, we propose ScreenMark, a robust and practical watermarking method designed specifically for arbitrary VSC protection. ScreenMark utilizes a three-stage progressive watermarking framework. Initially, inspired by diffusion principles, we initialize the mutual transformation between regular watermark information and irregular watermark patterns. Subsequently, these patterns are integrated with screen content using a pre-multiplication alpha blending technique, supported by a pre-trained screen decoder for accurate watermark retrieval. The progressively complex distorter enhances the robustness of the watermark in real-world screenshot scenarios. Finally, the model undergoes fine-tuning guided by a joint-level distorter to ensure optimal performance. To validate the effectiveness of ScreenMark, we compiled a dataset comprising 100,000 screenshots from various devices and resolutions. Extensive experiments on different datasets confirm the superior robustness, imperceptibility, and practical applicability of the method.

Abstract: Backdoor attacks significantly compromise the security of large language models by triggering them to output specific and controlled content. Currently, triggers for textual backdoor attacks fall into two categories: fixedtoken triggers and sentence-pattern triggers. However, the former are typically easy to identify and filter, while the latter, such as syntax and style, do not apply to all original samples and may lead to semantic shifts. In this paper, inspired by cross-lingual (CL) prompts of LLMs in real-world scenarios, we propose a higher-dimensional trigger method at the paragraph level, namely CL-Attack. CL-Attack injects the backdoor by using texts with specific structures that incorporate multiple languages, thereby offering greater stealthiness and universality compared to existing backdoor attack techniques. Extensive experiments on different tasks and model architectures demonstrate that CL-Attack can achieve nearly 100 percents attack success rate with a low poisoning rate in both classification and generation tasks. We also empirically show that CL-Attack is more robust against current major defense methods compared to baseline backdoor attacks. Additionally, in response to CL-Attack, we further develop a new defense called TranslateDefense, which can partially mitigate the impact of CL-Attack.

Abstract: The longrun average payoff per transition (mean payoff) is the main tool for specifying the performance and dependability properties of discrete systems. The problem of constructing a controller (strategy) simultaneously optimizing several mean payoffs has been deeply studied for stochastic and game-theoretic models. One common issue of the constructed controllers is the instability of the mean payoffs, measured by the deviations of the average rewards per transition computed in a finite "window" sliding along a run. Unfortunately, the problem of simultaneously optimizing the mean payoffs under local stability constraints is computationally hard, and the existing works do not provide a practically usable algorithm even for non-stochastic models such as two-player games. In this paper, we design and evaluate the first efficient and scalable solution to this problem applicable to Markov decision processes.

Abstract: There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multistep reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many] additional problems can be created automatically. Our extensive evaluation of 21 LLMs and OpenAI o1 reasoning models highlight the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multiturn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no notable progress is made on boolean questions.

Abstract: MonteCarlo Tree Search (MCTS) is a popular approach to online planning under uncertainty. While MCTS uses statistical sampling via multi-armed bandits to avoid exhaustive search in complex domains, common closed-loop approaches typically construct enormous search trees to consider a large number of potential observations and actions. On the other hand, open-loop approaches offer better memory efficiency by ignoring observations but are generally not competitive with closed-loop MCTS in terms of performance - even with commonly integrated human knowledge. In this paper, we propose Counterfactual Open-loop Reasoning with Ad hoc Learning (CORAL) for open-loop MCTS, using a causal multi-armed bandit approach with unobserved confounders (MABUC). CORAL consists of two online learning phases that are conducted during the open-loop search. In the first phase, observational values are learned based on preferred actions. In the second phase, counterfactual values are learned with MABUCs to make a decision via an intent policy obtained from the observational values. We evaluate CORAL in four POMDP benchmark scenarios and compare it with closed-loop and open-loop alternatives. In contrast to standard open-loop MCTS, CORAL achieves competitive performance compared with closed-loop algorithms while constructing significantly smaller search trees.

Abstract: Generative Flow Networks (GFlowNets) is a new family of probabilistic samplers for generating objects under an unnormalized reward distribution. It has emerged as a promising framework for learning stochastic policies that generate highquality and diverse discrete objects proportional to their rewards, surpassing traditional reward-maximizing reinforcement learning methods. However, existing GFlowNets often suffer with data efficiency due to the direct parameterization of edge flows or dependence on backward policies that are challenging to specify or optimize, especially in high-dimensional action spaces. While the recent development of GFlowNets has primarily focused on developing alternative loss functions, we introduce a novel approach by exploring enhanced flow representations from an architectural perspective. In this paper, we propose to factorize the conventional edge flows into separate state flow and edge-based allocation streams. By introducing an effective method to synergistically combine these two streams to estimate the flows, we develop Bifurcated Generative Flow Networks (BN), a practical implementation to improve learning efficiency. We conduct extensive experiments on various standard benchmarks, and results show that BN significantly improves learning efficiency and effectiveness compared to state-of-the-art baselines.

Abstract: Machine learning (ML) has made significant advancements across various domains, with a shifting focus from purely predictive tasks to decisionmaking. The recent proposal by Zhou (2022) introduced a line of research known as rehearsal learning, which provides a novel perspective on modeling decision-making tasks. However, previous studies mainly focused on the linear Gaussian setting to constrain the modeling complexity. Furthermore, it has been demonstrated that finding exact optimal multivariate decisions within the sampling-based rehearsal framework is computationally infeasible in polynomial time, necessitating the development of approximate methods. In this work, we present Grad-Rh, the first gradient-based rehearsal learning method that can efficiently find multivariate decisions under non-linear and non-Gaussian settings. We address the uncertainty in decision-making tasks using flexible and expressive conditional normalizing flow models and derive four surrogate loss functions to enable efficient gradient-based optimization. Experimental results show that Grad-Rh performs comparably to exact baselines on linear data and significantly outperforms them on non-linear data in both decision quality and running time.

Abstract: In boundedsuboptimal heuristic search, the aim is to find a solution path within a given bound as quickly as possible, which is crucial when computational resources are limited. Recent research has demonstrated Weighted A* variants such as XDP that find bounded suboptimal solutions without needing to perform state re-expansions; they work by shifting where the suboptimality in the search is allowed. However, the suboptimality distribution is fixed before the search begins. This paper introduces Dynamic Suboptimality Weighted A* (DSWA*), a search framework that allows suboptimality to be dynamically distributed at runtime, based on the properties of the search. Experiments show that dynamic policies can consistently outperform existing algorithms across a diverse set of domains, particularly those with dynamic costs.

Abstract: LLMs produce harmful and undesirable behavior when trained on datasets containing even a small fraction of poisoned data. We demonstrate that GPT models remain vulnerable to finetuning on poisoned data, even when safeguarded by moderation systems. Given the persistence of data poisoning vulnerabilities in today's most capable models, this paper investigates whether these risks increase with model scaling. We evaluate three threat models—malicious fine-tuning, imperfect data curation, and intentional data contamination—across 24 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.

Abstract: Despite the extent of recent advances in Machine Learning (ML) and Neural Networks, providing formal guarantees on the behavior of these systems is still an open problem, and a crucial requirement for their adoption in regulated or safetycritical scenarios. We consider the task of training differentiable ML models guaranteed to satisfy designer-chosen properties, stated as input-output implications. This is very challenging, due to the computational complexity of rigorously verifying and enforcing compliance in deep neural models. We provide an innovative approach based on: 1) a general, simple architecture enabling efficient verification with a conservative semantic; 2) a rigorous training algorithm based on the Projected Gradient Method; 3) a formulation of the problem of searching for strong counterexamples. The proposed framework, being only marginally affected by model complexity, scales well to practical applications, and produces models that provide full property satisfaction guarantees. We evaluate our approach on properties defined by linear inequalities in regression, and on mutually exclusive classes in multi-label classification. Our approach is competitive with a baseline that includes property enforcement in preprocessing (on training data) and postprocessing (on model predictions). Finally, our contributions establish a framework that opens up multiple research directions and potential improvements.

Abstract: Multiobjective preference alignment of large language models (LLMs) is critical for developing AI systems that are more configurable, personalizable, helpful, and safe. However, optimizing model outputs to satisfy diverse objectives with variable weights at inference time for truly personalized models presents a significant challenge. Existing approaches are either computationally expensive to train or do not sufficiently steer model behaviors. This paper introduces the Multi-Objective Online DPO (MO-ODPO) algorithm, designed to robustly and efficiently align model behaviors with multiple, potentially conflicting human preferences. Our approach incorporates a prompt conditioning mechanism, allowing us to train a single preference-conditional policy, that can adapt to new preference combinations at inference. Experiments on two popular benchmarks show that MO-ODPO Pareto-dominates existing baselines while providing excellent inference-time steerability between diverse objectives.

Abstract: We improve the efficacy of boundpropagation-based neural network verification by reducing the computational effort required by state-of-the-art propagation methods without incurring any loss in precision. We propose a method that infers the stability of ReLU nodes at every step of the back-substitution process, thereby dynamically simplifying the coefficient matrix of the symbolic bounding equations. We develop a heuristic for the effective application of the method and discuss its evaluation on common benchmarks where we show significant improvements in bound propagation times.

Abstract: Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equalsize transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for transformer language models will transfer to these up-and-coming recurrent architectures. Specifically, we focus on steering model outputs via contrastive activation addition, on eliciting latent predictions via the tuned lens, and eliciting latent knowledge from models fine-tuned to produce false outputs under certain conditions. Our results show that most of these techniques are effective when applied to RNNs, and we show that it is possible to improve some of them by taking advantage of RNNs' compressed state.

Abstract: Reinforcement Learning from Human Feedback (RLHF) aligns language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to finetune the base models. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate RM effectiveness, focusing on feature imprint, feature resistance, and feature robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them -- feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to slightly perturbed texts. Our experiments, utilizing open-source components like the Anthropic/hh-rlhf preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% resistance incidence in portions of the dataset where LM labelers disagreed with human preferences. We also find that misalignment stems from confusing entries in the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.

Abstract: Value alignment, at the intersection of moral philosophy and AI safety, is dedicated to ensuring that artificially intelligent (AI) systems align with a certain set of values. One challenge facing value alignment researchers is accurately translating these values into a machine readable format. In the case of reinforcement learning (RL), a popular method within value alignment, this requires designing a reward function which accurately defines the value of all stateaction pairs. It is common for programmers to hand-set and manually tune these values. In this paper, we examine the challenges of hand-programming values into reward functions for value alignment, and propose mathematical models as an alternative grounding for reward function design in ethical scenarios. Experimental results demonstrate that our modelled-ethics approach offers a more consistent alternative and outperforms our hand-programmed reward functions.

Abstract: Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing blackbox LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs.

Abstract: In many applications of AI for Social Impact (e.g., when allocating spots in support programs for underserved communities), resources are scarce and an allocation policy is needed to decide who receives a resource. Before being deployed at scale, a rigorous evaluation of an AIpowered allocation policy is vital. In this paper, we introduce the methods necessary to evaluate index-based allocation policies, which allocate a limited number of resources to those who need them the most. Such policies create dependencies between agents, rendering standard statistical tests invalid and ineffective. Addressing the arising practical and technical challenges, we describe an efficient estimator and methods for drawing valid statistical conclusions. Our extensive experiments validate our methodology in practical settings while also showcasing its statistical power. We conclude by proposing and empirically verifying extensions of our methodology that enable us to reevaluate a past randomized control trial conducted with 10000 beneficiaries for a mHealth program for pregnant women. Our new methodology allows us to draw previously invisible conclusions when comparing two different ML allocation policies.

Abstract: A key strategy in societal adaptation to climate change is using alert systems to prompt preventative action and reduce the adverse health impacts of extreme heat events. This paper implements and evaluates reinforcement learning (RL) as a tool to optimize the effectiveness of such systems. Our contributions are threefold. First, we introduce a new publicly available RL environment enabling the evaluation of the effectiveness of heat alert policies to reduce heatrelated hospitalizations. The rewards model is trained from a comprehensive dataset of historical weather, Medicare health records, and socioeconomic/geographic features. We use scalable Bayesian techniques tailored to the low-signal effects and spatial heterogeneity present in the data. The transition model uses real historical weather patterns enriched by a data augmentation mechanism based on climate region similarity. Second, we use this environment to evaluate standard RL algorithms in the context of heat alert issuance. Our analysis shows that policy constraints are needed to improve RL's initially poor performance. Third, a post-hoc contrastive analysis provides insight into scenarios where our modified heat alert-RL policies yield significant gains/losses over the current National Weather Service alert policy in the United States.

Abstract: Most US school districts draw geographic "attendance zones" to assign children to schools based on their home address, a process that can replicate existing neighborhood racial/ethnic and socioeconomic status (SES) segregation in schools. Redrawing boundaries can reduce segregation, but estimating expected rezoning impacts is often challenging because families can optout of their assigned schools. This paper seeks to alleviate this societal problem by developing a joint redistricting and choice modeling framework, called redistricting with choices (RWC). The RWC framework is applied to a large US public school district to estimate how redrawing elementary school boundaries might realistically impact levels of socioeconomic segregation. The main methodological contribution of RWC is a contextual stochastic optimization model that aims to minimize district-wide segregation by integrating rezoning constraints with a machine learning-based school choice model. The study finds that RWC yields boundary changes that might reduce segregation by a substantial amount (23%) -- but doing so might require the re-assignment of a large number of students, likely to mitigate re-segregation that choice patterns could exacerbate. The results also reveal that predicting school choice is a challenging machine learning problem. Overall, this study offers a novel practical framework that both academics and policymakers might use to foster more diverse and integrated schools.

Abstract: Antibiotic Resistance (AR) is a critical global health challenge that necessitates the development of costeffective, efficient, and accurate diagnostic tools. Given the genetic basis of AR, techniques such as Polymerase Chain Reaction (PCR) that target specific resistance genes offer a promising approach for predictive diagnostics using a limited set of key genes. This study introduces GenoARM, a novel framework that integrates reinforcement learning (RL) with transformer-based models to optimize the selection of PCR gene tests and improve AR predictions, leveraging observed metadata for improved accuracy. In our evaluation, we developed several high-performing baselines and compared them using publicly available datasets derived from real-world bacterial samples representing multiple clinically relevant pathogens. The results show that all evaluated methods achieve strong and reliable performance when metadata is not utilized. When metadata is introduced and the number of selected genes increases, GenoARM demonstrates superior performance due to its capacity to approximate rewards for unseen and sparse combinations. Overall, our framework represents a major advancement in optimizing diagnostic tools for AR in clinical settings.

Abstract: Oracle Bone Inscriptions (OBIs), as the earliest systematically organized pictographic script in China, hold significant importance in the study of the origins of Chinese civilization. Of the approximately 4,500 excavated OBI characters, only about onethird have been deciphered, leaving the remaining characters shrouded in mystery. Over the past decade, an increasing number of researchers have attempted to leverage artificial intelligence to assist in deciphering OBIs, but these efforts have not yet fully met the demands of this challenging objective. In this paper, we identify a key task—Component-Level OBI Segmentation—based on a successful deciphering case from 2018. This task aims to help experts quickly identify specific components within OBIs, thereby accelerating the deciphering process. Accordingly, we propose a new model to accomplish this task. Our model leverages a small amount of annotated data and a large amount of weakly annotated data and incorporates expert-provided prior knowledge, i.e., stroke rules, to automatically segment OBI components. Additionally, we train a series of auxiliary classifiers to evaluate the segmentation results during the test stage. We also invite experts to conduct a professional assessment of the results, which we cross-validated against our proposed evaluation metrics. Experimental results demonstrate that our method can accurately and clearly present the segmented components to experts.

Abstract: In many humanrobot collaboration and multi-agent tasks, it is vital to model the partners and estimate their objectives to efficiently collaborate/interact with them. While learning from demonstrations is the most common approach for this, it is very data-hungry, which we cannot afford in many settings including robotics, and demonstrations are unreliable in a surprisingly large number of domains, including those we think humans perform reasonably well, e.g., driving. In this talk, I will start with introducing comparison-based feedback and explain why it does not suffer from most of the problems that demonstrations have, but is still data-hungry. To address this problem, I will propose comparative language based feedback and active learning techniques, which will result in (1) a new type of human feedback, and (2) an active querying algorithm that optimizes the information the AI agent will elicit from the human. I will conclude the talk by discussing what other types of human feedback exist, e.g., interventions or hand gestures, and how we can incorporate them into the existing learning algorithms.

Abstract: To build a responsible data economy and protect data ownerhip, it is crucial to enable learning models from separate, heterogeneous data sources without centralization. For example, federated learning (FL) aims to train models across massive remote devices or isolated organizations, while keeping user data local. However, federated learning can face critical practical issues such as scalability, noisy samples, biased learning systems or procedures, and privacy leakage. At the intersection between optimization, trustworthy (fair, robust, and private) ML, and learning in heterogeneous environments, my research aims to support scalable and responsible data sharing to collectively build intelligent models.

Abstract: In the context of reinforcement learning from human feedback (RLHF), the reward function is generally derived from maximum likelihood estimation of a random utility model based on pairwise comparisons made by humans. The problem of learning a reward function is one of preference aggregation that, we argue, largely falls within the scope of social choice theory. From this perspective, we can evaluate different aggregation methods via established axioms, examining whether these methods meet or fail wellknown standards. We demonstrate that both the Bradley-Terry-Luce Model and its broad generalizations fail to meet basic axioms. In response, we develop novel rules for learning reward functions with strong axiomatic guarantees. A key innovation from the standpoint of social choice is that our problem has a linear structure, which greatly restricts the space of feasible rules and leads to a new paradigm that we call linear social choice.

Abstract: Decision making is at the core of healthcare: clinicians constantly make complex decisions that span diagnosis, treatment, care coordination, and resource allocation. Yet, human decisions are never perfect, leading to suboptimal patient care. My research uses AI to augment and improve decisionmaking in healthcare, following a synergistic approach that combines novel AI methods with practical, real-world implementation. Here, I will explore two key themes: Application-Inspired AI Innovations, focused on novel AI methods grounded in practical healthcare problems; and Path to Deployment and Impact, which addresses AI integration into clinical workflows for real-world improvements.

Abstract: This talk explores the challenge of customizing largescale AI models, particularly generative AI, on cost-effective devices with limited memory and energy resources. Modern AI models demand substantial computational power, often relying on specialized hardware such as GPUs. To address this, the talk introduces compression-aware computing, a framework enabling AI models to recognize and adapt to their compressed states while preserving performance. Compression-aware computing integrates compression techniques like sparsification, quantization, and low-rank decomposition to enhance the efficiency and accuracy of AI models, broadening these models' accessibility across diverse devices. Additionally, this talk highlights one rationale of scalable and sustainable AI in advancing Alzheimer’s research by facilitating the analysis of large single-cell transcriptomics datasets for gene-gene interaction discovery.

Abstract: Predicting the next activity in an ongoing process is one of the most common tasks in the business process management (BPM) domain. It allows businesses to optimize resource allocation, enhance operational efficiency, and aid both in risk mitigation and strategic decisionmaking. Existing state-of-the-art AI models for BPM do not fully capitalize on available semantic information within process event logs. As current advanced AI-BPM systems provide semantically richer textual data, the need for new adequate models grows. To address this gap, we develop SNAP—a novel system that utilizes LLMs by constructing narratives and semantic contextual stories for historical event logs, which are then used to generate precise and actionable predictions for the ongoing process. SNAP was evaluated on six benchmark datasets, where it demonstrated significant performance improvements over eleven SOTA models, particularly on datasets with high levels of semantic content. This work showcases the potential of integrating LLMs in BPM and outlines a clear path toward future deployment, emphasizing the relevance and innovation of our approach within the broader AI application landscape.

Abstract: In this tool paper, we design, develop, and release BoolXAI, an interpretable machine learning classification approach for Explainable AI (XAI) based on expressive Boolean formulas. The Boolean formula defines a logical rule with tunable complexity according to which input data are classified. Beyond the classical conjunction and disjunction, BoolXAI offers expressive operators such as AtLeast, AtMost, and Choose and their parameterization. This provides higher expressiveness compared to rigid rulesand tree-based approaches. We show how to train BoolXAI classifiers effectively using native local optimization to search the space of feasible formulas. We provide illustrative results on several well-known public benchmarks that demonstrate the competitive nature of our approach compared to existing methods. Our work is embodied in the open-source BoolXAI library with a high-level user interface to serve researchers and practitioners. BoolXAI can be used either as a standalone interpretable classifier or for post-hoc explanations of other black-box models or observed behavior. We highlight several desirable benefits of our tool, especially in industrial settings where rapid experimentation, reusability, reproducibility, deployment, and maintenance are of great interest. Finally, we showcase a deployed service powered by BoolXAI as an enterprise application.

Abstract: Responsible AI (RAI) is the science and practice of ensuring the design, development, use, and oversight of AI are socially sustainablebenefiting diverse stakeholders while controlling the risks. Achieving this goal requires active engagement and participation from the broader public. This paper introduces "We are AI: Taking Control of Technology," a public education course that brings the topics of AI and RAI to the general audience in a peer-learning setting. We outline the goals behind the course's development, discuss the multi-year iterative process that shaped its creation, and summarize its content. We also discuss two offerings of "We are AI" to an active and engaged group of librarians and professional staff at New York University, highlighting successes and areas for improvement. The course materials, including a multilingual comic book series by the same name, are publicly available and can be used independently. By sharing our experience in creating and teaching "We are AI", we aim to introduce these resources to the community of AI educators, researchers, and practitioners, supporting their public education efforts.

Abstract: We build the first machinelearning-based algorithm selection tool for hardware verification described in the Btor2 format. In addition to hardware verifiers, our tool also selects from a set of software verifiers to solve a given Btor2 instance, enabled by a Btor2-to-C translator. We propose two embeddings for a Btor2 instance, Bag of Keywords and Bit-Width Aggregation. Pairwise classifiers are applied for algorithm selection. Upon evaluation, our tool Btor2-Select solves 30.0% more instances and reduces PAR-2 by 50.2%, compared to the PDR implementation in the HWMCC'20 winner model checker AVR. Measured by the Shapley values, the software verifiers collectively contributed 27.2% to Btor2-Select's performance.

Abstract: Recently, researchers have focused on methods that not only distill knowledge from a Graph Neural Network (GNN) into a MultiLayer Perceptron (MLP) but also leverage multiple teacher GNNs. However, existing methods assign a single attention weight to each teacher GNN. We propose a NodeAware Attention Mechanism (NAAM) that flexibly adjusts the attention weight for each node to leverage multiple GNNs fully. Experimental results show that NAAM outperforms existing GNN-to-MLP methods. our source code is available at: https://github.com/NakayamaItsuki/NAAM.

Abstract: In this paper, we propose a novel framework that leverages Large Language Models (LLMs) for predicting missing values in timevarying graph signals by exploiting spatial and temporal smoothness. We leverage the power of LLM to achieve a message-passing scheme. For each missing node, its neighbors and previous estimates are fed into and processed by LLM to infer the missing observations. Tested on the task of the online prediction of wind-speed graph signals, our model outperforms online graph filtering algorithms in terms of accuracy, demonstrating the potential of LLMs in effectively addressing partially observed signals in graphs.

Abstract: To diminish the substantial communication costs incurred by federated learning during the training of the global model and enhance the model update efficiency across both clients and server domains, we have integrated knowledge distillation into the federated learning framework. This integration has led to the development of a novel approach termed ClientsToServerKDFL, which streamlines the distillation process by directly transferring model insights from clients to the server for computational learning without the need for extensive computations across numerous clients. This iterative process ensures model accuracy and curtails communication expenses. Experimental data analysis has validated the efficacy of this algorithm.

Abstract: Biological neural systems often represent information on lowdimensional manifolds that reflect the topology of their encoded variables. This suggests that neural activity can be naturally organized in geometrically meaningful ways, as seen in rodent head direction cells forming circular manifolds. This proposal examines whether artificial neural networks (ANNs) trained on tasks with well-defined topologies—such as planar or spherical coordinates from autonomous driving datasets like Apolloscape, cyclic temporal variables, or graph-structured road networks—develop similar low-dimensional representations aligned with the variables' inherent topology. We consider convolutional and vision transformer models for image data, graph neural networks for road network graphs, and 3D or point-based models for LIDAR point clouds, analyzing their internal activations with dimensionality reduction and topological data analysis. If successful, this approach not only elucidates the nature of internal representations in ANNs but also offers insights into the computational principles that bridge artificial systems and biological cognition.

Abstract: Falls among older adults pose a significant public health challenge, impacting quality of life and healthcare costs. This research proposal aims to develop an innovative AIdriven personalized fall prevention system for older adults, leveraging advanced machine learning techniques in computer vision, natural language processing, and reinforcement learning. The proposed system will encompass five key components: (1) Advanced pose estimation and activity recognition using HRNet with attention mechanisms and hybrid LSTM-GCN models; (2) Personalized risk assessment through multi-modal deep learning, combining CNNs, RNNs, and federated learning for privacy-preserving distributed training; (3) Adaptive intervention strategies employing Deep Q-Networks and model-based reinforcement learning with GAN-simulated environments; (4) Human-AI interaction utilizing SHAP values for explainable AI and fine-tuned GPT-3 for natural language communication; and (5) Privacy-preserving techniques including differential privacy and homomorphic encryption. The research will be conducted over a five-year period, involving data collection, model development, large-scale testing, and clinical trials. Expected outcomes include a scalable, privacy-preserving AI system capable of significantly reducing fall incidents among older adults, thereby improving quality of life and reducing healthcare costs. This interdisciplinary research contributes to advancing AI techniques in real-world healthcare applications while addressing critical ethical and privacy concerns, potentially transforming elderly care on a global scale.

Abstract: Transformers have revolutionized machine learning, yet their inner workings remain opaque to many. We present TRANSFORMER EXPLAINER, an interactive visualization tool designed for nonexperts to learn about Transformers through the GPT-2 model. Our tool helps users understand complex Transformer concepts by integrating a model overview and smooth transitions across abstraction levels of math operations and model structures. It runs a live GPT-2 model locally in the user’s browser, empowering users to experiment with their own input and observe in real-time how the internal components and parameters of the Transformer work together to predict the next tokens. 125,000 users have used our open-source tool at https://poloclub.github.io/ transformer-explainer/.

Abstract: Fake audios, videos, and images are now proliferating widely. We developed GODDS, the Global Online Deepfake Detection system, for a specific user community, namely journalists. GODDS leverages an ensemble of deepfake detectors, along with a human in the loop, to provide a deepfake report on each submitted video/image/audio or VIA artifact submitted to the system. To date, VIA artifacts submitted by over 50 journalists from outlets such as the New York Times, Wall Street Journal, CNN, Agence France Press, and others have been run through GODDS. Unlike other deepfake detection systems, GODDS doesn't just focus on the submitted artifact but automatically derives context about the subject of the VIA artifact. Because context is not always available on all subjects, GODDS focuses on alleged deepfakes of high profile individuals, organizations, and events, where there is likely to be considerable contextual information.

Abstract: We propose a 2D simulation system for multiagent collective construction (MACC) based on simple line-following intelligent machines (SLIM) - small differential drive mobile robots. Our MACC-SLIM system alleviates the high upfront cost of implementing MACC on real hardware. Our system builds upon widely available resources, namely a standard LCD screen and commodity mobile robots, allowing researchers and schools easier access to MACC hardware implementation. We test the system on plans generated by an optimal state-of-the-art MACC algorithm, demonstrating there are still non-insignificant synchronization delays. The MACC-SLIM system allows us to observe bottlenecks, parallelism, and possible execution failures of plans generated by the MACC algorithms.

Abstract: We present RLLTE: a longterm evolution, extremely modular, and open-source framework for reinforcement learning (RL) research and application. Beyond delivering top-notch algorithm implementations, RLLTE also serves as a toolkit for developing algorithms. More specifically, RLLTE decouples the RL algorithms completely from the exploitation-exploration perspective, providing a large number of components to accelerate algorithm development and evolution. In particular, RLLTE is the first RL framework to build a comprehensive ecosystem, which includes model training, evaluation, deployment, benchmark hub, and large language model (LLM)-empowered copilot. RLLTE is expected to set standards for RL engineering practice and be highly stimulative for industry and academia. Our documentation, examples, and source code are available at https://github.com/RLE-Foundation/rllte.

Abstract: Recent development in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have achieved superior performance and generalization capabilities, covered extensive areas of traditional tasks. However, existing large model training frameworks support only a limited number of models and techniques, particularly lacking in support for new models, which makes fine-tuning LLMs challenging for most developers. Therefore, we develop SWIFT, a customizable one-stop infrastructure for large models. With support of over 350+ LLMs and 80+ MLLMs, SWIFT stands as the open-source framework that provide the most comprehensive support for fine-tuning large models. In particular, it is the first training framework that provides systematic support for MLLMs. Moreover, SWIFT integrates post-training processes such as inference, evaluation, and quantization, to facilitate fast adoptions of large models in various application scenarios, offering helpful utilities like benchmark comparisons among different training techniques.

Abstract: This paper explores the impact of relational state abstraction on sample efficiency and performance in collaborative MultiAgent Reinforcement Learning. The proposed abstraction is based on spatial relationships in environments where direct communication between agents is not allowed, leveraging the ubiquity of spatial reasoning in real-world multi-agent scenarios. We introduce MARC (Multi-Agent Relational Critic), a simple yet effective critic architecture incorporating spatial relational inductive biases by transforming the state into a spatial graph and processing it through a relational graph neural network. The performance of MARC is evaluated across four collaborative tasks, including a novel environment with heterogeneous agents. We conduct a comprehensive empirical analysis, comparing MARC against state-of-the-art MARL baselines, demonstrating improvements in both sample efficiency and asymptotic performance, as well as its potential for generalization. Our findings suggest that a minimal integration of spatial relational inductive biases as abstraction can yield substantial benefits without requiring complex designs or task-specific engineering. This work provides insights into the potential of relational state abstraction to address sample efficiency, a key challenge in MARL, offering a promising direction for developing more efficient algorithms in spatially complex environments.

Abstract: Machine learning (ML) models have been shown to leak private information from their training datasets. Differential Privacy (DP), typically implemented through the differential private stochastic gradient descent algorithm (DPSGD), has become the standard solution to bound leakage from the models. Despite recent improvements, DP-SGD-based approaches for private learning still usually struggle in the high privacy (ε≤1) and low data regimes, and when the private training datasets are imbalanced. To overcome these limitations, we propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning. DPPL leverages publicly pre-trained encoders to extract features from private data and generates DP prototypes that represent each private class in the embedding space and can be publicly released for inference. Since our DP prototypes can be obtained from only a few private training data points and without iterative noise addition, they offer high-utility predictions and strong privacy guarantees even under the notion of pure DP. We additionally show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder: in particular, we can privately sample our DP prototypes from the publicly available data points used to train the encoder. Our experimental evaluation with four state-of-the-art encoders, four vision datasets, and under different data and imbalancedness regimes demonstrate DPPL's high performance under strong privacy guarantees in challenging private learning setups.

Abstract: Deep multiview clustering incorporating graph learning has presented tremendous potential. Most methods encounter costly square time consumption w.r.t. data size. Theoretically, anchor-based graph learning can alleviate this limitation, but related deep models mainly rely on manual discretization approaches to select anchors, which indicates that 1) the anchors are fixed during model training and 2) they may deviate from the true cluster distribution. Consequently, the unreliable anchors may corrupt clustering results. In this paper, we propose the Deep Multi-view Anchor Clustering (DMAC) model that performs clustering in linear time. Concretely, the initial anchors are intervened by the positive-incentive noise sampled from Gaussian distribution, such that they can be optimized with a newly designed anchor learning loss, which promotes a clear relationship between samples and anchors. Afterwards, anchor graph convolution is devised to model the cluster structure formed by the anchors, and the mutual information maximization loss is built to provide cross-view clustering guidance. In this way, the learned anchors can better represent clusters. With the optimal anchors, the full sample graph is calculated to derive a discriminative embedding for clustering. Extensive experiments on several datasets demonstrate the superior performance and efficiency of DMAC compared to state-of-the-art competitors.

Abstract: Offline reinforcement learning confronts the distributional shift challenge, a consequence of learning policy from static datasets. Current methods primarily handle this issue by aligning the learned policy with the behavior policy or conservatively estimating Qvalues for out-of-distribution (OOD) actions. However, these approaches can lead to overly pessimistic estimation of Q-values of the OOD actions in unfamiliar situations, resulting in a suboptimal policy. To address this, we propose a new method, Dynamic Uncertainty estimation for Offline Reinforcement Learning. This method introduces a base density-truncated OOD data sampling approach to reduce the impact of extrapolation errors on uncertainty estimation. It enables conservative estimation of Q-values for OOD actions while avoiding negative impacts on in-distribution data. We also develop a dynamic uncertainty estimation mechanism to prevent excessive pessimism and enhance the generalization of the Q-function. This mechanism dynamically adjusts the degree of pessimism in the Q-function by minimizing the error between target and estimated values. Our method outperforms existing algorithms, as demonstrated by experimental results based on the D4RL benchmark, and proves its superiority in addressing the distributional shift challenge.

Abstract: In noisy partial label learning, each training sample is associated with a set of candidate labels, and the groundtruth label may be contained within this set. With the emergence of powerful pre-trained vision-language models, e.g. CLIP, it is natural to consider using these models to automatically label training samples instead of relying on laborious manual annotation. In this paper, we investigate the pipeline of learning with CLIP annotated noisy partial labels and propose a novel collaborative consistency regularization method, in which we simultaneously train two neural networks, which collaboratively purify training labels for each other, called Co-Pseudo-Labeling, and perform consistency regularization between label and representation levels. For instance-dependent noise that embodies the underlying patterns of the pre-trained model, our method employs multiple mechanisms to avoid overfitting to noisy annotations, effectively mines information from potentially noisy sample set while iteratively optimizing both representations and pseudo-labels during the training process. Comparison experiments with various kinds of annotations and weakly supervised methods, as well as other pre-trained model application methods demonstrates the effectiveness of method and the feasibility of incorporating weakly supervised learning into the distillation of pre-trained models.

Abstract: While Large Language Models (LLMs) have shown exceptional generalization capabilities, their ability to process graph data, such as molecular structures, remains limited. To bridge this gap, this paper proposes Graph2Token, an efficient solution that aligns graph tokens to LLM tokens. The key idea is to represent a graph token with the LLM token vocabulary, without finetuning the LLM backbone. To achieve this goal, we first construct a molecule-text paired dataset from multi-sources, including CHEBI and HMDB, to train a graph structure encoder, which reduces the distance between graphs and texts representations in the feature space. Then, we propose a novel alignment strategy that associates a graph token with LLM tokens. To further unleash the potential of LLMs, we collect molecular IUPAC name identifiers, which are incorporated into the LLM prompts. By aligning molecular graphs as special tokens, we can activate LLMs' generalization ability to molecular few-shot learning. Extensive experiments on molecular classification and regression tasks demonstrate the effectiveness of our proposed Graph2Token.

Abstract: Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines. Recognizing the importance of this interface, the machine learning community is investing considerable effort in generating data that is semantically coherent with textual instructions. While strides have been made in textto-data generation spanning image editing, audio synthesis, video creation, and beyond, low-resource areas characterized by expensive annotations or complex data structures, such as molecules, motion dynamics, and time series, often lack textual labels. This deficiency impedes supervised learning, thereby constraining the application of advanced generative models for text-to-data tasks. In response to these challenges in the low-resource scenario, we propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model. Subsequently, it undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting. Comprehensive experiments demonstrate that Text2Data is able to achieve enhanced performance regarding controllability across various modalities, including molecules, motions and time series, when compared to existing baselines.

Abstract: Partial multiview classification (PMvC) poses a significant challenge due to the incomplete nature of multi-view data, which complicates effective information fusion and accurate classification. Existing PMvC methods typically rely on heuristic evaluations of view informativeness to achieve global alignment for downstream classification tasks. However, these approaches suffer from two critical issues: information redundancy and semantic misalignment. The complexity of missing data not only leads to over-reliance on redundant or less informative views but also exacerbates semantic misalignment across views, making it difficult for existing methods to effectively capture and discriminate the class-related features. To address these issues, this work proposes a novel GLobal-semantic Alignment Distillation (GLAD) model for partial multi-view classification without requiring imputation. Our approach incorporates a self-distillation mechanism that enables the model to extract informative features and achieve global semantic alignment across views. The key insight of GLAD is leveraging labels as semantic anchors to guide the alignment of partial multi-view features. By integrating labels with extracted features via a cross-attention mechanism, we generate ideal embeddings that consistently capture global semantics across views. These embeddings then serve as intermediate supervision for distilling the student model, ensuring robust semantic alignment even with missing views. We further introduce a margin-aware weighting strategy to enhance the model's discriminative ability. Extensive experimental results validate the effectiveness and superiority of the proposed method, showcasing significant improvements in classification performance over existing techniques.

Abstract: This paper proposes a sensitivity analysis framework based on setvalued mapping for deep neural networks (DNN) to understand and compute how the solutions (model weights) of DNN respond to perturbations in the training data. As a DNN may not exhibit a unique solution (minima) and the algorithm of solving a DNN may lead to different solutions with minor perturbations to input data, we focus on the sensitivity of the solution set of DNN, instead of studying a single solution. In particular, we are interested in the expansion and contraction of the solution set in response to data perturbations. If the change of solution set can be bounded by the extent of the data perturbation, the model is said to exhibit the Lipschitz-like property. This 'set-to-set' analysis approach provides a deeper understanding of the robustness and reliability of DNNs during training. Our framework incorporates both isolated and non-isolated minima, and critically, does not require the assumption that the Hessian of loss function is non-singular. By developing set-level metrics such as distance between sets, convergence of sets, derivatives of set-valued mapping, and stability across the solution set, we prove that the solution set of the Fully Connected Neural Network holds Lipschitz-like properties. For general neural networks (e.g. Resnet), we introduce a graphical-derivative-based method to estimate the new solution set following data perturbation without retraining.

Abstract: Best arm identification (BAI) is a key problem in stochastic multiarmed bandits, where K arms each has an associated reward distribution, and the objective is to minimize the number of queries needed to identify the best arm with high confidence. In this paper, we explore BAI using quantum oracles. For the case where each query probes only one arm (m=1), we devise a quantum algorithm with a query complexity upper bound of O((K/Delta)log(1/delta)), where delta is the confidence parameter and Delta is the reward gap between best and second best arms. This improves on the classical bound by a factor of 1/Delta. For the general case where a single query can probe m arms (1<= m<= K) simultaneously, we propose an algorithm with an upper bound of O((K/(Delta x sqrt(m))) log(1/delta)), improving by a factor of sqrt(m) compared to the m=1 case. We also provide query complexity lower bounds for both scenarios, which match the upper bounds up to logarithmic factors, and validate our theoretical results with Qiskit-based simulations.

Abstract: We investigate constrained online convex optimization, in which decisions must belong to a fixed and typically complicated domain, and are required to approximately satisfy additional timevarying constraints over the long term. In this setting, the commonly used projection operations are often computationally expensive or even intractable. To avoid the time-consuming operation, several projection-free methods have been proposed with an O(T^¾ (log T)^½) regret bound and an O(T^⅞) cumulative constraint violation (CCV) bound for general convex losses. In this paper, we improve this result and further establish novel regret and CCV bounds when loss functions are strongly convex. The primary idea is to first construct a composite surrogate loss, involving the original loss and constraint functions, by utilizing the Lyapunov-based technique. Then, we propose a parameter-free variant of the classical projection-free method, namely online Frank-Wolfe (OFW), and run this new extension over the online-generated surrogate loss. Theoretically, for general convex losses, we achieve an O(T^¾) regret bound and an O(T^¾ log T) CCV bound, both of which are order-wise tighter than existing results. For strongly convex losses, we establish new guarantees of an O(T^⅔) regret bound and an O(T^⅚) CCV bound. Moreover, we also extend our methods to a more challenging setting with bandit feedback, obtaining similar theoretical findings. Empirically, experiments on real-world datasets have demonstrated the effectiveness of our methods.

Abstract: Crossmodal matching shows enormous potential to recognize objects across different sensory modalities, which is fundamental to numerous visual-language tasks like image-text retrieval and visual captioning. Existing works generally rely on massive and well-aligned data pairs for model training. Unfortunately, multimodal datasets are extremely difficult to annotate and collect. As an alternative, the co-occurred data pairs collected from the internet have been widely exploited to train a cross-modal matching model. However, the cheaply-collected dataset unavoidably contains mismatched pairs (i.e., noisy correspondence), which are detrimental to the matching model. In this paper, we propose an alternative method termed noisy correspondence rectification via Asymmetric Similarity Learning (ASL), and it allows for dealing with insufficient learning of positive and negative pairs caused by the popular triplet-based symmetric learning fashion. Specifically, the learning of positive or negative pairs within a triplet is conducted in an asymmetric fashion, and the self-paced weighting boundary is imposed on positive pairs to mitigate the effect of noise. Meanwhile, the optimization of negative samples will not be affected in the process of punishing potentially-noisy positive samples. To verify the effectiveness of our proposed approach, a series of experiments are conducted on three widely-used benchmarks (i.e., Flick30k, MS-COCO and CC152k), and the results show superior performance compared to the state-of-the-art methods.

Abstract: As a foundational clustering paradigm, Density Peak Clustering (DPC) partitions samples into clusters based on their density peaks, garnering widespread attention. However, traditional DPC methods usually focus on highdensity regions, neglecting representative peaks in relatively low-density areas, particularly in datasets with varying densities and multiple peaks. Moreover, existing DPC variants struggle to identify clusters correctly in high-dimensional spaces due to the indistinct distance differences among samples and sparse data distributions. Additionally, existing methods typically adopt a one-step label assignment strategy, making them prone to cascading errors when initial misassignments occur. To address these challenges, we propose an Enhanced Density Peak Clustering (EDPC) method, which creatively incorporates multilayer perceptron (MLP)-based dimensionality reduction and a hierarchical label assignment strategy to significantly improve clustering performance in high-dimensional scenarios. Specifically, we introduce an effective selection condition that combines average densities and density-related distances to generate potential cluster centers, ensuring that peaks across different density regions are considered simultaneously. Furthermore, an MLP, guided by pseudo-labels from sub-clusters, is designed to learn low-dimensional embeddings for high-dimensional data, preserving data locality while enhancing clusterability. Extensive experiments demonstrate the effectiveness and superiority of EDPC against state-of-the-art DPC methods.

Abstract: Numerous wellannotated human key-point datasets are publicly available to date. However, annotating human poses for newly collected images is still a costly and time-consuming progress. Pose distributions from different datasets share similar pose hinge-structure priors with different geometric transformations, such as pivot orientation, joint rotation, and bone length ratio. The difference between Pose distributions is essentially the difference between the transformation distributions. Inspired by this fact, we propose a method to calibrate a pre-trained pose generator in which the pose prior has already been learned to an adapted one following a new pose distribution. We treat the representation of human pose joint coordinates as skeleton image and transfer a pre-trained pose annotation generator with only a few annotation guidance. By fine-tuning a limited number of linear layers that closely related to the pose transformation, the adapted generator is able to produce any number of pose annotations that are similar to the target poses. We evaluate our proposed method, FlexPose, on several cross-dataset settings both qualitatively and quantitatively, which demonstrates that our approach achieves state-of-the-art performance compared to the existing generative-model-based transfer learning methods when given limited annotation guidance.

Abstract: In multiview multi-label classification (MVML), each object is described by several heterogeneous views while annotated with multiple related labels. The key to learn from such complicate data lies in how to fuse cross-view features and explore multi-label correlations, while accordingly obtain correct assignments between each object and its corresponding labels. In this paper, we proposed an advanced MVML method named VAMS, which treats each object as a bag of views and reformulates the task of MVML as a “view-label” matching selection problem. Specifically, we first construct an object graph and a label graph respectively. In the object graph, nodes represent the multi-view representation of an object, and each view node is connected to its K-nearest neighbor within its own view. In the label graph, nodes represent the semantic representation of a label. Then, we connect each view node with all labels to generate the unified “view-label” matching graph. Afterwards, a graph network block is introduced to aggregate and update all nodes and edges on the matching graph, and further generating a structural representation that fuses multi-view heterogeneity and multi-label correlations for each view and label. Finally, we derive a prediction score for each view-label matching and select the optimal matching via optimizing a weighted cross-entropy loss. Extensive results on various datasets have verified that our proposed VAMS can achieve superior or comparable performance against state-of-the-art methods.

Abstract: Backdoor attacks have posed a serious threat in machine learning models, wherein adversaries can poison training samples with maliciously crafted triggers to compromise the victim model. Advanced backdoor attack methods have focused on selectively poisoning more vulnerable training samples, achieving a higher attack success rate (ASR). However, we found that when the manipulation strength of the trigger is constrained to a very small value for imperceptible attacks, they suffer from extremely uneven classwise ASR due to the unequal selection of instances per class. To solve this issue, we propose a novel backdoor attack method based on Influence-based Fair Selection (IFS), including two objectives: 1) selecting samples that significantly contribute to ASR and 2) ensuring class balance during the selection process. Specifically, we adapt Influence Functions, a classic technique in robust statistics, to evaluate the influence of trigger-embedded training samples on ASR. In this case, training samples contributing to reducing the backdoored test risk could possess higher influence scores. Further, a group-based pruning strategy is designed to avoid calculating the influence on ASR for all training samples, thereby significantly reducing the computational cost. Then, based on the influence score, we design an adaptive thresholding scheme to dynamically select samples with higher influence while maintaining class balance. Extensive experiments on four datasets verify the effectiveness of IFS compared with advanced methods.

Abstract: Current Transformer methods for Multivariate TimeSeries Forecasting (MTSF) are all based on the conventional attention mechanism. They involve sequence embedding and performing a linear projection for Q, K, and V, and then computing attention within this latent space. We have not yet delved into the attention mechanism to explore whether such a mapping space is optimal for MTSF. To investigate this issue, we first propose Frequency Spectrum attention (FSatten), a novel attention mechanism based on the frequency domain space. It employs the Fourier transform for embedding and introduces Multi-head Spectrum Scaling (MSS) to replace the conventional linear mapping for Q and K. FSatten can accurately capture the periodic dependencies between sequences and outperform the conventional attention, without necessitating changes to mainstream architectures. We further design a more general method dubbed Scaled Orthogonal attention (SOatten). We propose an orthogonal embedding and a Head-Coupling Convolution (HCC) based on the neighboring similarity bias to guide the model in learning comprehensive dependency patterns. Experiments show that FSatten and SOatten surpass the SOTA which uses conventional attention, making it a good alternative as a basic attention mechanism for MTSF.

Abstract: Multiview feature learning aims to learn discriminative features by integrating the distinct information in each view. However, most existing methods still face significant challenges in learning viewconsistency features, which are crucial for effective multiview learning. Motivated by the theories of CCA and contrastive learning in multiview feature learning, we propose the hierarchical consensus network (HCN) in this paper. The HCN derives three consensus indices for capturing the hierarchical consensus across views, which are classifying consensus, coding consensus, and global consensus, respectively. Specifically, classifying consensus reinforces class-level correspondence between views from a CCA perspective, while coding consensus closely resembles contrastive learning and reflects contrastive comparison of individual instances. Global consensus aims to extract consensus information from two perspectives simultaneously. By enforcing the hierarchical consensus, the information within each view is better integrated to obtain more comprehensive and discriminative features. The extensive experimental results obtained on four multiview datasets demonstrate that the proposed method significantly outperforms several state-of-the-art methods.

Abstract: Outof-distribution (OOD) detection is crucial for ensuring the reliable deployment of deep models in real-world scenarios. Recently, from the perspective of over-parameterization, a series of methods leveraging weight sparsification techniques have shown promising performance. These methods typically focus on selecting important parameters for in-distribution (ID) data to reduce the negative impact of redundant parameters on OOD detection. However, we empirically find that these selected parameters may behave overconfidently toward OOD data and hurt OOD detection. To address this issue, we propose a simple yet effective post-hoc method called Instance-aware Test Pruning (ITP), which performs OOD detection by considering both coarse-grained and fine-grained levels of parameter pruning. Specifically, ITP first estimates the class-specific parameter contribution distribution by exploring the ID data. By using the contribution distribution, ITP conducts coarse-grained pruning to eliminate redundant parameters. More importantly, ITP further adopts a fine-grained test pruning process based on the right-tailed Z-score test, which can adaptively remove instance-level overconfident parameters. Finally, ITP derives OOD scores from the pruned model to achieve more reliable predictions. Extensive experiments on widely adopted benchmarks verify the effectiveness of ITP, demonstrating its competitive performance.

Abstract: Federated Learning (FL) suffers from severe performance degradation due to the data heterogeneity among clients. Existing works reveal that the fundamental reason is that data heterogeneity can cause client drift where the local model update deviates from the global one, and thus they usually tackle this problem from the perspective of calibrating the obtained local update. Despite effectiveness, existing methods substantially lack a deep understanding of how heterogeneous data samples contribute to the formation of client drift. In this paper, we bridge this gap by identifying that the drift can be viewed as a cumulative manifestation of biases present in all local samples and the bias between samples is different. Besides, the bias dynamically changes as the FL training progresses. Motivated by this, we propose FedBSS that first mitigates the heterogeneity issue in a samplelevel manner, orthogonal to existing methods. Specifically, the core idea of our method is to adopt a bias-aware sample selection scheme that dynamically selects the samples from small biases to large epoch by epoch to train progressively the local model in each round. In order to ensure the stability of training, we set the diversified knowledge acquisition stage as the warm-up stage to avoid the local optimality caused by knowledge deviation in the early stage of the model. Evaluation results show that FedBSS outperforms state-of-the-art baselines. In addition, we also achieved effective results on feature distribution skew and noise label dataset setting, which proves that FedBSS can not only reduce heterogeneity, but also has scalability and robustness.

Abstract: Multiview graph clustering methods have been widely concerned due to the ability of dealing with arbitrarily shaped datasets. However, many methods with higher time and space complexity make them challenging to deal with large-scale datasets. Besides, many fuzzy clustering methods needs additional regularization terms or hyper-parameters to obtain the membership matrix or avoid trivial solutions, which weakens the model generalization ability. Furthermore, inconsistent clustering labels can arise when there are significant discrepancies between views, making it challenging to effectively leverage the complementary information from different views. To this end, we propose Tensorized Label Learning based Fast Fuzzy Clustering (TLLFFC). Specifically, we design a novel balanced regularization term to reduce pressure of tuning regularization parameters for fuzzy clustering. The label transmission strategy with the anchor graph makes TLLFFC suitable for large-scale datasets. Moreover, incorporating the Schatten p-norm regularization on the label matrices can effectively unearth the complementary information distributed among views, thereby align the labels across views more consistently. Extensive experiments verify the superiority of TLLFFC.

Abstract: The alignment between RNA sequences and structures in foundation models (FMs) has yet to be thoroughly investigated. Existing FMs have struggled to establish sequencestructure alignment, hindering the seamless flow of genomic information between RNA sequences and structures. In this study, we introduce OmniGenome, an RNA FM trained to align RNA sequences with respect to secondary structures through structure-contextualized modelling. This alignment enables free and bidirectional mappings between sequences and structures by utilizing a flexible RNA modelling paradigm that supports versatile input and output modalities, i.e., sequence and/or structure as input/output. We implement RNA design and zero-shot secondary structure prediction as case studies to evaluate the Seq2Str and Str2Seq mapping capabilities of OmniGenome. Results on the EternaV2 benchmark show that OmniGenome solved 74% of puzzles, whereas existing FMs solved only up to 3% of the puzzles due to the lack of sequence-structure alignment. We leverage four comprehensive in-silico genome modelling benchmarks to evaluate performance across a diverse set of downstream genome tasks, where the results show that OmniGenome achieves state-of-the-art performance on RNA and DNA benchmarks, even without any training on DNA genomes.

Abstract: Split Federated Learning (SFL) splits and collaboratively trains a shared model between clients and server, where clients transmit activations and clientside models to server for updates. Recent SFL studies assume synchronous transmission of activations and client-side models from clients to server. However, due to significant variations in computational and communication capabilities among clients, activations and client-side models arrive at server asynchronously. The delay caused by asynchrony significantly degrades the performance of SFL. To address this issue, we consider an asynchronous SFL framework, where an activation buffer and a model buffer are embedded on the server to manage the asynchronously transmitted activations and client-side models, respectively. Furthermore, as asynchronous activation transmissions cause the buffer to frequently receive activations from resource-rich clients, leading to biased updates of the server-side model, we propose Generative activations-aided Asynchronous SFL (GAS). In GAS, the server maintains an activation distribution for each label based on received activations and generates activations from these distributions according to the degree of bias. These generative activations are then used to assist in updating the server-side model, ensuring more accurate updates. We derive a tighter convergence bound, and our experiments demonstrate the effectiveness of the proposed method.

Abstract: Contrastive LanguageImage Pre-training (CLIP) has achieved excellent performance over a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption of computational resources. Although knowledge distillation has been widely applied in single modality models, how to efficiently expand knowledge distillation to vision-language foundation models with extensive data remains relatively unexplored. In this paper, we introduce CLIP-CID, a novel distillation mechanism that effectively transfers knowledge from a large vision-language foundation model to a smaller model. We initially propose a simple but efficient image semantic balance method to reduce transfer learning bias and improve distillation efficiency. This method filters out 43.7% of image-text pairs from the LAION400M while maintaining superior performance. After that, we leverage cluster-instance discrimination to facilitate knowledge transfer from the teacher model to the student model, thereby empowering the student model to acquire a holistic semantic comprehension of the pre-training data. Experimental results demonstrate that CLIP-CID achieves state-of-the-art performance on various downstream tasks including linear probe and zero-shot classification.

Abstract: Spiking Neural Networks (SNNs) offer an attractive and energyefficient alternative to conventional Artificial Neural Networks (ANNs) due to their sparse binary activation. When SNN meets Transformer, it shows great potential in 2D image processing. However, their application for 3D point cloud remains underexplored. To this end, we present Spiking Point Transformer (SPT), the first transformer-based SNN framework for point cloud classification. Specifically, we first design Queue-Driven Sampling Direct Encoding for point cloud to reduce computational costs while retaining the most effective support points at each time step. We introduce the Hybrid Dynamics Integrate-and-Fire Neuron (HD-IF), designed to simulate selective neuron activation and reduce over-reliance on specific artificial neurons. SPT attains state-of-the-art results on three benchmark datasets that span both real-world and synthetic datasets in the SNN domain. Meanwhile, the theoretical energy consumption of SPT is at least 6.4x less than its ANN counterpart.

School of Computer Science and Technology, East China Normal University, Shanghai, China Innovation Center for Artificial Intelligence and Drug Discovery, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China Innovation Center for Artificial Intelligence and Drug Discovery, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China Innovation Center for Artificial Intelligence and Drug Discovery, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China, Institute of Artificial Intelligence (TeleAI), China Telecom, Shenzhen Transsion Holdings CO.,LTD., Shenzhen Transsion Holdings CO.,LTD., Innovation Center for Artificial Intelligence and Drug Discovery, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China

Abstract: The interdisciplinary field of chemistry and artificial intelligence (AI) is an active area of research aimed at accelerating scientific discovery. Large language Models (LLMs) have shown significant promise in biochemical tasks, especially the molecule caption translation, which aims to align between molecules and natural language texts. However, existing works mainly focus on single molecules, while alignment between chemical reactions and natural language text remains largely unexplored. Additionally, the description of reactions is an essential part in biochemical patents and literature, and research on this aspect not only can help better understand chemical reactions but also promote research on automating chemical synthesis and retrosynthesis. In this work, we propose \textbf{ReactGPT}, a framework aiming to bridge the gap between chemical reaction and text. ReactGPT allows a new task: reaction captioning, by adapting LLMs to learn reactiontext alignment from context examples via In-Context Tuning. Specifically, ReactGPT jointly leverages a Fingerprints-based Reaction Retrieval module, a Domain-Specific Prompt Design module, and a two-stage In-Context Tuning module. We evaluate the effectiveness of ReactGPT on reaction captioning and experimental procedure prediction, both of these tasks can reflect the understanding of chemical reactions. Experimental results show that compared to previous models, ReactGPT exhibits competitive capabilities in resolving chemical reactions and generating high-quality text with correct structure.

Center for Integrated Nanotechnologies, Sandia National Laboratories, Albuquerque, NM, Center for Integrated Nanotechnologies, Sandia National Laboratories, Albuquerque, NM, Material, Physical and Chemical Sciences Center, Sandia National Laboratories, Albuquerque, NM, Center for Integrated Nanotechnologies, Sandia National Laboratories, Albuquerque, NM, Center for Computing Research, Sandia National Laboratories, Center for Computing Research, Albuquerque, NM, Center for Integrated Nanotechnologies, Sandia National Laboratories, Albuquerque, NM, Center for Integrated Nanotechnologies, Sandia National Laboratories, Albuquerque, NM

Abstract: Advances in robotic control and sensing have propelled the rise of automated scientific laboratories capable of highthroughput experiments. However, automated scientific laboratories are currently limited by human intuition in their ability to efficiently design and interpret experiments in high-dimensional spaces, throttling scientific discovery. We present AutoSciLab, a machine learning framework for driving autonomous scientific experiments, forming a surrogate researcher purposed for scientific discovery in high-dimensional spaces. AutoSciLab autonomously follows the scientific method in four steps: (i) generating high-dimensional experiments (x) using a variational autoencoder (ii) selecting optimal experiments by forming hypotheses using active learning (iii) distilling the experimental results to discover relevant low-dimensional latent variables (z) with a ‘directional autoencoder’ and (iv) learning a human interpretable equation connecting the discovered latent variables with a quantity of interest (y = f (z)), using a neural network equation learner. We validate the generalizability of AutoSciLab by rediscovering a) the principles of projectile motion and b) the phase-transitions within the spin-states of the Ising model (NP-hard problem). Applying our framework to an open-ended nanophotonics problem, AutoSciLab discovers a new way to steer incoherent light emission beyond current state-of-the-art, defining a new structure(material)-property(light-emission) relationship governing the physical process using closed-loop noisy experimental feedback.

Abstract: REST APIs have become key components of web services. However, they often contain logic flaws resulting in server side errors or security vulnerabilities. HTTP requests are used as test cases to find and mitigate such issues. Existing methods to modify requests, including those using deep learning, suffer from limited performance and precision, relying on undirected search or making limited usage of the contextual information. In this paper we propose APIRL, a fully automated deep reinforcement learning tool for testing REST APIs. A key novelty of our approach is the use of feedback from a transformer module pretrained on JSON-structured data, akin to that used in API responses. This allows APIRL to learn the subtleties relating to test outcomes, and generalise to unseen API endpoints. We show APIRL can find significantly more bugs than the state-of-the-art in real world REST APIs while minimising the number of required test cases. We also study how reward functions, and other key design choices, affect learnt policies with a thorough ablation study.

Abstract: Analog circuit design is a significant task in modern chip technology, focusing on the selection of component types, connectivity, and parameters to ensure proper circuit functionality. Despite advances made by Large Language Models (LLMs) in digital circuit design, the complexity and scarcity of data in analog circuitry pose significant challenges. To mitigate these issues, we introduce AnalogCoder, the first trainingfree LLM agent for designing analog circuits through Python code generation. Firstly, AnalogCoder incorporates a feedback-enhanced flow with tailored domain-specific prompts, enabling the automated and self-correcting design of analog circuits with a high success rate. Secondly, it proposes a circuit tool library to archive successful designs as reusable modular sub-circuits, simplifying composite circuit creation. Thirdly, extensive experiments on a benchmark designed to cover a wide range of analog circuit tasks show that AnalogCoder outperforms other LLM-based methods. It has successfully designed 20 circuits, 5 more than standard GPT-4o. We believe AnalogCoder can significantly improve the labor-intensive chip design process, enabling non-experts to design analog circuits efficiently.

Abstract: Designing proteins with specific attributes offers an important solution to address biomedical challenges. Pretrained protein large language models (LLMs) have shown promising results on protein sequence generation. However, to control sequence generation for specific attributes, existing work still exhibits poor functionality and structural stability. In this paper, we propose a novel controllable protein design method called CtrlProt. We finetune a protein LLM with a new multi-listwise preference optimization strategy to improve generation quality and support multi-attribute controllable generation. Experiments demonstrate that CtrlProt can meet functionality and structural stability requirements effectively, achieving state-of-the-art performance in both single-attribute and multi-attribute protein sequence generation.

Abstract: Physicsinformed neural networks solve partial differential equations by training neural networks. Since this method approximates infinite-dimensional PDE solutions with finite collocation points, minimizing discretization errors by selecting suitable points is essential for accelerating the learning process. Inspired by number theoretic methods for numerical analysis, we introduce good lattice training and periodization tricks, which ensure the conditions required by the theory. Our experiments demonstrate that GLT requires 2-7 times fewer collocation points, resulting in lower computational cost, while achieving competitive performance compared to typical sampling methods.

Abstract: With the malicious use and dissemination of multimodal deepfake videos, researchers start to investigate multi-modal deepfake detection. Unfortunately, most of the existing methods tune all the parameters of the deep network with limited speech video datasets and are trained under coarse-grained consistency supervision, which hinders their generalization ability in practical scenarios. To solve these problems, in this paper, we propose the first multi-task audio-visual prompt learning method for multi-modal deepfake video detection, by exploiting multiple foundation models. Specifically, we construct a two-stream multi-task learning architecture and propose sequential visual prompts and short-time audio prompts to extract multi-modal features, which are aligned at the frame level and utilized in subsequent fine-grained feature matching and fusion. Due to the natural alignment of visual content and audio signal in real data, we propose a frame-level cross-modal feature matching loss function to learn the fine-grained audio-visual consistency. Comprehensive experiments demonstrate the effectiveness and superior generalization ability of our method against the state-of-the-art methods.

Computer Network Information Center, Chinese Academy of Sciences, Beijing, China, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China, Tianjin Medical University Eye Hospital, Tianjin, China, Tianjin Medical University Eye Hospital, Tianjin, China, Computer Network Information Center, Chinese Academy of Sciences, Beijing, China Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China University of Chinese Academy of Sciences, Beijing, China, Computer Network Information Center, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China

Abstract: Singlecell transcriptomics describes complex molecular features at the individual cell level, serving various roles in biological research, such as enhancing gene expression and predicting drug responses. Due to transcriptomic data structurally resembling sequential data, many researchers have trained numerous transformers on extensive transcriptomic datasets. However, they have consistently neglected to explore the intrinsic properties of the data and the appropriateness of their chosen model architecture. In this paper, we carefully investigate the nature of transcriptomics, identifying three overlooked problems: 1) long-tailed data problem, 2) model selection problem, and 3) evaluation problem. Consequently, by applying the weighted sampling strategy, we address the long-tailed data problem and achieve consistent improvement across all settings. By adapting different model structures to transcriptomic data, we discover that transformers are not the only option. By developing three downstream tasks and fair evaluation metrics, we establish a simple and comprehensive benchmark to validate the effectiveness of models for transcriptomics. Through extensive experiments, we clarify the misunderstandings in the traditional methods and provide competitive baselines, thereby paving the way for future research in this field.

Abstract: DNNbased watermarking methods have rapidly advanced, with the ``Encoder-Noise Layer-Decoder'' (END) framework being the most widely used. To ensure end-to-end training, the noise layer in the framework must be differentiable. However, real-world distortions are often non-differentiable, leading to challenges in end-to-end training. Existing solutions only treat the distortion perturbation as additive noise, which does not fully integrate the effect of distortion in training. To better incorporate non-differentiable distortions into training, we propose a novel dual-decoder architecture (END^2). Unlike conventional END architecture, our method employs two structurally identical decoders: the Teacher Decoder, processing pure watermarked images, and the Student Decoder, handling distortion-perturbed images. The gradient is backpropagated only through the Teacher Decoder branch to optimize the encoder thus bypassing the problem of non-differentiability. To ensure resistance to arbitrary distortions, we enforce alignment of the two decoders' feature representations by maximizing the cosine similarity between their intermediate vectors on a hypersphere. Extensive experiments demonstrate that our scheme outperforms state-of-the-art algorithms under various non-differentiable distortions. Moreover, even without the differentiability constraint, our method surpasses baselines with a differentiable noise layer. Our approach is effective and easily implementable across all END architectures, enhancing practicality and generalizability.

Abstract: Trajectory data mining is crucial for smart city management. However, collecting largescale trajectory datasets is challenging due to factors such as commercial conflicts and privacy regulations. Therefore, we urgently need trajectory generation techniques to address this issue. Existing trajectory generation methods rely on the global road network structure of cities. When the road network structure changes, these methods are often not transferable to other cities. In fact, there exist invariant mobility patterns between different cities: 1) People prefer paths with the minimal travel cost; 2) The travel cost of roads has an invariant relationship with the topological features of the road network. Based on the above insight, this paper proposes a Generalizable Trajectory Generation model (GTG). The model consists of three parts: 1) Extracting city-invariant road representation based on Space Syntax method; 2) Cross-city travel cost prediction through disentangled adversarial training; 3) Travel preference learning by shortest path search and preference update. By learning invariant movement patterns, the model is capable of generating trajectories in new cities. Experiments on three datasets demonstrates that our model significantly outperforms existing models in terms of generalization ability.

Abstract: Predicting RNA secondary structures is crucial for understanding RNA function, designing RNAbased therapeutics, and studying molecular interactions within cells. Existing deep-learning-based methods for RNA secondary structure prediction have mainly focused on local structural properties, often overlooking the global characteristics and evolutionary features of RNA sequences. Guided by biological priors, we propose PriFold, incorporating two key innovations: 1) improving attention mechanism with pairing probabilities to utilize global pairing characteristics, and 2) implementing data augmentation based on RNA covariation to leverage evolutionary information. Our structured enhanced pretraining and finetuning strategy significantly optimizes model performance. Extensive experiments demonstrate that PriFold achieves state-of-the-art (SOTA) results in RNA secondary structure prediction on benchmark datasets such as bpRNA, RNAStrAlign and ArchiveII. These results not only validate our prediction approach but also highlight the potential of integrating biological priors, such as global characteristics and evolutionary information, into RNA structure prediction tasks, opening new avenues for research in RNA biology and bioinformatics.

Abstract: As deep learning techniques advance rapidly, deepfake speech synthesized through textto-speech or voice conversion networks is becoming increasingly realistic, posing significant challenges for detection and raising potential threats to social security. This growing realism has prompted extensive research in speech deepfake detection. However, current detection methods primarily focus on extracting features from either the raw waveform or the spectrogram, often overlooking the valuable correspondences between these two modalities that could enhance the detection of previously unseen types of deepfakes. In this work, we propose a multi-view collaborative learning network for speech deepfake detection, which jointly learns robust speech representations from both raw waveforms and spectrograms. Specifically, we first design a Dual-Branch Contrastive Learning (DBCL) framework for learning different view features. DBCL consists of two branches that learn representations from the raw waveform or the spectrogram and utilizes contrastive learning to enhance inter- and inner-view correlations. Additionally, we introduce a Waveform-Spectrogram Fusion Module (WSFM) to exchange multi-view information for collaborative learning. In the feature learning process, WSFM converts features between views and merges them adaptively using waveform-spectrogram cross-attention. The final detection is conducted based on the concatenation of the waveform and spectrogram features. We conduct extensive experiments on four benchmark deepfake speech detection datasets, and the experimental results demonstrate that our method can achieve better detection performance than current state-of-the-art detection methods.

Abstract: Tropical cyclones (TCs) are complex weather systems with strong winds and heavy rainfall, causing substantial loss of life and property. Therefore, accurate TC forecasting is crucial for the effective prevention of disasters caused by TCs. TC forecasting can be regarded as a spatiotemporal prediction problem. It has been proven that using multi-modal data can effectively introduce atmospheric information to achieve better prediction results and higher interpretability. But it also introduces inevitably introduces noise into the prediction process. The diffusion model's unique noise modeling capability can reduce prediction noise when using multi-modal datasets. However, adapting it to TC forecasting has two main challenges: how to extract valuable information from multi-modal data, and how to utilize them to guide the generation process. For the first challenge, while recent methods can predict multiple TC attributes using multi-modal data, they often overlook the interdependence of multiple attributes and the semantic gap between modalities. Considering the interdependence of attributes, we propose two condition generators that capture the commonalities and characteristics of TC attributes, extracting spatio-temporal and environmental features and incorporating expert knowledge. To reduce the semantic gap between multi-modal data, we introduce the PGSA-LSTM module to map primary and auxiliary modalities. For the second challenge, we propose a novel Bi-condition diffusion model that sequentially processes conditions from the characteristics to commonalities of attributes, thereby expanding the guidance information that the diffusion model can accept. Our results surpass state-of-the-art deep learning models and outperform the numerical weather prediction model used by the China Central Meteorological Observatory. TC-Diffuser shows high generalizability across global ocean areas, strong robustness in handling missing data, and higher computational efficiency.

National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology, Huazhong University of Science and Technology, School of Cyber Science and Engineering, Huazhong University of Science and Technology, School of Cyber Science and Engineering, Huazhong University of Science and Technology, School of Cyber Science and Engineering, Huazhong University of Science and Technology, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Hubei Engineering Research Center on Big Data Security Hubei Key Laboratory of Distributed System Security School of Cyber Science and Engineering, Huazhong University of Science and Technology, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Hubei Engineering Research Center on Big Data Security Hubei Key Laboratory of Distributed System Security School of Cyber Science and Engineering, Huazhong University of Science and Technology, School of Information and Communication Technology, Griffith University, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology, Huazhong University of Science and Technology, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology, Huazhong University of Science and Technology

Abstract: With the advancement of deep learning, object detectors (ODs) with various architectures have achieved significant success in complex scenarios like autonomous driving. Previous adversarial attacks against ODs have been focused on designing customized attacks targeting their specific structures (eg, NMS and RPN), yielding some results but simultaneously constraining their scalability. Moreover, most efforts against ODs stem from imagelevel attacks originally designed for classification tasks, resulting in redundant computations and disturbances in object-irrelevant areas (eg, background). Consequently, how to design a model-agnostic efficient attack to comprehensively evaluate the vulnerabilities of ODs remains challenging and unresolved. In this paper, we propose NumbOD, a brand-new spatial-frequency fusion attack against various ODs, aimed at disrupting object detection within images. We directly leverage the features output by the OD without relying on its any internal structures to craft adversarial examples. Specifically, we first design a dual-track attack target selection strategy to select high-quality bounding boxes from OD outputs for targeting. Subsequently, we employ directional perturbations to shift and compress predicted boxes and change classification results to deceive ODs. Additionally, we focus on manipulating the high-frequency components of images to confuse ODs' attention on critical objects, thereby enhancing the attack efficiency. Our extensive experiments on nine ODs and two datasets show that NumbOD achieves powerful attack performance and high stealthiness.

Abstract: Emotional support conversation (ESC) helps reduce people's psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic helpseeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical.

University of Chinese Academy of Sciences Institute of Automation, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences, Institute of Automation, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences, Institute of Medical Technology, Peking University Health Science Center, Peking University Institute of Automation, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences, Institute of Automation, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences, Institute of Medical Technology, Peking University Health Science Center, Peking University National Biomedical Imaging Center, Peking University, Institute of Automation, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences, Institute of Automation, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences

Abstract: Spiking Neural Networks (SNNs) have a lowpower advantage but perform poorly in image segmentation tasks. The reason is that directly converting neural networks with complex architectural designs for segmentation tasks into spiking versions leads to performance degradation and non-convergence. To address this challenge, we first identify the modules in the architecture design that lead to the severe reduction in spike firing, make targeted improvements, and propose Spike2Former architecture. Second, we propose normalized integer spiking neurons to solve the training stability problem of SNNs with complex architectures. We set a new state-of-the-art for SNNs in various semantic segmentation datasets, with a significant improvement of +12.7% mIoU and 5.0x efficiency on ADE20K, +14.3% mIoU and 5.2x efficiency on VOC2012, and +9.1% mIoU and 6.6x efficiency on CityScapes.

Abstract: Hashing has been widely applied in largescale multimodal retrieval by mapping heterogeneous modalities data into binary codes. However, most cross-modal hashing methods cannot make the most of semantic information to construct the association relations of sample pairs, resulting in unsatisfactory retrieval accuracy. Concept lattice is a powerful tool for data mining and information retrieval, and for all we know, this is the first time to combine formal concept analysis and hash learning to improve cross-modal hashing retrieval performance. In this paper, we propose a novel framework for Asymmetric Cross-modal Hashing based on Formal Concept Analysis, denoted as ACHFCA. Initially, a flash-projection three-layer semantic enhancement descriptor is designed to extract latent representations from heterogeneous modalities. Subsequently, an asymmetric hash learning framework is established to enhance the semantics of different layers based on the fine-grained similarity values reconstructed from concept lattice to reinforce the discriminative competence of the model. Finally, an effective discrete optimization algorithm is proposed, which can directly learn compact hash codes. Comprehensive experiments on MIRFlickr, NUS-WIDE and IAPR-TC12 datasets demonstrate the superior performance of ACHFCA to state-of-the-art hashing approaches.

Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University Lanzhou University, South China University of Technology Peng Cheng Laboratory, Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University, South China University of Technology, Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University Shenzhen University, Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University Beijing Institute of Technology

Abstract: Emotion recognition based on body movements is vital in humancomputer interaction. However, existing emotion recognition methods predominantly focus on enhancing classification accuracy, often neglecting the provision of textual explanations to justify their classifications. In this paper, we propose an Emotion-Action Interpreter powered by LargeLanguage Model (EAI-LLM), which not only recognizes emotions but also generates textual explanations by treating 3D body movement data as unique input tokens within large language models (LLMs). Specifically, we propose a multi-granularity skeleton tokenizer designed for LLMs, which separately extracts spatio-temporal tokens and semantic tokens from the skeleton data. This approach allows LLMs to generate more nuanced classification descriptions while maintaining robust classification performance. Furthermore, we treat the skeleton sequence as a specific language and propose a unified skeleton token module. This module leverages the extensive background knowledge and language processing capabilities of LLMs to address the challenges of joint training on heterogeneous datasets, thereby significantly enhancing recognition accuracy on individual datasets. Experimental results demonstrate that our model achieves recognition accuracy comparable to existing methods. More importantly, with the support of background knowledge from LLMs, our model can generate detailed emotion descriptions based on classification results, even when trained on a limited amount of labeled skeleton data.

Abstract: In dyadic humanhuman interactions, individuals may express multiple different facial reactions in response to the same/similar behaviours expressed by their conversational partners depending on their personalised behaviour patterns. As a result, frequently-employed reconstruction loss-based strategies lead the training of previous automatic facial reaction generation (FRG) models to not only suffer from the 'one-to-many mapping' problem, but also fail to comprehensively consider the quality of the generated facial reactions. Besides, none of them considered such personalised behaviour patterns in generating facial reactions. In this paper, we propose the first adversarial FRG model training strategy which jointly learns appropriateness and realism discriminators to provide comprehensive task-specific supervision for training the target facial reaction generators, and reformulates the 'one-to-many (facial reactions) mapping' training problem as a 'one-to-one (distribution) mapping' training task, i.e., the FRG model is trained to output a distribution representing multiple appropriate/plausible facial reaction from each input human behaviour. In addition, our approach also serves as the first offline FRG approach that considers personalised behaviour patterns in generating of target individuals' facial reactions. Experiments show that our PerReactor not only largely outperformed all existing offline solutions for generating more appropriate, diverse and realistic facial reactions, but also is the first approach that can effectively generate personalised appropriate facial reactions.

Abstract: We propose ProtoArgNet, a novel interpretable deep neural architecture for image classification in the spirit of prototypicalpart-learning as found, e.g., in ProtoPNet. While earlier approaches associate every class with multiple prototypical-parts, ProtoArgNet uses super-prototypes that combine prototypical-parts into a unified class representation. This is done by combining local activations of prototypes in an MLP-like manner, enabling the localization of prototypes and learning (non-linear) spatial relationships among them. By leveraging a form of argumentation, ProtoArgNet is capable of providing both supporting (i.e. `this looks like that') and attacking (i.e. `this differs from that') explanations. We demonstrate on several datasets that ProtoArgNet outperforms state-of-the-art prototypical-part-learning approaches. Moreover, the argumentation component in ProtoArgNet is customisable to the user's cognitive requirements by a process of sparsification, which leads to more compact explanations compared to state-of-the-art approaches.

Abstract: Wholebody multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to process different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the text-to-motion semantic pre-training, followed by the multimodal low-level control adaptation. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.

Abstract: The topic of stitching images with globally natural structures holds paramount significance, with two main goals: pixellevel alignment and distortion prevention. The existing approaches exhibit the ability to align well, yet fall short in maintaining object structures. In this paper, we endeavour to safeguard the overall OBJect-level structures within images based on Global Similarity Prior (OBJ-GSP), on the basis of good alignment performance. Our approach leverages semantic segmentation models like the family of Segment Anything Model to extract the contours of any objects in a scene. Triangular meshes are employed in image transformation to protect the overall shapes of objects within images. The balance between alignment and distortion prevention is achieved by allowing the object meshes to strike a balance between similarity and projective transformation. We also demonstrate that object-level semantic information is necessary in low-altitude aerial image stitching. Additionally, we propose StitchBench, the largest image stitching benchmark with most diverse scenarios. Extensive experimental results demonstrate that OBJ-GSP outperforms existing methods in both pixel alignment and shape preservation.

School of Computer Science and Technology, Ocean University of China, School of Computer Science and Technology, Ocean University of China, School of Computer Science and Technology, Ocean University of China, School of Computer Science and Technology, Ocean University of China, School of Computer Science and Technology, Ocean University of China, School of Cyber Science and Engineering, Southeast University; Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education; Purple Mountain Laboratories

Abstract: Online hashing has attracted much research attention for largescale image retrieval in a streaming way. The main challenge lies in keeping balance between high retrieval accuracy and low training time. Existing online hashing methods almost rely on shallow models rather than deep networks due to high training costs, because it is unacceptable to update hash functions on an order of hours. In addition, the multi-label supervision information is not fully utilized to guide the hash learning process and the affinity matrix is always fixed once constructed. In this paper, we propose a novel Deep Graph Online Hashing (DGOH) method, which for the first time introduces inductive graph neural networks (GNNs) to realize deep online hashing with acceptable training costs on an order of seconds. Furthermore, we mine the multi-label information of the images by constructing a label network and learn label-wise weights dynamically to help to update the affinity matrix. In addition, we provide a strategy to obtain examples from the old data to solve the catastrophic forgetting problem. An integrated objective function is designed to train the entire architecture. Extensive experiments on two common benchmarks demonstrate that the proposed method achieves up to 13.3% accuracy gains over state-of-the-art baselines and shows competitive performance on training time.

Abstract: This paper presents a comprehensive study on the role of ClassifierFree Guidance (CFG) in text-conditioned diffusion models from the perspective of inference efficiency. In particular, we relax the default choice of applying CFG in all diffusion steps and instead propose to search for more efficient guidance policies. We formulate the discovery of such policies in the framework of differentiable neural architecture search. Our findings suggest that, as denoising progresses, the updates produced by CFG become increasingly aligned with simple conditional steps, which renders CFG's additional neural network evaluation redundant, especially in the second half of the denoising process. Building upon this insight, we propose "Adaptive Guidance" (AG), an efficient variant of CFG that adaptively omits network evaluations when the denoising process displays convergence. Our experiments demonstrate that AG preserves CFG's image quality while reducing computation by 25%. Thus, AG constitutes a plug-and-play alternative to Guidance Distillation, achieving 50% of the speed-ups of the latter, while being training-free and retaining the capacity to handle negative prompts. We conclude by uncovering further redundancies of CFG in the first half of the diffusion process, showing that entire neural network evaluations can be replaced by simple affine transformations of past score estimates.

Abstract: The primary objective of Optical Chemical Structure Recognition is to identify chemical structure images into corresponding markup sequences. However, the complex twodimensional structures of molecules, particularly those with rings and multiple branches, present significant challenges for current end-to-end methods to learn one-dimensional markup directly. To overcome this limitation, we propose a novel Ring-Free Language (RFL), which utilizes a divide-and-conquer strategy to describe chemical structures in a hierarchical form. RFL allows complex molecular structures to be decomposed into multiple parts, ensuring both uniqueness and conciseness while enhancing readability. This approach significantly reduces the learning difficulty for recognition models. Leveraging RFL, we propose a universal Molecular Skeleton Decoder (MSD), which comprises a skeleton generation module that progressively predicts the molecular skeleton and individual rings, along with a branch classification module for predicting branch information. Experimental results demonstrate that the proposed RFL and MSD can be applied to various mainstream methods, achieving superior performance compared to state-of-the-art approaches in both printed and handwritten scenarios.

Abstract: Signed Distance Functions (SDFs) are vital implicit representations to represent high fidelity 3D surfaces. Current methods mainly leverage a neural network to learn an SDF from various supervisions including signed distances, 3D point clouds, or multiview images. However, due to various reasons including the bias of neural network on low frequency content, 3D unaware sampling, sparsity in point clouds, or low resolutions of images, neural implicit representations still struggle to represent geometries with high frequency components like sharp structures, especially for the ones learned from images or point clouds. To overcome this challenge, we introduce a method to sharpen a low frequency SDF observation by recovering its high frequency components, pursuing a sharper and more complete surface. Our key idea is to learn a mapping from a low frequency observation to a full frequency coverage in a data-driven manner, leading to a prior knowledge of shape consolidation in the frequency domain, dubbed frequency consolidation priors. To better generalize a learned prior to unseen shapes, we introduce to represent frequency components as embeddings and disentangle the embedding of the low frequency component from the embedding of the full frequency component. This disentanglement allows the prior to generalize on an unseen low frequency observation by simply recovering its full frequency embedding through a test-time self-reconstruction. Our evaluations under widely used benchmarks or real scenes show that our method can recover high frequency component and produce more accurate surfaces than the latest methods.

College of Computer Science and Technology, Jilin University, Changchun, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China, College of Computer Science and Technology, Jilin University, Changchun, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China, College of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, China, Institute for Infocomm Research (I2R), A*STAR, Singapore, College of Computer Science and Technology, Jilin University, Changchun, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China, College of Computer Science and Technology, Jilin University, Changchun, China Public Computer Education and Research Center, Jilin University, Changchun, China, The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou, China

Abstract: Videobased human pose estimation has long been a fundamental yet challenging problem in computer vision. Previous studies focus on spatio-temporal modeling through the enhancement of architecture design and optimization strategies. However, they overlook the causal relationships in the joints, leading to models that may be overly tailored and thus estimate poorly to challenging scenes. Therefore, adequate causal reasoning capability, coupled with good interpretability of model, are both indispensable and prerequisite for achieving reliable results. In this paper, we pioneer a causal perspective on pose estimation and introduce a causal-inspired multitask learning framework, consisting of two stages. In the first stage, we try to endow the model with causal spatio-temporal modeling ability by introducing two self-supervision auxiliary tasks. Specifically, these auxiliary tasks enable the network to infer challenging keypoints based on observed keypoint information, thereby imbuing causal reasoning capabilities into the model and making it robust to challenging scenes. In the second stage, we argue that not all feature tokens contribute equally to pose estimation. Prioritizing causal (keypoint-relevant) tokens is crucial to achieve reliable results, which could improve the interpretability of the model. To this end, we propose a Token Causal Importance Selection module to identify the causal tokens and non-causal tokens (e.g., background and objects). Additionally, non-causal tokens could provide potentially beneficial cues but may be redundant. We further introduce a non-causal tokens clustering module to merge the similar non-causal tokens. Extensive experiments show that our method outperforms state-of-the-art methods on three large-scale benchmark datasets.

Abstract: Spike cameras, as innovative neuromorphic devices, generate continuous spike streams to capture highspeed scenes with lower bandwidth and higher dynamic range than traditional RGB cameras. However, reconstructing high-quality images from the spike input under low-light conditions remains challenging. Conventional learning-based methods often rely on the synthetic dataset as the supervision for training. Still, these approaches falter when dealing with noisy spikes fired under the low-light environment, leading to further performance degradation in the real-world dataset. This phenomenon is primarily due to inadequate noise modelling and the domain gap between synthetic and real datasets, resulting in recovered images with unclear textures, excessive noise, and diminished brightness. To address these challenges, we introduce a novel spike-to-image reconstruction framework SpikeCLIP that goes beyond traditional training paradigms. Leveraging the CLIP model's powerful capability to align text and images, we incorporate the textual description of the captured scene and unpaired high-quality datasets as the supervision. Textual descriptions provide additional context that guides the network's feature reconstruction, while high-quality datasets help produce sharp latent images. Our experiments on real-world low-light datasets U-CALTECH and U-CIFAR demonstrate that SpikeCLIP significantly enhances texture details and the luminance balance of recovered images. Furthermore, the reconstructed images are well-aligned with the broader visual features needed for downstream tasks, ensuring more robust and versatile performance in challenging environments.

Abstract: Existing works have extensively studied adversarial examples, which are minimal perturbations that can mislead the output of deep neural networks (DNNs) while remaining imperceptible to humans. However, in this work, we reveal the existence of a harmless perturbation space, in which perturbations drawn from this space, regardless of their magnitudes, leave the network output unchanged when applied to inputs. Essentially, the harmless perturbation space emerges from the usage of noninjective functions (linear or non-linear layers) within DNNs, enabling multiple distinct inputs to be mapped to the same output. For linear layers with input dimensions exceeding output dimensions, any linear combination of the orthogonal bases of the nullspace of the parameter consistently yields no change in their output. For non-linear layers, the harmless perturbation space may expand, depending on the properties of the layers and input samples. Inspired by this property of DNNs, we solve for a family of general perturbation spaces that are redundant for the DNN's decision, and can be used to hide sensitive data and serve as a means of model identification. Our work highlights the distinctive robustness of DNNs (i.e., consistency under large magnitude perturbations) in contrast to adversarial examples (vulnerability for small noises).

Abstract: This paper explores higherresolution video outpainting with extensive content generation. We point out common issues faced by existing methods when attempting to largely outpaint videos: the generation of low-quality content and limitations imposed by GPU memory. To address these challenges, we propose a diffusion-based method called Infinite-Canvas. It builds upon two core designs. First, instead of employing the common practice of "single-shot" outpainting, we distribute the task across spatial windows and seamlessly merge them. It allows us to outpaint videos of any size and resolution without being constrained by GPU memory. Second, the source video and its relative positional relation are injected into the generation process of each window. It makes the generated spatial layout within each window harmonize with the source video. Coupling with these two designs enables us to generate higher-resolution outpainting videos with rich content while keeping spatial and temporal consistency. Infinite-Canvas excels in large-scale video outpainting, e.g., from 512 × 512 to 1152 × 2048 (9×), while producing high-quality and aesthetically pleasing results. It achieves the best quantitative results across various resolution and scale setups. The code is available at https://github.com/mayuelala/FollowYourCanvas.

Abstract: In this paper, we propose a simple yet unified single object tracking (SOT) framework, dubbed SUTrack. It consolidates five SOT tasks (RGBbased, RGB-Depth, RGB-Thermal, RGB-Event, RGB-Language Tracking) into a unified model trained in a single session. Due to the distinct nature of the data, current methods typically design individual architectures and train separate models for each task. This fragmentation results in redundant training processes, repetitive technological innovations, and limited cross-modal knowledge sharing. In contrast, SUTrack demonstrates that a single model with a unified input representation can effectively handle various SOT tasks, eliminating the need for task-specific designs and separate training sessions. Additionally, we introduce a task-recognition training strategy and a soft token type embedding to further enhance SUTrack's performance with minimal overhead. Experiments show that SUTrack outperforms previous task-specific counterparts across 11 datasets spanning five SOT tasks. Moreover, we provide a range of models catering edge devices as well as high-performance GPUs, striking a good trade-off between speed and accuracy. We hope SUTrack could serve as a strong foundation for further compelling research into unified tracking models.

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China SketchX, CVSSP, University of Surrey, United Kingdom, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China SketchX, CVSSP, University of Surrey, United Kingdom, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China, SketchX, CVSSP, University of Surrey, United Kingdom, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China SketchX, CVSSP, University of Surrey, United Kingdom, SketchX, CVSSP, University of Surrey, United Kingdom

Abstract: Despite the rapid advancements in textto-image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi-facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.

Abstract: Recent advancement in textto-image models and corresponding personalized technologies enables individuals to generate high-quality and imaginative images. However, they often suffer from limitations when generating images with resolutions outside of their trained domain. To overcome this limitation, we present the resolution adapter \textbf{(ResAdapter)}, a domain-consistent adapter designed for diffusion models to generate images with unrestricted resolutions and aspect ratios. Unlike other multi-resolution generation methods that process images of static resolution with complex post-process operations, ResAdapter directly generates images with the dynamical resolution. Especially, after learning a deep understanding of pure resolution priors, ResAdapter trained on the general dataset, generates resolution-free images with personalized diffusion models while preserving their original style domain. Comprehensive experiments demonstrate that ResAdapter with only 0.5M can process images with flexible resolutions for arbitrary diffusion models. More extended experiments demonstrate that ResAdapter is compatible with other modules for image generation across a broad range of resolutions, and can be integrated into other multi-resolution model for efficiently generating higher-resolution images.

Abstract: Adversarial training is one of the most effective approaches against adversarial attacks. However, adversarial training has primarily been studied in scenarios where data for all classes is provided, with limited research conducted in the context of incremental learning where knowledge is introduced sequentially. In this study, we investigate Adversarially Robust Class Incremental Learning (ARCIL), which deals with adversarial robustness in incremental learning. We first explore a series of baselines that integrate incremental learning with existing adversarial training methods, finding that they lead to conflicts between acquiring new knowledge and retaining past knowledge. Furthermore, we discover that training new knowledge causes the disappearance of a key characteristic in robust models: a flat loss landscape in input space. To address such issues, we propose a novel and robust baseline for ARCIL, named FLatness preserving Adversarial Incremental learning for Robustness (FLAIR). Experimental results demonstrate that FLAIR significantly outperforms other baselines. To the best of our knowledge, we are the first to comprehensively investigate the baselines, challenges, and solutions for ARCIL, which we believe represents a significant advance toward achieving realworld robustness.

Abstract: Visionbased 3D occupancy prediction has become a popular research task due to its versatility and affordability. Nowadays, conventional methods usually project the image-based vision features to 3D space and learn the geometric information through the attention mechanism, enabling the 3D semantic occupancy prediction. However, these works usually face two main challenges: 1) Limited geometric information. Due to the lack of geometric information in the image itself, it is challenging to directly predict 3D space information, especially in large-scale outdoor scenes. 2) Local restricted interaction. Due to the quadratic complexity of the attention mechanism, they often use modified local attention to fuse features, resulting in a restricted fusion. To address these problems, in this paper, we propose a language-assisted 3D semantic occupancy prediction network, named LOMA. In the proposed vision-language framework, we first introduce a VL-aware Scene Generator (VSG) module to generate the 3D language feature of the scene. By leveraging the vision-language model, this module provides implicit geometric knowledge and explicit semantic information from the language. Furthermore, we present a Tri-plane Fusion Mamba (TFM) block to efficiently fuse the 3D language feature and 3D vision feature. The proposed module not only fuses the two features with global modeling but also avoids too much computation costs. Experiments on the SemanticKITTI and SSCBench-KITTI360 datasets show that our algorithm achieves new state-of-the-art performances in both geometric and semantic completion tasks. Our code will be open soon.

Abstract: Recent advancements in languageguided diffusion models for image editing are often bottle-necked by cumbersome prompt engineering to precisely articulate desired changes. An intuitive alternative calls on guidance from in-the-wild image exemplars to help users bring their imagined edits to life. Contemporary exemplar-based editing methods shy away from leveraging the rich latent space learnt by pre-existing large text-to-image (TTI) models and fall back on training with curated objective functions to achieve the task. Though somewhat effective, this demands significant computational resources and lacks compatibility with diverse base models and arbitrary exemplar count. On further investigation, we also find that these techniques restrict user control to only applying uniform global changes over the entire edited region. In this paper, we introduce a novel framework for progressive exemplar-driven editing with off-the-shelf diffusion models, dubbed PIXELS, to enable customization by providing granular control over edits, allowing adjustments at the pixel or region level. Our method operates solely during inference to facilitate imitative editing, enabling users to draw inspiration from a dynamic number of reference images, or multimodal prompts, and progressively incorporate all the desired changes without retraining or fine-tuning existing TTI models. This capability of fine-grained control opens up a range of new possibilities, including selective modification of individual objects and specifying gradual spatial changes. We demonstrate that PIXELS delivers high-quality edits efficiently, leading to a notable improvement in quantitative metrics as well as human evaluation. By making high-quality image editing more accessible, PIXELS has the potential to enable professional-grade edits to a wider audience with the ease of using any open-source image generation model.

Abstract: NonRigid Structure-from-Motion (NRSfM) is a classic 3D vision problem, where a 2D sequence is taken as input to estimate the corresponding 3D sequence. Recently, the deep neural networks have greatly advanced the task of NRSfM. However, existing deep NRSfM methods still have limitations in handling the inherent sequence property and motion ambiguity associated with the NRSfM problem. In this paper, we revisit deep NRSfM from two perspectives to address the limitations of current deep NRSfM methods : (1) canonicalization and (2) sequence modeling. We propose an easy-to-implement per-sequence canonicalization method as opposed to the previous per-dataset canonicalization approaches. With this in mind, we propose a sequence modeling method that combines temporal information and subspace constraint. As a result, we have achieved a more optimal NRSfM reconstruction pipeline compared to previous efforts. The effectiveness of our method is verified by testing the sequence-to-sequence deep NRSfM pipeline with corresponding regularization modules on several commonly used datasets.

Abstract: Talking head video generation involves animating a still face image using facial motion cues derived from a driving video to replicate target poses and expressions. Traditional methods often rely on the assumption that the relative positions of facial keypoints remain unchanged. However, this assumption fails when keypoints are occluded or when the head is in a profile pose, leading to inconsistencies in identity and blurring in certain facial regions. In this paper, we introduce OcclusionInsensitive Talking Head Video Generation, a novel approach that eliminates the reliance on spatial correlation of keypoints and instead leverages semantic correlation. Our method transforms facial features into a facelet semantic bank, where each facelet token represents a specific facial semantic. This bank is devoid of spatial information, allowing it to compensate for any invisible or occluded face regions during motion warping. The facelet compensation module then populates the facelet tokens within the initially warped features by learning a correlation matrix between facial semantics and the facelet bank. This approach enables precise compensation for occlusions and pose changes, enhancing the fidelity of the generated videos. Extensive experiments demonstrate that our method achieves state-of-the-art results, preserving source identity, maintaining fine-grained facial details, and capturing nuanced facial expressions with remarkable accuracy.

Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Shanghai Artificial Intelligence Laboratory Shanghai Jiao Tong University, Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory Shanghai Jiao Tong University, Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory, Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory

Abstract: Despite recent advancements in textto-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES develops a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step forward for MUSES in bridging natural language, 2D image generation, and 3D world.

Abstract: Efficiently applying fully supervised learning to virtual tryon tasks is challenging due to the lack of paired ground truth in available training samples. Recent works have achieved virtual try-ons by employing self-supervised learning-based inpainting paradigms. However, this approach is heavily dependent on the constraints of inpainting masks. An incorrect mask can mislead the generated results, while overly large mask areas can lose essential original information, thereby hindering the synthesis of high-quality results. To address these problems, we propose a latent diffusion model-based virtual try-on network that achieves fully supervised learning using the concept of cycle consistency and knowledge distillation. Specifically, we divide our approach into pretext and downstream tasks. In the pretext task, we generate a pseudo-label (pseudo-person image) to form paired training samples, which enables the downstream task to achieve fully supervised learning. To prevent the unreliable pseudo-person image from introducing irresponsible prior knowledge, we propose a noise-covering strategy, which aims at fully optimizing the pseudo-label to eliminate the impact of the incorrect inpainting mask as much as possible. Additionally, we propose a skin refinement loss to further enhance the generation of details in the skin region. Extended experiments demonstrate that our proposed method is superior to state-of-the-art methods.

Abstract: Scene Text Recognition (STR) methods have demonstrated robust performance in wordlevel text recognition. However, in real applications the text image is sometimes long due to detected with multiple horizontal words. It triggers the requirement to build long text recognition models from readily available short (i.e., word-level) text datasets, which has been less studied previously. In this paper, we term this task Out of Length (OOL) text recognition. We establish the first Long Text Benchmark (LTB) to facilitate the assessment of different methods in long text recognition. Meanwhile, we propose a novel method called OOL Text Recognition with sub-String Matching (SMTR). SMTR comprises two cross-attention-based modules: one encodes a sub-string containing multiple characters into next and previous queries, and the other employs the queries to attend to the image features, matching the sub-string and simultaneously recognizing its next and previous character. SMTR can recognize text of arbitrary length by iterating the process above. To avoid being trapped in recognizing highly similar sub-strings, we introduce a regularization training to compel SMTR to effectively discover subtle differences between similar sub-strings for precise matching. In addition, we propose an inference augmentation strategy to alleviate confusion caused by identical sub-strings in the same text and improve the overall recognition efficiency. Extensive experimental results reveal that SMTR, even when trained exclusively on short text, outperforms existing methods in public short text benchmarks and exhibits a clear advantage on LTB.

Abstract: Noisy labels can negatively impact the performance of deep neural networks. One common solution is label refurbishment, which involves reconstructing noisy labels through predictions and distributions. However, these methods may introduce problematic semantic associations, a phenomenon that we identify as Semantic Contamination. Through an analysis of Robust LR, a representative label refurbishment method, we found that utilizing the logits of views for refurbishment does not adequately balance the semantic information of individual classes. Conversely, using the logits of models fails to maintain consistent semantic relationships across models, which explains why label refurbishment methods frequently encounter issues related to Semantic Contamination. To address this issue, we propose a novel method called Collaborative Cross Learning, which utilizes semisupervised learning on refurbished labels to extract appropriate semantic associations from embeddings across views and models. Experimental results show that our method outperforms existing approaches on both synthetic and real-world noisy datasets, effectively mitigating the impact of label noise and Semantic Contamination.

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, School of Intelligence Science and Technology, Nanjing University, Suzhou, China., School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China

Abstract: Image dehazing, particularly with learningbased methods, has gained significant attention due to its importance in real-world applications. However, relying solely on the RGB color space often fall short, frequently leaving residual haze. This arises from two main issues: the difficulty in obtaining clear textural features from hazy RGB images and the complexity of acquiring real haze/clean image pairs outside controlled environments like smoke-filled scenes. To address these issues, we first propose a novel Structure Guided Dehazing Network (SGDN) that leverages the superior structural properties of YCbCr features over RGB. It comprises two key modules: Bi-Color Guidance Bridge (BGB) and Color Enhancement Module (CEM). BGB integrates a phase integration module and an interactive attention module, utilizing the rich texture features of the YCbCr space to guide the RGB space, thereby recovering clearer features in both frequency and spatial domains. To maintain tonal consistency, CEM further enhances the color perception of RGB features by aggregating YCbCr channel information. Furthermore, for effective supervised learning, we introduce a Real-World Well-Aligned Haze dataset, which includes a diverse range of scenes from various geographical regions and climate conditions. Experimental results demonstrate that our method surpasses existing state-of-the-art methods across multiple real-world smoke/haze datasets.

Abstract: Reconstructing the intricate local morphology of neurons and their longrange projecting axons can address many connectivity related questions in neuroscience. The main bottleneck in connectomics pipelines is correcting topological errors, as multiple entangled neuronal arbors is a challenging instance segmentation problem. More broadly, segmentation of curvilinear, filamentous structures continues to pose significant challenges. To address this problem, we extend the notion of simple points from digital topology to connected sets of voxels (i.e. supervoxels) and propose a topology-aware neural network segmentation method with minimal computational overhead. We demonstrate its effectiveness on a new public dataset of 3-d light microscopy images of mouse brains, along with the benchmark datasets DRIVE, ISBI12, and CrackTree.

Abstract: Recently, dance generation has attracted increasing interest. In particular, the success of diffusion models in image generation has led to the emergence of dance generation systems based on the diffusion framework. However, these systems lack controllability, which limits their practical applications. In this paper, we propose a controllable dance generation method based on the diffusion model, which can generate 3D dance motions controlled by 2D keypoint sequences. Specifically, we design a transformerbased U-Net model to predict actual motions. Then, we fix the parameters of the U-Net model and train an additional control network, enabling the generated motions to be controlled by 2D keypoints. We conduct extensive experiments and compared our method with existing works on the widely used AIST++ dataset, demonstrating that our approach has certain advantages and controllability. Moreover, we also test our model on in-the-wild videos and find that it is capable of generating dance movements similar to the motions in the videos as well.

Abstract: Image inpainting is an important image generation task, which aims to restore corrupted image from partial visible area. Recently, diffusion Schrödinger bridge methods effectively tackle this task by modeling the translation between corrupted and target images as a diffusion Schrödinger bridge process along a noising schedule path. Although these methods have shown superior performance, in this paper, we find that 1) existing methods suffer from a schedulerestoration mismatching issue, i.e., the theoretical schedule and practical restoration processes usually exist a large discrepancy, which theoretically results in the schedule not fully leveraged for restoring images; and 2) the key reason causing such issue is that the restoration process of all pixels are actually asynchronous but existing methods set a synchronous noise schedule to them, i.e., all pixels shares the same noise schedule. To this end, we propose a schedule-Asynchronous Diffusion Schrödinger Bridge (AsyncDSB) for image inpainting. Our insight is preferentially scheduling pixels with high frequency (i.e., large gradients) and then low frequency (i.e., small gradients). Based on this insight, given a corrupted image, we first train a network to predict its gradient map in corrupted area. Then, we regard the predicted image gradient as prior and design a simple yet effective pixel-asynchronous noise schedule strategy to enhance the diffusion Schrödinger bridge. Thanks to the asynchronous schedule at pixels, the temporal interdependence of restoration process between pixels can be fully characterized for high-quality image inpainting. Experiments on real-world datasets show that our AsyncDSB achieves superior performance, especially on FID with around 3% ∼ 14% improvement over state-of-the-art baseline methods.

Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, School of Engineering Sciences, Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan United Imaging Healthcare Surgical Technology Co., Ltd., Tongji Medical College, Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology

Abstract: We propose MonoBox, an innovative boxsupervised segmentation method constrained by monotonicity to liberate its training from the user-unfriendly box-tightness assumption. In contrast to conventional box-supervised segmentation, where the box edges must precisely touch the target boundaries, MonoBox leverages imprecisely-annotated boxes to achieve robust pixel-wise segmentation. The 'linchpin' is that, within the noisy zones around box edges, MonoBox discards the traditional misguiding multiple-instance learning loss, and instead optimizes a carefully-designed objective, termed monotonicity constraint. Along directions transitioning from the foreground to background, this new constraint steers responses to adhere to a trend of monotonically decreasing values. Consequently, the originally unreliable learning within the noisy zones is transformed into a correct and effective monotonicity optimization. Moreover, an adaptive label correction is introduced, enabling MonoBox to enhance the tightness of box annotations using predicted masks from the previous epoch and dynamically shrink the noisy zones as training progresses. We verify MonoBox in the box-supervised segmentation task of polyps, where satisfying box-tightness is challenging due to the vague boundaries between the polyp and normal tissues. Experiments on both public synthetic and in-house real noisy datasets demonstrate that MonoBox exceeds other anti-noise state-of-the-arts by improving Dice by at least 5.5% and 3.3%, respectively.

PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing University PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing University, Nanjing University, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Guangxi Normal University, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology

Abstract: Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatialtemporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios.

Abstract: In the domain of computer vision, ParameterEfficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks.

Abstract: Personalized textto-image generation methods can generate customized images based on the reference images, which have garnered wide research interest. Recent methods propose a finetuning-free approach with a decoupled cross-attention mechanism to generate personalized images requiring no test-time finetuning. However, when multiple reference images are provided, the current decoupled cross-attention mechanism encounters the object confusion problem and fails to map each reference image to its corresponding object, thereby seriously limiting its scope of application. To address the object confusion problem, in this work we investigate the relevance of different positions of the latent image features to the target object in diffusion model, and accordingly propose a weighted-merge method to merge multiple reference image features into the corresponding objects. Next, we integrate this weighted-merge method into existing pre-trained models and continue to train the model on a multi-object dataset constructed from the open-sourced SA-1B dataset. To mitigate object confusion and reduce training costs, we propose an object quality score to estimate the image quality for the selection of high-quality training samples. Furthermore, our weighted-merge training framework can be employed on single-object generation when a single object has multiple reference images. The experiments verify that our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation, and remarkably improves the performance on single-object personalized image generation.

Abstract: Instance segmentation algorithms in remote sensing are typically based on conventional methods, limiting their application to seen scenarios and closedset predictions. In this work, we propose a novel task called zero-shot remote sensing instance segmentation, aimed at identifying aerial objects that are absent from training data. Challenges arise when classifying aerial categories with high inter-class similarity and intra-class variance. Besides, the domain gap between vision-language models’ pretraining datasets and remote sensing datasets hinders the zero-shot capabilities of the pretrained model when it is directly applied to remote sensing images. To address these challenges, we propose a Zero-Shot Remote Sensing Instance Segmentation framework, dubbed ZoRI. Our approach features a discrimination-enhanced classifier that uses refined textual embeddings to increase the awareness of class disparities. Instead of direct fine-tuning, we propose a knowledge-maintained adaptation strategy that decouples semantic-related information to preserve vision-language alignment while adjusting features to capture remote sensing domain-specific visual cues. Additionally, we introduce a prior-injected prediction with cache bank of aerial visual prototypes to supplement the semantic richness of text embeddings and seamlessly integrate aerial representations, adapting to the remote sensing domain. We establish new experimental protocols and benchmarks, and extensive experiments demonstrate that ZoRI achieves the state-of-art performance on the zero-shot remote sensing instance segmentation task.

Abstract: LiDARbased 3D object detection is crucial for autonomous driving. However, due to the quality deterioration of LiDAR point clouds, it suffers from performance degradation in adverse weather conditions. Fusing LiDAR with the weatherrobust 4D radar sensor is expected to solve this problem; however, it faces challenges of significant differences in terms of data quality and the degree of degradation in adverse weather. To address these issues, we introduce L4DR, a weather-robust 3D object detection method that effectively achieves LiDAR and 4D Radar fusion. Our L4DR proposes Multi-Modal Encoding (MME) and Foreground-Aware Denoising (FAD) modules to reconcile sensor gaps, which is the first exploration of the complementarity of early fusion between LiDAR and 4D radar. Additionally, we design an Inter-Modal and IntraModal ({IM}2) parallel feature extraction backbone coupled with a Multi-Scale Gated Fusion (MSGF) module to counteract the varying degrees of sensor degradation under adverse weather conditions. Experimental evaluation on a VoD dataset with simulated fog proves that L4DR is more adaptable to changing weather conditions. It delivers a significant performance increase under different fog levels, improving the 3D mAP by up to 20.0% over the traditional LiDAR-only approach. Moreover, the results on the K-Radar dataset validate the consistent performance improvement of L4DR in realworld adverse weather conditions.

Abstract: Lowlight image enhancement (LLIE) aims to improve visibility and signal-to-noise ratio in images captured under poor lighting conditions. While deep learning has shown promise in this domain, current approaches require extensive paired training data, limiting their practical utility. We present a novel framework that reformulates low-light image enhancement as a zero-shot inference problem using pre-trained latent diffusion models (LDMs), eliminating the need for task-specific training data. Our key insight is that the rich natural image priors encoded in LDMs can be leveraged to recover well-lit images through a carefully designed optimization process. To address the ill-posed nature of low-light degradation and the complexity of latent space optimization, our framework introduces an exposure-aware degradation module that adaptively models illumination variations and a principled latent regularization scheme with adaptive guidance that ensures both enhancement quality and natural image statistics. Experimental results demonstrate that our framework outperforms existing zero-shot methods across diverse real-world scenarios.

Abstract: Current knowledge distillation (KD) methods for semantic segmentation focus on guiding the student to imitate the teacher's knowledge within homogeneous architectures. However, these methods overlook the diverse knowledge contained in architectures with different inductive biases, which is crucial for enabling the student to acquire a more precise and comprehensive understanding of the data during distillation. To this end, we propose for the first time a generic knowledge distillation method for semantic segmentation from a heterogeneous perspective, named HeteroAKD. Due to the substantial disparities between heterogeneous architectures, such as CNN and Transformer, directly transferring crossarchitecture knowledge presents significant challenges. To eliminate the influence of architecture-specific information, the intermediate features of both the teacher and student are skillfully projected into an aligned logits space. Furthermore, to utilize diverse knowledge from heterogeneous architectures and deliver customized knowledge required by the student, a teacher-student knowledge mixing mechanism (KMM) and a teacher-student knowledge evaluation mechanism (KEM) are introduced. These mechanisms are performed by assessing the reliability and its discrepancy between heterogeneous teacher-student knowledge. Extensive experiments conducted on three main-stream benchmarks using various teacher-student pairs demonstrate that our HeteroAKD framework outperforms state-of-the-art KD methods in facilitating distillation between heterogeneous architectures.

Abstract: The visionbased geo-localization technology for UAV, serving as a secondary source of GPS information in addition to the global navigation satellite systems (GNSS), can still operate independently when communication with the external environment is cut off. Recent deep learning based methods attribute this as the task of image matching and retrieval. By retrieving drone-view images in satellite image database with GPS information tagged, approximate localization information can be obtained. However, due to high costs and privacy concerns, it is usually difficult to obtain large quantities of drone-view images from a continuous area. Existing drone-view datasets are mostly composed of small-scale aerial photography with a strong assumption that there exists a perfect one-to-one aligned reference image for any query, leaving a significant gap from the practical localization scenario. In this work, we construct a large-range continues area UAV geo-localization dataset named GTA-UAV, featuring multiple flight altitudes, attitudes, scenes, and targets using modern computer games. Based on this dataset, we introduce a more practical UAV geo-localization task including partial matches of cross-view paired data, and expand the image-level retrieval to the actual localization in terms of distance (meters). For the construction of drone-view and satellite-view pairs, we adopt a weight-based contrastive learning approach, which allows for effective learning while avoiding additional post-processing matching steps. Experiments demonstrate the effectiveness of our data and training method for UAV geo-localization, as well as the generalization capabilities to real-world scenarios.

Abstract: Spatial contexts, such as the backgrounds and surroundings, are considered critical in HumanObject Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-ambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.

Abstract: Video question answering plays a vital role in computer vision, and recent advances in large language models have further propelled the development of this field. However, existing video question answering techniques often face limitations in grasping finegrained video content in spatial dimensions. It mainly stems from the fixed and low-resolution input of video frames. While some approaches using high-resolution inputs partially alleviate this problem, they introduce excessive computational burdens by encoding the entire high-resolution image. In this work, we propose a granularity-adaptive spatial evidence tokenization model for video question answering. Our method introduces multi-granular visual tokenization in the spatial dimension to produce video tokens at various granularities based on the question. It highlights spatially activated patches at low resolutions through a granularity weighting module and then adaptively encodes these activated patches at high resolution for detail supplementation. To mitigate the computational overhead associated with high-resolution frame encoding, a masking and acceleration module is developed for efficient visual tokenization. Moreover, a granularity compression module is designed to dynamically select and compress visual tokens of varying granularities based on questions. We conduct extensive experiments on 11 mainstream video question answering datasets and the experimental results demonstrate the effectiveness of our proposed method.

Abstract: Compositional generalization is crucial for artificial intelligence agents to solve complex visionlanguage reasoning tasks. Neuro-symbolic approaches have demonstrated promise in capturing compositional structures, but they face critical challenges: (a) reliance on predefined predicates for symbolic representations that limit adaptability, (b) difficulty in extracting predicates from raw data, and (c) using non-differentiable operations for combining primitive concepts. To address these issues, we propose NeSyCoCo, a neuro-symbolic framework that leverages large language models (LLMs) to generate symbolic representations and map them to differentiable neural computations. NeSyCoCo introduces three innovations: (a) augmenting natural language inputs with dependency structures to enhance the alignment with symbolic representations, (b) employing distributed word representations to link diverse, linguistically motivated logical predicates to neural modules, and (c) using the soft composition of normalized predicate scores to align symbolic and differentiable reasoning. Our framework achieves state-of-the-art results on the ReaSCAN and CLEVR-CoGenT compositional generalization benchmarks and demonstrates robust performance with novel concepts in the CLEVR-SYN benchmark.

Abstract: Contextual information at the video level has become increasingly crucial for visual object tracking. However, existing methods typically use only a few tokens to convey this information, which can lead to information loss and limit their ability to fully capture the context. To address this issue, we propose a new videolevel visual object tracking framework called MCITrack. It leverages Mamba's hidden states to continuously record and transmit extensive contextual information throughout the video stream, resulting in more robust object tracking. The core component of MCITrack is the Contextual Information Fusion module, which consists of the mamba layer and the cross-attention layer. The mamba layer stores historical contextual information, while the cross-attention layer integrates this information into the current visual features of each backbone block. This module enhances the model's ability to capture and utilize contextual information at multiple levels through deep integration with the backbone. Experiments demonstrate that MCITrack achieves competitive performance across numerous benchmarks. For instance, it gets 76.6% AUC on LaSOT and 80.0% AO on GOT-10k, establishing a new state-of-the-art performance.

Abstract: Guided depth superresolution (GDSR) has demonstrated impressive performance across a wide range of domains, with numerous methods being proposed. However, existing methods often treat depth maps as images, where shading values are computed discretely, making them struggle to effectively restore the continuity inherent in the depth map. In this paper, we propose a novel approach that maximizes the utilization of spatial characteristics in depth, coupled with human abstract perception of real-world substance, by transforming the GDSR issue into deformation of a roughcast with ideal plasticity, which can be deformed by force like a continuous object. Specifically, we firstly designed a cross-modal operation, Continuity-constrained Asymmetrical Pixelwise Operation (CAPO), which can mimic the process of deforming an isovolumetrically flexible object through external forces. Utilizing CAPO as the fundamental component, we develop the Pixelwise Cross Gradient Deformation (PCGD), which is capable of emulating operations on ideal plastic objects (without volume constraint). Notably, our approach demonstrates state-of-the-art performance across four widely adopted benchmarks for GDSR, with significant advantages in large-scale tasks and generalizability.

Abstract: Geographic information is essential for modeling tasks in fields ranging from ecology to epidemiology. However, extracting relevant location characteristics for a given task can be challenging, often requiring expensive data fusion or distillation from massive global imagery datasets. To address this challenge, we introduce Satellite Contrastive LocationImage Pretraining (SatCLIP). This global, general-purpose geographic location encoder learns an implicit representation of locations by matching CNN and ViT inferred visual patterns of openly available satellite imagery with their geographic coordinates. The resulting SatCLIP location encoder efficiently summarizes the characteristics of any given location for convenient use in downstream tasks. In our experiments, we use SatCLIP embeddings to improve performance on nine diverse geospatial prediction tasks including temperature prediction, animal recognition, and population density estimation. Across tasks, SatCLIP consistently outperforms alternative location encoders and shows promise for improving geographic domain adaptation. These results demonstrate the potential of vision-location models to learn meaningful representations of our planet from the vast, varied, and largely untapped modalities of geospatial data.

Abstract: 3D superresolution aims to reconstruct high-fidelity 3D models from low-resolution (LR) multi-view images. Early studies primarily focused on single-image super-resolution (SISR) models to upsample LR images into high-resolution images. However, these methods often lack view consistency because they operate independently on each image. Although various post-processing techniques have been extensively explored to mitigate these inconsistencies, they have yet to fully resolve the issues. In this paper, we perform a comprehensive study of 3D super-resolution by leveraging video super-resolution (VSR) models. By utilizing VSR models, we ensure a higher degree of spatial consistency and can reference surrounding spatial information, leading to more accurate and detailed reconstructions. Our findings reveal that VSR models can perform remarkably well even on sequences that lack precise spatial alignment. Given this observation, we propose a simple yet practical approach to align LR images without involving fine-tuning or generating `smooth' trajectory from the trained 3D models over LR images. The experimental results show that the surprisingly simple algorithms can achieve the state-of-the-art results of 3D super-resolution tasks on standard benchmark datasets, such as the NeRF-synthetic and Mip-NeRF 360 datasets.

Abstract: While visual questionanswering (VQA) benchmarks have catalyzed the development of reasoning techniques, they have focused on vertical thinking. Effective problem-solving also necessitates lateral thinking, which remains understudied in AI and has not been used to test visual perception systems. To bridge this gap, we formulate visual lateral thinking as a multiple-choice question-answering task and describe a three-step taxonomy-driven methodology for instantiating task examples. Then, we develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles based on publicly available collections of compounds and common phrases. COLUMBUS comprises over 1,000 puzzles, each with four answer candidates. While the SotA vision language models (VLMs) achieve decent performance, our evaluation demonstrates a substantial gap between humans and models. VLMs benefit from human-curated descriptions but struggle to self-generate such representations at the right level of abstraction.

Abstract: In this work, we focus on semisupervised learning for video action detection. Video action detection requires spatio-temporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end-to-end student-teacher-based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel ErrOr Recovery (EoR) module, which learns from students' mistakes on labeled samples and transfers this to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatio-temporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To overcome this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, which leads to coherent temporal detections. We evaluate our approach on four different spatio-temporal detection benchmarks: UCF101-24, JHMDB21, AVA, and Youtube-VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides a competitive performance compared to the supervised baseline trained on 100% annotations on UCF101-24 and JHMDB21 respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and Youtube-VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain.

Abstract: Multimodal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods extract a fixed number of frames, but this has critical challenges. If a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our numerical experiments with three different benchmark datasets demonstrate that the proposed framework significantly improves the performances of three recent video captioning models.

Abstract: Pointbased interactive colorization techniques allow users to effortlessly colorize grayscale images using user-provided color hints. However, point-based methods often face challenges when different colors are given to semantically similar areas, leading to color intermingling and unsatisfactory results—an issue we refer to as color collapse. The fundamental cause of color collapse is the inadequacy of points for defining the boundaries for each color. To mitigate color collapse, we introduce a lasso tool that can control the scope of each color hint. Additionally, we design a framework that leverages the user-provided lassos to localize the attention masks. The experimental results show that using a single lasso is as effective as applying 4.18 individual color hints and can achieve the desired outcomes in 30% less time than using points alone.

Abstract: Learned lossless image compression has achieved significant advancements in recent years. However, existing methods often rely on training amortized generative models on massive datasets, resulting in suboptimal probability distribution estimation for specific testing images during encoding process. To address this challenge, we explore the connection between the Minimum Description Length (MDL) principle and Parameter-Efficient Transfer Learning (PETL), leading to the development of a novel content-adaptive approach for learned lossless image compression, dubbed CALLIC. Specifically, we first propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations, termed Masked Gated ConvFormer (MGCF), and pretrain MGCF on training dataset. Cache then Crop Inference (CCI) is proposed to accelerate the coding process. During encoding, we decompose pretrained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT). RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time. Extensive experiments across diverse datasets demonstrate that CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.

Abstract: Deep online crossmodal hashing has gained much attention from researchers recently, as its promising applications with low storage requirement, fast retrieval efficiency and cross modality adaptive, etc. However, there still exists some technical hurdles that hinder its applications, e.g., 1) how to extract the coexistent semantic relevance of cross-modal data, 2) how to achieve competitive performance when handling the real time data streams, 3) how to transfer the knowledge learned from offline to online training in a lightweight manner. To address these problems, this paper proposes a lightweight contrastive distilled hashing (LCDH) for cross-modal retrieval, by innovatively bridging the offline and online cross-modal hashing by similarity matrix approximation in a knowledge distillation framework. Specifically, in the teacher network, LCDH first extracts the cross-modal features by CLIP, which are further fed into an attention module for representation enhancement after feature fusion. Then, the output of the attention module is fed into a FC layer to obtain hash codes for aligning the sizes of similarity matrices for online and offline training. In the student network, LCDH extracts the visual and textual features by lightweight models, and then the features are fed into a FC layer to generate binary codes. Finally, by approximating the similarity matrices, the performance of online hashing in the lightweight student network can be enhanced by the supervision of coexistent semantic relevance that is distilled from the teacher network. Experimental results on three widely used datasets demonstrate that LCDH outperforms some state-of-the-art methods.

Abstract: Blindspot networks (BSN) have been prevalent neural architectures in self-supervised image denoising (SSID). However, most existing BSNs are conducted with convolution layers. Although transformers have shown the potential to overcome the limitations of convolutions in many image restoration tasks, the attention mechanisms may violate the blind-spot requirement, thereby restricting their applicability in BSN. To this end, we propose to analyze and redesign the channel and spatial attentions to meet the blind-spot requirement. Specifically, channel self-attention may leak the blind-spot information in multi-scale architectures, since the downsampling shuffles the spatial feature into channel dimensions. To alleviate this problem, we divide the channel into several groups and perform channel attention separately. For spatial self-attention, we apply an elaborate mask to the attention matrix to restrict and mimic the receptive field of dilated convolution. Based on the redesigned channel and window attentions, we build a Transformer-based Blind-Spot Network (TBSN), which shows strong local fitting and global perspective abilities. Furthermore, we introduce a knowledge distillation strategy that distills TBSN into smaller denoisers to improve computational efficiency while maintaining performance. Extensive experiments on real-world image denoising datasets show that TBSN largely extends the receptive field and exhibits favorable performance against state-of-the-art SSID methods.

School of Artificial Intelligence, Jilin University, Changchun, China Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China, Huawei Technologies Co., Ltd., Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China School of Artificial Intelligence, Jilin University, Changchun, China

Abstract: Retrievalbased multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to align queries (questions) with MHyS. Moreover, we propose a coarse-to-fine strategy for calculating both sentence-level and word-level similarity scores, to further enhance retrieval and filter out irrelevant details. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.

MAIS, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, MAIS, Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, Beijing Normal University MAIS, Institute of Automation, Chinese Academy of Sciences, MAIS, Institute of Automation, Chinese Academy of Sciences, MAIS, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences

Abstract: The integration of deep generative networks into generating ComputerAided Design (CAD) models has garnered increasing attention over recent years. Traditional methods often rely on discrete sequences of parametric line/curve segments to represent sketches. Differently, we introduce RECAD, a novel framework that generates Raster sketches and 3D Extrusions for CAD models. Representing sketches as raster images offers several advantages over discrete sequences: 1) it breaks the limitations on the types and numbers of lines/curves, providing enhanced geometric representation capabilities; 2) it enables interpolation within a continuous latent space; and 3) it allows for more intuitive user control over the output. Technically, RECAD employs two diffusion networks: the first network generates extrusion boxes conditioned on the number and types of extrusions, while the second network produces sketch images conditioned on these extrusion boxes. By combining these two networks, RECAD effectively generates sketch-and-extrude CAD models, offering a more robust and intuitive approach to CAD model generation. Experimental results indicate that RECAD achieves strong performance in unconditional generation, while also demonstrating effectiveness in conditional generation and output editing.

Abstract: Recent advancements in unsupervised monocular depth estimation typically rely on an assumption that image photometry remains consistent across consecutive frames. However, this assumption often fails in endoscopic scenes due to: 1) local photometric inconsistency caused by specular reflections creating highlights; and 2) global photometric inconsistency resulting from the simultaneous movement of the light source and the camera. Since unsupervised depth estimation methods rely on appearance discrepancies between frames as a supervisory signal, these photometric inconsistencies inevitably deteriorate loss function calculation. In this paper, our goal is to obtain a strong and reliable supervisory signal for achieving photometricconsistent depth estimation. To this end, for local photometric inconsistency, we utilize the specular reflection model to introduce a Highlight Loss for handling the estimation of highlight regions. For global photometric inconsistency, we design a Photometric Match module, which utilizes the spotlight illumination model to derive an analytical expression, achieving photometric alignment across different frames. Unlike previous works that introduce additional optical flow or networks, our method is simpler and more efficient. Extensive experiments demonstrate our method achieves the state-of-the-art results on C3VD, SCARED and SERV-CT datasets.

Abstract: Reconstructing desired objects and scenes has long been a primary goal in 3D computer vision. Singleview point cloud reconstruction has become a popular technique due to its low cost and accurate results. However, single-view reconstruction methods often rely on expensive CAD models and complex geometric priors. Effectively utilizing prior knowledge about the data remains a challenge. In this paper, we introduce hyperbolic space to 3D point cloud reconstruction, enabling the model to represent and understand complex hierarchical structures in point clouds with low distortion. We build upon previous methods by proposing a hyperbolic Chamfer distance and a regularized triplet loss to enhance the relationship between partial and complete point clouds. Additionally, we design adaptive boundary conditions to improve the model's understanding and reconstruction of 3D structures. Our model outperforms most existing models, and ablation studies demonstrate the significance of our model and its components. Experimental results show that our method significantly improves feature extraction capabilities. Our model achieves outstanding performance in 3D reconstruction tasks.

Abstract: Traditional adversarial attacks typically produce adversarial examples under normconstrained conditions, whereas unrestricted adversarial examples are free-form with semantically meaningful perturbations. Current unrestricted adversarial impersonation attacks exhibit limited control over adversarial face attributes and often suffer from low transferability. In this paper, we propose a novel Text Controlled Attribute Attack (TCA2) to generate photorealistic adversarial impersonation faces guided by natural language. Specifically, the category-level personal softmax vector is employed to precisely guide the impersonation attacks. Additionally, we propose both data and model augmentation strategies to achieve transferable attacks on unknown target models. Finally, a generative model, i.e, Style-GAN, is utilized to synthesize impersonated faces with desired attributes. Extensive experiments on two high-resolution face recognition datasets validate that our TCA2 method can generate natural text-guided adversarial impersonation faces with high transferability. We also evaluate our method on real-world face recognition systems, i.e, Face++ and Aliyun, further demonstrating the practical potential of our approach.

Abstract: Personalized image generation has made significant strides in adapting content to novel concepts. However, a persistent challenge remains: balancing the accurate reconstruction of unseen concepts with the need for editability according to the prompt, especially when dealing with the complex nuances of facial features. In this study, we delve into the temporal dynamics of the textto-image conditioning process, emphasizing the crucial role of stage partitioning in introducing new concepts. We present PersonaMagic, a stage-regulated generative technique designed for high-fidelity face customization. Using a simple MLP network, our method learns a series of embeddings within a specific timestep interval to capture face concepts. Additionally, we develop a Tandem Equilibrium mechanism that adjusts self-attention responses in the text encoder, balancing text description and identity preservation, improving both areas. Extensive experiments confirm the superiority of PersonaMagic over state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, its robustness and flexibility are validated in non-facial domains, and it can also serve as a valuable plug-in for enhancing the performance of pretrained personalization models.

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 211106, China., College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 211106, China., Microsoft, Washington, USA, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 211106, China.

Abstract: Diffusionbased models have shown great promise in real-world image super-resolution (Real-ISR), but often generate content with structural errors and spurious texture details due to the empirical priors and illusions of these models. To address this issue, we introduce StructSR, a simple, effective, and plug-and-play method that enhances structural fidelity and suppresses spurious details for diffusion-based Real-ISR. StructSR operates without the need for additional fine-tuning, external model priors, or high-level semantic knowledge. At its core is the Structure-Aware Screening (SAS) mechanism, which identifies the image with the highest structural similarity to the low-resolution (LR) input in the early inference stage, allowing us to leverage it as a historical structure knowledge to suppress the generation of spurious details. By intervening in the diffusion inference process, StructSR seamlessly integrates with existing diffusion-based Real-ISR models. Our experimental results demonstrate that StructSR significantly improves the fidelity of structure and texture, improving the PSNR and SSIM metrics by an average of 5.27% and 9.36% on a synthetic dataset (DIV2K-Val) and 4.13% and 8.64% on two real-world datasets (RealSR and DRealSR) when integrated with four state-of-the-art diffusion-based Real-ISR methods.

School of Electronic and Computer Engineering, Peking University Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Peking University Shenzhen Graduate School, ARC Lab, Tencent PCG, ARC Lab, Tencent PCG, Nanyang Technological University, Tsinghua University, University of Macau Shenzhen Institute of Advanced Technology (SIAT), ARC Lab, Tencent PCG, School of Electronic and Computer Engineering, Peking University Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Peking University Shenzhen Graduate School

Abstract: Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving laborintensive real-world capturing. Despite advancements in generative AI for video creation, achieving precise control over motion for interactive video asset generation remains challenging. To this end, we propose Image Conductor, a method for precise control of camera transitions and object movements to generate video assets from a single image. An well-cultivated training strategy is proposed to separate distinct camera and object motion by camera LoRA weights and object LoRA weights. To further eliminate motion ambiguity from ill-posed trajectories, we introduce a camera-free guidance technique during inference process, enhancing object movements while eliminating camera transitions. Additionally, we develop a trajectory-oriented video motion data curation pipeline for training. Quantitative and qualitative experiments demonstrate our method's precision and fine-grained control in generating motion-controllable videos from images, advancing the practical application of interactive video synthesis.

Abstract: Answering questions related to audiovisual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this paper, we present a new Patch-level Sounding Object Tracking (PSOT) method. It begins with a Motion-driven Key Patch Tracking (M-KPT) module, which relies on visual motion information to identify salient visual patches with significant movements that are more likely to relate to sounding objects and questions. We measure the patch-wise motion intensity map between neighboring video frames and utilize it to construct and guide a motion-driven graph network. Meanwhile, we design a Sound-driven KPT (S-KPT) module to explicitly track sounding patches. This module also involves a graph network, with the adjacency matrix regularized by the audio-visual correspondence map. The M-KPT and S-KPT modules are performed in parallel for each temporal segment, allowing balanced tracking of salient and sounding objects. Based on the tracked patches, we further propose a Question-driven KPT (Q-KPT) module to retain patches highly relevant to the question, ensuring the model focuses on the most informative clues. The audio-visual-question features are updated during the processing of these modules, which are then aggregated for final answer prediction. Extensive experiments on standard datasets demonstrate the effectiveness of our method, achieving competitive performance even compared to recent large-scale pretraining-based approaches.

Abstract: Most existing 3D visual speech animation methods synthesize lip movements synchronized with speech, which however neglect head poses and therefore degrade the animation realism. The animation of head poses presents two primary challenges: (1) the intricate mapping between speech and head poses remains poorly understood and (2) the absence of 4D face datasets featuring realistic head poses. Inspired by prosody decomposition in speech processing, we discern that head movements correlate with the fundamental frequency (F0) of speech prosody, while lip movements align with the language content. These observations motivate us to propose a novel framework, dubbed ProsodyTalker, that concurrently synthesizes lip and head movements, grounded in the principles of prosody decomposition. The core idea is first to adopt information perturbation to explicitly decompose the speech prosody into poserelated F0 and lip-related language content. Then, an autoregressive content-oriented fusion decoder is employed to enhance lip synchronization in the synthesized facial sequences. To synthesize head poses, we design a transformer-based variational autoencoder to learn a latent distribution of facial sequences and propose an F0-conditioned latent diffusion model to establish a probabilistic mapping from F0 to pose-related latent codes. Furthermore, we contribute a large-scale 4D face dataset containing bunches of variations in identities, head poses and facial motions. Extensive experiments show that our method achieves more realistic animation than state-of-the-art methods.

Abstract: Recent years have witnessed the remarkable success of Textto-3D generation, particularly with the rise of mainstream conditional diffusion models (DMs). Though achieving substantial progress, existing methods still face a knotty "human preference" dilemma, that is the 3D contents generated by the models often deviate greatly from the desired effects (e.g., perspective, aesthetics, shading, appearance, etc.) due to the lack of attention to human preferences. To mitigate the limitation of data deficiency and enable human preference learning, we first elaborately curate the HP3D, a text-to-3D dataset with expert preference annotations which is initally captioned by the multimodal large model LLava and then refined by human expert. Based on such a brand-new HP3D, we further propose DreamAlign, a reward-free method that does not require designing any complex reward models whereas only by introducing a light-weight lora adapter and then designing a novel direct 3D preference optimization (D-3DPO) algorithm for training. Moreover, in the stage of text-to-3D we design an additional Preference Contrastive Feedback training for score distillation sampling, which enables the generated 3D objects to align the human preferences (e.g., aesthetics, material, etc.). Extensive experiments demonstrate that DreamAlign consistently achieves state-of-the-art performance on generative effects and human preference alignment across various benchmark evaluations.

Abstract: The iterative sampling procedure employed by diffusion models (DMs) often leads to significant latency. To address this, we propose Stochastic Consistency Distillation (SCott) to enable accelerated textto-image generation, where high-quality generations can be achieved with just 2-4 sampling steps or even1 step, and further improvements can be obtained by additional cost, e.g., 4 steps. In contrast to vanilla consistency distillation (CD) which distills the ordinary differential equation solvers-based sampling process of a pre-trained teacher model into a student, SCott explores the possibility and validates the efficacy of integrating stochastic differential equation (SDE) solvers into CD to fully unleash the potential of the teacher. SCott is augmented with elaborate strategies to control the noise strength and sampling process of the SDE solver. An adversarial loss is further incorporated to strengthen the sample quality with rare sampling steps. Empirically, on the MSCOCO-2017 5K dataset with a Stable Diffusion-V1.5 teacher, SCott achieves an FID of 21.9, surpassing that of the 1-step InstaFlow (23.4) and the 4-step UFOGen (22.1). Moreover, SCott can yield more diverse samples than other consistency models for high-resolution image generation, with up to 16% improvement in a qualified metric.

Hangzhou Innovation Institute, Beihang University State Key Laboratory of Virtual Reality Technology and Systems, Beihang University School of Computer Science and Engineering, Beihang University, Hangzhou Innovation Institute, Beihang University State Key Laboratory of Virtual Reality Technology and Systems, Beihang University School of Computer Science and Engineering, Beihang University, School of Computer Science and Engineering, Beihang University, Hangzhou Innovation Institute, Beihang University State Key Laboratory of Virtual Reality Technology and Systems, Beihang University School of Computer Science and Engineering, Beihang University

Abstract: Deep learningbased methods have made significant progress in image dehazing. However, these methods often falter when applied to real-world hazy images, primarily due to the scarcity of paired real-world data and the limitations of current dehazing feature extractors. Toward these issues, we introduce a novel Physics Embedded Illumination Estimation (PEIE) method for adaptive real-world dehazing. Specifically, (1) we identify the limitations of the widely used Atmospheric Scattering Model and propose a new physical model, the Illumination-Adaptive Scattering Model (IASM), for more accurate illumination representation in hazy imaging; (2) we develop a robust data synthesis pipeline that leverages the physics embedded illumination estimation to generate realistic hazy images; and (3) we design an Illumination-Adaptive Dehazing Unit (IDU) to extract dehazing features consistent with our proposed IASM in the latent space. By integrating the IDU into a U-Net architecture to create IADNet, we achieve significant improvements in dehazing performance through end-to-end training on synthetic data. Extensive experiments validate the superior performance of our PEIE method, significantly surpassing the state-of-the-arts in real-world dehazing.

Abstract: Blind face restoration is a highly illposed problem due to the lack of necessary context. Although existing methods produce high-quality outputs, they often fail to faithfully preserve the individual's identity. In this paper, we propose a personalized face restoration method, FaceMe, based on a diffusion model. Given a single or a few reference images, we use an identity encoder to extract identity-related features, which serve as prompts to guide the diffusion model in restoring high-quality and identity-consistent facial images. By simply combining identity-related features, we effectively minimize the impact of identity-irrelevant features during training and support any number of reference image inputs during inference. Additionally, thanks to the robustness of the identity encoder, synthesized images can be used as reference images during training, and identity changing during inference does not require fine-tuning the model. We also propose a pipeline for constructing a reference image training pool that simulates the poses and expressions that may appear in real-world scenarios. Experimental results demonstrate that our FaceMe can restore high-quality facial images while maintaining identity consistency, achieving excellent performance and robustness.

Abstract: Openvocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations. This enables the identification of novel visual relationships, making it applicable to real-world scenarios with diverse relationships. However, existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment. To address these challenges, we propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information. Our approach utilizes entity clustering to address the complexity of relation triplet categories, enabling the effective integration of subject-object information. Additionally, we utilize a large language model (LLM) to generate detailed region-aware prompts, capturing fine-grained visual interactions and improving alignment between visual and textual modalities. RAHP also introduces a dynamic selection mechanism within Vision-Language Models (VLMs), which adaptively selects relevant text prompts based on the visual content, reducing noise from irrelevant prompts. Extensive experiments on the Visual Genome and Open Images v6 datasets demonstrate that our framework consistently achieves state-of-the-art performance, demonstrating its effectiveness in addressing the challenges of open-vocabulary scene graph generation.

School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China, School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Co., Ltd, Jinan, China, School of Software, Shandong University, Jinan 250101, China, School of Software, Shandong University, Jinan 250101, China

Abstract: Semisupervised hashing has shown promising efficacy in large-scale image retrieval, which learns similarity-preserving codes from both labeled and unlabeled data. To enable the use of advanced supervised hashing techniques, pseudo labels are widely applied. However, existing methods typically suffer from a biased learning issue due to pseudo label noise, which can be further aggravated during optimization. Although such a bias can adversely affect hashing accuracy, it has not been investigated sufficiently. In view of this, we present a comprehensive discussion on potential causes of biases, involving processes of pseudo-labeling, hash learning and optimization. Accordingly, a novel Generalized Debiased Semi-supervised Hashing (GDSH) method is proposed as a unified solution to mitigate the biases. Specifically, reliable pseudo labels are first predicted via a robust label completion strategy. Secondly, a debiased hash learning module is designed by combining label denoising and similarity updating. This can not only refine the supervision, but also obtain hash codes that are semantically debiased in both category and sample levels. Finally, a discrete semi-supervised hashing algorithm is proposed to alleviate the bias arising from optimization. Experimental results on three single-label and three multi-label image benchmarks demonstrate that GDSH remarkably outperforms the state-of-the-arts in different semi-supervised settings.

Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, RealAI, Beijing, China, Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China Nanjing Institute of Software Technology, Nanjing, China, Singapore Management University, Singapore, College of Computer and Information Science, Software College, Southwest University, Chongqing, China, The University of Liverpool, Liverpool, United Kingdom, Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China

Abstract: Formal verification provides critical security assurances for neural networks, yet its practical application suffers from the long verification time. This work introduces a novel method for training verificationfriendly neural networks, which are robust, easy to verify, and relatively accurate. Our method integrates neuron behavior consistency into the training process, making neuron activation states remain consistent across different inputs within a local neighborhood. This reduces the number of unstable neurons and tightens the bounds of neurons thereby enhancing the network's verifiability. We evaluated our method using the MNIST, Fashion-MNIST, and CIFAR-10 datasets with various network architectures. The experimental results demonstrate that networks trained using our method are verification-friendly across different radii and architectures, whereas other tools fail to maintain verifiability as the radius increases. Additionally, we show that our method can be combined with existing approaches to further improve the verifiability of networks.

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, Beijing University of Posts and Telecommunications, Beijing, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Guangzhou, China Guangdong Province Key Laboratory of Information Security Technology, China

Abstract: Edge labels are typically at various granularity levels owing to the varying preferences of annotators, thus handling the subjectivity of perpixel labels has been a focal point for edge detection. Previous methods often employ a simple voting strategy to diminish such label uncertainty or impose a strong assumption of labels with a pre-defined distribution, e.g., Gaussian. In this work, we unveil that the segment anything model (SAM) provides strong prior knowledge to model the uncertainty in edge labels. Our key insight is that the intermediate SAM features inherently correspond to object edges at various granularities, which reflects different edge options due to uncertainty. Therefore, we attempt to align uncertainty with granularity by regressing intermediate SAM features from different layers to object edges at multi-granularity levels. In doing so, the model can fully and explicitly explore diverse ``uncertainties'' in a data-driven fashion. Specifically, we inject a lightweight module (~ 1.5% additional parameters) into the frozen SAM to progressively fuse and adapt its intermediate features to estimate edges from coarse to fine. It is crucial to normalize the granularity level of human edge labels to match their innate uncertainty. For this, we simply perform linear blending to the real edge labels at hand to create pseudo labels with varying granularities. Consequently, our uncertainty-aligned edge detector can flexibly produce edges at any desired granularity (including an optimal one). Thanks to SAM, our model uniquely demonstrates strong generalizability for cross-dataset edge detection. Extensive experimental results on BSDS500, Muticue and NYUDv2 validate our model's superiority.

Abstract: Semisupervised learning improves semantic segmentation performance by leveraging unlabeled data, thereby significantly reducing labeling costs. Previous semi-supervised semantic segmentation (S4) methods explored perturbations at the image level but neglected to adequately utilize multi-scale information. When labeled information is insufficient, the scale variation between different objects makes learning instances with extreme scales even more difficult. To address this issue, we propose ScaleMatch, which aims to learn scale-invariant features by obtaining a mixed dual-scale pseudo-label and scale consistency learning. Specifically, the cross-scale interaction fusion (CIF) module enforces interactive information across different scaled-views, allowing for more reliable pseudo-label generation. More importantly, ScaleMatch introduces variable scale branches to utilize scale-invariant supervision. It consists of image-level scale variation consistency (ISVC) and feature-level scale variation consistency (FSVC). Consequently, our ScaleMatch enhances the model's generalization under scale variation, outperforming existing state-of-the-art methods on both the Pascal VOC and Cityscapes datasets under various partition protocols.

Abstract: Accurately describing images with text is a foundation of explainable AI. VisionLanguage Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect, where multiple modified text prompts act as a noisy test-time augmentation for the original one. We propose an alternative evaluation scenario to decide if a performance boost of LLM-generated descriptions is caused by such a noise augmentation effect or rather by genuine description semantics. The proposed scenario avoids noisy test-time augmentation and ensures that genuine, distinctive descriptions cause the performance boost. Furthermore, we propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects. Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets. Additionally, we provide insights into the explainability of description-based image classification using VLMs.

The Hong Kong University of Science and Technology, Hong Kong, The Hong Kong University of Science and Technology, Hong Kong, Tencent, Hunyuan, China Tsinghua University, China, Tencent, Hunyuan, China, Tsinghua University, China, The Hong Kong University of Science and Technology, Hong Kong, Tsinghua University, China, Tencent, Hunyuan, China, Tencent, Hunyuan, China, The Hong Kong University of Science and Technology, Hong Kong Tsinghua University, China, Tencent, Hunyuan, China, The Hong Kong University of Science and Technology, Hong Kong

Abstract: Despite recent advances in imageto-video generation, better controllability and local animation are less explored. Most existing image-to-video methods are not locally aware and tend to move the entire scene. However, human artists may need to control the movement of different objects or regions. Additionally, current I2V methods require users not only to describe the target motion but also to provide redundant detailed descriptions of frame contents.These two issues hinder the practical utilization of current I2V tools. In this paper, we propose a practical framework, named Follow-Your-Click, to achieve image animation with a simple user click (for specifying what to move) and a motion prompt (for specifying how to move). Technically, we propose the first-frame masking strategy, which significantly improves the video generation quality, and a motion-augmented module equipped with a motion prompt dataset to improve the motion prompt following abilities of our model. To further control the motion speed, we propose flow-based motion magnitude control to control the speed of target movement more precisely. Extensive experiments compared with 7 baselines, including both commercial tools and research methods on 8 metrics, suggest the superiority of our approach.

Abstract: Pansharpening aims to preserve the spectral information of the multi-spectral (MS) image while leveraging the high-frequency details from the guided high-resolution panchromatic (PAN) image to enhance its spatial resolution. The key challenge is how to preserve the spectral information from the MS image and the spatial details from the PAN image as much as possible. Diffusion models have achieved favorable results in image restoration and synthesis tasks but suffer from excessive computational resource and time consumption. In this paper, we design a novel and computationally efficient diffusion-based pan-sharpening network that achieves accelerated diffusion while reducing task complexity by decoupling the high and low-frequency components of the fused image. Specifically, leveraging the information-preserving characteristic of the wavelet transformation, we introduce a Wavelet-based Low-frequency Diffusion Model (WLDM). WLDM generates the low-frequency coefficient of high-resolution MS (HRMS) image from the low-resolution MS (LRMS) image. This approach significantly reduces computational resources and complexity compared to the direct restoration of the HRMS image. Furthermore, we have devised a High-frequency Information Restoration Module (HIRM) to restore the high-frequency information in the HRMS image through the interaction of high-frequency coefficients from the PAN image in three directions. Extensive experiments on three different datasets demonstrate that our method outperforms existing approaches in both quantitative metrics, qualitative metrics, and inference efficiency.

Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University Beijing National Research Center for Information Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University, Beijing University of Technology, Department of Computer Science and Technology, Tsinghua University Beijing National Research Center for Information Science and Technology, Tsinghua University

Abstract: Textto-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. However, when it comes to complex prompts that contain dynamic scenes and multiple camera-view transformations, these methods can not decompose the overall information into separate scenes, as well as fail to smoothly change scenes based on the corresponding camera-views. To solve these problems, we propose a novel method, i.e., Modular-Cam. Specifically, to better understand a given complex prompt, we utilize a large language model to analyze user instructions and decouple them into multiple scenes together with transition actions. To generate a video containing dynamic scenes that match the given camera-views, we incorporate the widely-used temporal transformer into the diffusion model to ensure continuity within a single scene and propose CamOperator, a modular network based module that well controls the camera movements. Moreover, we propose AdaControlNet, which utilizes ControlNet to ensure consistency across scenes and adaptively adjusts the color tone of the generated video. Extensive qualitative and quantitative experiments prove our proposed Modular-Cam's strong capability of generating multi-scene videos together with its ability to achieve fine-grained control of camera movements. Generated results are available at https://modular-cam.github.io.

Abstract: Entrusted with the goal of pixellevel object classification, the semantic segmentation networks entails the laborious preparation of pixel-level annotation masks. To obtain pixel-level annotation masks for a given class without human efforts, recent few works have proposed to generate pairs of images and annotation masks by employing image and text relationships modeled by text-to-image generative models, especially Stable Diffusion. However, these works do not fully exploit the capability of text-guided Diffusion models and thus require a pre-trained segmentation network, careful text prompt tuning, or the training of a segmentation network to generate final annotation masks. In this work, we take a closer look at attention mechanisms of Stable Diffusion, from which we draw connections with classical seeded segmentation approaches. In particular, we show that cross-attention alone provides very coarse object localization, which however can provide initial seeds. Then, akin to region expansion in seeded segmentation, we utilize the semantic-correspondence-modeling capability of self-attention to iteratively spread the attention to the whole class from the seeds using multi-scale self-attention maps. We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences, compared to complex-structured objects. Thus, we further refine a mask using a more accurate background mask. Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion, without additional training procedure, prompt tuning, or a pre-trained segmentation network.

Abstract: Unsupervised Domain Adaptive (UDA) person search focuses on employing the model trained on a labeled source domain dataset to a target domain dataset without any additional annotations. Most effective UDA person search methods typically utilize the ground truth of the source domain and pseudolabels derived from clustering during the training process for domain adaptation. However, the performance of these approaches will be significantly restricted by the disrupting pseudo-labels resulting from inter-domain disparities. In this paper, we propose a Dual Self-Calibration (DSCA) framework for UDA person search that effectively eliminates the interference of noisy pseudo-labels by considering both the image-level and instance-level features perspectives. Specifically, we first present a simple yet effective Perception-Driven Adaptive Filter (PDAF) to adaptively predict a dynamic filter threshold based on input features. This threshold assists in eliminating noisy pseudo-boxes and other background interference, allowing our approach to focus on foreground targets and avoid indiscriminate domain adaptation. Besides, we further propose a Cluster Proxy Representation (CPR) module to enhance the update strategy of cluster representation, which mitigates the pollution of clusters from misidentified instances and effectively streamlines the training process for unlabeled target domains. With the above design, our method can achieve state-of-the-art (SOTA) performance on two benchmark datasets, with 80.2% mAP and 81.7% top-1 on the CUHK-SYSU dataset, with 39.9% mAP and 81.6% top-1 on the PRW dataset, which is comparable to or even exceeds the performance of some fully supervised methods.

Abstract: Video object detection has made significant progress in recent years thanks to convolutional neural networks (CNNs) and vision transformers (ViTs). Typically, CNNs excel at capturing local features but struggle to model global representations. Conversely, ViTs are adept at capturing longrange global features but face challenges in representing local feature details. Off-the-shelf video object detection methods solely rely on CNNs or ViTs to conduct feature aggregation, which hampers their capability to simultaneously leverage global and local information, thereby resulting in limited detection performance. In this paper, we propose a Transformer-GraphFormer Blender Network (TGBFormer) for video object detection, with three key technical improvements to fully exploit the advantages of transformers and graph convolutional networks while compensating for their limitations. First, we develop a spatial-temporal transformer module to aggregate global contextual information, constituting global representations with long-range feature dependencies. Second, we introduce a spatial-temporal GraphFormer module that utilizes local spatial and temporal relationships to aggregate features, generating new local representations that are complementary to the transformer outputs. Third, we design a global-local feature blender module to adaptively couple transformer-based global representations and GraphFormer-based local representations. Extensive experiments demonstrate that our TGBFormer establishes new state-of-the-art results on the ImageNet VID dataset. Particularly, our TGBFormer achieves 86.5% mAP while running at around 41.0 FPS on a single Tesla A100 GPU.

Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences Hong Kong University of Science and Technology, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Hong Kong University of Science and Technology, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Aerospace Information Research Institute, Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Hong Kong University of Science and Technology

Abstract: Object pose estimation, crucial in computer vision and robotics applications, faces challenges with the diversity of unseen categories. We propose a zeroshot method to achieve category-level 6-DOF object pose estimation, which exploits both 2D and 3D universal features of input RGB-D image to establish semantic similarity-based correspondences and can be extended to unseen categories without additional model fine-tuning. Our method begins with combining efficient 2D universal features to find sparse correspondences between intra-category objects and gets initial coarse pose. To handle the correspondence degradation of 2D universal features if the pose deviates much from the target pose, we use an iterative strategy to optimize the pose. Subsequently, to resolve pose ambiguities due to shape differences between intra-category objects, the coarse pose is refined by optimizing with dense alignment constraint of 3D universal features. Our method outperforms previous methods on the REAL275 and Wild6D benchmarks for unseen categories.

Abstract: Multifocus image fusion (MFIF) enhances depth of field in photography by generating an all-in-focus image from multiple images captured at different focal lengths. While deep learning has shown promise in MFIF, most existing methods overlooked the physical properties of defocus blurring in their network design, limiting their interoperability and generalization. This paper introduces a novel framework that integrates explicit defocus blur modelling into the MFIF process, improving both interpretability and performance. Using an atom-based spatially-varying parameterized defocus blurring model, our approach calculates pixel-wise defocus descriptors and initial focused images from multi-focus source images in a scale-recurrent manner to estimate soft decision maps. Fusion is then performed using masks derived from these decision maps, with special treatment for pixels likely defocused in all source images or near boundaries of defocused/focused regions. The model is trained with a fusion loss and a cross-scale defocus estimation loss. Extensive experiments on benchmark datasets demonstrated the effectiveness of our approach.

Abstract: AllWeather Image Restoration (AWIR) under adverse weather conditions is a challenging task due to the presence of different types of degradations. Prior research in this domain relies on extensive training data but lacks the utilization of additional contextual information for restoration guidance. Consequently, the performance of existing methods is limited by the degradation cues that are learnt from individual training samples. Recent advancements in visual in-context learning have introduced generalist models that are capable of addressing multiple computer vision tasks simultaneously by using the information present in the provided context as a prior. In this paper, we propose All-Weather Image Restoration using Visual In-Context Learning (AWRaCLe), a novel approach for AWIR that innovatively utilizes degradation-specific visual context information to steer the image restoration process. To achieve this, AWRaCLe incorporates Degradation Context Extraction (DCE) and Context Fusion (CF) to seamlessly integrate degradation-specific features from the context into an image restoration network. The proposed DCE and CF blocks leverage CLIP features and incorporate attention mechanisms to adeptly learn and fuse contextual information. These blocks are specifically designed for visual in-context learning under all-weather conditions and are crucial for effective context utilization. Through extensive experiments, we demonstrate the effectiveness of AWRaCLe for all-weather restoration and show that our method advances the state-of-the-art in AWIR.

Abstract: Neural Radiance Fields (NeRF) often struggle with reconstructing and rendering highly reflective scenes. Recent advancements have developed various reflectionaware appearance models to enhance NeRF's capability to render specular reflections. However, the robust reconstruction of highly reflective scenes is still hindered by the inherent shape ambiguity on specular surfaces. Existing methods typically rely on additional geometry priors to regularize the shape prediction, but this can lead to oversmoothed geometry in complex scenes. Observing the critical role of surface normals in parameterizing reflections, we introduce a transmittance-gradient-based normal estimation technique that remains robust even under ambiguous shape conditions. Furthermore, we propose a dual activated densities module that effectively bridges the gap between smooth surface normals and sharp object boundaries. Combined with a reflection-aware appearance model, our proposed method achieves robust reconstruction and high-fidelity rendering of scenes featuring both highly specular reflections and intricate geometric structures. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods on various datasets.

College of Computer Science, Beijing Information Science and Technology University, College of Computer Science, Beijing Information Science and Technology University, Key Laboratory of System Software (CAS), State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, China, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University Zhongguancun Laboratory, China, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, College of Computer Science, Beijing Information Science and Technology University

Abstract: As virtual experiences grow in popularity, the demand for realistic, personalized, and animatable human avatars increases. Traditional methods, relying on fixed templates, often produce costly avatars that lack expressiveness and realism. To overcome these challenges, we introduce Controllable Avatars generation via disentangled invertible networks (CtrlAvatar), a realtime framework for generating lifelike and customizable avatars. CtrlAvatar uses disentangled invertible networks to separate the deformation process into implicit body geometry and explicit texture components. This approach eliminates the need for repeated occupancy reconstruction, enabling detailed and coherent animations. The body geometry component ensures anatomical accuracy, while the texture component allows for complex, artifact-free clothing customization. This architecture ensures smooth integration between body movements and surface details. By optimizing transformations with position-varying offsets from the avatar’s initial Linear Blend Skinning vertices, CtrlAvatar achieves flexible, natural deformations that adapt to various scenarios. Extensive experiments show that CtrlAvatar outperforms other methods in quality, diversity, controllability, and cost-efficiency, marking a significant advancement in avatar generation.

Abstract: Large annotated datasets inevitably contain noisy labels, which poses a major challenge for training deep neural networks as they easily memorize the labels. Noiserobust loss functions have emerged as a notable strategy to counteract this issue, but it remains challenging to create a robust loss function which is not susceptible to underfitting. Through a quantitative approach, this paper explores the limited overlap between the network output at initialization and regions of non-vanishing gradients of bounded loss functions in the initial learning phase. Using these insights, we address underfitting of several noise robust losses with a novel method denoted as logit bias, which adds a real number epsilon to the logit at the position of the correct class. The logit bias enables these losses to achieve state-of-the-art results, even on datasets like WebVision, consisting of over a million images from 1000 classes. In addition, we demonstrate that our method can be used to determine optimal parameters for several loss functions – without having to train networks. Remarkably, our method determines the hyperparameters based on the number of classes, resulting in loss functions which require zero dataset or noise-dependent parameters.

Abstract: Image dehazing is a crucial task that involves the enhancement of degraded images to recover their sharpness and textures. While vision Transformers have exhibited impressive results in diverse dehazing tasks, their quadratic complexity and lack of dehazing priors pose significant drawbacks for realworld applications. In this paper, guided by triple priors, Bright Channel Prior (BCP), Dark Channel Prior (DCP), and Histogram Equalization (HE), we propose a Prior-guided Hierarchical Harmonization Network (PGHHNet) for image dehazing. PGHNet is built upon the UNet-like architecture with an efficient encoder and decoder, consisting of two module types: (1) Prior aggregation module that injects BCP/DCP and selects diverse contexts with gating attention. (2) Feature harmonization modules that subtract low-frequency components from spatial and channel aspects and learn more informative feature distributions to equalize the feature maps. Inspired by observing the sparsity of BCP/DCP and the histogram equalization, we harmonize the deep features using a histogram equation-guided module and further leverage BCP/DCP to guide spatial attention through a sandwich module as the bottleneck. Comprehensive experiments demonstrate that our model efficiently attains the highest level of performance among existing methods across four different datasets for image dehazing tasks.

Abstract: Connected component (CC) is a proper text shape representation that aligns with human reading intuition. However, CCbased text detection methods have recently faced a developmental bottleneck that their time-consuming post-processing is difficult to eliminate. To address this issue, we introduce an explicit relational reasoning network (ERRNet) to elegantly model the component relationships without post-processing. Concretely, we first represent each text instance as multiple ordered text components, and then treat these components as objects in sequential movement. In this way, scene text detection can be innovatively viewed as a tracking problem. From this perspective, we design an end-to-end tracking decoder to achieve a CC-based method dispensing with post-processing entirely. Additionally, we observe that there is an inconsistency between classification confidence and localization quality, so we propose a Polygon Monte-Carlo method to quickly and accurately evaluate the localization quality. Based on this, we introduce a position-supervised classification loss to guide the task-aligned learning of ERRNet. Experiments on challenging benchmarks demonstrate the effectiveness of our ERRNet. It consistently achieves state-of-the-art accuracy while holding highly competitive inference speed.

Abstract: Temporal sentence grounding is a challenging task that aims to localize the moment spans relevant to a language description. Although recent DETRbased models have achieved notable progress by leveraging multiple learnable moment queries, they suffer from overlapped and redundant proposals, leading to inaccurate predictions. We attribute this limitation to the lack of task-related guidance for the learnable queries to serve a specific mode. Furthermore, the complex solution space generated by variable and open-vocabulary language descriptions complicates optimization, making it harder for learnable queries to adaptively distinguish each other, leading to more severe overlapped proposals. To address this limitation, we present the Region-Guided TRansformer (RGTR) for temporal sentence grounding, which introduces regional guidance to increase query diversity and eliminate overlapped proposals. Instead of using learnable queries, RGTR adopts a set of anchor pairs as moment queries to introduce explicit regional guidance. Each moment query takes charge of moment prediction for a specific temporal region, which reduces the optimization difficulty and ensures the diversity of the proposals. In addition, we design an IoU-aware scoring head to improve proposal quality. Extensive experiments demonstrate the effectiveness of RGTR, outperforming state-of-the-art methods on three public benchmarks and exhibiting good generalization and robustness on out-of-distribution splits.

Abstract: Sign Language Production (SLP) aims to generate semantically consistent sign videos from textual statements, where the conversion from textual glosses to sign poses (G2P) is a crucial step. Existing G2P methods typically treat sign poses as discrete threedimensional coordinates and directly fit them, which overlooks the relative positional relationships among joints. To this end, we provide a new perspective, constraining joint associations and gesture details by modeling the limb bones to improve the accuracy and naturalness of the generated poses. In this work, we propose a pioneering iconicity disentangled diffusion framework, termed Sign-IDD, specifically designed for SLP. Sign-IDD incorporates a novel Iconicity Disentanglement (ID) module to bridge the gap between relative positions among joints. The ID module disentangles the conventional 3D joint representation into a 4D bone representation, comprising the 3D spatial direction vector and 1D spatial distance vector between adjacent joints. Additionally, an Attribute Controllable Diffusion (ACD) module is introduced to further constrain joint associations, in which the attribute separation layer aims to separate the bone direction and length attributes, and the attribute control layer is designed to guide the pose generation by leveraging the above attributes. The ACD module utilizes the gloss embeddings as semantic conditions and finally generates sign poses from noise embeddings. Extensive experiments on PHOENIX14T and USTC-CSL datasets validate the effectiveness of our method.

Abstract: Fewshot anomaly detection (FSAD) aims to detect unseen anomaly regions with the guidance of very few normal support images from the same class. Existing FSAD methods usually find anomalies by directly designing complex text prompts to align them with visual features under the prevailing large vision-language model paradigm. However, these methods, almost always, neglect intrinsic contextual information in visual features, e.g., the interaction relationships between different vision layers, which is an important clue for detecting anomalies comprehensively. To this end, we propose a kernel-aware graph prompt learning framework, termed as KAG-prompt, by reasoning the cross-layer relations among visual features for FSAD. Specifically, a kernel-aware hierarchical graph is built by taking the different layer features focusing on anomalous regions of different sizes as nodes, meanwhile, the relationships between arbitrary pairs of nodes stand for the edges of the graph. By message passing over this graph, KAG-prompt can capture cross-layer contextual information, thus leading to more accurate anomaly prediction. Moreover, to integrate the information of multiple important anomaly signals in the prediction map, we propose a novel image-level scoring method based on multi-level information fusion. Extensive experiments on MVTecAD and VisA datasets show that KAG-prompt achieves state-of-the-art FSAD results for image-level/pixel-level anomaly detection.

Abstract: This work presents ParGo, a novel PartialGlobal projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability.

Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Microsoft Research Asia, Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing

Abstract: Recent advances in diffusionbased generative models have demonstrated superior performance in subject-driven image generation. Identity (ID) preserving image generation, as a subtask of subject-driven image generation, aims to generate customized images for specific human identity and has broad application potential. However, this task remains challenging due to the requirement for high ID fidelity and precise detail preservation. Additionally, generating high-quality context presents another challenge, as existing methods struggle to achieve both high ID fidelity and satisfactory context simultaneously. To address the issues of insufficient ID fidelity, we introduce a simple yet effective test-time fine-tuning approach. Specifically, we propose an attribute-driven training method that establishes global-level and local-level tasks to learn the global face feature and fine-grained attribute features, respectively. Furthermore, we introduce a novel ID-context decoupling framework that decouples image context generation from human ID generation, ensuring the quality of contextual content as well as facilitating the learning of ID information. Through extensive experiments, we demonstrate the effectiveness of the proposed method and showcase its capabilities across various applications.

Tsinghua Shenzhen International Graduate School, Tsinghua University, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Tsinghua Shenzhen International Graduate School, Tsinghua University, Meituan, Beijing, Harbin Institute of Technology, Shenzhen Research Center of Artificial Intelligence, Peng Cheng Laboratory, Harbin Institute of Technology, Shenzhen, Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artificial Intelligence, Peng Cheng Laboratory

Abstract: Selfsupervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency.

Abstract: As artificial intelligence advances rapidly, particularly with the advent of GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL) has become increasingly challenging. Current IIL methods face two main challenges: a tendency towards overconfidence, leading to incorrect predictions; and difficulty in detecting subtle tampering boundaries in inpainted images. In response, we propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models. Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions. During denoising, we employ edge conditions and introduce a novel edge supervision strategy to enhance the model's perception of edge details in inpainted objects. Balancing the diffusion model's stochastic sampling with edge supervision of tampered image regions mitigates the risk of incorrect predictions from overconfidence and prevents the loss of subtle boundaries that can result from overly stochastic processes. Furthermore, we propose an innovative Dualstream Multi-scale Feature Extractor (DMFE) for extracting multi-scale features, enhancing feature representation by considering both semantic and edge conditions of the inpainted images. Extensive experiments across challenging datasets demonstrate that the InpDiffusion significantly outperforms existing state-of-the-art methods in IIL tasks, while also showcasing excellent generalization capabilities and robustness.

Abstract: Current deep multimodal graph clustering methods primarily rely on Graph Neural Network (GNN) to fully exploit attribute features and graph structures, including message propagation and low-dimensional feature embedding. However, these methods lack further exploration of graph structural information, such as the relationship between nodes and shortest paths. Additionally, they may not sufficiently mine complementary information among multi-modal graph data. To address these issues, we propose a novel Deep Multi-modal Graph Clustering via Graph Transformer Network method, called DMGC-GTN. This method thoroughly dissects and utilizes graph structural information, applying graph smoothing to node features and incorporating various forms of embeddings into the transformer architecture. This achieves a unified embedding of graph structure and multi-modal feature attributes, fully exploiting the complementary information within multi-modal graph data. Extensive experiments demonstrate the effectiveness of our algorithm.

Deepwise AI Lab, Center on Frontiers of Computing Studies, School of Computer Science, Nat’l Eng. Research Center of Visual Technology, Peking University, Center on Frontiers of Computing Studies, School of Computer Science, Nat’l Eng. Research Center of Visual Technology, Peking University, Deepwise AI Lab, Deepwise AI Lab, Center on Frontiers of Computing Studies, School of Computer Science, Nat’l Eng. Research Center of Visual Technology, Peking University State Key Lab of General Artificial Intelligence, Inst. for Artificial Intelligence, Peking University, School of Computing and Data Science, The University of Hong Kong

Abstract: Threedimensional (3D) medical images, such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), are essential for clinical applications. However, the need for diverse and comprehensive representations is particularly pronounced when considering the variability across different organs, diagnostic tasks, and imaging modalities. How to effectively interpret the intricate contextual information and extract meaningful insights from these images remains an open challenge to the community. While current self-supervised learning methods have shown potential, they often consider an image as a whole thereby overlooking the extensive, complex relationships among local regions from one or multiple images. In this work, we introduce a pioneering method for learning 3D medical image representations through an autoregressive pre-training framework. Our approach sequences various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence. By employing an autoregressive sequence modeling task, we predict the next visual token in the sequence, which allows our model to deeply understand and integrate the contextual information inherent in 3D medical images. Additionally, we implement a random startup strategy to avoid overestimating token relationships and to enhance the robustness of learning. The effectiveness of our approach is demonstrated by the superior performance over others on nine downstream tasks in public datasets.

Abstract: Style transfer enables the seamless integration of artistic styles from a style image into a content image, resulting in visually striking and aesthetically enriched outputs. Despite numerous advances in this field, existing methods did not explicitly focus on the signature style, which represents the distinct and recognizable visual traits of the image such as geometric and structural patterns, color palettes and brush strokes etc. In this paper, we introduce SigStyle, a framework that leverages the semantic priors that embedded in a personalized textto-image diffusion model to capture the signature style representation. This style capture process is powered by a hypernetwork that efficiently fine-tunes the diffusion model for any given single style image. Style transfer then is conceptualized as the reconstruction process of content image through learned style tokens from the personalized diffusion model. Additionally, to ensure the content consistency throughout the style transfer process, we introduce a time-aware attention swapping technique that incorporates content information from the original image into the early denoising steps of target image generation. Beyond enabling high-quality signature style transfer across a wide range of styles, SigStyle supports multiple interesting applications, such as local style transfer, texture transfer, style fusion and style-guided text-to-image generation. Quantitative and qualitative evaluations demonstrate our approach outperforms existing style transfer methods for recognizing and transferring the signature styles.

Abstract: Existing VisionLanguage Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. Prevailing VLP models often overlook the intricate distinctions in expressing different modal features and typically depend on the similarity of holistic features for cross-modal interactions. Moreover, these models directly align and integrate features from different modalities, focusing more on coarse-grained general representations, thus failing to capture the nuanced differences necessary for tasks demanding a more detailed perception. In response to these limitations, we introduce Negative Augmented Samples(NAS), a refined vision-language pretraining model that innovatively incorporates NAS to specifically address the challenge of fine-grained understanding. NAS utilizes a Visual Dictionary(VD) as a semantic bridge between visual and linguistic domains. Additionally, it employs a Negative Visual Augmentation(NVA) method based on the VD to generate challenging negative image samples. These samples deviate from positive samples exclusively at the token level, thereby necessitating that the model discerns the subtle disparities between positive and negative samples with greater precision. Comprehensive experiments validate the efficacy of NAS components and underscore its potential to enhance fine-grained vision-language comprehension.

Abstract: Selfsupervised stereo matching has drawn attention due to its ability to estimate disparity without needing ground-truth data. However, existing self-supervised stereo matching methods heavily rely on the photo-metric consistency assumption, which is vulnerable to natural disturbances, resulting in ambiguous supervision and inferior performance compared to the supervised ones. To relax the limitation of the photo-metric consistency assumption and even bypass this assumption, we propose a novel self-supervised framework named DualNet, which consists of two key steps: robust self-supervised teacher learning and pseudo-label supervised student training. Specifically, the teacher model is first trained in a self-supervised manner with a focus on feature-metric consistency and data augmentation consistency. Then, the output of the teacher model is geometrically constrained to obtain high-quality pseudo labels. Benefiting from these high-quality pseudo labels, the student model can outperform its teacher model by a large margin. With the two well-designed steps, the proposed framework DualNet ranks 1st among all self-supervised methods on multiple benchmarks, surprisingly even outperforming several supervised counterparts.

Abstract: Rendering photorealistic head avatars from arbitrary viewpoints is crucial for various applications like virtual reality. Although previous methods based on Neural Radiance Fields (NeRF) can achieve impressive results, they lack fidelity and efficiency. Recent methods using 3D Gaussian Splatting (3DGS) have improved rendering quality and realtime performance but still require significant storage overhead. In this paper, we introduce a method called GraphAvatar that utilizes Graph Neural Networks (GNN) to generate 3D Gaussians for the head avatar. Specifically, GraphAvatar trains a geometric GNN and an appearance GNN to generate the attributes of the 3D Gaussians from the tracked mesh. Therefore, our method can store the GNN models instead of the 3D Gaussians, significantly reducing the storage overhead to just 10MB. To reduce the impact of face-tracking errors, we also present a novel graph-guided optimization module to refine face-tracking parameters during training. Finally, we introduce a 3D-aware enhancer for post-processing to enhance the rendering quality. We conduct comprehensive experiments to demonstrate the advantages of GraphAvatar, surpassing existing methods in visual fidelity and storage consumption. The ablation study sheds light on the trade-offs between rendering quality and model size.

Abstract: Incorporating uncertainty is crucial to provide trustworthy explanations of deep learning models. Recent works have demonstrated how uncertainty modeling can be particularly important in the unsupervised field of representation learning explainable artificial intelligence (RXAI). Current R-XAI methods provide uncertainty by measuring variability in the importance score. However, they fail to provide meaningful estimates of whether a pixel is certainly important or not. In this work, we propose a new R-XAI method called REPEAT that addresses the key question of whether or not a pixel is certainly important. REPEAT leverages the stochasticity of current R-XAI methods to produce multiple estimates of importance, thus considering each pixel in an image as a Bernoulli random variable that is either important or unimportant. From these Bernoulli random variables we can directly estimate the importance of a pixel and its associated certainty, thus enabling users to determine certainty in pixel importance. Our extensive evaluation shows that REPEAT gives certainty estimates that are more intuitive, better at detecting out-of-distribution data, and more concise.

Abstract: Recent advances in textto-image personalization have enabled high-quality and controllable image synthesis for user-provided concepts. However, existing methods still struggle to balance identity preservation with text alignment. Our approach is based on the fact that generating prompt-aligned images requires a precise semantic understanding of the prompt, which involves accurately processing the interactions between the new concept and its surrounding context tokens within the CLIP text encoder. To address this, we aim to embed the new concept properly into the input embedding space of the text encoder, allowing for seamless integration with existing tokens. We introduce Context Regularization (CoRe), which enhances the learning of the new concept's text embedding by regularizing its context tokens in the prompt. This is based on the insight that appropriate output vectors of the text encoder for the context tokens can only be achieved if the new concept's text embedding is correctly learned. CoRe can be applied to arbitrary prompts without requiring the generation of corresponding images, thus improving the generalization of the learned text embedding. Additionally, CoRe can serve as a test-time optimization technique to further enhance the generations for specific prompts. Comprehensive experiments demonstrate that our method outperforms several baseline methods in both identity preservation and text alignment.

Abstract: Allin-one image restoration is a fundamental low-level vision task with significant real-world applications. The primary challenge lies in addressing diverse degradations within a single model. While current methods primarily exploit task prior information to guide the restoration models, they typically employ uniform multi-task learning, overlooking the heterogeneity in model optimization across different degradation tasks. To eliminate the bias, we propose a task-aware optimization strategy, that introduces adaptive task-specific regularization for multi-task image restoration learning. Specifically, our method dynamically weights and balances losses for different restoration tasks during training, encouraging the implementation of the most reasonable optimization route. In this way, we can achieve more robust and effective model training. Notably, our approach can serve as a plug-and-play strategy to enhance existing models without requiring modifications during inference. Extensive experiments in diverse all-in-one restoration settings demonstrate the superiority and generalization of our approach. For example, AirNet retrained with TUR achieves average improvements of 1.16 dB on three distinct tasks and 1.81 dB on five distinct all-in-one tasks. These results underscore TUR's effectiveness in advancing the SOTAs in all-in-one image restoration, paving the way for more robust and versatile image restoration.

Abstract: Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various visionlanguage tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been overlooked by existing studies. Inspired by human intuition in handling hallucinations, this paper introduces a novel bottom-up reasoning framework. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs. Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perception- and cognition-level hallucinations.

Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China 01AI, Beijing, China, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China 01AI, Beijing, China Tsinghua University, Shenzhen, Guangdong, China, 01AI, Beijing, China, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China, 01AI, Beijing, China, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China, Sun Yat-sen University, Guangzhou, Guangdong, China, Tsinghua University, Shenzhen, Guangdong, China, Shenzhen University, Shenzhen, Guangdong, China Carleton University, Canada

Abstract: Posecontrolled human video generation is of significant interest and finds extensive applications in areas such as automated advertising and content creation on social media platforms. While existing methods employing pose sequences and reference images for human image animation have exhibited notable performance, they tend to encounter issues such as specific region blurring, background sharpening, and decreased identity consistency. In this paper, we introduce ReMask-Animate, which utilizes masks as additional priors to guide the model's local visual attention to specific areas, thereby alleviating feature confusion between different regions of the image. Three distinct mask-guided adapters are designed for cross-condition regional fusion of hand and face pose features, mitigating feature confusion between the foreground and background, and enhancing the visual consistency of character identity. Moreover, these lightweight adapters introduce minimal computational overhead and can be seamlessly integrated into specific layers of the backbone architecture. Extensive experiments show that our method outperforms state-of-the-art methods on five metrics in public datasets. Additionally, qualitative evaluations highlight a significant improvement in the quality of generated videos, demonstrating our approach's superiority.

Abstract: Recently, foundational models have significantly advanced in different tasks, accompanied by Transformer as the general backbone. However, Transformer's quadratic complexity poses challenges for handling longer sequences and higher resolution images, which may limit foundational models further development. To alleviate this issue, various efficient State Space Models (SSMs) like Mamba have emerged, initially matching Transformer performance and gradually surpassing it. To improve the performance of SSMs in computer vision tasks, one crucial viewpoint is effective serialization of images. Existing vision Mambas, which rely on a linear scanning mechanism, often struggle to capture complex spatial relationships in 2D images. This results in feature loss during serialization and negatively impacts model performance. To overcome this limitation, we propose the use of fractal scanning curves for image serialization to enhance the Mambas’ ability to accurately model complex spatial dependencies. Additionally, unlike existing vision Mambas, which are designed with various curve scanning directions that increase the complexity, contradicting the original intent of Mamba to enhance model performance. We novelty introduce the Fractal Fusion Pathway (FFP) for our FractalMamba, which can enhance its performance efficiently. Extensive experiments underscore the superiority of our proposed FractalMamba.

Department of Film and Television Engineering, Shanghai University Shanghai Engineering Research Center of Motion Picture Special Effects, Department of Film and Television Engineering, Shanghai University, Department of Film and Television Engineering, Shanghai University, Department of Film and Television Engineering, Shanghai University Shanghai Engineering Research Center of Motion Picture Special Effects, AI Lab, Giant Network, School of Information Science and Technology, ShanghaiTech University

Abstract: Fashion design is a challenging and complex process. Recent works on fashion generation and editing are all agnostic of the actual fashion design process, which limits their usage in practice. In this paper, we propose a novel hierarchical diffusionbased framework tailored for fashion design, coined as HieraFashDiff. Our model is designed to mimic the practical fashion design workflow, by unraveling the denosing process into two successive stages: 1) an ideation stage that generates design proposals given high-level concepts and 2) an iteration stage that continuously refines the proposals using low-level attributes. Our model supports fashion design generation and fine-grained local editing in a single framework. To train our model, we contribute a new dataset of full-body fashion images annotated with hierarchical text descriptions. Extensive evaluations show that, as compared to prior approaches, our method can generate fashion designs and edited results with higher fidelity and better prompt adherence, showing its promising potential to augment the practical fashion design workflow.

Abstract: The field of Autonomous Driving (AD) has witnessed significant progress in recent years. Among the various challenges faced, the safety evaluation of autonomous vehicles (AVs) stands out as a critical concern. Traditional evaluation methods are both costly and inefficient, often requiring extensive driving mileage in order to encounter rare safetycritical scenarios, which are distributed on the long tail of the complex real-world driving landscape. In this paper, we propose a unified approach, Diffusion-Based Safety-Critical Scenario Generation (DiffScene), to generate high-quality safety-critical scenarios which are both realistic and safety-critical for efficient AV evaluation. In particular, we propose a diffusion-based generation framework, leveraging the power of approximating the distribution of low-density spaces for diffusion models. We design several adversarial optimization objectives to guide the diffusion generation under predefined adversarial budgets. These objectives, such as safety-based objective, functionality-based objective, and constraint-based objective, ensure the generation of safety-critical scenarios while adhering to specific constraints. Extensive experimentation has been conducted to validate the efficacy of our approach. Compared with 6 SOTA baselines, DiffScene generates scenarios that are (1) more safety-critical under 3 metrics, (2) more realistic under 5 distance functions, and (3) more transferable to different AV algorithms. In addition, we demonstrate that training AV algorithms with scenarios generated by DiffScene leads to significantly higher performance in terms of the safety-critical metrics compared to baselines. These findings highlight the potential of DiffScene in addressing the challenges of AV safety evaluation, paving the way for safer AV development.

School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University Peng Cheng Laboratory, Peng Cheng Laboratory School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University Nio Inc., School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Peng Cheng Laboratory

Abstract: Occupancy networks aim to reconstruct the surroundings with occupied semantic voxels. However, frequent object occlusions often occur in dynamic realworld scenarios, which cannot be captured by independent frames. Most existing occupancy networks generate results without explicitly considering past occupancy states and continuous visual changes over time, limiting their temporal accuracy. We tackle it by treating the task from a new continuous updating perspective, which considers historical data and continuous motion clues. We propose a new approach termed Continuous Motion clue exploitation for Occupancy Prediction (CMOP), which incorporates three key designs: (i) Propagator: which forecasts future occupancy states based on historical data; (ii) Tracker: which updates the occupancy on a per-frame basis using dynamic visual motion information; and (iii) Fuser: which aggregates results from the Propagator and Tracker into more robust and accurate occupancy results. Experiments on several benchmarks demonstrate that CMOP outperforms state-of-the-art baselines.

Abstract: Shadows can originate from occlusions in both direct and indirect illumination. Although most current shadow removal research focuses on shadows caused by direct illumination, shadows from indirect illumination are often just as pervasive, particularly in indoor scenes. A significant challenge in removing shadows from indirect illumination is obtaining shadowfree images to train the shadow removal network. To overcome this challenge, we propose a novel rendering pipeline for generating shadowed and shadow-free images under direct and indirect illumination, and create a comprehensive synthetic dataset that contains over 30,000 image pairs, covering various object types and lighting conditions. We also propose an innovative shadow removal network that explicitly integrates semantic and geometric priors through concatenation and attention mechanisms. The experiments show that our method outperforms state-of-the-art shadow removal techniques and can effectively generalize to indoor and outdoor scenes under various lighting conditions, enhancing the overall effectiveness and applicability of shadow removal methods.

Abstract: With the emergence of visionlanguage pre-trained models, such as CLIP, some textual prompts have been gradually introduced recently into re-identification (Re-ID) tasks to obtain considerably robust multimodal information. However, most textual descriptions based on vehicle Re-ID tasks only contain identity index words without specific words to describe vehicle view information, thereby resulting in difficulty to be widely applied in vehicle Re-ID tasks with view variations. This case inspires us to propose a CLIP-driven view-aware prompt learning framework for unsupervised vehicle Re-ID. We first design a learnable textual prompt template called view-aware context optimization (ViewCoOp) based on dynamic multi-view word embeddings, which can fully obtain the proportion and position encoding of each view in the whole vehicle body region. Subsequently, a cross-modal mutual graph is constructed to explore the connections between inter-modal and intra-modal. Each sample is treated as a graph node, which extracts textual features based on ViewCoOp and the visual features of images. Moreover, leveraging the inter-cluster and intra-cluster correlation in the bimodal clustering results in the determination of connectivity between graph node pairs. Lastly, the proposed cross-modal mutual graph method utilizes supervised information from the bimodal gap to directly fine-tune the image encoder of CLIP for downstream unsupervised vehicle Re-ID tasks. Extensive experiments verify that the proposed method is capable of effectively obtaining cross-modal description ability from multiple views.

Abstract: Neural Radiance Fields (NeRF) has achieved remarkable success in synthesizing impressive novel views. However, existing methods usually fail to handle scenes with adverse lighting conditions caused by external time variations and different camera settings, leading to poor visual quality. To address this challenge, we propose a physicalaware NeRF for efficient exposure correction, named PHY-NeRF. Specifically, we design Adaptive Lighting Particles inspired by the theory of light scattering and absorption, which can adjust the illumination intensity during volume rendering. Subsequently, we can handle scenes with different lighting conditions by jointly optimizing camera parameters and these lighting particles. Moreover, to promote natural brightness transitions, we devise a global illumination consistency module to control the lighting intensity across views at the feature level while completing more details. Benefiting from the above designs, our PHY-NeRF can tackle arbitrary low-light or overexposed scenes in an unsupervised manner. Extensive experiments show that our PHY-NeRF achieves state-of-the-art results in addressing adverse lighting problems while ensuring high rendering efficiency.

Abstract: The concept of function and affordance is a critical aspect of 3D scene understanding and supports taskoriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a Transformer Based Hierarchical Scene Understanding (TB-HSU) model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.

Abstract: We propose a novel hybrid calibrationfree method FreeCap to accurately capture global multi-person motions in open environments. Our system combines a single LiDAR with expandable moving cameras, allowing for flexible and precise motion estimation in a unified world coordinate. In particular, We introduce a local-to-global pose-aware cross-sensor human-matching module that predicts the alignment among each sensor, even in the absence of calibration. Additionally, our coarse-to-fine sensor-expandable pose optimizer further optimizes the 3D human key points and the alignments, it is also capable of incorporating additional cameras to enhance accuracy. Extensive experiments on Human-M3 and FreeMotion datasets demonstrate that our method significantly outperforms state-of-the-art single-modal methods, offering an expandable and efficient solution for multi-person motion capture across various applications.

Abstract: Dataset distillation (DD) allows datasets to be distilled to fractions of their original size while preserving the rich distributional information so that models trained on the distilled datasets can achieve a comparable accuracy while saving significant computational loads. Recent research in this area has been focusing on improving the accuracy of models trained on distilled datasets. In this paper, we aim to explore a new perspective of DD. We study how to embed adversarial robustness in distilled datasets, so that models trained on these datasets maintain the high accuracy and meanwhile acquire better adversarial robustness. We propose a new method that achieves this goal by incorporating curvature regularization into the distillation process with much less computational overhead than standard adversarial training. Extensive empirical experiments suggest that our method not only outperforms standard adversarial training on both accuracy and robustness with less computation overhead but is also capable of generating robust distilled datasets that can withstand various adversarial attacks.

Abstract: Existing research on humancentric video understanding typically focuses on analyzing specific moments or entire videos. However, many applications require higher precision at the frame level. In this work, we propose a novel task, BestShot, which aims to locate highlight frames within human-centric videos through language queries. This task requires not only a deep semantic understanding of human actions but also precise temporal localization. To support this task, we introduce the BestShot Benchmark. The benchmark is meticulously constructed by combining human-annotated highlight frames, duration labels and detailed textual descriptions. These descriptions cover three critical elements: (1) Visual content; (2) Fine-grained actions; and (3) Human pose descriptions. Together, these elements provide the necessary precision to identify the exact highlight frames in videos. To tackle this problem, we have collected two distinct datasets: (i) ShotGPT4o Dataset, which is algorithmically generated by GPT-4o and (ii) Image-SMPLText Dataset, which features large-scale and accurate per-frame pose descriptions using PoseScript and existing pose estimation datasets. Based on these datasets, we present a strong baseline model, ShotVL, fine-tuned from InternVL, specifically for BestShot. We highlight the impressive zero-shot capabilities of our model and offer comparative analyses with existing state-of-the-art (SOTA) models. ShotVL demonstrates a significant 64% improvement over InternVL on the BestShot Benchmark and a notable 68% improvement on the THUMOS14 Benchmark, while maintaining SOTA performance in general image classification and retrieval.

School of Cybersecurity, Northwestern Polytechnical University AI Business, Alibaba Group, AI Business, Alibaba Group, AI Business, Alibaba Group, College of Computer Science and Technology, Zhejiang University, AI Business, Alibaba Group, AI Business, Alibaba Group, College of Information and Control Engineering, Xi’an University of Architecture and Technology, School of Computer Science, Northwestern Polytechnical University, School of Cybersecurity, Northwestern Polytechnical University, College of Computer Science and Technology, Zhejiang University

Abstract: Currently, inspired by the success of visionlanguage models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local patches as auxiliary information. Finally, with the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question. This results in generating better answers. Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our proposed method can bring more benefits to the baseline (LLaVA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings.

Abstract: Despite significant progress has been made in image deraining, most existing methods are limited to handling only a single type of rain degradation or a specific pattern of rain. However, realworld rain scenarios tend to contain diverse rainy patterns due to variations in the rainfall process and lighting conditions. To address this dilemma and advance this field, we introduce a new task: Universal Rainy Image Restoration (URIR), which aims to handle multiple types of rain degradation on a single model. To benchmark this task, we construct a high-quality dataset called URIR-8K, which contains four patterns: rain streak, raindrop, rain accumulation and nighttime rain. Building upon this dataset, we present a comprehensive study on existing approaches by evaluating their universal deraining capabilities and their effect on downstream object detection task. In addition, we design a multi-scale vision Mamba as a baseline model, leveraging the benefits of multi-scale learning for its robustness to diverse rain appearances. Unlike existing methods that use fixed-scale scanning for feature extraction, we employ a multi-scale 2D scanning technique to better help image restoration in the richer scale space. Extensive experimental analysis shows the potential of our proposed task and the effectiveness of our model.

Abstract: To enhance autonomous driving, innovative approaches have been proposed to generate simulated LiDAR data. However, these methods often face challenges in producing highquality and controllable foreground objects. To cater to the needs of object-aware tasks in 3D perception, we introduce OLiDM, a novel framework capable of generating controllable and high-fidelity LiDAR data at both the object and scene levels. OLiDM consists of two pivotal components: the Object-Scene Progressive Generation (OPG) module and the Object Semantic Alignment (OSA) module. OPG adapts to user-specific prompts to generate desired foreground objects, which are subsequently employed as conditions in scene generation, ensuring controllable and diverse output at both the object and scene levels. This also facilitates the association of user-defined object-level annotations with the generated LiDAR scenes. Moreover, OSA aims to rectify the misalignment between foreground objects and background scenes, enhancing the overall quality of the generated objects. The broad efficacy of OLiDM is demonstrated across both unconditional and conditional LiDAR generation tasks, as well as 3D perception tasks. Specifically, on the KITTI-360 dataset, OLiDM surpasses prior state-of-the-art methods such as UltraLiDAR by 11.8 in FPD, producing data that closely mirrors real-world distributions. Additionally, in sparse-to-dense LiDAR completion, OLiDM achieves a significant improvement over LiDARGen, with a 57.47% increase in semantic IoU. Moreover, in 3D object detection, OLiDM enhances the performance of mainstream detectors by 2.4% in mAP and 1.9% in NDS, underscoring its potential in advancing 3D perception models.

Abstract: Diffusion models represent the stateof-the-art in generative modeling. Due to their high training costs, many works leverage pre-trained diffusion models' powerful representations for downstream tasks, such as face super-resolution (FSR), through fine-tuning or prior-based methods. However, relying solely on priors without supervised training makes it challenging to meet the pixel-level accuracy requirements of discrimination task. Although prior-based methods can achieve high fidelity and high-quality results, ensuring consistency remains a significant challenge. In this paper, we propose a masking strategy with strong and weak constraints and iterative refinement for real-world FSR, termed Diffusion Prior Interpolation (DPI). We introduce conditions and constraints on consistency by masking different sampling stages based on the structural characteristics of the face. Furthermore, we propose a condition Corrector (CRT) to establish a reciprocal posterior sampling process. DPI can balance consistency and diversity and can be seamlessly integrated into pre-trained models. In extensive experiments conducted on synthetic and real datasets, along with consistency validation in face recognition, DPI demonstrates superiority over SOTA FSR methods.

Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University Shanghai Collaborative Innovation Center on Intelligent Visual Computing Institute of Information Engineering, CAS, School of Computer Science and Technology, Beijing JiaoTong University Institute of Information Engineering, CAS Guangdong Provincial Key Lab of Intell. Info. Processing & Shenzhen Key Lab of Media Security, Shenzhen University, Institute of Information Engineering, CAS, Information Research Center of Military Science, PLA Academy of Military Science, National Key Laboratory of Science and Technology on Information System Security, School of Computer Science and Engineering, Beihang University, School of Artificial Intelligence, Hebei University of Technology

Abstract: Scene Graph Generation (SGG) aims to detect all objects and identify their pairwise relationships existing in the scene. Considering the substantial human labor costs, existing scene graph annotations are often sparse and biased, which result in confusion training with lowfrequency predicates. In this work, we design a Semi-Supervised Clustering framework for Scene Graph Generation (SSC-SGG) that uses the sparse labeled data to guide the generation of effective pseudo-labels from unlabeled object pairs, thus enriching the labeled sample space, especially for low-frequency interaction samples. We approach from the perspective of clustering, reducing the problem of confirmation bias in a self-training manner. Specifically, we first enhance the model's robustness to feature extraction via prototype-based clustering, aggregating different relationship augmented features onto the same prototype. Secondly, we design a dynamic pseudo-label assignment algorithm based on a mini-batch, which adjusts the detection sensitivity to different frequency samples from the historical assignment. Finally, we conduct joint training on the pseudo-labels and the labeled data. We conduct experiments on various SGG models and achieve substantial overall performance improvements, demonstrating the effectiveness of SSC-SGG.

Abstract: Synthetic aperture radar (SAR) object detection requires accurate identification and localization of targets at various scales within SAR images. However, background clutter and speckle noise can obscure key features and mislead the knowledge distillation process. To address these challenges, we introduce the Dual Information Purification Knowledge Distillation (DIPKD) method, which improves the performance of the student model through three key strategies: denoising, enrichment, and decoupling. First, our Selective Noise Suppression (SNS) technique reduces speckle noise in global features by minimizing misleading information from the teacher model. Second, the Knowledge Level Decoupling (KLD) module separates features into target and nontarget knowledge, balancing feature mapping and reducing background noise to enhance the extraction of critical information for the student model. Finally, the Reverse Information Transfer (RIT) module refines intermediate features in the student model, compensating for the loss of detailed local information. Experimental results demonstrate that DIPKD significantly outperforms existing distillation techniques in SAR object detection, achieving 60.2% and 51.4% mAP scores on the SSDD and HRSID datasets, respectively. Additionally, the student model shows performance improvements of 1.3% and 2.9% over the teacher model, highlighting the effectiveness of the information purification approach.

Abstract: We introduce a wearable driving status recognition device and our opensource dataset, along with a new real-time method robust to changes in lighting conditions for identifying driving status from eye observations of drivers. The core of our method is generating event frames from conventional intensity frames, and the other is a newly designed Attention Driving State Network (ADSN). Compared to event cameras, conventional cameras offer complete information and lower hardware costs, enabling captured frames to encode rich spatial information. However, these textures lack temporal information, posing challenges in effectively identifying driving status. DriveGazen addresses this issue from three perspectives. First, we utilize video frames to generate realistic synthetic dynamic vision sensor (DVS) events.Second, we adopt a spiking neural network to decode pertinent temporal information. Lastly, ADSN extracts crucial spatial cues from corresponding intensity frames and conveys spatial attention to convolutional spiking layers during both training and inference through a novel guide attention module to guide the feature learning and feature enhancement of the event frame. We specifically collected the Driving Status (DriveGaze) dataset to demonstrate the effectiveness of our approach. Additionally, we validate the superiority of the DriveGazen on the Single-eye Event-based Emotion (SEE) dataset. To the best of our knowledge, our method is the first to utilize guide attention spiking neural networks and eye-based event frames generated from conventional cameras for driving status recognition.Please refer to our project page and supplementary materials for more details.

Abstract: In this paper, we propose FreqTS, a novel FrequencyAware Token Selection approach for accelerating diffusion models without requiring retraining. Diffusion models have gained significant attention in the field of image synthesis due to their impressive generative capabilities. However, these models often suffer from high computational costs, primarily due to the sequential denoising process and large model size. Additionally, diffusion models tend to prioritize low-frequency features, leading to sub-optimal quantitative results. To address these challenges, FreqTS introduces an amplitude-based sorting method that separates Token features in the frequency domain of diffusion models into high-frequency and low-frequency subsets. It then utilizes fast Token Selection to reduce the presence of low-frequency features, effectively reducing the computational overhead. Moreover, FreqTS incorporates a Bayesian hyper-parameter search to dynamically assign different selection strategies for various denoising processes. Extensive experiments conducted on Stable Diffusion series models, PixArt-Alpha, LCM, and other models demonstrate that FreqTS achieves a minimum acceleration of 2.3× without the need for retraining. Furthermore, FreqTS showcases its versatility by being applicable to different sampling techniques and compatible with other dimension-specific acceleration algorithms.

Abstract: Extracting lane topology from perspective views (PV) is crucial for planning and control in autonomous driving. This approach extracts potential drivable trajectories for selfdriving vehicles without relying on high-definition (HD) maps. However, the unordered nature and weak long-range perception of the DETR-like framework can result in misaligned segment endpoints and limited topological prediction capabilities. Inspired by the learning of contextual relationships in language models, the connectivity relations in roads can be characterized as explicit topology sequences. In this paper, we introduce Topo2Seq, a novel approach for enhancing topology reasoning via topology sequences learning. The core concept of Topo2Seq is a randomized order prompt-to-sequence learning between lane segment decoder and topology sequence decoder. The dual-decoder branches simultaneously learn the lane topology sequences extracted from the Directed Acyclic Graph (DAG) and the lane graph containing geometric information. Randomized order prompt-to-sequence learning extracts unordered key points from the lane graph predicted by the lane segment decoder, which are then fed into the prompt design of the topology sequence decoder to reconstruct an ordered and complete lane graph. In this way, the lane segment decoder learns powerful long-range perception and accurate topological reasoning from the topology sequence decoder. Notably, topology sequence decoder is only introduced during training and does not affect the inference efficiency. Experimental evaluations on the OpenLane-V2 dataset demonstrate the state-of-the-art performance of Topo2Seq in topology reasoning.

Abstract: We present RSDiffusion, the first Diffusion Models-based method for single-frame Rolling Shutter (RS) correction. RS artifacts compromise visual quality of frames due to the row-wise exposure of CMOS sensors. Most previous methods have focused on multi-frame approaches, using temporal information from consecutive frames for the motion rectification. However, few approaches address the more challenging but important single frame RS correction. In this work, we present an ``image-to-motion" framework via diffusion techniques, with a designed patch-attention module. In addition, we present the RS-Real dataset, comprised of captured RS frames alongside their corresponding Global Shutter (GS) ground-truth pairs. The GS frames are corrected from the RS ones, guided by the corresponding Inertial Measurement Unit (IMU) gyroscope data acquired during capture. Experiments show that RS-Diffusion surpasses previous single-frame RS methods, demonstrates the potential of diffusion-based approaches, and provides a valuable dataset for further research.

Abstract: Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a largescale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained on HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real scenarios. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations.

Abstract: Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms requires efficient motion modeling. This is because UAVMOT faces both local object motion and global camera motion. Motion blur also increases the difficulty of detecting large moving objects. Previous UAV motion modeling approaches either focus only on local motion or ignore motion blurring effects, thus limiting their tracking performance and speed. To address these issues, we propose the Motion Mamba Module, which explores both local and global motion features through cross-correlation and bi-directional Mamba Modules for better motion modeling. To address the detection difficulties caused by motion blur, we also design motion margin loss to effectively improve the detection accuracy of motion blurred objects. Based on the Motion Mamba module and motion margin loss, our proposed MM-Tracker surpasses the state-of-the-art in two widely open-source UAV-MOT datasets.

Abstract: Realworld image dehazing remains a challenging task due to the diverse nature of haze degradation and the lack of large-scale paired datasets. Existing methods based on hand-crafted priors or generative priors struggle to recover accurate backgrounds and fine details from dense haze regions. In this work, we propose a novel paradigm, PromptHaze, for real-world image dehazing via the depth prompt from the Depth Anything model. By employing a prompt-by-prompt strategy, our method iteratively updates the depth prompt and progressively restores the background through a dehazing network with controllable dehazing strength. Extensive experiments on widely-used real-world dehazing benchmarks demonstrate the superiority of PromptHaze in recovering authentic backgrounds and fine details from various haze scenes, outperforming state-of-the-art methods across multiple quality metrics.

Abstract: Highquality, pixel-level annotated datasets are crucial for training deep learning models, while their creation is often labor-intensive, time-consuming, and costly. Generative diffusion models have then gained prominence for producing synthetic datasets, yet existing text-to-data methods struggle with generating complex scenes involving multiple objects and intricate spatial arrangements. To address these limitations, we introduce FlexDataset, a framework that pioneers the composition-to-data (C2D) paradigm. FlexDataset generates high-fidelity synthetic datasets with versatile annotations, tailored for tasks like salient object detection, depth estimation, and segmentation. Leveraging a meticulously designed composition-to-image (C2I) framework, it offers precise positional and categorical control. Our Versatile Annotation Generation (VAG) Plan A further enhances efficiency by exploiting rich latent representations through tuned perception decoders, reducing annotation time by nearly fivefold. FlexDataset allows unlimited generation of customized, multi-instance and multi-category (MIMC) annotated data. Extensive experiments show that FlexDataset sets a new standard in synthetic dataset generation across multiple datasets and tasks, including zero-shot and long-tail scenarios.

Abstract: In this paper, we explore how to develop salient object detection models using adder neural networks (ANNs), which are more energy efficient than convolutional neural networks (CNNs), especially for realworld applications. Based on our empirical studies, we show that directly replacing the convolutions in CNN-based models with adder layers leads to a substantial loss of activations in the decoder part. This makes the feature maps learned in the decoder lack pattern diversity and hence results in a significant performance drop. To alleviate this issue, by investigating the statistics of the feature maps produced by adder layers, we introduce a simple yet effective differential merging strategy to augment the feature representations learned by adder layers and present a simple baseline for SOD using ANNs. Experiments on popular salient object detection benchmarks demonstrate that our proposed method with a simple feature pyramid network (FPN) architecture achieves comparable performance to previous state-of-theart CNN-based models and consumes much less energy. We hope this work could facilitate the development of ANNs in binary segmentation tasks.

Abstract: Person ReIDentification (ReID) aims to identify specific persons from non-overlapping cameras. Recently, some works have suggested using large-scale pre-trained vision-language models like CLIP to boost ReID performance. Unfortunately, existing methods still struggle to address two key issues simultaneously: efficiently transferring the knowledge learned from CLIP and comprehensively extracting the context information from images or videos. To address these issues, we introduce CLIMB-ReID, a pioneering hybrid framework that synergizes the impressive power of CLIP with the remarkable computational efficiency of Mamba. Specifically, we first propose a novel Multi-Memory Collaboration (MMC) strategy to transfer CLIP's knowledge in a parameter-free and prompt-free form. Then, we design a Multi-Temporal Mamba (MTM) to capture multi-granular spatiotemporal information in videos. Finally, with Importance-aware Reorder Mamba (IRM), information from various scales is combined to produce robust sequence features. Extensive experiments show that our proposed method outperforms other state-of-the-art methods on both image and video person ReID benchmarks.

Abstract: Recent Large VisionLanguage Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model's sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions. To address this, we introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions. XT-VQA integrates five existing text-rich VQA datasets and a newly collected dataset, XPaperQA, covering diverse scenarios that require faithful recognition and comprehension of visual information despite language inconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a significant drop in performance for cross-lingual scenarios, even for models with multilingual capabilities. A mutual information analysis suggests that this performance gap stems from cross-lingual questions failing to adequately activate relevant visual information. To mitigate this issue, we propose MVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information), where a visual-text cross-lingual alignment is built by maximizing mutual information between the model's outputs and visual information. This is achieved by distilling knowledge from monolingual to cross-lingual settings through KL divergence minimization, where monolingual output logits serve as a teacher. Experimental results on the XT-VQA demonstrate that MVCL-MI effectively reduces the visual-text cross-lingual performance disparity while preserving the inherent capabilities of LVLMs, shedding new light on the potential practice for improving LVLMs.

Abstract: Diffusion models have achieved cuttingedge performance in image generation. However, their lengthy denoising process and computationally intensive score estimation network impede their scalability in low-latency and resource-constrained scenarios. Post-training quantization (PTQ) compresses and accelerates diffusion models without retraining, but it inevitably introduces additional quantization noise, resulting in mean and variance deviations. In this work, we propose D2-DPM, a dual denoising mechanism aimed at precisely mitigating the adverse effects of quantization noise on the noise estimation network. Specifically, we first unravel the impact of quantization noise on the sampling equation into two components: the mean deviation and the variance deviation. The mean deviation alters the drift coefficient of the sampling equation, influencing the trajectory trend, while the variance deviation magnifies the diffusion coefficient, impacting the convergence of the sampling trajectory. The proposed D2-DPM is thus devised to denoise the quantization noise at each time step, and then denoise the noisy sample through the inverse diffusion iterations. Experimental results demonstrate that D2-DPM achieves superior generation quality, yielding a 1.42 lower FID than the full-precision model while achieving 3.99x compression and 11.67x bit-operation acceleration.

Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing ByteDance Inc., Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, ByteDance Inc., Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing

Abstract: Diffusion models, as a type of generative model, have achieved impressive results in generating images and videos conditioned on textual conditions. However, the generation process of diffusion models involves denoising dozens of steps to produce photorealistic images/videos, which is computationally expensive. Unlike previous methods that design ``onesize-fits-all'' approaches for speed up, we argue denoising steps should be sample-specific conditioned on the richness of input texts. To this end, we introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies, which are then used by the diffusion model for generation. AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function, balancing inference time and generation quality. We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline using a fixed 50 denoising steps while reducing inference time by at least 33%, going as high as 40%. Furthermore, our method can be used on top of other acceleration methods to provide further speed benefits. Lastly, qualitative analysis shows that AdaDiff allocates more steps to more informative prompts and fewer steps to simpler prompts.

Abstract: We propose symmetric power transformation to enhance the capacity of Implicit Neural Representation (INR) from the perspective of data transformation. Unlike prior work utilizing random permutation or index rearrangement, our method features a reversible operation that does not require additional storage consumption. Specifically, we first investigate the characteristics of data that can benefit the training of INR, proposing the RangeDefined Symmetric Hypothesis, which posits that specific range and symmetry can improve the expressive ability of INR. Based on this hypothesis, we propose a nonlinear symmetric power transformation to achieve both range-defined and symmetric properties simultaneously. We use the power coefficient to redistribute data to approximate symmetry within the target range. To improve the robustness of the transformation, we further design deviation-aware calibration and adaptive soft boundary to address issues of extreme deviation boosting and continuity breaking. Extensive experiments are conducted to verify the performance of the proposed method, demonstrating that our transformation can reliably improve INR compared with other data transformations. We also conduct 1D audio, 2D image and 3D video fitting tasks to demonstrate the effectiveness and applicability of our method.

Abstract: Realtime object detection is critical for the decision-making process for many real-world applications, such as collision avoidance and path planning in autonomous driving. This work presents an innovative real-time streaming perception method, Transtreaming, which addresses the challenge of real-time object detection with dynamic computational delays. The core innovation of Transtreaming lies in its adaptive delay-aware transformer, which can concurrently predict multiple future frames and select the output that best matches the real-world present time, compensating for any system-induced computational delays. The proposed model outperforms existing state-of-the-art methods, even in single-frame detection scenarios, by leveraging a transformer-based methodology. It demonstrates robust performance across a range of devices, from powerful V100 to modest 2080Ti, achieving the highest level of perceptual accuracy on all platforms. Unlike most state-of-the-art methods that struggle to complete computation within a single frame on less powerful devices, Transtreaming meets the stringent real-time processing requirements on all kinds of devices. The experimental results emphasize the system's adaptability and its potential to significantly improve the safety and reliability of many real-world systems, such as autonomous driving.

Abstract: We present InstantSticker, a disentangled reconstruction pipeline based on ImageBased Lighting (IBL), which focuses on highly realistic decal blending, simulates stickers attached to the reconstructed surface, and allows for instant editing and real-time rendering. To achieve stereoscopic impression of the decal, we introduce shadow factor into IBL, which can be adaptively optimized during training. This allows the shadow brightness of surfaces to be accurately decomposed rather than baked into the diffuse color, ensuring that the edited texture exhibits authentic shading. To address the issues of warping and blurriness in previous methods, we apply As-Rigid-As-Possible (ARAP) parameterization to pre-unfold a specified area of the mesh and use the local UV mapping combined with a neural texture map to enhance the ability to express high-frequency details in that area. For instant editing, we utilize the Disney BRDF model, explicitly defining material colors with 3-channel diffuse albedo. This enables instant replacement of albedo RGB values during the editing process, avoiding the prolonged optimization required in previous approaches. In our experiment, we introduce the Ratio Variance Warping (RVW) metric to evaluate the local geometric warping of the decal area. Extensive experimental results demonstrate that our method surpasses previous decal blending methods in terms of editing quality, editing speed and rendering speed, achieving the state-of-the-art.

Abstract: Recent approaches to VO have significantly improved performance by using deep networks to predict optical flow between video frames. However, existing methods still suffer from noisy and inconsistent flow matching, making it difficult to handle challenging scenarios and longsequence estimation.To overcome these challenges, we introduce Spatio-Temporal Visual Odometry (STVO), a novel deep network architecture that effectively leverages inherent spatio-temporal cues to enhance the accuracy and consistency of multi-frame flow matching. With more accurate and consistent flow matching, STVO can achieve better pose estimation through the bundle adjustment (BA).Specifically, STVO introduces two innovative components: 1) the Temporal Propagation Module that utilizes multi-frame information to extract and propagate temporal cues across adjacent frames, maintaining temporal consistency; 2) the Spatial Activation Module that utilizes geometric priors from the depth maps to enhance spatial consistency while filtering out excessive noise and incorrect matches.Our STVO achieves state-of-the-art performance on TUM-RGBD, EuRoc MAV, ETH3D and KITTI Odometry benchmarks. Notably, it improves accuracy by 77.8% on ETH3D benchmark and 38.9% on KITTI Odometry benchmark over the previous best methods.

Abstract: Handobject interaction modeling from a single RGB image is a significantly challenging task. Previous works typically reconstruct hand-object interactions as texture-less meshes, ignoring photo-realistic image generation. In this work, we introduce the HO123, a novel method to synthesize novel-view hand-object interaction images from a single image. To this end, we first train a 2D diffusion prior. Given the camera pose in novel views, our approach transfers the camera information into explicit hand representations, including hand depth and skeleton images. We propose a global hand embedding to control the diffusion model based on these hand representations. We then learn a 3D Gaussian splatting for novel-view rendering using the diffusion prior. However, occluded objects present a persistent challenge. To address this issue, we further introduce local hand embedding, where a contact field is defined in the 3D Gaussian Splatting. We leverage contact information to guide the rendering in the contact field. Extensive experiments on the HO3D and DexYCB datasets demonstrate that our method significantly outperforms state-of-the-art novel-view synthesis for hand-object interactions.

Abstract: Due to the density inconsistency and distribution difference between crosssource point clouds, previous methods fail in cross-source point cloud registration. We propose a density-robust feature extraction and matching scheme to achieve robust and accurate cross-source registration. To address the density inconsistency between cross-source data, we introduce a density-robust encoder for extracting density-robust features. To tackle the issue of challenging feature matching and few correct correspondences, we adopt a loose-to-strict matching pipeline with a ``loose generation, strict selection'' idea. Under it, we employ a one-to-many strategy to loosely generate initial correspondences. Subsequently, high-quality correspondences are strictly selected to achieve robust registration through sparse matching and dense matching. On the challenging Kinect-LiDAR scene in the cross-source 3DCSR dataset, our method improves feature matching recall by 63.5 percentage points (pp) and registration recall by 57.6 pp. It also achieves the best performance on 3DMatch, while maintaining robustness under diverse downsampling densities.

Abstract: Open vocabulary semantic segmentation is a hot topic in research, focusing on segmenting and recognizing a diverse array of categories in varied environments, including those previously unknown, thereby holding significant practical value. Mainstream studies utilize the CLIP model for direct semantic segmentation (denoted as “forward methods”), which often struggles to represent underrepresented categories effectively. To address this issue, this paper introduces a novel approach Excluding the ImpossibLe Semantic Segmentation Network (ELSENet) based on reverse thinking. By excluding improbable categories, ELSE-Net narrows the selection range for forward methods, significantly reducing the risk of misclassification. In implementation, we initially draw on leading research to design the General Processing Block (GP-Block), which generates inclusion probabilities (the likelihood of belonging to a category) by using the CLIP model cooperated with a Mask Proposal Network (MPN). We then present the EXcluding the ImPossible Block (EXP-Block), which computes exclusion probabilities (the likelihood of not belonging to a category) through the CLIPN model and a custom-designed Reverse Retrieval Adapter (R2-Adapter). These exclusion probabilities are subsequently used to refine the inclusion probabilities, which are ultimately employed to annotate class-agnostic masks. Moreover, the core component of our EXP-Block is model-agnostic, enabling it to enhance the capabilities of existing frameworks. Experimental results from four benchmark datasets validate the effectiveness of ELSE-Net and underscore the seamless model-agnostic functionality of the EXP-Block.

Abstract: Openvocabulary 3D object detection (OV-3DOD) aims at localizing and classifying novel objects beyond closed sets. The recent success of vision-language models (VLMs) has demonstrated their remarkable capabilities to understand open vocabularies. Existing works that leverage VLMs for 3D object detection (3DOD) generally resort to representations that lose the rich scene context required for 3D perception. To address this problem, we propose in this paper a hierarchical framework, named HCMA, to simultaneously learn local object and global scene information for OV-3DOD. Specifically, we first design a Hierarchical Data Integration (HDI) approach to obtain coarse-to-fine 3D-image-text data, which is fed into a VLM to extract object-centric knowledge. To facilitate the association of feature hierarchies, we then propose an Interactive Cross-Modal Alignment (ICMA) strategy to establish effective intra-level and inter-level feature connections. To better align features across different levels, we further propose an Object-Focusing Context Adjustment (OFCA) module to refine multi-level features by emphasizing object-related features. Extensive experiments demonstrate that the proposed method outperforms SOTA methods on the existing OV-3DOD benchmarks. It also achieves promising OV-3DOD results even without any 3D annotations.

Abstract: Eventbased semantic segmentation (ESS) has attracted researchers' attention recently, as event cameras can solve problems such as under/over-exposure or motion blur that are difficult for RGB cameras to handle. However, event data are noisy and sparse, resulting in difficulties for the model to locate and extract reliable cues from their sparse representations, especially when performing pixel-level tasks. In this paper, we propose a novel framework ESEG to alleviate the dilemma. Given that event signals relate closely to moving edges, instead of proposing complex structures to expect them to recognize those reliable edge regions behind event signals on their own, we introduce the explicit edge-semantic supervision as a reference to let the ESS model globally optimize semantics, considering the high confidence of event data in edge regions. In addition, we propose a fusion module named Density-Aware Dynamic-Window Cross Attention Fusion (D\textsuperscript{2}CAF), in which the density perception, cross-attention, and dynamic window masking mechanisms are jointly imposed to optimize edge-dense feature fusion, leveraging the characteristics of event cameras. Experimental results on DSEC and DDD17 datasets demonstrate the efficacy of the ESEG framework and its core designs.

Abstract: Portraits often suffer from specular highlights due to factors like skin oiliness, lighting conditions, and shooting angles, which degrade aesthetics and affect downstream tasks. Thus, portrait highlight removal is imperative. Previous methods struggle to remove highlights and achieve highfidelity restoration of disturbed regions simultaneously. In this work, we propose a novel patch-based diffusion model for this task, named PHR-DIFF. Specifically, in the training, we present a patchify training strategy that divides the portrait into equal-sized patches and performs diffusion on these patches individually. This patchify can extract more compact facial features and reduce training costs. Besides, to learn the global coherence of the face, we propose a patch-residual approach. It encodes the full-resolution highlight-free portrait into latent features, which are further used as residual terms to constrain the forward training. In the sampling, we remove portrait highlights in a patch-wise manner and propose a Patch-Aware Highlight Removal (PAHR) mechanism. PAHR leverages features from non-highlight regions to effectively guide the patch-wise removal of highlight components. Experimental results on multiple public datasets demonstrate that PHR-DIFF removes highlights more cleanly and avoids artifacts.

Abstract: Sourcefree domain adaptation (SFDA) aims to transfer knowledge from the well-trained source model and optimize it to adapt target data distribution. SFDA methods are suitable for medical image segmentation task due to its data-privacy protection and achieve promising performances. However, cross-domain distribution shift makes it difficult for the adapted model to provide accurate decisions on several hard instances and negatively affects model generalization. To overcome this limitation, a novel method `supportive negatives spectral augmentation' (SNSA) is presented in this work. Concretely, SNSA includes the instance selection mechanism to automatically discover a few hard samples for which source model produces incorrect predictions. And, active learning strategy is adopted to re-calibrate their predictive masks. Moreover, SNSA deploys the spectral augmentation between hard instances and others to encourage source model to gradually capture and adapt the attributions of target distribution. Considerable experimental studies demonstrate that annotating merely 4%~5% of negative instances from the target domain significantly improves segmentation performance over previous methods.

Abstract: Although diffusionbased techniques have shown remarkable success in image generation and editing tasks, their abuse can lead to severe negative social impacts. Recently, some works have been proposed to provide defense against the abuse of diffusion-based methods. However, their protection may be limited in specific scenarios by manually defined prompts or the stable diffusion (SD) version. Furthermore, these methods solely focus on tuning methods, overlooking editing methods that could also pose a significant threat. In this work, we propose Anti-Diffusion, a privacy protection system designed for general diffusion-based methods, applicable to both tuning and editing techniques. To mitigate the limitations of manually defined prompts on defense performance, we introduce the prompt tuning (PT) strategy that enables precise expression of original images. To provide defense against both tuning and editing methods, we propose the semantic disturbance loss (SDL) to disrupt the semantic information of protected images. Given the limited research on the defense against editing methods, we develop a dataset named Defense-Edit to assess the defense performance of various methods. Experiments demonstrate that our Anti-Diffusion achieves superior defense performance across a wide range of diffusion-based techniques in different scenarios.

Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China

Abstract: LIDARbased 3D object detection and semantic segmentation are critical tasks in 3D scene understanding. Traditional detection and segmentation methods supervise their models through bounding box labels and semantic mask labels. However, these two independent labels inherently contain significant redundancy. This paper aims to eliminate the redundancy by supervising 3D object detection using only semantic labels. However, the challenge arises due to the incomplete geometry structure and boundary ambiguity of point cloud instances, leading to inaccurate pseudo-labels and poor detection results. To address these challenges, we propose a novel method, named Seg2Box. We first introduce a Multi-Frame Multi-Scale Clustering (MFMS-C) module, which leverages the spatio-temporal consistency of point clouds to generate accurate box-level pseudo-labels. Additionally, the Semantic-Guiding Iterative-Mining Self-Training (SGIM-ST) module is proposed to enhance the performance by progressively refining the pseudo-labels and mining the instances without generating pseudo-labels. Experiments on the Waymo Open Dataset and nuScenes Dataset show that our method significantly outperforms other competitive methods by 23.7% and 10.3% in mAP, respectively. The results demonstrate the great label-efficient potential and advancement of our method.

Abstract: Contemporary face recognition systems use feature templates extracted from face images to identify persons. To enhance privacy, face template protection techniques are widely employed to conceal sensitive identity and appearance information stored in the template. This paper identifies an emerging privacy attack form utilizing diffusion models that could nullify prior protection. The attack can synthesize highquality, identity-preserving face images from templates, revealing persons' appearance. Based on studies of the diffusion model's generative capability, this paper proposes a defense by rotating templates to a noise-like distribution. This is achieved efficiently by spherically and linearly interpolating templates on their located hypersphere. This paper further proposes to group-wisely divide and drop out templates' feature dimensions, to enhance the irreversibility of rotated templates. The proposed techniques are concretized as a novel face template protection technique, SlerpFace. Extensive experiments show that SlerpFace provides satisfactory recognition accuracy and comprehensive protection against inversion and other attack forms, superior to prior arts.

Abstract: Point cloud completion aims to recover partial geometric and topological shapes caused by equipment defects or limited viewpoints. Current methods either solely rely on the 3D coordinates of the point cloud to complete it or incorporate additional images with wellcalibrated intrinsic parameters to guide the geometric estimation of the missing parts. Although these methods have achieved excellent performance by directly predicting the location of complete points, the extracted features lack fine-grained information regarding the location of the missing area. To address this issue, we propose a rapid and efficient method to expand an unimodal framework into a multimodal framework. This approach incorporates a position-aware module designed to enhance the spatial information of the missing parts through a weighted map learning mechanism. In addition, we establish a Point-Text-Image triplet corpus PCI-TI and MVP-TI based on the existing unimodal point cloud completion dataset and use the pre-trained vision-language model CLIP to provide richer detail information for 3D shapes, thereby enhancing performance. Extensive quantitative and qualitative experiments demonstrate that our method outperforms state-of-the-art point cloud completion methods.

Abstract: As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimizationbased methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks 1st on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.

Abstract: Knowledge distillation (KD) is a valuable yet challenging approach that enhances a compact student network by learning from a highperformance but cumbersome teacher model. However, previous KD methods for image restoration overlook the state of the student during the distillation, adopting a fixed solution space that limits the capability of KD. Additionally, relying solely on L1-type loss struggles to leverage the distribution information of images. In this work, we propose a novel dynamic contrastive knowledge distillation (DCKD) framework for image restoration. Specifically, we introduce dynamic contrastive regularization to perceive the student's learning state and dynamically adjust the distilled solution space using contrastive learning. Additionally, we also propose a distribution mapping module to extract and align the pixel-level category distribution of the teacher and student models. Note that the proposed DCKD is a structure-agnostic distillation framework, which can adapt to different backbones and can be combined with methods that optimize upper-bound constraints to further enhance model performance. Extensive experiments demonstrate that DCKD significantly outperforms the state-of-the-art KD methods across various image restoration tasks and backbones.

Abstract: Class incremental semantic segmentation (CISS) aims to segment new classes during continual steps while preventing the forgetting of old knowledge. Existing methods alleviate catastrophic forgetting by replaying distributions of previously learned classes using stored prototypes or features. However, they overlook a critical issue: in CISS, the representation of class knowledge is updated continuously through incremental learning, whereas prototype replay methods maintain fixed prototypes. This mismatch between updated representation and fixed prototypes limits the effectiveness of the prototype replay strategy. To address this issue, we propose the Adaptive prototype replay (Adapter) for CISS in this paper. Adapter comprises an adaptive deviation compensation (ADC) strategy and an uncertaintyaware constraint (UAC) loss. Specifically, the ADC strategy dynamically updates the stored prototypes based on the estimated representation shift distance to match the updated representation of old class. The UAC loss reduces prediction uncertainty, aggregating discriminative features to aid in generating compact prototypes. Additionally, we introduce a compensation-based prototype similarity discriminative (CPD) loss to ensure adequate differentiation between similar prototypes, thereby enhancing the efficiency of the adaptive prototype replay strategy. Extensive experiments on Pascal VOC and ADE20K datasets demonstrate that Adapter achieves state-of-the-art results and proves effective across various CISS tasks, particularly in challenging multi-step scenarios.

Abstract: Weakly supervised person search aims to detect and match individuals using only bounding box annotations jointly. The existing methods mainly alternate between the clustering stage and the training stage, where the former is responsible for instance level label allocation tasks and the latter needs to undertake proposal level label allocation tasks. In the clustering phase, the conventional use of the DBSCAN algorithm for clustering pedestrian instance features often neglects key contextual information such as scene context and relative positioning of individuals. During the training phase, the Region Proposal Network assigns labels based on the MaxIoU, which tends to produce locally ambiguous labels. Finally, the proposals updated to the memory bank with extensive background information tend to interfere with the task of pseudolabel generation. To address these issues, this paper proposes an Optimizing Label Assignment (OLA) for weakly supervised person search. Firstly, in the clustering phase, Context Aware Clustering is introduced to integrate contextual information and constraints, enhancing the accuracy of clustering. Secondly, in the training phase, we adopt Prototype Matching based on Optimal Transport theory to optimize label distribution from a global perspective. Furthermore, we propose Dual Memory Bank Enhancement that effectively enhances the accuracy of label assignment. Extensive experiments conducted on the CUHK-SYSU and PRW datasets demonstrate that our method achieves state-of-the-art performance in weakly supervised person search.

Abstract: Handwritten Mathematical Expression Recognition (HMER) has extensive applications in automated grading and office automation. However, existing sequencebased decoding methods, which directly predict LaTeX sequences, struggle to understand and model the inherent tree structure of LaTeX and often fail to ensure syntactic correctness in the decoded results. To address these challenges, we propose a novel model named TAMER (Tree-Aware Transformer) for handwritten mathematical expression recognition. TAMER introduces an innovative Tree-aware Module while maintaining the flexibility and efficient training of Transformer. TAMER combines the advantages of both sequence decoding and tree decoding models by jointly optimizing sequence prediction and tree structure prediction tasks, which enhances the model's understanding and generalization of complex mathematical expression structures. During inference, TAMER employs a Tree Structure Prediction Scoring Mechanism to improve the structural validity of the generated LaTeX sequences. Experimental results on CROHME datasets demonstrate that TAMER outperforms traditional sequence decoding and tree decoding models, especially in handling complex mathematical structures, achieving state-of-the-art (SOTA) performance.

Abstract: With the development of Vision Foundation Models (VFMs) in recent years, Visual InContext Learning (VICL) has become a better choice compared to modifying models in most scenarios. Different from retraining or fine-tuning models, VICL does not require modifications to the model's weights and architecture, and only needs a prompt with demonstrations to teach VFM how to solve tasks. Currently, significant computational cost for finding optimal prompts for every test sample hinders the deployment of VICL, as determining which demonstrations to use for constructing the prompt is very costly. In this paper, however, we find a counterintuitive phenomenon that most test samples actually achieve optimal performance under the same prompts, and searching for sample-level prompts only costs much time but results in completely identical prompts actually. Therefore, we propose task-level prompting to reduce the cost of searching for prompts during the inference stage and introduce two time-saving yet effective task-level prompt search strategies accordingly. Extensive experimental results show that our proposed method can identify near-optimal prompts and reach the best VICL performance with a minimal cost that prior work has never achieved.

Abstract: Remote photoplethysmography (rPPG) is a method for noncontact measurement of physiological signals from facial videos, holding great potential in various applications such as healthcare, affective computing, and anti-spoofing. Existing deep learning methods struggle to address two core issues of rPPG simultaneously: understanding the periodic pattern of rPPG among long contexts and addressing large spatiotemporal redundancy in video segments. These represent a trade-off between computational complexity and the ability to capture long-range dependencies. In this paper, we introduce RhythmMamba, a state space model-based method that captures long-range dependencies while maintaining linear complexity. By viewing rPPG as a time series task through the proposed frame stem, the periodic variations in pulse waves are modeled as state transitions. Additionally, we design multi-temporal constraint and frequency domain feed-forward, both aligned with the characteristics of rPPG time series, to improve the learning capacity of Mamba for rPPG signals. Extensive experiments show that RhythmMamba achieves state-of-the-art performance with 319% throughput and 23% peak GPU memory.

Abstract: Large language models (LLMs) have recently shown notable progress in unifying various visual tasks with an openended form. However, when transferred to human-centric tasks, despite their remarkable multi-modal understanding ability in general domains, they lack further human-related domain knowledge and show unsatisfactory performance. Meanwhile, current human-centric unified models are mostly restricted to a pre-defined form and lack open-ended task capability. Therefore, it is necessary to propose a large multi-modal model which utilizes LLMs to unify various human-centric tasks. We forge ahead along this path from the aspects of dataset and model. Specifically, we first construct a large-scale language-image instruction-following dataset named HumanIns based on existing 20 open datasets from 6 diverse downstream tasks, which provides sufficient and diverse data to implement multi-modal training. Then, a model named L-Man including a query adapter is designed to extract the multi-grained semantics of image and align the cross-modal information between image and text. In practice, we introduce a two-stage training strategy, where the first stage extracts generic text-relevant visual information, and the second stage maps the visual features to the embedding space of the LLM. By tuning on HumanIns, our model shows significant superiority on human-centric tasks compared with existing large multi-modal models, and also achieves even better results on downstream datasets compared with respective task-specific models.

Abstract: This paper initiates the first exploratory study to investigate the robust integrated sensing and communication (ISAC) systems under channel estimation errors from the perspective of GPUAccelerated bilevel optimization. Within this framework, the upper-level problem is dedicated to simultaneously optimizing communication and sensing objectives, quantified respectively by weighted sum rate and Cram\'er-Rao lower bound, while the lower-level problem considers the channel uncertainties. We then propose an efficient algorithm that can find a set of Pareto optimal solutions with different trade-offs among communication rates and sensing accuracy. The theoretical analysis regarding the convergence rate has also been provided. Furthermore, we design a bilevel optimization inspired deep neural network architecture for that can be realized efficiently on GPU platform. Experiments have been conducted to evaluate the performances of proposed methods. In particular, the proposed GPU-accelerated parallel bilevel optimization can accelerate the convergence speed by up to 50 times compared to conventional gradient-based methods. This characteristic renders it especially suitable for real-time applications, exemplified by the demanding requirements of robust ISAC in upcoming 6G networks.

Abstract: Proof systems can be used for certification of logic problems, and proof complexity can inform us how succinct certificates can be. In the PSPACE complete logic QBF (Quantified Boolean Formulas) refutation proofs often contain information that reproduce the witnesses of the quantified variables. This is known as strategy extraction. There are two known kinds of strategy extraction for proof systems, local strategy extraction and roundbased strategy extraction. Formalisation of local strategy extraction was done previously, in this paper we formalise round-based strategy extraction. By formalising the strategy extraction into circuits we can show new p-simulations. P-simulations are processes that allow you to transform proofs from a weaker proof system to a stronger proof system. Thus we solve an open problem in QBF proof complexity that Extended QBF Frege p-simulates LD-Q(\Drrs)-Resolution. LD-Q(\Drrs)-Resolution is the underlying proof system for the solver Qute. This is a positive result for certification. By clarifying the hierarchy of proof systems further suggests the feasibility of using known formats such as Extended QU-Resolution or QRAT to certify QCDCL solvers. The p-simulation is our main result, but we also make other observations from the specifics of the formalisation.

Abstract: The performance of models is intricately linked to the abundance of training data. In VisibleInfrared person Re-IDentification (VI-ReID) tasks, collecting and annotating large-scale images of each individual under various cameras and modalities is tedious, time-expensive, costly and must comply with data protection laws, posing a severe challenge in meeting dataset requirements. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. However, a specific data synthesis technique tailored for VI-ReID models has yet to be explored. In this paper, we present a novel data generation framework, dubbed Diffusion-based VI-ReID data Expansion (DiVE), that automatically obtain massive RGB-IR paired images with identity preserving by decoupling identity and modality to improve the performance of VI-ReID models. Specifically, identity representation is acquired from a set of samples sharing the same ID, whereas the modality of images is learned by fine-tuning the Stable Diffusion (SD) on modality-specific data. DiVE extend the text-driven image synthesis to identity-preserving RGB-IR multimodal image synthesis. This approach significantly reduces data collection and annotation costs by directly incorporating synthetic data into ReID model training. Experiments have demonstrated that VI-ReID models trained on synthetic data produced by DiVE consistently exhibit notable enhancements. In particular, the state-of-the-art method, CAJ, trained with synthetic images, achieves an improvement of about 9% in mAP over the baseline on the LLCM dataset.

Abstract: Mixed Integer Linear Program (MILP) solvers are mostly built upon a Branchand-Bound (B&B) algorithm, where the efficiency of traditional solvers heavily depends on hand-crafted heuristics for branching. The past few years have witnessed the increasing popularity of data-driven approaches to automatically learn these heuristics. However, the success of these methods is highly dependent on the availability of high-quality demonstrations, which requires either the development of near-optimal heuristics or a time-consuming sampling process. This paper averts this challenge by proposing Suboptimal-Demonstration-Guided Reinforcement Learning (SORREL) for learning to branch. SORREL selectively learns from suboptimal demonstrations based on value estimation. It utilizes suboptimal demonstrations through both offline reinforcement learning on the demonstrations generated by suboptimal heuristics and self-imitation learning on past good experiences sampled by itself. Our experiments demonstrate its advanced performance in both branching quality and training efficiency over previous methods for various MILPs.

Abstract: Index tracking is a popular passive investment strategy aimed at optimizing portfolios, but fully replicating an index can lead to high transaction costs. To address this, partial replication have been proposed. However, the cardinality constraint renders the problem nonconvex, non-differentiable, and often NP-hard, leading to the use of heuristic or neural network-based methods, which can be non-interpretable or have NP-hard complexity. To overcome these limitations, We propose a Differentiable Cardinality Constraint (DCC) for index tracking and introduce a floating-point precision-aware method to address implementation issues. We theoretically prove our methods calculate cardinality accurately and enforce actual cardinality with polynomial time complexity. We propose the range of the hyperparameter ensures that our method has no error in real implementations, based on theoretical proof and experiment. Our method applied to mathematical method outperforms baseline methods across various datasets, demonstrating the effectiveness of the identified hyperparameter.

Abstract: Traffic prediction is critical for optimizing travel scheduling and enhancing public safety, yet the complex spatial and temporal dynamics within traffic data present significant challenges for accurate forecasting. In this paper, we introduce a novel model, the Spatiotemporalaware Trend-Seasonality Decomposition Network (STDN). This model begins by constructing a dynamic graph structure to represent traffic flow and incorporates novel spatio-temporal embeddings to jointly capture global traffic dynamics. The representations learned are further refined by a specially designed trend-seasonality decomposition module, which disentangles the trend-cyclical component and seasonal component for each traffic node at different times within the graph. These components are subsequently processed through an encoder-decoder network to generate the final predictions. Extensive experiments conducted on real-world traffic datasets demonstrate that STDN achieves superior performance with remarkable computation cost. Furthermore, we have released a new traffic dataset named JiNan, which features unique inner-city dynamics, thereby enriching the scenario comprehensiveness in traffic prediction evaluation.

Abstract: Generative document retrieval is a novel retrieval framework, which represents documents as identifiers (DocID) and retrieves documents by generating DocIDs. It has the advantage of endto-end optimization over traditional retrieval methods and has attracted much research interest. Nonetheless, the development of efficient and precise DocIDs for document representation remains a pertinent issue within the field. Existing methods for designing DocIDs tend to consider only the relevance of DocIDs to the corresponding documents, while neglecting the ability of the DocIDs to distinguish the corresponding documents from similar ones, which is crucial for the retrieval task. In this paper, we design learnable descriptive and discriminative document Identifiers (D2-DocID) for Generative Retrieval and propose the paired retrieval model D2Gen. The D2-DocID is semantically similar to the corresponding documents (descriptive) and is able to distinguish similar documents (discriminative) in the corpus, thus enhancing retrieval performance. We use a contrastive learning assisted generative retrieval task to enable the model to understand the document and then complete the generative retrieval. We then design a DocID selection method to select DocIDs based on the retrieval model's understanding of the documents. Our experimental results on the MS MARCO and NQ320k dataset illustrate the effectiveness of the approach.

Abstract: By generating new yet effective data, data augmentation has become a promising method to mitigate the data sparsity problem in sequential recommendation. Existing works focus on augmenting the original data but rarely explore the issue of imbalanced relevance and diversity for augmented data, leading to semantic drift problems or limited performance improvements. In this paper, we propose a novel Balanced data Augmentation Plugin for Sequential Recommendation (BASRec) to generate data that balance relevance and diversity. BASRec consists of two modules: Singlesequence Augmentation and Cross-sequence Augmentation. The former leverages the randomness of the heuristic operators to generate diverse sequences for a single user, after which the diverse and the original sequences are fused at the representation level to obtain relevance. Further, we devise a reweighting strategy to enable the model to learn the preferences based on the two properties adaptively. The Cross-sequence Augmentation performs nonlinear mixing between different sequence representations from two directions. It produces virtual sequence representations that are diverse enough but retain the vital semantics of the original sequences. These two modules enhance the model to discover fine-grained preferences knowledge from single-user and cross-user perspectives. Extensive experiments verify the effectiveness of BASRec. The average improvement is up to 72.0% on GRU4Rec, 33.8% on SASRec, and 68.5% on FMLP-Rec. We demonstrate that BASRec generates data with a better balance between relevance and diversity than existing methods.

Abstract: Hubs are a few points that frequently appear in the knearest neighbors (kNN) of many other points in a high-dimensional data set. The hubs' effects, called the hubness phenomenon, degrade the performance of kNN based models in high dimensions. We present SamHub, a simple sampling approach to efficiently identify hubs with theoretical guarantees. Apart from previous works based on approximate kNN indexes, SamHub is generic and applicable to any distance measure with negligible additional memory footprint. Empirically, by sampling only 10% of points, SamHub runs significantly faster and offers higher accuracy than existing hub detection methods on many real-world data sets with dot product, L1, L2, and dynamic time warping distances. Our ablation studies of SamHub on improving kNN-based classification show potential for other high-dimensional data analysis tasks.

Abstract: Neural networks remain blackbox systems, unsure about their outputs, and their performance may drop unpredictably in real applications. An open question is how to qualitatively extend neural networks, so that they are sure about their reasoning results, or reasoning-for-sure. Here, we introduce set-theoretic relations explicitly and seamlessly into neural networks by extending vector embedding into sphere embedding, so that part-whole relations can explicitly encode set-theoretic relations through sphere boundaries in the vector space. A reasoning-for-sure neural network successfully constructs, within a constant number M of epochs, a sphere configuration as its semantic model for any consistent set-theoretic relation. We implement Hyperbolic Sphere Neural Network (HSphNN), the first reasoning-for-sure neural network for all types of Aristotelian syllogistic reasoning. Its construction process is realised as a sequence of neighbourhood transitions from the current towards the target configuration. We prove M=1 for HSphNN. In experiments, HSphNN achieves the symbolic level rigour of syllogistic reasoning and successfully checks both decisions and explanations of ChatGPT (gpt-3.5-turbo and gpt-4o) without errors. Through prompts, HSphNN improves the performance of gpt-3.5-turbo from 46.875% to 58.98%, and of gpt-4o from 82.42% to 84.76%. We show ways to extend HSphNN for various kinds of logical and Bayesian reasoning, and to integrate it with traditional neural networks seamlessly.

State Key Laboratory of Blockchain and Data Security, Zhejiang University Bangsheng Technology Co,Ltd., State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Big Graph Center, Hangzhou City University State Key Laboratory of Blockchain and Data Security, Zhejiang University, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, State Key Laboratory of Blockchain and Data Security, Zhejiang University Bangsheng Technology Co,Ltd., State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security

Abstract: Fraud is increasingly prevalent, and its patterns are frequently changing, posing challenges for fraud detection methods such as random forests and Graph Neural Networks (GNNs), which rely on binbased and mixture features separately. The former may lose crucial graph-associated features, while the latter face incorrect feature fusion. To overcome these limitations, we propose an approach based on attribute-association pattern that leverages the distinct attribute and association patterns differentiating fraudulent from benign behaviors, to enhance fraud detection capabilities. Attribute features are adaptively split into separate bins to eliminate incorrect attribute fusion and combine association patterns through graph neighbor message passing, thereby deriving attribute-association pattern features. Using the learned attribute-association patterns, the fraud patterns between a single pattern and the patterns across the entire graph are globally aggregated. Extensive experiments comparing our approach with 24 methods on 7 datasets demonstrate that the proposed method achieves SOTA performance.

Abstract: We study the problem of modeling a nonlinear dynamical system when given a time series by deriving equations directly from the data. Despite the fact that time series data are given as input, models for dynamics and estimation algorithms that incorporate long-term temporal dependencies are largely absent from existing studies. In this paper, we introduce a latent state to allow time-dependent modeling and formulate this problem as a dynamics estimation problem in latent states. We face multiple technical challenges, including (1) modeling latent non-linear dynamics and (2) solving circular dependencies caused by the presence of latent states. To tackle these challenging problems, we propose a new method, Latent Non-Linear equation modeling (LaNoLem), that can model a latent non-linear dynamical system and a novel alternating minimization algorithm for effectively estimating latent states and model parameters. In addition, we introduce criteria to control model complexity without human intervention. Compared with the state-of-the-art model, LaNoLem achieves competitive performance for estimating dynamics while outperforming other methods in prediction.

School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, China, Iowa State University, Iowa, USA, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, China Kash Institute of Electronics and Information Industry, Kashgar, China, Kash Institute of Electronics and Information Industry, Kashgar, China

Abstract: The metro flow in Urban Rail Transit Systems (URTS) differs from other urban traffic flows because it is characterized by: (1) highly predetermined scheduling; and (2) interactively dynamic dependencies over the fixed physical infrastructure that vary with spatiotemporal and environmental factors. Notwithstanding the advances in graph neural networks, existing efforts fail to fully capture the characteristics and complex spatiotemporal dynamics specific to metro flow, as the innate graphaware interactions underlying a metro flow are frequently affected by an amalgamation of: intrinsic connectivity, environmental associations, and flow-activated correlation, which usually dynamically evolve over time while containing redundant signals. We propose ReDyNet, a novel Responsive Dynamic Graph Neural Network to accurately understand the spatiotemporal dynamics of metro flow and external factors. Specifically, it employs a responsive mechanism that adapts to variations in metro flow and external influences, ensuring the construction of an appropriate dynamic graph. In addition, ReDyNet follows the merits of information bottleneck (IB) theory with redundancy disentanglement to enhance the clarity and precision of contextual spatial signals. Our experiments conducted on three real-world metro passenger flow datasets demonstrate that the proposed ReDyNet outperforms several representative baselines.

Abstract: Given the large volume of side information from different modalities, multimodal recommender systems have become increasingly vital, as they exploit richer semantic information beyond useritem interactions. Recent works highlight that leveraging Graph Convolutional Networks (GCNs) to explicitly model multimodal item-item relations can significantly enhance recommendation performance. However, due to the inherent over-smoothing issue of GCNs, existing models benefit only from shallow GCNs with limited representation power. This drawback is especially pronounced when facing complex and high-dimensional patterns such as multimodal data, as it requires large-capacity models to accommodate complicated correlations. To this end, in this paper, we investigate bypassing GCNs when modeling multimodal item-item relationship. More specifically, we propose a Topology-aware Multi-Layer Perceptron (TMLP), which uses MLPs instead of GCNs to model the relationships between items. TMLP enhances MLPs with topological pruning to denoise item-item relations and intra (inter)-modality learning to integrate higher-order modality correlations. Extensive experiments on three real-world datasets verify TMLP's superiority over nine baselines. We also find that by discarding the internal message passing in GCNs, which is sensitive to node connections, TMLP achieves significant improvements in both training efficiency and robustness against existing models.

Guangxi Key Lab of Multisource Information Mining Security, Guangxi Normal University, Guangxi Key Lab of Multisource Information Mining Security, Guangxi Normal University, Guangxi Key Lab of Multisource Information Mining Security, Guangxi Normal University, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Guangxi Key Lab of Multisource Information Mining Security, Guangxi Normal University Information Systems Technology Design Pillar, Singapore University of Technology and Design, Guangxi Key Lab of Multisource Information Mining Security, Guangxi Normal University, Guangxi Key Lab of Multisource Information Mining Security, Guangxi Normal University School of Computer Science and Engineering, University of Electronic Science and Technology of China

Abstract: Although unsupervised multiplex graph representation learning (UMGRL) has been a hot research topic, existing UMGRL methods still has limitations to be addressed. For example, previous works either preserve structural information by ignoring the impact of heterophily in the graph structure or only focus on nodelevel consistency by ignoring class-level consistency. To address these issues, in this paper, we propose a new UMGRL method to explore both homophily and consistency in the multiplex graph. Specifically, we propose to restructure the multi-order relationships of every graph between every node and its multi-order neighbors to improve the homophily and reduce the impact of the heterophily in the graph structure. We also design a contrastive loss based on a self-expression matrix of the node representation to achieve node-level and class-level consistency. Furthermore, we theoretically prove our method to achieve class-level consistency. Extensive experimental results on real datasets verify the effectiveness of the proposed method with respect to node classification tasks, compared to SOTA methods.

Abstract: Geoentity resolution involves linking records that refer to the same entities across different spatial datasets, which underpins location-based services. Given the varying quality of geo-data, this task is known to be challenging, as directly comparing the semantic-centric representations of two entities is no longer reliable. To robustify geo-entity resolution in this context, the main research question is how to effectively extend the current semantics-centric representations of geo-entity with geographical context from its spatial neighbors. Existing methods consider names from neighbors, but they struggle to fully utilize the unaligned neighbor attributes. In this paper, we study the representation of geo-context for robust geo-entity resolution and propose two adaptations that efficiently leverage unaligned geo-entity attributes across spatial neighbors: (1) A plugin module, namely Unaligned Message-Passing (UMP), that propagates unaligned neighbor features to integrate geo-context into the token embeddings output by language model; (2) a contextualized pretraining framework (CP) that allows the former to leverage unlabelled geo-entity data. Experiments show that our method surpasses the baselines, achieving higher F1 scores on 8 real-world geo-datasets in terms of robustness, with an improvement of up to 7.9%. The ablation study further justifies our proposal.

Abstract: As a typical problem of Spatiotemporal Resource Management, Time Series Supplier Allocation (TSSA) poses a complex NPhard challenge, aimed at refining future order dispatching strategies to satisfy the trade-off between demands and maximum supply. The Black-Litterman (BL) model, which comes from financial portfolio management, offers a new perspective for the TSSA by balancing expected returns against insufficient supply risks. However, the BL model is not only constrained by manually constructed perspective matrices and spatio-temporal market dynamics but also restricted by the absence of supervisory signals and unreliable supplier data. To solve these limitations, we introduce the pioneering Deep Black-Litterman Model for TSSA, which innovatively adapts the BL model from financial domain to supply chain context. Specifically, DBLM leverages Spatio-Temporal Graph Neural Networks (STGNNs) to capture spatio-temporal dependencies for automatically generating future perspective matrices. Moreover, a novel Spearman rank correlation is designed as our DBLM supervise signal to navigate complex risks and interactions of the supplier. Finally, DBLM further uses a masking mechanism to counteract the bias of unreliable data, thus improving precision and reliability. Extensive experiments on two datasets demonstrate significant improvements of DBLM on TSSA.

Abstract: Heterogeneous graphs, which are common in realworld downstream tasks, have recently sparked a wave of research interest. The performance of end-to-end heterogeneous graph neural networks (HGNNs) greatly relies on supervised training for specific tasks. To reduce the labeling cost, the "pretrain-finetune" paradigm has been widely adopted, but it leads to a knowledge gap between the pre-trained model and downstream tasks. In an effort to address this gap, the "pretrain-prompt" paradigm has emerged as a promising approach. This involves fine-tuning randomly initialized learnable vectors in downstream tasks. However, this approach may result in an insufficient representation of downstream task features. Existing techniques for heterogeneous graph prompting restructure the heterogeneous graph to align with the homogeneous graph prompting scheme. This can potentially introduce the same limitations as homogeneous graph prompt learning. In this paper, we propose HePa, short for Heterogeneous Graph Prompting for all-level classification tasks. It not only includes a unified prompt template-graph adapted for heterogeneous graphs but also introduces a novel pre-prompt token optimized during the pre-training phase to convey task information downstream. With these designs, HePa can complete all levels of classification tasks toward few-shot scenarios while activating in-context learning. Finally, we conducted a comprehensive experimental analysis of HePa on three benchmark datasets.

Abstract: Spatialtemporal graphs are widely used in a variety of real-world applications. Spatial-Temporal Graph Neural Networks (STGNNs) have emerged as a powerful tool to extract meaningful insights from this data. However, in real-world applications, most nodes may not possess any available temporal data during training. For example, the pandemic dynamics of most cities on a geographical graph may not be available due to the asynchronous nature of outbreaks. Such a phenomenon disagrees with the training requirements of most existing spatial-temporal forecasting methods, which jeopardizes their effectiveness and thus blocks broader deployment. In this paper, we propose to formulate a novel problem of inductive forecasting with limited training data. In particular, given a spatial-temporal graph, we aim to learn a spatial-temporal forecasting model that can be easily generalized onto those nodes without any available temporal training data. To handle this problem, we propose a principled framework named ST-FiT. ST-FiT consists of two key learning components: temporal data augmentation and spatial graph topology learning. With such a design, ST-FiT can be used on top of any existing STGNNs to achieve superior performance on the nodes without training data. Extensive experiments verify the effectiveness of ST-FiT in multiple key perspectives.

Abstract: Training graph neural networks (GNNs) for graph representation has received increasing concerns due to its outstanding performance in the link prediction and node classification tasks, but it incurs much time and storage for tackling largescale graphs. To alleviate this issue, graph condensation has been emerged to condense the large graph into a small but highly-informative graph, while achieving comparable performance of GNNs trained on the small graph and large graph. However, existing works mainly focus on the gradient or distribution matching under GNN training trajectories to condense simple link structures, while overlooking the structure matching for condensing signed graph that exists conflict links and structural balance among nodes. To bridge this gap, we propose a novel Structure Balance and Gradient Matching-Based Signed Graph Condensation (SGSGC) method for condensing signed graph with node attributes, conflict links and structural balance into informative smaller ones. Specifically, we first propose a structure-balanced matching to match the structural balance between the original and condensed signed graph, and then combine it with the gradient matching to condense signed graph for the link sign prediction task, while preserving both conflicting link structures and node attributes. Moreover, we use the feature smoothing and the graph sparsification technique to improve the robustness for the GNN training, respectively. Finally, a bi-level optimization technique is proposed to simultaneously find the optimal node attributes and conflict structure of the condensed graph. Experiments on six datasets demonstrate that SGSGC achieves excellent performance. On Epinions, 94% test accuracy of training on the original signed graph, while reducing their graph size by 99.95% - 99.99%, and there exist 2.24% – 6.26% accuracy improvements for link sign prediction compared to the state-of-the-arts.

Abstract: Graph Neural Networks (GNNs) have proven effective and typically benefit from pretraining on accessible graphs to enhance performance on tasks with limited labeled data. However, existing GNNs are constrained by the ``one-domain-one-model'' limitation, which restricts their effectiveness across diverse graph domains. In this paper, we tackle this problem by developing a method called Multi-Domain Pre-training for a Unified GNN Model (MDP-GNN). This method is based on the philosophical notion that everything is interconnected, suggesting that a latent meta-domain exists to encompass the diverse graph domains and their interconnections. MDP-GNN seeks to identify and utilize this meta-domain to train a unified GNN model through three core strategies. Firstly, it integrates node feature semantics from different domains to create unified representations. Secondly, it employs a bi-level learning strategy to build a domain-synthesized network that identifies latent connections to facilitate cross-domain knowledge transfer. Thirdly, it uses Wasserstein distance to map diverse domains into the common meta-domain for graph distribution alignment. We validate the effectiveness of MDP-GNN through theoretical analysis and extensive experiments on four real-world graph datasets, showing its superiority in enhancing GNN performance across diverse domains.

Abstract: Explainable recommender systems are designed to elucidate the explanation behind each recommendation, enabling users to comprehend the underlying logic. Previous works perform rating prediction and explanation generation in a multitask manner. However, these works suffer from incoherence between predicted ratings and explanations. To address the issue, we propose a novel framework that employs a large language model (LLM) to generate a rating, transforms it into a rating vector, and finally generates an explanation based on the rating vector and user-item information. Moreover, we propose utilizing publicly available LLMs and pre-trained sentiment analysis models to automatically evaluate the coherence without human annotations. Extensive experimental results on three datasets of explainable recommendation show that the proposed framework is effective, outperforming state-of-the-art baselines with improvements of 7.3% in explainability and 4.4% in text quality.

Abstract: Flight trajectory data plays a vital role in the traffic management community, especially for downstream tasks such as trajectory prediction, flight recognition, and anomaly detection. Existing works often utilize handcrafted features and design models for different tasks individually, which heavily rely on domain expertise and are hard to extend. We argue that different flight analysis tasks share the same useful features of the trajectory. Jointly learning a unified representation for flight trajectories could be beneficial for improving the performance of various tasks. However, flight trajectory representation learning (TRL) faces two primary challenges, \ie unbalanced behavior density and 3D spatial continuity, which disable recent general TRL methods. In this paper, we propose Flight2Vec, a flightspecific representation learning method to address these challenges. Specifically, a behavior-adaptive patching mechanism is used to inspire the learned representation to pay more attention to behavior-dense segments. Moreover, we introduce a motion trend learning technique that guides the model to memorize not only the precise locations, but also the motion trend to generate better representations. Extensive experimental results demonstrate that Flight2Vec significantly improves performance in downstream tasks such as flight trajectory prediction, flight recognition, and anomaly detection.

Abstract: Given an unweighted, undirected, and simple graph, the Densest kSubgraph (DkS) problem aims to find a subgraph of k vertices that has the maximum average induced degree. In this paper, we consider an equivalent reformulation of the DkS problem via diagonal loading. On relaxing the combinatorial constraint of the reformulated problem, we show that the resulting non-convex, continuous relaxation is tight under certain conditions by leveraging an extension of the Motzkin-Straus theorem. We utilize two projection-free approaches to solve the relaxed problem: one based on the Frank-Wolfe algorithm and the other on explicit constraint parameterization. We compare their performance to state-of-the-art baselines across various benchmarks. Our empirical results show that the Frank-Wolfe-based algorithm proposed in this paper outperforms existing baselines in terms of subgraph density and computational complexity.

Abstract: Sessionbased recommendation (SBR) is widely used in e-commerce and streaming services, with the task of performing real-time recommendations based on short-term anonymous user history data. Most existing SBR frameworks follow the pattern of learning a single representation for a specific session, which makes it difficult to capture potential multiple interests, thus preventing discriminative recommendations. Multi-Interest learning has emerged as an effective approach for addressing this issue on sequential data in recent years. However, the current Multi-Interest frameworks act terrible on session data because they may generate excessive interests. To address these issues, we proposed a model named Dynamic Multi-Interst Graph Neural Network (DMI-GNN) ,which introduces the Multi-Interest learning framework into SBR and refines it by proposing a multiple positional patterns (MPP) learning method and a Dynamic Multi-Interest (DMI) regularization.To be specific, the MPP learning layer ensures the model to obtain representations with different positional information for sessions.The DMI regularization, on the other hand, mitigates the influence of excessive interests. Experiments on three bench-mark datasets demonstrate that our methods achieve better performance on different metrics

Abstract: Cognitive diagnosis is a key task in computeraided education, aimed at assessing a students' proficiency in specific knowledge concepts based on their responses to exercises. However, existing cognitive diagnosis models often overlook anomalies in students and exercises. For instance, some students might incorrectly response exercises despite having a strong grasp of the knowledge concept, or they might response correctly despite a lack of understanding. Such subtle anomalies can adversely affect the diagnostic results of the models. To address these anomalies, we conduct a qualitative analysis of how anomalous student states and exercise properties impact response outcomes using causal diagrams. We propose a framework named Anomaly Detection for Cognitive Diagnosis (AD4CD) to enhance the ability of Learning-to-Detect-Anomalous. AD4CD approaches the problem from a causal perspective, analyzing confounding paths that affect the true causal relationship between student ability and response outcomes, and designing an anomaly detection mechanism suitable for cognitive diagnostic models. Specifically, we first account for anomalous student behaviors and exercise properties and introduce response times from both students and exercises as modeling factors. By quantifying the response time distributions in high-dimensional features, we identify anomalies within skewed distributions, including both left-tail and right-tail anomalies. Using the detected anomaly scores, we comprehensively model the students' anomalous behaviors and exercise anomalies. Additionally, we reconstruct unbiased true abilities under natural conditions and use reconstruction loss as an anomaly score to assist in modeling guessing and slipping features. Lastly, AD4CD leverages a general cognitive diagnosis model as its backbone, optimizing the guessing and slipping features to provide unbiased and accurate feedback. Extensive experimental results demonstrate that AD4CD effectively captures anomalous data in the diagnostic process across three real-world datasets, enhancing the accuracy of the diagnostic results.

Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Aalborg University, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Carnegie Mellon University, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence

Abstract: Uncertainty quantification in travel time estimation (TTE) aims to estimate the confidence interval for travel time, given the origin (O), destination (D), and departure time (T). Accurately quantifying this uncertainty requires generating the most likely path and assessing travel time uncertainty along the path. This involves two main challenges: 1) Predicting a path that aligns with the ground truth, and 2) modeling the impact of travel time in each segment on overall uncertainty under varying conditions. We propose DutyTTE to address these challenges. For the first challenge, we introduce a deep reinforcement learning method to improve alignment between the predicted path and the ground truth, providing more accurate travel time information from road segments to improve TTE. For the second challenge, we propose a mixture of experts guided uncertainty quantification mechanism to better capture travel time uncertainty for each segment under varying contexts. Extensive experiments on two realworld datasets demonstrate the superiority of our proposed method.

Abstract: In recent years, there has been a burgeoning interest in multimodal recommender systems within the recommendation systems domain. These systems aim to understand user preferences by leveraging both user interaction data and multimodal information associated with items. This approach frequently results in superior recommendation accuracy compared to traditional models that rely solely on useritem interactions. Despite the advancements of these methods, there is a relatively low utilization of image features in propagating item-item characteristics, an overreliance on text feature similarity, and a frequent neglect of the deep relationships between items, users, and modalities. In response to these challenges, we introduce a novel model termed LLMs-Enhanced Hyper-Knowledge Graph Recommender for Multimodal Recommendation (DOGE). DOGE utilizes large language models (LLMs) to understand image information under the guidance of text information, generating cross-modal features that effectively enhance the relationship between text and image modalities. Subsequently, DOGE constructs a Hyper-Knowledge Graph (HKG) using user-item interaction information and modality features enhanced by large language models. This graph encompasses a wide range of item-item and user-user binary relations and hyper-relations, effectively expanding the feature propagation mechanisms and mitigating the overreliance on text modality. By learning on heterogeneous user-item graphs and homogeneous item-item, user-user graphs, DOGE enhances potential effective propagation between item features and user features, acquiring more effective feature representations of users and items. Comprehensive experimentation across three public real-world datasets illustrates that DOGE attains state-of-the-art (SOTA) performance, exhibiting a 7.2% improvement over the strongest baseline.

Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China Key Lab of AI Safety, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, CAS, Beijing, China, Independent Researcher, Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China Key Lab of AI Safety, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, CAS, Beijing, China, Independent Researcher, Independent Researcher, Independent Researcher, Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China Key Lab of AI Safety, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, CAS, Beijing, China CASMINO Ltd., Suzhou, China

Abstract: Antifraud machine learning systems are perpetually confronted with the significant challenge of concept drift, driven by the continuous and intense evolution of fraudulent techniques. That is, outdated models trained on historical fraudulent behaviors often fall short in addressing the evolving tactics of malicious users over time. The key issue lies in effectively tackling the rapid and significant evolution of fraudsters' behaviors to detect these emerging and unforeseen anomalies. In this paper, we propose a solution by directly accessing real-time data and introducing a lightweight plug-in approach named TRE (Test-time Retrieval-based Representation Enrichment). Considering the similarity among samples, TRE employs a retriever to efficiently identify the top-K most relevant recent samples and implements an aggregation strategy to provide neighboring embeddings to the predictor. It thus adjusts the trained classifiers during the test time, providing them with the information from the latest unlabeled data. Extensive experiments on three large-scale real-world datasets demonstrate the superiority of TRE. By consistently incorporating information from the nearest neighbors, TRE demonstrates high adaptability and surpasses existing methods in performance.

Abstract: Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a lossresilient image coding approach that leverages end-to-end optimization in learned image compression (LIC). Our method builds on the channel-wise progressive coding framework, incorporating Spatial-Channel Rearrangement (SCR) on the encoder side and Mask Conditional Aggregation (MCA) on the decoder side to improve reconstruction quality with unpredictable errors. By integrating the Gilbert-Elliot model into the training process, we enhance the model's ability to generalize in real-world network conditions. Extensive evaluations show that our approach outperforms traditional and deep learning-based methods in terms of compression performance and stability under diverse packet loss, offering robust and efficient progressive transmission even in challenging environments.

Abstract: LocationBased Social Networks (LBSNs) offer a rich dataset of user activity at Points-of-Interest (POIs), making next POI recommendation a key task. Traditional algorithms face challenges due to broad searching scopes, affecting recommendation accuracy. Users tend to visit nearby POIs and show temporal concentration in their activities, reflecting personalized spatio-temporal clustering. However, individual user data may be insufficient to capture these clustering effects for personalized recommendations. In this paper, we propose an integrated Personalized Spatio-Temporal Clustering Model (iPCM) for next POI recommendation. The model learns this kind of personalized spatio-temporal clustering effect by using global historical trajectory data in conjunction with user feature embeddings. It integrates the features of personalized spatio-temporal clustering with the user's trajectory, and completes the user's POI recommendation through a Transformer encoding and MLP decoding. To enhance the accuracy of predictions, we add a module of probability adjustment. The experimental results on multiple datasets show that with the help of personalized spatio-temporal clustering, the proposed iPCM is superior to existing methods in various evaluation metrics.

Abstract: Dynamic interacting system modeling is important for understanding and simulating real world systems, e.g., meteorology and the spread of COVID. The system is typically described as a graph, where multiple objects dynamically interact with each other and evolve over time. In recent years, graph Ordinary Differential Equations (ODE) receive increasing research attentions. While achieving encouraging results, existing solutions prioritize the traditional Euclidean space, and neglect the intrinsic geometry of the system and physics laws, e.g., the principle of entropy increasing. The aforementioned limitations motivate us to rethink the system dynamics from a fresh perspective of Riemannian geometry, and pose a more realistic problem of physicsinformed dynamic system modeling, considering the underlying geometry and physics law for the first time. In this paper, we present a novel physics-informed Riemannian graph ODE for a wide range of entropy-increasing dynamic systems (termed as Pioneer). In particular, we formulate a differential system on the Riemannian manifold, where a manifold-valued graph ODE is governed by the proposed constrained Ricci flow, and a manifold preserving Gyro-transform aware of system geometry. Theoretically, we report the provable entropy non-decreasing of our formulation, obeying the physics laws. Empirical results show the superiority of Pioneer on real datasets.

Abstract: Model extraction attack shows promising performance in revealing sequential recommendation (SeqRec) robustness, e.g., as an upstream task of transferbased attack to provide optimization feedback for downstream attacks. However, existing work either heavily relies on impractical prior knowledge or has impressive attack performance. In this paper, we focus on data-free model extraction attack on SeqRec, which aims to efficiently train a surrogate model that closely imitates the target model in a practical setting. Conducting such an attack is challenging. First, imitating sequential training data for accurate model extraction is hard without prior knowledge. Second, limited queries for the target model require the attack to be efficient. To address these challenges, we propose a novel adversarial framework Sim4Rec which includes two modules, i.e., controllable sequence generation and reinforced adversarial distillation. The former allows a sequential generator to produce synthetic data similar to training data through pre-training with controllable generated samples. The latter efficiently extracts the target model via reinforced adversarial knowledge distillation. Extensive experiments demonstrate the advancement of Sim4Rec.

School of Software, Beihang University, Beijing, China Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China, Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, China, Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China, School of Software, Beihang University, Beijing, China Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China

Abstract: Graph neural networks (GNNs) provide important prospective insights in applications such as social behavior analysis and financial risk analysis based on their powerful learning capabilities on graph data. Nevertheless, GNNs' predictive performance relies on the quality of taskspecific node labels, so it is common practice to improve the model's generalization ability in the downstream execution of decision-making tasks through pre-training. Graph prompting is a prudent choice but risky without taking measures to prevent data leakage. In other words, in high-risk decision scenarios, prompt learning can infer private information by accessing model parameters trained on private data (publishing model parameters in pre-training, i.e., without directly leaking the raw data, is a tacitly accepted trend). However, myriad graph inference attacks necessitate tailored module design and processing to enhance inference capabilities due to variations in supervision signals. In this paper, we propose a novel Prompt-based unifying Inference Attack framework on GNNs, named ProIA. Specifically, ProIA retains the crucial topological information of the graph during pre-training, enhancing the background knowledge of the inference attack model. It then utilizes a unified prompt and introduces additional disentanglement factors in downstream attacks to adapt to task-relevant knowledge. Finally, extensive experiments show that ProIA enhances attack capabilities and demonstrates remarkable adaptability to various inference attacks.

Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL) Ping An Property & Casualty Insurance Company of China, Ltd, Shenzhen Yishi Huolala Technology Limited

Abstract: Collaborative Filtering (CF) based on graph neural networks (GNNs) has yielded immense success for recommendation systems by capturing highorder dependencies from implicit feedback. Recently, the outstanding text comprehension ability of the Large Language Models (LLMs) has shown promising potential to provide auxiliary semantics for collaborative representation. However, when aligning textual information with collaborative signals, inconsistent semantics between user-item and item-item text pairs may lead to the degradation of the alignment model, thus hindering the recommender system from effectively utilizing heterogeneous information. In this paper, we propose a novel method: Semantic Enhanced Heterogeneous Hypergraph Network (SEHHN), which enhances the representations of CF correlations with semantics, thereby avoiding alignment degradation. To better model the collaborative signals, we design a graph autoencoder that captures the bidirectional relationship between user preferences and item features in review semantics. Furthermore, we develop an LLM-based item classifier to adaptively exploit potential correlations of items via the co-occurrences of item features. Finally, we design a heterogeneous hypergraph network to achieve efficient alignment and propagation of heterogeneous information, thereby alleviating the impact of semantic inconsistency on CFs. Extensive experiments on three real-world datasets demonstrate that our proposed SEHHN outperforms existing SOTA methods and validates the effectiveness of each component.

Abstract: Sensors are commonly deployed to perceive the environment. However, due to the high cost, sensors are usually sparsely deployed. Kriging is the tailored task to infer the unobserved nodes (without sensors) using the observed nodes (with sensors). The essence of kriging task is transferability. Recently, several inductive spatiotemporal kriging methods have been proposed based on graph neural networks, being trained based on a graph built on top of observed nodes via pretext tasks such as masking nodes out and reconstructing them. However, the graph in training is inevitably much sparser than the graph in inference that includes all the observed and unobserved nodes. The learned pattern cannot be well generalized for inference, denoted as graph gap. To address this issue, we first present a novel Increment training strategy: instead of masking nodes (and reconstructing them), we add virtual nodes into the training graph so as to mitigate the graph gap issue naturally. Nevertheless, the empty-shell virtual nodes without labels could have bad-learned features and lack supervision signals. To solve these issues, we pair each virtual node with its most similar observed node and fuse their features together; to enhance the supervision signal, we construct reliable pseudo labels for virtual nodes. As a result, the learned pattern of virtual nodes could be safely transferred to real unobserved nodes for reliable kriging. We name our new Kriging model with Increment Training Strategy as KITS. Extensive experiments demonstrate that KITS consistently outperforms existing methods by large margins, e.g., the improvement over MAE score could be as high as 18.33%.

Abstract: Visual Entity Linking (VEL) is a crucial task for achieving finegrained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.

Abstract: Segment Anything Model (SAM) has made great progress in anomaly segmentation tasks due to its impressive generalization ability. However, existing methods that directly apply SAM through prompting often overlook the domain shift issue, where SAM performs well on natural images but struggles in industrial scenarios. ParameterEfficient Fine-Tuning (PEFT) offers a promising solution, but it may yield suboptimal performance by not adequately addressing the perception challenges during adaptation to anomaly images. In this paper, we propose a novel Self-Perception Tuning (SPT) method, aiming to enhance SAM's perception capability for anomaly segmentation. The SPT method incorporates a self-drafting tuning strategy, which generates an initial coarse draft of the anomaly mask, followed by a refinement process. Additionally, a visual-relation-aware adapter is introduced to improve the perception of discriminative relational information for mask generation. Extensive experimental results on several benchmark datasets demonstrate that our SPT method can significantly outperform baseline methods, validating its effectiveness.

Abstract: Heterogeneous graphs (HGs) that contain various node and edge types are ubiquitous in realworld scenarios. Considering the common label sparsity problem in HGs, some researchers propose to pretrain on source HGs to extract general knowledge and then fine-tune on a target HG for knowledge transfer. However, existing methods often assume that source and target HGs share a single heterogeneity, meaning that they have the same types of nodes and edges, which contradicts the real-world scenarios requiring cross-heterogeneity transfer. Although a recent study has made some preliminary attempts in cross-heterogeneity learning, its definition of general knowledge heavily rely on human knowledge, which lacks flexibility and further leads to a suboptimal transfer. To address the problem, we propose a novel Language Model-enhanced Cross-Heterogeneity learning model, namely LMCH. Specifically, we first design a metapath-based corpus construction method to unify HG representations as languages. The corpora of source HGs are then used to fine-tune a pretrained Language Model (LM), enabling the LM to autonomously extract general knowledge across different HGs. Furthermore, to fully utilize the extensive unlabeled nodes in a few-labeled target HG, we propose an iterative training pipeline with the help of an extra Graph Neural Network (GNN) predictor, enhanced by LM-GNN contrastive alignment at the end of each iteration. Extensive experiments on four real-world datasets have demonstrated the superior performance of LMCH over state-of-the-art methods.

Abstract: Tabular anomaly detection under the oneclass classification setting poses a significant challenge, as it involves accurately conceptualizing "normal" derived exclusively from a single category to discern anomalies from normal data variations. Capturing the intrinsic correlation among attributes within normal samples presents one promising method for learning the concept. To do so, the most recent effort relies on a learnable mask strategy with a reconstruction task. However, this wisdom may suffer from the risk of producing uniform masks, i.e., essentially nothing is masked, leading to less effective correlation learning. To address this issue, we presume that attributes related to others in normal samples can be divided into two non-overlapping and correlated subsets, defined as CorrSets, to capture the intrinsic correlation effectively. Accordingly, we introduce an innovative method that disentangles CorrSets from normal tabular data. To our knowledge, this is a pioneering effort to apply the concept of disentanglement for one-class anomaly detection on tabular data. Extensive experiments on 20 tabular datasets show that our method substantially outperforms the state-of-the-art methods and leads to an average performance improvement of 6.1% on AUC-PR and 2.1% on AUC-ROC.

School of Hotel and Tourism Management, The Hong Kong Polytechnic University, School of Computing and Artificial Intelligence, Southwest University of Finance and Economics, School of Computer Science, Wuhan University, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, German Research Center for Artificial Intelligence (DFKI) and RPTU Kaiserslautern, School of Computer Science, Wuhan University, German Research Center for Artificial Intelligence (DFKI) and RPTU Kaiserslautern, University of Queensland, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, College of Information Science and Technology, Shihezi University

Abstract: Graph neural networks (GNNs) have shown promise in integrating proteinprotein interaction (PPI) networks for identifying cancer genes in recent studies. However, due to the insufficient modeling of the biological information in PPI networks, more faithfully depiction of complex protein interaction patterns for cancer genes within the graph structure remains largely unexplored. This study takes a pioneering step toward bridging biological anomalies in protein interactions caused by cancer genes to statistical graph anomaly. We find a unique graph anomaly exhibited by cancer genes, namely weight heterogeneity, which manifests as significantly higher variance in edge weights of cancer gene nodes within the graph. Additionally, from the spectral perspective, we demonstrate that the weight heterogeneity could lead to the "flattening out" of spectral energy, with a concentration towards the extremes of the spectrum. Building on these insights, we propose the HIerarchical-Perspective Graph Neural Network (HIPGNN) that not only determines spectral energy distribution variations on the spectral perspective, but also perceives detailed protein interaction context on the spatial perspective. Extensive experiments are conducted on two reprocessed datasets STRINGdb and CPDB, and the experimental results demonstrate the superiority of HIPGNN.

Abstract: Graph Masked AutoEncoder (GMAE) has recently attracted vast interest in handling graphrelated tasks by adopting the 'masking-reconstruction' learning paradigm. Most existing GMAE-based methods adhere to the homophily assumption, i.e., connected nodes share the same attributes or labels. However, this assumption is not always right because most graphs from real-world applications are mixed by both homophilic and heterophilic edges. Therefore, it is necessary to distinguish them to improve the representative ability of GMAE. In this paper, we propose a teacher-guided edge discriminator for the personalized graph masked autoencoder (TEDMAE). Specifically, we design a teacher-guided edge discriminator that distinguishes homophilic and heterophilic edges by leveraging the embeddings from teacher models with structure and attribute knowledge. Then, we present a personalized graph masked autoencoder that individually tailors the masking, encoding, and reconstruction processes for each graph. Finally, we optimize the model by minimizing two types of loss functions, i.e., the scaled cosine error (SCE) loss and the InfoNCE loss. Experimental results on 10 datasets demonstrate the superior performance of TEDMAE on the tasks of node classification and node clustering.

Abstract: Spatialtemporal forecasting is crucial and widely applicable in various domains such as traffic, energy, and climate. Benefiting from the abundance of unlabeled spatial-temporal data, self-supervised methods are increasingly adapted to learn spatial-temporal representations. However, it encounters three key challenges: 1) the difficulty in selecting reliable negative pairs due to the homogeneity of variables, hindering contrastive learning methods; 2) overlooking spatial correlations across variables over time; 3) limitations of efficiency and scalability in existing self-supervised learning methods. To tackle these, we propose a lightweight representation-learning model ST-ReP, integrating current value reconstruction and future value prediction into the pre-training framework for spatial-temporal forecasting. And we design a new spatial-temporal encoder to model fine-grained relationships. Moreover, multi-time scale analysis is incorporated into the self-supervised loss to enhance predictive capability. Experimental results across diverse domains demonstrate that the proposed model surpasses pre-training-based baselines, showcasing its ability to learn compact and semantically enriched representations while exhibiting superior scalability.

Abstract: Graph Neural Networks (GNNs) are widely used in graph data mining tasks. Traditional GNNs follow a message passing scheme that can effectively utilize local and structural information. However, the phenomena of oversmoothing and over-squashing limit the receptive field in message passing processes. Graph Transformers were introduced to address these issues, achieving a global receptive field but suffering from the noise of irrelevant nodes and loss of structural information. Therefore, drawing inspiration from fine-grained token-based representation learning in Natural Language Processing (NLP), we propose the Structure-aware Multi-token Graph Transformer (Tokenphormer), which generates multiple tokens to effectively capture local and structural information and explore global information at different levels of granularity. Specifically, we first introduce the walk-token generated by mixed walks consisting of four walk types to explore the graph and capture structure and contextual information flexibly. To ensure local and global information coverage, we also introduce the SGPM-token (obtained through the Self-supervised Graph Pre-train Model, SGPM) and the hop-token, extending the length and density limit of the walk-token, respectively. Finally, these expressive tokens are fed into the Transformer model to learn node representations collaboratively. Experimental results demonstrate that the capability of the proposed Tokenphormer can achieve state-of-the-art performance on node classification tasks.

Abstract: The recognition of whether or not a predicate should be invented is an important problem in the domain of predicate invention. Despite its significance, existing research has yet to fully harness the rich data available in knowledge graphs. In this paper, we introduce a novel problem formulation, ReLPI (Representation Learning for Predicate Invention in Knowledge Graphs), marking a pioneering effort in this domain. To address the core issues of ReLPI, we devise a scoring function that informs the learning process. By optimizing embeddings towards this scoring function, we endow them with semantic meaning, crucial for capturing the nuances of predicate presence patterns. Furthermore, we present SEmPI (Semantic Embeddings for Predicate Invention), a framework that leverages predicate (relation) embeddings as a trainable medium. SEmPI uncovers latent patterns governing predicate occurrences in knowledge graphs, enabling the invention of novel predicates grounded in these discovered patterns. This approach represents a significant step forward in leveraging datadriven methods for predicate invention in knowledge graphs. We evaluate the proposed approach on FB15k and DRKG datasets, and the results demonstrate the effectiveness of SEmPI in discovering new predicates.

Abstract: Coalition formation concerns autonomous agents that strategically interact to form selforganized coalitions. When agents lack initial sufficient information to evaluate their preferences before interacting with others, they learn them online through repeated feedback while iteratively forming coalitions. In this work, we introduce online learning in coalition formation from a non-cooperative perspective, studying the impact of collective data utilization where selfish agents aim to accelerate their learning by leveraging a shared data platform. Thus, the efficiency and dynamics of the learning process are affected by each agent's local feedbacks, motivating us to explore the tension between semi-bandit and bandit feedback, which differ in the granularity of utility information observed by each agent. Under our non-cooperative viewpoint, we evaluate the system by means of Nash stability, where no agent can improve her utility by unilaterally deviating. Our main result is a sample-efficient algorithm for selfish agents that aims to minimize their Nash regret under both semi-bandit and bandit feedback, implying approximately Nash stable outcomes. Under both feedback settings, our algorithm enjoys Nash regret and sample complexity bounds that are optimal up to logarithmic factors.

Abstract: To model complex realworld systems, such as traders in stock markets, or the dissemination of contagious diseases, graphon mean-field games (GMFG) have been proposed to model many agents. Despite the empirical success, our understanding of GMFG is limited. Popular algorithms such as mirror descent are deployed but remain unknown for their convergence properties. In this work, we give the first last-iterate convergence rate of mirror descent in regularized monotone GMFG. In tabular monotone GMFG with finite state and action spaces and under bandit feedback, we show a last-iterate convergence rate of O(T^{-1/4}). Moreover, when exact knowledge of costs and transitions is available, we improve this convergence rate to O(T^{-1}), matching the existing convergence rate observed in strongly convex games. In linear GMFG, our algorithm achieves a last-iterate convergence rate of O(T^{-1/5}). Finally, we verify the performance of the studied algorithms by empirically testing them against fictitious play in a variety of tasks.

Abstract: Mean field games (MFGs) tractably model behavior in large agent populations. The literature on learning MFG equilibria typically focuses on finding Nash equilibria (NE), which assume perfectly rational agents and are hence implausible in many realistic situations. To overcome these limitations, we incorporate bounded rationality into MFGs by leveraging the wellknown concept of quantal response equilibria (QRE). Two novel types of MFG QRE enable the modeling of large agent populations where individuals only noisily estimate the true objective. We also introduce a second source of bounded rationality to MFGs by restricting the agents' planning horizon. The resulting novel receding horizon (RH) MFGs are combined with QRE and existing approaches to model different aspects of bounded rationality in MFGs. We formally define MFG QRE and RH MFGs and compare them to existing equilibrium concepts such as entropy-regularized NE. Subsequently, we design generalized fixed point iteration and fictitious play algorithms to learn QRE and RH equilibria. After a theoretical analysis, we give different examples to evaluate the capabilities of our learning algorithms and outline practical differences between the equilibrium concepts.

Abstract: Coalition formation over graphs is a well studied class of games whose players are vertices and feasible coalitions must be connected subgraphs. In this setting, the existence and computation of equilibria, under various notions of stability, has attracted a lot of attention. However, the natural process by which players, starting from any feasible state, strive to reach an equilibrium after a series of unilateral improving deviations, has been less studied. We investigate the convergence of dynamics towards individually stable outcomes under the following perspective: what are the most general classes of preferences and graph topologies guaranteeing convergence? To this aim, on the one hand, we cover a hierarchy of preferences, ranging from the most general to a subcase of additively separable preferences, including individually rational and monotone cases. On the other hand, given that convergence may fail in graphs admitting a cycle even in our most restrictive preference class, we analyze acyclic graph topologies such as trees, paths, and stars.

Abstract: Due to the sensitivity of data, Federated Learning (FL) is employed to enable distributed machine learning while safeguarding data privacy and accommodating the requirements of various devices. However, in the context of semidecentralized FL, clients’ communication and training states are dynamic. This variability arises from local training fluctuations, heterogeneous data distributions, and intermittent client participation. Most existing studies primarily focus on stable client states, neglecting the dynamic challenges inherent in realworld scenarios. To tackle this issue, we propose a TRust-Aware clIent scheduLing mechanism called TRAIL, which assesses client states and contributions, enhancing model training efficiency through selective client participation. We focus on a semi-decentralized FL framework where edge servers and clients train a shared global model using unreliable intra-cluster model aggregation and inter-cluster model consensus. First, we propose an adaptive hidden semi-Markov model to estimate clients’ communication states and contributions. Next, we address a client-server association optimization problem to minimize global training loss. Using convergence analysis, we propose a greedy client scheduling algorithm. Finally, our experiments conducted on real-world datasets demonstrate that TRAIL outperforms state-of-the-art baselines, achieving an improvement of 8.7% in test accuracy and a reduction of 15.3% in training loss.

Abstract: We consider correlated equilibria in an adversarial environment, where an adversary can compromise the public signal used by the players for choosing their strategies, while players aim at detecting a potential attack as soon as possible to avoid loss of utility. We model the interaction between the adversary and the players as a zerosum game and we derive the maxmin strategies for both the defender and the attacker using the framework of quickest change detection. We define a class of adversarial strategies that achieve the optimal trade-off between the impact and the detectability of the attack for the adversary and show that a generalized CUSUM scheme is asymptotically optimal for their detection. Our numerical results on the Sioux-Falls benchmark traffic routing game show that the proposed detection scheme can effectively limit the utility loss by a potential adversary.

Abstract: This paper studies a generalized variant of the Colonel Blotto game, referred to as the Colonel Blotto game with costs. Unlike the classic Colonel Blotto game, which imposes the useit-or-lose-it budget assumption, the Colonel Blotto game with costs captures the strategic importance of costs related both to obtaining resources and assigning them across battlefields. We show that every instance of the Colonel Blotto game with costs is strategically equivalent to an instance of the zero-sum Colonel Blotto game with one additional battlefield. This enables the computation of Nash equilibria of the Colonel Blotto game with costs in polynomial time with respect to the game parameters: the number of battlefields plus the number of resources available to the players.

Abstract: We propose a decentralized market model in which agents can negotiate bilateral contracts. This builds on a similar, but centralized, model of trading networks introduced by Hatfield et al. in 2013. Prior work has established that fullysubstitutable preferences guarantee the existence of competitive equilibria which can be centrally computed. Our motivation comes from the fact that prices in markets such as over-the-counter markets and used car markets arise from decentralized negotiation among agents, which has left open an important question as to whether equilibrium prices can emerge from agent-to-agent bilateral negotiations. We design a best response dynamic intended to capture such negotiations between market participants. We assume fully substitutable preferences for market participants. In this setting, we provide proofs of convergence for sparse markets (covering many real world markets of interest), and experimental results for more general cases, demonstrating that prices indeed reach equilibrium, quickly, via bilateral negotiations. Our best response dynamic, and its convergence behavior, forms an important first step in understanding how decentralized markets reach, and retain, equilibrium.

Abstract: In forecasting competitions, the traditional mechanism scores the predictions of each contestant against the outcome of each event, and the contestant with the highest total score wins. While it is wellknown that this traditional mechanism can suffer from incentive issues, it is folklore that contestants will still be roughly truthful as the number of events grows. Yet thus far the literature lacks a formal analysis of this traditional mechanism. This paper gives the first such analysis. We first demonstrate that the "long-run truthfulness" folklore is false: even for arbitrary numbers of events, the best forecaster can have an incentive to hedge, reporting more moderate beliefs to increase their win probability. On the positive side, however, we show that two contestants will be approximately truthful when they have sufficient uncertainty over the relative quality of their opponent and the outcomes of the events, a case which may arise in practice.

Abstract: We study the kfacility location games with optional preferences on the line. In the games, each strategic agent has a public location preference on the k facility locations and a private optional preference on the preferred/acceptable set of facilities out of the k facilities. Our goal is to design strategyproof mechanisms to elicit agents’ optional preferences and locate k facilities to minimize the social or maximum cost of agents based on their facility preferences and public agent locations. We consider two variants of the facility location games with optional preferences: the Min variant and the Max variant where the agent’s cost is defined as their distance to the closest acceptable facility and the farthest acceptable facility, respectively. For the Min variant, we present two deterministic strategyproof mechanisms to minimize the maximum cost and social cost with k ≥ 3 facilities, achieving approximation ratios of 3 and 2n+1 respectively. We complement the results by establishing lower bounds of 3/2 and n/4 for the approximation ratios achievable by any deterministic strategyproof mechanisms for the maximum cost and social cost, respectively. We then improve our results in a special setting of the Min variant where there are exactly three facilities and present two deterministic strategyproof mechanisms to minimize the maximum cost and social cost. For the Max variant, we present an optimal deterministic strategyproof mechanism for the maximum cost and a k-approximation deterministic strategyproof mechanism for the social cost.

Abstract: Strategic interactions can be represented more concisely, and analyzed and solved more efficiently, if we are aware of the symmetries within the multiagent system. Symmetries also have conceptual implications, for example for equilibrium selection. We study the computational complexity of identifying and using symmetries. Using the classical framework of normalform games, we consider game symmetries that can be across some or all players and/or actions. We find a strong connection between game symmetries and graph automorphisms, yielding graph automorphism and graph isomorphism completeness results for characterizing the symmetries present in a game. On the other hand, we also show that the problem becomes polynomial-time solvable when we restrict the consideration of actions in one of two ways. Next, we investigate when exactly game symmetries can be successfully leveraged for Nash equilibrium computation. We show that finding a Nash equilibrium that respects a given set of symmetries is PPAD- and CLS-complete in general-sum and team games respectively---that is, exactly as hard as Brouwer fixed point and gradient descent problems. Finally, we present polynomial-time methods for the special cases where we are aware of a vast number of symmetries, or where the game is two-player zero-sum and we do not even know the symmetries.

Abstract: Knowledge tracing (KT) involves using the historical records of studentlearning interactions to anticipate their performance on forthcoming questions. Central to this process is the modeling of human cognition to gain deeper insights into how knowledge is acquired and retained. Human cognition is characterized by two key features: long-term cognitive trends, reflecting the gradual accumulation and stabilization of knowledge over time, and short-term cognitive fluctuations, which arise from transient factors such as forgetting or momentary lapses in attention. Although existing attention-based KT models effectively capture long-term cognitive trends, they often fail to adequately address short-term cognitive fluctuations. These limitations lead to overly smoothed cognitive features and reduced model performance, especially when the test data length exceeds the training data length. To address these problems, we propose FlucKT, a novel short-term cognitive fluctuations enhanced attention network for KT tasks. FlucKT improves the attention mechanism in two ways: First, by using a decomposition-based layer with causal convolution to separate and dynamically reweight long-term and short-term cognitive features. Second, by introducing a kernelized bias attention score penalty to enhance focus on short-term fluctuations, improving length generalization capabilities. Our contributions are validated through extensive experiments on three real-world datasets, demonstrating significant improvements in length generalization and prediction performance.

Abstract: Even though data annotation is extremely important for interpretability, research, and development of artificial intelligence solutions, annotating data remains costly. Research efforts such as active learning or fewshot learning alleviate the cost by increasing sample efficiency, yet the problem of annotating data more quickly has received comparatively little attention. Leveraging a predictor has been shown to reduce annotation cost in practice but has not been theoretically considered. We ask the following question: to annotate a binary classification dataset with N samples, can the annotator answer less than N yes/no questions? Framing this question-and-answer (Q&A) game as an optimal encoding problem, we find a positive answer given by the Huffman encoding of the possible labelings. Unfortunately, the algorithm is computationally intractable even for small dataset sizes. As a practical method, we propose to minimize a cost function a few steps ahead, similarly to lookahead minimization in optimal control. This solution is analyzed, compared with the optimal one, and evaluated using several synthetic and real-world datasets. The method allows a significant improvement (23-86%) in the annotation efficiency of real-world datasets.

Abstract: The gold standard in humanAI collaboration is complementarity: when combined performance exceeds both the human and algorithm alone. We investigate this challenge in binary classification settings where the goal is to maximize 0-1 accuracy. Given two or more agents who can make calibrated probabilistic predictions, we show a "No Free Lunch"-style result. Any deterministic collaboration strategy (a function mapping calibrated probabilities into binary classifications) that does not essentially always defer to the same agent will sometimes perform worse than the least accurate agent. In other words, complementarity cannot be achieved "for free." The result does suggest one model of collaboration with guarantees, where one agent identifies "obvious" errors of the other agent. We also use the result to understand the necessary conditions enabling the success of other collaboration techniques, providing guidance to human-AI collaboration.

Abstract: Reconstructing perceived images from human brain activity forms a crucial link between human and machine learning through BrainComputer Interfaces. Early methods primarily focused on training separate models for each individual to account for individual variability in brain activity, overlooking valuable cross-subject commonalities. Recent advancements have explored multisubject methods, but these approaches face significant challenges, particularly in data privacy and effectively managing individual variability. To overcome these challenges, we introduce BrainGuard, a privacy-preserving collaborative training framework designed to enhance image reconstruction from multisubject fMRI data while safeguarding individual privacy. BrainGuard employs a collaborative global-local architecture where personalized models are trained on each subject's data and operate in conjunction with a shared commonality model that captures and leverages cross-subject patterns. This architecture eliminates the need to aggregate fMRI data across subjects, thereby ensuring privacy preservation. To tackle the complexity of fMRI data, BrainGuard integrates a hybrid synchronization strategy, enabling individual models to dynamically incorporate parameters from the global model. By establishing a secure and collaborative training environment, BrainGuard not only protects sensitive brain activity data but also improves the accuracy of image reconstructions. Extensive experiments demonstrate that BrainGuard sets a new benchmark in both high-level and low-level metrics, advancing the state-of-the-art in brain decoding through its innovative design.

Abstract: Functional Magnetic Resonance Image (fMRI) is commonly employed to study human brain activity, since it offers insight into the relationship between functional fluctuations and human behavior. To enhance analysis and comprehension of brain activity, Graph Neural Networks (GNNs) have been widely applied to the analysis of functional connectivities (FC) derived from fMRI data, due to their ability to capture the synergistic interactions among brain regions. However, in the human brain, performing complex tasks typically involves the activation of certain pathways, which could be represented as paths across graphs. As such, conventional GNNs struggle to learn from these pathways due to the longrange dependencies of multiple pathways. To address these challenges, we introduce a novel framework BrainMAP to learn multiple pathways in brain networks. BrainMAP leverages sequential models to identify long-range correlations among sequentialized brain regions and incorporates an aggregation module based on Mixture of Experts (MoE) to learn from multiple pathways. Our comprehensive experiments highlight BrainMAP's superior performance. Furthermore, our framework enables explanatory analyses of crucial brain regions involved in tasks.

Abstract: Recent advancements in AI alignment techniques have significantly improved the alignment of large language models (LLMs) with static human preferences. However, the dynamic nature of human preferences can render some prior training data outdated or even erroneous, ultimately causing LLMs to deviate from contemporary human preferences and societal norms. Existing methodologies, either curation of new data for continual alignment or manual correction of outdated data for realignment, demand costly human resources. To address this, we propose a novel approach, LLM BehAvior Correction with INfluence FunCtion REcall and Post-Training (LANCET), which needs no human involvement. LANCET consists of two phases: (1) using a new method LinFAC to efficiently identify the training data that significantly impact undesirable model outputs, and (2) applying an novel Influence-driven Bregman Optimization (IBO) technique to adjust the model’s outputs based on these influence distributions. Our experiments show that LANCET effectively and efficiently corrects inappropriate behaviors of LLMs while preserving model utility. Further more, LANCET exhibits stronger generalization ability than all baselines under out-of-distribution harmful prompts, offering better interpretability and compatibility with real-world applications of LLMs.

Abstract: Video captioning automatically generates natural language phrases to explain the contents in video frames. When deploying captioning models in specialized domains, active learning can help reduce the high annotation cost. However, the generative nature of the captioning process is more complex than standard supervised learning tasks and introduces several challenges for active learning in video captioning. Entropybased uncertainty estimation, which is widely used in active learning, may be inflated in captioning tasks and mislead active sampling. Another challenge arises from the rich content of videos, as each video could be described in multiple ways. A single uncertainty score obtained from one possible caption does not capture the diversity induced by the rich content. To fill out this gap, we propose identifying multiple sources of uncertainty and performing hierarchical aggregation to integrate uncertainty from distinct sources. This innovates a holistic uncertainty metric to quantify the overall informativeness of video content for active sampling. The overall uncertainty is built upon conditional vacuity, an extension of the second-order uncertainty introduced along with the evidential learning framework to the captioning setting, leading to more robust uncertainty estimation without inflation. Both theoretical analysis and experimental evaluation are conducted to demonstrate the effectiveness of the proposed framework for complex uncertainty estimation and interactive learning.

State Key Laboratory of Brain-machine Intelligence, Zhejiang University； College of Computer Science and Technology, Zhejiang University;, State Key Laboratory of Brain-machine Intelligence, Zhejiang University； College of Computer Science and Technology, Zhejiang University;, State Key Laboratory of Brain-machine Intelligence, Zhejiang University； College of Computer Science and Technology, Zhejiang University;, Department of Neurobiology, Affiliated Mental Health Center & Hangzhou Seventh People’s Hospital, Zhejiang University School of Medicine; MOE Frontier Science Center for Brain Science and Brain-machine Integration, Zhejiang University; State Key Laboratory of Brain-machine Intelligence, Zhejiang University, State Key Laboratory of Brain-machine Intelligence, Zhejiang University； College of Computer Science and Technology, Zhejiang University;, The First Affiliated Hospital, College of Medicine, Zhejiang University, Department of Neurobiology, Affiliated Mental Health Center & Hangzhou Seventh People’s Hospital, Zhejiang University School of Medicine; MOE Frontier Science Center for Brain Science and Brain-machine Integration, Zhejiang University; State Key Laboratory of Brain-machine Intelligence, Zhejiang University, State Key Laboratory of Brain-machine Intelligence, Zhejiang University； College of Computer Science and Technology, Zhejiang University; MOE Frontier Science Center for Brain Science and Brain-machine Integration, Zhejiang University;

Abstract: Sleep staging is important for monitoring sleep quality and diagnosing sleeprelated disorders. Recently, numerous deep learning-based models have been proposed for automatic sleep staging using polysomnography recordings. Most of them are trained and tested on the same labeled datasets which results in poor generalization to unseen target domains. However, they regard the subjects in the target domains as a whole and overlook the individual discrepancies, which limits the model's generalization ability to new patients (i.e., unseen subjects) and plug-and-play applicability in clinics. To address this, we propose a novel Source-Free Unsupervised Individual Domain Adaptation (SF-UIDA) framework for sleep staging, leveraging sequential cross-view contrasting and pseudo-label based fine-tuning. It is actually a two-step subject-specific adaptation scheme, which enables the source model to effectively adapt to newly appeared unlabeled individual without access to the source data. It meets the practical needs in real-world scenarios, where the personalized customization can be plug-and-play applied to new ones. Our framework is applied to three classic sleep staging models and evaluated on three public sleep datasets, achieving the state-of-the-art performance.

Abstract: Multirobot task planning and collaboration are critical challenges in robotics. While Behavior Trees (BTs) have been established as a popular control architecture and are plannable for a single robot, the development of effective multi-robot BT planning algorithms remains challenging due to the complexity of coordinating diverse action spaces. We propose the Multi-Robot Behavior Tree Planning (MRBTP) algorithm, with theoretical guarantees of both soundness and completeness. MRBTP features cross-tree expansion to coordinate heterogeneous actions across different BTs to achieve the team's goal. For homogeneous actions, we retain backup structures among BTs to ensure robustness and prevent redundant execution through intention sharing. While MRBTP is capable of generating BTs for both homogeneous and heterogeneous robot teams, its efficiency can be further improved. We then propose an optional plugin for MRBTP when Large Language Models (LLMs) are available to reason goal-related actions for each robot. These relevant actions can be pre-planned to form long-horizon subtrees, significantly enhancing the planning speed and collaboration efficiency of MRBTP. We evaluate our algorithm in warehouse management and everyday service scenarios. Results demonstrate MRBTP's robustness and execution efficiency under varying settings, as well as the ability of the pre-trained LLM to generate effective task-specific subtrees for MRBTP.

Abstract: Natural language is the most intuitive means for humans to interact with robots, making task planning based on natural language commands a longstanding area of research. Large language models (LLMs) have significantly improved task planning by enhancing understanding of language and common sense. However, current methods still face several challenges: they lack a deep understanding of physical environments, their performance relies heavily on prompt examples, LLMs are oversized and not customized for specific tasks, and the planning costs remain high. To overcome these issues, we introduce the GNNTransformer Task Planner (GTTP), designed to predict task-level actions by leveraging the semantic environment and incorporating historical state data. The GTTP architecture is scalable through the use of GNN layers, while transformer layers facilitate understanding task progression. In addition, our model uses a text encoder to embed environments, allowing it to be trained on simulated datasets and applied directly in real-world scenarios. We also propose an automated data generation method that includes semantic augmentation, planning verification, and instruction generation via LLM. This method enables the collection of 14k instruction-annotated tasks in the VirtualHome environment with minimal human effort. The model has been validated across diverse scenes containing up to 715 objects, achieving significantly higher success rates compared to baseline models. It has also been successfully deployed on a physical mobile manipulator, demonstrating its practical applicability and effectiveness.

Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China, Intel Labs, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China, Intel Labs, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China

Abstract: Visual localization is a fundamental machine learning problem. Absolute Pose Regression (APR) trains a scenedependent model to efficiently map an input image to the camera pose in a pre-defined scene. However, many applications have continually changing environments, where inference data at novel poses or scene conditions (weather, geometry) appear after deployment. Training APR on a fixed dataset leads to overfitting, making it fail catastrophically on challenging novel data. This work proposes Continual Domain Expansion (ConDo), which continually collects unlabeled inference data to update the deployed APR. Instead of applying standard unsupervised domain adaptation methods which are ineffective for APR, ConDo effectively learns from unlabeled data by distilling knowledge from scene-agnostic localization methods. By sampling data uniformly from historical and newly collected data, ConDo can effectively expand the generalization domain of APR. Large-scale benchmarks with various scene types are constructed to evaluate models under practical (long-term) data changes. ConDo consistently and significantly outperforms baselines across architectures, scene types, and data changes. On challenging scenes (Fig.1), it reduces the localization error by >7x (14.8m vs 1.7m). Analysis shows the robustness of ConDo against compute budgets, replay buffer sizes and teacher prediction noise. Comparing to model re-training, ConDo achieves similar performance up to 25x faster.

Abstract: Learning discriminative state representations of agents, encompassing the spatial layout and temporal pose trajectory, is essential for effective navigation decisions. However, existing approaches often rely on simplistic plain networks for navigation information fusion, overlooking the complex longrange dependencies across spatio-temporal cues, which leads to suboptimal state perception and potential decision failures. In this paper, we introduce NaviFormer, an effective encoder-decoder navigation transformer, to aggregate discriminative spatio-temporal context information for object navigation. Our navigation encoder not only encodes spatial layouts and temporal agent poses but also innovatively constructs and encodes a passable frontier map, enriching the original state encoding with cues of potential exploration regions. Furthermore, our navigation decoder employs spatio-temporal self-attention and cross-attention mechanisms to model the dependencies among spatial layout encoding, temporal pose encoding, and passable frontier encoding, thereby facilitating comprehensive contextual state feature aggregation. Finally, we leverage these learned spatio-temporal contextual state representations for PPO-based navigation decisions. Extensive experiments on the Gibson, Habitat-Matterport3D (HM3D) and Matterport3D (MP3D) datasets demonstrate the superiority of our approach.

Abstract: In this paper, we focus on estimating the causal effect of an intervention over time on a dynamical system. To that end, we formally define causal interventions and their effects over time on discretetime stochastic processes (DSPs). Then, we show under which conditions the equilibrium states of a DSP, both before and after a causal intervention, can be captured by a structural causal model (SCM). With such an equivalence at hand, we provide an explicit mapping from vector autoregressive models (VARs), broadly applied in econometrics, to linear, but potentially cyclic and/or affected by unmeasured confounders, SCMs. The resulting causal VAR framework allows us to perform causal inference over time from observational time series data. Our experiments on synthetic and real-world datasets show that the proposed framework achieves strong performance in terms of observational forecasting while enabling accurate estimation of the causal effect of interventions on dynamical systems. We demonstrate, through a case study, the potential practical questions that can be addressed using the proposed causal VAR framework.

Abstract: We present a new formal framework for generalized planning (GP) based on the situation calculus extended with LTL constraints. The GP problem is specified by a firstorder basic action theory whose models are the problem instances. This low-level theory is then abstracted into a high-level propositional nondeterministic basic action theory with a single model. A refinement mapping relates the two theories. LTL formulas are used to specify the temporally extended goals as well as assumed trace constraints. If all LTL trace constraints hold at the low level and the high-level model can simulate all the low-level models with respect to the mapping, we say that we have a temporally lifted abstraction. We prove that if we have such an abstraction and the agent has a strategy to achieve a LTL goal under some trace constraints at the abstract level, then there exists a refinement of the strategy to achieve the refinement of the goal at the concrete level. We use LTL synthesis to generate the strategy at the abstract level. We illustrate our approach by synthesizing a program that solves a data structure manipulation problem.

Abstract: This paper introduces a general framework for generateand-test-based solvers for epistemic logic programs that can be instantiated with different generate and test programs, and it provides sufficient conditions on those programs for the correctness of the solvers built using this framework. It also introduces a new generator program that incorporates the propagation of epistemic consequences and shows that this can exponentially reduce the number of candidates that need to be tested while only incurring a linear overhead. We implement a new solver based on these theoretical findings and experimentally show that it outperforms existing solvers by achieving a ~3.3x speed-up and solving 87% more instances on well-known benchmarks.

Abstract: Structural equation models (SEM) are a standard approach to representing causal dependencies between variables. In this paper we propose a new interpretation of existing formalisms in the field of Actual Causality in which SEM's are viewed as mechanisms transforming the dynamics of exogenous variables into the dynamics of endogenous variables. This allows us to combine counterfactual causal reasoning with existing temporal logic formalizms, and to introduce a temporal logic, CPLTL, for causal reasoning about such structures. Then, we demonstrate that the standard restriction to socalled recursive models (with no cycles in the dependency graphs) is not necessary in our approach. This fact provides us extra tools for reasoning about mutually dependent processes and feedback loops. Finally, we introduce the notions of model equivalence for temporal causal models and show that CPLTL has an efficient model-checking procedure.

Abstract: This paper introduces two new compilation languages restricting weak decomposable negation normal form (wDNNF) circuits and integrates them into the knowledge compilation map. Positive (resp. negative) wDNNF circuits restrict wDNNF circuits so that each variable shared among the inputs of a conjunction node can only have positive (resp. negative) occurrences in that subcircuit. Unlike wDNNF circuits, pwDNNF (resp. nwDNNF) circuits satisfy the maximum (resp. minimum) cardinality query. We present a compiler for converting CNF formulae into pwDNNF and nwDNNF circuits by extending Bella the state-of-the-art compiler for wDNNF circuits. We introduce a new caching scheme, called Cara, that exploits isomorphism. Using that scheme, we show a new compilation method based on copying subcircuits, which may significantly speed up compilations at the expense of increasing circuit sizes. Our experiments demonstrate that nwDNNF circuits are suitable for computing most probable explanations (MPEs) in two-layer Bayesian networks (BNs) with large domains.

Abstract: We investigate the problem of checking the consistency of qualitative preferences expressed in CPtheory. This problem is PSPACE-Complete even when the preferences are locally consistent or the preference variables have binary domain. We present a new sufficient condition for consistency of preferences and show that the condition can be checked in polynomial time in settings of practical relevance (locally consistent or binary domain preference variables). We further show how the resulting sufficient condition can be used to efficiently identify a subset of outcomes that are non-dominated with respect to a set of qualitative preferences.

Abstract: Efficiently computing accurate representations of highdimensional data is essential for data analysis and unsupervised learning. Dendrograms, also known as ultrametrics, are widely used representations that preserve hierarchical relationships within the data. However, popular methods for computing them, such as *linkage* algorithms, suffer from quadratic time and space complexity, making them impractical for large datasets. The "best ultrametric embedding" (a.k.a. "best ultrametric fit") problem, which aims to find the ultrametric that best preserves the distances between points in the original data, is known to require at least quadratic time for an exact solution. Recent work has focused on improving scalability by approximating optimal solutions in subquadratic time, resulting in a (sqrt(2) + epsilon)-approximation (Cohen-Addad, de Joannis de Verclos and Lagarde, 2021). In this paper, we present the first subquadratic algorithm that achieves arbitrarily precise approximations of the optimal ultrametric embedding. Specifically, we provide an algorithm that, for any c >1, outputs a c-approximation of the best ultrametric in time O(n^(1 + 1/c)). In particular, for any fixed epsilon > 0, the algorithm computes a (1+ epsilon)-approximation in time O(n^(2 - epsilon + o(epsilon ^2))). Experimental results show that our algorithm improves upon previous methods in terms of approximation quality while maintaining comparable running times.

Abstract: In Imitation Learning (IL), utilizing suboptimal and heterogeneous demonstrations presents a substantial challenge due to the varied nature of realworld data. However, standard IL algorithms consider these datasets as homogeneous, thereby inheriting the deficiencies of suboptimal demonstrators. Previous approaches to this issue rely on impractical assumptions like high-quality data subsets, confidence rankings, or explicit environmental knowledge. This paper introduces IRLEED, *Inverse Reinforcement Learning by Estimating Expertise of Demonstrators*, a novel framework that overcomes these hurdles without prior knowledge of demonstrator expertise. IRLEED enhances existing Inverse Reinforcement Learning (IRL) algorithms by combining a general model for demonstrator suboptimality to address reward bias and action variance, with a Maximum Entropy IRL framework to efficiently derive the optimal policy from diverse, suboptimal demonstrations. Experiments in both online and offline IL settings, with simulated and human-generated data, demonstrate IRLEED's adaptability and effectiveness, making it a versatile solution for learning from suboptimal demonstrations.

Abstract: Marked temporal point processes (MTPPs) have been shown to be extremely effective in modeling continuous time event sequences (CTESs). In this work, we present adversarial attacks designed specifically for MTPP models. A key criterion for a good adversarial attack is its imperceptibility. For objects such as images or text, this is often achieved by bounding perturbation in some fixed Lp normball. However, similarly minimizing distance norms between two CTESs in the context of MTPPs is challenging due to their sequential nature and varying time-scales and lengths. We address this challenge by first permuting the events and then incorporating the additive noise to the arrival timestamps. However, the worst case optimization of such adversarial attacks is a hard combinatorial problem, requiring exploration across a permutation space that is factorially large in the length of the input sequence. As a result, we propose a novel differentiable scheme - PERMTPP - using which we can perform adversarial attacks by learning to minimize the likelihood, while minimizing the distance between two CTESs. Our experiments on four real-world datasets demonstrate the offensive and defensive capabilities, and lower inference times of PERMTPP.

Abstract: Offline safe reinforcement learning (OSRL) involves learning a decisionmaking policy to maximize rewards from a fixed batch of training data to satisfy pre-defined safety constraints. However, adapting to varying safety constraints during deployment without retraining remains an under-explored challenge. To address this challenge, we introduce constraint-adaptive policy switching (CAPS), a wrapper framework around existing offline RL algorithms. During training, CAPS uses offline data to learn multiple policies with a shared representation that optimize different reward and cost trade-offs. During testing, CAPS switches between those policies by selecting at each state the policy that maximizes future rewards among those that satisfy the current cost constraint. Our experiments on 38 tasks from the DSRL benchmark demonstrate that CAPS consistently outperforms existing methods, establishing a strong wrapper-based baseline for OSRL.

Abstract: FewShot Class-Incremental Learning (FSCIL) studies how to empower the machine learning system to learn novel classes with only a few annotated examples continually. To tackle the FSCIL task, recent state-of-the-art methods propose to employ the meta-learning mechanism, which constructs the pseudo incremental episodes/tasks in the training phase. However, these methods only select part of the base classes to construct the pseudo novel classes in the feature space of the base classes, which cannot mimic the real novel classes of the testing scenario. To deal with this problem, we propose a new Pseudo Informative Episode Construction (PIEC) framework. Specifically, we first perform distribution-level mixing to generate a set of pseudo novel classes in the feature space of the novel class. Then, we propose two diversity criteria to select the informative pseudo novel classes that have large discrepancies with each other and high information gain over the base classes to construct the pseudo incremental session. In this way, we can allow the model to learn rich new concepts beyond the base classes as in the real incremental session during the episodic training procedure, thus improving its generalization ability. Extensive experiments on three popular classification benchmarks (i.e., CUB200, miniImageNet, and CIFAR100) show that the proposed framework can outperform other state-of-the-art methods.

Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences School of Electronic, Electrical and Communication Engineering, University of the Chinese Academy of Sciences, Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences School of Electronic, Electrical and Communication Engineering, University of the Chinese Academy of Sciences, Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences

Abstract: Semantic segmentation and 3D reconstruction are two fundamental tasks in remote sensing, typically treated as separate or loosely coupled tasks. Despite attempts to integrate them into a unified network, the constraints between the two heterogeneous tasks are not explicitly modeled, since the pioneering studies either utilize a loosely coupled parallel structure or engage in only implicit interactions, failing to capture the inherent connections. In this work, we explore the connections between the two tasks and propose a new network that imposes semantic constraints on the stereo matching task, both implicitly and explicitly. Implicitly, we transform the traditional parallel structure to a new cascade structure termed SemanticGuided Cascade structure, where the deep features enriched with semantic information are utilized for the computation of initial disparity maps, enhancing semantic guidance. Explicitly, we propose a Semantic Selective Refinement (SSR) module and a Left-Right Semantic Consistency (LRSC) module. The SSR refines the initial disparity map under the guidance of the semantic map. The LRSC ensures semantic consistency between two views via reducing the semantic divergence after transforming the semantic map from one view to the other using the disparity map. Experiments on the US3D and WHU datasets demonstrate that our method achieves state-of-the-art performance for both semantic segmentation and stereo matching.

Abstract: Graph representation learning is fundamental for analyzing graphstructured data. Exploring invariant graph representations remains a challenge for most existing graph representation learning methods. In this paper, we propose a cross-view graph consistency learning (CGCL) method that learns invariant graph representations for link prediction. First, two complementary augmented views are derived from an incomplete graph structure through a coupled graph structure augmentation scheme. This augmentation scheme mitigates the potential information loss that is commonly associated with various data augmentation techniques involving raw graph data, such as edge perturbation, node removal, and attribute masking. Second, we propose a CGCL model that can learn invariant graph representations. A cross-view training scheme is proposed to train the proposed CGCL model. This scheme attempts to maximize the consistency information between one augmented view and the graph structure reconstructed from the other augmented view. Furthermore, we offer a comprehensive theoretical CGCL analysis. This paper empirically and experimentally demonstrates the effectiveness of the proposed CGCL method, achieving competitive results on graph datasets in comparisons with several state-of-the-art algorithms.

Abstract: Despite some promising results in federated learning using gametheoretical methods, most existing studies mainly employ a one-level game in either a cooperative or competitive environment, failing to capture the complex dynamics among participants in practice. To address this issue, we propose DualGFL, a novel federated learning framework with a dual-level game in cooperative-competitive environments. DualGFL includes a lower-level hedonic game where clients form coalitions and an upper-level multi-attribute auction game where coalitions bid for training participation. At the lower-level DualGFL, we introduce a new auction-aware utility function and propose a Pareto-optimal partitioning algorithm to find a Pareto-optimal partition based on clients' preference profiles. At the upper-level DualGFL, we formulate a multi-attribute auction game with resource constraints and derive equilibrium bids to maximize coalitions' winning probabilities and profits. A greedy algorithm is proposed to maximize the utility of the central server. Extensive experiments on real-world datasets demonstrate DualGFL's effectiveness in improving both server utility and client utility.

Abstract: Contrastive clustering performs clustering and data representation in a unified model, where instanceand cluster-level constrastive learning are conducted simultaneously. However, commonly-used data augmentation methods make contrastive mechanism effect but may cause representation learning getting stuck in domain-specific information, which further deteriorates clustering performance and limits generalization ability. To this end, we propose a new framework, named Generalized Contrastive Clustering with domain shifts modeling (GeCC), which can integrate diverse domain knowledge to improve the clustering performance. Specifically, we first design a cluster-guided domain shifts modeling module to synthesize a reference view with diverse domain information. Then, we introduce instance representation and cluster assignment contrastive modules with well-designed attention weights to guide the representation learning and clustering. In this way, our method can maximize the extraction of cluster-related information and avoid over-fitting domain-specific features. Experimental results on four benchmark datasets demonstrate that our proposed method consistently outperforms other state-of-the-art methods.

Abstract: Wearable Human Action Recognition (wHAR) uses motion sensor data to identify human movements, which is essential for mobile and wearable devices. However, traditional wHAR systems are only trained on a limited set of activities. Hence, they fail to generalize to diverse human motions, prompting ZeroShot Learning (ZSL). Existing ZSL methods for wHAR focus solely on augmenting labels, such as representing them as attribute matrices, images, videos, or text. We propose ZeroHAR that enhances ZSL by not just focusing on activity labels, but by augmenting motion data with sensor context features. Our approach incorporates information about the sensor type, the Cartesian axis of the data, and the sensor's body position, providing the model with crucial spatial and biomechanical insights. This helps the model generalize better to new actions. First, we train the model by aligning the latent space of the motion time-series with its corresponding sensor context, while distancing it from unrelated sensor contexts. Finally, we train the model using the target activity descriptions. We tested our method against eight baselines on five benchmark HAR datasets with various sensors, placements, and activities. Our model shows exceptional generalizability across 18 motion time series classification benchmark datasets, outperforming the best baselines by 262% in the zero-shot setting.

Abstract: Prompt learning has emerged as a promising method for adapting pretrained visual-language models (VLMs) to a range of downstream tasks. While optimizing the context can be effective for improving performance on specific tasks, it can often lead to poor generalization performance on unseen classes or datasets sampled from different distributions. It may be attributed to the fact that textual prompts tend to overfit downstream data distributions, leading to the forgetting of generalized knowledge derived from hand-crafted prompts. In this paper, we propose a novel method called Similarity Paradigm with Textual Regularization (SPTR) for prompt learning without forgetting. SPTR is a two-pronged design based on hand-crafted prompts that is an inseparable framework. 1) To avoid forgetting general textual knowledge, we introduce the optimal transport as a textual regularization to finely ensure approximation with hand-crafted features and tuning textual features. 2) In order to continuously unleash the general ability of multiple hand-crafted prompts, we propose a similarity paradigm for natural alignment score and adversarial alignment score to improve model robustness for generalization. Both modules share a common objective in addressing generalization issues, aiming to maximize the generalization capability derived from multiple hand-crafted prompts. Four representative tasks (i.e., non-generalization few-shot learning, base-to-novel generalization, cross-dataset generalization, domain generalization) across 11 datasets demonstrate that SPTR outperforms existing prompt learning methods.

Abstract: Multiobjective Bayesian optimization (MOBO) aims to optimize multiple competing objective functions in the expensive-to-evaluate scenario. The Expected Hypervolume Improvement (EHVI) is a commonly used acquisition function for MOBO and shows a good performance. However, the computation of EHVI becomes challenging as the number of objective functions grows. In this paper, we revisit the formulation of EHVI, as well as its multi-point counterpart qEHVI, and derive much simpler analytic expressions for them. The main contributions of this paper include: (1) first formulating EHVI as a particular hypervolume improvement, and thus immediately obtaining a formal proof of its NP-hardness, faster algorithms in both theory and practice, and more results on its derivatives; (2) first obtaining the analytic expressions of qEHVI for any q > 1 and m ≥ 2 where m is the number of objectives; and (3) demonstrating the advantages of our formulation over existing exact and approximation methods for computing EHVI and qEHVI through a large number of numerical experiments.

Abstract: The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to highdefinition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.

Abstract: Learning agents that excel at sequential decisionmaking tasks must continuously resolve the problem of exploration and exploitation for optimal learning. However, such interactions with the environment online might be prohibitively expensive and may involve some constraints, such as a limited budget for agent-environment interactions and restricted exploration in certain regions of the state space. Examples include selecting candidates for medical trials and training agents in complex navigation environments. This problem necessitates the study of active reinforcement learning strategies that collect minimal additional experience trajectories by reusing existing offline data previously collected by some unknown behavior policy. In this work, we propose an active reinforcement learning method capable of collecting trajectories that can augment existing offline data. With extensive experimentation, we demonstrate that our proposed method reduces additional online interaction with the environment by up to 75% over competitive baselines across various continuous control environments such as Gym-MuJoCo locomotion environments as well as Maze2d, AntMaze, CARLA and IsaacSimGo1. To the best of our knowledge, this is the first work that addresses the active learning problem in the context of sequential decision-making and reinforcement learning.

Abstract: Generative AI has made remarkable progress in addressing various design challenges. One prominent area where generative AI could bring significant value is in engineering design. In particular, selecting an optimal set of components and their interfaces to create a mechanical system that meets design requirements is one of the most challenging and timeconsuming tasks for engineers. This configuration design task is inherently challenging due to its categorical nature, multiple design requirements a solution must satisfy, and the reliance on physics simulations for evaluating potential solutions. These characteristics entail solving a combinatorial optimization problem with multiple constraints involving black-box functions. To address this challenge, we propose a deep generative model to predict the optimal combination of components and interfaces for a given design problem. To demonstrate our approach, we solve a gear train synthesis problem by first creating a synthetic dataset using a domain-specific language, a parts catalogue, and a physics simulator. We then train a Transformer-based model using this dataset, named GearFormer, which can not only generate quality solutions on its own, but also augment traditional search methods such as an evolutionary algorithm and Monte Carlo tree search. We show that GearFormer outperforms such search methods on their own in terms of satisfying the specified design requirements with orders of magnitude faster generation time. Additionally, we showcase the benefit of hybrid methods that leverage both GearFormer and search methods, which further improve the quality of the solutions.

Abstract: We introduce OmniMark, a novel and efficient fingerprinting method for Latent Diffusion Models (LDM). OmniMark can encode userspecific fingerprints across diverse dimensions of the weights of the LDM, including kernels, filters, channels, and spatial domains. The LDM is fine-tuned to encode the invisible fingerprint into generated images, which can be decoded by a decoder. By altering fingerprints and re-encoding the weights, OmniMark supports efficient and scalable ad-hoc generation (<100 ms) of numerous models with unique fingerprints that enable user accountability and model attribution. Extensive experiments demonstrate that OmniMark can be applied to various image generation and editing tasks and achieve highly accurate fingerprint detection without compromising image quality. Furthermore, OmniMark demonstrates good robustness against both white-box model attacks and image attacks, including fine-tuning and JPEG compression.

Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, School of Economics, Wuhan Textile University, School of Computer Science and Technology, Hainan University

Abstract: Backdoor attacks in federated learning (FL) face challenges such as lower attack success rates and compromised main task accuracy (MA) compared to local training. Existing methods like distributed backdoor attack (DBA) mitigate these issues by modifying malicious clients’ updates and partitioning global triggers to enhance backdoor persistence and stealth. The recent full combination backdoor attack (FCBA) further improves backdoor efficiency with a full combination strategy. However, these methods are mainly applicable in smallscale FL. In large-scale FL, small trigger patterns weaken impact, and scaling them requires controlling exponentially more clients, which poses significant challenges, while simply reverting to DBA may decrease backdoor performance. To overcome these challenges, we propose the self-adaptive distributed backdoor attack (SADBA), which achieves similar performance to FCBA with a lower percentage of malicious clients (PMC). It also adapts more flexibly through an optimized model poisoning strategy and a self-adaptive data poisoning strategy. Experiments demonstrate SADBA outperforms state-of-the-art methods, achieving higher or comparable backdoor performance and MA across various datasets with limited PMC.

Abstract: Multiview clustering (MVC) methods have garnered considerable attention within centralized data frameworks. However, real-world multi-view data are often collected and stored by different organizations, complicating the practical deployment of MVC and motivating the emergence of federated multi-view clustering (FMVC). Existing FMVC approaches typically necessitate post-processing to derive clustering labels and confront challenges in effectively exploring the complementary and consistent information across multi-view data residing in different entities. To address these limitations, we propose a novel framework termed Scalable Federated One-Step Multi-View Clustering with Tensorized Regularization (SFOMVC-TR). This framework facilitates one-step clustering at each client and employs tensor learning to capture consistent and complementary information through a centralized server. Additionally, it adopts anchor graphs to enhance clustering efficiency and scalability in high-dimensional data. By incorporating a Lp,q sparse regularization on the projection matrix, SFOMVC-TR enables the direct projection of anchors into clustering assignments to mitigate redundancy. A federated optimization framework is developed to support collaborative and privacy-preserving training under the coordination of the server. Extensive experiments on multiple datasets validate the privacy and effectiveness of our method.

Abstract: Considering the difficulty of interpreting generative model output, there is significant current research focused on determining meaningful evaluation metrics. Several recent approaches utilize "precision" and "recall," borrowed from the classification domain, to individually quantify the output fidelity (realism) and output diversity (representation of the real data variation), respectively. With the increase in metric proposals, there is a need for a unifying perspective, allowing for easier comparison and clearer explanation of their benefits and drawbacks. To this end, we unify a class of kthnearest neighbors (kNN)-based metrics under an information-theoretic lens using approaches from kNN density estimation. Additionally, we propose a tri-dimensional metric composed of Precision Cross-Entropy (PCE), Recall Cross-Entropy (RCE), and Recall Entropy (RE), which separately measure fidelity and two distinct aspects of diversity, inter- and intra-class. Our domain-agnostic metric, derived from the information-theoretic concepts of entropy and cross-entropy, can be dissected for both sample- and mode-level analysis. Our detailed experimental results demonstrate the sensitivity of our metric components to their respective qualities and reveal undesirable behaviors of other metrics.

Abstract: Dataset condensation has significantly improved model training efficiency, but its application on devices with different computing power brings new requirements for different data sizes. For sparse graph data with nonEuclidean structures, repeated condensation of each scale may lead to significant computational costs. Thus, condensing multiple scale graphs simultaneously is the core of achieving efficient training in different on-device scenarios. Existing efficient works for multi-scale graph dataset condensation mainly perform efficient approximate computation in scale order (large-to-small or small-to-large scales). However, these two commonly used paradigms for multi-scale graph dataset condensation have serious ''scaling down degradation'' and ''scaling up collapse" problems of a graph. The main bottleneck of the above paradigms is whether the effective information of the original graph is fully preserved when consenting to the primary sub-scale (the first of multiple scales), which determines the condensation effect and consistency of all scales. In this paper, we proposed a novel GNN-centric Bi-directional Multi-Scale Graph Dataset Condensation (BiMSGC) framework, to explore unifying paradigms by operating on both large-to-small and small-to-large for multi-scale graph condensation. Based on the mutual information theory, we estimate an optimal ''meso-scale'' to obtain the minimum necessary dense graph preserving the maximum utility information of the original graph, and then we achieve stable and consistent ''bi-directional'' condensation learning by optimizing graph eigenbasis matching with information bottleneck on other scales. Encouraging empirical results on several datasets demonstrates the significant superiority of the proposed framework in graph condensation at different scales.

Abstract: In numerous settings, agents lack sufficient data to learn a model directly. Collaborating with other agents may help, but introduces a biasvariance trade-off when local data distributions differ. A key challenge is for each agent to identify clients with similar distributions while learning the model, a problem that remains largely unresolved. This study focuses on a particular instance of the overarching problem, where each agent collects samples from a real-valued distribution over time to estimate its mean. Existing algorithms face impractical per-agent space and time complexities (linear in the number of agents |A|). To address scalability challenges, we propose a framework where agents self-organize into a graph, allowing each agent to communicate with only a selected number of peers r. We propose two collaborative mean estimation algorithms: one employs a consensus-based approach, while the other uses a message-passing scheme, with complexity O(r) and O(r log |A|), respectively. We establish conditions for both algorithms to yield asymptotically optimal estimates and we provide a theoretical characterization of their performance.

Abstract: Advanced Deep Neural Networks (DNNs) perform well for highquality images, but their performance dramatically decreases for degraded images. Data augmentation is commonly used to alleviate this problem, but using too much perturbed data might seriously decrease the performance on pristine images. To tackle this challenge, we take our cue from the assumption of spatial coincidence in human visual perception, i.e. multiscale and varying receptive fields are required for understanding pristine and degraded images. Correspondingly, we propose a novel plug-and-play network architecture, dubbed Quality-Adaptive Receptive Fields (QuARF), to automatically select the optimal receptive fields based on the quality of the input image. To this end, we first design a multi-kernel convolutional block, which comprises multiscale continuous receptive fields. Afterward, we design a quality-adaptive routing network to predict the significance of each kernel, based on the quality features extracted from the input image. In this way, QuARF automatically selects the optimal inference route for each image. To further boost efficiency and effectiveness, the input feature map is split into multiple groups, with each group independently learning its quality-adaptive routing parameters. We apply QuARF to a variety of DNNs and conduct experiments in both discriminative and generation tasks, including semantic segmentation, image translation, and restoration. Thorough experimental results show that QuARF significantly and robustly improves the performance for degraded images, and outperforms data augmentation in most cases.

Abstract: Time series forecasting (TSF) is essential in various domains, and recent advancements in diffusionbased TSF models have shown considerable promise. However, these models typically adopt traditional diffusion patterns, treating TSF as a noise-based conditional generation task. This approach neglects the inherent continuous sequential nature of time series, leading to a fundamental misalignment between diffusion mechanisms and the TSF objective, thereby severely impairing performance. To bridge this misalignment, and inspired by the classic Auto-Regressive Moving Average (ARMA) theory, which views time series as continuous sequential progressions evolving from previous data points, we propose a novel Auto-Regressive Moving Diffusion (ARMD) model to first achieve the continuous sequential diffusion-based TSF. Unlike previous methods that start from white Gaussian noise, our model employs chain-based diffusion with priors, accurately modeling the evolution of time series and leveraging intermediate state information to improve forecasting accuracy and stability. Specifically, our approach reinterprets the diffusion process by considering future series as the initial state and historical series as the final state, with intermediate series generated using a sliding-based technique during the forward process. This design aligns the diffusion model's sampling procedure with the forecasting objective, resulting in an unconditional, continuous sequential diffusion TSF model. Extensive experiments conducted on seven widely used datasets demonstrate that our model achieves state-of-the-art performance, significantly outperforming existing diffusion-based TSF models.

Abstract: Uncertainty estimation is essential for practical applications such as decisionmaking, risk assessment, and human-AI collaboration. However, Uncertainty estimation in open-ended question-answering (QA) tasks presents unique challenges. The output space for open-ended QA is vast and discrete, and the autoregressive nature of LLMs, combined with the rapid increase in model parameters, makes inference sampling significantly costly. An ideal uncertainty estimation for LLMs should meet two criteria: 1) incur no additional inference cost and 2) capture the semantic dependencies of token-level uncertainty within sequences. We propose a promising solution that converts redundancy into randomness in the extensive parameters of LLMs to quantify knowledge uncertainty. We can obtain token-level Monte Carlo samples without multiple inferences by introducing randomness during a single forward pass. We theoretically analyze the FLUE sampling method and employ a post-processing method to learn the state transitions from token uncertainty to sequence uncertainty. In open-ended question-answering tasks, we demonstrate that FLUE can achieve competitive performance in estimating the uncertainty of generated sentences without adding extra inference overhead.

Abstract: Learning dynamics governing physical and spatiotemporal processes is a challenging problem, especially in scenarios where states are partially measured. In this work, we tackle the problem of learning dynamics governing these systems when parts of the system's states are not measured, specifically when the dynamics generating the nonmeasured states are unknown. Inspired by state estimation theory and Physics Informed Neural ODEs, we present a sequential optimization framework in which dynamics governing unmeasured processes can be learned. We demonstrate the performance of the proposed approach leveraging both numerical simulations and a real dataset extracted from an electro-mechanical positioning system. We show how the underlying equations fit into our formalism and demonstrate the improved performance of the proposed method when compared with standard baselines.

Abstract: Conditional demographic parity (CDP) is a measure of the demographic parity of a predictive model or decision process when conditioning on an additional feature or set of features. Many algorithmic fairness techniques exist to target demographic parity, but CDP is much harder to achieve, particularly when the conditioning variable has many levels and/or when the model outputs are continuous. The problem of auditing and enforcing CDP is understudied in the literature. In light of this, we propose novel measures of conditional demographic disparity (CDD) which rely on statistical distances borrowed from the optimal transport literature. We further design and evaluate regularizationbased approaches based on these CDD measures. Our methods, FairBiT and FairLeap, allow us to target conditional demographic parity even when the conditioning variable has many levels. When model outputs are continuous, our methods target full equality of the conditional distributions, unlike other methods that only consider first moments or related proxy quantities. We validate our approaches on real-world datasets.

Abstract: Feature transformation aims to reconstruct the feature space of raw features to enhance the performance of downstream models. However, the exponential growth in the combinations of features and operations poses a challenge, making it difficult for existing methods to efficiently explore a wide space. Additionally, their optimization is solely driven by the accuracy of downstream models in specific domains, neglecting the acquisition of general feature knowledge. To fill this research gap, we propose an evolutionary LLM framework for automated feature transformation. This framework consists of two parts: 1) constructing a multipopulation database through an RL data collector while utilizing evolutionary algorithm strategies for database maintenance, and 2) utilizing the ability of Large Language Model (LLM) in sequence understanding, we employ few-shot prompts to guide LLM in generating superior samples based on feature transformation sequence distinction. Leveraging the multi-population database initially provides a wide search scope to discover excellent populations. Through culling and evolution, high-quality populations are given greater opportunities, thereby furthering the pursuit of optimal individuals. By integrating LLMs with evolutionary algorithms, we achieve efficient exploration within a vast space, while harnessing feature knowledge to propel optimization, thus realizing a more adaptable search paradigm. Finally, we empirically demonstrate the effectiveness and generality of our proposed method.

Abstract: Multiview clustering (MVC) for remote sensing data is a critical and challenging task in Earth observation. Although recent advances in graph neural network (GNN)-based MVC have shown remarkable success, the most prevalent approaches have two major limitations: 1) heavily relying on a predefined yet fixed graph, which limits the performance of clustering because the large number of indistinguishable background samples contained in remote sensing data would introduce noise information and increase structure heterogeneity; 2) ignoring the effect of confusing samples on cluster structure compactness, which leads to fluffy cluster structure and decrease feature discriminability. To address these issues, we propose a Structure-Adaptive Multi-View Graph Clustering method named SAMVGC on remote sensing data which boosts the structure homogeneity and cluster compactness by adaptively learning the graph and cluster structures, respectively. Concretely, we use the geometric structure within the feature embedding space to refine adjacency matrices. The adjacency matrices are dynamically fused with the previous ones to improve the homogeneity and stability of structure information. Additionally, the samples are separated into two categories, including the central (intra-cluster center samples) and the confusing (inter-cluster boundary samples). On the basis, we deploy the contrastive learning paradigm on the central samples within views and the consistent learning paradigm on the confusing samples between views, improving the cluster compactness and consistency. Finally, we conduct extensive experiments on four benchmarks and achieve promising results, well demonstrating the effectiveness and superiority of the proposed method.

Abstract: Traditional Federated Learning (FL) necessitates numerous rounds of communication between the server and clients, posing significant challenges including high communication costs, connection drop risks and susceptibility to privacy attacks. Oneshot FL has become a compelling learning paradigm to overcome above drawbacks by enabling the training of a global server model via a single communication round. However, existing one-shot FL methods suffer from expensive computation cost on the server or clients and cannot deal with non-IID (Independent and Identically Distributed) data stably and effectively. To address these challenges, this paper proposes FedCGS, a novel Federated learning algorithm that Capture Global feature Statistics leveraging pre-trained models. With global feature statistics, we achieve training-free and heterogeneity-resistant one-shot FL. Furthermore, we expand its application to personalization scenario, where clients only need execute one extra communication round with server to download global statistics. Extensive experimental results demonstrate the effectiveness of our methods across diverse data-heterogeneity settings.

Abstract: Multimodal sentiment analysis (MSA) is an emerging research topic that aims to understand and recognize human sentiment or emotions through multiple modalities. However, in realworld dynamic scenarios, the distribution of target data is always changing and different from the source data used to train the model, which leads to performance degradation. Common adaptation methods usually need source data, which could pose privacy issues or storage overheads. Therefore, test-time adaptation (TTA) methods are introduced to improve the performance of the model at inference time. Existing TTA methods are always based on probabilistic models and unimodal learning, and thus can not be applied to MSA which is often considered as a multimodal regression task. In this paper, we propose two strategies: Contrastive Adaptation and Stable Pseudo-label generation (CASP) for test-time adaptation for multimodal sentiment analysis. The two strategies deal with the distribution shifts for MSA by enforcing consistency and minimizing empirical risk, respectively. Extensive experiments show that CASP brings significant and consistent improvements to the performance of the model across various distribution shift settings and with different backbones, demonstrating its effectiveness and versatility.

Abstract: Virtual screening (VS) is a critical step in computeraided drug discovery, aiming to identify molecules that bind to a specific target protein. Traditional VS methods, such as docking, are often too time-consuming to efficiently screen large-scale molecular databases. Recent advances in deep learning have demonstrated that learning vector representations for both proteins and molecules using contrastive learning can outperform traditional docking methods. However, considering that the target databases often contain billions of molecules, real-valued vector representations adopted by existing methods can still incur large memory and time cost in VS. To address this problem, we propose DrugHash, a hashing-based contrastive learning method for VS. DrugHash formulates VS as a retrieval task that leverages binary hash codes for efficient retrieval. In particular, DrugHash designs a simple yet effective hashing strategy to enable end-to-end learning of binary hash codes for both proteins and molecules, which can dramatically reduce the memory and time cost with higher accuracy compared with existing methods. Experimental results show that DrugHash can outperform existing methods to achieve state-of-the-art accuracy, with at least a 32 times reduction in memory cost and a 4.6 times improvement in speed.

Abstract: In recent years, multiview multi-label learning (MVML) has gained popularity due to its close resemblance to real-world scenarios. However, the challenge of selecting informative features to ensure both performance and efficiency remains a significant question in MVML. Existing methods often extract information separately from the consistency part and the complementary part, which may result in noise due to unclear segmentation. In this paper, we propose a unified model constructed from the perspective of global-view reconstruction. Additionally, while feature selection methods can discern the importance of features, they typically overlook the uncertainty of samples, which is prevalent in realistic scenarios. To address this, we incorporate the perception of sample uncertainty during the reconstruction process to enhance trustworthiness. Thus, the global-view is reconstructed through the graph structure between samples, sample confidence, and the view relationship. The accurate mapping is established between the reconstructed view and the label matrix. Experimental results demonstrate the superior performance of our method on multi-view datasets.

State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences. Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences. University of Chinese Academy of Sciences., State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences. Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences. University of Chinese Academy of Sciences., Shenyang University of Chemical Technology., School of Automation Science and Engineering, South China University of Technology., State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences.

Abstract: Mimicking the real interaction trajectory in the inference of the world model has been shown to improve the sample efficiency of modelbased reinforcement learning (MBRL) algorithms. Many methods directly use known state sequences for reasoning. However, this approach fails to enhance the quality of reasoning by capturing the subtle variation between states. Much like how humans infer trends in event development from this variation, in this work, we introduce Global-Local variation Awareness Mamba-based world model (GLAM) that improves reasoning quality by perceiving and predicting variation between states. GLAM comprises two Mamba-based parallel reasoning modules, GMamba and LMamba, which focus on perceiving variation from global and local perspectives, respectively, during the reasoning process. GMamba focuses on identifying patterns of variation between states in the input sequence and leverages these patterns to enhance the prediction of future state variation. LMamba emphasizes reasoning about unknown information, such as rewards, termination signals, and visual representations, by perceiving variation in adjacent states. By integrating the strengths of the two modules, GLAM accounts for higher-value variation in environmental changes, providing the agent with more efficient imagination-based training. We demonstrate that our method outperforms existing methods in normalized human scores on the Atari 100k benchmark.

Abstract: Universal domain adaptation (UniDA) transfers knowledge from a labelled source domain to an unlabelled target domain under domainshift and category-shift for annotation. In reality, due to privacy protection or other limits, not only source data but also pre-trained models on it may be unavailable when training on target data. In this paper, we go a step further to explore the black-box universal domain adaptation (B^2-UniDA) problem. It requires tackling the labelling task under shifts by only accessing the interface of pre-trained source models. To this end, we introduce GSS which proposes a novel sample selection criterion based on gradient descent and Bayes' Theorem to identify samples of potential unknown classes. This criterion doesn't require manually-set thresholds depending on data used and is suitable for various datasets. GSS builds an open-set classifier and enables it to estimate probabilities of belonging to each class including the unknown category and adjust estimates adaptively. To overcome class-imbalance, especially imbalance between the unknown and known classes, we propose a balancing mechanism by measuring training status and estimating DA type. In addition to distilling knowledge from source model outputs, we focus on mining the categorical structure of target domain by self-training. Experiments on benchmarks show the state-of-the-art performance of GSS compared to typical methods, including source models or source data dependent methods.

College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, Shandong Artificial Intelligence Institute School of Mathematics and Statistics, Qilu University of Technology, College of Computer Science and Technology, Zhejiang University

Abstract: In the burgeoning domain of machine learning, the reliance on thirdparty services for model training and the adoption of pre-trained models have surged. However, this reliance introduces vulnerabilities to model hijacking attacks, where adversaries manipulate models to perform unintended tasks, leading to significant security and ethical concerns, like turning an ordinary image classifier into a tool for detecting faces in pornographic content, all without the model owner’s knowledge. This paper introduces Category-Agnostic Model Hijacking (CAMH), a novel model hijacking attack method capable of addressing the challenges of class number mismatch, data distribution divergence, and performance balance between the original and hijacking tasks. CAMH incorporates synchronized training layers, random noise optimization, and a dual-loop optimization approach to ensure minimal impact on the original task’s performance while effectively executing the hijacking task. We evaluate CAMH across multiple benchmark datasets and network architectures, demonstrating its potent attack effectiveness while ensuring minimal degradation in the performance of the original task.

Abstract: Influence maximization is a key topic in data mining, with broad applications in social network analysis and viral marketing. In recent years, researchers have increasingly turned to machine learning techniques to address this problem. By learning the underlying diffusion processes from data, these methods improve the generalizability of solutions while optimizing objectives to identify the optimal seed set for maximizing influence. Nonetheless, two fundamental challenges remain unresolved: (1) While Graph Neural Networks (GNNs) are increasingly employed to learn diffusion models, their traditional architectures often fail to capture the complex dynamics of influence diffusion, (2) Designing optimization objectives is inherently difficult due to the combinatorial explosion associated with solving this problem. To address these challenges, we propose a novel framework, DeepSN. Our framework employs sheaf neural diffusion to learn diverse influence patterns in a datadriven, end-to-end manner, providing enhanced separability in capturing diffusion characteristics. We also propose an optimization technique that accounts for overlapping influence between vertices, significantly reducing the search space and facilitating the identification of the optimal seed set efficiently. Finally, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate the effectiveness of our framework.

Abstract: Time series forecasting requires reliable uncertainty estimates. Gaussian process regression provides a powerful framework for modelling this in a probabilistic fashion. However, its application to large time series is challenging, due to its cubic time complexity and quadratic memory requirement. In this work, we present KernelMatmul, a novel method that accelerates Gaussian process inference and thus facilitates scaling of Gaussian process regression to large, irregularly sampled and multioutput time series. Leveraging conjugate gradients in combination with sparsity approximation, KernelMatmul achieves time and memory complexity linear in the number of samples. We thoroughly benchmark our new method against multiple baselines to demonstrate its benefits and limitations, both in efficiency and accuracy.

Abstract: Multiview clustering (MVC), especially contrastive MVC, has demonstrated promising potential in many fields and practical scenarios. However, existing contrastive MVC methods still ignore the reliability of clustering results and the impact of false negative pairs, which limits the application of methods in critical security areas. To solve the above challenges, we propose a Self-supervised Trusted Contrastive Multi-view Clustering with Uncertainty Refined (STCMC-UR) method, which integrates clustering results and uncertainty learning to guide the self-supervised contrastive learning (CL). First, the belief of a specific view is generated in the evidence generation module. Afterwards, the belief mass and uncertainty of each view are learned using the Dirichlet distribution and we fuse multiple views with the Dempster-Shafer theory to generate the final clustering result and the uncertainty of the view. Then, the view weight is further quantified to adjust the belief of each view. Different from existing methods, with the clustering result and uncertainty generated by the fusion, we design a feature-level uncertainty-refined self-supervised CL module, where the pseudo-label is selectively employed in each iteration to conduct more accurate CL. As a result, the modules are mutually beneficial, which is conducive to more effective feature learning and clustering structure discovery, and more accurate learning results are obtained. Extensive experiments on five datasets show that the proposed method has significant improvements in effectiveness compared with the latest methods.

Abstract: Deep multimodal clustering can extract useful information among modals, thus benefiting the final clustering and many related fields. However, existing multi-modal clustering methods have two major limitations. First, they often ignore different levels of guiding information from both the feature representations and cluster assignments, which thus are difficult in learning discriminative representations. Second, most methods fail to effectively eliminate redundant information between multi-modal data, negatively affecting clustering results. In this paper, we propose a novel multi-aspect self-guided deep information bottleneck (MSDIB) method for multi-modal clustering, which can effectively employ different aspects of guiding information for learning cluster-friendly information among modals. MSDIB mainly contains two parts: information compression and information preservation. In information compression, we extract from the private information of each modality to obtain the compact representation and meanwhile conduct mutual compression between them. In information preservation, the aim is to preserve the shared information among modals and the self-supervised information from the clustering results in each iteration. In the above process, there are mainly three aspects of self-guiding information, the modality-private information, the modality-shared information and the self-supervised pseudo label information. By minimizing the mutual information based objective function with a variational optimization method, we can fully extract useful discriminative information while eliminating the irrelevant parts. Extensive experimental results demonstrate that our method outperforms state-of-the-art multi-modal clustering methods, showcasing its superior performance and broad application prospects.

Abstract: Transformerbased and MLP-based methods have emerged as leading approaches in time series forecasting (TSF). However, real-world time series often show different patterns at different scales, and future changes are shaped by the interplay of these overlapping scales, requiring high-capacity models. While Transformer-based methods excel in capturing long-range dependencies, they suffer from high computational complexities and tend to overfit. Conversely, MLP-based methods offer computational efficiency and adeptness in modeling temporal dynamics, but they struggle with capturing temporal patterns with complex scales effectively. Based on the observation of multi-scale entanglement effect in time series, we propose a novel MLP-based Adaptive Multi-Scale Decomposition (AMD) framework for TSF. Our framework decomposes time series into distinct temporal patterns at multiple scales, leveraging the Multi-Scale Decomposable Mixing (MDM) block to dissect and aggregate these patterns. Complemented by the Dual Dependency Interaction (DDI) block and the Adaptive Multi-predictor Synthesis (AMS) block, our approach effectively models both temporal and channel dependencies and utilizes autocorrelation to refine multi-scale data integration. Comprehensive experiments demonstrate our AMD framework not only overcomes the limitations of existing methods but also consistently achieves state-of-the-art performance across various datasets.

Abstract: Forwardonly algorithms offer a promising memory-efficient alternative to Backpropagation (BP) for on-device learning. However, state-of-the-art forward-only algorithms, e.g., Forward-Forward (FF), still require a substantial amount of memory during the training process, often exceeding the limits of mobile edge and Internet of Things (IoT) devices. At the same time, existing memory-optimization techniques, e.g., binarizing parameters and activations, are mainly designed for BP, hence significantly degrading the classification performance when applied to state-of-the-art forward-only algorithms. In this paper, we propose a memory-efficient forward-only algorithm called TinyFoA, to reduce dynamic memory overhead in the training process. Our TinyFoA optimizes the memory efficiency not only by layer-wise training but also by partially updating each layer, as well as by binarizing the weights and the activations. We extensively evaluate our proposed TinyFoA against BP and other forward-only algorithms and demonstrate its effectiveness and superiority compared to state-of-the-art forward-only algorithms in terms of classification performance and training memory overhead, reducing the memory overheads by an order of magnitude.

Abstract: Federated Learning (FL) is a distributed approach that enables collaborative model training while safeguarding client data privacy. Nevertheless, FL encounters difficulties due to statistical heterogeneity from the varied data distributions across numerous clients, which can affect overall efficiency and performance. Existing stateof-the-art FL methods often concentrate on optimizing interactions between clients, neglecting the potential insights from individual clients during training. Additionally, these approaches generally assume that every period of training has an equal impact on the final model's performance. To address these issues, this paper introduces a novel method, PA3Fed, which conducts period-aware adaptive aggregation for improved federated learning. The key idea is to identify the most critical periods, i.e., those with the highest information content and entropy, where we leverages each client's own performance variations during training for adaptive aggregation. Furthermore, because it operates independently of inter-client optimization approaches, it can be easily incorporated into other baselines for improved performance. Experimental results show that our method improves accuracy by up to 15% and significantly enhances stability.

Abstract: Large language models (LLMs) have achieved outstanding performance in natural language processing, but enormous model sizes and high computational costs limit their practical deployment. Structured pruning can effectively reduce the resource demands for deployment by removing redundant model parameters. However, the randomly selected calibration data and fixed single importance estimation metrics in existing structured pruning methods lead to degraded performance of pruned models. This study introduces AdaPruner, a sampleaware adaptive structured pruning framework for LLMs, aiming to optimize the calibration data and importance estimation metrics in the structured pruning process. Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space and then employing Bayesian optimization to adaptively search for the optimal calibration data and importance estimation metrics. Experimental results show that the AdaPruner outperforms existing structured pruning methods on a family of LLMs with varying pruning ratios, demonstrating its applicability and robustness. Remarkably, at a 20% pruning ratio, the model pruned with AdaPruner maintains 97% of the performance of the unpruned model.

AIRI, Moscow, Russia Skolkovo Institute of Science and Technology, Moscow, Russia, AIRI, Moscow, Russia Skolkovo Institute of Science and Technology, Moscow, Russia, AIRI, Moscow, Russia ISP RAS Research Center for Trusted Artificial Intelligence, Moscow, Russia, AIRI, Moscow, Russia Skolkovo Institute of Science and Technology, Moscow, Russia Moscow Technical University of Communications and Informatics, Moscow, Russia, AIRI, Moscow, Russia Skolkovo Institute of Science and Technology, Moscow, Russia

Abstract: Speaker recognition technology is applied to various tasks, from personal virtual assistants to secure access systems. However, the robustness of these systems against adversarial attacks, particularly to additive perturbations, remains a significant challenge. In this paper, we pioneer applying robustness certification techniques to speaker recognition, initially developed for the image domain. Our work covers this gap by transferring and improving randomized smoothing certification techniques against normbounded additive perturbations for classification and few-shot learning tasks to speaker recognition. We demonstrate the effectiveness of these methods on VoxCeleb 1 and 2 datasets for several models. We expect this work to improve the robustness of voice biometrics and accelerate the research of certification methods in the audio domain.

Abstract: Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pretrained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT’s robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems.

Abstract: In the Noisy IntermediateScale Quantum (NISQ) era, using variational quantum algorithms (VQAs) to solve optimization problems has become a key application. However, these algorithms face significant challenges, such as choosing an effective initial set of parameters and the limited quantum processing time that restricts the number of optimization iterations. In this study, we introduce a new framework for optimizing parameterized quantum circuits (PQCs) that employs a classical optimizer, inspired by Model-Agnostic Meta-Learning (MAML) technique. This approach aim to achieve better parameter initialization that ensures fast convergence. Our framework features a classical neural network, called Learner, which interacts with a PQC using the output of Learner as an initial parameter. During the pre-training phase, Learner is trained with the meta objective function. In the adaptation phase, the framework requires only a few PQC updates to converge to a more accurate value, while the learner remains unchanged. This method is highly adaptable and is effectively extended to various Hamiltonian optimization problems. We validate our approach through experiments, including distribution function mapping and optimization of the Heisenberg XYZ Hamiltonian. The result implies that the Learner successfully estimates initial parameters that generalize across the problem space, enabling fast adaptation.

Abstract: Outof-distribution (OOD) detection is indispensable for deploying reliable machine learning systems in real-world scenarios. Recent works, using auxiliary outliers in training, have shown good potential. However, they seldom concern the intrinsic correlations between in-distribution (ID) and OOD data. In this work, we discover an obvious correlation that OOD data usually possesses significant ID attributes. These attributes should be factored into the training process, rather than blindly suppressed as in previous approaches. Based on this insight, we propose a structured multi-view-based out-of-distribution detection learning (MVOL) framework, which facilitates rational handling of the intrinsic in-distribution attributes in outliers. We provide theoretical insights on the effectiveness of MVOL for OOD detection. Extensive experiments demonstrate the superiority of our framework to others. MVOL effectively utilizes both auxiliary OOD datasets and even wild datasets with noisy ID data.

Abstract: Safe reinforcement learning (RL) is a popular and versatile paradigm to learn rewardmaximizing policies with safety guarantees. Previous works tend to express the safety constraints in an expectation form due to the ease of implementation, but this turns out to be ineffective in maintaining safety constraints with high probability. To this end, we move to the quantile-constrained RL that enables a higher level of safety without any expectation-form approximations. We directly estimate the quantile gradients through sampling and provide the theoretical proofs of convergence. Then a tilted update strategy for quantile gradients is implemented to compensate the asymmetric distributional density, with a direct benefit of return performance. Experiments demonstrate that the proposed model fully meets safety requirements (quantile constraints) while outperforming the state-of-the-art benchmarks with higher return.

School of Automation,Guangdong Provincial Key Laboratory of Intelligent Systems and Optimization Integration,Guangdong University of Technology, School of Automation,Guangdong Provincial Key Laboratory of Intelligent Systems and Optimization Integration,Guangdong University of Technology, School of Automation,Guangdong Provincial Key Laboratory of Intelligent Systems and Optimization Integration,Guangdong University of Technology Key Laboratory of iDetection and Manufacturing-IoT,Ministry of Education,Guangdong University of Technology

Abstract: Federated learning is essential for enabling collaborative model training across decentralized data sources while preserving data privacy and security. This approach mitigates the risks associated with centralized data collection and addresses concerns related to data ownership and compliance. Despite significant advancements in federated learning algorithms that address communication bottlenecks and enhance privacy protection, existing works overlook the impact of differences in data feature dimensions, resulting in global models that disproportionately depend on participants with large feature dimensions. Additionally, current singleview federated learning methods fail to account for the unique characteristics of multi-view data, leading to suboptimal performance in processing such data. To address these issues, we propose a Self-expressive Hypergraph Based Federated Multi-view Learning method (FedMSGL). The proposed method leverages self-expressive character in the local training to learn uniform dimension subspace with latent sample relation. At the central side, an adaptive fusion technique is employed to generate the global model, while constructing a hypergraph from the learned global and view-specific subspace to capture intricate interconnections across views. Experiments on multi-view datasets with different feature dimensions validated the effectiveness of the proposed method.

Abstract: Noisy labels are both inevitable and problematic in machine learning methods, as they negatively impact models' generalization ability by causing overfitting. In the context of learning with noise, the transition matrix plays a crucial role in the design of statistically consistent algorithms. However, the transition matrix is often considered unidentifiable. One strand of methods typically addresses this problem by assuming that the transition matrix is instanceindependent; that is, the probability of mislabeling a particular instance is not influenced by its characteristics or attributes. This assumption is clearly invalid in complex real-world scenarios. To better understand the transition relationship and relax this assumption, we propose to study the data generation process of noisy labels from a causal perspective. We discover that an unobservable latent variable can affect either the instance itself, the label annotation procedure, or both, which complicates the identification of the transition matrix. To address various scenarios, we have unified these observations within a new causal graph. In this graph, the input instance is divided into a noise-resistant component and a noise-sensitive component based on whether they are affected by the latent variable. These two components contribute to identifying the “causal transition matrix”, which approximates the true transition matrix with theoretical guarantee. In line with this, we have designed a novel training framework that explicitly models this causal relationship and, as a result, achieves a more accurate model for inferring the clean label.

Abstract: Due to the challenges in acquiring paired Text3D data and the inherent irregularity of 3D data structures, combined representation learning of 3D point clouds and text remains unexplored. In this paper, we propose a novel Riemann-based Multi-scale Attention Reasoning Network (RMARN) for text-3D retrieval. Specifically, the extracted text and point cloud features are refined by their respective Adaptive Feature Refiner (AFR). Furthermore, we introduce the innovative Riemann Local Similarity (RLS) module and the Global Pooling Similarity (GPS) module. However, as 3D point cloud data and text data often possess complex geometric structures in high-dimensional space, the proposed RLS employs a novel Riemann Attention Mechanism to reflect the intrinsic geometric relationships of the data. Without explicitly defining the manifold, RMARN learns the manifold parameters to better represent the distances between text-point cloud samples. To address the challenges of lacking paired text-3D data, we have created the large-scale Text-3D Retrieval dataset T3DR-HIT, which comprises over 3,380 pairs of text and point cloud data. T3DR-HIT contains coarse-grained indoor 3D scenes and fine-grained Chinese artifact scenes, consisting of 1,380 and over 2,000 text-3D pairs, respectively. Experiments on our custom datasets demonstrate the superior performance of the proposed method.

Abstract: Despite federated learning (FL)'s potential in collaborative learning, its performance has deteriorated due to the data heterogeneity of distributed users. Recently, clustered federated learning (CFL) has emerged to address this challenge by partitioning users into clusters according to their similarity. However, CFL faces difficulties in training when users are unwilling to share their cluster identities due to privacy concerns. To address these issues, we present an innovative Efficient and Robust Secure Aggregation scheme for CFL, dubbed EBSCFL. The proposed EBS-CFL supports effectively training CFL while maintaining users' cluster identity confidentially. Moreover, it detects potential poisonous attacks without compromising individual client gradients by discarding negatively correlated gradients and aggregating positively correlated ones using a weighted approach. The server also authenticates correct gradient encoding by clients. EBS-CFL has high efficiency with client-side overhead O(ml + m^2) for communication and O(m^2l) for computation, where m is the number of cluster identities, and l is the gradient size. When m = 1, EBS-CFL's computational efficiency of client is at least O(log(n)) times better than comparison schemes, where n is the number of clients. In addition, we validate the scheme through extensive experiments. Finally, we theoretically prove the scheme's security.

Abstract: Federated Collaborative Filtering (FedCF) is an emerging field focused on developing a new recommendation framework with preserving privacy in a federated setting. Existing FedCF methods typically combine distributed Collaborative Filtering (CF) algorithms with privacypreserving mechanisms, and then preserve personalized information into a user embedding vector. However, the user embedding is usually insufficient to preserve the rich information of the fine-grained personalization across heterogeneous clients. This paper proposes a novel personalized FedCF method by preserving users' personalized information into a latent variable and a neural model simultaneously. Specifically, we decompose the modeling of user knowledge into two encoders, each designed to capture shared knowledge and personalized knowledge separately. A personalized gating network is then applied to balance personalization and generalization between the global and local encoders. Moreover, to effectively train the proposed framework, we model the CF problem as a specialized Variational AutoEncoder (VAE) task by integrating user interaction vector reconstruction with missing value prediction. The decoder is trained to reconstruct the implicit feedback from items the user has interacted with, while also predicting items the user might be interested in but has not yet interacted with. Experimental results on benchmark datasets demonstrate that the proposed method outperforms other baseline methods, showcasing superior performance.

Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE, University of Trento, Italy, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE, Lanzhou Jiaotong University, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE Pengcheng Laboratory, Shenzhen, China, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE

Abstract: Although diffusion models have achieved remarkable success in the field of image generation, their latent space remains underexplored. Current methods for identifying semantics within latent space often rely on external supervision, such as textual information and segmentation masks. In this paper, we propose a method to identify semantic attributes in the latent space of pre-trained diffusion models without any further training. By projecting the Jacobian of the targeted semantic region into a low-dimensional subspace which is orthogonal to the non-masked regions, our approach facilitates precise semantic discovery and control over local masked areas, eliminating the need for annotations. We conducted extensive experiments across multiple datasets and various architectures of diffusion models, achieving state-of-the-art performance. In particular, for some specific face attributes, the performance of our proposed method even surpasses that of supervised approaches, demonstrating its superior ability in editing local image properties.

Abstract: Graph contrastive learning (GCL) has become a hot topic in the field of graph representaion learning. In contrast to traditional supervised learning relying on a large number of labels, GCL exploits augmentation techniques to generate multiple views and positive/negative pairs, both of which greatly influence the performance. Unfortunately, commonly used random augmentations may disturb the underlying semantics of graphs. Moreover, traditional GNNs, a type of widely employed encoders in GCL, are inevitably confronted with oversmoothing and over-squashing problems. To address these issues, we propose GNN-Transformer Cooperative Architecture for Trustworthy Graph Contrastive Learning (GTCA), which inherits the advantages of both GNN and Transformer, incorporating graph topology to obtain comprehensive graph representations. Theoretical analysis verifies the trustworthiness of the proposed method. Extensive experiments on benchmark datasets demonstrate state-of-the-art empirical performance.

ZJU-Angelalign R&D Center for Intelligence Healthcare, ZJU-UIUC Institute, Zhejiang University, China Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence, Zhejiang University, China, ZJU-Angelalign R&D Center for Intelligence Healthcare, ZJU-UIUC Institute, Zhejiang University, China Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence, Zhejiang University, China, Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore, ZJU-Angelalign R&D Center for Intelligence Healthcare, ZJU-UIUC Institute, Zhejiang University, China, Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore, ZJU-Angelalign R&D Center for Intelligence Healthcare, ZJU-UIUC Institute, Zhejiang University, China Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence, Zhejiang University, China

Abstract: Visual Language Models such as CLIP excel in image recognition due to extensive imagetext pre-training. However, applying the CLIP inference in zero-shot classification, particularly for medical image diagnosis, faces challenges due to: 1) the inadequacy of representing image classes solely with single category names; 2) the modal gap between the visual and text spaces generated by CLIP encoders. Despite attempts to enrich disease descriptions with large language models, the lack of class-specific knowledge often leads to poor performance. In addition, empirical evidence suggests that existing proxy learning methods for zero-shot image classification on natural image datasets exhibit instability when applied to medical datasets. To tackle these challenges, we introduce the Knowledge Proxy Learning (KPL) to mine knowledge from CLIP. KPL is designed to leverage CLIP's multimodal understandings for medical image classification through Text Proxy Optimization and Multimodal Proxy Learning. Specifically, KPL retrieves image-relevant knowledge descriptions from the constructed knowledge-enhanced base to enrich semantic text proxies. It then harnesses input images and these descriptions, encoded via CLIP, to stably generate multimodal proxies that boost the zero-shot classification performance. Extensive experiments conducted on both medical and natural image datasets demonstrate that KPL enables effective zero-shot image classification, outperforming all baselines. These findings highlight the great potential in this paradigm of mining knowledge from CLIP for medical image classification and broader areas.

Abstract: Decentralized federated learning (DFL) realizes cooperative model training among connected clients without relying on a central server, thereby mitigating communication bottlenecks and eliminating the singlepoint failure issue present in centralized federated learning (CFL). Most existing work on DFL focuses on supervised learning, assuming each client possesses sufficient labeled data for local training. However, in real-world applications, much of the data is unlabeled. We address this by considering a challenging yet practical semi-supervised learning (SSL) scenario in DFL, where clients may have varying data sources: some with few labeled samples, some with purely unlabeled data, and others with both. In this work, we propose SemiDFL, the first semi-supervised DFL method that enhances DFL performance in SSL scenarios by establishing a consensus in both data and model spaces. Specifically, we utilize neighborhood information to improve the quality of pseudo-labeling, which is crucial for effectively leveraging unlabelled data. We then design a consensus-based diffusion model to generate synthesized data, which is used in combination with pseudo-labeled data to create mixed datasets. Additionally, we develop an adaptive aggregation method that leverages the model accuracy of synthesized data to further enhance SemiDFL performance. Through extensive experimentation, we demonstrate the remarkable performance superiority of the proposed DFL-Semi method over existing CFL and DFL schemes in both iid and Non-iid SSL scenarios.

Abstract: Molecular design inherently involves the optimization of multiple conflicting objectives, such as enhancing bioactivity and ensuring synthesizability. Evaluating these objectives often requires resource-intensive computations or physical experiments. Current molecular design methodologies typically approximate the Pareto set using a limited number of molecules. In this paper, we present an innovative approach, called Multi-Objective Molecular Design through Learning Latent Pareto Set (MLPS). MLPS initially utilizes an encoder-decoder model to seamlessly transform the discrete chemical space into a continuous latent space. We then employ local Bayesian optimization models to efficiently search for local optimal solutions (i.e., molecules) within predefined trust regions. Using surrogate objective values derived from these local models, we train a global Pareto set learning model to understand the mapping between direction vectors (called “preferences”) in the objective space and the entire Pareto set in the continuous latent space. Both the global Pareto set learning model and local Bayesian optimization models collaborate to discover high-quality solutions and adapt the trust regions dynamically. Our work is an effective endeavor towards learning the Pareto set for multi-objective molecular design, providing decision-makers with the capability to fine-tune their preferences and thoroughly explore the Pareto set. Experimental results demonstrate that MLPS achieves state-of-the-art performance across various multi-objective scenarios, encompassing diverse objective types and varying numbers of objectives. The effectiveness of MLPS was further validated through real-world challenges in discovering antifungal peptides with low toxicity and high activity.

Abstract: Mixup is a data augmentation technique that enhances model generalization by interpolating between data points using a mixing ratio lambda in the image domain. Recently, the concept of mixup has been adapted to the graph domain through nodecentric interpolations. However, these approaches often fail to address the complexity of interconnected relationships, potentially damaging the graph's natural topology and undermining node interactions. Furthermore, current graph mixup methods employ a one-size-fits-all strategy with a randomly sampled lambda for all mixup pairs, ignoring the diverse needs of different pairs. This paper proposes an Adaptive Graph Mixup (AGMixup) framework for semi-supervised node classification. AGMixup introduces a subgraph-centric approach, which treats each subgraph similarly to how images are handled in Euclidean domains, thus facilitating a more natural integration of mixup into graph-based learning. We also propose an adaptive mechanism to tune the mixing ratio lambda for diverse mixup pairs, guided by the contextual similarity and uncertainty of the involved subgraphs. Extensive experiments across seven datasets on semi-supervised node classification benchmarks demonstrate AGMixup's superiority over state-of-the-art graph mixup methods.

Abstract: Visual prompt tuningbased continual learning (CL) methods have shown promising performance in exemplar-free scenarios, where their key component can be viewed as a prompt generator. Existing approaches generally rely on freezing old prompts, slow updating and task discrimination for prompt generators to preserve stability and minimize forgetting. In contrast, we introduce a novel approach that trains a consistent prompt generator to ensure stability during CL. Consistency means that for any instance from an old task, its corresponding instance-ware prompt generated by the prompt generator remains consistent even as the generator continually updates in a new task. This ensures that the representation of a specific instance remains stable across tasks and thereby prevents forgetting. We employ a mixture of experts (MoE) as the prompt generator, which contains a router and multiple experts. By deriving conditions sufficient to achieve the consistency for the MoE prompt generator, we demonstrate that: during training in a new task, if the router and experts update in the directions orthogonal to the subspaces spanned by old input features and gating vectors, respectively, the consistency can be theoretically guaranteed. To implement this orthogonality, we project parameter gradients to those orthogonal directions using the orthogonal projection matrices computed via the null space method. Extensive experiments on four class-incremental learning benchmarks validate the effectiveness and superiority of our approach.

Abstract: Blendedtarget domain adaptation (BTDA) leverages learned source knowledge to adapt the model to a blended-target domain that is composed of multiple unlabeled sub-target domains with distinct statistical characteristics. The existing BTDA methods usually overlook semantic correlation information across multiple domains and domain shifts among sub-target domains, resulting in suboptimal adaptation performance. To fully harness semantic knowledge and alleviate domain shifts in hybrid data distribution, we propose a collaborative semantic consistency alignment (CSCA) method for BTDA. Specifically, we achieve distribution alignment by minimizing the sliced Wasserstein distance between the source and target feature distributions. To alleviate complex domain shifts among all sub-target domains in the hybrid feature space, we design graph networks to propagate and share semantic knowledge across domains, which reduces semantic discrepancies among multiple domains. Additionally, we propose a double consistency regularization method to reduce the susceptibility of the model to domain-specific information, further facilitating semantic alignment and alleviating domain shifts. Extensive experiments on several datasets show that CSCA achieves promising classification performance.

Abstract: Federated learning (FL) enables decentralized clients to collaboratively train a global model under the orchestration of a central server without exposing their individual data. However, the iterative exchange of model parameters between the server and clients imposes heavy communication burdens, risks potential privacy leakage, and even precludes collaboration among heterogeneous clients. Distillationbased FL tackles these challenges by exchanging low-dimensional model outputs rather than model parameters, yet it highly relies on a task-relevant auxiliary dataset that is often not available in practice. Data-free FL attempts to overcome this limitation by training a server-side generator to directly synthesize task-specific data samples for knowledge transfer. However, the update rule of the generator requires clients to share on-device models for white-box access, which greatly compromises the advantages of distillation-based FL. This motivates us to explore a data-free and black-box FL framework via Zeroth-order Gradient Estimation (FedZGE), which estimates the gradients after flowing through on-device models in a black-box optimization manner to complete the training of the generator in terms of fidelity, transferability, diversity, and equilibrium, without involving any auxiliary data or sharing any model parameters, thus combining the advantages of both distillation-based FL and data-free FL. Experiments on large-scale image classification datasets and network architectures demonstrate the superiority of FedZGE in terms of data heterogeneity, model heterogeneity, communication efficiency, and privacy protection.

School of Software, Shandong University, Jinan, China Shandong Research Institute of Industrial Technology, Jinan, China, School of Software, Shandong University, Jinan, China School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai, China, School of Software, Shandong University, Jinan, China, School of Software, Shandong University, Jinan, China, School of Software, Shandong University, Jinan, China, School of Software, Shandong University, Jinan, China, School of Software, Shandong University, Jinan, China

Abstract: Incorporating tagging information to regularize the representation learning of images usually leads to improved performance in image classification by aligning the visual features with the textual ones of higher discriminative power. Existing methods typically follow the predictive approach, which uses tags as the semantic labels for visual input to make predictions. However, they typically face the problem of handling the heterogeneity between modalities. In order to learn accurate visualsemantic mapping, this paper presents a visual-semantic causal association modeling framework termed VSCNet. It aligns visual regions with tags, uses a pre-learned hierarchy of visual and semantic exemplars to refine tag predictions and constructs an augmented heterogeneous graph to perform causal intervention. Specifically, the fine-grained visual-semantic alignment (FVA) module adaptively locates the semantic-intensive regions corresponding to tags. The heterogeneous association refinement (HAR) module associates the visual regions, semantic elements and pre-learned visual prototypes in a heterogeneous graph to filter the error predictions and enrich the information. The causal inference with graphical masking (CIM) module applies self-learned masks to discover the causal nodes and edges in the heterogeneous graph to address the spurious association, forming robust causal representations. Experimental results from two benchmarking datasets show that VSCNet effectively builds the visual-semantic associations from images and leads to better performance than the state-of-the-art methods with enriched predictive information.

Abstract: In this paper we describe a novel framework for diffusionbased generative modeling on constrained spaces. In particular, we introduce manual bridges, a framework that expands the kinds of constraints that can be practically used to form so-called diffusion bridges. We develop a mechanism for combining multiple such constraints so that the resulting multiply-constrained model remains a manual bridge that respects all constraints. We also develop a mechanism for training a diffusion model that respects such multiple constraints while also adapting it to match a data distribution. We develop and extend theory demonstrating the mathematical validity of our mechanisms. Additionally, we demonstrate our mechanism in constrained generative modeling tasks, highlighting a particular high-value application in modeling trajectory initializations for path planning and control in autonomous vehicles.

Abstract: Compositional relational reasoning (CRR) is a hallmark of human intelligence, but we lack a clear understanding of whether and how existing transformer large language models (LLMs) can solve CRR tasks. To enable systematic exploration of the CRR capability of LLMs, we first propose a new synthetic benchmark called Generalized Associative Recall (GAR) by integrating and generalizing the essence of several tasks in mechanistic interpretability (MI) study in a unified framework. Evaluation shows that GAR is challenging enough for existing LLMs, revealing their fundamental deficiency in CRR. Meanwhile, it is easy enough for systematic MI study. Then, to understand how LLMs solve GAR tasks, we use attribution patching to discover the core circuits reused by Vicuna33B across different tasks, and a set of vital attention heads. Intervention experiments show that the correct functioning of these heads significantly impacts task performance. Especially, we identify two classes of heads whose activations represent the abstract notion of true and false in GAR tasks respectively. They play fundamental roles in CRR across various models and tasks.

Abstract: Incomplete multiview multi-label classification aims to accurately predict labels for each sample in the face of some missing views. Due to its widespread presence in real-world scenarios, it has become an extensively researched topic. In addition to the challenges brought by missing views, it also encounters issues caused by redundant views, whose inclusion fails to make a positive contribution to performance. In this paper, we make the first attempt to take advantage of diffusion models to address the missing view problem and design a strategy to identify and remove redundant views. Specifically, we train a diffusion model conditioned on the pseudo-labels to recover information of missing views. The learned diffusion model can carry data distribution knowledge in training split to the data. Regarding redundant identification strategy, it is designed by considering both the additional information of views and the classification difficulty level of samples, thereby adaptively identifying and removing redundant views. We conduct extensive experiments on five datasets, and the proposed method achieves favorable performance against several state-of-the-art methods on the multi-view multi-label classification task.

The Chinese University of Hong Kong, Shenzhen, The Chinese University of HongKong, Shenzhen, The Chinese University of Hong Kong, Shenzhen Shenzhen Research Institute of Big Data, University of Michigan - Ann Arbor, The Chinese University of Hong Kong, Shenzhen, The Chinese University of Hong Kong, Shenzhen The Shenzhen Institute of Artificial Intelligence and Robotics for Society The Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong, Shenzhen The Shenzhen Institute of Artificial Intelligence and Robotics for Society

Abstract: Federated Learning (FL) has received much attention in recent years. However, although clients are not required to share their data in FL, the global model itself can implicitly remember clients' local data. Therefore, it’s necessary to effectively remove the target client's data from the FL global model to ease the risk of privacy leakage and implement "the right to be forgotten". Federated Unlearning (FU) has been considered a promising solution to remove data without full retraining. But the model utility easily suffers significant reduction during unlearning due to the gradient conflicts. Furthermore, when conducting the posttraining to recovery the model utility, it’s prone to move back and revert what have already been unlearned. To address these issues, we propose Federated Unlearning with Orthogonal Steepest Descent (FedOSD). We first design an unlearning cross entropy loss to overcome the convergence issue of the gradient ascent. A steepest descent direction for unlearning is then calculated in the condition of being non-conflicting with other clients’ gradients and closest to the target client's gradient. This benefits to efficiently unlearn and mitigate the model utility reduction. After unlearning, we recover the model utility by maintaining the achievement of unlearning. Finally, extensive experiments in several FL scenarios verify that FedOSD outperforms the SOTA FU algorithms in terms of unlearning and the model utility.

Abstract: Domain adaptation (DA) tackles the issue of distribution shift by learning a model from a source domain that generalizes to a target domain. However, most existing DA methods are designed for scenarios where the source and target domain data lie within the same feature space, which limits their applicability in realworld situations. Recently, heterogeneous DA (HeDA) methods have been introduced to address the challenges posed by heterogeneous feature space between source and target domains. Despite their successes, current HeDA techniques fall short when there is a mismatch in both feature and label spaces. To address this, this paper explores a new DA scenario called open-set HeDA (OSHeDA). In OSHeDA, the model must not only handle heterogeneity in feature space but also identify samples belonging to novel classes. To tackle this challenge, we first develop a novel theoretical framework that constructs learning bounds for prediction error on target domain. Guided by this framework, we propose a new DA method called Representation Learning for OSHeDA (RL-OSHeDA). This method is designed to simultaneously transfer knowledge between heterogeneous data sources and identify novel classes. Experiments across text, image, and clinical data demonstrate the effectiveness of our algorithm.

Abstract: Highdimensional data visualization is crucial in big data era and these techniques such as t-SNE and UMAP have been widely used in science and engineering. Big data, however, is often distributed across multiple data centers and subject to security and privacy concerns, which leads to difficulties for the standard algorithms of t-SNE and UMAP. To tackle the challenge, this work proposes Fed-tSNE and Fed-UMAP, which provide high-dimensional data visualization under the framework of federated learning, without exchanging data across clients or sending data to the central server. The main idea of Fed-tSNE and Fed-UMAP is implicitly learning the distribution information of data in a manner of federated learning and then estimating the global distance matrix for t-SNE and UMAP. To further enhance the protection of data privacy, we propose Fed-tSNE+ and Fed-UMAP+. We also extend our idea to federated spectral clustering, yielding algorithms of clustering distributed data. In addition to these new algorithms, we offer theoretical guarantees of distance and similarity estimation and analyze the property of differential privacy. Experiments on multiple datasets demonstrate that, compared to the original algorithms, the accuracy drops of our federated algorithms are tiny.

Abstract: Multiview clustering has gained increasing attention by utilizing the complementary and consensus information across views. To alleviate the computation cost for the existing multi-view clustering approaches on datasets with large scales, studies based on anchor have been presented. Although extensively adopted in the real scenarios, most of these works ignore to learn an integral subspace revealing the cluster structure with anchors from different views being aligned, where the centroid and cluster assignment matrix can be directly achieved based on the integral subspace. Moreover, these works neglect to perform the alignment among anchors and integral subspace learning in a unified model on the incomplete multi-view dataset. Then the mutual improvements among aligning anchors and learning integral subspace are not guaranteed in optimizing the objective function, which inevitably limit the representation ability of the model and result in the suboptimal clustering performance. In this paper, we propose a novel anchor learning method for incomplete multi-view dataset termed Scalable One-pass incomplete Multi-view clustEring by Aligning anchorS (SOME-AS). Specifically, we capture the complementary information among multiple views by building the anchor graph for each view on the incomplete dataset. The integral subspace reflecting the cluster structure is learned with the alignment among anchors from different views being considered. We build the cluster assignment and centroid representation with orthogonal constraint to approximate the integral subspace. Then the subspace itself and the partition are simultaneously taken into account in this manner. Besides, the mutual improvements among aligning anchors and learning integral subspace are able to be ensured. Experiments on several incomplete multi-view datasets validate the efficiency and effectiveness of SOME-AS.

Institute of Automation, Chinese Academy of Sciences School of Future Technology, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences, University of Electronic Science and Technology of China, Institute of Automation, Chinese Academy of Sciences The Hong Kong Polytechnic University, SynSense AG Corporation, Huinao Zhixin, Institute of Automation, Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences Peng Cheng Laboratory Institute of Automation, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences

Abstract: Spiking Neural Networks (SNNs) provide an energyefficient way to extract 3D spatio-temporal features. Point clouds are sparse 3D spatial data, which suggests that SNNs should be well-suited for processing them. However, when applying SNNs to point clouds, they often exhibit limited performance and fewer application scenarios. We attribute this to inappropriate preprocessing and feature extraction methods. To address this issue, we first introduce the Spike Voxel Coding (SVC) scheme, which encodes the 3D point clouds into a sparse spike train space, reducing the storage requirements and saving time on point cloud preprocessing. Then, we propose a Spike Sparse Convolution (SSC) model for efficiently extracting 3D sparse point cloud features. Combining SVC and SSC, we design an efficient 3D SNN backbone (E-3DSNN), which is friendly with neuromorphic hardware. For instance, SSC can be implemented on neuromorphic chips with only minor modifications to the addressing function of vanilla spike convolution. Experiments on ModelNet40, KITTI, and Semantic KITTI datasets demonstrate that E-3DSNN achieves state-of-the-art (SOTA) results with remarkable efficiency. Notably, our E-3DSNN (1.87M) obtained 91.7% top-1 accuracy on ModelNet40, surpassing the current best SNN baselines (14.3M) by 3.0%. To our best knowledge, it is the first direct training 3D SNN backbone that can simultaneously handle various 3D computer vision tasks (e.g., classification, detection, and segmentation) with an event-driven nature.

Abstract: Deep NeuroFuzzy Inference Systems (DNFIS) seamlessly fuse neural networks with the fuzzy inference system enabling intricate decision-making and knowledge representation, while upholding a commendable degree of adaptability and interpretability. However, the challenge of privacy-preserving inference (PI) over DNFIS has remained largely uncharted, with no prior research addressing this critical issue. In this paper, we embark on an exploration of this issue. We introduce an efficient and secure PI framework for DNFIS, named PrivDNFIS, which leverages the post-quantum lattice-based homomorphic encryption to implement secure computation protocols for PI over DNFIS. Our work incorporates several non-trivial performance enhancements. Firstly, it consolidates multiple elements of input feature vectors into a single message, reducing encryption/decryption overhead. Secondly, building upon this novel encoding approach, PrivDNFIS can perform ciphertext aggregation and vector-vector inner production without necessitating time-consuming ciphertext rotation operations. Thirdly, we replace the softmax function in the DNFIS layer with a quadratic function to further enhance inference efficiency, without compromising the inference accuracy. Under the given threat model, we provide formal security proof for PrivDNFIS. In comprehensive experimental results, PrivDNFIS demonstrates an approximately 1.9 to 4.4 times reduction in end-to-end time cost compared to the benchmark.

Abstract: This paper focuses on the newly emerged research topic, i.e., whether the complex decisionmaking logic of a DNN can be mathematically summarized into a few simple logics. Beyond the explanation of a static DNN, in this paper, we hope to show that the seemingly complex learning dynamics of a DNN can be faithfully represented as the change of a few primitive interaction patterns encoded by the DNN. Therefore, we redefine the interaction of principal feature components in intermediate-layer features, which enables us to concisely summarize the highly complex dynamics of interactions throughout the learning of the DNN. The mathematical faithfulness of the new interaction is experimentally verified. From the perspective of learning efficiency, we find that the interactions naturally belong to five groups (reliable, withdrawn, forgotten, betraying, and fluctuating interactions), each representing a distinct type of dynamics of an interaction being learned and/or being forgotten. This provides deep insights into the learning process of a DNN.

Abstract: Test Time Adaptation (TTA) addresses the problem of distribution shift by adapting a pretrained model to a new domain during inference. When faced with challenging shifts, most methods collapse and perform worse than the original pretrained model. In this paper, we find that not all layers are equally receptive to the adaptation, and the layers with the most misaligned gradients often cause performance degradation. To address this, we propose GALA, a novel layer selection criterion to identify the most beneficial updates to perform during test time adaptation. This criterion can also filter out unreliable samples with noisy gradients. Its simplicity allows seamless integration with existing TTA loss functions, thereby preventing degradation and focusing adaptation on the most trainable layers. This approach also helps to regularize adaptation to preserve the pretrained features, which are crucial for handling unseen domains. Through extensive experiments, we demonstrate that the proposed layer selection framework improves the performance of existing TTA approaches across multiple datasets, domain shifts, model architectures, and TTA losses.

Abstract: Incorporating user preferences into multiobjective Bayesian optimization (MOBO) allows for personalization of the op- timization procedure. Preferences are often abstracted in the form of an unknown utility function, estimated through pair- wise comparisons of potential outcomes. However, utility-driven MOBO methods can yield solutions that are dominated by nearby solutions, as non-dominance is not enforced. Additionally, classical MOBO commonly relies on estimating the entire Pareto front to identify the Pareto-optimal solutions, which can be expensive and ignore user preferences. Here, we present a new method, termed preference-utility-balanced MOBO (PUB-MOBO), that allows users to disambiguate between near-Pareto candidate solutions. PUB-MOBO combines utility-based MOBO with local multi-gradient descent to refine user-preferred solutions to be near-Pareto-optimal. To this end, we propose a novel preference-dominated utility function that concurrently preserves user-preferences and dominance amongst candidate solutions. A key advantage of PUB-MOBO is that the local search is restricted to a (small) region of the Pareto front directed by user preferences, alleviating the need to estimate the entire Pareto-front. PUB-MOBO is tested on three synthetic benchmark problems: DTLZ1, DTLZ2 and DH1, as well as on three real-world problems: Vehicle Safety, Conceptual Marine Design, and Car Side Impact. PUB-MOBO consistently outperforms state-of-the-art competitors in terms of proximity to the Pareto-front and utility regret across all the problems.

Abstract: Graph Neural Networks (GNNs) have shown impressive performance in graph representation learning, but they face challenges in capturing longrange dependencies due to their limited expressive power. To address this, Graph Transformers (GTs) were introduced, utilizing self-attention mechanism to effectively model pairwise node relationships. Despite their advantages, GTs suffer from quadratic complexity w.r.t. the number of nodes in the graph, hindering their applicability to large graphs. In this work, we present Graph-Enhanced Contextual Operator (GECO), a scalable and effective alternative to GTs that leverages neighborhood propagation and global convolutions to effectively capture local and global dependencies in quasiliniear time. Our study on synthetic datasets reveals that GECO reaches 169x speedup on a graph with 2M nodes w.r.t. optimized attention. Further evaluations on diverse range of benchmarks showcase that it scales to large graphs where traditional GTs often face memory and time limitations. Notably, GECO consistently achieves comparable or superior quality compared to baselines, improving the SOTA up to 4.5%, and offering a scalable and effective solution for large-scale graph learning.

Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences Key Laboratory of Cyberspace Security Defense, Institute of Information Engineering, Chinese Academy of Sciences Key Laboratory of Cyberspace Security Defense, College of Information and Intelligence Engineering, Zhejiang Wanli University, Institute of Information Engineering, Chinese Academy of Sciences Key Laboratory of Cyberspace Security Defense, Institute of Information Engineering, Chinese Academy of Sciences Key Laboratory of Cyberspace Security Defense, Tsinghua University

Abstract: For general users, training a neural network from scratch is usually challenging and laborintensive. Fortunately, neural network zoos enable them to find a well-performing model for directly use or fine-tuning it in their local environments. Although current model retrieval solutions attempt to convert neural network models into vectors to avoid complex multiple inference processes required for model selection, it is still difficult to choose a suitable model due to inaccurate vectorization and biased correlation alignment between the query dataset and models. From the perspective of knowledge consistency, i.e., whether the knowledge possessed by the model can meet the needs of query tasks, we propose a model retrieval scheme, named Know2Vec, that acts as a black-box retrieval proxy for model zoo. Know2Vec first accesses to models via a black-box interface in advance, capturing vital decision knowledge from models while ensuring their privacy. Next, it employs an effective encoding technique to transform the knowledge into precise model vectors. Secondly, it maps the user's query task to a knowledge vector by probing the semantic relationships within query samples. Furthermore, the proxy ensures the knowledge-consistency between query vector and model vectors within their alignment space, which is optimized through the supervised learning with diverse loss functions, and finally it can identify the most suitable model for a given task during the inference stage. Extensive experiments show that our Know2Vec achieves superior retrieval accuracy against the state-of-the-art methods in diverse neural network retrieval tasks.

Abstract: Despite empirical risk minimization (ERM) is widely applied in the machine learning community, its performance is limited on data with spurious correlation or subpopulation that is introduced by hidden attributes. Existing literature proposed techniques to maximize groupbalanced or worst-group accuracy when such correlation presents, yet, at the cost of lower average accuracy. In addition, many existing works conduct surveys on different subpopulation methods without revealing the inherent connection between these methods, which could hinder the technology advancement in this area. In this paper, we identify important sampling as a simple yet powerful tool for solving the subpopulation problem. On the theory side, we provide a new systematic formulation of the subpopulation problem, and explicitly identify the assumptions that are not clearly stated in the existing works. This helps to uncover the cause of the dropped average accuracy. We provide the first theoretical discussion on the connections of existing methods, revealing the core components that make them different. On the application side, we demonstrate a single estimator is enough to solve the subpopulation problem. In particular, we introduce the estimator in both attribute-known and -unknown scenarios in the subpopulation setup, offering flexibility in practical use cases. And empirically, we achieve state-of-the-art performance on commonly used benchmark datasets.

Abstract: Constrained ksubmodular maximization is a general framework that captures many discrete optimization problems such as ad allocation, influence maximization, personalized recommendation, and many others. In many of these applications, datasets are large or decisions need to be made in an online manner, which motivates the development of efficient streaming and online algorithms. In this work, we develop single-pass streaming and online algorithms for constrained k-submodular maximization with both monotone and general (possibly non-monotone) objectives subject to cardinality and knapsack constraints. Our algorithms achieve provable constant-factor approximation guarantees which improve upon the state of the art in almost all settings. Moreover, they achieve the fastest known running times and have optimal space usage. We experimentally evaluate our algorithms on instances for ad allocation and other applications, where we observe that our algorithms are practical and scalable, and construct solutions that are comparable in value even to offline greedy algorithms.

Abstract: Symbolic regression (SR) has emerged as a pivotal technique for uncovering the intrinsic information within data and enhancing the interpretability of AI models. However, current stateof-the-art (sota) SR methods struggle to perform correct recovery of symbolic expressions from high-noise data. To address this issue, we introduce a novel noise-resilient SR (NRSR) method capable of recovering expressions from high-noise data. Our method leverages a novel reinforcement learning (RL) approach in conjunction with a designed noise-resilient gating module (NGM) to learn symbolic selection policies. The gating module can dynamically filter the meaningless information from high-noise data, thereby demonstrating a high noise-resilient capability for the SR process. And we also design a mixed path entropy (MPE) bonus term in the RL process to increase the exploration capabilities of the policy. Experimental results demonstrate that our method significantly outperforms several popular baselines on benchmarks with high-noise data. Furthermore, our method also can achieve sota performance on benchmarks with clean data, showcasing its robustness and efficacy in SR tasks.

Abstract: ClassIncremental Learning (CIL) requires models to continually acquire knowledge of new classes without forgetting old ones. Despite Pre-trained Models (PTMs) have shown excellent performance in CIL, catastrophic forgetting still occurs as the model learns new concepts. Existing work seeks to utilize lightweight components to adjust the PTM, while the forgetting phenomenon still comes from parameter and retrieval levels. Specifically, iterative updates of the model result in parameter drift, while mistakenly retrieving irrelevant modules leads to the mismatch during inference. To this end, we propose MOdel Surgery (MOS) to rescue the model from forgetting previous knowledge. By training task-specific adapters, we continually adjust the PTM to downstream tasks. To mitigate parameter-level forgetting, we present an adapter merging approach to learn task-specific adapters, which aims to bridge the gap between different components while reserve task-specific information. Besides, to address retrieval-level forgetting, we introduce a training-free self-refined adapter retrieval mechanism during inference, which leverages the model's inherent ability for better adapter retrieval. By jointly rectifying the model with those steps, MOS can robustly resist catastrophic forgetting in the learning process. Extensive experiments on seven benchmark datasets validate MOS's state-of-the-art performance.

Abstract: Most graph contrastive learning (GCL) methods heavily rely on crossview contrast, thus facing several concomitant challenges, such as the complexity of designing effective augmentations, the potential for information loss between views, and increased computational costs. To mitigate reliance on cross-view contrasts, we propose SIGNA, a novel single-view graph contrastive learning framework. Regarding the inconsistency between structural connection and semantic similarity of neighborhoods, we resort to soft neighborhood awareness for GCL. Specifically, we leverage dropout to obtain structurally-related yet randomly-noised embedding pairs for neighbors, which serve as potential positive samples. At each epoch, the role of partial neighbors is switched from positive to negative, leading to probabilistic neighborhood contrastive learning effect. Moreover, we propose a normalized Jensen-Shannon divergence estimator for a better effect of contrastive learning. Experiments on diverse node-level tasks demonstrate that our simple single-view GCL framework consistently outperforms existing methods by margins of up to 21.74% (PPI). In particular, with soft neighborhood awareness, SIGNA can adopt MLPs instead of complicated GCNs as the encoder in transductive learning tasks, thus speeding up its inference process by 109× to 331×.

Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University, China, Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University, China, Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University, China, Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore Institute for InfoComm Research, Agency for Science, Technology and Research, Singapore, Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University, China

Abstract: Deep learning has significantly advanced time series forecasting through its powerful capacity to capture sequence relationships. However, training these models with the Mean Square Error (MSE) loss often results in oversmooth predictions, making it challenging to handle the complexity and learn high-entropy features from time series data with high variability and unpredictability. In this work, we introduce a novel approach by tokenizing time series values to train forecasting models via cross-entropy loss, while considering the continuous nature of time series data. Specifically, we propose a Hierarchical Classification Auxiliary Network, HCAN, a general model-agnostic component that can be integrated with any forecasting model. HCAN is based on a Hierarchy-Aware Attention module that integrates multi-granularity high-entropy features at different hierarchy levels. At each level, we assign a class label for timesteps to train an Uncertainty-Aware Classifier. This classifier mitigates the over-confidence in softmax loss via evidence theory. We also implement a Hierarchical Consistency Loss to maintain prediction consistency across hierarchy levels. Extensive experiments integrating HCAN with state-of-the-art forecasting models demonstrate substantial improvements over baselines on several real-world datasets.

Abstract: We introduce a novel matrix representation for differentially private training and prediction methods tailored to random forest classifiers. Our approach involves representing each rootto-leaf decision path in all trees as a row vector in a matrix. Similarly, inference queries are represented as a matrix. This representation enables us to collectively analyze privacy across multiple trees and inference queries, resulting in optimal DP noise allocation under the Laplace Mechanism. Our experimental results show significant accuracy improvements of up to 40% compared to state-of-the-art methods.

Abstract: We investigate safe online convex optimization (SOCO), where each decision must satisfy a set of unknown linear constraints. Assuming that the unknown constraints can be observed with a subGaussian noise for each chosen decision, previous studies have established a high-probability regret bound of O(T^{2/3}). However, this assumption may not hold in many practical scenarios. To address this limitation, in this paper, we relax the assumption to allow any noise that admits finite (1+ε)-th moments for some ε∈(0,1], and propose two algorithms that enjoy an O(T^{c_ε}) regret bound with high probability, where T is the time horizon and c_ε=(1+ε)/(1+2ε). The key idea of our two algorithms is to respectively utilize the median-of-means and truncation techniques to achieve accurate estimation under heavy-tailed noises. To the best of our knowledge, these are the first algorithms designed to handle SOCO with heavy-tailed observation noises.

Abstract: Lately, deep generative models have achieved excellent results after learning predefined and static data distribution. Meanwhile, their performance on continual learning suffers from degeneration, caused by catastrophic forgetting. In this paper, we study the unsupervised generative modelling in a more realistic continual learning scenario, where class and task information are absent during both training and inference learning phases. To implement this goal, the proposed memory approach consists of a temporary memory system, which stores data examples while a dynamic expansion memory system would gradually preserve those samples that are crucial for long-term memorization. A novel memory expansion mechanism is then proposed, by employing optimal transport distances between the statistics of memorized samples and each newly seen datum. This paper proposes the Sinkhorn-based Dual Dynamic Memory (SDDM) method, by considering Sinkhorn distance as an optimal transport measure, for evaluating the significance of the data to be stored in the memory buffer. The Sinkhorn transport algorithm leads to preserving a diversity of samples within a compact memory capacity. The memory buffering approach does not interact with the model's training process and can be optimized independently in both supervised and unsupervised learning without any modifications. Moreover, we also propose a novel dynamic model expansion mechanism to automatically increase the model's capacity whenever necessary, which can deal with infinite data streams and further improve the model's performance. Experimental results show that the proposed approach achieves state-of-the-art performance in both supervised and unsupervised learning.

Abstract: The diffusion model has lately been shown to achieve remarkable performances through its ability of generating high quality images. However, current diffusion model studies consider only learning from a single data distribution, resulting in catastrophic forgetting when attempting to learn new data. In this paper, we explore a more realistic learning scenario where training data is continuously acquired. We propose the Dynamic Expansion Diffusion Model (DEDM) for addressing catastrophic forgetting and data distribution shifts under Online TaskFree Continual Learning (OTFCL) paradigm. New diffusion components are added to a mixture model following the evaluation of a criterion which compares the probabilistic representation of the new data with the existing knowledge of the DEDM model. In addition, to maintain an optimal architecture, we propose a component discovery approach that ensures the diversity of knowledge while minimizing the total number of parameters in the DEDM. Furthermore, we show how the proposed DEDM can be implemented as a teacher module in a unified framework for representation learning. In this approach, knowledge distillation is proposed for training a student module aiming to compress the teacher's knowledge into the latent space of the student.

Abstract: The widespread adoption of Batch Normalization (BN) in contemporary deep neural architectures has demonstrated significant efficacy, particularly in the domain of Unsupervised Domain Adaptation (UDA) for crossdomain applications. Notwithstanding its success, extant BN variants often conflate source and target domain information within identical channels, potentially compromising transferability due to inter-domain feature misalignment. To address this limitation, we introduce Refined Batch Normalization (RBN), a novel normalization paradigm that leverages estimated shift to quantify discrepancies between estimated population statistics and their expected values. Our pivotal observation reveals that estimated shift can accumulate through BN stacking within the network, potentially degrading target domain performance. We elucidate how RBN mitigates this accumulation, thereby enhancing overall system efficacy. The practical implementation of this technique is realized through the RBNBlock, which supplants conventional BN with RBN in the bottleneck architecture of residual networks. Extensive empirical evaluation across diverse cross-domain benchmarks corroborates the superiority of RBN in augmenting inter-domain transferability. This perspective transcends immediate performance metrics, offering a foundational lens through which subsequent research can more deeply understand and refine the interplay between normalization strategies and domain adaptation.

Abstract: Spiking Neural Networks (SNNs) are emerging as a promising alternative to Artificial Neural Networks (ANNs) due to their inherent energy efficiency. Owing to the inherent sparsity in spike generation within SNNs, the indepth analysis and optimization of intermediate output spikes are often neglected. This oversight significantly restricts the inherent energy efficiency of SNNs and diminishes their advantages in spatiotemporal feature extraction, resulting in a lack of accuracy and unnecessary energy expenditure. In this work, we analyze the inherent spiking characteristics of SNNs from both temporal and spatial perspectives. In terms of spatial analysis, we find that shallow layers tend to focus on learning vertical variations, while deeper layers gradually learn horizontal variations of features. Regarding temporal analysis, we observe that there is not a significant difference in feature learning across different time steps. This suggests that increasing the time steps has limited effect on feature learning. Based on the insights derived from these analyses, we propose a Frequency-based Spatial-Temporal Attention (FSTA) module to enhance feature learning in SNNs. This module aims to improve the feature learning capabilities by suppressing redundant spike features. The experimental results indicate that the introduction of the FSTA module significantly reduces the spike firing rate of SNNs, demonstrating superior performance compared to state-of-the-art baselines across multiple datasets.

Abstract: Graph transfer learning endeavors to develop a Graph Neural Network (GNN) model in a fullylabeled source domain, with the intention of deploying it on a target domain that has limited labeled data for inference. We reveal that prevalent graph transfer learning methods are susceptible to the homophily shift problem. This issue arises from the divergence in homophily structures between the source and target graphs, leading to a notable deterioration in the performance of GNN models. In this paper, we introduce a novel Contextual Structural Graph Neural Network (CS-GNN) method, leveraging a tailored attention mechanism to apprehend a variety of local structural cues, facilitating structural knowledge transfer across domains. It features an ego-network module to distill local structural diversity and a moment-based approach to gauge structural patterns without needing ground-truth labels. CS-GNN crafts a feature smoothness matrix from node attributes, guiding a customized attention mechanism for feature aggregation. A group-wise fairness loss is employed to balance learning across various structural patterns, enhancing the model's ability to transfer knowledge across domains. Comprehensive experiments conducted on six benchmark datasets substantiate the superiority of CS-GNN over the state-of-the-art methods, demonstrating significant improvements in accuracy and robustness against homophily shifts.

Abstract: Largescale pre-trained Vision Transformer (ViT) models have demonstrated remarkable performance on visual tasks but are computationally expensive to transfer to downstream tasks. Parameter-Efficient Fine-Tuning (PEFT) offers a promising transferring approach by updating only a subset of parameters. However, PEFT's effectiveness is hindered by discrepancies between pre-training and downstream tasks in terms of object scale and granularity. Downstream tasks often focus on finer-grained and more specialized recognition, requiring more detailed features. The diversity of feature scales of existing PEFT methods for ViT is limited. To address this, we propose a novel PEFT method named Wavelet-based multi-Scale Tuning (WST), which learns multi-scale features in a simple and efficient way. WST introduces a parallel fine-tuning patch embedding branch with a smaller patch size than the pre-trained model to capture finer-grained features. Furthermore, to handle the computational challenge from the resulting longer token sequence, WST designs wavelet fine-tuning blocks that balance both efficiency and performance. In the block, wavelet transform enables invertible and lossless down-sampling of the longer token sequence, aligning it with that of the backbone, and two lightweight linear mappings are employed to learn task-specific features. This design facilitates efficient multi-scale information exchange between the pre-trained backbone and fine-tuning branch. Extensive experiments on transfer learning demonstrate the promising performance and efficiency of our WST.

Abstract: Machine learning models are vulnerable to both security attacks (e.g., adversarial examples) and privacy attacks (e.g., private attribute inference). We take the first step to mitigate both the security and privacy attacks, and maintain task utility as well. Particularly, we propose an informationtheoretic framework to achieve the goals through the lens of representation learning, i.e., learning representations that are robust to both adversarial examples and attribute inference adversaries. We also derive novel theoretical results under our framework, e.g., the inherent trade-off between adversarial robustness/utility and attribute privacy, and guaranteed attribute privacy leakage against attribute inference adversaries.

Abstract: In this paper, we study the problem of (finite sum) minimax optimization in the Differential Privacy (DP) model. Unlike most of the previous studies on the (strongly) convexconcave settings or loss functions satisfying the Polyak-Lojasiewicz condition, here we mainly focus on the nonconvex-strongly-concave one, which encapsulates many models in deep learning such as deep AUC maximization. Specifically, we first analyze a DP version of Stochastic Gradient Descent Ascent (SGDA) and show the utility bound in terms of the Euclidean norm of the gradient for the empirical risk function. We then propose a new method with less gradient noise variance and improve the upper bound to the best-known result for DP Empirical Risk Minimization with non-convex loss. We also discussed several lower bounds of private minimax optimization. Finally, experiments on AUC maximization, generative adversarial networks, and temporal difference learning with real-world data support our theoretical analysis.

Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation, Chinese Academy of Sciences School of Future Technology, University of Chinese Academy of Sciences, Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation, Chinese Academy of Sciences School of Future Technology, University of Chinese Academy of Sciences, Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation, Chinese Academy of Sciences Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences, Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation, Chinese Academy of Sciences School of Future Technology, University of Chinese Academy of Sciences Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences

Abstract: While Reinforcement Learning (RL) agents can successfully learn to handle complex tasks, effectively generalizing acquired skills to unfamiliar settings remains a challenge. One of the reasons behind this is the visual encoder used are taskdependent, preventing effective feature extraction in different settings. To address this issue, recent studies have tried to pretrain encoders with diverse visual inputs in order to improve their performance. However, they rely on existing pretrained encoders without further exploring the impact of pretraining period. In this work, we propose APE: efficient reinforcement learning through Adaptively Pretrained visual Encoder—a framework that utilizes adaptive augmentation strategy during the pretraining phase and extracts useful features with only a few interactions within the task environments in the policy learning period. Experiments are conducted across various domains, including DeepMind Control Suite, Atari Games and Memory Maze benchmarks, to verify the effectiveness of our method. Results show that mainstream RL methods, such as DreamerV3 and DrQ-v2, achieve state-of-the-art performance when equipped with APE. In addition, APE significantly improves the sampling efficiency during learning, approaching the efficiency of state-based method using only visual inputs in several control tasks. These findings demonstrate the potential of adaptive pretraining of encoder in enhancing the generalization ability and efficiency of visual RL algorithms.

School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China, School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China, Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University, Xiamen, China, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China, Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China

Abstract: Federated Clustering (FC) is crucial to mining knowledge from unlabeled nonIndependent Identically Distributed (non-IID) data provided by multiple clients while preserving their privacy. Most existing attempts learn cluster distributions at local clients, then securely pass the desensitized information to the server for aggregation. However, some tricky but common FC problems are still relatively unexplored, including the heterogeneity in terms of clients' communication capacity and the unknown number of proper clusters. To further bridge the gap between FC and real application scenarios, this paper first shows that the clients' communication asynchrony and unknown proper cluster numbers are complex coupling problems, and then proposes an Asynchronous Federated Cluster Learning (AFCL) method accordingly. It spreads the excessive number of seed points to clients as a learning medium and coordinates them across clients to form a consensus. To alleviate the distribution imbalance cumulated due to the unforeseen asynchronous uploading from the heterogeneous clients, we also design a balancing mechanism for seeds updating. As a result, the seeds gradually adapt to each other to reveal a proper number of clusters. Extensive experiments demonstrate the efficacy of AFCL.

Abstract: The Coarseto-Fine Few-Shot (C2FS) task is designed to train models using only coarse labels, then leverages a limited number of subclass samples to achieve fine-grained recognition capabilities. This task presents two main challenges: coarse-grained supervised pre-training suppresses the extraction of critical fine-grained features for subcategory discrimination, and models suffer from overfitting due to biased distributions caused by limited fine-grained samples. In this paper, we propose the Twofold Debiasing (TFB) method, which addresses these challenges through detailed feature enhancement and distribution calibration. Specifically, we introduce a multi-layer feature fusion reconstruction module and an intermediate layer feature alignment module to combat the model's tendency to focus on simple predictive features directly related to coarse-grained supervision, while neglecting complex fine-grained level details. Furthermore, we mitigate the biased distributions learned by the fine-grained classifier using readily available coarse-grained sample embeddings enriched with fine-grained information. Extensive experiments conducted on five benchmark datasets demonstrate the efficacy of our approach, achieving state-of-the-art results that surpass competitive methods.

Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, Beijing, China, Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, Beijing, China, Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, Beijing, China, Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, Beijing, China Nanjing Institute of Software Technology, University of Chinese Academy of Sciences, Nanjing, China, Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, Beijing, China, Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, Beijing, China

Abstract: Learning with softmax crossentropy on one-hot labels often leads to overconfidence on the correct class. While label smoothing regulates this overconfidence by redistributing some confidence from the correct class to other incorrect classes, it compromises the representation in the logits about the similarity between samples of different classes and may hurt calibration if higher confidence is required for high accuracy. To overcome these limitations, we propose a Virtual Smoothing (VS) label that redistributes certain confidence from the correct class to additional VS classes to regularize overconfidence. In VS labels, the VS class nodes act as adversaries to the original class nodes, enforcing regularization by clustering samples across all classes. The zero confidence assigned to each incorrect class also allows the incorrect logits to be different from each other without erasing information about sample similarities. The prediction probability can still approach 1 when applying softmax to the logits of the original real classes, which avoids harming but consistently improves calibration. Experiments show that VS labels consistently improve accuracy and calibration while providing better logits for improved knowledge distillation. Additionally, VS labels exhibit effectiveness in improving adversarial training, robust distillation, and out-of-distribution detection.

Abstract: The imageto-video adaptation task seeks to effectively harness both labeled images and unlabeled videos for achieving effective video recognition. The modality gap of the image and video modalities and the domain discrepancy across the two domains are the two essential challenges in this task. Existing methods reduce the domain discrepancy via close-set domain adaptation techniques, resulting in inaccurate domain alignment as there exist outlier target frames. To tackle this issue, we extend the vanilla classifier with outlier classes, where each outlier class responsible for capturing outlier frames for a specific class via batch nuclear norm maximization loss. We further propose a new loss by treating the source images apart from class c as instances from outlier class specific for c. As for the modality gap, existing methods usually utilize the pseudo labels obtained from an image-level adapted model to learn a video-level model. Rare efforts are dedicated to handling the noise in pseudo labels. We proposed a new metric based on label propagation consistency to select samples for training a better video-level model. Experiments on 3 benchmarks validating the effectiveness of our method.

AIRI, Moscow, Russia, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, Moscow, Russia AIRI, Moscow, Russia, AIRI, Moscow, Russia Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, Moscow, Russia Moscow Institute of Physics and Technology, Dolgoprudny, Russia, AIRI, Moscow, Russia Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, Moscow, Russia Moscow Institute of Physics and Technology, Dolgoprudny, Russia

Abstract: Multiagent pathfinding (MAPF) is a problem that generally requires finding collision-free paths for multiple agents in a shared environment. Solving MAPF optimally, even under restrictive assumptions, is NP-hard, yet efficient solutions for this problem are critical for numerous applications, such as automated warehouses and transportation systems. Recently, learning-based approaches to MAPF have gained attention, particularly those leveraging deep reinforcement learning. Typically, such learning-based MAPF solvers are augmented with additional components like single-agent planning or communication. Orthogonally, in this work we rely solely on imitation learning that leverages a large dataset of expert MAPF solutions and transformer-based neural network to create a foundation model for MAPF called MAPF-GPT. The latter is capable of generating actions without additional heuristics or communication. MAPF-GPT demonstrates zero-shot learning abilities when solving the MAPF problems that are not present in the training dataset. We show that MAPF-GPT notably outperforms the current best-performing learnable MAPF solvers on a diverse range of problem instances and is computationally efficient during inference.

Abstract: The Mutliagent Path Finding (MAPF) problem consists of identifying the trajectories that a set of agents should follow inside a given network in order to reach their desired destinations as soon as possible, but without colliding with each other. We aim to minimize the maximum time any agent takes to reach their goal, ensuring optimal path length. In this work, we complement a recent thread of results that aim to systematically study the algorithmic behavior of this problem, through the parameterized complexity point of view. First, we show that MAPF is NPhard when the given network has a star-like topology (bounded vertex cover number) or is a tree with 11 leaves. Both of these results fill important gaps in our understanding of the tractability of this problem that were left untreated in the recent work of Fioravantes et al., Exact Algorithms and Lowerbounds for Multiagent Path Finding: Power of Treelike Topology, presented in AAAI'24. Nevertheless, our main contribution is an exact algorithm that scales well as the input grows (FPT) when the topology of the given network is highly centralized (bounded distance to clique). This parameter is significant as it mirrors real-world networks. In such environments, a bunch of central hubs or nodes (e.g., processing areas) are connected to peripheral nodes.

Abstract: MultiAgent Path Finding (MAPF) focuses on planning collision-free paths for multiple agents. However, during the execution of a MAPF plan, agents may encounter unexpected delays, which can lead to inefficiencies, deadlocks, or even collisions. To address these issues, the Switchable Temporal Plan Graph provides a framework for finding an acyclic Temporal Plan Graph with the minimum execution cost under delays, ensuring deadlock- and collision-free execution. Unfortunately, existing optimal algorithms, such as Mixed Integer Linear Programming and Graph-Based Switchable Edge Search (GSES), are often too slow for practical use. This paper introduces Improved GSES, which significantly accelerates GSES through four speedup techniques: stronger admissible heuristics, edge grouping, prioritized branching, and incremental implementation. Experiments conducted on four different map types with varying numbers of agents demonstrate that Improved GSES consistently achieves over twice the success rate of GSES and delivers up to a 30-fold speedup on instances where both methods successfully find solutions.

Abstract: Applying Reinforcement Learning (RL) to Restless MultiArm Bandits (RMABs) offers a promising avenue for addressing allocation problems with resource constraints and temporal dynamics. However, classic RMAB models largely overlook the challenges of (systematic) data errors - a common occurrence in real-world scenarios due to factors like varying data collection protocols and intentional noise for differential privacy. We demonstrate that conventional RL algorithms used to train RMABs can struggle to perform well in such settings. To solve this problem, we propose the first communication learning approach in RMABs, where we study which arms, when involved in communication, are most effective in mitigating the influence of such systematic data errors. In our setup, the arms receive Q-function parameters from similar arms as messages to guide behavioral policies, steering Q-function updates. We learn communication strategies by considering the joint utility of messages across all pairs of arms and using a Q-network architecture that decomposes the joint utility. Both theoretical and empirical evidence validate the effectiveness of our method in significantly improving RMAB performance across diverse problems.

Abstract: The recent studies show that Large Language Models (LLMs) often fall short in tasks demanding creative, lateral thinking due to lacking a clear awareness of their own reasoning processes. To cope with this issue, we propose a novel metacognitive prompting method (titled as MP) by mimicking human metacognition. Through integrating metacognitive principles, MP endows LLMs with lateral thinking ability, thereby enhancing their abilities to strategize, monitor, and reflect on their responses when dealing with creative tasks. The experimental results with five base LLMs across three lateral thinking datasets demonstrate that: All LLMs armed with MP consistently outperform the representative baseline methods. For example, MP demonstrates superior performance over CoT prompting across Sentence Puzzle (+5.00%), Word Puzzle (+10.07%), BiRdQA (+6.48%), and RiddleSense (+2.65%) with GPT3.5-turbo model. In particular, the deployment of MP with GPT-4 achieves significant performance improvements that even surpass human performance on BRAINTEASER benchmark, demonstrating the transformative potential of MP in enhancing the creative problem-solving abilities of LLMs.

Abstract: Chainof-thought (CoT) has emerged as a powerful technique to elicit reasoning in large language models and improve a variety of downstream tasks. CoT mainly demonstrates excellent performance in English, but its usage in low-resource languages is constrained due to poor language generalization. To bridge the gap among different languages, we propose a cross-lingual instruction fine-tuning framework (xCoT) to transfer knowledge from high-resource languages to low-resource languages. Specifically, the multilingual instruction training data (xCoT-Instruct) is created to encourage the semantic alignment of multiple languages. We introduce cross-lingual in-context few-shot learning (xICL) to accelerate multilingual agreement in instruction tuning, where some fragments of source languages in examples are randomly substituted by their counterpart translations of target languages. During multilingual instruction tuning, we adopt the randomly online CoT strategy to enhance the multilingual reasoning ability of the large language model by first translating the query to another language and then answering in English. To further facilitate the language transfer, we leverage the high-resource CoT to supervise the training of low-resource languages with cross-lingual distillation. Experimental results demonstrate the superior performance of xCoT in reducing the gap among different languages, highlighting its potential to reduce the cross-lingual gap.

Abstract: Previous research on causal reasoning often overlooks the subtleties crucial to understanding causal reasoning. To address this gap, our study introduces the concept of causal epistemic consistency, which focuses on the selfconsistency of Large Language Models (LLMs) in differentiating intermediates with nuanced differences in causal reasoning. We propose a suite of novel metrics -- intensity ranking concordance, cross-group position agreement, and intra-group clustering -- to evaluate LLMs on this front. Through extensive empirical studies on 21 high-profile LLMs, including GPT-4, Claude3, and LLaMA3-70B, we have favoring evidence that current models struggle to maintain epistemic consistency in identifying the polarity and intensity of intermediates in causal reasoning. Additionally, we explore the potential of using internal token probabilities as an auxiliary tool to maintain causal epistemic consistency. In summary, our study bridges a critical gap in AI research by investigating the self-consistency over fine-grained intermediates involved in causal reasoning.

Abstract: Recent literature uses language to build foundation models for audio. These AudioLanguage Models (ALMs) are trained on a vast number of audio-text pairs and show remarkable performance in tasks including Text-to-Audio Retrieval, Captioning, and Question Answering. However, their ability to engage in more complex open-ended tasks, like Interactive Question-Answering, requires proficiency in logical reasoning- a skill not yet benchmarked. We introduce the novel task of Audio Entailment to evaluate an ALM's deductive reasoning ability. This task assesses whether a text description (hypothesis) of audio content can be deduced from an audio recording (premise), with potential conclusions being entailment, neutral, or contradiction, depending on the sufficiency of the evidence. We create two datasets for this task with audio recordings sourced from two audio captioning datasets-AudioCaps and Clotho-and hypotheses generated using Large Language Models (LLMs). We benchmark state-of-the-art ALMs and find deficiencies in logical reasoning with both zero-shot and linear probe evaluations. Finally, we propose "caption-before-reason", an intermediate step of captioning that improves the Zero-Shot and linear-probe performance of ALMs by an absolute 6% and 3%, respectively.

Abstract: Following natural instructions is crucial for the effective application of RetrievalAugmented Generation (RAG) systems. Despite recent advancements in Large Language Models (LLMs), research on assessing and improving instruction-following (IF) alignment within the RAG domain remains limited. To address this issue, we propose VIF-RAG, an automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems. We start by manually crafting a minimal set of atomic instructions (<100) and developing combination rules to synthesize and verify complex instructions for a seed set. We then use supervised models for instruction rewriting while simultaneously generating code to automate the verification of instruction quality via a Python executor. Finally, we integrate these instructions with extensive RAG and general samples, scaling up to a high-quality VIF-RAG-QA dataset (>100k) through automated processes. To further bridge the gap in instruction-following auto-evaluation for RAG systems, we introduce FollowRAG Benchmark, which includes approximately 3K test samples, covering 22 categories of general instruction constraints and four knowledge-intensive QA datasets. Due to its robust pipeline design, FollowRAG can seamlessly integrate with different RAG benchmarks. Using FollowRAG and eight widely-used IF and foundational abilities benchmarks for LLMs, we demonstrate that VIF-RAG markedly enhances LLM performance across a broad range of general instruction constraints while effectively leveraging its capabilities in RAG scenarios. Further analysis offers practical insights for achieving IF alignment in RAG systems.

School of Cyber Science and Engineering, Huazhong University of Science and Technology (HUST) National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Hubei Engineering Research Center on Big Data Security and Hubei Key Laboratory of Distributed System Security, Huawei International, School of Cyber Science and Engineering, Huazhong University of Science and Technology (HUST) National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Hubei Engineering Research Center on Big Data Security and Hubei Key Laboratory of Distributed System Security JinYinHu Laboratory, Huawei International, Huawei International, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab, School of Computer Science and Technology, HUST, Huawei International

Abstract: Large Language Models (LLMs) have achieved significant performance in various natural language processing tasks but also pose safety and ethical threats, thus requiring red teaming and alignment processes to bolster their safety. To effectively exploit these aligned LLMs, recent studies have introduced jailbreak attacks based on multiturn dialogues. These attacks aim to prompt LLMs to generate harmful or biased content by guiding them through contextual content. However, the underlying reasons for the effectiveness of multi-turn jailbreaks remain unclear. Existing attacks often focus on optimizing queries and escalating toxicity to construct dialogues, lacking a thorough analysis of the inherent vulnerabilities of LLMs. In this paper, we first conduct an in-depth analysis of the differences between single-turn and multi-turn jailbreaks and find that successful multi-turn jailbreaks can effectively disperse the attention of LLMs on keywords associated with harmful behaviors, especially in historical responses. Based on this, we propose ASJA, a new multi-turn jailbreak approach by shifting the attention of LLMs, specifically by iteratively fabricating the dialogue history through a genetic algorithm to induce LLMs to generate harmful content. Extensive experiments on three LLMs and two datasets show that our approach surpasses existing approaches in jailbreak effectiveness, the stealth of jailbreak prompts, and attack efficiency. Our work emphasizes the importance of enhancing the robustness of LLMs' attention mechanism in multi-turn dialogue scenarios for a better defense strategy.

Abstract: Finetuning large language models (LLMs) with a collection of large and diverse instructions has improved the model’s generalization to different tasks, even for unseen tasks. However, most existing instruction datasets include only single instructions, and they struggle to follow complex instructions composed of multiple subtasks. In this work, we propose a novel concept of compositional instructions called chain-of-instructions (CoI), where the output of one instruction becomes an input for the next like a chain. Unlike the conventional practice of solving single instruction tasks, our proposed method encourages a model to solve each subtask step by step until the final answer is reached. CoI-tuning (i.e., fine-tuning with CoI instructions) improves the model’s ability to handle instructions composed of multiple subtasks as well as unseen composite tasks such as multilingual summarization. Overall, our study find that simple CoI tuning of existing instruction data can provide consistent generalization to solve more complex, unseen, and longer chains of instructions.

Abstract: Recent advancements in proactive dialogues have garnered significant attention, particularly for more complex objectives (e.g. emotion support and persuasion). Unlike traditional taskoriented dialogues, proactive dialogues demand advanced policy planning and adaptability, requiring rich scenarios and comprehensive policy repositories to develop such systems. However, existing approaches tend to rely on Large Language Models (LLMs) for user simulation and online learning, leading to biases that diverge from realistic scenarios and result in suboptimal efficiency. Moreover, these methods depend on manually defined, context-independent, coarse-grained policies, which not only incur high expert costs but also raise concerns regarding their completeness. In our work, we highlight the potential for automatically discovering policies directly from raw, real-world dialogue records. To this end, we introduce a novel dialogue policy planning framework, LDPP. It fully automates the process from mining policies in dialogue records to learning policy planning. Specifically, we employ a variant of the Variational Autoencoder to discover fine-grained policies represented as latent vectors. After automatically annotating the data with these latent policy labels, we propose an Offline Hierarchical Reinforcement Learning (RL) algorithm in the latent space to develop effective policy planning capabilities. Our experiments demonstrate that LDPP outperforms existing methods on two proactive scenarios, even surpassing ChatGPT with only a 1.8-billion-parameter LLM.

Abstract: In recent years, visuallyrich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba's potential for length extrapolation.

Abstract: Large Language Models (LLMs) have shown remarkable performance across various tasks, but the escalating demands on computational resources pose significant challenges, particularly in the extensive utilization of full finetuning for downstream tasks. To address this, parameter-efficient fine-tuning (PEFT) methods have been developed, but they often underperform compared to full fine-tuning and struggle with memory efficiency. In this work, we introduce Gradient Weight-Normalized Low-Rank Projection (GradNormLoRP), a novel approach that enhances both parameter and memory efficiency while maintaining comparable performance to full fine-tuning. GradNormLoRP normalizes the weight matrix to improve gradient conditioning, facilitating better convergence during optimization. Additionally, it applies low-rank approximations to the weight and gradient matrices, significantly reducing memory usage during training. Extensive experiments demonstrate that our 8-bit GradNormLoRP reduces optimizer memory usage by up to 89.5\% and enables the pre-training of large LLMs, such as LLaMA 7B, on consumer-level GPUs like the NVIDIA RTX 4090, without additional inference costs. Moreover, GradNormLoRP outperforms existing low-rank methods in fine-tuning tasks. For instance, when fine-tuning the RoBERTa model on all GLUE tasks with a rank of 8, GradNormLoRP achieves an average score of 80.65, surpassing LoRA's score of 79.23. These results underscore GradNormLoRP as a promising alternative for efficient LLM pre-training and fine-tuning.

School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu, China Kash Institute of Electronics and Information Industry, Kashgar, China, Kash Institute of Electronics and Information Industry, Kashgar, China

Abstract: Fewshot Named Entity Recognition (NER) spotlights the tag of novel entity types in data-limited scenarios or lower-resource settings. Advances with Pre-trained Language Models (PLMs), including BERT, GPT, and their variants, have driven tremendous strategies to leverage context-dependent representations and exploit predefined relational cues, yielding significant gains in witnessing unseen entities. Nevertheless, a fundamental issue exists in prior efforts regarding their susceptibility to adversarial attacks in the intricate semantic environment. This vulnerability undermines the robustness of semantic representations, exacerbating the challenge of accurate entity identification, especially when transitioning across domains. To this end, we propose an Adversity-aware Augment Learning (AAL) solution for the few-shot NER task, dedicated to retrieving and reinforcing entity prototypes resilient to adversarial inference, thereby enhancing cross-domain semantic coherence. In particular, AAL employs a two-stage paradigm consisting of training and fine-tuning. The process initiates with augmentation learning by leveraging two kinds of prompt learning schemes, then identifies prototypes under the guidance of a variational manner. Furthermore, we devise a domain-oriented prototype refinement to optimize prototype learning under conditions of uncertainty attack, facilitating the effective transfer of common knowledge from source to target domains. The experimental results, encompassing the few-shot NER datasets under both certainty and uncertainty conditions, affirm the superiority of the proposed AAL over several representative baselines, particularly its capability against adversarial attacks.

Abstract: Large language models (LLMs) have shown remarkable capability in numerous tasks and applications. However, finetuning LLMs using high-quality datasets under external supervision remains prohibitively expensive. In response, LLM self-improvement approaches have been vibrantly developed recently. The typical paradigm of LLM self-improvement involves training LLM on self-generated data, part of which may be detrimental and should be filtered out due to the unstable data quality. While current works primarily employs filtering strategies based on answer correctness, in this paper, we demonstrate that filtering out correct but with high distribution shift extent (DSE) samples could also benefit the results of self-improvement. Given that the actual sample distribution is usually inaccessible, we propose a new metric called DS weight to approximate DSE, inspired by the Importance Weighting methods. Consequently, we integrate DS weight with self-consistency to comprehensively filter the self-generated samples and fine-tune the language model. Experiments show that with only a tiny valid set (up to 5% size of the training set) to compute DS weight, our approach can notably promote the reasoning ability of current LLM self-improvement methods. The resulting performance is on par with methods that rely on external supervision from pre-trained reward models.

Abstract: Model editing aims at selectively updating a small subset of a neural model's parameters with an interpretable strategy to achieve desired modifications. It can significantly reduce computational costs to adapt to large language models(LLMs). Given its ability to precisely target critical components within LLMs, model editing shows great potential for efficient finetuning applications. In this work, we investigate model editing to serve as an efficient method for adapting LLMs to solve aspect-based sentiment classification. Through causal interventions, we trace and determine which neuron hidden states are essential for the model’s predictions. By performing interventions and restorations on each component of an LLM, we identify the importance of these components for aspect-based sentiment classification. Our findings reveal that a distinct set of mid-layer representations is essential for detecting the sentiment polarity of given aspect words. Leveraging these insights, we develop a model editing approach that focuses exclusively on these critical parts of the LLM, leading to a more efficient method for adapting LLMs. Our in and out of domain experiments demonstrate that this approach achieves competitive results compared to the currently strongest methods with significantly fewer trainable parameters, highlighting a more efficient and interpretable fine-tuning strategy.

College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China, Swansea University, United Kingdom

Abstract: Emotion Recognition in Conversations (ERC) involves automatically identifying the emotion of each utterance in conversations. The emotion of an utterance is contingent to the conversation context, and thus, annotating each utterance in ERC entails repetitive screening the whole conversation from annotators. Such a requirement leads to prohibitive cost in finegrained labeling on utterance. In this paper, we propose an efficient coarse-grained labeling strategy for ERC, which assigns a set of emotions for each conversation. In specific, we reformulate the ERC predictors with conversation-level emotion sets as weakly-supervised learning to optimise a potential candidate for ERC, which is termed as Dataless ERC (DERC). To validate this, we propose a simple-yet-flexible DERC framework with Progressive Learning (DERC-PL). We jointly update pseudo-utterance-level emotions and the ERC predictor in a self-training manner, where we progressively update the ERC predictor from training subsets with lower noise densities to the ones with higher noise densities. We implemented several versions of \baby by incorporating various off-the-shelf ERC methods. Extensive experimental results demonstrate that the proposed \baby can be on par with existing weakly-supervised learning baselines and supervised learning ERC methods.

Abstract: Despite the advancements in large language models (LLMs) for mathematical reasoning, solving competitionlevel math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base language models. Models fine-tuned on MMIQC consistently surpass their counterparts in performance on the MATH benchmark across various model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-of-the-art by 8.2% and outperforming the initial version GPT-4 released in 2023. Extensive evaluation results on Hungarian high school finals suggest that such improvement can generalize to unseen data. Our ablation study on MMIQC reveals that a large part of the improvement can be attributed to our novel augmentation method, Iterative Question Composing (IQC), which involves iteratively composing new questions from seed problems using an LLM and applying rejection sampling through another LLM.

Abstract: Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latencysensitive scenarios. The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or relying on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts. To address these issues, we introduce a novel methodology called Adaptix, which accelerates LLM decoding without requiring fine-tuning. Our approach involves an adaptive draft-verification process that evolves over time to improve efficiency. We utilize a tri-gram matrix-based LLM representation to dynamically approximate the output distribution of the LLM, allowing the model to adjust to changing token probabilities during the decoding process. Additionally, we implement a draft construction mechanism that effectively balances exploration and exploitation, ensuring that the drafts generated are both diverse and close to the true output distribution of the LLM. The importance of this design lies in its ability to optimize the draft distribution adaptively, leading to faster and more accurate decoding. Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that Adaptix significantly accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications.

College of Management and Economics, Tianjin University Laboratory of Computation and Analytics of Complex Management Systems (CACMS), Tianjin University, College of Management and Economics, Tianjin University Laboratory of Computation and Analytics of Complex Management Systems (CACMS), Tianjin University, AI Lab, Lenovo Research, College of Management and Economics, Tianjin University Laboratory of Computation and Analytics of Complex Management Systems (CACMS), Tianjin University, College of Management and Economics, Tianjin University Laboratory of Computation and Analytics of Complex Management Systems (CACMS), Tianjin University, College of Management and Economics, Tianjin University Laboratory of Computation and Analytics of Complex Management Systems (CACMS), Tianjin University

Abstract: Significant efforts have been dedicated to integrating the powerful Large Language Models (LLMs) with diverse modalities, particularly focusing on the fusion of language, vision and audio data. However, the graphstructured data, which is inherently rich in structural and domain-specific knowledge, has not yet been gracefully adapted to LLMs. Existing methods either describe the graph with raw text, suffering the loss of graph structural information, or feed Graph Neural Network (GNN) embeddings into LLMs at the cost of losing explainable prompt semantics. To bridge this gap, we introduce an end-to-end modality-aligning framework for LLM-graph alignment: Dual-Residual Vector Quantized-Variational AutoEncoder, namely Dr.E. Our approach is purposefully designed to facilitate token-level alignment with LLMs, enabling an effective translation of the intrinsic `language' of graphs into comprehensible natural language. We also manage to enhance LLMs' more robust structural understanding of graphs by incorporating multiple views of the central nodes based on their surrounding nodes at various distances. Our experimental evaluations on standard graph tasks demonstrate competitive performance against other state-of-the-art (SOTA) approaches. Additionally, our framework ensures certain visual interpretability, efficiency, and robustness, marking the promising successful endeavor to achieve token-level alignment between LLMs and GNNs.

Abstract: Generative retrieval constitutes an innovative approach in information retrieval, leveraging generative language models(LM) to generate a ranked list of document identifiers (docid) for a given query. It simplifies the retrieval pipeline by replacing the large external index with model parameters. However, existing works merely learned the relationship between queries and document identifiers, which is unable to directly represent the relevance between queries and documents. To address the above problem, we propose a novel and general generative retrieval framework, namely Leveraging DocumentOriented Contrastive Learning in Generative Retrieval (DOGR), which leverages contrastive learning to improve generative retrieval tasks. It adopts a two-stage learning strategy that captures the relationship between queries and documents comprehensively through direct interactions. Furthermore, negative sampling methods and corresponding contrastive learning objectives are implemented to enhance the learning of semantic representations, thereby promoting a thorough comprehension of the relationship between queries and documents. Experimental results demonstrate that DOGR achieves state-of-the-art performance compared to existing generative retrieval methods on two public benchmark datasets. Further experiments have shown that our framework is generally effective for common identifier construction techniques.

Abstract: Authorship models have historically generalized poorly to new domains because of the wide distribution of authoridentifying signals across domains. In particular, the effects of topic and genre are highly domain-dependent and impact authorship analysis performance greatly. This paper addresses the existing data gap in authorship for these resources by introducing CROSSNEWS, a novel cross-genre dataset that connects formal journalistic articles and casual social media posts. CROSSNEWS is the largest authorship dataset of its kind for supporting both verification and attribution tasks, with comprehensive topic and genre annotations. We use CROSSNEWS to demonstrate that current models exhibit poor performance in genre transfer scenarios, underscoring the need for authorship models robust to genre-specific effects. We also explore SELMA, a new LLM embedding approach for large-scale authorship setups that outperforms existing models in both same-genre and cross-genre settings.

Abstract: Mental disorders, such as anxiety and depression, have become a global concern that affects people of all ages. Early detection and treatment are crucial to mitigate the negative effects these disorders can have on daily life. Although AIbased detection methods show promise, progress is hindered by the lack of publicly available large-scale datasets. To address this, we introduce the Multi-Modal Psychological assessment corpus (MMPsy), a large-scale dataset containing audio recordings and transcripts from Mandarin-speaking adolescents undergoing automated anxiety/depression assessment interviews. MMPsy also includes self-reported anxiety/depression evaluations using standardized psychological questionnaires. Leveraging this dataset, we propose Mental-Perceiver, a deep learning model for estimating mental disorders from audio and textual data. Extensive experiments on MMPsy and the DAIC-WOZ dataset demonstrate the effectiveness of Mental-Perceiver in anxiety and depression detection.

Abstract: Textto-SQL enables users to interact with databases through natural language, simplifying the retrieval and synthesis of information. Despite the success of large language models (LLMs) in converting natural language questions into SQL queries, their broader adoption is limited by two main challenges: achieving robust generalization across diverse queries and ensuring interpretative confidence in their predictions. To tackle these issues, our research investigates the integration of selective classifiers into Text-to-SQL systems. We analyse the trade-off between coverage and risk using entropy based confidence estimation with selective classifiers and assess its impact on the overall performance of Text-to-SQL models. Additionally, we explore the models' initial calibration and improve it with calibration techniques for better model alignment between confidence and accuracy. Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.

Abstract: Parameterefficient fine-tuning (PEFT) methods optimize large language models (LLMs) by modifying or introducing a small number of parameters to enhance alignment with downstream tasks. However, they can result in catastrophic forgetting, where LLMs prioritize new knowledge at the expense of comprehensive world knowledge. A promising approach to mitigate this issue is to recall prior memories based on the original knowledge. To this end, we propose a model-agnostic PEFT framework, IMSM, which Interweaves Memories of a Siamese Large Language Model. Specifically, our siamese LLM is equipped with an existing PEFT method. Given an incoming query, it generates two distinct memories based on the pre-trained and fine-tuned parameters. IMSM then incorporates an interweaving mechanism that regulates the contributions of both original and enhanced memories when generating the next token. This framework is theoretically applicable to all open-source LLMs and existing PEFT methods. We conduct extensive experiments across various benchmark datasets, evaluating the performance of popular open-source LLMs using the proposed IMSM, in comparison to both classical and leading PEFT methods. Our findings indicate that IMSM maintains comparable time and space efficiency to backbone PEFT methods while significantly improving performance and effectively mitigating catastrophic forgetting.

Abstract: Predicting unknown drugdrug interactions (DDIs) is crucial for improving medication safety. Previous efforts in DDI prediction have typically focused on binary classification or predicting DDI categories, with the absence of explanatory insights that could enhance trust in these predictions. In this work, we propose to generate natural language explanations for DDI predictions, enabling the model to reveal the underlying pharmacodynamics and pharmacokinetics mechanisms simultaneously as making the prediction. To do this, we have collected DDI explanations from DDInter and DrugBank and developed various models for extensive experiments and analysis. Our models can provide accurate explanations for unknown DDIs between known drugs. This paper contributes new tools to the field of DDI prediction and lays a solid foundation for further research on generating explanations for DDI predictions.

Abstract: For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformerbased models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.

Abstract: Iterative preference optimization has recently become one of the defacto training paradigms for large language models (LLMs), but the performance is still underwhelming due to too much noisy preference data yielded in the loop. To combat this issue, we present an Uncertainty-enhanced Preference Optimization (UPO) framework to make the LLM self-evolve with reliable feedback. The key idea is mitigating the noisy preference pairs derived from the current policy and reward models by performing pair-wise uncertainty estimation and judiciously reliable feedback sampling. To reach this goal, we thus introduce an estimator model, which incorporates Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation for the batch of preference pairs. Compared to the existing methods that directly filter generated responses based on the reward score, the estimator focuses on the model uncertainty in a pair-wise manner and effectively bypasses the confirmation bias problem of the reward model. Additionally, we also propose an uncertainty-enhanced self-evolution algorithm to better improve the LLM robustly align with these reliable feedback data. Extensive experiments over multiple benchmarks demonstrate our framework substantially improves the performance of iterative preference optimization.

Abstract: Online psychological counseling dialogue systems are trending, offering a convenient and accessible alternative to traditional inperson therapy. However, existing psychological counseling dialogue systems mainly focus on basic empathetic dialogue or QA with minimal professional knowledge and without goal guidance. In many real-world counseling scenarios, clients often seek multi-type help, such as diagnosis, consultation, therapy, console, and common questions, but existing dialogue systems struggle to combine different dialogue types naturally. In this paper, we identify this challenge as how to construct mixed-type dialogue systems for psychological counseling that enable clients to clarify their goals before proceeding with counseling. To mitigate the challenge, we collect a mixed-type counseling dialogues corpus termed STAMPsy, covering five dialogue types, task-oriented dialogue for diagnosis, knowledge-grounded dialogue, conversational recommendation, empathetic dialogue, and question answering, over 5,000 conversations. Moreover, spatiotemporal-aware knowledge enables systems to have world awareness and has been proven to affect one's mental health. Therefore, we link dialogues in STAMPsy to spatiotemporal state and propose a spatiotemporal-aware mixed-type psychological counseling dataset. Additionally, we build baselines on STAMPsy and develop an iterative self-feedback psychological dialogue generation framework, named Self-STAMPsy. Results indicate that clarifying dialogue goals in advance and utilizing spatiotemporal states are effective.

Abstract: Most work treats large language models as black boxes without an indepth understanding of their internal working mechanism. To explain the internal representations of LLMs, we utilize a gradient-based metric to assess the activation level of model parameters. Based on this metric, we obtain three preliminary findings. (1) When the inputs are in the same domain, parameters in the shallow layers will be activated densely, which means a larger portion of parameters will have great impacts on the outputs. In contrast, parameters in the deep layers are activated sparsely. (2) When the inputs are across different domains, parameters in shallow layers exhibit higher similarity in the activation behavior than in deep layers. (3) In deep layers, the similarity of the distributions of activated parameters is positively correlated to the empirical data relevance. Further, we develop three validation experiments to solidify these findings. (1) Firstly, starting from the first finding, we attempt to configure different sparsities for different layers and find this method can benefit model pruning. (2) Secondly, we find that a pruned model based on one calibration set can better handle tasks related to the calibration task than those not related, which validates the second finding. (3) Thirdly, Based on the STS-B and SICK benchmarks, we find that two sentences with consistent semantics tend to share similar parameter activation patterns in deep layers, which aligns with our third finding. Our work sheds light on the behavior of parameter activation in LLMs, and we hope these findings will have the potential to inspire more practical applications.

Abstract: Multimodal multi-party conversation (MMC) is a less studied yet important topic of research due to that it well fits real-world scenarios and thus potentially has more widely-used applications. Compared with the traditional multi-modal conversations, MMC requires stronger character-centered understanding abilities as there are many interlocutors appearing in both the visual and textual context. To facilitate the study of this problem, we present Friends-MMC in this paper, an MMC dataset that contains 24,000+ unique utterances paired with video context. To explore the character-centered understanding of the dialogue, we also annotate the speaker of each utterance, the names and bounding bboxes of faces that appear in the video. Based on this Friends-MMC dataset, we further study two fundamental MMC tasks: conversation speaker identification and conversation response prediction, both of which have the multi-party nature with the video or image as visual context. For conversation speaker identification, we demonstrate the inefficiencies of existing methods such as pre-trained models, and propose a simple yet effective baseline method that leverages an optimization solver to utilize the context of two modalities to achieve better performance. For conversation response prediction, we fine-tune generative dialogue models on Friend-MMC, and analyze the benefits of speaker information. The code and dataset will be publicly available, and thus we call for more attention on modelling speaker information when understanding conversations.

Abstract: Audiovisual Automatic Speech Recognition (AVASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition.

Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Alibaba Group, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Shenzhen University of Advanced Technology, Alibaba Group, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Alibaba Group

Abstract: Recent advancements in large language models (LLMs) have broadened their application scope but revealed challenges in balancing capabilities across general knowledge, coding, and mathematics. To address this, we introduce a Collaborative and Semantic Experts (CoE) approach for supervised finetuning (SFT), which employs a two-phase training strategy. Initially, expert training fine-tunes the feed-forward network on specialized datasets, developing distinct experts in targeted domains. Subsequently, expert leveraging synthesizes these trained experts into a structured model with semantic guidance to activate specific experts, enhancing performance and interpretability. Evaluations on comprehensive benchmarks across MMLU, HumanEval, GSM8K, MT-Bench, and AlpacaEval confirm CoE's efficacy, demonstrating improved performance and expert collaboration in diverse tasks, significantly outperforming traditional SFT methods.

Abstract: Text readability assessment involves categorizing texts based on readers' comprehension levels. Hybrid automatic readability assessment (ARA) models, combining deep and linguistic features, have recently attracted rising attention due to their impressive performance. However, existing hybrid ARA models generally ignore the specificintrinsic information of deep and linguistic representations, and cannot fully explore their common-intrinsic information. In this paper, we introduce a self-supervised collaborative information bottleneck (SCIB) module for ARA to address these issues. Specifically, we collaboratively consider both specific-intrinsic and common-intrinsic information of the linguistic representation and various levels of deep representations including the document-, sentence- and word-level deep representations, and yield their refined representations via a self-supervised information bottleneck scheme. Extensive experiments are conducted on four English and two Chinese corpora to demonstrate the effectiveness of the proposed model. Experimental results show that the proposed model outperforms state-of-the-art models in terms of four important evaluation metrics, and the suggested SCIB module can effectively capture the specific- and common-intrinsic information.

Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS Key Lab of AI Safety of Chinese Academy of Sciences (CAS) University of Chinese Academy of Sciences, CAS, Gaoling School of Artificial Intelligence, Renmin University of China, Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS Key Lab of AI Safety of Chinese Academy of Sciences (CAS) University of Chinese Academy of Sciences, CAS, Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS Key Lab of AI Safety of Chinese Academy of Sciences (CAS) University of Chinese Academy of Sciences, CAS, Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS Key Lab of AI Safety of Chinese Academy of Sciences (CAS) University of Chinese Academy of Sciences, CAS

Abstract: As large language models (LLMs) are widely deployed across various domains, the ability to control their generated outputs has become more critical. This control involves aligning LLMs outputs with human values and ethical principles or customizing LLMs on specific topics or styles for individual users. Existing controlled generation methods either require significant computational resources and extensive trialand-error or provide coarse-grained control. In this paper, we propose Generation with Concept Activation Vector (GCAV), a lightweight model control framework that ensures accurate control without requiring resource-extensive fine-tuning. Specifically, GCAV first trains a concept activation vector for specified concepts to be controlled, such as toxicity. During inference, GCAV steers the concept vector in LLMs, for example, by removing the toxicity concept vector from the activation layers. Control experiments from different perspectives, including toxicity reduction, sentiment control, linguistic style, and topic control, demonstrate that our framework achieves state-of-the-art performance with granular control, allowing for fine-grained adjustments of both the steering layers and the steering magnitudes for individual samples.

Abstract: The emergence of longcontext text applications utilizing large language models (LLMs) has presented significant scalability challenges, particularly in memory footprint. The linear growth of the Key-Value (KV) cache, which stores attention keys and values to reduce redundant computations, can significantly increase memory usage and may prevent models from functioning properly in memory-constrained environments. To address this issue, we propose a novel approach called Cache Sparse Representation (CSR), which converts the KV cache by transforming the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference. Furthermore, we introduce NeuralDict, a novel neural network-based method to automatically generate the dictionary used in our sparse representation. Our extensive experiments demonstrate that CSR matches the performance of state-of-the-art KV cache quantization algorithms while ensuring robust functionality in memory-constrained environments.

Abstract: Document Information Extraction (DIE) aims to extract structured information from Visually Rich Documents (VRDs). Previous fulltraining approaches have demonstrated strong performance but may struggle with generalization to unseen data. In contrast, training-free methods leverage powerful pre-trained models like Large Language Models (LLMs) to address various downstream tasks with only a few examples. Nonetheless, training-free methods for DIE encounter two primary challenges: (1) understanding the complex relationship between layout and textual elements in VRDs, and (2) providing accurate guidance to pre-trained models. To address these challenges, we propose SAmple-centric In-context Learning (SAIL). SAIL introduces a fine-grained entity-level textual similarity to facilitate in-depth text analysis by LLMs and incorporates layout similarity to enhance the analysis of layouts in VRDs. Moreover, SAIL formulates a unified In-Context Learning (ICL) prompt template for various sample-centric examples, enabling tailored prompts that deliver precise guidance to pre-trained models for each sample. Extensive experiments on FUNSD, CORD, and SROIE benchmarks with various base models (e.g., LLMs) indicate that our SAIL outperforms training-free baselines, even closer to the full-training methods, showing the superiority and generalization of our method.

Abstract: We introduce a new task called Defeasible Visual Entailment (DVE), where the goal is to allow the modification of the entailment relationship between an image premise and a text hypothesis based on an additional update. While this concept is wellestablished in Natural Language Inference, it remains unexplored in visual entailment. At a high level, DVE enables models to refine their initial interpretations, leading to improved accuracy and reliability in various applications such as detecting misleading information in images, enhancing visual question answering, and refining decision-making processes in autonomous systems. Existing metrics do not adequately capture the change in the entailment relationship brought by updates. To address this, we propose a novel inference-aware evaluator designed to capture changes in entailment strength induced by updates, using pairwise contrastive learning and categorical information learning. Additionally, we introduce a reward-driven update optimization method to further enhance the quality of updates generated by multimodal models. Experimental results demonstrate the effectiveness of our proposed evaluator and optimization method.

College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University

Abstract: Multimodal Relation Extraction (MRE) aims to predict relations between head and tail entities based on the context of sentenceimage pairs. Most existing MRE methods progressively incorporate textual and visual inputs to dominate the learning process, assuming both contribute significantly to the task. However, the diverse visual appearances and text with ambiguous semantics contain less-informative contexts for the corresponding relation. To tackle these challenges, we highlight the importance of semantically invariant entity attributes that encompass fine-grained categories. Towards this, we propose a novel Prototype-Guided Multimodal Relation Extraction (PG-MRE) framework based on Entity Attributes. Specifically, we first generate detailed entity explanations using Large Language Models (LLMs) to supplement the attribute semantics. Then, the Attribute Prototype Module (APM) refines attribute categories and condenses scattered entity attribute features into cluster-level prototypes. Furthermore, prototype-aligned attribute features guide diverse visual appearance features to produce compact and distinctive multimodal representations in the Relation Prototype Module (RPM). Extensive experiments demonstrate that our method gains superior relation classification capability (especially in scenarios involving various unseen entities), achieving new state-of-the-art performances on MNRE dataset.

Abstract: Recently, both closedsource and open-source LLMs have made significant strides, outperforming humans in various general domains. However, their performance in specific professional domains such as medicine, especially within the open-source community, remains suboptimal due to the complexity of medical knowledge. In this paper, we propose CareBot, a bilingual medical LLM, which leverages a comprehensive approach integrating continuous pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF). Our novel two-stage CPT method, comprising Stable CPT and Boost CPT, effectively bridges the gap between general and domain-specific data, facilitating a smooth transition from pre-training to fine-tuning and enhancing domain knowledge progressively. We also introduce DataRater, a model designed to assess data quality during CPT, ensuring that the training data is both accurate and relevant. For SFT, we develope a large and diverse bilingual dataset, along with ConFilter, a metric to enhance multi-turn dialogue quality, which is crucial to improving the model's ability to handle more complex dialogues. The combination of high-quality data sources and innovative techniques significantly improves CareBot's performance across a range of medical applications. Our rigorous evaluations on Chinese and English benchmarks confirm CareBot's effectiveness in medical consultation and education. These advancements not only address current limitations in medical LLMs but also set a new standard for developing effective and reliable open-source models in the medical domain.

Tsinghua University, Tsinghua University The CoAI Group, DCST, Tsinghua University, Lingxin AI Northwest Minzu University, Tsinghua University The CoAI Group, DCST, Tsinghua University, Tsinghua University The CoAI Group, DCST, Tsinghua University, Tsinghua University, Tsinghua University The CoAI Group, DCST, Tsinghua University, Tsinghua University The CoAI Group, DCST, Tsinghua University, Tsinghua University, Tsinghua University The CoAI Group, DCST, Tsinghua University, Lingxin AI Beijing Normal University, Lingxin AI Tsinghua University, Tsinghua University, Lingxin AI Guangdong University of Finance & Economics, Fuxi AI Lab, Netease, Fuxi AI Lab, Netease, Fuxi AI Lab, Netease, Fuxi AI Lab, Netease, Tsinghua University The CoAI Group, DCST, Tsinghua University, Tsinghua University, Tsinghua University The CoAI Group, DCST, Tsinghua University

Abstract: Characterbased dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs’ character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features in responses makes feature-focused generative evaluation both ineffective and inefficient. To address these issues, we propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters from 25 detailed character categories. We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response. We enable effective and efficient evaluation by crafting tailored queries for each dimension to induce characters’ responses related to specific dimensions. Further, we develop CharacterJudge model for cost-effective and stable evaluations. Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark’s potential to optimize LLMs’ character customization.

Abstract: Human feedback in generative systems is a highly active frontier of research that aims to improve the quality of generated content and align it with subjective preferences. Existing efforts predominantly focus on textonly large language models (LLMs) or text-based image generation, while cross-modal generation between audio and text remains largely unexplored. Moreover, there is currently no open-source preference dataset to support the deployment of alignment algorithms in this domain. In this work, we take audio speech translation (AST) and audio captioning (AAC) tasks as examples to explore how to enhance the performance of mainstream audio-based text generation models with limited human annotation. Specifically, we propose an novel framework named IPO that includes a model adversarial sampling concept--human annotators act as referees to determine model outcomes, using these results as pseudo-labels for the corresponding beam search hypotheses. Given these imbalance win-loss results, IPO effectively enable the two models to update interactively to win the next round of adversarial sampling. We conduct both subjective and objective evaluations to demonstrate the alignment benefits of IPO and its enhancement on model perception and generation capacities. On both AAC and AST, a few hundreds of annotations significantly enhance the weak model, and the strong model can also be encouraged to achieve new state-of-the-art results in terms of different objective metrics. Additionally, we show the extensibility of IPO by applying it to the reverse task of text-to-speech generation, improving the robustness of system on unseen reference speaker.

Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, The Hong Kong University of Science and Technology, Hong Kong, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China, The Hong Kong University of Science and Technology, Hong Kong, China

Abstract: Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on common coding tasks (e.g., bubble sort, greatest common divisor), leaving domainspecific coding tasks (e.g., computation, system, cryptography) unexplored. To fill this gap, we propose a multi-domain code benchmark, DOMAINEVAL, designed to evaluate LLMs' coding capabilities thoroughly. Our pipeline works in a fully automated manner, enabling a push-button construction from code repositories into formatted subjects under study. Interesting findings are observed by evaluating 12 representative LLMs against DOMAINEVAL. We notice that LLMs are generally good at computation tasks while falling short on cryptography and system coding tasks. The performance gap can be as much as 68.94% (80.94% - 12.0%) in some LLMs. We also observe that generating more samples can increase the overall performance of LLMs, while the domain bias may even increase. The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL, providing directions for future research improvements.

Abstract: Despite the impressive success of textto-image (TTI) models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by TTI models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing TTI models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five TTI models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation (ρ=0.95) with human judgments. Our benchmark dataset and metric can serve as a foundation for developing factually accurate TTI models.

Abstract: This paper considers the problem of fair probabilistic binary classification with binary protected groups. The classifier assigns scores, and a practitioner predicts labels using a certain cutoff threshold based on the desired trade-off between false positives vs. false negatives. It derives these thresholds from the ROC of the classifier. The resultant classifier may be unfair to one of the two protected groups in the dataset. It is desirable that no matter what threshold the practitioner uses, the classifier should be fair to both the protected groups; that is, the ℒₚ norm between FPRs and TPRs of both the protected groups should be at most ε. We call such fairness on ROCs of both the protected attributes εₚ-Equalized ROC. Given a classifier not satisfying ε₁-Equalized ROC, we aim to design a post-processing method to transform the given (potentially unfair) classifier's output (score) to a suitable randomized yet fair classifier. That is, the resultant classifier must satisfy ε₁-Equalized ROC. First, we introduce a threshold query model on the ROC curves for each protected group. The resulting classifier is bound to face a reduction in AUC. With the proposed query model, we provide a rigorous theoretical analysis of the minimal AUC loss to achieve ε₁-Equalized ROC. To achieve this, we design a linear time algorithm, namely FROC, to transform a given classifier's output to a probabilistic classifier that satisfies ε₁-Equalized ROC. We prove that under certain theoretical conditions, FROC achieves the theoretical optimal guarantees. We also study the performance of our FROC on multiple real-world datasets with many trained classifiers.

Abstract: Traffic prediction is pivotal in intelligent transportation systems. Existing works focus mainly on improving overall accuracy, overlooking a crucial problem of whether prediction results will lead to biased decisions by transportation authorities. In practice, the uneven deployment of traffic sensors in different urban areas produces imbalanced data, making the traffic prediction model fail in some urban areas and leading to unfair regional decisionmaking that eventually severely affects equity and quality of residents’ life. Existing fairness machine learning models struggle to maintain fair traffic prediction over prolonged periods. Although these models might achieve fairness at certain time slots, this static fairness will break down as traffic conditions change. To fill this research gap, we investigate prolonged fair traffic prediction, introducing two novel fairness metrics, i.e., region-based static fairness and sensor-based dynamic fairness, tailored to fairness fluctuations over time and across areas. An innovative prolonged fairness traffic prediction framework, namely FairTP, is then proposed. FairTP achieves prolonged fairness by alternating between “sacrifice” and “benefit” the prediction accuracy of each traffic sensor or area, ensuring that the number of these two actions are balanced over time. Specifically, FairTP incorporates a state identification module to discriminate whether the traffic sensors or areas are in a “sacrifice” or “benefit” state, thereby enabling prolonged fairness-aware traffic predictions. Additionally, we devise a state-guided balanced sampling strategy to select training examples to further enhance prediction fairness by mitigating the performance disparities among areas with uneven sensor distribution over time. Extensive experiments in two real-world datasets show that FairTP significantly improves prediction fairness without causing significant accuracy degradation.

Abstract: Learningbased methods provide a promising approach to solving highly non-linear control tasks that are often challenging for classical control methods. To ensure the satisfaction of a safety property, learning-based methods jointly learn a control policy together with a certificate function for the property. Popular examples include barrier functions for safety and Lyapunov functions for asymptotic stability. While there has been significant progress on learning-based control with certificate functions in the white-box setting, where the correctness of the certificate function can be formally verified, there has been little work on ensuring their reliability in the black-box setting where the system dynamics are unknown. In this work, we consider the problems of certifying and repairing neural network control policies and certificate functions in the black-box setting. We propose a novel framework that utilizes runtime monitoring to detect system behaviors that violate the property of interest under some initially trained neural network policy and certificate. These violating behaviors are used to extract new training data, that is used to re-train the neural network policy and the certificate function and to ultimately repair them. We demonstrate the effectiveness of our approach empirically by using it to repair and to boost the safety rate of neural network policies learned by a state-of-the-art method for learning-based control on two autonomous system control tasks.

Abstract: We design sensitivity oracles for errorprone networks. For a network problem Π, the data structure preprocesses a network G=(V,E) and sensitivity parameter f such that, for any set F of up to f link or node failures, it can report the solution of Π in G-F. We study three network problems Π. - L-Hop Shortest Path: Given s,t in V, is there a shortest s-t-path in G-F with at most L links? - k-Path: Does G-F contain a simple path with k links? - k-Clique: Does G-F contain a clique of k nodes? Our main technical contribution is a new construction of (L,f)-replacement path coverings ((L,f)-RPC) in the parameter realm where f = o(log L). An (L,f)-RPC is a family G' of subnetworks of G which, for every set F of at most f links, has a subfamily G'_F such that (i) no subnetwork in G'_F contains a link of F and (ii) for each s,t in V, if G-F contains a shortest s-t-path with at most L links, then some subnetwork in G'_F retains at least one such path. Our (L,f)-RPC has almost the same size as the one by Weimann and Yuster (2013) but it improves the time to query G'_F from Õ(f^2 L^f) to Õ(f^(5/2) L^o(1)). It also improves over the size and query time of the (L,f)-RPC by Karthik and Parter (2021) by nearly a factor of L. From this construction, we derive oracles for L-Hop Shortest Path, k-Path, and k-Clique. Notably, our solution for k-Path improves the query time of the one by Bilò for f=o(log k).

Abstract: It is well known that numeric planning can be made decidable if the domain of all numeric state variables is finite. This bounded formulation can be polynomially compiled into classical planning with Boolean conditions and conditional effects preserving the plan size exactly. However, it remains unclear whether this compilation has any practical utility. To explore this aspect, this work revisits the theoretical compilation framework from a practical perspective, focusing on the fragment of simple numeric planning. Specifically, we introduce three different compilations. The first, called onehot, aims to systematise the current practice among planning practitioners of modelling numeric planning through classical planning. The other two, termed binary compilations, extend and specialise the logarithmic encoding introduced in previous literature. Our experimental analysis reveals that the overly complex logarithmic encoding can, surprisingly, be made practical with some representational expedients. Among these, the use of axioms is particularly crucial. Furthermore, we identify a class of mildly numeric planning problems where a classical planner, i.e., LAMA, when run on the compiled problem, is highly competitive with state-of-the-art numeric planners.

Abstract: One of the major techniques to tackle temporal planning problems is heuristic search augmented with a symbolic representation of time in the states. Augmenting the problem with composite actions (macroactions) is a simple and powerful approach to create "shortcuts" in the search space, at the cost of augmenting the branching factor of the problem and thus the expansion time of a heuristic search planner. Hence, it is of paramount importance to select the right macro-actions and minimize the number of such actions to optimize the planner performance. In this paper, we first discuss a simple, yet powerful, model similar to macro-actions for the case of temporal planning, and we call these macro-events. Then, we present a novel ranking function to extract and select a suitable set of macro-events from a dataset of valid plans. In our ranking approach, we consider an estimation of the hypothetical search space for a blind search including a candidate set of macro-events under four different exploitation schemata. Finally, we experimentally demonstrate that the proposed approach yields a substantial performance improvement for a state-of-the-art temporal planner.

Abstract: Markov decision processes (MDP) are a wellestablished model for sequential decision-making in the presence of probabilities. In *robust* MDP (RMDP), every action is associated with an *uncertainty set* of probability distributions, modelling that transition probabilities are not known precisely. Based on the known theoretical connection to stochastic games, we provide a framework for solving RMDPs that is generic, reliable, and efficient. It is *generic* both with respect to the model, allowing for a wide range of uncertainty sets, including but not limited to intervals, L1- or L2-balls, and polytopes; and with respect to the objective, including long-run average reward, undiscounted total reward, and stochastic shortest path. It is *reliable*, as our approach not only converges in the limit, but provides precision guarantees at any time during the computation. It is *efficient* because -- in contrast to state-of-the-art approaches -- it avoids explicitly constructing the underlying stochastic game. Consequently, our prototype implementation outperforms existing tools by several orders of magnitude and can solve RMDPs with a million states in under a minute.

Abstract: The Traveling Tournament Problem (TTPk) is a well-known benchmark problem in tournament timetabling. It involves designing a feasible double round-robin tournament for a sports league of n teams under several feasibility requirements, while minimizing the total traveling costs of the teams. The parameter k requires that in the tournament at most k consecutive home games or away games for each team are allowed. TTP-k with a small k, especially for k=2,3 and 4, have been extensively studied in the literature. In this paper, we focus on TTP-4 and design an efficient algorithm for it based on minimum weight matching. In theory, we prove that our algorithm has an approximation ratio of 1.625+ε for any constant ε>0, improving the best-known approximation ratio of 1.7+ε. In practice, our experimental results indicate an average improvement of 6.65% over the best-known solutions on 9 benchmark instances.

Abstract: Understanding causal relationships among the variables of a system is paramount to explain and control its behavior. For many realworld systems, however, the true causal graph is not readily available and one must resort to predictions made by algorithms or domain experts. Therefore, metrics that quantitatively assess the goodness of a causal graph provide helpful checks before using it in downstream tasks. Existing metrics provide an absolute number of inconsistencies between the graph and the observed data, and without a baseline, practitioners are left to answer the hard question of how many such inconsistencies are acceptable or expected. Here, we propose a novel consistency metric by constructing a baseline through node permutations. By comparing the number of inconsistencies with those on the baseline, we derive an interpretable metric that captures whether the graph is significantly better than random. Evaluating on both simulated and real data sets from various domains, including biology and cloud monitoring, we demonstrate that the true graph is not falsified by our metric, whereas the wrong graphs given by a hypothetical user are likely to be falsified.

Abstract: In decisionmaking problems under uncertainty, predicting unknown parameters is often considered independent of the optimization part. Decision-focused learning (DFL) is a task-oriented framework that integrates prediction and optimization by adapting the predictive model to give better decisions for the corresponding task. Here, an inevitable challenge arises when computing the gradients of the optimal decision with respect to the parameters. Existing research copes with this issue by smoothly reforming surrogate optimization or constructing surrogate loss functions that mimic task loss. However, they are applied to restricted optimization domains. In this paper, we propose Locally Convex Global Loss Network (LCGLN), a global surrogate loss model that can be implemented in a general DFL paradigm. LCGLN learns task loss via a partial input convex neural network which is guaranteed to be convex for chosen inputs while keeping the non-convex global structure for the other inputs. This enables LCGLN to admit general DFL through only a single surrogate loss without any sense for choosing appropriate parametric forms. We confirm the effectiveness and flexibility of LCGLN by evaluating our proposed model with three stochastic decision-making problems.

Abstract: The classic Resource Constrained Shortest Path (RCSP) problem aims to find a cost optimal path between a pair of nodes in a network such that the resources used in the path are within a given limit. Having been studied for over a decade, RCSP has seen recent solutions that utilize heuristicguided search to solve the constrained problem faster. Building upon the bidirectional A* search paradigm, this paper introduces a novel constrained search framework that uses efficient pruning strategies to allow for accelerated and effective RCSP search in large-scale networks. Results show that, compared to the state of the art, our enhanced framework can significantly reduce the constrained search time, achieving speed-ups of over to two orders of magnitude.

Abstract: We consider the Online Facility Location (OFL) problem in the framework of learningaugmented online algorithms. In Online Facility Location (OFL), demands arrive one-by-one in a metric space and must be (irrevocably) assigned to an open facility upon arrival, without any knowledge about future demands. We focus on uniform facility opening costs and present an online algorithm for OFL that exploits potentially imperfect predictions on the locations of the optimal facilities. We prove that the competitive ratio decreases from sublogarithmic in the number n of demands to constant as the so-called η1 error, i.e., the sum of distances of the predicted locations to the optimal facility locations, decreases towards zero. E.g., our analysis implies that if for some ε > 0, η1 = OPT / n^ε, where OPT is the cost of the optimal solution, the competitive ratio is O(1/ε). We complement our analysis with a matching lower bound establishing that the dependence of the algorithm's competitive ratio on the η1 error is optimal, up to constant factors.

Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, Shanghai Institute of AI for Education, and School of Computer Science and Technology, East China Normal University Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, Shanghai Institute of AI for Education, and School of Computer Science and Technology, East China Normal University Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, Shanghai Institute of AI for Education, and School of Computer Science and Technology, East China Normal University Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, Shanghai Institute of AI for Education, and School of Computer Science and Technology, East China Normal University, School of Computer Science, Wuhan University, Department of Statistics and Data Science, Southern University of Science and Technology Guangdong Provincial Key Laboratory of Brain-Inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology, Guangdong Provincial Key Laboratory of Brain-Inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology, Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, Shanghai Institute of AI for Education, and School of Computer Science and Technology, East China Normal University

Abstract: Multiobjective Bayesian optimization (MOBO) has shown promising performance on various expensive multi-objective optimization problems (EMOPs). However, effectively modeling complex distributions of the Pareto optimal solutions is difficult with limited function evaluations. Existing Pareto set learning algorithms may exhibit considerable instability in such expensive scenarios, leading to significant deviations between the obtained solution set and the Pareto set (PS). In this paper, we propose a novel Composite Diffusion Model based Pareto Set Learning algorithm (CDM-PSL) for expensive MOBO. CDM-PSL includes both unconditional and conditional diffusion model for generating high-quality samples efficiently. Besides, we introduce a weighting method based on information entropy to balance different objectives. This method is integrated with a guiding strategy to appropriately balancing different objectives during the optimization process. Experimental results on both synthetic and real-world problems demonstrates that CDM-PSL attains superior performance compared with state-of-the-art MOBO algorithms.

Abstract: This paper addresses theory in evolutionary multiobjective optimisation (EMO) and focuses on the role of crossover operators in manyobjective optimisation. The advantages of using crossover are hardly understood and rigorous runtime analyses with crossover are lagging far behind its use in practice, specifically in the case of more than two objectives. We present a many-objective problem class together with a theoretical runtime analysis of the widely used NSGA-III to demonstrate that crossover can yield an exponential speedup on the runtime. In particular, this algorithm can find the Pareto set in expected polynomial time when using crossover while without crossover it requires exponential time to even find a single Pareto-optimal point. To our knowledge, this is the first rigorous runtime analysis in many-objective optimisation demonstrating an exponential performance gap when using crossover for more than two objectives.

Abstract: Heuristics are commonly used to tackle various search and optimization problems. Design heuristics usually require tedious manual crafting with domain knowledge. Recent works have incorporated Large Language Models (LLMs) into automatic heuristic search, leveraging their powerful language and coding capacity. However, existing research focuses on the optimal performance on the target problem as the sole objective, neglecting other criteria such as efficiency and scalability, which are vital in practice. To tackle this challenge, we propose to model the heuristic search as a multiobjective optimization problem and consider introducing additional practical criteria beyond optimal performance. Due to the complexity of the search space, conventional multi-objective optimization methods struggle to effectively handle LLM-based multi-objective heuristic search. We propose the first LLM-based multi-objective heuristic search framework, Multi-objective Evolution of Heuristic (MEoH), which integrates LLMs in a zero-shot manner to generate a non-dominated set of heuristics to meet multiple design criteria. We design a new dominance-dissimilarity mechanism for effective population management and selection, which incorporates both code dissimilarity in the search space and dominance in the objective space. MEoH is demonstrated in two well-known combinatorial optimization problems: the online Bin Packing Problem (BPP) and the Traveling Salesman Problem (TSP). The results indicate that a variety of elite heuristics are automatically generated in a single run, offering more trade-off options than the existing methods. It successfully achieves competitive or superior performance while improving efficiency up to 10 times. Moreover, we also observe that the multi-objective search introduces novel insights into heuristic design and leads to the discovery of diverse heuristics.

Abstract: A key aspect of alignment is the proper use of withindocument evidence to construct document-level decisions. We analyze the relationship between the retrieval and interpretation of within-document evidence for large language model in a few-shot setting. Specifically, we measure the extent to which model prediction errors are associated with evidence retrieval errors with respect to gold-standard human-annotated extractive evidence for five datasets, using two popular closed proprietary models. We perform two ablation studies to investigate when both label prediction and evidence retrieval errors can be attributed to qualities of the relevant evidence. We find that there is a strong empirical relationship between model prediction and evidence retrieval error, but that evidence retrieval error is mostly not associated with evidence interpretation error--a hopeful sign for downstream applications built on this mechanism.

Abstract: With the rising use of neural networks across various application domains, it becomes increasingly important to ensure that they do not exhibit dangerous or undesired behaviour. In light of this, several neural network robustness verification algorithms have been developed, among which methods based on Branch and Bound (BaB) constitute the current state of the art. However, these algorithms still require immense computational resources. In this work, we seek to reduce this cost by leveraging running time prediction techniques, thereby allowing for more efficient resource allocation and use. Towards this end, we present a novel method that dynamically predicts whether a verification instance can be solved in the remaining time budget available to the verification algorithm. We introduce features describing BaBbased verification instances and use these to construct running time, and more specifically, timeout prediction models. We leverage these models to terminate runs on instances early in the verification process that would otherwise result in a timeout. Overall, using our method, we were able to reduce the total running time by 64% on average compared to the standard verification procedure, while certifying a comparable number of instances.

Abstract: Large language models (LLMs) have significantly advanced the field of automated code generation. However, a notable research gap exists in evaluating social biases that may be present in the code produced by LLMs. To solve this issue, we propose a novel fairness framework, i.e., Solar, to assess and mitigate the social biases of LLMgenerated code. Specifically, Solar can automatically generate test cases for quantitatively uncovering social biases of the auto-generated code by LLMs. To quantify the severity of social biases in generated code, we develop a dataset that covers a diverse set of social problems. We applied Solar and the crafted dataset to four state-of-the-art LLMs for code generation. Our evaluation reveals severe bias in the LLM-generated code from all the subject LLMs. Furthermore, we explore several prompting strategies for mitigating bias, including Chain-of-Thought (CoT) prompting, combining positive role-playing with CoT prompting and dialogue with Solar. Our experiments show that dialogue with Solar can effectively reduce social bias in LLM-generated code by up to 90%. Last, we make the code and data publicly available is highly extensible to evaluate new social problems.

Institute for AI, Peking University State Key Laboratory of General Artificial Intelligence, Institute for AI, Peking University, Institute for AI, Peking University State Key Laboratory of General Artificial Intelligence, Institute for AI, Peking University, Institute for AI, Peking University State Key Laboratory of General Artificial Intelligence, Institute for AI, Peking University, Institute for AI, Peking University State Key Laboratory of General Artificial Intelligence, Institute for AI, Peking University

Abstract: The rapid advancement of large language models (LLMs) has led to significant improvements in their capabilities, but also to increased concerns about their alignment with human values and intentions. Current alignment strategies, including adaptive training and inferencetime methods, have demonstrated potential in this area. However, these approaches still struggle to balance deployment complexity and capability across various tasks and difficulties. In this work, we introduce the Streaming Distribution Induce Aligner (Stream Aligner), a novel alignment paradigm that combines efficiency with enhanced performance in various tasks throughout the generation process. Stream Aligner achieves dynamic sentence-level correction by using a small model to learn the preferences of the suffix sentence, iteratively correcting the suffix sentence output by the upstream model, and then using the corrected sentence to replace the suffix sentence in subsequent generations. Compared to Aligner, our experiments demonstrate that Stream Aligner reduces reliance on the capabilities of additional models, enhances the reasoning abilities of LLMs, and decreases latency during user interaction. Specifically, Stream Aligner-2B model has achieved an improvement of 76.1% in helpfulness, 36.0% in harmlessness on the tested Llama2-70B-chat model, and Stream Aligner-8B has achieved an improvement of 3.5% on the math ability of the tested Llama3-70B-Instruct model.

Authors:Adrian de Wynter, Ishaan Watts, Tua Wongsangaroonsri, Minghui Zhang, Noura Farra, Nektar Ege Altıntoprak, Lena Baur, Samantha Claudet, Pavel Gajdušek, Qilong Gu, Anna Kaminska, Tomasz Kaminski, Ruby Kuo, Akiko Kyuba, Jongho Lee, Kartik Mathur, Petter Merok, Ivana Milovanović, Nani Paananen, Vesa-Matti Paananen, Anna Pavlenko, Bruno Pereira Vidal, Luciano Ivan Strika, Yueh Tsao, Davide Turcato, Oleksandr Vakhno, Judit Velcsov, Anna Vickers, Stéphanie F. Visser, Herdyan Widarmanto, Andrey Zaikin, Si-Qing Chen

Abstract: Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTPLX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

Huazhong University of Science and Technology, Huazhong University of Science and Technology, The University of Sydney, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, University of Sydney, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, University of Sydney, Huazhong University of Science and Technology

Abstract: Breast cancer remains a leading cause of mortality among women, with millions of new cases diagnosed annually. Early detection through screening is crucial. Using neural networks to improve the accuracy of breast cancer screening has become increasingly important. In accordance with radiologists' practices, we proposed using images from the unaffected side to create adversarial samples with critical medical implications in our adversarial learning process. By introducing beneficial perturbations, this method aims to reduce overconfidence and improve the precision and robustness of breast cancer classification. Our proposed framework is an adversarial quadrupleview classification network (NaFV-Net) incorporating images from both affected and unaffected perspectives. By comprehensively capturing local and global information and implementing adversarial learning from four mammography views, this framework allows for the fusion of features and the integration of medical principles and radiologist evaluation techniques, thus facilitating the accurate identification and characterization of breast tissues. Extensive experiments have shown the high effectiveness of our model in accurately distinguishing between benign and malignant findings, demonstrating state-of-the-art classification performance on both internal and public datasets.

Abstract: As artificial intelligence techniques evolve, we are approaching a critical moment for the widespread deployment of autonomous vehicles. Subsequently, the emergence of mixedautonomy traffic environments presents formidable challenges to autonomous vehicles, especially for the accurate prediction of lane change intentions of their surrounding human-driven vehicles, which is crucial for ensuring the safety of autonomous vehicles. Existing lane change prediction models mainly focus on capturing the temporal variations in the movement dynamics of individual vehicles. However, the neglect to consider inter-vehicle interactions hinders their capability in complex lane change scenarios, resulting in suboptimal prediction performance. Moreover, current interaction-aware approaches for autonomous driving fail to explicitly model future interactions between vehicles, leading to unreasonable prediction results that can cause collisions between vehicles. To address the above issues, we propose to incorporate the concept of perceived safety into future interaction modeling and design a dual-view interaction-aware lane change prediction model. We evaluate the proposed model on two real-world datasets and experimental results show that the proposed model achieves average improvements of 11.7-12.4% in classification ability and 75.6-95.7% in forecast ability over the best-performing baselines across the two datasets. The ablation study and investigation into future interaction modeling demonstrate that our model has advantages in interpreting lane change scenarios from a driving safety perspective.

Abstract: Medical benchmark datasets significantly contribute to developing Large Language Models (LLMs) for medical knowledge extraction, diagnosis, summarization, and other uses. Yet, current benchmarks are mainly derived from exam questions given to medical students or cases described in the medical literature, lacking the complexity of realworld patient cases that deviate from classic textbook abstractions. These include rare diseases, uncommon presentations of common diseases, and unexpected treatment responses. Here, we construct Clinically Uncommon Patient Cases and Diagnosis Dataset (CUPCase) based on 3,563 real-world case reports from BMC, which we formulate into diagnoses in open-ended textual format and as multiple-choice options with distractors. Using this dataset, we evaluate the ability of state-of-the-art LLMs, including both general-purpose and Clinical LLMs, to identify and correctly diagnose a patient case, and test models' performance when only partial information about cases is available. Our findings show that general-purpose GPT-4o attains the best performance in both the multiple-choice task (average accuracy of 87.9%) and the open-ended task (BERTScore F1 of 0.764), outperforming several LLMs with a focus on the medical domain such as Meditron-70B and MedLM-Large. Moreover, GPT-4o was able to maintain 87% and 88% of its performance with only the first 20% of tokens of the case presentation in multiple-choice and free text, respectively, highlighting the potential of LLMs to aid in early diagnosis in real-world cases. An error analysis demonstrates the complexity of the task, and attempts to hypothesise about the models' reasoning. CUPCase expands our ability to evaluate LLMs for clinical decision support in an open and reproducible manner.

Abstract: Tenant evictions threaten housing stability and are a major concern for many cities. An open question concerns whether datadriven methods enhance outreach programs that target at-risk tenants to mitigate their risk of eviction. We propose a novel active geospatial search (AGS) modeling framework for this problem. AGS integrates property-level information in a search policy that identifies a sequence of rental units to canvas to both determine their eviction risk and provide support if needed. We propose a hierarchical reinforcement learning approach to learn a search policy for AGS that scales to large urban areas containing thousands of parcels, balancing exploration and exploitation and accounting for travel costs and a budget constraint. Crucially, the search policy adapts online to newly discovered information about evictions. Evaluation using eviction data for a large urban area demonstrates that the proposed framework and algorithmic approach are considerably more effective at sequentially identifying eviction cases than baseline methods.

Abstract: In coastal river systems, floods, often during major storms or king tides, severely threaten lives and property. However, hydraulic structures such as dams, gates, pumps, and reservoirs exist in these river systems, and these floods can be mitigated or even prevented by strategically releasing water before extreme weather events. A standard approach used by local water management agencies is the “rulebased” method, which specifies predetermined water prereleases based on historical human experience, but which tends to result in excessive or inadequate water release. Iterative optimization methods that rely on detailed physics-based models for prediction are an alternative approach. Whereas, such methods tend to be computationally intensive, requiring hours or even days to solve the problem optimally. In this paper, we propose a Forecast Informed Deep Learning Architecture, FIDLAR, to achieve rapid and near-optimal flood management with precise water prereleases. FIDLAR seamlessly integrates two neural network modules: one called the Flood Manager, which is responsible for generating water pre-release schedules, and another called the Flood Evaluator, which evaluates those generated schedules. The Evaluator module is pre-trained separately, and its gradient-based feedback is utilized to train the Manager model, ensuring near-optimal water pre-releases. We have conducted experiments with a flood-prone coastal area in South Florida. Results show that FIDLAR is several orders of magnitude faster than currently used physics-based approaches while outperforming baseline methods with improved water pre-release schedules.

Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, Max Planck Institute for Security and Privacy (MPI-SP), Bochum, Germany, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, Max Planck Institute for Security and Privacy (MPI-SP), Bochum, Germany Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea

Abstract: Recent studies on the urban heat island phenomenon reveal how rapid urbanization intensifies temperature disparities in urban cores, highlighting the need for sustainable urban planning solutions. Analyzing the problems caused by these effects requires highresolution climate data; however, physical weather stations often lack sufficient regional coverage and resolution. Proposals for alternative methods have attempted to bridge this gap, but they fall short in capturing regional characteristics adequately or necessitate obtaining difficult-to-get input data. This research proposes to use satellite data, where the visual spectrum provides rich information about the degree of human development and is easy to obtain, to measure urban air temperature. Our model, UrbanHeat, uses multi-resolution satellite imagery and employs land surface temperature and global climate data as proxy labels to predict air temperature at a granular scale. The results show that the model provides predictions at a much finer scale while showing superior performance in measuring ordinal relationships between points by capturing both local and broad land cover details of the region. Our case studies demonstrate how predictions at high resolution can help protect vulnerable populations from extreme heat (e.g., elders or developing countries) and contribute to sustainable urban development worldwide.

Abstract: Blackbox optimization (BBO) has become increasingly relevant for tackling complex decision-making problems, especially in public policy domains such as police redistricting. However, its broader application in public policymaking is hindered by the complexity of defining feasible regions and the high-dimensionality of decisions. This paper introduces a novel BBO framework, termed as the Conditional And Generative Black-box Optimization (CageBO). This approach leverages a conditional variational autoencoder to learn the distribution of feasible decisions, enabling a two-way mapping between the original decision space and a simplified, constraint-free latent space. The CageBO efficiently handles the implicit constraints often found in public policy applications, allowing for optimization in the latent space while evaluating objectives in the original space. We validate our method through a case study on large-scale police redistricting problems in Atlanta, Georgia. Our results reveal that our CageBO offers notable improvements in performance and efficiency compared to the baselines.

Abstract: Local governments around the world are making consequential decisions on behalf of their constituents, and these constituents are responding with requests, advice, and assessments of their officials at public meetings. So many small meetings cannot be covered by traditional newsrooms at scale. We propose PublicSpeak, a probabilistic framework which can utilize meeting structure, domain knowledge, and linguistic information to discover public remarks in local government meetings. We then use our approach to inspect the issues raised by constituents in 7 cities across the United States. We evaluate our approach on a novel dataset of local government meetings and find that PublicSpeak improves over stateof-the-art by 10% on average, and by up to 40%.

Abstract: Longsequence precipitation forecasting is critical for both meteorological science and smart city applications. The primary objective of this task is to predict future radar echo sequences, which provide high resolution and timely references for atmospheric precipitation distribution based on current observations. However, the chaotic nature of precipitation systems poses significant challenges in extending reliable forecast horizons. Most existing methods struggle with accuracy and clarity when extended to long-sequence predictions, such as three-hour forecasts. This is primarily due to the insufficiency of spatio-temporal information within a single modality over time. In this paper, we propose a cascading forecasting framework that adaptively extracts and integrates multimodal spatio-temporal information to support accurate and realistic long-sequence radar forecasting. Our framework includes a temporal adaptive predictor and a flow-based precipitation distribution adaptor. The predictor utilizes a multi-branch encoder-decoder architecture. This design allows it to extract meteorological sequences from multiple sources at varying scales, resulting in an initial global precipitation estimate. The core component is a carefully designed cross-attention module with a temporal adaptive layer to enhance multi-modality alignment. The initial estimate is then refined by the flow-based adaptor, which adjusts the prediction to match the target precipitation distribution, enhancing local details and correcting extreme precipitation patterns. We validated our method using real multi-source dataset for long-sequence forecasting, and the experimental results demonstrate that our approach outperforms existing state-of-the-art methods.

Abstract: Robots are increasingly being used in different application domains due to rapid advancements in hardware and computational methods. However, state of the art methods for many problems in robotics are based on deep networks and similar datadriven models. These methods and models are resource-hungry and opaque, and they are known to provide arbitrary decisions in previously unseen situations, whereas practical robot application domains require transparent, multi-step, multi-level decision-making and ad hoc collaboration under resource constraints and open world uncertainty. In this talk, I argue that for widespread use of robots, we need to revisit principles such as refinement and adaptive satisficing, which can be traced back to the early pioneers of AI. We also need to make these principles the foundation of the architectures we develop for robots, with modern data-driven methods being just another tool in our toolbox. I then illustrate the potential benefits of this approach in the context of fundamental problems in robotics such as visual scene understanding, planning, changing-contact manipulation, and multiagent/human-agent collaboration.

Abstract: Artificial intelligence (AI) has made substantial impacts in numerous fields, including education. Within education, learning and assessment are two key areas. Although many AI techniques have been applied to improve teaching and learning, their potential in educational assessment remains underexplored. This paper explores the intersection of AI and educational assessment and presents a rich landscape of challenges and opportunities, especially in the context of trustworthy AI, including fairness, transparency, accountability, explainability, and robustness. We will begin by outlining the foundations of trustworthy AI and educational assessment. Next, we will delve into the application of trustworthy AI for various assessment tasks, such as test item generation, test design, and automated scoring. In addition, the talk will also discuss how insights from educational measurement theory, such as item response theory (IRT) and validity frameworks, can inform the development and evaluation of trustworthy AI models. These frameworks help ensure that AI systems in education are not only accurate, but also equitable and aligned with educational goals. Finally, we will highlight future research directions, focusing on the integration of ethical AI principles into educational technology and the need for interdisciplinary collaboration to tackle the emerging challenges in this field. The aim is to foster a new generation of AIpowered educational tools that are both innovative and trustworthy, ultimately contributing to a more equitable and more effective educational landscape.

Abstract: The concern that Artificial Intelligence (AI) and Machine Learning (ML) are entering a "reproducibility crisis" has spurred significant research in the past few years. Yet with each paper, it is often unclear what someone means by "reproducibility". Our work attempts to clarify the scope of "reproducibility" as displayed by the community at large. In doing so, we propose to refine the research to eight general topic areas. In this light, we see that each of these areas contains many works that do not advertise themselves as being about "reproducibility", in part because they go back decades before the matter came to broader attention.

Abstract: Artificial intelligence has made remarkable progress in reasoning over complex, structured, multimodal, and multilingual data, addressing critical challenges in domains such as finance and healthcare. This abstract underscores key advancements in tabular reasoning, temporal analysis, and structured multimodal reasoning. Key contributions include the development of TempTabQA, a benchmark for temporal question answering, along with novel methods for enhancing temporal reasoning in large language models (LLMs). Additionally, a framework for evaluating mathematical reasoning in financial documents has been introduced, establishing robust techniques for interpreting timesensitive and quantitative data. Building on these foundations, we have developed hybrid SQL-text adaptive reasoning models (H-STAR) and knowledge-aware reasoning techniques for semi-structured tables (MMTabQA), enabling precise and efficient handling of complex queries. In the vision-language domain, our contributions include advancements in spatial reasoning for geographic data (MAPWise), methods to improve robustness in chart interpretation (FlowVQA), and evaluations of LLMs’ ability to understand visual data, such as charts. Furthermore, we have addressed challenges in multilingual and cross-modal robustness through innovations such as multilingual table synchronization (InfoSync), concurrent robustness evaluations across languages and modalities, and numerical reasoning in tabular data. Our work aims to enhance reasoning on dynamically evolving data using hybrid LLM-SQL queries, symbolic query generation, and multi-table retrieval techniques. We also plan to tackle challenges in interpreting hierarchical table structures, analyzing multiple complex chart types, and exploring diverse map types, while advancing real-world multimodal data analysis. Additionally, we plan to improve table generation in both closed/open-book scenarios and refine evaluation frameworks for structured tasks. These advancements demonstrate the potential of AI in tackling complex, multimodal data and delivering impactful real-world solutions.

Abstract: Reinforcement Learning (RL) has emerged as a powerful paradigm for sequential decisionmaking with numerous real-world applications. However, in practical environments such as recommender systems, search engines, and LLMs, RL algorithms must efficiently learn from biased human feedback that may be subject to corruption. In this talk, I will present our recent efforts in developing robust RL algorithms that can provably effectively handle such challenging scenarios. First, I will introduce our works on reinforcement learning from biased click feedback in ranking. While previous approaches typically relied on strong assumptions about human click behavior (formalized as click models) and required specialized debiasing methods for different models, we propose a novel unified framework that formulates the ranking process under general click models as a Markov Decision Process, enabling the development of a click model-agnostic RL algorithm. Second, I will introduce the fundamental vulnerability of bandits and reinforcement learning under corrupted feedback. Our theoretical analysis provides complete necessity and sufficiency characterizations of the attackability of linear bandits and linear RL, revealing their intrinsic robustness and limitations. Lastly, I will discuss our recent works on improving RL finetuning for LLMs, including sample efficient off-policy RLHF and solving the gradient entanglement issue in margin-based alignment methods.

Abstract: Time series data, prevalent in fields like medical, ecommerce, finance, etc., is used for forecasting, such as predicting next quarter’s product demand based on past trends. However, some problems necessitate causal models to answer questions like “What the product demand would have been without a specific intervention (e.g., products with slower delivery time suppressed from the search results)?” Such questions require causal models to estimate unobserved counterfactual outcome. In this paper, we propose a novel Graph Causal Forecasting (GCF) model, that predicts the unobserved demand leveraging the relationship of a product with other similar products in the marketplace (spatial aspect), along with change in demand over time for each product (temporal aspect). The core idea is to estimate the counterfactual outcome using a synthetic control unaffected by the treatment. Our approach uses RGCN-dilated CNN based network, which leverages domain knowledge to automatically design a synthetic control during training. Using GCF for our demand forecasting problem, we achieve 75.3% lower MAPE compared to baseline. We use the forecasted values to recommend high demand products, in terms of our business metric (discussed later) which tracks the quality of these recommendations, we achieve a significant jump of 61.2%. Moreover, it adds 67.8% more high demand products to the marketplace, compared to existing model in production. Deployment of GCF in 2023, led to +1399 bps improvement in number of products with a view from customers, and +310 bps improvement in number of products with a sale. We also compare GCF with state of the art forecasting methods on a semi-synthetic data, created by simulating a treatment on open source traffic data METR-LA. We achieve 30% lower MSE against TGCN, a time series forecasting approach and 30% lower MSE against CRN and 25% lower MSE against Google Causal Impact model, both of which are causal forecasting approaches.

Abstract: In a prior paper, we argued that Artificial Intelligence (AI) should be placed on a different foundation, one based on pattern recognition and feature learning rather than symbol manipulation and feature engineering. In this paper, we provide a proof of concept of an AI course that follows that proposed approach. Students study how these systems become so incredibly powerful through machine learning of features and through pattern matching. Students learn how those systems represent knowledge and they study their currently limited reasoning abilities. Students spend time discussing the accomplishments of current systems, positive as well as negative and they study the projected impact of anticipated systems. In this paper, we give a brief argument of why one would want to offer such a course. We present a detailed outline of the contents of such a course, together with learning materials and their proposed use. We summarize relevant anonymous student feedback and offer a subjective evaluation of the pilot course.

Abstract: Modern public health data contains information about changes in disease dynamics that can have significant downstream benefits if these phenomena can be identified. However, systemic data quality issues hamper automated analysis of these vast data volumes, and there is now far too much data (34 million data points/day) for public health data experts to inspect manually as they may have done in the past. This interdisciplinary thesis addresses practical questions about large-scale data monitoring that impact public health data users and are also reflected in the larger public health community. This work has been deployed for over a year and a half at the Delphi Research Group at Carnegie Mellon University, a national public health data curator, where data reviewers have been able to detect approximately 200 significant outbreaks, data issues, or changes in disease dynamics from 15 million new data points weekly.

Abstract: Deep learning has given rise to the field of representation learning, which aims to automatically extract rich semantics from data. However, there have been several challenges in the generalization capabilities of deep learning models. Recent works have highlighted beneficial properties of causal models that are desirable for learning robust models under distribution shifts. Thus, there has been a growing interest in causal representation learning for achieving generalizability in tasks involving reasoning and planning. The goal of my dissertation is to develop theoretical intuitions and practical algorithms that uncover the nature of causal representations and their applications. In my work, I focus on causal generative modeling with an emphasis on either representation or generation. For representation learning, I investigate the disentanglement of causal representations through the lens of independent causal mechanisms. For generation tasks, I develop algorithms for counterfactual generation under weak supervision settings by leveraging recent advances in generative modeling. The proposed approaches have been empirically shown to be effective in achieving disentanglement and generating counterfactuals.

Abstract: The advancements in Knowledge Graphs (KGs) and Large Language Models (LLMs) are driving transformative changes across various research fields, including metabolomics. These tools present exceptional opportunities to elucidate complex metabolic pathways and identify biomarkers essential to biological systems. My research focuses on harnessing the potential of KGs and LLMs within metabolomics, specifically making interactions between them and with biological researches. KGs, with their structured representation of metabolic entities and relationships, provide a robust foundation for managing extensive multimodal metabolomic knowledge. Recently, I developed a metabolitecentric knowledge graph and explored innovative methodologies to leverage KGs and LLMs for enhancing predictive modeling in clinical settings. My future research aims to fully exploit the capabilities of KGs and LLMs in metabolomics, advancing our understanding and applications in this field.

Abstract: My PhD research focuses on developing a highly accurate and explainable multioutput virtual metrology system for semiconductor manufacturing. Using machine learning, we predict the physical properties of metal layers from process parameters captured by production equipment sensors. Key contributions include a model-agnostic explanatory method based on projective operators, providing insights into the most influential features for multi-output predictions and feature selection algorithms for these tasks.

Abstract: Large Language Models (LLMs) have shown promise in educational applications, but challenges such as hallucinations, lack of contextual relevance, and limited personalization impede their practical adoption. To address these issues, my research introduces MerryQuery, an LLMpowered educational agent that integrates Retrieval-Augmented Generation (RAG), rule-based content control, and Reinforcement Learning from Human Feedback (RLHF). The system features a dynamic learning profile module for adaptive personalization and a multi-step verification framework that cross-checks responses against external sources to enhance trustworthiness. A functional prototype of MerryQuery is being piloted in a real-world classroom. Preliminary results demonstrate improved response reliability and student understanding.

Abstract: Economic growth and development require a consistent supply of energy. Energy which has mainly been supplied from fossil fuels. The impacts of these on the environment such as global warming, raised an alarm on their use. As a result other sources of energy such as wind energy are used as alternatives for electricity production. Wind energy assessment nevertheless faces barriers due to its stochastic nature. This later creates various regimes, which traditional models can't always fit thereby producing poor estimates. In this work, we aim to use Large Language Models (LLMs) to predict the wind potential in a given location. Through this approach, we aim at lifting the barrier on energy problems in developing countries by providing knowledge on the state of wind energy in given locations.

Abstract: The neural tangent kernel (NTK) has emerged as an important tool in recent years, both for developing a theoretical understanding of deep learning as well as for various applications. Even though recursive closed form expressions have been derived for computing the NTK, these become computationally expensive as the complexity of a network increases. Recent papers have looked at reducing this complexity using various sketching techniques along with random features. Building on these techniques, we propose an additional optimization step which results in better approximation of the NTK.

Abstract: We applied Riskaverse Reinforcement Learning (RL) to optimize investment portfolios while incorporating risk constraints. Given that portfolios must adhere to risk constraints set by investors and regulators, enforcing hard constraints is essential for practical portfolio optimization. Traditional techniques often lack the flexibility to model the complexities of dynamic financial markets. To address this, we used the Augmented Lagrangian Multiplier (ALM) to impose constraints on the agent, reducing risk during decision-making. Our risk-constrained RL algorithm demonstrated no constraint violations during testing and outperformed other Risk-averse RL methods, indicating its potential for optimizing portfolios for risk-averse investors.

Abstract: Healthrisk behaviors such as overeating and smoking have a profound impact on public health, making their monitoring and mitigation critical. Wearable RGB-Thermal cameras are being employed to monitor these behaviors by capturing hand-to-mouth (HTM) gestures, which are central to them. However, detection models relying on single modalities—either RGB or thermal—often struggle to accurately distinguish these confounding gestures due to inherent sensor limitations, such as sensitivity to lighting conditions or thermal occlusions. We present a family of fusion models that integrate RGB and thermal video data using early-, decision- , and a novel mid-fusion architecture, RGB-Thermal Fusion Video Network (RTFVNet), designed to enhance the recognition of HTM gestures associated with eating and smoking. Our evaluation shows that while decision fusion achieves the highest F1-score of 88% (0.44 TFLOPs), RTFVNet offers an optimal balance between performance (85%) and complexity (0.37 TFLOPs) for gesture classification of eating, smoking, and non-gesture activities.

Abstract: A rich line of theoretical work has modeled scenarios in which a set of agents make decisions sequentially, based on observing a growing mix of public and private signals that are revealed as these decisions occur. Here, we study a second crucial dimension, which is the way in which strategies can depend on crowding. In particular, consider a setting in which agents must sequentially decide which of several options to invest in, each based on a public signal that they receive. One of these options will ultimately be revealed to be valuable; but crucially, all the agents who selected this option must divide the value that comes from it. As a result, when a given agent j goes to make a decision among the options, the decisions of earlier agents convey information about the payoff that j will receive in any eventual division of the value. When many earlier agents have chosen a specific option, the greater crowding on this option means it must be divided more finely, resulting in lower payoffs. To simulate large games when signals are public, we define a polynomialtime algorithm to compute equilibrium strategies. We show that even in this case of public signals, the interaction of crowding with informational effects leads to complex non-monotonicities in the resulting sequential decisions, with agents sometimes choosing options with lower expected levels of crowding --- and hence a better split of the potential value --- over options with better informational or current crowding properties.

Abstract: We introduce LLM Stinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or whitebox access, LLM Stinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLM Stinger across open and closed-source models.

Abstract: This study investigates user engagement and political polarization on YouTube Shorts, a special category of YouTube videos with a duration of 1560 seconds. Via a substantial corpus of 38,838 videos gleaned from 100 YouTube channels focusing on political content, we contrast YouTube Shorts with long-form content in terms of user engagement, content toxicity, and polarization. Our analyses reveal that (1) YouTube Shorts receive more likes and views and fewer comments as compared to their long-form video counterparts; (2) YouTube Shorts are more toxic; and (3) considerably more polarized than long-form YouTube videos.

Abstract: We study a model of sequential decisionmaking where voters have dynamic preferences over a set of candidates that are undesirable. This models scenarios such as the implementation of projects that are overall beneficial to society, but impose individual costs on certain affected individuals. We show that while minimizing the sum of agents' disutilities can be done in polynomial time, minimizing the maximum disutility obtained by any agent is computationally intractable, even in restricted cases. We then examine the potential for agents to engage in strategic manipulation in response to these welfare objectives, offering insights into possible misconduct within such decision-making environments.

Abstract: We present FÆRDXEL, an expert system for providing answers and explanations to legal questions (queries) regarding Danish traffic law cases. It utilizes a Datalog encoding of Danish traffic laws and uses SLDresolution to answer queries. The SLD-resolution’s trace of operations can be converted into a legal explanation for the query. A user interface allows legal professionals to input case facts, ask questions, and explore explanations. Feedback from legal experts identified usability challenges and potential end-users, including traffic police and judges. Future steps include empirical evaluation of soundness, integrating punishment reasoning, and enhancing usability through natural language processing.

Departments of Biostatistics and Epidemiology, Harvard T. H. Chan School of Public Health, MA, USA Department of Dermatology, Massachusetts General Hospital, Harvard Medical School, MA, USA, Department of Dermatology, Massachusetts General Hospital, Harvard Medical School, MA, USA Department of Biomedical Informatics, Harvard Medical School, MA, USA, Department of Computer Science, University of Texas at Dallas, Texas, USA, Department of Computer Science, University of Texas at Dallas, Texas, USA, Department of Computer Science, Baylor University, Texas, USA, Department of Biomedical Informatics, Harvard Medical School, MA, USA, Department of Dermatology, Massachusetts General Hospital, Harvard Medical School, MA, USA

Abstract: Given a data matrix, unsupervised column subset selection refers to the problem of identifying a subset of columns that can be used to linearly approximate the original data matrix. This problem has many applications, such as feature selection and representative selection, but solving it optimally is known to be NPhard. We consider multi-view unsupervised column subset selection, which extends the concept of (single-view) column subset selection to data represented in multiple views or modalities. We introduce a combinatorial search algorithm for this generalized problem. One variant of the algorithm is guaranteed to compute an optimal solution in a setting similar to the classical A* algorithm. Other suboptimal variants, in a setting similar to the weighted A* algorithm, are much faster and provide a solution along with a bound on its quality.

Abstract: The recent success of transformer language models owes much to their conversational fluency, which includes linguistic and morphological proficiency. An affine Taylor approximation has been found to be a good approximation for transformer computations over certain factual and encyclopedic relations. We show that the truly linear approximation W s, where s is a early layer representation of the base form and W is a local model derivative, is necessary and sufficient to approximate morphological derivation, achieving above 80% top1 accuracy across most morphological tasks in the Bigger Analogy Test Set. We argue that many morphological forms in transformer models are likely linearly encoded.

Abstract: Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating highquality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset. The code is in https://github.com/adlnlp/MCKDVQA.

Abstract: This research proposes an AIdriven early warning system to predict patient deterioration in real-time using electronic health records (EHRs) and wearable devices. Leveraging deep learning techniques, such as recurrent neural networks (RNNs) for sequential data and convolutional neural networks (CNNs) for pattern recognition, the system adapts dynamically through reinforcement learning. Evaluation strategies include retrospective and prospective studies in clinical settings, measuring prediction accuracy and impact on patient outcomes. If successful, this system has the potential to save lives, reduce ICU admissions, and transform healthcare into a proactive, data-driven field.

Abstract: Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning approaches are resourceintensive, requiring task and dataset-specific training. We present an automated system that utilizes large language models to generate executable code for tasks like missing value imputation, error detection, and error correction. Our system aims to identify inherent patterns in the data while leveraging external knowledge, effectively addressing both memory-dependent and memory-independent tasks.

Abstract: Bringing a new AI system into a production environment involves multiple different stakeholders such as business owners, risk officer, ethics officers approving the AI System for a specific usage. Governance frameworks typically include multiple manual steps, including curating information needed to assess risks and reviewing outcomes to identify appropriate actions and governance strategies. We demo a humanin-the-loop automation system that takes a natural language description of an intended use case for an AI system in order to create semi-structured governance information, recommend the most appropriate model for that use case, prioritise risks to be evaluated, automatically running those evaluations and finally storing these results for auditing, reporting and future recommendations. As a result we increase transparency to stakeholders and provide valuable information to aid in decision making when assessing risks associated with an AI solution.

Abstract: Matching markets, in which agents are assigned to one another based on preferences and capacity constraints, are pervasive in various domains. This paper introduces MATWA (https://matwa.optimalmatching.com), a web application that offers the most comprehensive collection to date of algorithms for fundamental matching under preference problem classes. MATWA provides results of algorithm executions and visualisations of structural properties. It is intended to be a resource for the community of researchers, educators and practitioners, supporting experimentation, as well as aiding the understanding of matching algorithms.

Abstract: Social skills training targets behaviors necessary for success in social interactions. However, traditional classroom training for such skills is often insufficient to teach effective communication — oneto-one interaction in real-world scenarios is preferred to lecture-style information delivery. This paper introduces a framework that allows instructors to collaborate with large language models to dynamically design realistic scenarios for students to communicate. Our framework uses these scenarios to enable student rehearsal, provide immediate feedback and visualize performance for both students and instructors. Unlike traditional intelligent tutoring systems, instructors can easily co-create scenarios with a large language model without technical skills. Additionally, the system generates new scenario branches in real time when existing options don't fit the student's response.

Abstract: While large language models (LLMs) have shown remarkable capability to generate convincing text across diverse domains, concerns around its potential risks have highlighted the importance of understanding the rationale behind text generation. We present LLM ATTRIBUTOR, a Python library that provides interactive visualizations for training data attribution of an LLM’s text generation. Our library offers a new way to quickly attribute an LLM’s text generation to training data points to inspect model behaviors, enhance its trustworthiness, and compare modelgenerated text with user-provided text. Thanks to LLM ATTRIBUTOR’s broad support for computational notebooks, users can easily integrate it into their workflow to interactively visualize attributions of their models.

Abstract: As the demand for healthier, personalized culinary experiences grows, so does the need for advanced food computation models that offer more than basic nutritional insights. However, current food computation models lack the depth to provide actionable insights like ingredient substitution or alternative cooking actions to suit users’ dietary goals. To address this, we introduce and demonstrate Pic2Prep, a multimodal conversational system that generates detailed cooking instructions, actions and ingredient lists from both images and text provided by users. The system is developed using a novel dataset generated through Stable Diffusion, where the input consists of recipe titles and ingredient lists from the Recipe1M dataset to create synthesized food images with variations. This dataset is used to finetune the Bootstrapping Language-Image Pre-training (BLIP) model to extract cooking instructions and ingredients from food images. Pic2Prep also employs the CookGen model, a small-scale custom generative model to derive specific cooking actions from cooking instructions. A custom mapper, trained on the Mistral model, links these actions to the corresponding ingredients, creating a comprehensive understanding of the cooking process. The system features an interactive user interface that allows users to input images and ask targeted questions, receiving real-time responses.

Abstract: Exploratory Data Analysis (EDA) derives meaningful insights from extensive and complex datasets. This process typically involves a series of analytical operations to identify the patterns within the data. However, the effectiveness of EDA is often limited by the user's domain knowledge and proficiency in data exploration methods. To overcome these challenges, we developed QUIS, a fully automated EDA system that uncovers insights by generating datarelated questions and exploring subspaces in the dataset without prior training. QUIS allows users to control key system parameters such as beam width, beam depth, and expansion factor for subspace selection, the interestingness score for filtering valuable insights, and parameters for managing the quality and quantity of generated questions.

Abstract: In this paper, we present PRIORITY2REWARD a Large Language Model (LLM) based application which incorporates health worker preferences for resource allocation planning in public health programs. LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning problems. We focus on LLMdesigned rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In the context of public health, our approach empowers grassroots health workers to tailor automated allocation decisions to community needs. We showcase a simulated application of PRIORITY2REWARD for a large-scale mobile health program in India. The tool allows health workers to enter natural language preferences and leverages LLMs to search for reward functions aligned with the preference. Our tool then dynamically showcases how the LLM generated reward function modifies the policy outcomes with respect to different demographic groups in the population. This can help inform policy implementation at a community level.

Abstract: Explanation for deep learning models on time series classification (TSC) tasks is an important and challenging problem. Most existing approaches use attribution maps to explain outcomes. However, they have limitations in generating explanations that are wellaligned with humans's perceptions. Recently LIME-based approaches provide a more meaningful explanation via segmenting the data. However, these approaches are still suffering from the processes of segment generations and evaluations. In this paper, we propose a novel time series explanation approach called InteDisUX to overcome these problems. Our technique utilizes the segment-level integrated gradient (SIG) for calculating importance scores for an initial set of small and equal segments before iteratively merge two consecutive ones to create better explanations under a unique greedy strategy guided by two new proposed metrics including discrimination and faithfulness gains. By this way, our method does not depend on predefined segments like others while being robusts to instability, poor local fidelity and data imbalance like LIME-based methods. Furthermore, InteDisUX is the first work to use the model's information to improve the set of segments} for time series explanation. Extensive experiments show that our method outperforms LIME-based ones in 12 datasets in terms of faithfulness and 8/12 datasets in terms of robustness.

Abstract: Among various branches of offline reinforcement learning (RL) methods, goalconditioned supervised learning (GCSL) has gained increasing popularity as it formulates the offline RL problem as a sequential modeling task, therefore bypassing the notoriously difficult credit assignment challenge of value learning in conventional RL paradigm. Sequential modeling, however, requires capturing accurate dynamics across long horizons in trajectory data to ensure reasonable policy performance. To meet this requirement, leveraging large, expressive models has become a popular choice in recent literature, which, however, comes at the cost of significantly increased computation and inference latency. Contradictory yet promising, we reveal that lightweight models as simple as shallow 2-layer MLPs, can also enjoy accurate dynamics consistency and significantly reduced sequential modeling errors against large expressive models by adopting a simple recursive planning scheme: recursively planning coarse-grained future sub-goals based on current and target information, and then executes the action with a goal-conditioned policy learned from data relabeled with these sub-goal ground truths. We term our method as Recursive Skip-Step Planning (RSP). Simple yet effective, RSP enjoys great efficiency improvements thanks to its lightweight structure, and substantially outperforms existing methods, reaching new SOTA performances on the D4RL benchmark, especially in multi-stage long-horizon tasks.

Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China College of Computer Science and Software Engineering, Shenzhen University, China, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China, College of Computer Science and Software Engineering, Shenzhen University, China School of Information Technology, Carleton University, Canada, HKUST (GZ), AI Thrust, College of Computer Science and Software Engineering, Shenzhen University, China

Abstract: Graph Neural Networks (GNNs) have shown efficacy in graph node classification, but face computational challenges on largescale graphs. Although existing graph reduction methods address these issues, they still require high computational resources and fail to prioritize robust performance on out-of-distribution data. To tackle these challenges, we introduce the subgraph invariant learning paradigm, inspired by the small-world phenomenon. This approach enables models trained on specific subgraphs to generalize across diverse subgraphs, reducing computational demands, and enhancing scalability. To promote generalization, we maximize the invariance log-likelihood by deriving a theoretical lower bound of it and formulating the InVar loss. This loss minimizes the discrepancy between node representations and their corresponding invariance representations while maximizing the entropy of the node representation. In response to InVar loss, we propose the Invariance Facilitation Model (IFM), comprising the Invariance Representation Encoder (IRE) and Node Representation Encoder (NRE). IRE, capturing the invariance representations, utilizes Invariance ATTention (InvarATT) to compress long-range dependencies, while NRE learns the node representation, by integrating invariance representations via Telematic ATTention (TeleATT) and exchanging local information within each subgraph through GNNs. Evaluations on four large-scale graph datasets demonstrate the effectiveness, computational efficiency, and interpretability of IFM for large-scale graph node classification.

Abstract: This paper addresses generalized category discovery (GCD), the task of clustering unlabeled data from potentially known or unknown categories with the help of labeled instances from each known category. Compared to traditional semisupervised learning, GCD is more challenging because unlabeled data could be from novel categories not appearing in labeled data. Current state-of-the-art methods typically learn a parametric classifier assisted by self-distillation. While being effective, these methods do not make use of cross-instance similarity to discover class-specific semantics which are essential for representation learning and category discovery. In this paper, we revisit the association-based paradigm and propose a Prior-constrained Association Learning method to capture and learn the semantic relations within data. In particular, the labeled data from known categories provides a unique prior for the association of unlabeled data. Unlike previous methods that only adopts the prior as a pre or post-clustering refinement, we fully incorporate the prior into the association process, and let it constrain the association towards a reliable grouping outcome. The estimated semantic groups are utilized through non-parametric prototypical contrast to enhance the representation learning. A further combination of both parametric and non-parametric classification complements each other and leads to a model that outperforms existing methods by a significant margin. On multiple GCD benchmarks, we perform extensive experiments and validate the effectiveness of our proposed method.

Abstract: DomainIncremental Learning (DIL) enables vision models to adapt to changing conditions in real-world environments while maintaining the knowledge acquired from previous domains. Given privacy concerns and training time, Rehearsal-Free DIL (RFDIL) is more practical. Inspired by the incremental cognitive process of the human brain, we design Dual-level Concept Prototypes (DualCP) for each class to address the conflict between learning new knowledge and retaining old knowledge in RFDIL. To construct DualCP, we propose a Concept Prototype Generator (CPG) that generates both coarse-grained and fine-grained prototypes for each class. Additionally, we introduce a Coarse-to-Fine calibrator (C2F) to align image features with DualCP. Finally, we propose a Dual Dot-Regression (DDR) loss function to optimize our C2F module. Extensive experiments on the DomainNet, CDDB, and CORe50 datasets demonstrate the effectiveness of our method.

Abstract: Multiview clustering aims to identify consistent and complementary information across multiple views to partition data into clusters, emerging as a popular unsupervised method for multi-view data analysis. However, existing methods often design view-specific encoders to extract distinct features from each view, lacking exploration of their complementarity. Additionally, current contrastive-based multi-view clustering methods may lead to erroneous negative sample pairs conflicting with the clustering objective. To address these challenges, we propose a novel Contrastive Multi-view Subspace Clustering via Tensor Transformers Autoencoder (TTAE). On the one hand, it facilitates information exchange between views by tensor transformers autoencoder, thereby enhancing complementarity. On the other hand, It learns a consistent subspace with a self-expression layer. Meanwhile, adaptive contrastive learning helps to provide more discriminative features for the self-expression learning layer, and the self-expression learning layer in turn supervises contrastive learning. Moreover, our method adaptively selects positive and negative samples for contrastive learning to mitigate the impact of inappropriate negative sample pairs. Extensive experiments on several multi-view datasets demonstrate the effectiveness and superiority of our model.

School of Computer Science and Engineering, Central South University, Changsha 410083, China Xiangjiang Laboratory, Changsha 410205, China, School of Computer Science and Engineering, Central South University, Changsha 410083, China Xiangjiang Laboratory, Changsha 410205, China, School of Computer Science and Engineering, Central South University, Changsha 410083, China Xiangjiang Laboratory, Changsha 410205, China, Department of Computer Science and Engineering, State University of New York at Buffalo, NY, USA, Department of Computer Science and Software Engineering, Penn State Erie, The Behrend College, School of Computer Science and Engineering, Central South University, Changsha 410083, China Xiangjiang Laboratory, Changsha 410205, China The Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha 410083, China

Abstract: In this paper, we consider the kcenter problem with outliers (the (k, z)-center problem) in the context of Massively Parallel Computation (MPC). Existing MPC algorithms for the (k, z)-center problem typically require Ω(k) local space per machine. While this may be feasible when k is small, these algorithms become impractical for large k, where each machine may lack sufficient space for computation. This motivates the study of fully-scalable algorithms with sublinear local space. We propose the first fully-scalable MPC algorithm for the (k, z)-center problem. The main challenge is to design an MPC algorithm that operates with sublinear local space for finding the inliers close to the optimal clustering centers, and ensuring the approximation loss remains bounded. To address this issue, we propose an iterative sampling-based algorithm with sublinear local space in the data size. A key component of our approach is an outliers-removal algorithm that adjusts the sample size in each iteration to select inliers as clustering centers. However, the number of discarded inliers increases with the iteration of the outliers-removal algorithm, making it difficult to bound. To address this, we propose a self-adaptive method that can automatically adjust sample size to account for different data distributions on each machine, ensuring a lower bound on the sampling success probability. With these techniques, we present an O(log^*n)-approximation MPC algorithm for the (k, z)-center problem in constant-dimensional Euclidean space. The algorithm discards at most (1 + ε)z outliers, completing in O(log log n) computation rounds while using Θ(n^δ) local space per machine.

Abstract: Generative adversarial networks (GANs) have emerged as a powerful tool for generating highfidelity data. However, the main bottleneck of existing approaches is the lack of supervision on the generator training, which often results in undamped oscillation and unsatisfactory performance. To address this issue, we propose an algorithm called Monte Carlo GAN (MCGAN). This approach, utilizing an innovative generative loss function, termed the regression loss, reformulates the generator training as a regression task and enables the generator training by minimizing the mean squared error between the discriminator's output of real data and the expected discriminator of fake data. We demonstrate the desirable analytic properties of the regression loss, including discriminability and optimality, and show that our method requires a weaker condition on the discriminator for effective generator training. These properties justify the strength of this approach to improve the training stability while retaining the optimality of GAN by leveraging strong supervision of the regression loss. Extensive experiments on diverse datasets, including image data (CIFAR-10/100, FFHQ256, ImageNet, and LSUN Bedroom), time series data (VAR and stock data), and video data, are conducted to demonstrate the flexibility and effectiveness of our proposed MCGAN. Numerical results show that the proposed MCGAN is versatile in enhancing a variety of backbone GAN models and achieves consistent and significant improvement in terms of quality, accuracy, training stability, and learned latent space.

Abstract: Scientific discovery serves as the cornerstone for advances in various fields, from the fundamental laws of physics to the intricate mechanisms of biology. However, two existing mainstream methodssymbolic regression and dimensional analysis, are significantly limited in this task: the former suffers from low computational efficiency due to the vast search space and often results in formulas without physical meaning; the latter provides a useful theoretical framework but also struggles in searching in a huge space because of lacking effective analysis for the latent variables. To address this issue, here we propose a framework for efficiently discovering underlying formulas in data, named FIND. We draw inspiration from Buckingham’s Pi theorem, imposing dimensional constraints on the input and output, thereby ensuring discovered expressions possess physical meaning. Additionally, we propose a theoretical scheme for identifying the latent structure as well as a coarse-to-fine framework, significantly reducing the search space of latent variables. This framework not only improves computational efficiency but also enhances model interpretability. From comprehensive experimental validation, FIND showcases its potential to uncover meaningful scientific insights across various domains, providing a robust tool for advancing our understanding of unknown systems.

Abstract: The powerful capability of HyperGraph Neural Networks (HGNNs) in modeling intricate, highorder relationships among multiple data samples stems primarily from their ability to aggregate both the direct neighborhood features of individual nodes and those associated with hyperedges. However, the limited scope of feature propagation in existing HGNNs significantly reduces the utilization of hypergraph information, exacerbating over-squashing and over-smoothing issues. To this end, we propose a novel K-hop HyperGraph Neural Network (KHGNN) to facilitate the interactions of distant nodes and hyperedges. Specifically, the bisection nested convolution based on HyperGINE is employed to extract features from nodes, hyperedges, and structures along all shortest paths between nodes or hyperedges, providing representations of long-distance relationships. With these comprehensive path features, nodes and hyperedges are guided to aggregate distant information while learning their complex relationships. The extensive experiments, particularly on long-range graph datasets, demonstrate that the proposed method achieves SOTA performance compared to existing HGNNs and graph neural networks.

MAIS, Institute of Automation, Chinese Academy of Sciences Pengcheng Laboratory School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), MAIS, Institute of Automation, Chinese Academy of Sciences Pengcheng Laboratory School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Pengcheng Laboratory, Pengcheng Laboratory Harbin Institute of Technology, Shenzhen, MAIS, Institute of Automation, Chinese Academy of Sciences Pengcheng Laboratory School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS)

Abstract: In this paper, we explore a novel federated multimodal instruction tuning task(FedMIT), which is significant for collaboratively finetuning MLLMs on different types of multimodal instruction data on distributed devices. To solve the new task, we propose a federated multimodal instruction tuning framework(Pilot). Our framework integrates two-stage of ``adapter on adapter” into the connector of the vision encoder and the LLM. In stage 1, we extract task-specific features and client-specific features from visual information. In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction. Each client can not only capture personalized information of local data and learn task-related multimodal information, but also learn general knowledge from other tasks. In addition, we introduce an adaptive parameter aggregation strategy for text training parameters, which optimizes parameter aggregation by calculating weights based on the euclidean distance between parameters, so that parameter aggregation can benefit from positive effects to the greatest extent while effectively reducing negative effects. Our framework can collaboratively exploit distributed data from different local clients to learn cross-task knowledge without being affected by the task heterogeneity during instruction tuning. The effectiveness of our method is verified in two different cross-task scenarios.

Abstract: Anomaly detection aims to identify deviations from normal patterns within data. This task is particularly crucial in dynamic graphs, which are common in applications like social networks and cybersecurity, due to their evolving structures and complex relationships. Although recent deep learningbased methods have shown promising results in anomaly detection on dynamic graphs, they often lack of generalizability. In this study, we propose GeneralDyG, a method that samples temporal ego-graphs and sequentially extracts structural and temporal features to address the three key challenges in achieving generalizability: Data Diversity, Dynamic Feature Capture, and Computational Cost. Extensive experimental results demonstrate that our proposed GeneralDyG significantly outperforms state-of-the-art methods on four real-world datasets.

Abstract: Complex claim factchecking performs a crucial role in disinformation detection. However, existing fact-checking methods struggle with claim vagueness, specifically in effectively handling latent information and complex relations within claims. Moreover, evidence redundancy, where non-essential information complicates the verification process, remains a significant issue. To tackle these limitations, we propose Bilateral Defusing Verification (BiDeV), a novel fact-checking working-flow framework integrating multiple role-played LLMs to mimic the human-expert fact-checking process. BiDeV consists of two main modules: Vagueness Defusing identifies latent information and resolves complex relations to simplify the claim, and Redundancy Defusing eliminates redundant content to enhance the evidence quality. Extensive experimental results on two widely used challenging fact-checking benchmarks (Hover and Feverous-s) demonstrate that our BiDeV can achieve the best performance under both gold and open settings. This highlights the effectiveness of BiDeV in handling complex claims and ensuring precise fact-checking.

Abstract: While finetuned large language models (LLMs) excel in generating grammatically valid SQL in Text-to-SQL parsing, they often struggle to ensure semantic accuracy in queries, leading to user confusion and diminished system usability. To tackle this challenge, we introduce SQLFixAgent, a new consistency-enhanced multi-agent collaborative framework designed for detecting and repairing erroneous SQL. Our framework comprises a core agent, SQLRefiner, alongside two auxiliary agents: SQLReviewer and QueryCrafter. The SQLReviewer agent employs the rubber duck debugging method to identify potential semantic mismatches between SQL and user query. If the error is detected, the QueryCrafter agent generates multiple SQL as candidate repairs using a fine-tuned SQLTool. Subsequently, leveraging similar repair retrieval and failure memory reflection, the SQLRefiner agent selects the most fitting SQL statement from the candidates as the final repair. We evaluated our proposed framework on five Text-to-SQL benchmarks. The experimental results show that our method consistently enhances the performance of the baseline model, specifically achieving an execution accuracy improvement of over 3% on the Bird benchmark. Our framework also has a higher token efficiency compared to other advanced methods, making it more competitive.

Abstract: Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce FoldTokenizer to represent protein sequencestructure as discrete symbols. This approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We name the learned discrete symbols as FoldToken, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting task, building the first GPT-style model (FoldGPT) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (SoftCVQ).

Abstract: Understanding and leveraging the 3D structures of proteins is central to a variety of biological and drug discovery tasks. While deep learning has been applied successfully for structurebased protein function prediction tasks, current methods usually employ distinct training for each task. However, each of the tasks is of small size, and such a single-task strategy hinders the models' performance and generalization ability. As some labeled 3D protein datasets are biologically related, combining multi-source datasets for larger-scale multi-task learning is one way to overcome this problem. In this paper, we propose a neural network model to address multiple tasks jointly upon the input of 3D protein structures. In particular, we first construct a standard structure-based multi-task benchmark called Protein-MT, consisting of 6 biologically relevant tasks, including affinity prediction and property prediction, integrated from 4 public datasets. Then, we develop a novel graph neural network for multi-task learning, dubbed Heterogeneous Multichannel Equivariant Network (HeMeNet), which is E(3) equivariant and able to capture heterogeneous relationships between different atoms. Besides, HeMeNet can achieve task-specific learning via the task-aware readout mechanism. Extensive evaluations of our benchmark verify the effectiveness of multi-task learning, and our model generally surpasses state-of-the-art models.

Abstract: In complex electromagnetic environments, the identification and differentiation of diverse radio frequency (RF) emitters become particularly crucial. Existing RF fingerprinting methods demonstrate limitations when dealing with numerous unknown emitters, making it challenging for accurate classification and recognition. These limitations hinder the effective handling of specific unknown emitters.To address this issue, we introduce a novel RF fingerprinting method suitable for openworld conditions for the first time. We develop a novel RF fingerprinting model, Roinformer, to extract signal features with positional attention. We then leverage data augmentation strategies such as noise jitter and signal frame rearrangement to construct an effective pre-training model. Moreover, by incorporating instance-level similarity loss and a novel local entropy regularization approach, we significantly enhance the accuracy of known class identification and mitigate the catastrophic forgetting of known signal samples. Experimental results on three temporal signal datasets demonstrate that our method effectively recognizes both the known and unknown classes, outperforming several state-of-the-art methods by a large margin.

Abstract: Recent studies have shown that Hypergraph Neural Networks (HGNNs) are vulnerable to adversarial attacks. Existing approaches focus on hypergraph modification attacks guided by gradients, overlooking node spanning in the hypergraph and the group identity of hyperedges, thereby resulting in limited attack performance and detectable attacks. In this manuscript, we present a novel framework, i.e., Hypergraph Attacks via Injecting Homogeneous Nodes into Elite Hyperedges (IEAttack), to tackle these challenges. Initially, utilizing the node spanning in the hypergraph, we propose the elite hyperedges sampler to identify hyperedges to be injected. Subsequently, a node generator utilizing Kernel Density Estimation (KDE) is proposed to generate the homogeneous node with the group identity of hyperedges. Finally, by injecting the homogeneous node into elite hyperedges, IE-Attack improves the attack performance and enhances the imperceptibility of attacks. Extensive experiments are conducted on five authentic datasets to validate the effectiveness of IE-Attack and the corresponding superiority to state-of-the-art methods.

Abstract: Molecular representation learning plays a crucial role in various downstream tasks, such as molecular property prediction and drug design. To accurately represent molecules, Graph Neural Networks (GNNs) and Graph Transformers (GTs) have shown potential in the realm of selfsupervised pretraining. However, existing approaches often overlook the relationship between molecular structure and electronic information, as well as the internal semantic reasoning within molecules. This omission of fundamental chemical knowledge in graph semantics leads to incomplete molecular representations, missing the integration of structural and electronic data. To address these issues, we introduce MOL-Mamba, a framework that enhances molecular representation by combining structural and electronic insights. MOL-Mamba consists of an Atom & Fragment Mamba-Graph (MG) for hierarchical structural reasoning and a Mamba-Transformer (MT) fuser for integrating molecular structure and electronic correlation learning. Additionally, we propose a Structural Distribution Collaborative Training and E-semantic Fusion Training framework to further enhance molecular representation learning. Extensive experiments demonstrate that MOL-Mamba outperforms state-of-the-art baselines across eleven chemical-biological molecular datasets.

Abstract: Density Functional Theory (DFT) stands as a widely used and efficient approach for addressing the manyelectron Schrödinger equation across various domains such as physics, chemistry, and biology. However, a core challenge that persists over the long term pertains to refining the exchange-correlation (XC) approximation. This approximation significantly influences the triumphs and shortcomings observed in DFT applications. Nonetheless, a prevalent issue among XC approximations is the presence of systematic errors, stemming from deviations from the mathematical properties of the exact XC functional. For example, although both B3LYP and DM21 (DeepMind 21) exhibit improvements over previous benchmarks, there is still potential for further refinement. In this paper, we propose a strategy for enhancing XC approximations by estimating the neural uncertainty of the XC functional, named Residual XC-Uncertain Functional. Specifically, our approach involves training a neural network to predict both the mean and variance of the XC functional, treating it as a Gaussian distribution. To ensure stability in each sampling point, we construct the mean by combining traditional XC approximations with our neural predictions, mitigating the risk of divergence or vanishing values. It is crucial to highlight that our methodology excels particularly in cases where systematic errors are pronounced. Empirical outcomes from three benchmark tests substantiate the superiority of our approach over existing state-of-the-art methods. Our approach not only surpasses related techniques but also significantly outperforms both the popular B3LYP and the recent DM21 methods, achieving average RMSE improvements of 62% and 37%, respectively, across the three benchmarks: W4-17, G21EA, and G21IP.

Abstract: As the Ethereum platform continues to mature and gain widespread usage, it is crucial to maintain high standards of smart contract writing practices. While bad practices in smart contracts may not directly lead to security issues, they do elevate the risk of encountering problems. Therefore, to understand and avoid these bad practices, this paper introduces the first systematic study of bad practices in smart contracts, delving into over 35 specific issues. Specifically, we propose a large language models (LLMs)based framework, SCALM. It combines Step-Back Prompting and Retrieval-Augmented Generation (RAG) to effectively identify and address various bad practices. Our extensive experiments using multiple LLMs and datasets have shown that SCALM outperforms existing tools in detecting bad practices in smart contracts.

Abstract: We study cascades in social networks with the independent cascade (IC) model and the SusceptibleInfected-recovered (SIR) model. The well-studied IC model fails to capture the feature of node recovery, and the SIR model is a variant of the IC model with the node recovery feature. In the SIR model, by computing the probability that a node successfully infects another before its recovery and viewing this probability as the corresponding IC parameter, an equivalence between the two models is established, except that the events of the infections along different out-going edges of a node become dependent in the SIR model, whereas these events are independent in the IC model. In this paper, we thoroughly compare the two models and examine the effect of this extra dependency in the SIR model. By a carefully designed coupling argument, we show that the seeds in the IC model have a stronger influence spread than their counterparts in the SIR model, and sometimes it can be significantly stronger. Specifically, we prove that, given the same network, the same seed sets, and the parameters of the two models being set based on the above-mentioned equivalence, the expected number of infected nodes at the end of the cascade for the IC model is weakly larger than that for the SIR model, and there are instances where this dominance is significant. We also study the influence maximization problem (the optimization problem of selecting a set of nodes as initial seeds in a social network to maximize their influence) with the SIR model. We show that the above-mentioned difference in the two models yields different seed-selection strategies, which motivates the design of influence maximization algorithms specifically for the SIR model. We design efficient approximation algorithms with theoretical guarantees by adapting the reverse-reachable-set-based algorithms, commonly used for the IC model, to the SIR model.

Abstract: An image encoder pretrained by self-supervised learning can be used as a general-purpose feature extractor to build downstream classifiers for various downstream tasks. However, many studies showed that an attacker can embed a trojan into an encoder such that multiple downstream classifiers built based on the trojaned encoder simultaneously inherit the trojan behavior. In this work, we propose TrojanDec, the first data-free method to identify and recover a test input embedded with a trigger. Given a (trojaned or clean) encoder and a test input, TrojanDec first predicts whether the test input is trojaned. If not, the test input is processed in a normal way to maintain the utility. Otherwise, the test input will be further restored to remove the trigger. Our extensive evaluation shows that TrojanDec can effectively identify the trojan (if any) from a given test input and recover it under state-of-the-art trojan attacks. We further demonstrate by experiments that our TrojanDec outperforms the state-of-the-art defenses.

Abstract: Video surveillance systems are crucial components for ensuring public safety and management in smart city. As a fundamental task in video surveillance, textto-image person retrieval aims to retrieve the target person from an image gallery that best matches the given text description. Most existing text-to-image person retrieval methods are trained in a supervised manner that requires sufficient labeled data in the target domain. However, it is common in practice that only unlabeled data is available in the target domain due to the difficulty and cost of data annotation, which limits the generalization of existing methods in practical application scenarios. To address this issue, we propose a novel unsupervised domain adaptation method, termed Graph-Based Cross-Domain Knowledge Distillation (GCKD), to learn the cross-modal feature representation for text-to-image person retrieval in a cross-dataset scenario. The proposed GCKD method consists of two main components. Firstly, a graph-based multi-modal propagation module is designed to bridge the cross-domain correlation among the visual and textual samples. Secondly, a contrastive momentum knowledge distillation module is proposed to learn the cross-modal feature representation using the online knowledge distillation strategy. By jointly optimizing the two modules, the proposed method is able to achieve efficient performance for cross-dataset text-to-image person retrieval. Extensive experiments on three publicly available text-to-image person retrieval datasets demonstrate the effectiveness of the proposed GCKD method, which consistently outperforms the state-of-the-art baselines.

Abstract: Malicious users attempt to replicate commercial models functionally at low cost by training a clone model with query responses. It is challenging to timely prevent such modelstealing attacks to achieve strong protection and maintain utility. In this paper, we propose a novel non-parametric detector called Account-aware Distribution Discrepancy (ADD) to recognize queries from malicious users by leveraging account-wise local dependency. We formulate each class as a Multivariate Normal distribution (MVN) in the feature space and measure the malicious score as the sum of weighted class-wise distribution discrepancy. The ADD detector is combined with random-based prediction poisoning to yield a plug-and-play defense module named D-ADD for image classification models. Results of extensive experimental studies show that D-ADD achieves strong defense against different types of attacks with little interference in serving benign users for both soft and hard-label settings.

Abstract: Most existing semisupervised community detection algorithms leverage known communities to learn community structures, subsequently identifying communities that align with these learned community structures. However, differences in community structures may render the community structures learned by these methods inappropriate for the community containing the given node of interest. As a result, the identified community may exclude the given node or be of poor quality. Inspired by the success of reinforcement learning, we propose a Semi-supervised Local community detection method based on Reinforcement Learning, named SLRL, which only explores parts of the network surrounding the given node. It first extracts the local structure around a given node with an extractor, followed by selecting communities that are similar to this local structure to distill useful communities. These selected communities are employed to train the expander, which expands the community containing a given node. Experimental results demonstrate that SLRL outperforms state-of-the-art algorithms on five real-world datasets.

School of Artificial Intelligence, Nanjing University, Nanjing, China; State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China, School of Artificial Intelligence, Nanjing University, Nanjing, China; State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China, School of Artificial Intelligence, Nanjing University, Nanjing, China; State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China, School of Artificial Intelligence, Nanjing University, Nanjing, China; State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China, Jinling Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China; Jiangsu Provincial Medical Key Discipline Cultivation Unit, Nanjing, China, Jinling Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China; Jiangsu Provincial Medical Key Discipline Cultivation Unit, Nanjing, China, Jinling Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China; Jiangsu Provincial Medical Key Discipline Cultivation Unit, Nanjing, China, Jinling Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China; Jiangsu Provincial Medical Key Discipline Cultivation Unit, Nanjing, China, Jinling Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China; Jiangsu Provincial Medical Key Discipline Cultivation Unit, Nanjing, China

Abstract: The accurate assessment of sperm morphology is crucial in andrological diagnostics, where the segmentation of sperm images presents significant challenges. Existing approaches frequently rely on large annotated datasets and often struggle with the segmentation of overlapping sperm and the presence of dye impurities. To address these challenges, this paper first analyzes the issue of overlapping sperm tails from a geometric perspective and introduces a novel clustering algorithm, Con2Dis, which effectively segments overlapping tails by considering three essential factors: CONnectivity, CONformity, and DIStance. Building on this foundation, we propose an unsupervised method, SpeHeaTal, designed for the comprehensive segmentation of the SPErm HEAd and TAiL. SpeHeaTal employs the Segment Anything Model (SAM) to generate masks for sperm heads while filtering out dye impurities, utilizes Con2Dis to segment tails, and then applies a tailored mask splicing technique to produce complete sperm masks. Experimental results underscore the superior performance of SpeHeaTal, particularly in handling images with overlapping sperm.

Abstract: Generative models have gained significant attention in multivariate time series forecasting (MTS), particularly due to their ability to generate highfidelity samples. Forecasting the probability distribution of multivariate time series is a challenging yet practical task. Although some recent attempts have been made to handle this task, two major challenges persist: 1) some existing generative methods underperform in high-dimensional multivariate time series forecasting, which is hard to scale to higher dimensions; 2) The inherent high-dimensional multivariate attributes constrain the forecasting lengths of existing generative models. In this paper, we point out that discrete token representations can model high-dimensional MTS with faster inference time, and forecast the target with the long-term trends of itself can extend the forecasting length with high accuracy. Motivated by this, we propose a vector quantized framework called Hierarchical Discrete Transformer (HDT) that models time series into discrete token representations with l2 normalization enhanced vector quantized strategy, in which we transform the MTS forecasting into discrete tokens generation. To address the limitations of generative models in long-term forecasting, we propose a hierarchical discrete Transformer. This model captures the discrete long-term trend of the target at the low level and leverages this trend as a condition to generate the discrete representation of the target at the high level that introduces the features of target itself for extending the forecasting length in high-dimensional MTS. Extensive experiments on five popular MTS datasets verify the effectiveness of our proposed method. The source code will be released.

State Key Laboratory of Aerodynamics, Sichuan, China Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Sichuan, China, State Key Laboratory of Aerodynamics, Sichuan, China Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Sichuan, China, Sichuan Tianfu Fluid Big Data Research Center, Chengdu, China State Key Laboratory of Aerodynamics, Sichuan, China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Sichuan, China Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Sichuan, China, State Key Laboratory of Aerodynamics, Sichuan, China Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Sichuan, China College of Computer Science and Technology, National University of Defense Technology, Changsha, China, Sichuan Tianfu Fluid Big Data Research Center, Chengdu, China State Key Laboratory of Aerodynamics, Sichuan, China

Abstract: Aerodynamic coefficient prediction is pivotal in aircraft and vehicles' design, performance evaluation, and motion control. Integrating artificial neural networks into aerodynamic coefficient prediction offers a promising alternative to traditional numerical methods burdened by extensive computations and high costs. Nevertheless, this datadriven approach faces several critical challenges, which limit its further performance enhancement: i) The current research lacks a profound understanding of the complex interplay between the shape of an object and its aerodynamic characteristics. ii) The scarcity of high-quality aerodynamic data poses a significant barrier. The models trained on limited datasets lack generalization ability, struggling to accurately predict and adapt to diverse aerodynamic performance under new shapes or conditions. To overcome these challenges, we introduce an innovative framework that employs cross-attention to capture the intimate interplay between shape and flow conditions and allows for the direct utilization of pre-trained models on general shape datasets to mitigate the scarcity of aerodynamic data. Furthermore, to bolster the inference capabilities of this data-driven approach, we integrate physical information constraints into the model, leveraging them as guiding principles to enhance the model's predictive power under unknown conditions. Experimental validation demonstrates that our proposed method performs excellently in multiple aerodynamic prediction tasks. This achievement brings a new technological breakthrough to the field of aerodynamic prediction and provides robust support for the design optimization of complex systems such as aircraft and vehicles.

Abstract: Molecular dynamics (MD) has long been the de facto choice for simulating intricate physical systems from first principles. Recent efforts utilize the implicit neural representation (INR) to directly learn surface point clouds' signed distance function (SDF) with promising outcomes. However, INR's temporal generalization to unexplored molecular systems remains limited, which poses a significant barrier to applying INR to a broader range of realworld scenarios. This study introduces MoE-DSR, an enhanced version of dynamic surface representations (DSR) that effectively integrates the mixture-of-experts (MoE) strategy. Specifically, the router employs a novel geometric surface cloud network to extract the structural information from the initial static protein conformation as the prior knowledge. Meanwhile, experts compromising a team of equivariant implicit neural networks (E-INNs), each responsible for distinct protein families, ensure precise SDF estimation across varied protein data landscapes. We showcase the ability of MoE-DSR to model dynamic protein surface shapes using ensembles from ATLAS, the largest available protein MD simulations database. Extensive experiments validate its effectiveness in analyzing complex molecular systems across continuous space and time domains.

Abstract: Human Activity Recognition (HAR) aims to recognize activities by training models on massive sensor data. In realworld deployment, a crucial aspect of HAR that has been largely overlooked is that the test sets may have different distributions from training sets due to inter-subject variability including age, gender, behavioral habits, etc., which leads to poor generalization performance. One promising solution is to learn domain-invariant representations to enable a model to generalize on an unseen distribution. However, most existing methods only consider the feature-invariance of the penultimate layer for domain-invariant learning, which leads to suboptimal results. In this paper, we propose a Categorical Concept Invariant Learning (CCIL) framework for generalizable activity recognition, which introduces a concept matrix to regularize the model in the training stage by simultaneously concertrating on feature-invariance and logit-invariance. Our key idea is that the concept matrix for samples belonging to the same activity category should be similar. Extensive experiments on four public HAR benchmarks demonstrate that our CCIL substantially outperforms the state-of-the-art approaches under cross-person, cross-dataset, cross-position, and one-person-to-another settings.

Abstract: The fault location task in power grids is crucial for maintaining social order and ensuring public safety. However, existing methods that rely on tabular state records often neglect the intrinsic topological influences of transmission lines, resulting in a segmented approach to fault location that consists of multiple stages. In this paper, we propose an Disentangled TableGraph representation framework, termed DTG, which integrates fault location tasks at coarse-grained line levels and fine-grained point levels within an end-to-end learning paradigm. Our innovative disentanglement strategy produces interpretable attribution coefficients that connect tabular records and transmission line topology, thereby facilitating fault location at both line- and point-levels. The joint prediction tasks designed around our disentangled tabular graph representation promote mutual information exchange between features and topology of transmission lines in an interpretable manner. Experimental results on the 7-bus system, 36-bus system and a realistic 325-bus system in China demonstrate that the proposed method adapt to different topological structures and handle different types of faults. Compared to traditional methods, DTG4Power achieves high accuracy in both fault lines and fault points.

Abstract: Exception handling is crucial but challenging in program development. It needs to identify and handle all potential exceptions within programs to ensure system security and stabilization. Traditional exception handling relies on the expertise and experience of programmers, which often leads to oversights. Therefore, identifying exceptional code and recommending handling solutions are hot research topics with significant practical value. This paper presents a model called CodeHunter for exception localization and type prediction. The model first utilizes BERTbased model to represent code features and then uses Bi-LSTM for sequence labeling to pinpoint exceptional code. Additionally, this model also considers contextual features of the exception code and learns weights for the code within the try block and its context through the self-attention mechanism. Subsequently, it performs exception localization and predicts exception types. We conduct experiments on three different datasets. The results demonstrate that in the task of exception localization, our model can achieve a maximum accuracy of 98.6%, exceeding SOTA baselines by 11.2%. In the task of exception type prediction, our model can surpass the accuracy of SOTA baselines by a maximum of 18.7%, achieving 92.0% Top-1 accuracy. The rationality of techniques used in our model is also proved by the ablation testing. The model is implemented as an IDE plugin for programming convenience.

Abstract: Barrier certificate generation is an efficient and powerful technique for formally verifying safety properties of cyberphysical systems. Feed-forward neural networks (FNNs) are commonly used to synthesize barrier certificates, but the fixed activation functions limit their efficiency and scalability. In this paper, we propose a novel method for generating barrier certificates using Fourier Kolmogorov-Arnold Networks (KANs). Specifically, it utilizes Fourier KANs to replace FNNs as the template of barrier certificates. Since Fourier KAN has learnable activation functions and uses trigonometric functions as its basis functions, it can efficiently improve the representation power and is easy to train for neural barrier certificates. Then, it formally verifies the validity of the candidate Fourier KAN barrier certificates using both the Lipschitz method and the Satisfiability Modulo Theories, improving the efficiency and success rate of verification. We implement the tool KAN4BC, and evaluate its performance over a set of benchmarks. The experimental results demonstrate the effectiveness and efficiency of our method.

The School of Artifcial Intelligence and Data Science, University of Science & Technology of China, Hefei, China State Key Laboratory of Cognitive Intelligence, Hefei, China, The School of Artifcial Intelligence and Data Science, University of Science & Technology of China, Hefei, China State Key Laboratory of Cognitive Intelligence, Hefei, China, State Key Laboratory of Cognitive Intelligence, Hefei, China, State Key Laboratory of Cognitive Intelligence, Hefei, China, The School of Artifcial Intelligence and Data Science, University of Science & Technology of China, Hefei, China State Key Laboratory of Cognitive Intelligence, Hefei, China, State Key Laboratory of Cognitive Intelligence, Hefei, China IFLYTEK Research, Hefei, China, State Key Laboratory of Cognitive Intelligence, Hefei, China IFLYTEK Research, Hefei, China, The School of Artifcial Intelligence and Data Science, University of Science & Technology of China, Hefei, China State Key Laboratory of Cognitive Intelligence, Hefei, China

Abstract: Cognitive diagnosis, which assesses the learners' competence from learners' interaction logs, plays a vital role in education. It provides a crucial reference for gauging learners' proficiency levels and tailoring future learning activities accordingly. Researchers have proposed numerous cognitive diagnosis models to address this task. Despite their success, these models continue to face the illposed problem because of the information loss caused by under-expressive interaction function and incomplete observations. In this paper, we address these challenges by proposing a novel cognitive diagnosis model, DMC-CDM, based on the theoretical premise that cognitive states can be captured with minimal information loss by maximizing the mutual information between observed and potential observations. Specifically, DMC-CDM incorporates a semantic extractor to provide a comprehensive semantic understanding of learners' interaction logs, thereby enhancing current collaborative-based cognitive state representations. It then consolidates multi-perspective observations to capture precise cognitive states by maximizing mutual information between these observations. We conducted extensive experiments on three datasets, and the experimental results demonstrate that our proposed model is both effective and beneficial for downstream applications in education.

Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University, Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University, Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University, Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen Guangdong Provincial Key Laboratory of Intelligent Information Processing, Mohamed Bin Zayed University of Artificial Intelligence, University of Exeter, Great Bay University

Abstract: Multimodal sentiment analysis, which learns a model to process multiple modalities simultaneously and predict a sentiment value, is an important area of affective computing. Modeling sequential intramodal information and enhancing cross-modal interactions are crucial to multimodal sentiment analysis. In this paper, we propose MSAmba, a novel hybrid Mamba-based architecture for multimodal sentiment analysis, consisting of two core blocks: Intra-Modal Sequential Mamba (ISM) block and Cross-Modal Hybrid Mamba (CHM) block, to comprehensively address the above-mentioned challenges with hybrid state space models. Firstly, the ISM block models the sequential information within each modality in a bi-directional manner with the assistance of global information. Subsequently, the CHM blocks explicitly model centralized cross-modal interaction with a hybrid combination of Mamba and attention mechanism to facilitate information fusion across modalities. Finally, joint learning of the intra-modal tokens and cross-modal tokens is utilized to predict the sentiment values. This paper serves as one of the pioneering works to unravel the outstanding performances and great research potential of Mamba-based methods in the task of multimodal sentiment analysis. Experiments on CMU-MOSI, CMU-MOSEI and CH-SIMS demonstrate the superior performance of the proposed MSAmba over prior Transformer-based and CNN-based methods.

Abstract: Humans excel at understanding and reasoning about novel, compositionally structured knowledge, largely due to their capacity for compositional generalization—a cognitive skill that has recently been validated in structured neural networks. However, most existing research has focused primarily on semantic translation within canonical language environments, often neglecting the explicit connection to compositional generalization behavior. In contrast, humans typically demonstrate this ability through interaction with their environments rather than solely through internal reasoning. To address this gap, we propose CraftFactory, a benchmark designed for evaluating compositional generalization in an interactive control environment. This benchmark introduces a new challenge for testing compositional generalization in a more realistic and comprehensive manner. CraftFactory stands out due to three key features: (1) it offers an openended interactive control environment with thousands of items and flexible actions; (2) it requires advanced compositional inference through various combinations and complex permutations of instructions; and (3) it evaluates compositional generalization intuitively through interactive behavior. By leveraging CraftFactory, we aim to promote the development of more advanced compositional generalization methods, thereby contributing to the broader field of general AI.

Abstract: Completely automated public Turing test to tell humans apart (CAPTCHA) is an effective mechanism to protect websites and online applications from malicious bots programs. Imagebased CAPTCHA is one of the most widely used schemes. However, deep learning techniques have significantly weakened the security of some image-based CAPTCHA schemes. Mooney images (MIs) are important research materials in the field of cognitive science. Compared to natural images, MI exhibits fewer visual cues, fragmented content, and greater ambiguity, leading to the perception of MI relying more on the iterative process between feedforward and feedback mechanisms. In this paper, we raise an intriguing question: can MIs be used to enhance the security of CAPTCHA? Before this study, we first propose a novel framework HiMI that generates the high-quality MIs from natural images and also allows flexible adjustment of the perceived difficulty. Based on MI, we design two MI-CAPTCHA schemes related to object detection and instance segmentation tasks, respectively. We experimentally demonstrate that HiMI performs better than other baseline methods in terms of both image quality and application potential in two MI-CAPTCHA schemes. Additionally, we conduct experiments to explore the solving performance of humans and CAPTCHA solvers under different parameter settings of schemes, providing valuable reference for the practical application.

Abstract: Spiking neural networks (SNNs) are widely applied in various fields due to their energyefficient and fast-inference capabilities. Applying SNNs to reinforcement learning (RL) can significantly reduce the computational resource requirements for agents and improve the algorithm's performance under resource-constrained conditions. However, in current spiking reinforcement learning (SRL) algorithms, the simulation results of multiple time steps can only correspond to a single-step decision in RL. This is quite different from the real temporal dynamics in the brain and also fails to fully exploit the capacity of SNNs to process temporal data. In order to address this temporal mismatch issue and further take advantage of the inherent temporal dynamics of spiking neurons, we propose a novel temporal alignment paradigm (TAP) that leverages the single-step update of spiking neurons to accumulate historical state information in RL and introduces gated units to enhance the memory capacity of spiking neurons. Experimental results show that our method can solve partially observable Markov decision processes (POMDPs) and multi-agent cooperation problems with similar performance as recurrent neural networks (RNNs) but with about 50\% power consumption.

Abstract: Recent advancements in neuroscience research have propelled the development of Spiking Neural Networks (SNNs), which not only have the potential to further advance neuroscience research but also serve as an energyefficient alternative to Artificial Neural Networks (ANNs) due to their spike-driven characteristics. However, previous studies often overlooked the multiscale information and its spatiotemporal correlation between event data, leading SNN models to approximate each frame of input events as static images. We hypothesize that this oversimplification significantly contributes to the performance gap between SNNs and traditional ANNs. To address this issue, we have designed a Spiking Multiscale Attention (SMA) module that captures multiscale spatiotemporal interaction information. Furthermore, we developed a regularization method named Attention ZoneOut (AZO), which utilizes spatiotemporal attention weights to reduce the model's generalization error through pseudo-ensemble training. Our approach has achieved state-of-the-art results on mainstream neuromorphic datasets. Additionally, we have reached a performance of 77.1\% on the Imagenet-1K dataset using a 104-layer ResNet architecture enhanced with SMA and AZO. This achievement confirms the state-of-the-art performance of SNNs with non-transformer architectures and underscores the effectiveness of our method in bridging the performance gap between SNN models and traditional ANN models.

Abstract: Multimodal sentiment analysis aims to integrate diverse modalities for precise emotional interpretation. However, external factors such as sensor malfunctions or network issues may disrupt certain modalities. This may lead to missing data, which poses challenges in realworld deployment. Most existing approaches focus on designing feature reconstruction strategies, overlooking the collaborative integration of reconstruction and fusion strategies. Moreover, they fail to capture the relationships between features in the global dimension and those in the local dimension. These limitations hinder the full capture of the complex nature of multimodal data, especially in scenarios involving missing modalities. To address the above issues, this paper proposes a robust model named MFMB-Net with multiple branches for feature multi-focus fusion and reconstruction. We design a two-stream fusion branch where macro-fusion focuses on the fusion of features in the global dimension and micro-fusion targets local dimension features. This dual-stream fusion branch distributes multi-focus across both pathways, simultaneously capturing global coarse-grained and local fine-grained features. Additionally, the reconstruction branch interacts collaboratively with the fusion branch to reconstruct and enhance the missing data. It integrates the reconstructed feature information with the fused information thus refining the representation fidelity of the missing information. Experiments performed on two benchmarks show that our approach obtains results superior to state-of-the-art models.

Abstract: Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture longterm dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.

School of Computer Science and Information Engineering, Hubei University, China Hubei Key Laboratory of Big Data Intelligent Analysis and Application, China, School of Computer Science and Information Engineering, Hubei University, China Key Laboratory of Intelligent Sensing System and Security, Ministry of Education, China, School of Computer Science and Information Engineering, Hubei University, China Key Laboratory of Intelligent Sensing System and Security, Ministry of Education, China, School of Computer Science and Information Engineering, Hubei University, China Key Laboratory of Intelligent Sensing System and Security, Ministry of Education, China, School of Computer Science and Information Engineering, Hubei University, China Key Laboratory of Intelligent Sensing System and Security, Ministry of Education, China, School of Computer Science and Information Engineering, Hubei University, China Key Laboratory of Intelligent Sensing System and Security, Ministry of Education, China, School of Computer Science and Information Engineering, Hubei University, China Hubei Key Laboratory of Big Data Intelligent Analysis and Application, China

Abstract: Learning concept prerequisite relations helps better master and build a logically coherent knowledge structure. Many studies use graph neural networks to create heterogeneous knowledge networks that enhance concept representations. However, different types of relations in these networks can influence each other. Existing research often focuses solely on concept relations, neglecting other types of knowledge connections. To address this issue, this paper proposes a novel concept prerequisite relation learning model, named the Global Knowledge Relation Optimization Model(GKROM). Specifically, we capture the impact of different knowledge relation types on document and concept semantic representations separately, integrating the document and concept semantic representations. Then, we introduce multiobjective learning to optimize the knowledge relation network from a global perspective. Through the above optimization, GKROM learns richer semantic representations for concepts and documents, improving the accuracy of concept prerequisite relation learning. Extensive experiments on public datasets demonstrate the effectiveness of our GKROM, achieving state-of-the-art performance in concept prerequisite relation learning.

Abstract: We address an advanced challenge of predicting pedestrian occupancy as an extension of multiview pedestrian detection in urban traffic. To support this, we have created a new synthetic dataset called MVP-Occ, designed for dense pedestrian scenarios in large-scale scenes. Our dataset provides detailed representations of pedestrians using voxel structures, accompanied by rich semantic scene understanding labels, facilitating visual navigation and insights into pedestrian spatial information. Furthermore, we present a robust baseline model, termed OmniOcc, capable of predicting both the voxel occupancy state and panoptic labels for the entire scene from multi-view images. Through in-depth analysis, we identify and evaluate the key elements of our proposed model, highlighting their specific contributions and importance.

Abstract: This paper challenges the prevailing view that convolutional neural network (CNN) filters become increasingly specialized in deeper layers. Motivated by recent observations of clusterable repeating patterns in depthwise separable CNNs (DSCNNs) trained on ImageNet, we extend this investigation across various domains and datasets. Our analysis of DS-CNNs reveals that deep filters maintain generality, contradicting the expected transition to class-specific features. We demonstrate the generalizability of these filters through transfer learning experiments, showing that frozen filters from models trained on different datasets perform well and can be further improved when sourced from larger, better-performing models. Our findings indicate that spatial features learned by depthwise separable convolutions remain generic across all layers, domains, and architectures. This research provides new insights into the nature of generalization in neural networks, particularly in DS-CNNs, and has significant implications for transfer learning and model design.

Abstract: Finegrained domain generalization (FGDG) aims to learn a fine-grained representation that can be well generalized to unseen target domains when only trained on the source domain data. Compared with generic domain generalization, FGDG is particularly challenging in that the fine-grained category can be only discerned by some subtle and tiny patterns. Such patterns are particularly fragile under the cross-domain style shifts caused by illumination, color and etc. To push this frontier, this paper presents a novel Hyperbolic State Space Hallucination (HSSH) method. It consists of two key components, namely, state space hallucination (SSH) and hyperbolic manifold consistency (HMC). SSH enriches the style diversity for the state embeddings by firstly extrapolating and then hallucinating the source images. Then, the pre- and post- style hallucinate state embeddings are projected into the hyperbolic manifold. The hyperbolic state space models the high-order statistics, and allows a better discernment of the fine-grained patterns. Finally, the hyperbolic distance is minimized, so that the impact of style variation on fine-grained patterns can be eliminated. Experiments on three FGDG benchmarks demonstrate its state-of-the-art performance.

Abstract: Existing crossmodal retrieval methods typically rely on large-scale vision-language pair data. This makes it challenging to efficiently develop a cross-modal retrieval model for under-resourced languages of interest. Therefore, Cross-lingual Cross-modal Retrieval (CCR), which aims to align vision and the low-resource language (the target language) without using any human-labeled target-language data, has gained increasing attention. As a general parameter-efficient way, a common solution is to utilize adapter modules to transfer the vision-language alignment ability of Vision-Language Pretraining (VLP) models from a source language to a target language. However, these adapters are usually static once learned, making it difficult to adapt to target-language captions with varied expressions. To alleviate it, we propose Dynamic Adapter with Semantics Disentangling (DASD), whose parameters are dynamically generated conditioned on the characteristics of the input captions. Considering that the semantics and expression styles of the input caption largely influence how to encode it, we propose a semantic disentangling module to extract the semantic-related and semantic-agnostic features from the input, ensuring that generated adapters are well-suited to the characteristics of input caption. Extensive experiments on two image-text datasets and one video-text dataset demonstrate the effectiveness of our model for cross-lingual cross-modal retrieval, as well as its good compatibility with various VLP models.

Abstract: Affordance refers to the interactable functional properties of an object, and affordance segmentation aims to pixellevel segment the object functional parts in a given image, which is crucial for various interactive vision tasks. Existing methods address the affordance segmentation problem by utilizing only image features, they can hardly solve the problems of interference between adjacent object pixels in complex scenes, and inability to generalize to the open-world. To tackle these problems, we propose a novel open-vocabulary affordance segmentation task and a benchmark dataset, and propose an approach with object shape mask prompts. The mask is used as prior for different granularity visual feature enhancement and fine-grained text prompt embedding. Specifically, we first propose a mask prompt generation module, which generates refined object shape masks, as well as text prompts for mask-focused regions. Based on the masks, we propose a mask prompt feature enhancement module. It uses masks to encode instance features, and then aggregates them with global features to enhance the visual feature representation. The enhanced visual features are combined with text prompts of different granularity to generate class-agnostic affordance mask proposals. We finally classify these proposals in a proposed affordance prediction module. Quantitative and qualitative evaluations compared with state-of-the-art methods demonstrate that the proposed method achieves superior performance on a proposed benchmark dataset. Our approach is also competitive on other open-vocabulary part segmentation datasets.

Abstract: The reconstruction of lowtextured areas is a prominent research focus in multi-view stereo (MVS). In recent years, traditional MVS methods have performed exceptionally well in reconstructing low-textured areas by constructing plane models. However, these methods often encounter issues such as crossing object boundaries and limited perception ranges, which undermine the robustness of plane model construction. Building on previous work (APD-MVS), we propose the DPE-MVS method. By introducing dual-level precision edge information, including fine and coarse edges, we enhance the robustness of plane model construction, thereby improving reconstruction accuracy in low-textured areas. Furthermore, by leveraging edge information, we refine the sampling strategy in conventional PatchMatch MVS and propose an adaptive patch size adjustment approach to optimize matching cost calculation in both stochastic and low-textured areas. This additional use of edge information allows for more precise and robust matching. Our method achieves state-of-the-art performance on the ETH3D and Tanks & Temples benchmarks. Notably, our method outperforms all published methods on the ETH3D benchmark.

Abstract: Subjectdriven text-to-image (T2I) customization has drawn significant interest in academia and industry. This task enables pre-trained models to generate novel images based on unique subjects. Existing studies adopt a self-reconstructive perspective, focusing on capturing all details of a single image, which will misconstrue the specific image's irrelevant attributes (e.g., view, pose, and background) as the subject intrinsic attributes. This misconstruction leads to both overfitting or underfitting of irrelevant and intrinsic attributes of the subject, i.e., these attributes are over-represented or under-represented simultaneously, causing a trade-off between similarity and controllability. In this study, we argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning, which allows the model to focus more on intrinsic attributes through intra-consistency (features of the same subject are spatially closer) and inter-distinctiveness (features of different subjects have distinguished differences). Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning (MCL) paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is used to extract intrinsic features of subjects from high-level semantics to low-level appearance through crossmodal semantic contrastive learning and multiscale appearance contrastive learning. To facilitate contrastive learning, we introduce the MFI encoder to capture cross-modal representations. Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.

Abstract: With the continuous improvement of device imaging resolution, the popularity of UltraHigh-Definition (UHD) images is increasing. Unfortunately, existing methods for fusing multi-exposure images in dynamic scenes are designed for low-resolution images, which makes them inefficient for generating high-quality UHD images on a resource-constrained device. To alleviate the limitations of extremely long-sequence inputs, inspired by the Large Language Model (LLM) for processing infinitely long texts, we propose a novel learning paradigm to achieve UHD multi-exposure dynamic scene image fusion on a single consumer-grade GPU, named Infinite Pixel Learning (IPL). The design of our approach comes from three key components: The first step is to slice the input sequences to relieve the pressure generated by the model processing the data stream; Second, we develop an attention cache technique, which is similar to the KV cache for infinite data stream processing; Finally, we design a method for attention cache compression to alleviate the storage burden of the cache on the device. In addition, we provide a new UHD benchmark to evaluate the effectiveness of our method. Extensive experimental results show that our method maintains high-quality visual performance while fusing UHD dynamic multi-exposure images in real-time (>40fps) on a single consumer-grade GPU.

Abstract: Generalized fewshot semantic segmentation (GFSS) aims to segment objects of both base and novel classes, using sufficient samples of base classes and few samples of novel classes. Representative GFSS approaches typically employ a two-phase training scheme, involving base class pre-training followed by novel class fine-tuning, to learn the classifiers for base and novel classes respectively. Nevertheless, distribution gap exists between base and novel classes in this process. To narrow this gap, we exploit effective knowledge transfer from base to novel classes. First, a novel prototype modulation module is designed to modulate novel class prototypes by exploiting the correlations between base and novel classes. Second, a novel classifier calibration module is proposed to calibrate the weight distribution of the novel classifier according to that of the base classifier. Furthermore, existing GFSS approaches suffer from a lack of contextual information for novel classes due to their limited samples, we thereby introduce a context consistency learning scheme to transfer the contextual knowledge from base to novel classes. Extensive experiments on PASCAL-5i and COCO-20i demonstrate that our approach significantly enhances the state of the art in the GFSS setting.

Abstract: Tongue diagnosis is a vital tool in both Western and Traditional Chinese Medicine, providing key insights into a patient's health by analyzing tongue attributes. The COVID19 pandemic has heightened the need for accurate remote medical assessments, emphasizing the importance of precise tongue attribute recognition via telehealth. To address this, we propose a Sign-Oriented multi-label Attributes Detection Framework. Our approach begins with an adaptive tongue feature extraction module that standardizes tongue images and mitigates environmental factors. This is followed by a Sign-oriented Network (SignNet) that identifies specific tongue attributes, emulating the diagnostic process of experienced practitioners and enabling comprehensive health evaluations. To validate our methodology, we developed an extensive tongue image dataset specifically designed for telemedicine. Unlike existing datasets, ours is tailored for remote diagnosis, with a comprehensive set of attribute labels. This dataset will be openly available, providing a valuable resource for research. Initial tests have shown improved accuracy in detecting various tongue attributes, highlighting our framework's potential as an essential tool for remote medical assessments.

Abstract: Scalable Vector Graphics (SVG) are essential XMLbased formats for versatile graphics, offering resolution independence and scalability. Unlike raster images, SVGs use geometric shapes and support interactivity, animation, and manipulation via CSS and JavaScript. Current SVG generation methods face challenges related to high computational costs and complexity. In contrast, human designers use component-based tools for efficient SVG creation. Inspired by this, SVGBuilder introduces a component-based, autoregressive model for generating high-quality colored SVGs from textual input. It significantly reduces computational overhead and improves efficiency compared to traditional methods. Our model generates SVGs up to 604 times faster than optimization-based approaches. To address the limitations of existing SVG datasets and support our research, we introduce ColorSVG-100K, the first large-scale dataset of colored SVGs, comprising 100,000 graphics. This dataset fills the gap in color information for SVG generation models and enhances diversity in model training. Evaluation against state-of-the-art models demonstrates SVGBuilder's superior performance in practical applications, highlighting its efficiency and quality in generating complex SVG graphics.

Abstract: High Dynamic Range (HDR) video reconstruction seeks to accurately restore the extensive dynamic range present in realworld scenes and is widely employed in downstream applications. Existing methods typically operate on one or a small number of consecutive frames, which often leads to inconsistent brightness across the video due to their limited perspective on the video sequence. Moreover, supervised learning-based approaches are susceptible to data bias, resulting in reduced effectiveness when confronted with test inputs exhibiting a domain gap relative to the training data. To address these limitations, we present an event-guided HDR video reconstruction method through building 3D Gaussian Splatting (3DGS), to ensure consistent brightness imposed by 3D consistency. We introduce HDR 3D Gaussians capable of simultaneously representing HDR and low-dynamic-range (LDR) colors. Furthermore, we incorporate a learnable HDR-to-LDR transformation optimized by input event streams and LDR frames to eliminate the data bias. Experimental results on both synthetic and real-world datasets demonstrate that the proposed method achieves state-of-the-art performance.

Abstract: The area of portrait image animation, propelled by audio input, has witnessed notable progress in the generation of lifelike and dynamic portraits. Conventional methods are limited to utilizing either audios or facial key points to drive images into videos, while they can yield satisfactory results, certain issues exist. For instance, methods driven solely by audios can be unstable at times due to the relatively weaker audio signal, while methods driven exclusively by facial key points, although more stable in driving, can result in unnatural outcomes due to the excessive control of key point information. In addressing the previously mentioned challenges, in this paper, we introduce a novel approach which we named EchoMimic. EchoMimic is concurrently trained using both audios and facial landmarks. Through the implementation of a novel training strategy, EchoMimic is capable of generating portrait videos not only by audios and facial landmarks individually, but also by a combination of both audios and selected facial landmarks. EchoMimic has been comprehensively compared with alternative algorithms across various public datasets and our collected dataset, showcasing superior performance in both quantitative and qualitative evaluations. The code and models are available on the project page.

National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, China, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, China, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, China, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, China Research & Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen, China Ningbo Institute of Northwestern Polytechnical University, Ningbo, China

Abstract: Although recent years have witnessed significant advancements in medical image segmentation, the pervasive issue of domain shift among medical images from diverse centres hinders the effective deployment of pretrained models. Many Test-time Adaptation (TTA) methods have been proposed to address this issue by fine-tuning pre-trained models with test data during inference. These methods, however, often suffer from less-satisfactory optimization due to suboptimal optimization direction (dictated by the gradient) and fixed step-size (predicated on the learning rate). In this paper, we propose the Gradient alignment-based Test-time adaptation (GraTa) method to improve both the gradient direction and learning rate in the optimization procedure. Unlike conventional TTA methods, which primarily optimize the pseudo gradient derived from a self-supervised objective, our method incorporates an auxiliary gradient with the pseudo one to facilitate gradient alignment. Such gradient alignment enables the model to excavate the similarities between different gradients and correct the gradient direction to approximate the empirical gradient related to the current segmentation task. Additionally, we design a dynamic learning rate based on the cosine similarity between the pseudo and auxiliary gradients, thereby empowering the adaptive fine-tuning of pre-trained models on diverse test data. Extensive experiments establish the effectiveness of the proposed gradient alignment and dynamic learning rate and substantiate the superiority of our GraTa method over other state-of-the-art TTA methods on a benchmark medical image segmentation task.

Abstract: In this paper, we propose a new benchmark called "Archaeological Piece Grouping." In the field of archaeology, it is common for broken archaeological pieces, such as artifact fragments, to be mixed. Archaeologists often spend significant time distinguishing these pieces and categorizing them into different groups. Our benchmark introduces a novel, comprehensive dataset named ArcPie, along with new evaluation metrics for this task. Additionally, we propose a new framework called "3D Probabilistic Graph Search" (3DPGS) to address the problem of grouping mixed archaeological pieces. This framework includes a relation network designed to learn the relationships among all the input 3D pieces. Utilizing the relationships learned, our framework generates a probabilistic matching graph that describes the affinity of any two pieces. We also introduce a novel search algorithm to identify groups according to this matrix. Our framework significantly outperforms other baselines.

Abstract: The method for imageto-point cloud registration typically determines the rigid transformation using a coarse-to-fine pipeline. However, directly and uniformly matching image patches with point cloud patches may lead to focusing on incorrect noise patches during matching while ignoring key ones. Moreover, due to the significant differences between image and point cloud modalities, it may be challenging to bridge the domain gap without specific improvements in design. To address the above issues, we innovatively propose the Uncertainty-aware Hierarchical Matching Module (UHMM) and the Adversarial Modal Alignment Module (AMAM). Within the UHMM, we model the uncertainty of critical information in image patches and facilitate multi-level fusion interactions between image and point cloud features. In the AMAM, we design an adversarial approach to reduce the domain gap between image and point cloud. Extensive experiments and ablation studies on RGB-D Scene V2 and 7-Scenes benchmarks demonstrate the superiority of our method, making it a state-of-the-art approach for image-to-point cloud registration tasks.

Abstract: Partially Relevant Video Retrieval~(PRVR) aims to retrieve a video where a specific segment is relevant to a given text query. Typical training processes of PRVR assume a oneto-one relationship where each text query is relevant to only one video. However, we point out the inherent ambiguity between text and video content based on their conceptual scope and propose a framework that incorporates this ambiguity into the model learning process. Specifically, we propose Ambiguity-Restrained representation Learning~(ARL) to address ambiguous text-video pairs. Initially, ARL detects ambiguous pairs based on two criteria: uncertainty and similarity. Uncertainty represents whether instances include commonly shared context across the dataset, while similarity indicates pair-wise semantic overlap. Then, with the detected ambiguous pairs, our ARL hierarchically learns the semantic relationship via multi-positive contrastive learning and dual triplet margin loss. Additionally, we delve into fine-grained relationships within the video instances. Unlike typical training at the text-video level, where pairwise information is provided, we address the inherent ambiguity within frames of the same untrimmed video, which often contains multiple contexts. This allows us to further enhance learning at the text-frame level. Lastly, we propose cross-model ambiguity detection to mitigate the error propagation that occurs when a single model is employed to detect ambiguous pairs for its training. With all components combined, our proposed method demonstrates its effectiveness in PRVR.

Abstract: Sampling strategy (e.g., fixed farthest point sampling) of point cloud has been an essential step for developing practical solutions in 3D computer vision tasks. Previous fixed sampling is simple, but suffer from suboptimal performance for downstream tasks. To adapt to target networks properly, adaptive sampling methods with trainable parameters have been recently developed to enhance the performance. However, existing adaptive sampling methods still suffer from the overcoupling problem of target network, and thus become model-specific, which limits their practical applications. To address this issue, we propose a novel general cross-scale decoupled sampling method (GCD-sampling) for point cloud, which consists of original feature cache, cross-scale feature fusion and convex combination learning for better feature extraction. To reduce the coupling relationship with the target task network, our method only utilizes the point cloud coordinates as the input and output of itself. Besides, we introduce an arbitrary scale structure to enable parameter sharing across multi-scale sampling in point cloud networks. Extensive experiments on different architectures demonstrate the effectiveness of our method over other existing adaptive sampling methods.

Abstract: Scenelevel point cloud registration is very challenging when considering dynamic foregrounds. Existing indoor datasets mostly assume rigid motions, so the trained models cannot robustly handle scenes with non-rigid motions. On the other hand, non-rigid datasets are mainly object-level, so the trained models cannot generalize well to complex scenes. This paper presents HybridReg, a new approach to 3D point cloud registration, learning uncertainty mask to account for hybrid motions: rigid for backgrounds and non-rigid/rigid for instance-level foregrounds. First, we build a scene-level 3D registration dataset, namely HybridMatch, designed specifically with strategies to arrange diverse deforming foregrounds in a controllable manner. Second, we account for different motion types and formulate a mask-learning module to alleviate the interference of deforming outliers. Third, we exploit a simple yet effective negative log-likelihood loss to adopt uncertainty to guide the feature extraction and correlation computation. To our best knowledge, HybridReg is the first work that exploits hybrid motions for robust point cloud registration. Extensive experiments show HybridReg's strengths, leading it to achieve state-of-the-art performance on both widely-used indoor and outdoor datasets.

Abstract: Retinexbased methods have become a general approach for solving low-light image enhancement (LLIE). However, traditional methods require post-processing of illumination (e.g., gamma correction), which lacks adaptability and disrupts the illumination structure. Retinex-based deep networks typically follow a ‘decomposition-adjustment-exposure control’ process, which is redundant and lacks robustness. One major issue is the inaccuracy in estimating and decomposing the initial illumination. Accurate initial illumination can prevent further post-processing instability. We propose IniRetinex, rethinking the Retinex-based LLIE method from the perspective of initialization. By using neural networks to provide reasonable initial illumination and solving for smooth illumination through optimization, higher performance LLIE is achieved. We construct a two-layer convolutional neural network to capture the low-frequency structure of the image, adaptively compensating for classical initial illumination and avoiding additional post-processing. The network requires no pre-training and can be implemented in an unsupervised manner with just a few iterations, making it highly efficient. Additionally, we propose a new illumination optimization strategy by introducing an additional proximal penalty term, improving illumination in areas with varying levels and enhancing image details. Extensive experiments on various low-light image datasets demonstrate that our method achieves state-of-the-art (SOTA) results on multiple benchmarks, offering higher stability and inference efficiency compared to current advanced methods.

PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China, Huaiyin Institute of Technology，China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China

Abstract: In this paper, we study the challenging problem of simultaneously removing haze and estimating depth from real monocular hazy videos. These tasks are inherently complementary: enhanced depth estimation improves dehazing via the atmospheric scattering model (ASM), while superior dehazing contributes to more accurate depth estimation through the brightness consistency constraint (BCC). To tackle these intertwined tasks, we propose a novel depthcentric learning framework that integrates the ASM model with the BCC constraint. Our key idea is that both ASM and BCC rely on a shared depth estimation network. This network simultaneously exploits adjacent dehazed frames to enhance depth estimation via BCC and uses the refined depth cues to more effectively remove haze through ASM. Additionally, we leverage a non-aligned clear video and its estimated depth to independently regularize the dehazing and depth estimation networks. This is achieved by designing two discriminator networks: D_MFIR enhances high-frequency details in dehazed videos, and D_MDR reduces the occurrence of black holes in low-texture regions. Extensive experiments demonstrate that the proposed method outperforms current state-of-the-art techniques in both video dehazing and depth estimation tasks, especially in real-world hazy scenes.

Abstract: Given some videoquery pairs with untrimmed videos and sentence queries, temporal sentence grounding (TSG) aims to locate query-relevant segments in these videos. Although previous respectable TSG methods have achieved remarkable success, they train each video-query pair separately and ignore the relationship between different pairs. To this end, in this paper, we pose a brand-new setting: Multi-Pair TSG, which aims to co-train these pairs. We propose a novel video-query co-training approach, Multi-Thread Knowledge Transfer Network, to locate a variety of video-query pairs effectively and efficiently. Firstly, we mine the spatial and temporal semantics across different queries to cooperate with each other. To learn intra- and inter-modal representations simultaneously, we design a cross-modal contrast module to explore the semantic consistency by a self-supervised strategy. To fully align visual and textual representations between different pairs, we design a prototype alignment strategy to 1) match object prototypes and phrase prototypes for spatial alignment, and 2) align activity prototypes and sentence prototypes for temporal alignment. Finally, we develop an adaptive negative selection module to adaptively generate a threshold for cross-modal matching. Extensive experiments show the effectiveness and efficiency of our proposed method.

Institute of High Performance Computing, Singapore, A*STAR, Institute of High Performance Computing, Singapore, A*STAR, Institute of High Performance Computing, Singapore, A*STAR, The Chinese University of Hong Kong, Shenzhen, Mohamed bin Zayed University of Artificial Intelligence, (MBZUAI), UAE Australian National University, Canberra ACT, Australia, Harbin Institute of Technology, Institute of High Performance Computing, Singapore, A*STAR, Institute of High Performance Computing, Singapore, A*STAR

Abstract: Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a postprocessing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation → VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can find the images inconsistent with relative caption when the answer by VQA is inconsistent with the answer in the QA pair. Consequently, the CIR performance can be boosted by modifying the ranks of inconsistently retrieved images. Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.

Abstract: Defocus deblurring is a challenging task due to the spatially varying nature of defocus blur with multiple plausible solutions of a single given image. However, most existing methods falter when faced with extensive and variable defocus blur, either ignoring it or relying on additional loss functions to enhance perceptual quality. This often results in unrealistic reconstructions and compromised generalizability. In this paper, we propose a novel Residual Diffusion Deblurring Model framework for single image defocus deblurring. Our approach integrates a pretrained defocus map estimator and a lightweight pre-deblur module with a learnable receptive field, providing crucial posterior information to effectively address large-scale and varying shaped defocus blur. In addition, a carefully-design denoising network enables the generation of diverse reconstructions from a single input. This approach not only significantly improves the perceptual quality of defocus deblurring outputs through multi-step residual learning, but also offers a more efficient inference strategy. Experimental results demonstrate that our method achieves competitive performance on real-world defocus deblurring image datasets across both perceptual and distortion evaluation metrics.

Abstract: The generalization problem is broadly recognized as a critical challenge in detecting deepfakes. Most previous work believes that the generalization gap is caused by the differences among various forgery methods. However, our investigation reveals that the generalization issue can still occur when forgeryirrelevant factors shift. In this work, we identify two biases that detectors may also be prone to overfitting: position bias and content bias, as depicted in Fig. 1. For the position bias, we observe that detectors are prone to “lazily” depending on the specific positions within an image (e.g., central regions even no forgery). As for content bias, we argue that detectors may potentially and mistakenly utilize forgery-unrelated information for detection (e.g., background, and hair). To intervene in these biases, we propose two branches for shuffling and mixing with tokens in the latent space of transformers. For the shuffling branch, we rearrange the tokens and corresponding position embedding for each image while maintaining the local correlation. For the mixing branch, we randomly select and mix the tokens in the latent space between two images with the same label within the mini-batch to recombine the content information. During the learning process, we align the outputs of detectors from different branches in both feature space and logit space. Contrastive losses for features and divergence losses for logits are applied to obtain unbiased feature representation and classifiers. We demonstrate and verify the effectiveness of our method through extensive experiments on widely used evaluation datasets.

School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, School of Aeronautics and Astronautics, University of Electronic Science and Technology of China Aircraft Swarm Intelligent Sensing and Cooperative Control Key Laboratory of Sichuan Province, School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, School of Computer Science and Technology, Tongji University

Abstract: Videos inherently contain complex temporal dynamics across various spatial directions, often entangled in ways that obscure effective dynamic extraction. Previous studies typically process video spatiotemporal features without disentangling, which hampers their ability to extract dynamic information. Additionally, the extraction of dynamics is disrupted by transient highdynamic information in video sequences, e.g., noise or flicker, which has received limited attention in the literature. To tackle those problems, this paper proposes the Disentangling and Filtering Dynamics Network (DFDNet). Firstly, to disentangle the interwoven dynamics, DFDNet decomposes the spatially encoded video sequences into lower dimensional sequences. Secondly, a learnable threshold filter is proposed to eliminate the transient high-dynamic information. Thirdly, the model incorporates an MLP to extract the temporal dependencies from the disentangled and filtered sequences. DFDNet demonstrates competitive performance across four chosen datasets, including both low and high-resolution videos. Specifically, on the low-resolution Moving MNIST dataset, DFDNet achieves a 19% improvement on MSE over the previous state-of-the-art model. On the high-resolution SJTU4K dataset, it outperforms the previous state-of-the-art model by 10% on the LPIPS metric under similar inference time.

Abstract: 3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single humanobject interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the Multi-Image Guided Invariant-Feature-Aware 3D Affordance Grounding (MIFAG) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (IAM) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (ADM) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (MIPA) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons.

Abstract: In recent years, deep learning has revenue in automated medical landmark detection. Nonetheless, prevailing research in this field predominantly addresses singlecenter scenarios or domain adaptation settings. In practical environments, the acquisition of multi-center data faces privacy concerns, coupled with the time-intensive and costly nature of data collection and annotation. These challenges substantially impede the broader application of deep learning-based medical landmark detection. To mitigate these issues, we propose a novel domain-generalized medical landmark detection framework that relies solely on single-center data for training. Considering the availability of numerous public medical segmentation datasets, we design a simple yet effective method that utilizes single-center segmentation to enhance the domain generalization capabilities of the landmark detection task. Specifically, we introduce a novel boundary-aware pre-training approach to focus the model on regions pertinent to landmarks. To further enhance the robustness and generalization capabilities during pre-training, we have derived a mixing loss term and proved its effectiveness in theory and practice. Extensive experiments conducted on our new domain generalization benchmark for medical landmark detection demonstrate the superiority of our approach.

CompVis @ LMU Munich, Munich Center for Machine Learning, CompVis @ LMU Munich, Munich Center for Machine Learning, CompVis @ LMU Munich, Munich Center for Machine Learning, CompVis @ LMU Munich, Munich Center for Machine Learning, CompVis @ LMU Munich, Munich Center for Machine Learning, CompVis @ LMU Munich, Munich Center for Machine Learning, CompVis @ LMU Munich, Munich Center for Machine Learning, CompVis @ LMU Munich, Munich Center for Machine Learning, CompVis @ LMU Munich, Munich Center for Machine Learning

Abstract: Current discriminative depth estimation methods often produce blurry artifacts, while generative approaches suffer from slow sampling due to curvatures in the noiseto-depth transport. Our method addresses these challenges by framing depth estimation as a direct transport between image and depth distributions. We are the first to explore flow matching in this field, and we demonstrate that its interpolation trajectories enhance both training and sampling efficiency while preserving high performance. While generative models typically require extensive training data, we mitigate this dependency by integrating external knowledge from a pre-trained image diffusion model, enabling effective transfer even across differing objectives. To further boost our model performance, we employ synthetic data and utilize image-depth pairs generated by a discriminative model on an in-the-wild image dataset. As a generative model, our model can reliably estimate depth confidence, which provides an additional advantage. Our approach achieves competitive zero-shot performance on standard benchmarks of complex natural scenes while improving sampling efficiency and only requiring minimal synthetic data for training.

Abstract: Neural Representations for Videos (NeRV) has emerged as a promising implicit neural representation (INR) approach for video analysis, which represents videos as neural networks with frame indexes as inputs. However, NeRVbased methods are time-consuming when adapting to a large number of diverse videos, as each video requires a separate NeRV model to be trained from scratch. In addition, NeRV-based methods spatially require generating a high-dimension signal (i.e., an entire image) from the input of a low-dimension timestamp, and a video typically consists of tens of frames temporally that have a minor change between adjacent frames. To improve the efficiency of video representation, we propose Meta Neural Representations for Videos, named MetaNeRV, a novel framework for fast NeRV representation for unseen videos. MetaNeRV leverages a meta-learning framework to learn an optimal parameter initialization, which serves as a good starting point for adapting to new videos. To address the unique spatial and temporal characteristics of video modality, we further introduce spatial-temporal guidance to improve the representation capabilities of MetaNeRV. Specifically, the spatial guidance with a multi-resolution loss aims to capture the information from different resolution stages, and the temporal guidance with an effective progressive learning strategy could gradually refine the number of fitted frames during the meta-learning process. Extensive experiments conducted on multiple datasets demonstrate the superiority of MetaNeRV for video representations and video compression.

Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University, Shanghai Collaborative Innovation Center on Intelligent Visual Computing, Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University, Shanghai Collaborative Innovation Center on Intelligent Visual Computing, Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University, Shanghai Collaborative Innovation Center on Intelligent Visual Computing, Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University, Shanghai Collaborative Innovation Center on Intelligent Visual Computing, Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University, Shanghai Collaborative Innovation Center on Intelligent Visual Computing, Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University, Shanghai Collaborative Innovation Center on Intelligent Visual Computing

Abstract: The exceptional generative capability of textto-image models has raised substantial safety concerns regarding the generation of Not-Safe-For-Work (NSFW) content and potential copyright infringement. To address these concerns, previous methods safeguard the models by eliminating inappropriate concepts. Nonetheless, these models alter the parameters of the backbone network and exert considerable influences on the structural (low-frequency) components of the image, which undermines the model's ability to retain irrelevant concepts. In this work, we propose our Dual encoder Modulation network (DuMo), which achieves precise erasure of inappropriate target concepts with minimum impairment to non-target concepts. In contrast to previous methods, DuMo employs the Eraser with PRior Knowledge (EPR) module which modifies the skip connection features of the U-NET and primarily achieves concept erasure on details (high-frequency) components of the image. To minimize the demage to non-target concepts during erasure, the parameters of the backbone U-NET are frozen and the prior knowledge from the original skip connection features is introduced to the erasure process. Meanwhile, the phenomenon is observed that distinct erasing preferences for the image structure and details are demonstrated by the EPR at different timesteps and layers. Therefore, we adopt a novel Time-Layer MOdulation process (TLMO) that adjusts the erasure scale of EPR module's outputs across different layers and timesteps, automatically balancing the erasure effects and model's generative ability. Our method achieves state-of-the-art performance on Explicit Content Erasure (detecting only 34 nude parts), Cartoon Concept Removal (with an average LPIPS_da of 0.428, 0.113 higher than SOTA at 0.315), and Artistic Style Erasure (with an average LPIPS_da of 0.387, 0.088 higher than SOTA at 0.299), clearly outperforming alternative methods.

Abstract: The rapid progress of multimedia technology has led to an increased focus on enhancing the quality of experience (QoE) for video. Specifically, the demand for lowlatency and high-quality decoding has grown significantly. Compressed Video Quality Enhancement (CVQE) methods based on Deep Neural Networks (DNNs) have achieved remarkable success. However, most of the methods suffer from high computational complexity, thereby limiting their practicality in low-latency scenarios. Recently, Look-Up Table (LUT) methods have shown great efficiency, which makes them considerably promising in the field of low-latency CVQE. In this paper, we propose an efficient multi-frame deformable Look-Up Table structure for CVQE. Firstly, we design an efficient CNN to explore the inter-frame correlation and then predict the multi-scale convolution offsets. Secondly, we introduce a temporal feature extraction module and a multi-scale fusion module. We first exploit the predicted offsets to guide sampling for precise temporal alignment and extract multi-frame information. Then, higher quality frames are reconstructed from the fused multi-scale features. During the inference, we convert these two modules into LUTs to achieve a sound trade-off between model performance and computational complexity. Experiments demonstrate that our proposed method dramatically outperforms the state-of-the-art LUT-based methods, and obtains competitive performance compared to CNN-based methods with the capability to run in real-time(30fps) at 1080p resolution.

Abstract: Transformers have revolutionized the point cloud learning task, but the quadratic complexity hinders its extension to long sequence and makes a burden on limited computational resources. The recent advent of RWKV, a fresh breed of deep sequence models, has shown immense potential for sequence modeling in NLP tasks. In this paper, we present PointRWKV, a model of linear complexity derived from the RWKV model in the NLP field with necessary modifications for point cloud learning tasks. Specifically, taking the embedded point patches as input, we first propose to explore the global processing capabilities within PointRWKV blocks using modified multiheaded matrix-valued states and a dynamic attention recurrence mechanism. To extract local geometric features simultaneously, we design a parallel branch to encode the point cloud efficiently in a fixed radius near-neighbors graph with a graph stabilizer. Furthermore, we design PointRWKV as a multi-scale framework for hierarchical feature learning of 3D point clouds, facilitating various downstream tasks. Extensive experiments on different point cloud learning tasks show our proposed PointRWKV outperforms the transformer- and mamba-based counterparts, while significantly saving about 42\% FLOPs, demonstrating the potential option for constructing foundational 3D models.

Computer Science and Technology and Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Computer Science and Technology and Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Computer Science and Technology and Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, School of Computer Science and Technology, Harbin Institute of Technology

Abstract: Lens flares arise from light reflection and refraction within sensor arrays, whose diverse types include glow, veiling glare, reflective flare and so on. Existing methods are specialized for one specific type only, and overlook the simultaneous occurrence of multiple typed lens flares, which is common in the realworld, e.g. coexistence of glow and displacement reflections from the same light source. These co-occurring lens flares cannot be effectively resolved by the simple combination of individual flare removal methods, since these coexisting flares originates from the same light source and are generated simultaneously within the same sensor array, exhibit a complex interdependence rather than simple additive relation. To model this interdependent flares’ relationship, our Nighttime Lens Flare Formation model is the first attempt to learn the intrinsic physical relationship between flares on the imaging plane. Building on this physical model, we introduce a solution to this joint flare removal task named Self-supervised Generation-based Lens Flare Removal Network (SGLFR-Net), which is self-supervised without pre-training. Specifically, the nighttime glow is detangled in PSF Rendering Network(PSFR-Net) based on PSF Rendering Prior, while the reflective flare is modelled in Texture Prior Based Reflection Flare Removal Network (TPRR-Net). Empirical evaluations demonstrate the effectiveness of the proposed method in both joint and individual glare removal tasks.

Abstract: Zeroshot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones, guided by semantic information. To this end, existing works have demonstrated remarkable performance by utilizing global visual features from Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions. Due to the limited receptive fields of CNNs and the quadratic complexity of ViTs, however, these visual backbones achieve suboptimal visual-semantic interactions. In this paper, motivated by the visual state space model (i.e., Vision Mamba), which is capable of capturing long-range dependencies and modeling complex visual dynamics, we propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates semantic embeddings to map visual features to local semantic-related representations, while GRL encourages the model to learn global semantic representations. SeF combines these two semantic representations to enhance the discriminability of semantic features. We incorporate these designs into Vision Mamba, forming an end-to-end ZSL framework. As a result, the learned semantic representations are better suited for classification. Through extensive experiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior performance, significantly outperforming the state-of-the-art (i.e., CNN-based and ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings.

Abstract: Dynamic object modeling is a critical challenge in 3D scene reconstruction. Previous methods typically maintain a canonical space to represent the object model, and a deformation field to express the object motion. However, this approach fails when the object undergoes large motions. The position variation caused by significant motion not only complicates the establishment of a canonical space, but also misleads the interpretation of the deformation field. To overcome the above weaknesses, we propose Motion Decoupled Dynamic 3D Gaussian Splatting (M5DGS), the first 3D-GS model that separates motion and deformation modeling for dynamic object representation with large motion from a monocular camera. M5D-GS increases the practicality of 3D-GS, as it is common for objects to move, rotate, and deform simultaneously. Current datasets only contain object deformations with slight motions. We introduce a pipeline to reuse current benchmarks by adding large motions into the scene. We also introduce a new benchmark featuring several new scenes with complex motions, scenes augmented from previous datasets, and some real world recorded testcases, to fully demonstrate our improvements. Our M5D-GS significantly increases the accuracy under large motion scenarios while maintaining high rendering speed, which makes it suitable for dynamic object representation tasks including 4D novel view synthesis and real-time rendering.

Abstract: Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective training of advanced large visionlanguage models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

Abstract: Recently, there has been considerable exploration of methods for generating 3D point clouds, which is crucial for numerous 3D vision applications. Though conditional generation methods show promising performance, it depends on the additional paired label. On the other hand, unconditional generation methods usually fail to annotate the generated 3D point cloud. In this paper, we introduce a novel selfconditional architecture that trains on unlabeled data and then generates high-quality labeled 3D point clouds. Specifically, we design a module to extract geometry and view features, and then use a feature fusion module to integrate them as a substitute for label embedding in conditional point cloud generation. Then the point cloud generator is trained using the fused features. LPCG also harnesses CLIP to handle the view features of point clouds for generating label information. Besides, we train two feature diffusion modules to capture the essence of multimodal features and obtain diverse fused features for use as conditions in generating point clouds. Experiments on the ShapeNet dataset demonstrate that LPCG achieves state-of-the-art performance for single class generation. Our experimental results show that the accuracy of our generated label annotations reaches around 97.44% for a two-class generation task.

Abstract: Face antispoofing (FAS) plays a pivotal role in ensuring the security and reliability of face recognition systems. With advancements in vision-language pretrained (VLP) models, recent two-class FAS techniques have leveraged the advantages of using VLP guidance, while this potential remains unexplored in one-class FAS methods. The one-class FAS focuses on learning intrinsic liveness features solely from live training images to differentiate between live and spoof faces. However, the lack of spoof training data can lead one-class FAS models to inadvertently incorporate domain information irrelevant to the live/spoof distinction (\eg, facial content), causing performance degradation when tested with a new application domain. To address this issue, we propose a novel framework called Spoof-aware one-class face anti-spoofing with Language Image Pretraining (SLIP). Given that live faces should ideally not be obscured by any spoof-attack-related objects (\eg, paper, or masks) and are assumed to yield zero spoof cue maps, we first propose an effective language-guided spoof cue map estimation to enhance one-class FAS models by simulating whether the underlying faces are covered by attack-related objects and generating corresponding nonzero spoof cue maps. Next, we introduce a novel prompt-driven liveness feature disentanglement to alleviate live/spoof-irrelative domain variations by disentangling live/spoof-relevant and domain-dependent information. Finally, we design an effective augmentation strategy by fusing latent features from live images and spoof prompts to generate spoof-like image features and thus diversify latent spoof features to facilitate the learning of one-class FAS. Our extensive experiments and ablation studies support that SLIP consistently outperforms previous one-class FAS methods.

Abstract: Dynamic 3D interaction has been attracting a lot of attention recently. However, creating such 4D content remains challenging. One solution is to animate 3D scenes with physicsbased simulation, which requires manually assigning precise physical properties to the object or the simulated results would become unnatural. Another solution is to learn the deformation of 3D objects with the distillation of video generative models, which, however, tends to produce 3D videos with small and discontinuous motions due to the inappropriate extraction and application of physics priors. In this work, to combine the strengths and complementing shortcomings of the above two solutions, we propose to learn the physical properties of a material field with video diffusion priors, and then utilize a physics-based Material-Point-Method (MPM) simulator to generate 4D content with realistic motions. In particular, we propose motion distillation sampling to emphasize video motion information during distillation. In addition, to facilitate the optimization, we further propose a KAN-based material field with frame boosting. Experimental results demonstrate that our method enjoys more realistic motions than state-of-the-arts do.

Abstract: Most indoor depth completion tasks rely on convolutional autoencoders to reconstruct depth images, especially in areas with significant missing values. While traditional convolution treats valid and missing pixels equally, Partial Convolution (PConv) has mitigated this limitation. However, PConv fails to distinguish the varying degree of invalidity across different missing areas, which highlights the need for a more refined strategy. To solve this problem, we propose a novel system for indoor depth completion tasks that leverages Mask-adaptive Gated Convolution (MagaConv). MagaConv utilizes gated signals to selectively apply convolution kernels based on the characteristics of missing depth data. These gating signals are generated using shared convolution kernels that jointly process depth features and corresponding masks, ensuring coherent weight optimization. Additionally, the mask undergoes iterative updates according to predefined rules. To improve the fusion of depth and color information, we introduce a Bi-directional Aligning Projection (Bid-AP) module, which utilizes a bi-directional projection scheme with global spatial-channel attention mechanisms to filter out depth-irrelevant features from other modalities. Extensive experiments on popular benchmarks, including NYU-Depth V2, DIML, and SUN RGB-D, demonstrate that our model outperforms state-of-the-art methods in both accuracy and efficiency.

Abstract: Exposure correction aims to adjust the exposure of an underand over-exposed image to enhance its overall visual quality. The core challenge of this task lies in that it requires to faithfully restore both the structure and perception information. In this work, we present a novel exposure correction method, referred to as CLIP-RestoreX, that leverages structural and perceptual priors from CLIP to tackle exposure correction. Specifically, we in CLIP-RestoreX propose to perform exposure correction by aligning CLIP-based structural and perceptual feature of the impaired image with its ground-truth image. To better restore the damaged structural information and perceptual information, we further design a frequency-domain based feature enhancement diffusion model, where we utilize the globality of Fourier transform to help reveal potential the relationship within the features. We conduct extensive experiments on several benchmark datasets. The results demonstrate that the proposed CLIP-RestoreX outperforms state-of-the-art exposure correction methods.

Abstract: Large Language Model (LLM)based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.

Abstract: Domain Adaptive Person Search (DAPS) aims to improve the generalization capability of person search models by training on both labeled source data and unlabeled target data, which is not that practical in realworld applications considering the storage/transmission costs and the privacy of source data. In this paper, we investigate a more practical and efficient person search setting, Source-Free Domain Adaptive Person Search (SFDA-PS), which seeks to generalize an existing source person search model to any unseen domain without requiring source data. Considering the absence of effective annotations in SFDA-PS, we propose a Doubly Contrastive Learning (DCL) method to adapt the target domain knowledge to the source model in a mutual learning and contrastive learning way. Specifically, we employ a mutual learning-based mean-teacher model as our baseline to incorporate target domain knowledge by pursuing prediction consistency between the teacher and student. Then, a Relation-embedded Contrastive (ReC) learning strategy is introduced to the detection head to ensure semantic consistency among proposals related to the same person while maintaining semantic distinction among proposals from different categories or persons. Furthermore, a Memory-aided Constrative (MaC) learning strategy is integrated into the re-identification (Re-ID) head to enhance its discriminative capability on target person embeddings. Extensive experiments on existing state-of-the-art person search models and two widely used benchmarks demonstrate the superiority of the proposed SFDA-PS task, as well as our proposed DCL.

College of Computer Science and Technology, Jilin University Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, College of Computer Science and Technology, Zhejiang Gongshang University, College of Computer Science and Technology, Jilin University Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, School of Computing, National University of Singapore, The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, College of Computer Science and Technology, Zhejiang Gongshang University, College of Computer Science and Technology, Zhejiang Gongshang University

Abstract: Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and laborintensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7% labeled data.

Abstract: The rapid development of largescale deep learning models questions the affordability of hardware platforms, which necessitates the pruning to reduce their computational and memory footprints. Sparse neural networks as the product, have demonstrated numerous favorable benefits like low complexity, undamaged generalization, etc. Most of the prominent pruning strategies are invented from a model-centric perspective, focusing on searching and preserving crucial weights by analyzing network topologies. However, the role of data and its interplay with model-centric pruning has remained relatively unexplored. In this research, we introduce a novel data-model co-design perspective: to promote superior weight sparsity by learning important model topology and adequate input data in a synergetic manner. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework. As a pioneering effort, this paper conducts systematic investigations about the impact of different visual prompts on model pruning and suggests an effective joint optimization approach. Extensive experiments with 3 network architectures and 8 datasets evidence the substantial performance improvements from VPNs over existing start-of-the-art pruning algorithms. Furthermore, we find that subnetworks discovered by VPNs from pre-trained models enjoy better transferability across diverse downstream scenarios. These insights shed light on new promising possibilities of data-model co-designs for vision model sparsification.

Abstract: The gait, as a kind of soft biometric characteristic, can reflect the distinct walking patterns of individuals at a distance, exhibiting a promising technique for unrestrained human identification. With largely excluding gaitunrelated cues hidden in RGB videos, the silhouette and skeleton, though visually compact, have acted as two of the most prevailing gait modalities for a long time. Recently, several attempts have been made to introduce more informative data forms like human parsing and optical flow images to capture gait characteristics, along with multi-branch architectures. However, due to the inconsistency within model designs and experiment settings, we argue that a comprehensive and fair comparative study among these popular gait modalities, involving the representational capacity and fusion strategy exploration, is still lacking. From the perspectives of fine vs. coarse-grained shape and whole vs. pixel-wise motion modeling, this work presents an in-depth investigation of three popular gait representations, i.e., silhouette, human parsing, and optical flow, with various fusion evaluations, and experimentally exposes their similarities and differences. Based on the obtained insights, we further develop a C²Fusion strategy, consequently building our new framework MultiGait++. C²Fusion preserves commonalities while highlighting differences to enrich the learning of gait features. To verify our findings and conclusions, extensive experiments on Gait3D, GREW, CCPG, and SUSTech1K are conducted.

Institute of Imaging and Computer Vision, RWTH Aachen University, Aachen, Germany, Department of Computer Science, RWTH Aachen University, Aachen, Germany, Department of Computer Science, RWTH Aachen University, Aachen, Germany, Department of Computer Science, RWTH Aachen University, Aachen, Germany, Department of Computer Science, RWTH Aachen University, Aachen, Germany Fraunhofer Institute for Applied Information Technology FIT, Sankt Augustin, Germany, Independent Researcher, Institute of Imaging and Computer Vision, RWTH Aachen University, Aachen, Germany

Abstract: Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image's visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for maintaining highquality standards and minimizing costly recalls. Previous research in anomaly detection (AD) has relied on prior knowledge for designing algorithms, which often requires extensive manual annotations, significant computing power, and large amounts of data for training. Autoregressive, multimodal Vision Language Models (AVLMs) offer a promising alternative due to their exceptional performance in visual reasoning across various domains. Despite this, their application to logical AD remains unexplored. In this work, we investigate using AVLMs for logical AD and demonstrate that they are well-suited to the task. Combining AVLMs with format embedding and a logic reasoner, we achieve SOTA performance on public benchmarks, MVTec LOCO AD, with an AUROC of 86.0% and an F1-max of 83.7% along with explanations of the anomalies. This significantly outperforms the existing SOTA method by 18.1% in AUROC and 4.6% in F1-max score.

Abstract: Recent advancements in selfsupervised denoising have made it possible to train models without needing a large amount of noisy-clean image pairs. A significant development in this area is the use of blind-spot networks (BSNs), which use single noisy images as training pairs by masking some input information to prevent noise transmission to the network output. Researchers have shown that BSNs are capable of reconstructing clean pixels from various types of independent pixel-wise degradations, such as synthetic additive white Gaussian noise (AWGN). However, unlike synthetic noise, real noise often contains highly correlated components which can induce noise transmission and reduce the performance of BSNs. To address the spatial correlation of real noise, we propose the Adjacent Pixel Replacer (APR), which decorrelates noise without a downsampling process that is widely adopted in previous research. The dissimilarity in our APR-generated pairs serves as relatively different noise components during training. Hence, it enables the BSN to block noise transmission while utilizing clean information effectively. As a result, BSN can utilize denser information to reconstruct the corresponding center pixel. We also propose Recharged Distillation (RD) to enhance high-frequency textures without additional network modifications. This method selectively refines clean information from recharged noisy pixels during distillation. Extensive experimental results demonstrate that our proposed method outperforms the existing state-of-the-art self-supervised denoising methods in real sRGB space.

Abstract: With the growing demand for solutions to realworld video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical dense memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical dense memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.

Abstract: Conventional GANbased models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models have attempted to address these limitations and improve fidelity. However, they still face challenges, such as intensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, called MoDiTalker. We introduce two modules: the Audio-To-Motion (AToM) module, designed to generate synchronized lip movements from audio, and the Motion-To-Video (MToV) module, designed to produce high-quality talking head videos based on the generated motions. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. Additionally, MToV enhances temporal consistency by utilizing an efficient tri-plane representation. Our experiments on standard benchmarks demonstrate that our model outperforms existing GAN-based and diffusion-based models. We also provide comprehensive ablation studies and user study results.

Abstract: This paper introduces a method for efficiently interpolating 3D dynamic sequences using truncated signed distance function (TSDF) volumes. The method calculates bidirectional motions between TSDF volumes of two frames and refines them to reconstruct intermediate frames. Unlike point cloud-based methods, which can suffer from varying and irregular point densities, the uniform and dense grid structure of TSDF offers a consistent framework for estimating the true motion of objects within a scene. In our experiments, the TSDF-based method offers more precise and reliable smooth motion prediction compared to the often error-prone surface depiction in point clouds. Experimental results demonstrate improved accuracy and reduced computational complexity, making it suitable for real-time applications.

Abstract: Recent lightweight image captioning models using retrieved data mainly focus on text prompts. However, previous works only utilize the retrieved text as text prompts, and the visual information relies only on the CLIP visual embedding. Because of this issue, there is a limitation that the image descriptions inherent in the prompt are not sufficiently reflected in the visual embedding space. To tackle this issue, we propose ViPCap, a novel retrieval textbased visual prompt for lightweight image captioning. ViPCap leverages the retrieved text with image information as visual prompts to enhance the ability of the model to capture relevant visual information. By mapping text prompts into the CLIP space and generating multiple randomized Gaussian distributions, our method leverages sampling to explore randomly augmented distributions and effectively retrieves the semantic features that contain image information. These retrieved features are integrated into the image and designated as the visual prompt, leading to performance improvements on the datasets such as COCO, Flickr30k, and NoCaps. Experimental results demonstrate that ViPCap significantly outperforms prior lightweight captioning models in efficiency and effectiveness, demonstrating the potential for a plug-and-play solution.

Abstract: Largescale generative models, such as text-to-image diffusion models, have garnered widespread attention across diverse domains due to their creative and high-fidelity image generation. Nonetheless, existing large-scale diffusion models are confined to generating images of up to 1K resolution, which is far from meeting the demands of contemporary commercial applications. Directly sampling higher-resolution images often yields results marred by artifacts such as object repetition and distorted shapes. Addressing the aforementioned issues typically necessitates training or fine-tuning models on higher-resolution datasets. However, this poses a formidable challenge due to the difficulty in collecting large-scale high-resolution images and substantial computational resources. While several preceding works have proposed alternatives to bypass the cumbersome training process, they often fail to produce convincing results. In this work, we probe the generative ability of diffusion models at higher resolution beyond their original capability and propose a novel progressive approach that fully utilizes generated low-resolution images to guide the generation of higher-resolution images. Additionally, we integrate an image sharpening operation into our pipeline, further enhancing image quality. Our method obviates the need for additional training or fine-tuning which significantly lowers the burden of computational costs. Extensive experiments and results validate the efficiency and efficacy of our method.

Abstract: Understanding the 3D semantics of a scene is a fundamental problem for various scenarios such as embodied agents. While NeRFs and 3DGS excel at novelview synthesis, previous methods for understanding their semantics have been limited to incomplete 3D understanding: their segmentation results are rendered as 2D masks that do not represent the entire 3D space. To address this limitation, we redefine the problem to segment the 3D volume and propose the following methods for better 3D understanding. We directly supervise the 3D points to train the language embedding field, unlike previous methods that anchor supervision at 2D pixels. We transfer the learned language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy. Lastly, we introduce a 3D querying and evaluation protocol for assessing the reconstructed geometry and semantics together. Code, checkpoints, and annotations are available at the project page.

Abstract: Object detection in Unmanned Aerial Vehicle (UAV) images has emerged as a focal area of research, which presents two significant challenges: i) objects are typically small and dense within vast images; ii) computational resource constraints render most models unsuitable for realtime deployment. Current real-time object detectors are not optimized for UAV images, and complex methods designed for small object detection often lack real-time capabilities. To address these challenges, we propose a novel detector, RemDet (Reparameter efficient multiplication Detector). Our contributions are as follows: 1) Rethinking the challenges of existing detectors for small and dense UAV images, and proposing information loss as a design guideline for efficient models. 2) We introduce the ChannelC2f module to enhance small object detection performance, demonstrating that high-dimensional representations can effectively mitigate information loss. 3) We design the GatedFFN module to provide not only strong performance but also low latency, effectively addressing the challenges of real-time detection. Our research reveals that GatedFFN, through the use of multiplication, is more cost-effective than feed-forward networks for high-dimensional representation. 4) We propose the CED module, which combines the advantages of ViT and CNN downsampling to effectively reduce information loss. It specifically enhances context information for small and dense objects. Extensive experiments on large UAV datasets, Visdrone and UAVDT, validate the real-time efficiency and superior performance of our methods. On the challenging UAV dataset VisDrone, our methods not only provided state-of-the-art results, improving detection by more than 3.4%, but also achieve 110 FPS on a single 4090.

Abstract: Social Intelligence Queries (SocialIQ) serve as the primary multimodal benchmark for evaluating a model’s social intelligence level. While impressive multiple-choice question (MCQ) accuracy is achieved by current solutions, increasing evidence shows that they are largely, and in some cases entirely, dependent on language modality, overlooking visual context. Additionally, the closed-set nature further prevents the exploration of whether and to what extent the reasoning path behind selection is correct. To address these limitations, we propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model. As a generative multimodal model, VEGAS leverages open-ended answering to provide explainable responses, which enhances the clarity and evaluation of reasoning paths. To enable visually grounded answering, we propose a novel sampling strategy to provide the model with more relevant visual frames. We then enhance the model’s interpretation of these frames through Generalist Instruction Fine-Tuning (GIFT), which aims to: i) learn multimodal language transformations for fundamental emotional social traits, and ii) establish multimodal joint reasoning capabilities. Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers. We expect this work to offer a new perspective on Social-IQ and advance the development of human-like social AI.

School of Informatics, Xiamen University, School of Informatics, Xiamen University Institute of Artificial Intelligence, Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University, School of Computer Science and Technology, East China Normal University Chongqing Institute of East China Normal University, School of Informatics, Xiamen University Institute of Artificial Intelligence, Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University

Abstract: Domain Generalized Semantic Segmentation (DGSS) aims to utilize segmentation model training on known source domains to make predictions on unknown target domains. Currently, there are two network architectures: one based on Convolutional Neural Networks (CNNs) and the other based on Visual Transformers (ViTs). However, both CNNbased and ViT-based DGSS methods face challenges: the former lacks a global receptive field, while the latter requires more computational demands. Drawing inspiration from State Space Models (SSMs), which not only possess a global receptive field but also maintain linear complexity, we propose SSM-based method for achieving DGSS. In this work, we first elucidate why does mask make sense in SSM-based DGSS and propose our mask learning mechanism. Leveraging this mechanism, we present our Mask Vision Mamba network (MaskViM), a model for SSM-based DGSS, and design our mask loss to optimize MaskViM. Our method achieves superior performance on four diverse DGSS setting, which demonstrates the effectiveness of our method.

Abstract: Infraredvisible object detection (IVOD) seeks to harness the complementary information in infrared and visible images, thereby enhancing the performance of detectors in complex environments. However, existing methods often neglect the frequency characteristics of complementary information, such as the abundant high-frequency details in visible images and the valuable low-frequency thermal information in infrared images, thus constraining detection performance. To solve this problem, we introduce a novel Frequency-Driven Feature Decomposition Network for IVOD, called FD2-Net, which effectively captures the unique frequency representations of complementary information across multimodal visual spaces. Speciﬁcally, we propose a feature decomposition encoder, wherein the high-frequency unit (HFU) utilizes discrete cosine transform to capture representative high-frequency features, while the low-frequency unit (LFU) employs dynamic receptive ﬁelds to model the multi-scale context of diverse objects. Next, we adopt a parameter-free complementary strengths strategy to enhance multimodal features through seamless inter-frequency recoupling. Furthermore, we innovatively design a multimodal reconstruction mechanism that recovers image details lost during feature extraction, further leveraging the complementary information from infrared and visible images to enhance overall representational capacity. Extensive experiments demonstrate that FD2-Net outperforms state-of-the-art (SoTA) models across various IVOD benchmarks, i.e. LLVIP (96.2% mAP), FLIR (82.9% mAP), and M3FD (83.5% mAP).

Abstract: In this study, we focus on heterogeneous knowledge transfer across entirely different model architectures, tasks, and modalities. Existing knowledge transfer methods (e.g., backbone sharing, knowledge distillation) often hinge on shared elements within model structures or taskspecific features/labels, limiting transfers to complex model types or tasks. To overcome these challenges, we present MergeNet, which learns to bridge the gap of parameter spaces of heterogeneous models, facilitating the direct interaction, extraction, and application of knowledge within these parameter spaces. The core mechanism of MergeNet lies in the parameter adapter, which operates by querying the source model's low-rank parameters and adeptly learning to identify and map parameters into the target model. MergeNet is learned alongside both models, allowing our framework to dynamically transfer and adapt knowledge relevant to the current stage, including the training trajectory knowledge of the source model. Extensive experiments on heterogeneous knowledge transfer demonstrate significant improvements in challenging settings, where representative approaches may falter or prove less applicable.

Abstract: Multiview 3D human pose estimation (MHPE) is an important research task in computer vision. To maintain consistency during the data collection, hardware synchronization devices are commonly used to connect cameras, ensuring that images from different views are captured simultaneously. However, synchronizing with extra devices has two apparent limitations: the hardware is i) usually expensive and ii) less flexible for deployment in outdoor open scenarios. Suppose the model can improve its tolerance for the time differences in multi-view image capture. In that case, the difficulty and cost of deployment will be greatly reduced, and MHPE will become more widespread. In this paper, we try to answer how to build a model that performs pose estimation directly using ''weakly synchronized images" from multiple views, where the captured images shift from each other within a frame. To this end, we introduce a new multi-view 3D human pose estimation task given weakly synchronized image inputs. Apart from existing well-synchronized datasets, we present the first weakly synchronized dataset comprising 800k images. Thereon, we propose SyncDiffPose, a novel model based on the diffusion method for pose estimation to denoise the error in such data. By combining simple synchronization strategies, e.g., the timer method, our approach can perform pose estimation without hardware calibration.

Abstract: Textbased 2D diffusion models have demonstrated impressive capabilities in image generation and editing. Meanwhile, the 2D diffusion models also exhibit substantial potentials for 3D editing tasks. However, how to achieve consistent edits across multiple viewpoints remains a challenge. While the iterative dataset update method is capable of achieving global consistency, it suffers from slow convergence and over-smoothed textures. We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing. SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent, which ensures global consistency in both semantic structure and low-frequency appearance. To further enhance local consistency in high-frequency details, we set a group of anchor views and propagate them to their neighboring frames through cross-view reprojection. To improve the reliability of multi-view correspondences, we introduce depth supervision during training to enhance the reconstruction of precise geometries. Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures, by enhancing geometric consistency at the noise and pixel levels.

Abstract: Transferable adversarial examples highlight the vulnerability of deep neural networks (DNNs) to imperceptible perturbations across various realworld applications. While there have been notable advancements in untargeted transferable attacks, targeted transferable attacks remain a significant challenge. In this work, we focus on generative approaches for targeted transferable attacks. Current generative attacks focus on reducing overfitting to surrogate models and the source data domain, but they often overlook the importance of enhancing transferability through additional semantics. To address this issue, we introduce a novel plug-and-play module into the general generator architecture to enhance adversarial transferability. Specifically, we propose a Semantic Injection Module (SIM) that utilizes the semantics contained in an additional guiding image to improve transferability. The guiding image provides a simple yet effective method to incorporate target semantics from the target class to create targeted and highly transferable attacks. Additionally, we propose new loss formulations that can integrate the semantic injection module more effectively for both targeted and untargeted attacks. We conduct comprehensive experiments under both targeted and untargeted attack settings to demonstrate the efficacy of our proposed approach.

Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Colleges and Universities Key Laboratory of Intelligent Software, Wuzhou University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Key Laboratory of Big Data Mining and Knowledge Management, University of Chinese Academy of Sciences, Guangxi Colleges and Universities Key Laboratory of Intelligent Software, Wuzhou University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University

Abstract: Effectively constructing context information with longterm dependencies from video sequences is crucial for object tracking. However, the context length constructed by existing work is limited, only considering object information from adjacent frames or video clips, leading to insufficient utilization of contextual information. To address this issue, we propose MambaLCT, which constructs and utilizes target variation cues from the first frame to the current frame for robust tracking. First, a novel unidirectional Context Mamba module is designed to scan frame features along the temporal dimension, gathering target change cues throughout the entire sequence. Specifically, target-related information in frame features is compressed into a hidden state space through a selective scanning mechanism. The target information across the entire video is continuously aggregated into target variation cues. Next, we inject the target change cues into the attention mechanism, providing temporal information for modeling the relationship between the template and search frames. The advantage of MambaLCT is its ability to continuously extend the length of the context, capturing complete target change cues, which enhances the stability and robustness of the tracker. Extensive experiments show that long-term context information enhances the model's ability to perceive targets in complex scenarios. MambaLCT achieves new SOTA performance on six benchmarks while maintaining real-time runing speeds.

Abstract: Blind Image Quality Assessment (BIQA) aims to evaluate image quality in line with human perception, without reference benchmarks. Currently, deep learning BIQA methods typically depend on using features from highlevel tasks for transfer learning. However, the inherent differences between BIQA and these high-level tasks inevitably introduce noise into the quality-aware features. In this paper, we take an initial step toward exploring the diffusion model for feature denoising in BIQA, namely Perceptual Feature Diffusion for IQA (PFD-IQA), which aims to remove noise from quality-aware features. Specifically, 1) we propose a Perceptual Prior Discovery and Aggregation module to establish two auxiliary tasks to discover potential low-level features in images that are used to aggregate perceptual textual prompt conditions for the diffusion model. 2) we propose a Perceptual Conditional Feature Refinement strategy, which matches noisy features to predefined denoising trajectories and then performs exact feature denoising based on textual prompt conditions. By incorporating a lightweight denoiser and requiring only a few feature denoising steps (e.g., just five iterations), our PFD-IQA framework achieves superior performance across eight standard BIQA datasets, validating its effectiveness.

Abstract: Point Transformers (PoinTr) have shown great potential in point cloud completion recently. Nevertheless, effective domain adaptation that improves transferability toward target domains remains unexplored. In this paper, we delve into this topic and empirically discover that direct feature alignment on point Transformer’s CNN backbone only brings limited improvements since it cannot guarantee sequencewise domain-invariant features in the Transformer. To this end, we propose a pioneering Domain Adaptive Point Transformer (DAPoinTr) framework for point cloud completion. DAPoinTr consists of three novel components: Domain Query-based Feature Alignment (DQFA), Point Token-wise Feature alignment (PTFA), and Voted Prediction Consistency (VPC). In particular, DQFA is presented to narrow the global domain gaps from the sequence via the presented domain proxy and domain query at the Transformer encoder and decoder, respectively. PTFA is proposed to close the local domain shifts by aligning the tokens, i.e., point proxy and dynamic query, at the Transformer encoder and decoder, respectively. VPC is designed to consider different Transformer decoders as multiple of experts (MoE) for ensembled prediction voting and pseudo-label generation. Extensive experiments with visualization on several challenging domain adaptation benchmarks demonstrate the effectiveness and superiority of our DAPoinTr compared with other state-of-the-art methods.

Abstract: 3D semantic scene completion is critical for multiple downstream tasks in autonomous systems. It estimates missing geometric and semantic information in the acquired scene data. Due to the challenging realworld conditions, this task usually demands complex models that process multi-modal data to achieve acceptable performance. We propose a unique neural model, leveraging advances from the state space and diffusion generative modeling to achieve remarkable 3D semantic scene completion performance with monocular image input. Our technique processes the data in the conditioned latent space of a variational autoencoder where diffusion modeling is carried out with an innovative state space technique. A key component of our neural network is the proposed Skimba (Skip Mamba) denoiser, which is adept at efficiently processing long-sequence data. The Skimba diffusion model is integral to our 3D scene completion network, incorporating a triple Mamba structure, dimensional decomposition residuals and varying dilations along three directions. We also adopt a variant of this network for the subsequent semantic segmentation stage of our method. Extensive evaluation on the standard SemanticKITTI and SSCBench-KITTI360 datasets show that our approach not only outperforms other monocular techniques by a large margin, it also achieves competitive performance against stereo methods.

Abstract: Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardwareawareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models.

Abstract: Diffusion models for garmentcentric human generation from text or image prompts have garnered emerging attention for their great application potential. However, existing methods often face a dilemma: lightweight approaches, such as adapters, are prone to generate inconsistent textures; while finetune-based methods involve high training costs and struggle to maintain the generalization capabilities of pretrained diffusion models, limiting their performance across diverse scenarios. To address these challenges, we propose DreamFit, which incorporates a lightweight Anything-Dressing Encoder specifically tailored for the garment-centric human generation. DreamFit has three key advantages: (1) Lightweight training: with the proposed adaptive attention and LoRA modules, DreamFit significantly minimizes the model complexity to 83.4M trainable parameters. (2) Anything-Dressing: Our model generalizes surprisingly well to a wide range of (non-)garments, creative styles, and prompt instructions, consistently delivering high-quality results across diverse scenarios. (3) Plug-and-play: DreamFit is engineered for smooth integration with any community control plugins for diffusion models, ensuring easy compatibility and minimizing adoption barriers. To further enhance generation quality, DreamFit leverages pretrained large multi-modal models (LMMs) to enrich the prompt with fine-grained garment descriptions, thereby reducing the prompt gap between training and inference. We conduct comprehensive experiments on both 768 x 512 high-resolution benchmarks and in-the-wild images. DreamFit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation.

Abstract: Event cameras, known for their low latency and high dynamic range, show great potential in pedestrian detection applications. However, while recent research has primarily focused on improving detection accuracy, the robustness of eventbased visual models against physical adversarial attacks has received limited attention. For example, adversarial physical objects, such as specific clothing patterns or accessories, can exploit inherent vulnerabilities in these systems, leading to misdetections or misclassifications. This study is the first to explore physical adversarial attacks on event-driven pedestrian detectors, specifically investigating whether certain clothing patterns worn by pedestrians can cause these detectors to fail, effectively rendering them unable to detect the person. To address this, we developed an end-to-end adversarial framework in the digital domain, framing the design of adversarial clothing textures as a 2D texture optimization problem. By crafting an effective adversarial loss function, the framework iteratively generates optimal textures through backpropagation. Our results demonstrate that the textures identified in the digital domain possess strong adversarial properties. Furthermore, we translated these digitally optimized textures into physical clothing and tested them in real-world scenarios, successfully demonstrating that the designed textures significantly degrade the performance of event-based pedestrian detection models. This work highlights the vulnerability of such models to physical adversarial attacks.

Abstract: Existing hands datasets are largely shortrange and the interaction is weak due to the self-occlusion and self-similarity of hands, which can not yet fit the need for interacting hands motion generation. To rescue the data scarcity, we propose HandDiffuse12.5M, a novel and real dataset that consists of temporal sequences with strong two-hand interactions. HandDiffuse12.5M has the largest scale and richest interactions among the existing two-hand datasets. We further present a strong baseline method HandDiffuse for the controllable motion generation of interacting hands using various controllers. Specifically, we apply the diffusion model as the backbone and design two motion representations for different controllers. To reduce artifacts, we also propose Interaction Loss which explicitly quantifies the dynamic interaction process. Our HandDiffuse enables various applications, i.e., motion in-betweening and trajectory controled generation. Experiments show that our method outperforms the state-of-the-art techniques in motion generation. The vivid two-hand motions generated by our method can also construct synthetic datasets and enhances the accuracy of existing hand motion capture algorithms.

Abstract: Modeling 3D openvocabulary language fields is challenging yet highly anticipated. Despite great progress, existing approaches heavily rely on a large number of training views to construct language-embedded 3D scenes, which is unfortunately impractical in real-world scenarios. This paper introduces SOVGaussian, the first method for few-shot novel view open-vocabulary language querying. We introduce a depth-constrained neural language field to mitigate the geometry degradation caused by overfitting training views. Rather than straightforwardly using dense depth maps for loosely accurate supervision, Language-Aware Depth Distillation (LAD) based on open-vocabulary object masks is proposed, ensuring intra-object geometric accuracy within the language field. To further refine the language-geometry consistency of the language field, we propose a novel Language-Guided Outlier Pruning (LOP) strategy, which identifies floating 3D Gaussian primitives overfitting training views based on their language-grouped densities. Our comprehensive experiments demonstrate that SOVGaussian is able to reconstruct a superior scene representation from few-shot images, outperforming existing state-of-the-art methods and achieving significantly better performance on novel view language querying and synthesis.

Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network Technology, Tangshan Research Institute of Beijing Jiaotong University Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network Technology, Hefei University of Technology, Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network Technology, Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network Technology, National University of Singapore, Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network Technology

Abstract: Zeroshot learning (ZSL) endeavors to transfer knowledge from the seen categories to recognize unseen categories, which mostly relies on the semantic-visual interactions between image and attribute tokens. Recently, the prompt learning has emerged in ZSL and demonstrated significant potential as it allows the zero-shot transfer of diverse visual concepts to downstream tasks. However, current methods explore the fixed adaptation of the learnable prompt on the seen domains, which make them over-emphasize the primary visual features observed during training, limiting their generalization capabilities to the unseen domains. In this work, we propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment, enabling effective knowledge transfer for ZSL. AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding the semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision. It is further integrated with primary visual features to attend to semantic-related information for visual enhancement, thus strengthening transferable ability. Experimental results on three benchmarks show that our AENet outperforms existing state-of-the-art ZSL methods.

Abstract: Allin-one restoration needs to implicitly distinguish between different degradation conditions and apply specific prior constraints accordingly. To fulfill this goal, our work makes the first effort to create an all-in-one restoration via unrolling from the typical maximum a-posterior optimization function. This unrolling framework naturally leads to the construction of progressively solving models, which are equivalent to a diffusion enhancer taking as input dynamically generated prompts. Under a score-based diffusion model, the prompts are integrated for propogating and updating several context-related variables, i.e. transmission map, atmospheric light map and noise or rain map progressively. Such learned prompt generation process, which simulates the nonlinear operations in the unrolled solution, is combined with linear operations owning clear physics implications to make the diffusion models well reguarlized and more effective in learning degradation-related visual priors. Experimental results demonstrate that our method achieves significant performance improvements across various image restoration tasks, realizing true all-in-one image restoration.

Abstract: Diffusion models have exhibited substantial success in textto-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In this paper, we propose a novel framework called LLM4GEN, which enhances the semantic understanding of text-to-image diffusion models by leveraging the representation of Large Language Models (LLMs). It can be seamlessly incorporated into various diffusion models as a plug-and-play component. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features, thereby enhancing text-to-image generation. Additionally, to facilitate and correct entity-attribute relationships in text prompts, we develop an entity-guided regularization loss to further improve generation performance. We also introduce DensePrompts, which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. Experiments indicate that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 9.69% and 12.90% in color on T2I-CompBench, respectively. Moreover, it surpasses existing models in terms of sample quality, image-text alignment, and human evaluation.

Abstract: We present VQTalker, a Vector Quantizationbased framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. Notably, our method achieves high-quality results at a resolution of 512 × 512 pixels while maintaining a lower bitrate of approximately 11 kbps. Our work opens new possibilities for cross-lingual talking face generation.

Abstract: Radiology report generation (RRG) models typically focus on individual exams, often overlooking the integration of historical visual or textual data, which is crucial for patient followups. Traditional methods usually struggle with long sequence dependencies when incorporating historical information, but large language models (LLMs) excel at in-context learning, making them well-suited for analyzing longitudinal medical data. In light of this, we propose a novel Historical-Constrained Large Language Models (HC-LLM) framework for RRG, empowering LLMs with longitudinal report generation capabilities by constraining the consistency and differences between longitudinal images and their corresponding reports. Specifically, our approach extracts both time-shared and time-specific features from longitudinal chest X-rays and diagnostic reports to capture disease progression. Then, we ensure consistent representation by applying intra-modality similarity constraints and aligning various features across modalities with multimodal contrastive and structural constraints. These combined constraints effectively guide the LLMs in generating diagnostic reports that accurately reflect the progression of the disease, achieving state-of-the-art results on the Longitudinal-MIMIC dataset. Notably, our approach performs well even without historical data during testing and can be easily adapted to other multimodal large models, enhancing its versatility.

Abstract: Knowledge Distillation (KD) is a promising approach for unsupervised Anomaly Detection (AD). However, the student network's overgeneralization often diminishes the crucial representation differences between teacher and student in anomalous regions, leading to detection failures. To address this problem, the widely accepted Reverse Distillation (RD) paradigm designs the asymmetry teacher and student network, using an encoder as teacher and a decoder as student. Yet, the design of RD does not ensure that the teacher encoder effectively distinguishes between normal and abnormal features or that the student decoder generates anomaly-free features. Additionally, the absence of skip connections results in a loss of fine details during feature reconstruction. To address these issues, we propose RD with Expert, which introduces a novel Expert-Teacher-Student network for simultaneous distillation of both the teacher encoder and student decoder. The added expert network enhances the student's ability to generate normal features and optimizes the teacher's differentiation between normal and abnormal features, reducing missed detections. Additionally, Guided Information Injection is designed to filter and transfer features from teacher to student, improving detail reconstruction and minimizing false positives. Experiments on several benchmarks prove that our method outperforms existing unsupervised AD methods under RD paradigm, fully unlocking RD’s potential.

College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education

Abstract: Learning visual semantic similarity is a critical challenge in bridging the gap between images and texts. However, there exist inherent variations between vision and language data, such as information density, i.e., images can contain textual information from multiple different views, which makes it difficult to compute the similarity between these two modalities accurately and efficiently. In this paper, we propose a novel framework called Asymmetric Visual Semantic Embedding (AVSE) to dynamically select features from various regions of images tailored to different textual inputs for similarity calculation. To capture information from different views in the image, we design a radial bias sampling module to sample image patches and obtain image features from various views, Furthermore, AVSE introduces a novel module for efficient computation of visual semantic similarity between asymmetric image and text embeddings. Central to this module is the presumption of foundational semantic units within the embeddings, denoted as ``metasemantic embeddings." It segments all embeddings into meta-semantic embeddings with the same dimension and calculates visual semantic similarity by finding the optimal match of meta-semantic embeddings of two modalities. Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.

Abstract: Existing ultrahigh-definition (UHD) image restoration methods often struggle with consistency due to downsampling. We aim to address these challenges by leveraging the powerful latent space representation and reconstruction capabilities of Variational Autoencoders (VAE). However, applying VAE to UHD image restoration presents challenges: 1) High-performing VAEs have large parameter sizes, leading to significant carbon footprints; 2) The self-reconstruction property of VAE hinders bridging the domain gap between clean and degraded images; 3) Latent encoding in VAE can lose high-frequency information, compromising image detail. To overcome these challenges, we propose a frequency enhanced VAE UHD image restoration framework by integrating frequency priors. First, we design the Fourier-based lightweight frequency learning within the VAE to improve parameter efficiency. Then, we introduce a wavelet-based adapter that extracts multi-scale image information and employs frequency-aware adaptive modulation to bridge the domain gap by integrating degraded image data into the pre-trained VAE. Additionally, the adapter injects high-frequency information into the VAE decoder, enhancing detail in the restored images. In this way, our method effectively combines the powerful latent space representation with frequency priors to enhance UHD image restoration. Extensive experiments on various UHD image restoration tasks show that our method surpasses state-of-the-art methods both qualitatively and quantitatively.

Abstract: Visual Question Answering (VQA) has garnered significant attention as a crucial link between vision and language, aimed at generating accurate responses to visual queries. However, current VQA models still struggle with the challenges of minority class collapse and spurious semantic correlations posed by language bias and imbalanced distributions. To address these challenges, this paper proposes a novel PromptDriven Geometric Harmonization (PDGH) paradigm, which integrates both geometric structure and information entropy principles to enhance the ability of VQA models to generalize effectively across diverse scenarios. Specifically, our PDGH approach is meticulously designed to generate image-generated prompts that are guided by specific question cues, facilitating a more accurate and context-aware understanding of the visual content. Moreover, we project the prompt-visual-question and visual-question joint representations into a unified hypersphere space, applying feature weight self-orthogonality and prompt-information entropy correction constraints to optimize the margin, further alleviating minority class collapse and correcting language bias. To maintain the geometric integrity of the representation space, we introduce multi-space geometric contrast constraints to minimize the impact of spurious priors introduced during training. Finally, a semantic matrix is constructed for the coordinated joint representation to ensure that the learned instances are semantically consistent and improve reasoning ability. Extensive experiments on various general and medical VQA datasets demonstrate the consistent superiority of our PDGH approach over existing state-of-the-art baselines.

Abstract: Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information. Current methods in visionbased sign language recognition (SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information. To address these challenges, we introduce SCOPE (Sign language COntextual Processing with Embedding from LLMs), a novel context-aware vision-based SLR and SLT framework. For SLR, we utilize dialogue contexts through a multi-modal encoder to enhance gloss-level recognition. For subsequent SLT, we further fine-tune a Large Language Model (LLM) by incorporating prior conversational context. We also contribute a new sign language dataset that contains 72 hours of Chinese sign language videos in contextual dialogues across various scenarios. Experimental results demonstrate that our SCOPE framework achieves state-of-the-art performance on multiple datasets, including Phoenix-2014T, CSL-Daily, and our SCOPE dataset. Moreover, surveys conducted with participants from the Deaf community further validate the robustness and effectiveness of our approach in real-world applications. Both our dataset and code will be open-sourced to facilitate further research.

Abstract: Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values, and identifying its highlights and areas for improvement. Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets, thus impairing indepth aesthetic comprehension. Despite efforts to overcome this challenge through the application of Multi-modal Large Language Models (MLLMs), such models remain underdeveloped for IAA purposes. To address this, we propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight. Central to our approach is an innovative multi-scale text-guided self-supervised learning technique. This technique features a multi-scale feature alignment module and capitalizes on a wealth of unlabeled data in a self-supervised manner to structurally and functionally enhance aesthetic ability. The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. Remarkably, it also demonstrates zero-shot learning capabilities in the emerging task of aesthetic suggesting. Furthermore, for personalized image aesthetic assessment, we harness the potential of in-context learning and showcase its inherent advantages.

Abstract: Detecting and localizing objects in 3D space using multiple cameras, known as MultiCamera 3D Object Detection (MC3D-Det), has gained prominence with the advent of bird's-eye view (BEV) approaches. However, these methods often struggle with the serious domain gaps caused by various viewpoints and environments between the training and testing domains. To address this challenge, we propose a novel framework that aligns 3D detection with 2D camera plane results by perspective rendering, thus achieving consistent and accurate results when facing serious domain shifts. Our approach consists of two main steps in both source and target domains: 1) rendering diverse view maps from BEV features by leveraging implicit foreground volumes and 2) rectifying the perspective bias of these maps. This design promotes the learning of perspective- and context-independent features, crucial for accurate object detection across varying viewpoints, camera parameters, and environmental conditions. Notably, our model-agnostic approach preserves the original network structure without incurring additional inference costs, facilitating seamless integration across various models and simplifying deployment. Worth noting is that our approach achieves satisfactory results in real data when trained only with virtual datasets, eliminating the need for real scene annotations. Experimental results on both Domain Generalization (DG) and Unsupervised Domain Adaptation (UDA) demonstrate its effectiveness.

Abstract: Zeroshot generalized anomaly detection (ZGAD) plays a critical role in industrial automation and health screening. Recent studies have shown that ZGAD methods built on visual-language models (VLMs) like CLIP have excellent cross-domain detection performance. Different from other computer vision tasks, ZGAD needs to jointly optimize both image-level anomaly classification and pixel-level anomaly segmentation tasks for determining whether an image contains anomalies and detecting anomalous parts of an image, respectively, this leads to different granularity of the tasks. However, existing methods ignore this problem, processing these two tasks with one set of broad text prompts used to describe the whole image. This limits CLIP to align textual features with pixel-level visual features and impairs anomaly segmentation performance. Therefore, for precise visual-text alignment, in this paper we propose a novel fine-grained text prompts generation strategy. We then apply the broad text prompts and the generated fine-grained text prompts for visual-textual alignment in classification and segmentation tasks, respectively, accurately capturing normal and anomalous instances in images. We also introduce the Text Prompt Shunt (TPS) model, which performs joint learning by reconstruction the complementary and dependency relationships between the two tasks to enhance anomaly detection performance. This enables our method to focus on fine-grained segmentation of anomalous targets while ensuring accurate anomaly classification, and achieve pixel-level comprehensible CLIP for the first time in the ZGAD task. Extensive experiments on 13 real-world anomaly detection datasets demonstrate that TPS achieves superior ZGAD performance across highly diverse datasets from industrial and medical domains.

Abstract: Open vocabulary object detection (OVOD) task aims to detect objects of novel categories beyond the base categories in the training set. To this end, the detector needs to access imagetext pairs containing rich semantic information or the visual language pre-trained model (VLM) learned on them. Recent OVOD methods rely on knowledge distillation from VLMs. However, there are two main problems in current methods: (1) Current knowledge distillation frameworks fail to take advantage of the global category information of VLMs and thus fail to learn category-specific knowledge. (2) Due to the overfitting phenomenon of base categories during training, current OVOD networks generally have the problem of suppressing novel categories as background. To address these two problems, we propose a Category Aware Knowledge Extraction framework (CAKE), which consists of a Category-Specific Knowledge Distillation branch (CSKD) and a Category Generalization Region Proposal Network (CG-RPN). CSKD can more fully extract category-strong related information through category-specific distillation, and it is also conducive to filtering the exclusion problem between individuals of the same category; in this process, the model constructs a category-specific feature set to maintain high-quality category features. CG-RPN leverages the guidance of feature set to adjust the confidence scores of region proposals, thereby mining proposals that potentially contain novel categories of objects. Extensive experiments show that our method can plug and play well with many existing methods and significantly improve their detection performance. Moreover, our CAKE framework can reach the-state-of-the-art performance on OV-COCO and OV-LVIS datasets.

Abstract: Novel view synthesis is a critical task in autonomous driving. Although 3D Gaussian Splatting (3DGS) has shown success in generating novel views, it faces challenges in maintaining high-quality rendering when viewpoints deviate significantly from the training set. This difficulty primarily stems from complex lighting conditions and geometric inconsistencies in texture-less regions. To address these issues, we propose an attention-based illumination model that leverages light fields from neighboring views, enhancing the realism of synthesized images. Additionally, we propose a geometry optimization method using planar homography to improve geometric consistency in texture-less regions. Our experiments demonstrate substantial improvements in synthesis quality for large-deviation viewpoints, validating the effectiveness of our approach.

Abstract: To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations. Existing methods encode entity masks tracked across temporal dimensions (mask tubes), then predict their relations with temporal pooling operation, which does not fully utilize the motion indicative of the entities' relation. To overcome this limitation, we introduce a contrastive representation learning framework that focuses on motion pattern for temporal scene graph generation. Firstly, our framework encourages the model to learn close representations for mask tubes of similar subjectrelation-object triplets. Secondly, we seek to push apart mask tubes from their temporally shuffled versions. Moreover, we also learn distant representations for mask tubes belonging to the same video but different triplets. Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets.

Abstract: Utilizing uniformly distributed sparse annotations, weakly supervised learning alleviates the heavy reliance on finegrained annotations in point cloud semantic segmentation tasks. However, few works discuss the inhomogeneity of sparse annotations, albeit it is common in real-world scenarios. Therefore, this work introduces the probability density function into the gradient sampling approximation method to qualitatively analyze the impact of annotation sparsity and inhomogeneity under weakly supervised learning. Based on our analysis, we propose an Adaptive Annotation Distribution Network (AADNet) capable of robust learning on arbitrarily distributed sparse annotations. Specifically, we propose a label-aware point cloud downsampling strategy to increase the proportion of annotations involved in the training stage. Furthermore, we design the multiplicative dynamic entropy as the gradient calibration function to mitigate the gradient bias caused by non-uniformly distributed sparse annotations and explicitly reduce the epistemic uncertainty. Without any prior restrictions and additional information, our proposed method achieves comprehensive performance improvements at multiple label rates and different annotation distributions.

Abstract: Fact verification has become increasingly vital in the internet age, driven by the proliferation of false claims and political misinformation. While traditional methods rely predominantly on textbased evidence, multi-modal evidence introduces richer sources of information, offering valuable insights for claim verification. Existing multi-modal verification models often focus on superficial correlations between claims and evidence, neglecting the complex semantic interactions present in fine-grained multi-modal signals. In this paper, we propose a novel framework for multi-modal fact-checking, named Hypergraph Transformer-based Multi-modal Fact-Checking (HGTMFC). Our approach captures high-order relationships between different modalities of evidence and claims by leveraging hypergraphs. HGTMFC models the intricate relationships among evidence across various modalities and enhances information propagation through a transformer-based mechanism embedded within the hypergraph. Moreover, we utilize linegraphs to refine this propagation process, further strengthening the model's reasoning capabilities. Experiments on benchmark datasets demonstrate that our model significantly outperforms existing approaches in multi-modal fact verification.

Abstract: We propose ForegroundCovering Prototype Generation and Matching to resolve Few-Shot Segmentation (FSS), which aims to segment target regions in unlabeled query images based on labeled support images. Unlike previous research, which typically estimates target regions in the query using support prototypes and query pixels, we utilize the relationship between support and query prototypes. To achieve this, we utilize two complementary features: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. Specifically, we construct support and query prototypes with SAM features and distinguish query prototypes of target regions based on ResNet features. For the query prototype construction, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS.

Abstract: Prior efforts in lightweight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands O(N^2). This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to O(N). Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with 1.3G FLOPs improves Vim-Ti with 1.5G FLOPs by a large margin of 5.6% accuracy on ImageNet.

Abstract: Unsupervised Person Reidentification (Re-ID) aims to identify the same person shot from non-overlapping cameras without any annotated data. In this task, attributes such as contrast, saturation, and resolution of the camera cause the deviation in target features. Since the camera label is readily available, they are employed to achieve the constraints across cameras and smooth the deviations during the model training phase. However, features from the same camera are prone to generating false positives due to the identical camera properties, which induce camera deviations on pseudo-label assignment. To address this problem, this paper proposes a novel camera-unbiased method named Camera Deviation Elimination Learning (CDE-Learning). In CDE-Learning, the Camera Deviation Compensation (CDC) module is designed to align data distributions from disparate cameras to decouple camera information from identity information during the pseudo-label allocation. Our Camera Deviation Balancing (CDB) module integrates different camera constraints in a united loss and adjusts camera constraints by constructing contrastive pairs between intra-camera and inter-camera. After explicit constraints, the Camera Attribution Auxiliary (CAA) task predicts whether a pair of images originates from the same camera to implicitly enhance the capacity to distinguish the camera deviation. We demonstrated the superior performance of the proposed CDE-Learning on benchmark datasets.

Abstract: The primary challenge of crossdomain few-shot segmentation (CD-FSS) is the domain disparity between the training and inference phases, which can exist in either the input data or the target classes. Previous models struggle to learn feature representations that generalize to various unknown domains from limited training domain samples. In contrast, the large-scale visual model SAM, pre-trained on tens of millions of images from various domains and classes, possesses excellent generalizability. In this work, we propose a SAM-aware graph prompt reasoning network (GPRN) that fully leverages SAM to guide CD-FSS feature representation learning and improve prediction accuracy. Specifically, we propose a SAM-aware prompt initialization module (SPI) to transform the masks generated by SAM into visual prompts enriched with high-level semantic information. Since SAM tends to divide an object into many sub-regions, this may lead to visual prompts representing the same semantic object having inconsistent or fragmented features. We further propose a graph prompt reasoning (GPR) module that constructs a graph among visual prompts to reason about their interrelationships and enable each visual prompt to aggregate information from similar prompts, thus achieving global semantic consistency. Subsequently, each visual prompt embeds its semantic information into the corresponding mask region to assist in feature representation learning. To refine the segmentation mask during testing, we also design a non-parameter adaptive point selection module (APS) to select representative point prompts from query predictions and feed them back to SAM to refine inaccurate segmentation results. Experiments on four standard CD-FSS datasets demonstrate that our method establishes new state-of-the-art results.

Abstract: Recently, memorybased methods have achieved progress in semi-supervised video object segmentation. However, these methods still suffer from unstructured challenges, such as object transformations, occlusions and disappearance-reappearance. To this end, we propose a Holistic Correction Network (HCNet) to adaptively acquire concise object prototypes for holistic correction at semantic, spatial and temporal aspects. Specifically, an Adaptive Prototype Update module is firstly designed to construct multi-level core object representations by associating object variations in consecutive frames with segmentation quality assessment. Based on the updated object prototypes, Semantic, Spatial and Temporal Correction modules are respectively designed to enhance the object semantics in the entire frame, eliminate the incorrect semantic enhancement outside the object regions and calibrate the estimated object regions with temporal changes of objects. Through the holistic correction mechanism with effective object prototypes, our proposed HCNet can robustly and efficiently deal with diverse complex scenarios. Extensive and comprehensive experiments conducted on seven datasets demonstrate that our proposed HCNet can significantly improve the segmentation performance.

Abstract: Unsupervised semantic segmentation algorithms aim to identify meaningful semantic groups without annotations. Recent approaches leveraging selfsupervised transformers as pre-training backbones have successfully obtained high-level dense features that effectively express semantic coherence. However, these methods often overlook local semantic coherence and low-level features such as color and texture. We propose integrating low-level visual cues to complement high-level visual cues derived from self-supervised pre-training branches. Our findings indicate that low-level visual cues provide a more coherent recognition of color-texture aspects, ensuring the continuity of spatial structures within classes. This insight led us to develop IL2Vseg, an unsupervised semantic segmentation method that leverages the complementation of low-level visual cues. The core of IL2Vseg is a spatially-constrained fuzzy clustering algorithm based on color affinities, which preserves the intra-class affinity of spatially-adjacent and similarly-colored pixels in low-level visual cues. Additionally, to effectively couple low-level and high-level visual cues, we introduce a feature similarity loss function to optimize the feature representation of fused visual cues. To further enhance consistent feature learning, we incorporate contrast loss functions based on color invariance and luminosity invariance, which improve the learning of features from different semantic categories. Extensive experiments on multiple datasets, including COCO-Stuff-27, Cityscapes, Potsdam, and MaSTr1325, demonstrate that IL2Vseg achieves state-of-the-art results.

Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Aritificial Intelligence, Beihang University, Beijing, China, Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Aritificial Intelligence, Beihang University, Beijing, China, Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Aritificial Intelligence, Beihang University, Beijing, China, Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Aritificial Intelligence, Beihang University, Beijing, China

Abstract: Neural implicit methods have made remarkable progress in 3D reconstruction. However, previous methods often assume viewindependent properties of target objects, which fails to accurately reconstruct objects with challenging characteristics, such as transparency and high reflectivity. To address this limitation, we propose a polarimetric implicit 3D reconstruction method that integrates geometric and polarization information, enabling the production of high-quality meshes in complex scenes. For high-fidelity surface reconstruction, we introduce a view-dependent physical representation that thoroughly analyzes the subtle physical properties of reflections. The reconstruction process is further enhanced by a simple yet effective view-dependent detection algorithm and optimized using the principles of ray tracing and polarization. Experimental results demonstrate the superior performance of the proposed method in both real and synthetic scenarios.

Abstract: Salient Object Detection (SOD) is crucial in computer vision, yet RGBbased methods face limitations in challenging scenes, such as small objects and similar color features. Hyperspectral images provide a promising solution for more accurate Hyperspectral Salient Object Detection (HSOD) by abundant spectral information, while HSOD methods are hindered by the lack of extensive and available datasets. In this context, we introduce HSOD-BIT-V2, the largest and most challenging HSOD benchmark dataset to date. Five distinct challenges focusing on small objects and foreground-background similarity are designed to emphasize spectral advantages and real-world complexity. To tackle these challenges, we propose Hyper-HRNet, a high-resolution HSOD network. Hyper-HRNet effectively extracts, integrates, and preserves effective spectral information while reducing dimensionality by capturing the self-similar spectral features. Additionally, it conveys fine details and precisely locates object contours by incorporating comprehensive global information and detailed object saliency representations. Experimental analysis demonstrates that Hyper-HRNet outperforms existing models, especially in challenging scenarios.

Abstract: Large visuallanguage models (LVLMs) have achieved great success in multiple applications. However, they still encounter challenges in complex scenes, especially those involving camouflaged objects. This is primarily due to the lack of samples related to camouflaged scenes in the training dataset. To mitigate this issue, we construct the MM-CamObj dataset for the first time, comprising two subsets: CamObj-Align and CamObj-Instruct. Specifically, CamObj-Align contains 11,363 image-text pairs, and it is designed for VL alignment and injecting rich knowledge of camouflaged scenes into LVLMs. CamObj-Instruct is collected for fine-tuning the LVLMs with improved instruction-following capabilities, and it includes 11,363 images and 68,849 conversations with diverse instructions. Based on the MM-CamObj dataset, we propose the CamObj-Llava, an LVLM specifically designed for addressing tasks in camouflaged scenes. To facilitate our model's effective acquisition of knowledge about camouflaged objects and scenes, we introduce a curriculum learning strategy with six distinct modes. Additionally, we construct the CamObj-Bench to evaluate the existing LVLMs' capabilities of understanding, recognition, localization and count in camouflage scenes. This benchmark includes 600 images and 7 tasks, with a total of 9,449 questions. Extensive experiments are conducted on the CamObj-Bench with CamObj-Llava, 8 existing open-source and 3 closed-source LVLMs. Surprisingly, the results indicate that our model achieves a 25.84% improvement in 4 out of 7 tasks compared to GPT-4o.

Bosch Center for Artificial Intelligence, Germany Technical University of Munich, Germany Munich Center for Machine Learning, Germany, Bosch Center for Artificial Intelligence, Germany Saarland University, Germany, Technical University of Munich, Germany Munich Center for Machine Learning, Germany University of Strasbourg, France, Bosch Center for Artificial Intelligence, Germany, Technical University of Munich, Germany, University of Strasbourg, France IHU Strasbourg, France, CISPA Helmholtz Center for Information Security, Germany

Abstract: Medical multimodal large language models (MLLMs) are becoming an instrumental part of healthcare systems, assisting medical personnel with decision making and results analysis. Models for radiology report generation are able to interpret medical imagery, thus reducing the workload of radiologists. As medical data is scarce and protected by privacy regulations, medical MLLMs represent valuable intellectual property. However, these assets are potentially vulnerable to model stealing, where attackers aim to replicate their functionality via blackbox access. So far, model stealing for the medical domain has focused on image classification; however, existing attacks are not effective against MLLMs. In this paper, we introduce Adversarial Domain Alignment (ADA-Steal), the first stealing attack against medical MLLMs. ADA-Steal relies on natural images, which are public and widely available, as opposed to their medical counterparts. We show that data augmentation with adversarial noise is sufficient to overcome the data distribution gap between natural images and the domain-specific distribution of the victim MLLM. Experiments on the IU X-RAY and MIMIC-CXR radiology datasets demonstrate that Adversarial Domain Alignment enables attackers to steal the medical MLLM without any access to medical data.

Abstract: We propose an approach for reconstructing freemoving object from a monocular RGB video. Most existing methods either assume scene prior, hand pose prior, object category pose prior, or rely on local optimization with multiple sequence segments. We propose a method that allows free interaction with the object in front of a moving camera without relying on any prior, and optimizes the sequence globally without any segments. We progressively optimize the object shape and pose simultaneously based on an implicit neural representation. A key aspect of our method is a virtual camera system that reduces the search space of the optimization significantly. We evaluate our method on the standard HO3D dataset and a collection of egocentric RGB sequences captured with a head-mounted device. We demonstrate that our approach outperforms most methods significantly, and is on par with recent techniques that assume prior information.

Abstract: Adversarial robustness in the context of Unsupervised Domain Adaptation (UDA) is particularly challenging due to the lack of labels in the target domain. Pseudo labels are often used to make adversarial robust models but compromise robustness and accuracy, falling short of the performance due to noise and inaccuracies in these pseudo labels. The main challenges in achieving robustness and accuracy include ensuring reliable pseudo labels and developing effective training methods that bring alignment between clean and adversarial examples of target data. To address these challenges, we propose a novel training method within the selftraining paradigm Consistent Attention Mapping with Self Pseudo Label Refinement (CAM+SPLR). It begins with the pre-training of the UDA model, resulting in a UDA pre-trained model, which is initialized into two separate models: the Anchor model and the TargetNet model. The Anchor model encourages the attention maps of clean images and their adversarial counterparts to be similar, while the TargetNet model simultaneously performs self-training using Adversarial target data and refining the pseudo labels. CAM+SPLR improves both semantically relevant key features and pseudo-labels through a two-step stochastic gradient descent process during training. We conducted extensive experiments on benchmark datasets, including OfficeHome, PACS, and VisDA, demonstrating significant improvements in both robustness and accuracy. Our method achieves an average accuracy improvement of 6% and 8.1% and an average robustness improvement of 10.2% and 4.9%, compared to state-of-the-art methods on the PACS and VisDA datasets.

Department of Electrical Engineering, Pohang University of Science and Technology, Department of Electrical Engineering, Pohang University of Science and Technology, Department of Statistics, Inha University, Department of Electrical Engineering, Pohang University of Science and Technology, Department of Electrical Engineering, Pohang University of Science and Technology Graduate School of Artificial Intelligence, Pohang University of Science and Technology Institute for Convergence Research and Education in Advanced Technology, Yonsei University

Abstract: We propose SoundBrush, a model that uses sound as a brush to edit and manipulate visual scenes. We extend the generative capabilities of the Latent Diffusion Model (LDM) to incorporate audio information for editing visual scenes. Inspired by existing imageediting works, we frame this task as a supervised learning problem and leverage various off-the-shelf models to construct a sound-paired visual scene editing dataset for training. This richly generated dataset enables SoundBrush to learn to map audio features into the textual space of the LDM, allowing for visual scene editing guided by diverse in-the-wild sound. Unlike existing methods, SoundBrush can accurately manipulate the overall scenery or even insert sounding objects to best match the input sound semantics while preserving the original content. Furthermore, by integrating with novel view synthesis techniques, our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation.

Key Laboratory of Dependable Service Computing in Cyber Physical Society (Chongqing University), Ministry of Education, China School of Big Data and Software Engineering, Chongqing University, China, Key Laboratory of Dependable Service Computing in Cyber Physical Society (Chongqing University), Ministry of Education, China School of Big Data and Software Engineering, Chongqing University, China, School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi’an Jiaotong-Liverpool University, Suzhou, China, Key Laboratory of Dependable Service Computing in Cyber Physical Society (Chongqing University), Ministry of Education, China School of Big Data and Software Engineering, Chongqing University, China, Key Laboratory of Dependable Service Computing in Cyber Physical Society (Chongqing University), Ministry of Education, China School of Big Data and Software Engineering, Chongqing University, China

Abstract: Video scene detection involves assessing whether each shot and its surroundings belong to the same scene. Achieving this requires meticulously correlating multimodal cues, e.g., visual entity and place modalities, among shots and comparing semantic changes around each shot. However, most methods treat multi-modal semantics equally and do not examine contextual differences between the two sides of a shot, leading to sub-optimal detection performance. In this paper, we propose the Modality-Aware Shot Relating and Comparing approach (MASRC), which enables relating shots per their own characteristics of visual entity and place modalities, as well as comparing multi-shots similarities to have scene changes explicitly encoded. Specifically, to fully harness the potential of visual entity and place modalities in modeling shot relations, we mine long-term shot correlations from entity semantics while simultaneously revealing short-term shot correlations from place semantics. In this way, we can learn distinctive shot features that consolidate coherence within scenes and amplify distinguishability across scenes. Once equipped with distinctive shot features, we further encode the relations between preceding and succeeding shots of each target shot by similarity convolution, aiding in the identification of scene ending shots. We validate the broad applicability of the proposed components in MASRC. Extensive experimental results on public benchmark datasets demonstrate that the proposed MASRC significantly advances video scene detection.

Abstract: Training semantic segmenter with synthetic data has been attracting great attention due to its easy accessibility and huge quantities. Most previous methods focused on producing largescale synthetic image-annotation samples and then training the segmenter with all of them. However, such a solution remains a main challenge in that the poor-quality samples are unavoidable, and using them to train the model will damage the training process. In this paper, we propose a training-free Synthetic Data Selection (SDS) strategy with CLIP to select high-quality samples for building a reliable synthetic dataset. Specifically, given massive synthetic image-annotation pairs, we first design a Perturbation-based CLIP Similarity (PCS) to measure the reliability of synthetic image, thus removing samples with low-quality images. Then we propose a class-balance Annotation Similarity Filter (ASF) by comparing the synthetic annotation with the response of CLIP to remove the samples related to low-quality annotations. The experimental results show that using our method significantly reduces the data size by half, while the trained segmenter achieves higher performance.

Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University Peng Cheng Laboratory, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University Peng Cheng Laboratory, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University Peng Cheng Laboratory, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University, Sun Yat-sen University

Abstract: TextVideo Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to CLIP's inherent plain structure, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University, China Pengcheng Laboratory, China, Pengcheng Laboratory, China, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University, China Pengcheng Laboratory, China, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University, China Pengcheng Laboratory, China, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University, China, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University, China Pengcheng Laboratory, China

Abstract: The success of 3D Gaussian Splatting (3DGS) in static scenes has inspired numerous attempts to construct FreeViewpoint Videos (FVVs) of dynamic scenes from multi-view videos. Despite advancements in current techniques, simultaneously achieving photo-realistic view synthesis results, fast on-the-fly training, real-time rendering, and low storage costs remains a formidable problem. To address these challenges, we propose the first Gaussian-based streamable FVV intelligent compression framework named iFVC. Specifically, we utilize an anchor-based Gaussian representation to model the scene. To achieve on-the-fly training, we propose a Binary Transformation Cache (BTC) to model the dynamic changes between adjacent timesteps, which not only ensures compactness but also supports precise bit rate estimation. Furthermore, we carefully design a high-resolution transformation tri-plane assisted by a saliency grid as our BTC, allowing for accurate dynamic capture. The entire pipeline is regarded as a joint optimization of rate and distortion to achieve optimal compression performance. Experiments on widely used datasets demonstrate the state-of-the-art performance of our framework in both synthesis quality and efficiency, i.e., achieving per-frame training in 13 seconds with a storage cost of 0.1 MB and real-time rendering at 120 FPS.

Abstract: The rapid development of the autonomous driving industry has led to a significant accumulation of autonomous driving data. Consequently, there comes a growing demand for retrieving data to provide specialized optimization. However, directly applying previous image retrieval methods faces several challenges, such as the lack of global feature representation and inadequate text retrieval ability for complex driving scenes. To address these issues, firstly, we propose the BEVTSR framework which leverages descriptive text as an input to retrieve corresponding scenes in the Bird’s Eye View (BEV) space. Then to facilitate complex scene retrieval with extensive text descriptions, we employ a large language model (LLM) to extract the semantic features of the text inputs and incorporate knowledge graph embeddings to enhance the semantic richness of the language embedding. To achieve feature alignment between the BEV feature and language embedding, we propose Shared Cross-modal Embedding with a set of shared learnable embeddings to bridge the gap between these two modalities, and employ a caption generation task to further enhance the alignment. Furthermore, there lack of well-formed retrieval datasets for effective evaluation. To this end, we establish a multi-level retrieval dataset, nuScenes-Retrieval, based on the widely adopted nuScenes dataset. Experimental results on the multi-level nuScenes-Retrieval show that BEV-TSR achieves state-of-the-art performance, e.g., 85.78% and 87.66% top-1 accuracy on scene-to-test and text-to-scene retrieval respectively.

Abstract: Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of largescale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By finetuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to localize audio-visual events in videos temporally. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,081 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.

Abstract: Recent 3D large reconstruction models typically employ a twostage process, including first generate multi-view images by a multi-view diffusion model, and then utilize a feed-forward model to reconstruct images to 3D content. However, multi-view diffusion models often produce low-quality and inconsistent images, adversely affecting the quality of the final 3D reconstruction. To address this issue, we propose a unified 3D generation framework called Cycle3D, which cyclically utilizes a 2D diffusion-based generation module and a feed-forward 3D reconstruction module during the multi-step diffusion process. Concretely, 2D diffusion model is applied for generating high-quality texture, and the reconstruction model guarantees multi-view consistency. Moreover, 2D diffusion model can further control the generated content and inject reference-view information for unseen views, thereby enhancing the diversity and texture consistency of 3D generation during the denoising process. Extensive experiments demonstrate the superior ability of our method to create 3D content with high-quality and consistency compared with state-of-the-art baselines.

Abstract: Gait attracts growing interest from researchers due to its advantages as a noninvasive and non-cooperative biometric feature. Current gait-based attribute recognition methods primarily focus on estimating attributes such as gender, age, and emotions. However, there is insufficient attention to diverse gait attributes in various covariate scenarios. In this paper, we design and collect a Richly Annotated benchmark for 15 gait attributes, named RA-GAR, comprising data from 533 individuals with over 120,000 sequences. To our knowledge, RA-GAR represents the largest and most diverse benchmark of gait attributes currently available. Furthermore, to fully leverage the semantic information and enhance attribute-specific local perception, we propose a two-stage CLIP-based method for Gait Attribute Recognition, named CLIP-GAR. Experiments on the RA-GAR and MA-Gait datasets demonstrate the effectiveness of CLIP-GAR, showing significant improvements in mean accuracy and F1 score.

Abstract: We propose Intra and Inter ParserPrompted Transformers (PPTformer) that explore useful features from visual foundation models for image restoration. Specifically, PPTformer contains two parts: an Image Restoration Network (IRNet) for restoring images from degraded observations and a Parser-Prompted Feature Generation Network (PPFGNet) for providing IRNet with reliable parser information to boost restoration. To enhance the integration of the parser within IRNet, we propose Intra Parser-Prompted Attention (IntraPPA) and Inter Parser-Prompted Attention (InterPPA) to implicitly and explicitly learn useful parser features to facilitate restoration. The IntraPPA re-considers cross attention between parser and restoration features, enabling implicit perception of the parser from a long-range and intra-layer perspective. Conversely, the InterPPA initially fuses restoration features with those of the parser, followed by formulating these fused features within an attention mechanism to explicitly perceive parser information. Further, we propose a parser-prompted feed-forward network to guide restoration within pixel-wise gating modulation. Experimental results show that PPTformer achieves state-of-the-art performance on image deraining, defocus deblurring, desnowing, and low-light enhancement.

Abstract: Many machine learning models are susceptible to adversarial attacks, with decisionbased black-box attacks representing the most critical threat in real-world applications. These attacks are extremely stealthy, generating adversarial examples using hard labels obtained from the target machine learning model. This is typically realized by optimizing perturbation directions, guided by decision boundaries identified through query-intensive exact search, significantly limiting the attack success rate. This paper introduces a novel approach using the Approximation Decision Boundary (ADB) to efficiently and accurately compare perturbation directions without precisely determining decision boundaries. The effectiveness of our ADB approach (ADBA) hinges on promptly identifying suitable ADB, ensuring reliable differentiation of all perturbation directions. For this purpose, we analyze the probability distribution of decision boundaries, confirming that using the distribution's median value as ADB can effectively distinguish different perturbation directions, giving rise to the development of the ADBA-md algorithm. ADBA-md only requires four queries on average to differentiate any pair of perturbation directions, which is highly query-efficient. Extensive experiments on six well-known image classifiers clearly demonstrate the superiority of ADBA and ADBA-md over multiple state-of-the-art black-box attacks.

School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Cyber Science and Engineering, Southeast University, Nanjing, China, School of Cyber Science and Engineering, Southeast University, Nanjing, China, School of Cyber Science and Engineering, Southeast University, Nanjing, China Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education, China Purple Mountain Laboratories, Nanjing 210000, China

Abstract: Video Anomaly Detection (VAD) is essential for computer vision and multimedia research. Existing VAD methods utilize either reconstructionbased or prediction-based frameworks. The former excels at detecting irregular patterns or structures, whereas the latter is capable of spotting abnormal deviations or trends. We address pose-based video anomaly detection and introduce a novel framework called Dual Conditioned Motion Diffusion (DCMD), which enjoys the advantages of both approaches. The DCMD integrates conditioned motion and conditioned embedding to comprehensively utilize the pose characteristics and latent semantics of observed movements, respectively. In the reverse diffusion process, a motion transformer is proposed to capture potential correlations from multi-layered characteristics within the spectrum space of human motion. To enhance the discriminability between normal and abnormal instances, we design a novel United Association Discrepancy (UAD) regularization that primarily relies on a Gaussian kernel-based time association and a self-attention-based global association. Finally, a mask completion strategy is introduced during the inference stage of the reverse diffusion process to enhance the utilization of conditioned motion for the prediction branch of anomaly detection. Extensive experiments conducted on four datasets demonstrate that our method dramatically outperforms state-of-the-art methods and exhibits superior generalization performance.

Tongji University State Key Laboratory of Intelligent Autonomous Systems, Frontiers Science Center for Intelligent Autonomous Systems, Shanghai Key Laboratory of Intelligent Autonomous Systems, Tongji University State Key Laboratory of Intelligent Autonomous Systems, Frontiers Science Center for Intelligent Autonomous Systems, Shanghai Key Laboratory of Intelligent Autonomous Systems, Tongji University State Key Laboratory of Intelligent Autonomous Systems, Frontiers Science Center for Intelligent Autonomous Systems, Shanghai Key Laboratory of Intelligent Autonomous Systems, Tongji University State Key Laboratory of Intelligent Autonomous Systems, Frontiers Science Center for Intelligent Autonomous Systems, Shanghai Key Laboratory of Intelligent Autonomous Systems

Abstract: 3D part assembly is a promising task in 3D computer vision and robotics, focusing on assembling 3D parts together by predicting their 6DoF poses. Like most 3D shape understanding tasks, existing methods primarily address this task by memorizing the poses of parts during the training process, leading to inaccuracies in complex assemblies and poor generalization to novel categories. In order to essentially improve the performance, structure knowledge of the target assembly is indispensable before assembling, which abstracts the potential part composition and their structural relationships. An image of the target assembly can serve as a common source for constructing this structure knowledge. Nevertheless, the image is far from enough, as its knowledge can be incomplete and ambiguous due to part occlusion and varying views. To tackle these issues, we propose Imagine, a novel Image-guided 3D part assembly framework with structure knowledge graph. As a novel assembly prior, the structure knowledge graph originates from the image and is refined as understanding the 3D parts. It encodes robust part-aware structural and semantic information of the assembly, guides the 3D parts from a coarse super-structure to a fine assembly, and co-evolves progressively throughout the assembly process. Extensive experiments demonstrate the state-of-the-art performance of our framework, along with strong generalization to novel images and categories.

Abstract: Textto-image diffusion model has inspired research into text-to-data synthesis without human intervention, where spatial attentions correlated with semantic entities in text prompts are primarily interpreted as pseudo-masks. However, these vannila attentions often deliver visual-linguistic discrepancies, in which the associations between image features and entity-level tokens are unstable and divergent, yielding inferior masks for realistic applications, especially in more practical open-vocabulary settings. To tackle this issue, we propose a novel text-guided self-driven generative paradigm, termed FreeGen, which addresses the discrepancies by recalibrating intrinsic visual-linguistic correlations and serves as a real-data-free method to automatically synthesize open-vocabulary pixel-level data for arbitrary entities. Specifically, we first learn an Attention Self-Rectification mechanism to reproject the inherent attention matrices to achieve robust semantic alignment, thereby obtaining class-discriminative masks. A Temporal Fluctuation Factor is present to assess mask quality based on its variation over uniform sampling timesteps, enabling the selection of reliable masks. These masks are then employed as self-supervised signals to support the learning of an Entity-level Grounding Decoder in a self-training manner, thus producing open-vocabulary segmentation results. Extensive experiments show that the existing segmenters trained on FreeGen narrow the performance gap with real data counterparts and remarkably outperform the state-of-the-art methods.

School of Computer Science and Technology, Wuhan University of Science and Technology School of Computer Science, Wuhan University, School of Computer Science and Technology, Wuhan University of Science and Technology Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, China, School of Computer Science, Wuhan University, School of Computer Science, Wuhan University, School of Computer Science, Wuhan University, School of Computer Science and Technology, Wuhan University of Science and Technology Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, China

Abstract: Unsupervised visibleinfrared person re-identification (US-VI-ReID) seeks to match infrared and visible images of the same individual without the use of annotations. Current methods typically derive cross-modal correspondences through a single global feature matching process for generating pseudo labels and learning modality-invariant features. However, this matching approach is hindered by both intra-modality and inter-modality discrepancies, which result in imprecise measurements. As a consequence, the clustering of individuals with single global feature is often incomplete and unreliable, leading to suboptimal performance in cross-modal clustering tasks. To address these challenges and to extract cross-modality discriminative identity information, we propose a TokenMatcher, which encompasses three key components: Diverse Tokens Matching (DTM), Diverse Tokens Neighbor Learning (DTNL), and the Homogeneous Fusion (HF) Module. DTM utilizes multiple class tokens within the visual transformer framework to capture diverse embedding representations, thereby facilitating the integration of fine-grained information essential for reliable cross-modality correspondences. DTNL enhances the intra-modality and inter-modality consistency among diverse tokens by refining neighborhood sets with insights from neighboring tokens and camera information, promoting robust neighborhood learning and fostering discriminative identity information. Additionally, the HF module consolidates clusters of the same identity while effectively separating those of different identities. Extensive experiments conducted on the publicly available SYSU-MM01 and RegDB datasets demonstrate the efficacy of the proposed method.

Abstract: Existing unsupervised distillationbased methods rely on the differences between encoded and decoded features to locate abnormal regions in test images. However, the decoder trained only on normal samples still reconstructs abnormal patch features well, degrading performance. This issue is particularly pronounced in unsupervised multi-class anomaly detection tasks. We attribute this behavior to ‘over-generalization’ (OG) of decoder: the significantly increasing diversity of patch patterns in multi-class training enhances the model generalization on normal patches, but also inadvertently broadens its generalization to abnormal patches. To mitigate ‘OG’, we propose a novel approach that leverages class-agnostic learnable prompts to capture common textual normality across various visual patterns, and then apply them to guide the decoded features towards a ‘normal’ textual representation, suppressing ‘over-generalization’ of the decoder on abnormal patterns. To further improve performance, we also introduce a gated mixture-of-experts module to specialize in handling diverse patch patterns and reduce mutual interference between them in multi-class training. Our method achieves competitive performance on the MVTec AD and VisA datasets, demonstrating its effectiveness.

Abstract: Deep learning based dehazing networks trained on paired synthetic data have shown impressive performance, but they struggle with significant degradation in generalization ability on realworld hazy scenes. In this paper, we propose Dehaze-RetinexGAN, a lightweight Retinex-based Generative Adversarial Network for real-world image Dehazing using unpaired data. Our Dehaze-RetinexGAN consists of two stages: self-supervised pre-training and weakly-supervised fine-tuning. During the pre-training, we reduce the image dehazing task to an illumination-reflectance decomposition task based on the duality correlation between Retinex and dehazing. Specifically, a decomposition network named DecomNet is constructed to obtain an illumination and a reflectance, simultaneously. Moreover, a self-supervised learning strategy is developed to construct the connection between the preliminary dehazed result and the input hazy image, which constrains the solution space of DecomNet and accelerates training, leading to a more realistic dehazed result. In the fine-tuning stage, we develop a dual DTCWT-based attention module and embed it into the U-Net architecture to further improve the quality of preliminary result in the frequency domain. In addition, the adversarial learning is employed to constrain the relevance between the clean image and the final dehazed result in a weakly supervised manner, which can promote more natural performance. Extensive experiments on several real-world datasets demonstrate that our proposed framework performs favorably over state-of-the-art dehazing methods in visual quality and quantitative evaluation.

Abstract: Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a textguided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we utilize GPT-4V to design an automatic annotation pipeline, constructing an instruction-video paired training dataset. This is combined with a novel two-branch diffusion-based generator to predict avatars using both audio and text instructions simultaneously. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness.

Abstract: Designing an appropriate loss function can enhance the discriminative power on gait recognition. However, previous research focuses on improving network structure and enriching input modalities but overlooks the loss functions. Although transferring loss functions from face recognition can address samplelevel loss, additional design is needed for part-level loss. Therefore, we have designed a new loss function called Waveloss, aimed at adaptively and dynamically changing the preference for parts of different difficulties. First, the previous method treats the loss of different parts equally, which brings the problems of difficult convergence or susceptibility to noise interference, so we propose norm-fusion to adaptively learn samples of different difficulties. Additionally, since we find the exponential value represents preference for learning different samples, we introduce the Dynamic Learning Process, which dynamically adjusts the exponential value during iteration to focus on samples of varying difficulties at different training stages. Finally, as the changes of the exponential value leads to significant fluctuations in the gradient, we introduce the gradient truncation and normalization to avoid getting trapped in local optima and gradient vanishing or exploding by adaptively adjusting the gradient. Experimental results demonstrate that our proposed Waveloss achieves state-of-the-art performance on various gait recognition datasets and can improve the performance of different backbones as well.

Abstract: Diffusion models have been utilized as powerful tools for various image editing tasks, including semantic image painting (SIP), which aims to generate content within masked regions conditioned on a reference image or text. SIP, especially those using images as conditions, often suffers from three issues: semantic inconsistency, unnatural transitions, and style inconsistency, which significantly hinder its practical application. To address these challenges, we propose a novel Semantic Image Painting framework with INdependent INformation INjection (Spin). Specifically, we compute a saliency map to segregate the reference image into salient and nonsalient components. We then filter out the non-salient information during the semantic embedding extraction phase and precisely inject the semantic embedding into the masked region instead of the whole image during the semantic generation phase. Furthermore, we impose an additional style guidance to promote style consistency between background and foreground. Experimental results demonstrate that Spin achieve superior semantic similarity and image coherence across various styles, including realistic, pencil drawings, cartoon, and oil painting. Additionally, Spin offers diversity and editability, and can be integrated into other models that meet our prerequisites.

Abstract: In recent years, semantic segmentation has flourished in various applications. However, the high computational cost remains a significant challenge that hinders its further adoption. The filter pruning method for structured network slimming offers a direct and effective solution for the reduction of segmentation networks. Nevertheless, we argue that most existing pruning methods, originally designed for image classification, overlook the fact that segmentation is a locationsensitive task, which consequently leads to their suboptimal performance when applied to segmentation networks. To address this issue, this paper proposes a novel approach, denoted as Spatial-aware Information Redundancy Filter Pruning (SIRFP), which aims to reduce feature redundancy between channels. First, we formulate the pruning process as a maximum edge weight clique problem (MEWCP) in graph theory, thereby minimizing the redundancy among the remaining features after pruning. Within this framework, we introduce a spatial-aware redundancy metric based on feature maps, thus endowing the pruning process with location sensitivity to better adapt to pruning segmentation networks. Additionally, based on the MEWCP, we propose a low computational complexity greedy strategy to solve this NP-hard problem, making it feasible and efficient for structured pruning. To validate the effectiveness of our method, we conducted extensive comparative experiments on various challenging datasets. The results demonstrate the superior performance of SIRFP for semantic segmentation tasks.

Abstract: Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through crossmodal requires such as textual descriptions and synchronized audios. Cross-modal pre-training (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.

Abstract: Point cloud completion aims to reconstruct complete 3D shapes from partial 3D point clouds. With advancements in deep learning techniques, various methods for point cloud completion have been developed. Despite achieving encouraging results, a significant issue remains: these methods often overlook the variability in point clouds sampled from a single 3D object surface. This variability can lead to ambiguity and hinder the achievement of more precise completion results. Therefore, in this study, we introduce a novel point cloud completion network, namely DualCodebook Point Completion Network (DC-PCN), following an encder-decoder pipeline. The primary objective of DC-PCN is to formulate a singular representation of sampled point clouds originating from the same 3D surface. DC-PCN introduces a dual-codebook design to quantize point-cloud representations from a multilevel perspective. It consists of an encoder-codebook and a decoder-codebook, designed to capture distinct point cloud patterns at shallow and deep levels. Additionally, to enhance the information flow between these two codebooks, we devise an information exchange mechanism. This approach ensures that crucial features and patterns from both shallow and deep levels are effectively utilized for completion. Extensive experiments on the PCN, ShapeNet_Part, and ShapeNet34 datasets demonstrate the state-of-the-art performance of our method.

Abstract: Pansharpening is a challenging image fusion task that involves restoring images using two different modalities: lowresolution multispectral images (LRMS) and high-resolution panchromatic (PAN). Many end-to-end specialized models based on deep learning (DL) have been proposed, yet the scale and performance of these models are limited by the size of dataset. Given the superior parameter scales and feature representations of pre-trained models, they exhibit outstanding performance when transferred to downstream tasks with small datasets. Therefore, we propose an efficient fine-tuning method, namely PanAdapter, which utilizes additional advanced semantic information from pre-trained models to alleviate the issue of small-scale datasets in pansharpening tasks. Specifically, targeting the large domain discrepancy between image restoration and pansharpening tasks, the PanAdapter adopts a two-stage training strategy for progressively adapting to the downstream task. In the first stage, we fine-tune the pre-trained CNN model and extract task-specific priors at two scales by proposed Local Prior Extraction (LPE) module. In the second stage, we feed the extracted two-scale priors into two branches of cascaded adapters respectively. At each adapter, we design two parameter-efficient modules for allowing the two branches to interact and be injected into the frozen pre-trained VisionTransformer (ViT) blocks. We demonstrate that by only training the proposed LPE modules and adapters with a small number of parameters, our approach can benefit from pre-trained image restoration models and achieve state-of-the-art performance in several benchmark pansharpening datasets.

MoE Key Lab of Collaborative Intelligence Systems, Xidian University School of Computer Science and Technology, Xidian University, MoE Key Lab of Collaborative Intelligence Systems, Xidian University School of Computer Science and Technology, Xidian University, MoE Key Lab of Collaborative Intelligence Systems, Xidian University School of Computer Science and Technology, Xidian University, MoE Key Lab of Collaborative Intelligence Systems, Xidian University Academy of Artificial Intelligence, College of Mathematics Science, Inner Mongolia Normal University, MoE Key Lab of Collaborative Intelligence Systems, Xidian University School of Electronic Engineering, Xidian University, MoE Key Lab of Collaborative Intelligence Systems, Xidian University School of Electronic Engineering, Xidian University, School of Artificial Intelligence, Xidian University, MoE Key Lab of Collaborative Intelligence Systems, Xidian University School of Computer Science and Technology, Xidian University

Abstract: 3D Change Detection (3DCD) has gradually become another research hotspot after image change detection. Recent works focus on using artificial labels for supervised or weaklysupervised training of siamese networks to segment changed points. However, labeling every points of multi-temporal point clouds is very expensive and time-consuming. In addition, these works lack effective self-supervised signals, and existing self-supervised signals often fail to capture sufficiently rich change information. To solve this problem, we assume that the powerful representation of 3D objects should model the consistency information of unchanged regions and distinguish different objects. Based on this assumption, we propose a new unsupervised framework called MUCD to learn change information of multi-temporal point clouds through bidirectional optimization of change segmentor and feature extractor. The training of network is divided into two stages. We first design a foreknowledge point contrastive loss based on the characteristics of the 3DCD task to initialize the feature extractor, and then propose a masked consistency loss to further learn the shared geometric information of unchanged regions in the multi-temporal point clouds, utilizing it as a free and powerful supervised signal to train a change segmentor. In the inference stage, only the segmentor is used to take multi-temporal point clouds as input and produce change segmentation result. Extensive experiments are conducted on SLPCCD and Urb3DCD, two real-world datasets of streets and urban buildings, to verify that our proposed unsupervised method is highly competitive and even outperforms supervised methods in scenes where semantic information changes occur, exhibiting better performance in generalization ability and robustness.

Abstract: Audiodriven talking head synthesis is a critical task in digital human modeling. While recent advances using diffusion models and Neural Radiance Fields (NeRF) have improved visual quality, they often require substantial computational resources, limiting practical deployment. We present a novel framework for audio-driven talking head synthesis, namely it Hierarchically Controlled Deformable 3D Gaussians (HiCoDe), which achieves state-of-the-art performance with significantly reduced computational costs. Our key contribution is a hierarchical control strategy that effectively bridges the gap between sparse audio features and dense 3D Gaussian point clouds. Specifically, this strategy comprises two control levels: i) coarse-level control based on a 3D Morphable Model (3DMM) and ii) fine-level control using facial landmarks. Extensive experiments on the HDTF dataset and additional test sets demonstrate that our method outperforms existing approaches in visual quality, facial landmark accuracy, and audio-visual synchronization while being more computationally efficient in both training and inference.

Abstract: Lane detection plays a crucial role in autonomous driving systems, enabling vehicles to navigate safely and efficiently in complex environment. Despite significant advancements in recent years, accurate lane detection remains a challenging task, particularly in scenarios with occlusions, ambiguous lane markings, and diverse lighting conditions. In this paper, we propose the Global Enhancement and Optimization Network (GEONet) for lane detection, which is designed to refine both feature extraction and global feature transmission. Traditional approaches typically depend on deep convolutional layer stacks for global feature extraction, a process that often compromises inference speed and the precision of global feature representation. In contrast, GEONet introduces a novel and more effective methodology. We present the Global Feature Extraction Module (GFEM), which is specifically engineered to capture comprehensive global features with higher accuracy. Additionally, we introduce the TopTier Supplementary Module (TTSM), which enhances these features through a bottom-up approach, improving overall lane detection accuracy. To further bolster our framework, we incorporate Whitening Batch Normalization (WBN) and Whitening Contrastive Learning (WCL), which enhance feature robustness and ensure better generalization. In addition to our novel network design, we propose two new loss functions to enhance lane detection accuracy. The Generalized Rectangular Intersection over Union (GRIoU) Loss extends the predicted points into rectangles, optimizing overlap and smoothness of lane predictions.The Angle Loss accounts for angular differences between predicted and ground truth lanes, improving alignment and continuity. Experimental results demonstrate that our proposed method significantly outperforms current state-of-the-art lane detection techniques.

Abstract: Multishape matching is a central problem in various applications of computer vision and graphics, where cycle consistency constraints play a pivotal role. For this issue, we propose a novel and efficient approach that models multi-shapes as directed graphs for two-stage optimization, i.e., optimizing pairwise correspondence accuracy using landmarks, and refining matching consistency through cycle consistency basis. Specifically, we utilize local mapping distortion to identify landmarks and extract the dimension of the functional space, which is then used to upsample in the spectral domain, thereby producing smoother results. Next, to optimize the consistency of correspondences, we introduce the cycle consistency basis, which succinctly describes all consistent cycles in the collection. We then propose cycle consistency refinement, which resolves inconsistencies in cycles efficiently via the alternating direction method of multipliers. Our approach simultaneously balances the accuracy and consistency of multi-shape matching, achieving lower correspondence errors. Extensive experiments on several public datasets demonstrate the superiority of our approach over current state-of-the-art methods.

Abstract: Obtaining planetary images with good visual quality is not an easy task since they are usually degenerated by atmospheric turbulence during the imaging procedure. Existing atmospheric turbulence mitigation methods designed for conventional images cannot be applied to planetary images, since the objects on the Earth have totally different degeneration patterns to planets. Besides, in planetary imaging, photographers often capture as many frames as possible to reduce the noise level of planetary images, which requires the method designed for planetary images to support an arbitrary number of input frames. In this paper, we propose a vertical distanceaware turbulence simulation pipeline to synthesize realistic planetary images in accordance with their unique degeneration patterns at a large scale with affordable computational cost, and design a neural network to mitigate the turbulence with flexible input frames by adopting an edge-based supervision strategy to handle the background scarcity issue. Experimental results show that our method achieves state-of-the-art performance on both synthetic and real-world images.

Abstract: Current research on adversarial attacks mainly focuses on RGB trackers, with no existing methods for attacking RGBT cross-modal trackers. To fill this gap and overcome its challenges, we propose a progressive adversarial patch generation framework and achieve cross-modal stealth. On the one hand, we design a coarse-to-fine architecture grounded in the latent space to progressively and precisely uncover the vulnerabilities of RGB-T trackers. On the other hand, we introduce a correlation-breaking loss that disrupts the modal coupling within trackers, spanning from the pixel to the semantic level. These two design elements ensure that the proposed method can overcome the obstacles posed by cross-modal information complementarity in implementing attacks. Furthermore, to enhance the reliable application of the adversarial patches in real world, we develop a point tracking-based reprojection strategy that effectively mitigates performance degradation caused by multi-angle distortion during imaging. Extensive experiments demonstrate the superiority of our method.

Abstract: Embedded flight devices with visual capabilities have become essential for a wide range of applications. In aerial image detection, while many existing methods have partially addressed the issue of small target detection, challenges remain in optimizing small target detection and balancing detection accuracy with efficiency. These issues are key obstacles to the advancement of realtime aerial image detection. In this paper, we propose a new family of real-time detectors for aerial image detection, named FBRT-YOLO, to address the imbalance between detection accuracy and efficiency. Our method comprises two lightweight modules: Feature Complementary Mapping Module (FCM) and Multi-Kernel Perception Unit (MKP), designed to enhance object perception for small targets in aerial images. FCM focuses on alleviating the problem of information imbalance caused by the loss of small target information in deep networks. It aims to integrate spatial positional information of targets more deeply into the network, better aligning with semantic information in the deeper layers to improve the localization of small targets. We introduce MKP, which leverages convolutions with kernels of different sizes to enhance the relationships between targets of various scales and improve the perception of targets at different scales. Extensive experimental results on three major aerial image datasets, including Visdrone, UAVDT, and AI-TOD, demonstrate that FBRT-YOLO outperforms various real-time detectors in terms of performance and speed.

Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Key Laboratory of Big Data Mining and Knowledge Management, University of Chinese Academy of Sciences, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University

Abstract: Recently, several studies have shown that utilizing contextual information to perceive target states is crucial for object tracking. They typically capture context by incorporating multiple video frames. However, these naive framecontext methods fail to consider the importance of each patch within a reference frame, making them susceptible to noise and redundant tokens, which deteriorates tracking performance. To address this challenge, we propose a new token context-aware tracking pipeline named LMTrack, designed to automatically learn high-quality reference tokens for efficient visual tracking. Embracing the principle of Less is More, the core idea of LMTrack is to analyze the importance distribution of all reference tokens, where important tokens are collected, continually attended to, and updated. Specifically, a novel Token Context Memory module is designed to dynamically collect high-quality spatio-temporal information of a target in an autoregressive manner, eliminating redundant background tokens from the reference frames. Furthermore, an effective Unidirectional Token Attention mechanism is designed to establish dependencies between reference tokens and search frame, enabling robust cross-frame association and target localization. Extensive experiments demonstrate the superiority of our tracker, achieving state-of-the-art results on tracking benchmarks such as GOT-10K, TrackingNet, and LaSOT.

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), JD AI Research, China University of Petroleum (Beijing) at Karamay, Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China)

Abstract: The significance of visual emotion distribution learning (VEDL) has surged, particularly with the growing inclination to convey emotions through images. The key of VEDL lies in capturing both lowand high-level features within the same visual content, thus promoting the model for salient and subtle emotion awareness. To learn the distribution of emotions involved in images, most previous works learn coarse semantic knowledge with unbiased filtering. Consequently, they focus on the entire scene and suffer from the redundancy of semantic-irrelevant information, which diminishes the affective coherence, impeding the comprehension of emotional attributes within the treated features. In light of this, we reanalyze from the perspective of information filtering and propose a novel method called Multiple Feature Refining Network (MFRN). To minimize low-level feature redundancy, we design a wavelet-based separated frequency modeling, named Spectral Mixer, to learn invariant representations and enhance emotion saliency in low-level image features. At the higher semantic level, we design a Semantic Graph Prompt Learning for emotional semantic filtering, ensuring the purity of emotional information and providing the model with richer content semantics. Experiments conducted on three commonly used datasets have demonstrated the superiority of our MFRN model over cutting-edge methods.

Abstract: Humanobject interaction (HOI) detection aims to detect the spatial positions of human-object pairs and recognize their interactions. Existing single-branch, two-branch, and three-branch methods are challenging to make an appropriate trade-off on efficiency, multi-task decoupling, and collaborative learning, while they fail to identify rare and complex interaction categories effectively as well. In this work, we propose a novel Efficient Mamba-based Disentangled Progressive Learning (HOIMamba) for HOI Detection to absorb the advantages of the existing three approaches and adaptively aggregate multi-level interaction semantics guided by cross-task bidirectional information contexts. Specifically, HOIMamba builds an efficient and effective decoder through cascaded Low-Rank Adaptations (LoRAs), with high efficiency, thorough decoupling of tasks, and good multi-task collaborative learning. Furthermore, to alleviate the recognition problem of interactions in difficult HOI samples, a novel Mamba-based comprehensive progressive learning strategy with Cross-enhance Mamba (CEM) blocks and Detection Context Propagation (DCP) blocks is designed to gradually excavate interaction-related discriminative cues from four levels. CEM blocks automatically aggregate context to generate diverse task-shared semantics and simultaneously realize the cross-task interaction between human and object branches, guiding the interaction branch to extract more expressive HOI representation. DCP blocks further transfer the comprehensive interaction context to human and object branches to achieve rich and effective information exchange, facilitating the model to discover more HOI instances. Extensive experimental results on two standard benchmarks demonstrate the effectiveness of our HOIMamba.

Abstract: Face retouching aims to remove facial imperfections from image and videos while at the same time preserving face attributes. The existing methods are designed to perform noninteractive end-to-end retouching, while the ability to interact with users is highly demanded in downstream applications. In this paper, we propose RetouchGPT, a novel framework that leverages Large Language Models (LLMs) to guide the interactive retouching process. Towards this end, we design an instruction-driven imperfection prediction module to accurately identify imperfections by integrating textual and visual features. To learn imperfection prompts, we further incorporate a LLM-based embedding module to fuse multi-modal conditioning information. The prompt-based feature modification is performed in each transformer block, such that the imperfection features are suppressed and replaced with the features of normal skin progressively. Extensive experiments have been performed to verify effectiveness of our design elements and demonstrate that RetouchGPT is a useful tool for interactive face retouching and achieves superior performance over state-of-the-arts.

Abstract: Embedding links in brand logos is a promising technology, which allows consumers to access the online information of products by capturing physical logo images. Previous physical data hiding methods primarily embed data within cover media in a global manner, making them ineffective for processing brand logos in vector graphics format with a transparent background. To address this issue, we propose in this paper a novel physical deep hiding scheme for invisibly embedding links in printed trademarks. Specifically, the encoder embeds links only into the area of the brand logo under the constraints of a mask, which is generated from the transparency information of the logo image. A background variation distortion is introduced into the distortion layer that approximate practical logo printcamera environments, such that the decoder could be learnt to retrieve the link from the camera-captured logo with various backgrounds. A feature prompt subspace modulator is further proposed and employed in the encoder to enhance the invisibility of the encoded logo pattern and in the decoder to boost hyperlink extraction accuracy. Various experiments have been conducted to demonstrate the advantage of our proposed method for embedding links in printed brand logos, which provides reliable extraction accuracy under both simulated and real scenarios.

Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of SciencesUniversity of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences

Abstract: Multiteacher Knowledge Distillation (KD) transfers diverse knowledge from a teacher pool to a student network. The core problem of multi-teacher KD is how to balance distillation strengths among various teachers. Most existing methods often develop weighting strategies from an individual perspective of teacher performance or teacher-student gaps, lacking comprehensive information for guidance. This paper proposes Multi-Teacher Knowledge Distillation with Reinforcement Learning (MTKD-RL) to optimize multi-teacher weights. In this framework, we construct both teacher performance and teacher-student gaps as state information to an agent. The agent outputs the teacher weight and can be updated by the return reward from the student. MTKD-RL reinforces the interaction between the student and teacher using an agent in an RL-based decision mechanism, achieving better matching capability with more meaningful weights. Experimental results on visual recognition tasks, including image classification, object detection, and semantic segmentation tasks, demonstrate that MTKD-RL achieves state-of-the-art performance compared to the existing multi-teacher KD works.

Abstract: Video procedure planning, i.e., planning a sequence of action steps given the video frames of start and goal states, is an essential ability for embodied AI. Recent works utilize Large Language Models (LLMs) to generate enriched action step description texts to guide action step decoding. Although LLMs are introduced these methods decode the action steps into a closedset of one-hot vectors, limiting the model's capability of generalizing to new steps or tasks. Additionally, fixed action step descriptions based on world-level commonsense may contain noise in specific instances of visual states. In this paper, we propose PlanLLM, a cross-modal joint learning framework with LLMs for video procedure planning. We propose an LLM-Enhanced Planning module which fully uses the generalization ability of LLMs to produce free-form planning output and to enhance action step decoding. We also propose Mutual Information Maximization module to connect world-level commonsense of step descriptions and sample-specific information of visual states, enabling LLMs to employ the reasoning ability to generate step sequences. With the assistance of LLMs, our method can both closed-set and open vocabulary procedure planning tasks. Our PlanLLM achieves superior performance on three benchmarks, demonstrating the effectiveness of our designs.

Abstract: Event cameras are bioinspired sensors that are capable of capturing motion information with high temporal resolution, which show potential in aiding image motion deblurring recently. Most existing methods indiscriminately handle feature fusion of two modalities with symmetric unidirectional/bidirectional interactions at different-level layers in feature encoder, while ignoring the different dependencies between cross-modal hierarchical features. To tackle these limitations, we propose a novel Asymmetric Hierarchical Difference-aware Interaction Network (AHDINet) for event-based motion deblurring, which explores the complementarity of two modalities with differential dependence modeling of cross-modal hierarchical features. Thereby, an event-assisted edge complement module is designed to leverage event modality to enhance the edge details of the image features in low-level encoder stage, and an image-assisted semantic complement module is developed to transfer contextual semantics of image features to event branch in high-level encoder stage. Benefiting from the proposed differentiated interaction mode, the respective advantages of image and event modalities are fully exploited. Extensive experiments on both synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance.

Abstract: 3D single object tracking (3D SOT) in LiDAR point clouds is essential for autonomous driving. Most existing 3D SOT methods focus on clear weather, where point clouds are more defined. However, adverse weather conditions lead to sparser and noisier point clouds, significantly degrading tracking performance and posing safety risks. In this study, we introduce UAWTrack, a universal 3D SOT model designed to perform effectively across diverse realworld weather conditions. UAWTrack comprises three key modules: 1) Voxel Feature Extraction, which mitigates the perturbations in point clouds caused by adverse weather; 2) Motion-centric Spatial-temporal Aggregation and Motion-guided Feature Fusion, capturing motion clues and sampling dense BEV motion features to address the issue of sparsity; and 3) Weather-Specific Tracker, which efficiently handles tracking in various weather conditions. To fill the gap of lacking benchmarks for 3D SOT in adverse weather, we simulate physically valid adverse weather conditions on the KITTI and NuScenes datasets, creating two benchmarks: KITTI-A and NuScenes-A. Extensive experiments demonstrate that UAWTrack achieves state-of-the-art performance under all weather conditions.

School of Computer Science, Peking University, School of Artificial Intelligence, Beihang University, Beijing Digital Native Digital City Research Center, School of Computer Science and Engineering, Beihang University, Beijing Digital Native Digital City Research Center, Beijing Digital Native Digital City Research Center, School of Computer Science, Peking University, School of Computer Science and Engineering, Beihang University, School of Computer Science, Peking University, School of Computer Science, Peking University, Theta Labs, Inc., Beijing Digital Native Digital City Research Center, Technical University of Munich

Abstract: Controllable 3D scene generation has extensive applications in virtual reality and interior design, where the generated scenes should exhibit high levels of realism and controllability in terms of geometry. Scene graphs provide a suitable data representation that facilitates these applications. However, current graphbased methods for scene generation are constrained to text-based inputs and exhibit insufficient adaptability to flexible user inputs, hindering the ability to precisely control object geometry. To address this issue, we propose MMGDreamer, a dual-branch diffusion model for scene generation that incorporates a novel Mixed-Modality Graph, visual enhancement module, and relation predictor. The mixed-modality graph allows object nodes to integrate textual and visual modalities, with optional relationships between nodes. It enhances adaptability to flexible user inputs and enables meticulous control over the geometry of objects in the generated scenes. The visual enhancement module enriches the visual fidelity of text-only nodes by constructing visual representations using text embeddings. Furthermore, our relation predictor leverages node representations to infer absent relationships between nodes, resulting in more coherent scene layouts. Extensive experimental results demonstrate that MMGDreamer exhibits superior control of object geometry, achieving state-of-the-art scene generation performance.

Abstract: The customization of textto-image models has seen significant advancements, yet generating multiple personalized concepts remains a challenging task. Current methods struggle with attribute leakage and layout confusion when handling multiple concepts, leading to reduced concept fidelity and semantic consistency. In this work, we introduce a novel training-free framework, Concept Conductor, designed to ensure visual fidelity and correct layout in multi-concept customization. Concept Conductor isolates the sampling processes of multiple customized models to prevent attribute leakage between different concepts and corrects erroneous layouts through self-attention-based spatial guidance. Additionally, we present a concept injection technique that employs shape-aware masks to specify the generation area for each concept. This technique injects the structure and appearance of personalized concepts through feature fusion in the attention layers, ensuring harmony in the final image. Extensive qualitative and quantitative experiments demonstrate that Concept Conductor can consistently generate composite images with accurate layouts while preserving the visual details of each concept. Compared to existing baselines, Concept Conductor shows significant performance improvements. Our method supports the combination of any number of concepts and maintains high fidelity even when dealing with visually similar concepts. The code and trained models will be made publicly available.

Fujian Province Key Laboratory of Information Security and Network Systems College of Computer and Data Science, Fuzhou University, Fujian Province Key Laboratory of Information Security and Network Systems College of Computer and Data Science, Fuzhou University, College of Computer and Data Science, Fuzhou University, Fujian Provincial Key Laboratory of Big Data Mining and Applications College of Computer Science and Mathematics, Fujian University of Technology, College of Computer and Data Science, Fuzhou University Lion Rock Labs of Cyberspace Security

Abstract: Backdoor attacks and adversarial attacks are two major security threats to deep neural networks (DNNs), with the former one is a trainingtime data poisoning attack that aims to implant backdoor triggers into models by injecting trigger patterns into training samples, and the latter one is a testing-time attack trying to generate adversarial examples (AEs) from benign images to mislead a well-trained model. While previous works generally treat these two attacks separately, the inherent connection between these two attacks is rarely explored. In this paper, we focus on bridging backdoor and adversarial attacks and observe two intriguing phenomena when applying adversarial attacks on an infected model implanted with backdoors: 1) the sample is harder to be turned into an AE when the trigger is presented; 2) the AEs generated from backdoor samples are highly likely to be predicted as its true labels. Inspired by these observations, we proposed a novel backdoor defense method, dubbed Adversarial-Inspired Backdoor Defense (AIBD), to isolate the backdoor samples by leveraging a progressive top-q scheme and break the correlation between backdoor samples and their target labels using adversarial labels. Through extensive experiments on various datasets against six state-of-the-art backdoor attacks, the AIBD-trained models on poisoned data demonstrate superior performance over the existing defense methods.

Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology, Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology, Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology, Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology, Department of Artificial Intelligence, Korea University, Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology

Abstract: Scene Graph Generation (SGG) research has suffered from two fundamental challenges: the longtailed predicate distribution and semantic ambiguity between predicates. These challenges lead to a bias towards head predicates in SGG models, favoring dominant general predicates while overlooking fine-grained predicates. In this paper, we address the challenges of SGG by framing it as multi-label classification problem with partial annotation, where relevant labels of fine-grained predicates are missing. Under the new frame, we propose Retrieval-Augmented Scene Graph Generation (RA-SGG), which identifies potential instances to be multilabeled and enriches the single-label with multi-labels that are semantically similar to the original label by retrieving relevant samples from our established memory bank. Based on augmented relations (i.e., discovered multi-labels), we apply multi-prototype learning to train our SGG model. Several comprehensive experiments have demonstrated that RASGG outperforms state-of-the-art baselines by up to 3.6% on VG and 5.9% on GQA, particularly in terms of F@K, showing that RA-SGG effectively alleviates the issue of biased prediction caused by the long-tailed distribution and semantic ambiguity of predicates.

Abstract: Modern methods for autonomous driving perception widely adopt multimodal fusion to enhance 3D scene understanding. However, existing methods suffer from inferior semantic extraction in image encoders that treat all pixels equally, ignoring contextual differences. The generated multi-modal representations also typically lack comprehensive semantic and spatial geometry information, which is crucial for the 3D panoptic segmentation task. In this paper, we propose a novel Semantic-Geometry Fusion Transformer (SGFormer) that extracts adaptive semantic contexts, aggregates geometric information and captures the semantic-geometry fusion. First, in the Image Branch, we tailor semantic contexts for each pixel with context-guided attention and spatial context alignment to refine semantic details. Second, we transform image and voxel features into point-pixel geometry representations, simultaneously learning semantic category priors as embeddings to better represent scene geometry and semantics. Finally, to aggregate semantic information with related geometry, we design a semantic-geometry fusion that combines the transformer, effectively capturing semantic-geometry relationships into multi-modal panoptic representations. Notably, SGFormer achieves the state-of-the-art (SOTA) results on the nuScenes and SemanticPOSS, as well as yielding competitive performance on the SemanticKITTI. Moreover, SGFormer exhibits superior robustness compared to leading methods, marking an improvement of 2% to 10%.

Abstract: Categorylevel object pose estimation is an important task in computer vision. Some prior methods based on assumptions often struggle with drastic changes in object appearance. To address this challenge, we propose a new method for object pose estimation based on object-adaptive keypoints. In this paper, we first introduce a transformer-based keypoint prediction method for adaptive forecasting of point cloud keypoints. This method calculates the similarity between keypoint features and point cloud features, allowing keypoints to represent object geometry more effectively. Furthermore, to enhance the geometric feature construction of keypoints, we propose a graph-based keypoint feature aggregation method, which considers both the structural relationships between keypoints and the point cloud, strengthening the network's understanding of geometric structures. At this stage, keypoints remain at the geometric spatial level of the object and have not been predicted in NOCS. To improve the accuracy of keypoint prediction in NOCS, we design a NOCS voxelization method that divides NOCS into multiple voxels and accurately predicts NOCS keypoints within these voxels. Experimental results on multiple benchmark datasets demonstrate that our proposed KeyPose method outperforms all existing methods, achieving over 20% improvement in pose accuracy on some critical datasets.

Abstract: Large Multimodal Models (LMMs) have significantly progressed by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixelwise segmentation by integrating segmentation models. Despite the innovations, existing works’ textual responses and segmentation masks remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual cues. To overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research. Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple/empty segmentation, and reasoning segmentation. The great properties and versatility of MGLMM underscore its potential impact on advancing multimodal research.

Abstract: Patch deformationbased methods have recently exhibited substantial effectiveness in multi-view stereo, due to the incorporation of deformable and expandable perception to reconstruct textureless areas. However, such approaches typically focus on exploring correlative reliable pixels to alleviate match ambiguity during patch deformation, but ignore the deformation instability caused by mistaken edge-skipping and visibility occlusion, leading to potential estimation deviation. To remedy the above issues, we propose DVP-MVS, which innovatively synergizes depth-edge aligned and cross-view prior for robust and visibility-aware patch deformation. Specifically, to avoid unexpected edge-skipping, we first utilize Depth Anything V2 followed by the Roberts operator to initialize coarse depth and edge maps respectively, both of which are further aligned through an erosion-dilation strategy to generate fine-grained homogeneous boundaries for guiding patch deformation. In addition, we reform view selection weights as visibility maps and restore visible areas by cross-view depth reprojection, then regard them as cross-view prior to facilitate visibility-aware patch deformation. Finally, we improve propagation and refinement with multi-view geometry consistency by introducing aggregated visible hemispherical normals based on view selection and local projection depth differences based on epipolar lines, respectively. Extensive evaluations on ETH3D and Tanks & Temples benchmarks demonstrate that our method can achieve state-of-the-art performance with excellent robustness and generalization.

Abstract: 3D color lookup tables (LUTs) enable precise color manipulation by mapping input RGB values to specific output RGB values. 3D LUTs are instrumental in various applications, including video editing, incamera processing, photographic filters, computer graphics, and color processing for displays. While an individual LUT does not incur a high memory overhead, software and devices may need to store dozens to hundreds of LUTs that can take over 100 MB. This work aims to develop a neural network architecture that can encode hundreds of LUTs in a single compact representation. To this end, we propose a model with a memory footprint of less than 0.25 MB that can reconstruct 512 LUTs with only minor color distortion (ΔE ≤ 2.0 on average) over the entire color gamut. We also show that our network can weight colors to provide further quality gains on natural image colors (ΔE ≤ 1.0 on average). Finally, we show that minor modifications to the network architecture enable a bijective encoding that produces LUTs that are invertible, allowing for reverse color processing.

Abstract: Gaze estimation methods encounter significant performance deterioration when being evaluated across different domains, because of the domain gap between the testing and training data. Existing methods try to solve this issue by reducing the deviation of data distribution, however, they ignore the existence of label deviation in the data due to the acquisition mechanism of the gaze label and the individual physiological differences. In this paper, we first point out that the influence brought by the label deviation cannot be ignored, and propose a gaze label alignment algorithm (GLA) to eliminate the label distribution deviation. Specifically, we first train the feature extractor on all domains to get domain invariant features, and then select an anchor domain to train the gaze regressor. We predict the gaze label on remaining domains and use a mapping function to align the labels. Finally, these aligned labels can be used to train gaze estimation models. Therefore, our method can be combined with any existing method. Experimental results show that our GLA method can effectively alleviate the label distribution shift, and SOTA gaze estimation methods can be further improved obviously.

Abstract: Despite the significant role textto-motion (T2M) generation plays across various applications, current methods involve a large number of parameters and suffer from slow inference speeds, leading to high usage costs. To address this, we aim to design a lightweight model to reduce usage costs. First, unlike existing works that focus solely on global information modeling, we recognize the importance of local information modeling in the T2M task by reconsidering the intrinsic properties of human motion, leading us to propose a lightweight Local Information Modeling Module. Second, we are the first to introduce Mamba to the T2M task, reducing the number of parameters and GPU memory demands, and we have designed a novel Pseudo-bidirectional Scan to replicate the effects of a bidirectional scan without increasing parameter count. Moreover, we propose a novel Adaptive Textual Information Injector that more effectively integrates textual information into the motion during generation. By integrating the aforementioned designs, we propose a lightweight and fast model named Light-T2M. Compared to the state-of-the-art method, MoMask, our Light-T2M model features just 10% of the parameters (4.48M vs 44.85M) and achieves a 16% faster inference time (0.152s vs 0.180s), while surpassing MoMask with an FID of 0.040 (vs. 0.045) on HumanML3D dataset and 0.161 (vs. 0.228) on KIT-ML dataset.

Abstract: Multiobject tracking faces a major challenge in handling the variations of tracked targets within complex scenes. In existing transformer-based tracking methods, typically each tracked target is only associated with one track query. However, trajectories in crowded scenes often experience varying levels of occlusion, making the association brittle for using a single track query to identify the tracked target. Therefore, we argue that relying on a single track query to track a target in complex scenes is inadequate. In this paper, we introduce TGFormer, with the core idea of designing a Track Query Group for each tracked target. Each group encompasses track queries that handle the same tracked target across different levels of occlusion scenes. To achieve long-term robust association, we propose a novel updater that integrates temporal memories and occlusion-aware features to update the Track Query Group, ensuring the tracked target can be consistently captured in complex scenes. Additionally, we introduce a Position Predictor that allows TGFormer to forecast motion trends, helping the model accurately locate moving tracklets. Experimental results show that our method achieves competitive performance on the MOT Challenge and DanceTrack datasets.

Beijing Institute of Technology Shenzhen MSU-BIT University Chongqing Changan Automobile Co., Ltd., Beijing Institute of Technology Chongqing Changan Automobile Co., Ltd., Chongqing Changan Automobile Co., Ltd., Beijing Institute of Technology Chongqing Changan Automobile Co., Ltd., Chongqing Changan Automobile Co., Ltd., Beijing Institute of Technology, Chongqing Changan Automobile Co., Ltd., Chongqing Changan Automobile Co., Ltd., Shenzhen MSU-BIT University Beijing Institute of Technology, Shenzhen MSU-BIT University

Abstract: The Multimodal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception-limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception-limited regions can conceal crucial safety information, especially for vulnerable road users. In this paper, we propose a framework, which aims to improve autonomous driving performance under perception-limited conditions by enhancing the integration of perception capabilities and world knowledge. Specifically, we propose a plug-and-play instruction-guided interaction module that bridges modality gaps and significantly reduces the input sequence length, allowing it to adapt effectively to multi-view video inputs. Furthermore, to better integrate world knowledge with driving-related tasks, we have collected and refined a large-scale multi-modal dataset that includes 2 million natural language QA pairs, 1.7 million grounding task data. To evaluate the model’s utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution. Extensive experiments validate the effectiveness of our proposed method.

Abstract: Incremental object detection (IOD) aims to cultivate an object detector that can continuously localize and recognize novel classes while preserving its performance on previous classes. Existing methods achieve certain success by improving knowledge distillation and exemplar replay for transformerbased detection frameworks, but the intrinsic forgetting mechanisms remain underexplored. In this paper, we dive into the cause of forgetting and discover forgetting imbalance between localization and recognition in transformer-based IOD, which means that localization is less-forgetting and can generalize to future classes, whereas catastrophic forgetting occurs primarily on recognition. Based on these insights, we propose a Divide-and-Conquer Amnesia (DCA) strategy, which redesigns the transformer-based IOD into a localization-then-recognition process. DCA can well maintain and transfer the localization ability, leaving decoupled fragile recognition to be specially conquered. To reduce feature drift in recognition, we leverage semantic knowledge encoded in pre-trained language models to anchor class representations within a unified feature space across incremental tasks. This involves designing a duplex classifier fusion and embedding class semantic features into the recognition decoding process in the form of queries. Extensive experiments validate that our approach achieves state-of-the-art performance, especially for long-term incremental scenarios. For example, under the four-step setting on MS-COCO, our DCA strategy significantly improves the final AP by 6.9%.

Academy for Engineering and Technology, Fudan University, Shanghai, China Innovation Platform for Academicians of Hainan Province, Haikou, Hainan, China, Academy for Engineering and Technology, Fudan University, Shanghai, China, Academy for Engineering and Technology, Fudan University, Shanghai, China, Duke Kunshan University, Suzhou, China, Chongqing Changan Automobile Co., Ltd. Chongqing, China, Chongqing Changan Automobile Co., Ltd. Chongqing, China, Chongqing Changan Automobile Co., Ltd. Chongqing, China, Academy for Engineering and Technology, Fudan University, Shanghai, China Innovation Platform for Academicians of Hainan Province, Haikou, Hainan, China

Abstract: As a potential application of Vehicleto-Everything (V2X) communication, multi-agent collaborative perception has achieved significant success in 3D object detection. While these methods have demonstrated impressive results on standard benchmarks, the robustness of such approaches in the face of complex real-world environments requires additional verification. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate the robustness of collaborative perception methods in the presence of natural corruptions typical of real-world environments. Furthermore, we propose DSRC, a robustness-enhanced collaborative perception method aiming to learn Density-insensitive and Semantic-aware collaborative Representation against Corruptions. DSRC consists of two key designs: i) a semantic-guided sparse-to-dense distillation framework, which constructs multi-view dense objects painted by ground truth bounding boxes to effectively learn density-insensitive and semantic-aware collaborative representation; ii) a feature-to-point cloud reconstruction approach to better fuse critical collaborative representation across agents. To thoroughly evaluate DSRC, we conduct extensive experiments on real-world and simulated datasets. The results demonstrate that our method outperforms state-of-the-art collaborative perception methods in both clean and corrupted conditions.

Abstract: Multimodal salient object detection (SOD) through the integration of additional data such as depth or thermal information has become a significant task in computer vision during recent years. Traditionally, the challenges of identifying salient objects in RGB, RGB-D (Depth), and RGB-T (Thermal) images are tackled separately. However, without intricate cross-modal fusion strategies, such approaches struggle to effectively integrate multi-modal information, often resulting in poorly defined object edges or overconfident inaccurate predictions. Recent studies have shown that designing a unified end-to-end framework to handle all three types of SOD tasks simultaneously is both necessary and difficult. To address this need, we propose a novel approach that treats multi-modal SOD as a conditional mask generation task utilizing diffusion models. We introduce DiMSOD, which enables the concurrent use of local (depth maps, thermal maps) and global controls (original images) within a unified model for progressive denoising and refined prediction. DiMSOD is efficient, only requiring fine-tuning of our newly introduced modules on the existing stable diffusion, which not only reduces the fine-tuning cost, making it more viable for practical use, but also enhances the integration of multi-modal conditional controls. Specifically, we have developed modules including SOD-ControlNet, Feature Adaptive Network (FAN), and Feature Injection Attention Network (FIAN) to enhance the model's performance. Extensive experiments demonstrate that DiMSOD efficiently detects salient objects across RGB, RGB-D, and RGB-T datasets, achieving superior performance compared to previous well-established methods.

Abstract: Recently, transformerbased methods have been introduced to estimate 3D human pose from multiple views by aggregating the spatial-temporal information of human joints to achieve the lifting of 2D to 3D. However, previous approaches cannot model the inter-frame correspondence of each view's joint individually, nor can they directly consider all view interactions at each time, leading to insufficient learning of multi-view associations. To address this issue, we propose a Spatial-View-Temporal transformer (SVTformer) to decouple spatial-view-temporal information in sequential order for correlation learning and model dependencies between them in a local-to-global manner. SVTformer includes an attended Spatial-View-Temporal (SVT) patch embedding to attentively capture the local features of the input poses and stacked SVT encoders to extract global spatial-view-temporal dependencies progressively. Specifically, SVT encoders perform three reconstructions sequentially to attended features with the learning through view decoupling for temporal-enhanced spatial correlation, temporal decoupling for spatial-enhanced view correlation, and another view decoupling for spatial-enhanced temporal relationship. This decoupling-coupling-decoupling multi-view scheme enables us to alternatively model the inter-joint spatial relationships, cross-view dependencies, and temporal motion associations. We evaluate the proposed SVTformer on three popular 3D HPE datasets, and it yields state-of-the-art performance. It effectively deals with ill-posed problems and enhances the accuracy of 3D human pose estimation.

Abstract: Injecting semantics into 3D Gaussian Splatting (3DGS) has recently garnered significant attention. While current approaches typically distill 3D semantic features from 2D foundational models (e.g., CLIP and SAM) to facilitate novel view segmentation and semantic understanding, their heavy reliance on 2D supervision can undermine crossview semantic consistency and necessitate complex data preparation processes, therefore hindering view-consistent scene understanding. In this work, we present FreeGS, an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels. Instead of directly learning semantic features, we introduce the IDentity-coupled Semantic Field (IDSF) into 3DGS, which captures both semantic representations and view-consistent instance indices for each Gaussian. We optimize IDSF with a two-step alternating strategy: semantics help to extract coherent instances in 3D space, while the resulting instances regularize the injection of stable semantics from 2D space. Additionally, we adopt a 2D-3D joint contrastive loss to enhance the complementarity between view-consistent 3D geometry and rich semantics during the bootstrapping process, enabling FreeGS to uniformly perform tasks such as novel-view semantic segmentation, object selection, and 3D object detection. Extensive experiments on LERF-Mask, 3D-OVS, and ScanNet datasets demonstrate that FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.

Abstract: Lipreading is to utilize the visual information of the speaker’s lip movements to recognize words and sentences. Existing event-based lip-reading solutions integrate different frame rate branches to learn spatio-temporal features of varying granularities. However, aggregating events into event frames inevitably leads to the loss of fine-grained temporal information within frames. To remedy this drawback, we propose a novel framework termed Multi-view Temporal Granularity aligned Aggregation (MTGA). Specifically, we first present a novel event representation method, namely time-segmented voxel graph list, where the most significant local voxels are temporally connected into a graph list. Then we design a spatio-temporal fusion module based on temporal granularity alignment, where the global spatial features extracted from event frames, together with the local relative spatial and temporal features contained in voxel graph list are effectively aligned and integrated. Finally, we design a temporal aggregation module that incorporates positional encoding, which enables the capture of local absolute spatial and global temporal information. Experiments demonstrate that our method outperforms both the event-based and video-based lip-reading counterparts.

School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) Leibniz-Institut für Analytische Wissenschaften – ISAS – e.V., Tsinghua Shenzhen International Graduate School, Tsinghua University, School of Science, Harbin Institute of Technology (Shenzhen), Tsinghua Shenzhen International Graduate School, Tsinghua University, School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) Pengcheng Laboratory, School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)

Abstract: Nuclei segmentation and classification provide an essential basis for tumor immune microenvironment analysis. The previous nuclei segmentation and classification models require splitting large images into smaller patches for training, leading to two significant issues. First, nuclei at the borders of adjacent patches often misalign during inference. Second, this patchbased approach significantly increases the model's training and inference time. Recently, Mamba has garnered attention for its ability to model large-scale images with linear time complexity and low memory consumption. It offers a promising solution for training nuclei segmentation and classification models on full-sized images. However, the Mamba orientation-based scanning method lacks account for category-specific features, resulting in suboptimal performance in scenarios with imbalanced class distributions. To address these challenges, this paper introduces a novel scanning strategy based on category probability sorting, which independently ranks and scans features for each category according to confidence from high to low. This approach enhances the feature representation of uncertain samples and mitigates the issues caused by imbalanced distributions. Extensive experiments conducted on four public datasets demonstrate that our method outperforms state-of-the-art approaches, delivering superior performance in nuclei segmentation and classification tasks.

Harbin Institute of Technology, China Department of Electrical and Electronic Engineering, Southern University of Science and Technology, China, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, UK, Department of Electrical and Electronic Engineering, Southern University of Science and Technology, China, Department of Electrical and Electronic Engineering, Southern University of Science and Technology, China Pengcheng Laboratory, China, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, UK, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Yau Mathematical Sciences Center, Tsinghua University, China

Abstract: We introduce SONO, a novel method leveraging SecondOrder Neural Ordinary Differential Equations (Second-Order NODEs) to enhance cross-modal few-shot learning. By employing a simple yet effective architecture consisting of a Second-Order NODEs model paired with a cross-modal classifier, SONO addresses the significant challenge of overfitting, which is common in few-shot scenarios due to limited training examples. Our second-order approach can approximate a broader class of functions, enhancing the model's expressive power and feature generalization capabilities. We initialize our cross-modal classifier with text embeddings derived from class-relevant prompts, streamlining training efficiency by avoiding the need for frequent text encoder processing. Additionally, we utilize text-based image augmentation, exploiting CLIP’s robust image-text correlation to enrich training data significantly. Extensive experiments across multiple datasets demonstrate that SONO outperforms existing state-of-the-art methods in few-shot learning performance.

Abstract: Learning representations from numerous 2D image data has shown promising performance, yet very few works apply this representations to point cloud registration. In this paper, we explore how to leverage the 2D information to assist the point cloud registration, and propose IAPReg, an ImageAssisted Partial 3D point cloud Registration framework with the multi-view images generated by the input point cloud. It is expected to enrich 3D information with 2D knowledge, and leverage 2D knowledge to assist with point cloud registration. Specifically, we create multi-view depth maps by projecting the input point cloud from several specific views, and then extract 2D and 3D features using some well-established models. To fuse the information learned from 2D and 3D modalities, inter-modality multi-view learning module is proposed to enhance geometric information and complement semantic information. Weighted SVD is a common method to reduce the impact of inaccurate correspondences on registration. However, determining the correspondence weights is not trivial. Therefore, we design a 2D-weighted SVD method, where the 2D knowledge is employed to provide weight information of correspondences. Extensive experiments perform that our method outperform the state-of-the-art method without additional 2D training data.

School of Artificial Intelligence, University of Chinese Academy of Sciences Institute of automation, Chinese Academy of Sciences Luoyang Institute for Robot and Intelligent Equipment, School of Artificial Intelligence, University of Chinese Academy of Sciences Institute of automation, Chinese Academy of Sciences Luoyang Institute for Robot and Intelligent Equipment, GigaAI, GigaAI, GigaAI, School of Artificial Intelligence, University of Chinese Academy of Sciences Institute of automation, Chinese Academy of Sciences Luoyang Institute for Robot and Intelligent Equipment, Institute of automation, Chinese Academy of Sciences Luoyang Institute for Robot and Intelligent Equipment

Abstract: World models have demonstrated superiority in autonomous driving, particularly in the generation of multiview driving videos. However, significant challenges still exist in generating customized driving videos. In this paper, we propose DriveDreamer-2, which incorporates a Large Language Model (LLM) to facilitate the creation of user-defined driving videos. Specifically, a trajectory generation function library is developed to produce trajectories that conform to user descriptions. Subsequently, an HDMap generator is designed to learn the mapping from trajectories to road structures. Ultimately, we propose the Unified Multi-View Model (UniMVM) to enhance temporal and spatial coherence in the generated multi-view driving videos. To the best of our knowledge, DriveDreamer-2 is the first world model to generate customized driving videos, and it can generate uncommon driving videos (e.g., vehicles abruptly cut in) in a user-friendly manner. Besides, experimental results demonstrate that the generated videos enhance the training of driving perception methods (e.g., 3D detection and tracking). Furthermore, video generation quality of DriveDreamer-2 surpasses other state-of-the-art methods, showcasing FID and FVD scores of 11.2 and 55.7, representing relative improvements of ~30% and ~50%.

Abstract: The AudioVisual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background feature. The decoupled class-wise features enable our model to selectively aggregate useful semantics for each segment from clearly matched classes contained in other segments, preventing semantic interference from irrelevant classes. Specifically, we further design a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations. It comprises a Segment-wise Event Co-occurrence Modeling (SECM) block and a Local-Global Semantic Fusion (LGSF) block. The SECM exploits inter-class dependencies of concurrent events within the same timestamp with the aid of a novel event co-occurrence loss. The LGSF further enhances the event semantics of each segment by incorporating relevant semantics from more informative global video features. Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance.

Abstract: Autonomous vehicles (AVs) rely on LiDAR sensors for environmental perception and decisionmaking in driving scenarios. However, ensuring the safety and reliability of AVs in complex environments remains a pressing challenge. To address this issue, we introduce a real-world dataset (ROLiD) comprising LiDAR-scanned point clouds of two random objects: water mist and smoke. In this paper, we introduce a novel adversarial perspective by proposing an attack framework that utilizes water mist and smoke to simulate environmental interference. Specifically, we propose a point cloud sequence generation method using a motion and content decomposition generative adversarial network named PCS-GAN to simulate the distribution of random objects. Furthermore, leveraging the simulated LiDAR scanning characteristics implemented with Range Image, we examine the effects of introducing random object perturbations at various positions on the target vehicle. Extensive experiments demonstrate that adversarial perturbations based on random objects effectively deceive vehicle detection and reduce the recognition rate of 3D object detection models.

Abstract: Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous framesentence pairs, struggle with high memory usage, slow generation speeds, and limited context integration. To address these issues, we propose ContextualStory, a novel framework designed to generate coherent story frames and extend frames for visual storytelling. ContextualStory utilizes Spatially-Enhanced Temporal Attention to capture spatial and temporal dependencies, handling significant character movements effectively. Additionally, we introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames for guiding the model. Extensive experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation.

Abstract: Domain Adaptive Object Detection (DAOD) transfers knowledge from a labeled source domain to an unannotated target domain under closedset assumption. Universal DAOD (UniDAOD) extends DAOD to handle open-set, partial-set, and closed-set domain adaptation. In this paper, we first unveil two issues: domain-private category alignment is crucial for global-level features, and the domain probability heterogeneity of features across different levels. To address these issues, we propose a novel Dual Probabilistic Alignment (DPA) framework to model domain probability as Gaussian distribution, enabling the heterogeneity domain distribution sampling and measurement. The DPA consists of three tailored modules: the Global-level Domain Private Alignment (GDPA), the Instance-level Domain Shared Alignment (IDSA), and the Private Class Constraint (PCC). GDPA utilizes the global-level sampling to mine domain-private category samples and calculate alignment weight through a cumulative distribution function to address the global-level private category alignment. IDSA utilizes instance-level sampling to mine domain-shared category samples and calculates alignment weight through Gaussian distribution to conduct the domain-shared category domain alignment to address the feature heterogeneity. The PCC aggregates domain-private category centroids between feature and probability spaces to mitigate negative transfer. Extensive experiments demonstrate that our DPA outperforms state-of-the-art UniDAOD and DAOD methods across various datasets and scenarios, including open, partial, and closed sets.

Academy for Engineering and Technology, Fudan University Cognition and Intelligent Technology Laboratory (CIT Lab) Institute of Metaverse & Intelligent Medicine, Fudan University, Academy for Engineering and Technology, Fudan University Cognition and Intelligent Technology Laboratory (CIT Lab) Institute of Metaverse & Intelligent Medicine, Fudan University, Academy for Engineering and Technology, Fudan University Cognition and Intelligent Technology Laboratory (CIT Lab) Institute of Metaverse & Intelligent Medicine, Fudan University, Academy for Engineering and Technology, Fudan University Cognition and Intelligent Technology Laboratory (CIT Lab) Institute of Metaverse & Intelligent Medicine, Fudan University Jilin Provincial Key Laboratory of Intelligence Science and Engineering, Changchun, China Engineering Research Center of AI and Robotics, Ministry of Education, Shanghai, China

Abstract: As the global population ages and the incidence of chronic diseases increases, the demand for early detection of abnormal medical conditions is increasing. Traditional health monitoring methods often require significant resources and specialized personnel, limiting their widespread use. Leveraging advancements in AI technologies, this study proposes a noninvasive method for detecting abnormal medical conditions from image data. A multimodal perception framework is introduced, integrating features from various modalities, including facial expressions and body postures, to enhance detection accuracy. The framework employs a Cascaded Squeeze-Excitation (CSE) module, consisting of Adaptive and Multi-modal Squeeze-Excitation components, to capture complex feature dependencies and improve cross-modal performance. Extensive experiments demonstrate the effectiveness of this approach, showing improved performance over existing methods. In addition, a new dataset that encompasses a wide range of medical conditions has been released, providing a valuable resource for future research in this domain.

Abstract: Given a pair of images depicting a person and a garment separately, imagebased 3D virtual try-on methods aim to reconstruct a 3D human model that realistically portrays the person wearing the desired garment. In this paper, we present IPVTON, a novel image-based 3D virtual try-on framework. IPVTON employs score distillation sampling with image prompts to optimize a hybrid 3D human representation, integrating target garment features into diffusion priors through an image prompt adapter. To avoid interference with non-target areas, we leverage mask-guided image prompt embeddings to focus the image features on the try-on regions. Moreover, we impose geometric constraints on the 3D model with a pseudo silhouette generated by ControlNet, ensuring that the clothed 3D human model retains the shape of the source identity while accurately wearing the target garments. Extensive qualitative and quantitative experiments demonstrate that IPVTON outperforms previous methods in image-based 3D virtual try-on tasks, excelling in both geometry and texture.

Abstract: Adversarial finetuning methods enhance adversarial robustness via fine-tuning the pre-trained model in an adversarial training manner. However, we identify that some specific latent features of adversarial samples are confused by adversarial perturbation and lead to an unexpectedly increasing gap between features in the last hidden layer of natural and adversarial samples. To address this issue, we propose a disentanglement-based approach to explicitly model and further remove the specific latent features. We introduce a feature disentangler to separate out the specific latent features from the features of the adversarial samples, thereby boosting robustness by eliminating the specific latent features. Besides, we align clean features in the pre-trained model with features of adversarial samples in the fine-tuned model, to benefit from the intrinsic features of natural samples. Empirical evaluations on three benchmark datasets demonstrate that our approach surpasses existing adversarial fine-tuning methods and adversarial training baselines.

Abstract: Recent studies showed that the generalization of neural networks is correlated with the sharpness of the loss landscape and flat minima suggests a better generalization ability than sharp minima. In this paper, we propose a novel method called optimum shifting, which changes the parameters of a neural network from a sharp minimum to a flatter one while maintaining the same training loss value. Our method is based on the observation that when the input and output of a neural network are fixed, the matrix multiplications within the network can be treated as systems of underdetermined linear equations, enabling adjustment of parameters in the solution space, which can be simply accomplished by solving a constrained optimization problem. Furthermore, we introduce a practical stochastic optimum shifting technique utilizing the neural collapse theory to reduce computational costs and provide more degrees of freedom for optimum shifting. Extensive experiments with various deep neural network architectures on benchmark datasets demonstrate the effectiveness of our method.

Abstract: We consider a class of optimization problems defined by a system of linear equations with min and max operators. This class of optimization problems has been studied under restrictive conditions, such as, (C1) the halting or stability condition; (C2) the nonnegative coefficients condition; (C3) the sum upto 1 condition; and (C4) the only min or only max operator condition. Several seminal results in the literature focus on special cases. For example, turn-based stochastic games correspond to conditions C2 and C3; and Markov decision process to conditions C2, C3, and C4. However, the systematic computational complexity study of all the cases has not been explored, which we address in this work. Some highlights of our results are: with conditions C2 and C4, and with conditions C3 and C4, the problem is NP-complete, whereas with condition C1 only, the problem is in UP intersects coUP. Finally, we establish the computational complexity of the decision problem of checking the respective conditions.

Abstract: The problem of checking satisfiability of linear real arithmetic (LRA) and nonlinear real arithmetic (NRA) formulas has broad applications, in particular, they are at the heart of logic-related applications such as logic for artificial intelligence, program analysis, etc. While there has been much work on checking satisfiability of unquantified LRA and NRA formulas, the problem of checking satisfiability of quantified LRA and NRA formulas remains a significant challenge. The main bottleneck in the existing methods is a computationally expensive quantifier elimination step. In this work, we propose a novel method for efficient quantifier elimination in quantified LRA and NRA formulas. We propose a template-based Skolemization approach, where we automatically synthesize linear/polynomial Skolem functions in order to eliminate quantifiers in the formula. The key technical ingredient in our approach are Positivstellensätze theorems from algebraic geometry, which allow for an efficient manipulation of polynomial inequalities. Our method offers a range of appealing theoretical properties combined with a strong practical performance. On the theory side, our method is sound, semi-complete, and runs in subexponential time and polynomial space, as opposed to existing sound and complete quantifier elimination methods that run in doubly-exponential time and at least exponential space. On the practical side, our experiments show superior performance compared to state of the art SMT solvers in terms of the number of solved instances and runtime, both on LRA and on NRA benchmarks.

Abstract: Graph generation and enumeration problems often require handling equivalent graphsthose that differ only in vertex labeling. We study how to extend SAT Modulo Symmetries (SMS), a framework for eliminating such redundant graphs, to handle more complex constraints. While SMS was originally designed for constraints in propositional logic (in NP), we now extend it to handle quantified Boolean formulas (QBF), allowing for more expressive specifications like non-3-colorability (a coNP-complete property). We develop two approaches: a static QBF encoding and a dynamic method integrating SMS into QBF solvers. Our analysis reveals that while specialized approaches can be faster, QBF-based methods offer easier implementation and formal verification capabilities.

Institute of Software, Chinese Academy of Sciences University of the Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of the Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of the Chinese Academy of Sciences, Stanford University, University of Oxford, Institute of Software, Chinese Academy of Sciences University of the Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of the Chinese Academy of Sciences

Abstract: Optimization Modulo Nonlinear Real Arithmetic, abbreviated as OMT(NRA), generally focuses on optimizing a given objective subject to quantifierfree Boolean combinations of primitive constraints, including Boolean variables, polynomial equations, and inequalities. It is widely applicable in areas like program verification, analysis, planning, and so on. The existing solver, OptiMathSAT, officially supporting OMT(NRA), employs an incomplete algorithm. We present a sound and complete algorithm, Optimization Cylindrical Algebraic Covering (OCAC), integrated within the Conflict-Driven Clause Learning (CDCL) framework, specifically tailored for OMT(NRA) problems. We establish the correctness and termination of CDCL(OCAC) and explore alternative approaches using cylindrical algebraic decomposition (CAD) and first-order formulations. Our work includes the development of the first complete OMT solver for NRA, demonstrating significant performance improvements. In benchmarks generated from SMT-LIB instances, our algorithm finds the optimum value in about 150% more instances compared to the current leading solver, OptiMathSAT.

Abstract: The MaxSAT problem is an optimization version of the satisfiability problem (SAT). A tight lower bound (LB) on the number of falsified soft clauses in a MaxSAT solution is crucial for the efficiency of Branchand-Bound (BnB) MaxSAT solvers. To compute an LB, modern BnB solvers detect disjoint inconsistent subsets of soft clauses, called cores, using unit propagation. A notable feature of these solvers is that soft clauses belonging to already detected cores cannot be reused to detect additional cores, limiting the number of cores that can be detected. In this paper, we propose an unlocking mechanism that allows the reuse of soft clauses in already detected cores while ensuring the soundness of LB. Experimental results show that this unlocking mechanism consistently improves the performance of a state-of-the-art BnB solver. In addition, it allowed us to win the first two places in the exact unweighted category of the MaxSAT Evaluation 2024.

Abstract: The effectiveness of satisfiability solvers strongly depends on the quality of the encoding of a given problem into conjunctive normal form. Cardinality constraints are prevalent in numerous problems, prompting the development and study of various types of encoding. We present a novel approach to optimizing cardinality constraint encodings by exploring the impact of literal orderings within the constraints. By strategically placing related literals nearby each other, the encoding generates auxiliary variables in a hierarchical structure, enabling the solver to reason more abstractly about groups of related literals. Unlike conventional metrics such as formula size or propagation strength, our method leverages structural properties of the formula to redefine the roles of auxiliary variables to enhance the solver's learning capabilities. The experimental evaluation on benchmarks from the maximum satisfiability competition demonstrates that literal orderings can be more influential than the choice of the encoding type. Our literal ordering technique improves solver performance across various encoding techniques, underscoring the robustness of our approach.

Abstract: Algorithms and hardware for solving quadratic unconstrained binary optimization (QUBO) problems have made significant recent progress. This advancement has focused attention on formulating combinatorial optimization problems as quadratic polynomials. To improve the performance of solving large QUBO problems, it is essential to minimize the number of binary variables used in the objective function. In this paper, we propose a QUBO formulation that offers a bit capacity advantage over conventional quadratization techniques. As a key application, this formulation significantly reduces the number of binary variables required for scorebased Bayesian network structure learning. Experimental results on 16 instances, ranging from 37 to 223 variables, demonstrate that our approach requires fewer binary variables than quadratization by orders of magnitude. Moreover, an annealing machine that implement our formulation have outperformed existing algorithms in score maximization.

Abstract: The satisfiability (SAT) problem of higherorder quantified Boolean formula (HOQBF) emerged as a natural generalization of SAT, quantified SAT, and second-order quantified SAT. It allows succinct encoding of k-EXPTIME problems beyond the reach of prior Boolean satisfiability formulations, but its application was hampered by the lack of solvers. In this paper, we present the first HOQBF solver that leverages techniques from the model-checking community. Our HOQBF solver is based on reduction to higher-order model checking, which is a generalization from model checking of while-programs to that of higher-order functional programs. The ability of a higher-order model checker to deal with higher-order functions in a program is used to reason about higher-order quantifiers in HOQBF.

Abstract: Solving a Constraint Satisfaction Problem (CSP) usually requires a model typically using existing basic constraints. The most flexible form of constraint, adhoc (generic) constraints defined with certain constraint representations, such as binary constraint tree (BCT) and decision diagrams, have been proposed where basic constraints in intensional form are insufficient. A modeller may wish to combine basic constraints using logic operators (and, or, negation). However, negation, a key logical operator for expressivity is not tractable in many existing constraint representations. This creates a dilemma, for modelling, we would desire more flexibility, but a model whose operations are intractable may in turn be impractical. In this paper, we give a framework which allows for a tractable negation operator on constraint representations. We apply the framework on the BCT and ordered decision diagram constraints, giving new subforms. These subforms can be strictly more succinct than ordered multi-valued decision diagrams (OMDD), while being as tractable as OMDD for logical combinations. We give applications to show effective propagators from logical combinations and in building large constraint models for configuration problems.

Abstract: Variable ordering heuristics (VOH) play a central role in solving Constraint Satisfaction Problems (CSP). The performance of different VOHs may vary greatly when solving the same CSP instance, so identifying an efficient candidate VOH for a given CSP has been a key issue in the community. In this study, we propose a predictionbased approach to adaptively select efficient VOHs for different CSPs from a set of candidates. Our work demonstrates that efficient candidate VOHs can be identified by learning from the topology of search trees. Specifically, we propose to represent the topology of a binary search tree by the sequence of the Numbers of Positive Decisions (NPD) made before each failure occurs. Based on the representation, we predict the total failure number of a search tree from its beginning part. When solving a CSP, we run a probing procedure to obtain the NPD sequences generated by candidate VOHs and select an efficient one for the resolution according to the prediction results. Our experiments show that the Long Short Term Memory model and Gradient Boosting Decision Tree models trained with the search trees sampled from easy instances are effective in identifying efficient VOHs for hard instances. The models capture some common structure properties hidden in the search trees of different problems. Our approach outperforms the state-of-the-art adaptive VOHs in terms of the number of solved instances and the PAR2 score of runtime.

Abstract: Networked time series are time series on a graph, one for each node, with applications in traffic and weather monitoring. Graph neural networks are natural candidates for networked time series imputation and have recently outperformed existing alternatives such as recurrent and generative models for time series imputation as they utilize a relational inductive bias for imputation. However, existing GNNbased approaches fail to capture the higher-order topological structure between sensors, which are shaped by recurring substructures in the graph, referred to as temporal motifs. In addition, it remains uncertain which motifs are the most pivotal motifs guiding the imputation task in networked time series. In this paper, we fill in this gap by proposing a graph neural network designed to leverage motif structures within the network by employing weighted motif adjacency matrices to capture higher-order neighborhood information. In particular, (1) we design a motif-wise multi-view attention module that explicitly captures various higher-order structures along with an attention mechanism that automatically assigns high weights to informative ones in order to maximize the use of higher-order information. (2) We introduce a gated fusion module by merging gated recurrent networks and graph convolutional networks to capture the spatial and temporal dependency in order to reflect the intricate impacts of temporal and spatial influence. Experimental results demonstrate that when compared to state-of-the-art models for time-series imputation tasks, our proposed model can reduce the error by around 19%.

Abstract: The problem of forecasting spatiotemporal events such as crimes and accidents is crucial to public safety and city management. Besides accuracy, interpretability is also a key requirement for spatiotemporal forecasting models to justify the decisions. Merely presenting predicted scores fails to convince the public and does not contribute to future urban planning. Interpretation of the spatiotemporal forecasting mechanism is, however, challenging due to the complexity of multisource spatiotemporal features, the non-intuitive nature of spatiotemporal patterns for non-expert users, and the presence of spatial heterogeneity in the data. Currently, no existing deep learning model intrinsically interprets the complex predictive process learned from multi-source spatiotemporal features. To bridge the gap, we propose GeoPro-Net, an intrinsically interpretable spatiotemporal model for spatiotemporal event forecasting problems. GeoPro-Net introduces a novel Geo-concept convolution operation, which employs statistical tests to extract predictive patterns in the input as "Geo-concepts'', and condenses the "Geo-concept-encoded'' input through interpretable channel fusion and geographic-based pooling. In addition, GeoPro-Net learns different sets of prototypes of concepts inherently, and projects them to real-world cases for interpretation. Comprehensive experiments and case studies on four real-world datasets demonstrate that GeoPro-Net provides better interpretability while still achieving competitive prediction performance compared with state-of-the-art baselines.

Abstract: Stock prediction stands as a pivotal research objective within the Fintech. Existing deep learning research revolves around the development and scaling of one individual neural network predictor. However, in the dynamic and noisy landscape of the stock market, reliance solely on a single predictor poses risks of limited adaptability to diverse market conditions and challenges in effectively integrating multisource information. Besides, top-down teaching and bottom-up hierarchical decision-making paradigms are critical for robust and accurate stock prediction within successful quantitative firms. Nonetheless, there is scarcely any research that integrates this workflow into stock prediction. To this end, we propose Diffusion Generated Hierarchical Mixture-of-Experts (DHMoE) to emulate such workflow in stock prediction. Specifically, DHMoE is crafted as a three-layer tree structure, where each expert functions as a node within the tree and their parameters are generated in a top-down, recursive manner. Recognizing the leading role of the top-level root expert, we harness the robust capabilities of diffusion models for generating and introduce the Diffusion Inverted Transformer (DIT) as the root expert. The DIT is tailored to receive information from various modalities as conditional inputs and allocate parameters to bottom-level experts. These bottom-level experts are responsible for performing predictions specific to their respective input modalities. The prediction results are then synthesized in a bottom-up manner, culminating in the final prediction outcomes. Experiments on three stock trading datasets reveal that DHMoE outperforms state-of-the-art methods in terms of both cumulative and risk-adjusted returns.

Abstract: Modeling geospatial tabular data with deep learning has become a promising alternative to traditional statistical and machine learning approaches. However, existing deep learning models often face challenges related to scalability and flexibility as datasets grow. To this end, this paper introduces GeoAggregator, an efficient and lightweight algorithm based on transformer architecture designed specifically for geospatial tabular data modeling. GeoAggregators explicitly account for spatial autocorrelation and spatial heterogeneity through Gaussianbiased local attention and global positional awareness. Additionally, we introduce a new attention mechanism that uses the Cartesian product to manage the size of the model while maintaining strong expressive power. We benchmark GeoAggregator against spatial statistical models, XGBoost, and several state-of-the-art geospatial deep learning methods using both synthetic and empirical geospatial datasets. The results demonstrate that GeoAggregators achieve the best or second-best performance compared to their competitors on nearly all datasets. GeoAggregator's efficiency is underscored by its reduced model size, making it both scalable and lightweight. Moreover, ablation experiments offer insights into the effectiveness of the Gaussian bias and Cartesian attention mechanism, providing recommendations for further optimizing the GeoAggregator's performance.

Abstract: Knowledge Graphs (KGs) are structured data presented as directed graphs. Due to the common issues of incompleteness and inaccuracy encountered during construction and maintenance, completing KGs becomes a critical task. Inductive Knowledge Graph Completion (KGC) excels at inferring patterns or models from seen data to be applied to unseen data. However, existing methods mainly focus on new entities, while relations are usually randomly initialized. To this end, we propose TARGI, a simple yet effective inductive method for KGC. Specifically, we first construct a global relation graph for each topology from a global graph perspective, thus leveraging the invariance of relation structures. We then utilize this graph to aggregate the rich embeddings of new relations and new entities, thereby performing KGC robustly in inductive scenarios. This successfully addresses the excessive reliance on the degree of relations and resolves the high complexity and limited scope of enclosing subgraph sampling in existing fully inductive algorithms. We conduct KGC experiments on six inductive datasets using inference data where entities are entirely new and new relations at 100 percent, 50 percent, and 0 percent radios. Extensive results demonstrate that our model accurately learns the topological structures and embeddings of new relations, and guides the embedding learning of new entities. Notably, our model outperforms 15 SOTA methods, especially in two fully inductive datasets.

Abstract: Large language models (LLMs) provide a promising way for accurate sessionbased recommendation (SBR), but they demand substantial computational time and memory. Knowledge distillation (KD)-based methods can alleviate these issues by transferring the knowledge to a small student, which trains a student based on the predictions of a cumbersome teacher. However, these methods encounter difficulties for LLM-based KD in SBR. 1) It is expensive to make LLMs predict for all instances in KD. 2) LLMs may make ineffective predictions for some instances in KD, e.g., incorrect predictions for hard instances or similar predictions as existing recommenders for easy instances. In this paper, we propose an active LLM-based KD method in SBR, contributing to sustainable AI. To efficiently distill knowledge from LLMs with limited cost, we propose to extract a small proportion of instances predicted by LLMs. Meanwhile, for a more effective distillation, we propose an active learning strategy to extract instances that are as effective as possible for KD from a theoretical view. Specifically, we first formulate gains based on potential effects (e.g., effective, similar, and incorrect predictions by LLMs) and difficulties (e.g., easy or hard to fit) of instances for KD. Then, we propose to maximize the minimal gains of distillation to find the optimal selection policy for active learning, which can largely avoid extracting ineffective instances in KD. Experiments on real-world datasets show that our method significantly outperforms state-of-the-art methods for SBR.

Abstract: Multimedia recommender systems focus on utilizing behavioral information and content information to model user preferences. Typically, it employs pretrained feature encoders to extract content features, then fuses them with behavioral features. However, pre-trained feature encoders often extract features from the entire content simultaneously, including excessive preference-irrelevant details.We speculate that it may result in the extracted features not containing sufficient features to accurately reflect user preferences. To verify our hypothesis, we introduce an attribution analysis method for visually and intuitively analyzing the content features. The results indicate that certain items’ content features exhibit the issues of information drift and information omission, reducing the expressive ability of features. Building upon this finding, we propose an effective and efficient general Behaviordriven Feature Adapter (BeFA) to tackle these issues. This adapter reconstructs the content feature with the guidance of behavioral information, enabling content features accurately reflecting user preferences. Extensive experiments demonstrate the effectiveness of the adapter across all multimedia recommendation methods.

Abstract: In recommender systems, postclick conversion rate (CVR) estimation is an essential task to model user preferences for items and estimate the value of recommendations. Sample selection bias (SSB) and data sparsity (DS) are two persistent challenges for post-click conversion rate (CVR) estimation. Currently, entire-space approaches that exploit unclicked samples through knowledge distillation are promising to mitigate SSB and DS simultaneously. Existing methods use non-conversion, conversion, or adaptive conversion predictors to generate pseudo labels for unclicked samples. However, they fail to consider the unbiasedness and information limitations of these pseudo labels. Motivated by such analysis, we propose an entire-space variational information exploitation framework (EVI) for CVR prediction. First, EVI uses a conditional entire-space CVR teacher to generate unbiased pseudo labels. Then, it applies variational information exploitation and logit distillation to transfer non-click space information to the target CVR estimator. We conduct extensive offline experiments on six large-scale datasets. EVI demonstrated a 2.25% average improvement compared to the state-of-the-art baselines.

Abstract: With the widely adoption of microservice architecture in the cloud computing industry, accurate prediction of workloads, especially CPU cores, can support reasonable resource allocation, thereby optimizing the resource utilization of the system. However, workload prediction is challenging in two dimensions. In the temporal dimension, workload series 1) has nonstationary characteristics, leading to poor predictability; 2) has a multi-periodic nature with entangled temporal patterns; 3) may be influenced by dynamic system states like response time and number of requests. In the spatial dimension, when regarding microservices as nodes in a distributed system, there is no topology caused by physical connections, but exists complex similarity dependencies. Extracting robust spatial features from these dependencies presents difficulties. To address these, we propose STEAM, a Spatio Temporal Heterogenous Graph Contrastive Learning for Microservice Workload Prediction. STEAM leverages non-stationary decomposition self-attention to extract temporal features from non-stationary and multi-periodic workload series, while the decoupled embedding is used to capture system state information of microservices. By treating microservices as nodes and constructing a similarity graph, STEAM effectively models the similarity relationships between microservices. To reduce the prior interference caused by the similarity threshold and improve the robustness, STEAM constructs two heterogeneous augmentation views and uses contrastive learning to extract the shared consistent spatial features. The multi-scale learning is adopted to model the long- and short-term temporal features, forming a spatio-temporal stacking structure. Experiments on two datasets, including MS dataset obtained from Ant Group, which is one of the world’s largest cloud service providers, demonstrate the superiority of STEAM.

Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University Beijing National Research Center for Information Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University, Machine Learning Platform Department, Tencent TEG, Machine Learning Platform Department, Tencent TEG, Department of Computer Science and Technology, Tsinghua University Machine Learning Platform Department, Tencent TEG, Department of Computer Science and Technology, Tsinghua University Beijing National Research Center for Information Science and Technology, Tsinghua University

Abstract: Crossdomain recommendation (CDR) mitigates data sparsity and cold-start issues in recommendation systems. While recent CDR approaches using graph neural networks (GNNs) capture complex user-item interactions, they rely on manually designed architectures that are often suboptimal and labor-intensive. Additionally, extracting valuable behavioral information from source domains to improve target domain recommendations remains challenging. To address these challenges, we propose Behavior importance-aware Graph Neural Architecture Search (BiGNAS), a framework that jointly optimizes GNN architecture and data importance for CDR. BiGNAS introduces two key components: a Cross-Domain Customized Supernetwork and a Graph-Based Behavior Importance Perceptron. The supernetwork, as a one-shot, retrain-free module, automatically searches the optimal GNN architecture for each domain without the need for retraining. The perceptron uses auxiliary learning to dynamically assess the importance of source domain behaviors, thereby improving target domain recommendations. Extensive experiments on benchmark CDR datasets and a large-scale industry advertising dataset demonstrate that BiGNAS consistently outperforms state-of-the-art baselines. To the best of our knowledge, this is the first work to jointly optimize GNN architecture and behavior data importance for cross-domain recommendation.

SKLCCSE, School of Computer Science and Engineering, Beihang University, China, SKLCCSE, School of Computer Science and Engineering, Beihang University, China, SKLCCSE, School of Computer Science and Engineering, Beihang University, China, Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, China, Huawei Technologies Co., Ltd, China, Institute of Artificial Intelligence, Beihang University, China, SKLCCSE, School of Computer Science and Engineering, Beihang University, China

Abstract: Realworld graphs have inherently complex and diverse topological patterns, known as topological heterogeneity. Most existing works learn graph representation in a single constant curvature space that is insufficient to match the complex geometric shapes, resulting in low-quality embeddings with high distortion. This also constitutes a critical challenge for graph foundation models, which are expected to uniformly handle a wide variety of diverse graph data. Recent studies have indicated that product manifold gains the possibility to address topological heterogeneity. However, the product manifold is still homogeneous, which is inadequate and inflexible for representing the mixed heterogeneous topology. In this paper, we propose a novel Graph Mixture of Riemannian Experts (GraphMoRE) framework to effectively tackle topological heterogeneity by personalized fine-grained topology geometry pattern preservation. Specifically, to minimize the embedding distortion, we propose a topology-aware gating mechanism to select the optimal embedding space for each node. By fusing the outputs of diverse Riemannian experts with learned gating weights, we construct personalized mixed curvature spaces for nodes, effectively embedding the graph into a heterogeneous manifold with varying curvatures at different points. Furthermore, to fairly measure pairwise distances between different embedding spaces, we present a concise and effective alignment strategy. Extensive experiments on real-world and synthetic datasets demonstrate that our method achieves superior performance with lower distortion, highlighting its potential for modeling complex graphs with topological heterogeneity, and providing a novel architectural perspective for graph foundation models.

Abstract: The microscopic cascade prediction task has wide applications in downstream areas like ''rumor detection''. Its goal is to forecast the diffusion routines of information cascade within networks. Existing works typically formulate it as a classification task, which fails to well align with the Social Homophily assumption, as it just use the features of ''infected'' users while neglecting those of ''uninfected'' users in representation learning. Moreover, these methods focus primarily on social relationships, thereby dismissing other vital dimensions like users' historical behavior and the underlying preferences behind it. To address these challenges, we introduce the MSR (Multifaceted SelfRetrieval) framework. During encoding, in addition to the existing social graph, we construct a preference graph to represent ''behavioral preferences'' and further propose a modified multi-channel GRAU for multi-view analysis of cascade phenomenon. For decoding, our approach diverges from classification-based methods by reformulating the task as an information retrieval problem that predicts the target user with similarity measures. Empirical evaluations on public datasets demonstrate that this framework significantly outperforms baselines on Hits@κ and MAP@κ, affirming its enhanced ability.

Abstract: Soccer is a rich testbed for studying multiagent adversarial systems. In this work we focus on the task of reconstructing the noisy trajectories of soccer agents (players and the ball). Previous works that model the behaviours of agents in soccer are limited in two respects: (i) they only focus on short-term context windows (less than or equal to 10 seconds) which are not suitable for reconstructing trajectories impacted by long-term noise, and (ii) they exclusively rely on trajectory context, and do not leverage soccer's auxiliary data streams that can provide additional context. Our Event2Tracking model addresses these limitations. First, our architecture models soccer's long-term structure by processing long-term trajectories (60 seconds in duration). Secondly, our architecture is multimodal. Specifically, it fuses soccer tracking data with event data (which specifies the high-level semantic events that transpire in a game), providing rich context that cannot strictly be inferred from the raw trajectories. We evaluate our method empirically using a reconstruction loss metric. Compared to state-of-the-art approaches, our method substantially improves the accuracy of the ball's and players' reconstructed trajectories.

Abstract: Entity alignment (EA) is crucial for integrating knowledge graphs (KGs) constructed from diverse sources. Conventional unsupervised EA approaches attempt to eliminate human intervention but often suffer from accuracy limitations. With the rise of large language models (LLMs), leveraging their capabilities for EA presents a promising direction. However, it introduces new challenges: formulating the LLMbased EA problem and extracting the background knowledge in LLMs to realize EA without human intervention. This paper proposes HLMEA, a novel hybrid language model-based unsupervised EA method. HLMEA formulates the EA task into a filtering and single-choice problem and synergistically integrates small language models (SLMs) and LLMs. Specifically, SLMs filter candidate entities based on textual representations generated from KG triples. Then, LLMs refine this selection to identify the most semantically aligned entities. An iterative self-training mechanism allows SLMs to distill knowledge from LLM outputs, enhancing the EA ability of hybrid language models in subsequent rounds cooperatively. We also conducted extensive experiments on benchmark datasets to evaluate HLMEA's performance. The results demonstrate that HLMEA significantly outperforms unsupervised and even supervised EA baselines, proving its potential for scalable and effective EA across large KGs. The code and data are available at \url{https://github.com/xnjin-ai/HLMEA}.

Abstract: Information retrieval methods often rely on a single embedding model trained on large, generaldomain datasets like MSMARCO. While this approach can produce a retriever with reasonable overall performance, they often underperform models trained on domain-specific data when testing on their respective domains. Prior work in information retrieval has tackled this through multi-task training, but the idea of routing over a mixture of domain-specific expert retrievers remains unexplored despite the popularity of such ideas in language model generation research. In this work, we introduce RouterRetriever, a retrieval model that leverages a mixture of domain-specific experts by using a routing mechanism to select the most appropriate expert for each query. RouterRetriever is lightweight and allows easy addition or removal of experts without additional training. Evaluation on the BEIR benchmark demonstrates that RouterRetriever outperforms both models trained on MSMARCO (+2.1 absolute nDCG@10) and multi-task models (+3.2). This is achieved by employing our routing mechanism, which surpasses other routing techniques (+1.8 on average) commonly used in language modeling. Furthermore, the benefit generalizes well to other datasets, even in the absence of a specific expert on the dataset. RouterRetriever is the first work to demonstrate the advantages of routing over a mixture of domain-specific expert embedding models as an alternative to a single, general-purpose embedding model, especially when retrieving from diverse, specialized domains.

Abstract: The coldstart recommendation has been challenging due to the limited historical interactions for new users and new items. Recently, methods based on meta-learning and graph neural networks have been effective in this problem. However, these methods mainly focus on the missing user-item interactions in cold-start scenarios, overlooking the missing of user/item feature information, which significantly limits the quality and effectiveness of node embeddings. To address this issue, we propose a new method called Feature-Structure Adaptive Completion Graph Neural Network (FS-GNN), which is designed to tackle the cold-start problem by simultaneously addressing the missing feature and structure information in a bipartite graph composed of users and items. Specifically, we first design a trainable feature completion module that leverages the knowledge emergence abilities of large language models to enhance node embedding and mitigate the impact of missing features. Then, we incorporate a three-channel structure completion module to simultaneously complete the structures among users-users, items-items, as well as users-items. Finally, we adaptively integrate the feature and structure completion modules in an end-to-end fashion, so as to minimize cross-module interference when completing features and structures simultaneously. This generates more comprehensive and robust embeddings for users and items in recommendation tasks. Experimental results on multiple public benchmark datasets demonstrate significant improvements in our proposed FS-GNN in cold-start scenarios, outperforming or being competitive with state-of-the-art methods.

State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China Computer Center, Peking University, Beijing, China, School of Information Technology & Management, University of International Business and Economics, Beijing, China, Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA, Computer Center, Peking University, Beijing, China, Computer Center, Peking University, Beijing, China, State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China, College of Computer Science, Sichuan University, Chengdu, China

Abstract: Recommender systems are widely used in various realworld applications, but they often encounter the persistent challenge of the user cold-start problem. Cross-domain recommendation (CDR), which leverages user interactions from one domain to improve prediction performance in another, has emerged as a promising solution. However, users with similar preferences in the source domain may exhibit different interests in the target domain. Therefore, directly transferring embeddings may introduce irrelevant source-domain collaborative information. In this paper, we propose a novel graph-based disentangled contrastive learning framework to capture fine-grained user intent and filter out irrelevant collaborative information, thereby avoiding negative transfer. Specifically, for each domain, we use a multi-channel graph encoder to capture diverse user intents. We then construct the affinity graph in the embedding space and perform multi-step random walks to capture high-order user similarity relationships. Treating one domain as the target, we propose a disentangled intent-wise contrastive learning approach, guided by user similarity, to refine the bridging of user intents across domains.Extensive experiments on four benchmark CDR datasets demonstrate that DisCo consistently outperforms existing state-of-the-art baselines, thereby validating the effectiveness of both DisCo and its components.

Abstract: Crossdomain recommendations in healthcare services differ from traditional ones in electronic commerce due to the need for heightened medical privacy protection for a small group of users, while ensuring the majority, who may lack sufficient medical knowledge, can understand the recommendations. To recommend doctors who provide online consultations to health video viewers and enable multimodal cross-domain recommendations from short video platforms (source domain) to online healthcare communities (target domain), this paper introduces a framework based on the User-Centric Synthetic Data Architect (UCSDA) and Pre-trained Large Language Model (PtLLM). UCSDA employs a user-centric, advanced selection-synthesis mechanism to filter users' cold interaction items and synthesize noise items, reducing privacy leakage risk. PtLLM focuses on necessary patient and doctor IDs during the recommendation decision process to generate explanations. The model's effectiveness and scalability were validated using three public datasets and a healthcare cross-domain recommendation dataset. In addition to traditional evaluation metrics, strong privacy metrics and the unique sentence ratio were used to assess privacy protection and interpretability. We also compared the characteristics of privacy protection and interpretability between e-commerce and healthcare recommendation scenarios.

Abstract: Anomaly detection has garnered significant attention for its extensive industrial application value. Most existing methods focus on singleview scenarios and fail to detect anomalies hidden in blind spots, leaving a gap in addressing the demands of multi-view detection in practical applications. Ensemble of multiple single-view models is a typical way to tackle the multi-view situation, but it overlooks the correlations between different views. In this paper, we propose a novel multi-view anomaly detection framework, Intra-view Decoupling and Inter-view Fusion (IDIF), to explore correlations among views. Our method contains three key components: 1) a proposed Consistency Bottleneck module extracting the common features of different views through information compression and mutual information maximization; 2) an Implicit Voxel Construction module fusing features of different views with prior knowledge represented in the form of voxels; and 3) a View-wise Dropout training strategy enabling the model to learn how to cope with missing views during test. The proposed IDIF achieves state-of-the-art performance on three datasets. Extensive ablation studies also demonstrate the superiority of our methods.

Abstract: Recently, Knowledge Graphs (KGs) have been successfully coupled with Large Language Models (LLMs) to mitigate their hallucinations and enhance their reasoning capability, e.g., KGbased retrieval-augmented framework. However, current KG-LLM frameworks lack rigorous uncertainty estimation, limiting their reliable deployment in applications where the cost of errors is significant. Directly incorporating uncertainty quantification into KG-LLM frameworks presents a challenge due to their more complex architectures and the intricate interactions between the knowledge graph and language model components. To address this crucial gap, we propose a new trustworthy KG-LLM framework, UAG (Uncertainty Aware Knowledge-Graph Reasoning), which incorporates uncertainty quantification into the KG-LLM framework. We design an uncertainty-aware multi-step reasoning framework that leverages conformal prediction to provide a theoretical guarantee on the prediction set. To manage the error rate of the multi-step process, we additionally introduce an error rate control module to adjust the error rate within the individual components. Extensive experiments show that UAG can achieve any pre-defined coverage rate while reducing the prediction set/interval size by 40% on average over the baselines.

Abstract: Graph fraud detection (GFD) has rapidly advanced in protecting online services by identifying malicious fraudsters. Recent supervised GFD research highlights that heterophilic connections between fraudster and user greatly impacts detection performance, where the fraudsters tend to camouflage themselves by building more connections to benign users. Despite their promising performance, their label reliance limits its application in unsupervised scenarios; Additionally, accurately capturing complex and diverse heterophily patterns without labels poses a further challenge. Therefore, we propose a Heterophilyguided Unsupervised Graph fraud dEtection approach (HUGE) for unsupervised GFD, which contains two essential components: a heterophily estimation module and an alignment-based fraud detection module. In the heterophily estimation module, we design a novel unsupervised heterophily metric called HALO, which captures the critical graph properties for GFD, enabling its outstanding ability to estimate heterophily with attributes. In the alignment-based fraud detection module, we develop a joint MLP-GNN architecture with ranking loss and asymmetric alignment loss. The ranking loss aligns the predicted fraud score with the relative order of HALO, providing an extra robustness guarantee by comparing heterophily between non-adjacent nodes. Moreover, the asymmetric alignment loss effectively utilizes structural information to alleviate the feature-smooth effects. Extensive experiments on six datasets demonstrate that HUGE consistently outperforms competitors, showcasing its effectiveness and robustness.

Abstract: An issue concerning the use of deep reinforcement learning (RL) agents is whether they can be trusted to perform reliably when deployed, as training environments may not reflect reallife environments. Anticipating instances outside their training scope, learning-enabled systems are often equipped with out-of-distribution (OOD) detectors that alert when a trained system encounters a state it does not recognize or in which it exhibits uncertainty. There exists limited work conducted on the problem of OOD detection within RL, with prior studies being unable to achieve a consensus on the definition of OOD execution within the context of RL. By framing our problem using a Markov Decision Process, we assume there is a transition distribution mapping each state-action pair to another state with some probability. Based on this, we consider the following definition of OOD execution within RL: A transition is OOD if its probability during real-life deployment differs from the transition distribution encountered during training. As such, we utilize conditional variational autoencoders (CVAE) to approximate the transition dynamics of the training environment and implement a conformity-based detector using reconstruction loss that is able to guarantee OOD detection with a pre-determined confidence level. We evaluate our detector by adapting existing benchmarks and compare it with existing OOD detection models for RL.

Abstract: User data spread across multiple modalities has popularized multimodal recommender systems (MMRS). They recommend diverse content such as products, social media posts, TikTok reels, etc., based on a user-item interaction graph. With rising data privacy demands, recent methods propose unlearning private user data from uni-modal recommender systems (RS). However, methods for unlearning item data related to outdated user preferences, revoked licenses, and legally requested removals are still largely unexplored. Previous RS unlearning methods are unsuitable for MMRS due to the incompatibility of their matrix-based representation with the multi-modal user-item interaction graph. Moreover, their data partitioning step degrades performance on each shard due to poor data heterogeneity and requires costly performance aggregation across shards. This paper introduces MMRecUn, the first approach known to us for unlearning in MMRS and unlearning item data. Given a trained RS model, MMRecUn employs a novel Reverse Bayesian Personalized Ranking (BPR) objective to enable the model to forget marked data. The reverse BPR attenuates the impact of user-item interactions within the forget set, while the forward BPR reinforces the significance of user-item interactions within the retain set. Our experiments demonstrate that MMRecUn outperforms baseline methods across various unlearning requests when evaluated on benchmark MMRS datasets. MMRecUn achieves recall performance improvements of up to 49.85% compared to baseline methods and is up to 1.3× faster than the Gold model, which is trained on retain set from scratch. MMRecUn offers significant advantages, including superiority in removing target interactions, preserving retained interactions, and zero overhead costs compared to previous methods.

Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences

Abstract: Graph anomaly detection has attracted significant attention due to its critical applications, such as identifying money laundering in financial systems and detecting fake reviews on social networks. However, two major challenges persist: (1) anomaly detection at the node, edge, and graph levels is often addressed in isolation, hindering the integration of complementary information to identify anomalies arising from collective behaviors; and (2) the inherent label sparsity in graph data, coupled with the difficulty of obtaining highquality annotations, exacerbates bias in detection. To address these challenges, we propose UniFORM, a unified self-supervised anomaly detection framework comprising two modules: UIO and UMC. UIO unifies node-, edge-, and graph-level tasks from a subgraph perspective, leveraging an energy-based GNN for iterative multi-granular anomaly detection. UMC enhances meta-learning through contrastive learning and employs Langevin dynamics to generate phantom samples as substitutes for anomalous data, reducing reliance on labeled data. Extensive experiments on real-world datasets demonstrate that UniFORM significantly outperforms state-of-the-art methods across multiple granularities.

Abstract: Multilabel Out-Of-Distribution (OOD) detection aims to discriminate the OOD samples from the multi-label In-Distribution (ID) ones. Compared with its multiclass counterpart, it is crucial to model the joint information among classes. To this end, JointEnergy, which is a representative multi-label OOD inference criterion, summarizes the logits of all the classes. However, we find that JointEnergy can produce an imbalance problem in OOD detection, especially when the model lacks enough discrimination ability. Specifically, we find that the samples only related to minority classes tend to be classified as OOD samples due to the ambiguous energy decision boundary. Besides, imbalanced multi-label learning methods, originally designed for ID ones, would not make sense for OOD detection scenarios, even producing a serious negative transfer effect. In this paper, we resort to auxiliary outlier exposure (OE) and propose an unknown-aware multi-label learning framework to reshape the uncertainty energy space layout. In this framework, the energy score is separately optimized for tail ID samples and unknown samples, and the energy distribution gap between them is expanded, such that the tail ID samples can have a significantly larger energy score than the OOD ones. What's more, a simple yet effective measure is designed to select more informative OE datasets. Finally, comprehensive experimental results on multiple multi-label and OOD datasets reveal the effectiveness of the proposed method.

Abstract: Recommender Systems (RSs) are widely applied for navigating information, and Collaborative Filtering (CF) is one of prominent recommendation techniques due to the advantages of domain independence and easy interpretation. Among the numerous CF methods, Variational Autoencoders (VAE), benefiting from modeling in a probabilitistic way, stands out in capturing user preferences through representation learning. Despite the superiority, VAEbased CF models still suffer from two challenging problems: (1) Exposure bias: models in training state are narrowly exposed to a limited, biased sample of data, leading to a skewed understanding of users' true preferences; (2) Posterior collapse: models excessively simplify the learned latent variable distributions, generating na"ive representations that are unable to encapsulate the complex data patterns and thereby resulting improper recommendations. In this paper, we propose a Debiased and Representation-enhanced Variational AutoEncoder (DR-VAE) framework for collaborative recommendations. Specifically, for exposure bias problem, DR-VAE incorporates a Debiasing Estimator, mitigating the impact of exposure bias. For poster collapse issue, DR-VAE innovatively introduces a Flow-based Representation Enhancement module, ensuring us to encapsulate complex data patterns by fitting complex and intricate posterior distributions directly. We provide experimental validations over four datasets to substantiate the efficacy of our DR-VAE framework.

Abstract: Multimodal Entity Linking (MEL) is a fundamental component for various downstream tasks. However, existing MEL datasets suffer from small scale, scarcity of topic types and limited coverage of tasks, making them incapable of effectively enhancing the entity linking capabilities of multi-modal models. To address these obstacles, we propose a dataset construction pipeline and publish M^3EL, a large-scale dataset for MEL. M^3EL includes 79,625 instances, covering 9 diverse multi-modal tasks, and 5 different topics. In addition, to further improve the model's adaptability to multi-modal tasks, We propose a modality-augmented training strategy. Utilizing M^3EL as a corpus, train the CLIP_ND model based on CLIP (ViT-B-32), and conduct a comparative analysis with an existing multi-modal baselines. Experimental results show that the existing models perform far below expectations (ACC of 49.4%-75.8%), After analysis, it was obtained that small dataset sizes, insufficient modality task coverage, and limited topic diversity resulted in poor generalization of multi-modal models. Our dataset effectively addresses these issues, and the CLIP_ND model fine-tuned with M^3EL shows a significant improvement in accuracy, with an average improvement of 9.3% to 25% across various tasks. Our dataset publicly available to facilitate future research.

Abstract: Personalized news recommendation aims to recommend candidate news to the target user. Since the data and knowledge involved in traditional recommender systems are restricted, recent studies utilize large language models (LLMs) to generate news articles and augment the original dataset. However, despite the superiority of LLMbased augmentation in news recommendation, previous studies still suffer from two serious problems, i.e., structure-level deficiency and semantic-level noise. Since the LLM-based augmentation is mainly implemented at the semantic level, collaborative signals, the critical structure information in recommender systems, is neglected during the generation process. Thus, it is inappropriate to perform recommendation based on the augmented user-news bipartite, which manifests as multiple isolated cliques. Moreover, utilizing the open-world knowledge of LLMs to extend the closed systems will inevitably introduce noise information, leading to difficulties in mining users' real preferences. In this paper, we propose a novel Structure-aware and Semantic-aware approach for LLM-Empowered personalized News Recommendation, named S^2LENR, to tackle the mentioned problems. Specifically, we propose a structure-aware refinement module to inject collaborative information in a parametric way, in order to construct a valid augmented bipartite. Besides, we devise a semantic-aware denoising module utilizing contrastive learning paradigm to overcome the negative effects of noise information. Finally, we calculate the relevance score between target user and candidate news representations. We conduct experiments on two real-world news recommendation datasets MIND-Large, MIND-Small and empirical results demonstrate the effectiveness of our approach from multiple perspectives.

Abstract: CrossDomain Recommendation (CDR) leverages additional knowledge from auxiliary domains to address the long-standing data sparsity issue. However, existing methods typically acquire this knowledge by minimizing the average loss over all domains, overlooking the fact that different domains possess different user-preference distributions. As a result, the acquired knowledge may contain biased information from data-rich domains, leading to performance degradation in data-scarce domains. In this paper, we propose a novel CDR method, which takes domain distinctions into consideration to extract and adapt unbiased information. Specifically, our method consists of two key components: Unbiased Information Extraction (UIE) and Unbiased Information Adaptation (UIA). In the UIE, inspired by distributionally robust optimization, we optimize the worst-case performance across all domains to extract domain-invariant information, preventing the potential bias from auxiliary domains. In the UIA, we introduce a new user-item attention module, which employs domain-specific information from historically interacted items to attend the adaptation of domain-invariant information. To verify the effectiveness of our method, we conduct extensive experiments on three real-world datasets, each of which contains three extremely sparse domains. Experimental results demonstrate the considerable superiority of our proposed method compared to baselines.

Abstract: The connections between symbolic rules and neural networks have been explored in various directions, including rule mining through neural networks and rulebased explanation for neural networks. These approaches allow symbolic rules to be extracted from neural network models, which offers explainability to the models. However, the plausibility of the extracted rules is rarely analysed. In this paper, we show that the confidence degrees of extracted rules are generally not high, and we propose a new family of Graph Neural Networks that can be trained with the guidance of rules. Hence, the inference of our model simulates the rule reasoning. Moreover, rules with high confidence degrees can be extracted from the trained model that aligns with the inference of the model, which verifies the effectiveness of the rule guidance. Experimental evaluation of knowledge graph reasoning tasks further demonstrates the effectiveness of our model.

Abstract: Accurately predicting the trajectory of vehicles is critically important for ensuring safety and reliability in autonomous driving. Although considerable research efforts have been made recently, the inherent trajectory uncertainty caused by various factors including the dynamic driving intends and the diverse driving scenarios still poses significant challenges to accurate trajectory prediction. To address this issue, we propose C2FTP, a coarse-to-fine denoising framework for uncertainty-aware vehicle trajectory prediction. C2F-TP features an innovative two-stage coarse-to-fine prediction process. Specifically, in the spatial-temporal interaction stage, we propose a spatial-temporal interaction module to capture the inter-vehicle interactions and learn a multimodal trajectory distribution, from which a certain number of noisy trajectories are sampled. Next, in the trajectory refinement stage, we design a conditional denoising model to reduce the uncertainty of the sampled trajectories through a step-wise denoising operation. Extensive experiments are conducted on two real datasets NGSIM and highD that are widely adopted in trajectory prediction. The result demonstrates the effectiveness of our proposal.

Abstract: Emerging advancements in large language models (LLMs) show significant potential for enhancing recommendations. However, promptbased methods often struggle to find ideal prompts without task-specific feedback, while fine-tuning-based methods are hindered by high computational demands and dependence on open-source backbones. To address these challenges, we propose a Reflective Reinforcement Large Language Model (Re2LLM) for session-based recommendation, which refines LLMs to generate and utilize specialized knowledge effectively and efficiently. Specifically, we first devise the Reflective Exploration Module to extract and present knowledge in a form that LLMs can easily process. This module enables LLMs to reflect on their recommendation mistakes and construct a hint knowledge base to rectify them effectively. Next, we design the Reinforcement Utilization Module to train a lightweight retrieval agent that elicits correct LLM reasoning. This module recognizes hints as signals to facilitate LLM recommendations and learns to select appropriate hints from the constructed knowledge base using task-specific feedback efficiently. Lastly, we conduct experiments on real-world datasets and demonstrate the superiority of our Re2LLM over state-of-the-art methods.

Abstract: Social recommendation leverages the social connections between users to mitigate the issue of data sparsity and enhance recommendation quality. Although existing related works show their effectiveness, there remain two critical questions: i) The patterns of preference interactions among users are varied and heterogeneous. Current models struggle to accurately capture preference shifts from user interactions in noisy social environments. ii) Existing methods handle the integration of auxiliary information coarsely, potentially introducing noise and leading to biases in user preferences. To address the limitations above, we introduce a novel framework named Robust Graph Based Social Recommendation through Contrastive Multiview Learning (RGCML). This framework leverages denoised social relations and global intents as dual auxiliary information sources to provide comprehensive characterization of users. Firstly, RGCML employs the concept of opinion dynamics to simulate how user preferences evolve due to noisy social relations. Then, it utilizes a specifically designed information fusion module to extract critical contextual information from multiple semantic perspectives, thereby achieving efficient personalized information fusion. Finally, it adopts the designed global-local contrastive learning paradigm that untangles and discriminates user preferences from global intents, further addressing the noise problem and enhancing the quality of user representations. Extensive experiments conducted on three real-world datasets demonstrate the superior performance of RGCML compared to several state-of-the-art (SOTA) baselines.

Abstract: While the mining of modalities is the focus of most multimodal recommendation methods, we believe that how to fully utilize both collaborative and multimodal information is pivotal in ecommerce scenarios where, as clarified in this work, the user behaviors are rarely determined entirely by multimodal features. In order to combine the two distinct types of information, some additional challenges are encountered: 1) Modality erasure: Vanilla graph convolution, which proves rather useful in collaborative filtering, however erases multimodal information; 2) Modality forgetting: Multimodal information tends to be gradually forgotten as the recommendation loss essentially facilitates the learning of collaborative information. To this end, we propose a novel approach named STAIR, which employs a novel stepwise graph convolution to enable a co-existence of collaborative and multimodal information in e-commerce recommendation. Besides, it starts with the raw multimodal features as an initialization, and the forgetting problem can be significantly alleviated through constrained embedding updates. As a result, STAIR achieves state-of-the-art recommendation performance on three public e-commerce datasets with minimal computational and memory costs.

Abstract: Recent advances in Large Language Models (LLMs) have demonstrated significant potential in the field of Recommendation Systems (RSs). Most existing studies have focused on converting user behavior logs into textual prompts and leveraging techniques such as prompt tuning to enable LLMs for recommendation tasks. Meanwhile, research interest has recently grown in multimodal recommendation systems that integrate data from images, text, and other sources using modality fusion techniques. This introduces new challenges to the existing LLMbased recommendation paradigm which relies solely on text modality information. Moreover, although Multimodal Large Language Models (MLLMs) capable of processing multi-modal inputs have emerged, how to equip MLLMs with multi-modal recommendation capabilities remains largely unexplored. To this end, in this paper, we propose the Multimodal Large Language Model-enhanced Sequential Multimodal Recommendation (MLLM-MSR) model. To capture the dynamic user preference, we design a two-stage user preference summarization method. Specifically, we first utilize an MLLM-based item-summarizer to extract image feature given an item and convert the image into text. Then, we employ a recurrent user preference summarization generation paradigm to capture the dynamic changes in user preferences based on an LLM-based user-summarizer. Finally, to enable the MLLM for multi-modal recommendation task, we propose to fine-tune a MLLM-based recommender using Supervised Fine-Tuning (SFT) techniques. Extensive evaluations across various datasets validate the effectiveness of MLLM-MSR, showcasing its superior ability to capture and adapt to the evolving dynamics of user preferences.

Abstract: Graph Neural Network (GNN)based methods have recently emerged as effective approaches for multimedia recommendation. Typically, these methods employ message passing on the user-item interaction graph, and model user preferences by exploiting co-occurrence patterns. Despite their effectiveness, we argue that they insufficiently exploit the individual information, potentially limiting recommendation performance. To validate our argument, we first analyze existing methods from spectral graph theory. We identify that existing methods focus on capturing global structural features, but underutilize local structural features that convey individual information. Further detailed experiments reveal that such an underutilization leads to overly similar user preferences modeling. Furthermore, we propose a novel Principal Graph Learning (PGL) framework to address this issue. The idea is to enhance user preference modeling by effectively mining and utilizing principal local structural features. PGL first extracts the principal subgraph from the user-item interaction graph using two novel extraction operators: global-aware and local-aware subgraph extraction. It then employs message passing on the principal subgraph to comprehensively model user perference, with the aim of simultaneously capturing co-occurrence patterns and individual information. Compared to existing methods, PGL achieves an average performance improvement of 9%.

Abstract: Graph neural networks (GNNs) are widely used for node classification tasks, but when encountering distribution shifts due to environmental change in realworld scenarios, they tend to learn unstable correlations between features and labels. To overcome this dilemma, a powerful class of approaches views the environment as the root cause of those unstable correlations, thereby their key focus is to infer the environment involved, enabling the model to avoid capturing environment-sensitive correlations. However, their inferences rely solely on the single-level information from one low-hop ego-graph, neglecting both global information and multi-granularity information in local ego-graphs with different hops. Although applying deeper GNNs on the high-hop ego-graph could capture global information, it will bring the side effect of over-smoothing node representations. To tackle these issues, we propose a novel Multi-Level Environment Inference model named MLEI, which effectively broadens the horizon of training GNNs under node-level distribution shifts. Specifically, MLEI first leverages a linear graph transformer to surpass the scope of ego-graph, efficiently enabling high-level global environment inference. This global environment is in turn used as an overview to assist layer-by-layer environment inference on local multi-hop ego-graphs. Finally, we combine the environment from global and local views and utilize the designed objective function to capture stable predictive patterns. Extensive experiments on real-world datasets demonstrate that our model achieves satisfactory performance compared with the state-of-the-art methods under various distribution shifts.

Abstract: Learningbased probabilistic models can be combined with an entropy coder for data compression. However, due to the high complexity of learning-based models, their practical application as text compressors has been largely overlooked. To address this issue, our work focuses on a low-complexity design while maintaining compression performance. We introduce a novel Learned Lossless Low-complexity Text Compression method (L3TC). Specifically, we conduct extensive experiments demonstrating that RWKV models achieve the fastest decoding speed with a moderate compression ratio, making it the most suitable backbone for our method. Second, we propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens while allowing outliers to bypass the prediction and encoding. Third, we propose a novel high-rank reparameterization strategy that enhances the learning capability during training without increasing complexity during inference. Experimental results validate that our method achieves 48% bit saving compared to gzip compressor. Besides, L3TC offers compression performance comparable to other learned compressors, with a 50x reduction in model parameters. More importantly, L3TC is the fastest among all learned compressors, providing real-time decoding speeds up to megabytes per second.

Abstract: Recent work shows that linear models can outperform several transformer models in longterm time-series forecasting (TSF). However, instead of explicitly performing temporal interaction through self-attention, linear models implicitly perform it based on stacked MLP structures, which may be insufficient in capturing the complex temporal dependencies and their performance still has potential for improvement. To this end, we propose a Lightweight Sparse Interaction Network (LSINet) for TSF task. Inspired by the sparsity of self-attention, we propose a Multihead Sparse Interaction Mechanism (MSIM). Different from self-attention, MSIM learns the important connections between time steps through sparsity-induced Bernoulli distribution to capture temporal dependencies for TSF. The sparsity is ensured by the proposed self-adaptive regularization loss. Moreover, we observe the shareability of temporal interactions and propose to perform Shared Interactions Learning (SIL) for MSIM to further enhance efficiency and improve convergence. LSINet is a linear model comprising only MLP structures with low overhead and equipped with explicit temporal interaction mechanisms. Extensive experiments on public datasets show that LSINet achieves both higher accuracy and better efficiency than advanced linear models and transformer models in TSF tasks.

College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph, College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph, College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph, College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph, Ant Group, Ant Group, School of Software Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph, College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph Zhejiang Key Laboratory of Big Data Intelligent Computing

Abstract: Multimodal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given multi-modal knowledge graphs (MMKG), collaboratively leveraging structural information from the triples and multi-modal information of the entities to overcome the inherent incompleteness. Existing MMKGC methods usually extract multi-modal features with pre-trained models and employ fusion modules to integrate multi-modal features for the entities. This often results in coarse handling of multi-modal entity information, overlooking the nuanced, fine-grained semantic details and their complex interactions. To tackle this shortfall, we introduce a novel framework MyGO to tokenize, fuse, and augment the fine-grained multi-modal representations of entities and enhance the MMKGC performance. Motivated by the tokenization technology, MyGO tokenizes multi-modal entity information as fine-grained discrete tokens and learns entity representations with a cross-modal entity encoder. To further augment the multi-modal representations, MyGO incorporates fine-grained contrastive learning to highlight the specificity of the entity representations. Experiments on standard MMKGC benchmarks reveal that our method surpasses 19 of the latest models, underlining its superior performance.

Abstract: Recent studies have revealed the vulnerability of graph neural networks (GNNs) to adversarial attacks. In practice, effectively attacking GNNs is not easy. Existing attack methods primarily focus on modifying the topology of the graph data. In many scenarios, attackers do not have the authority to manipulate the graph's topology, making such attacks challenging to execute. Although node injection attacks are more feasible than modifying the topology, current injection attacks rely on knowledge of the victim model's architecture. This dependency significantly degrades attack quality when there is inconsistency in the victim models. Moreover, the generation of injected nodes often lacks precise control over features, making it difficult to balance attack effectiveness and stealthiness. In this paper, we investigate a node injection attack under modelagnostic conditions and propose Targeted Evasion Attack via Node Injection (TEANI). Specifically, TEANI models the generation of adversarial nodes as a Markov process. Without considering the target model's structure, it guides the agent to select features that maximize attack effectiveness within a budget, based solely on the results of queries to a black-box model. Extensive experiments on real-world datasets and mainstream GNN models demonstrate that the proposed TEANI poses more effective and imperceptible threats than state-of-the-art attack methods.

Abstract: Simple random negative sampling is a technique used to enhance decisionmaking in sequential models with numerous potential negative instances, like recommender systems. However, it ignores the patterns that can be discovered in complex sequences to select the most informative negative samples. In this paper, we address this challenge by introducing a Neighborhood-Aware Negative Sampling (NANS) technique in the context of student knowledge modeling (KM) and behavior modeling (BM). In the education domain, KM quantifies student knowledge based on past performance, while BM focuses on behaviors like student preferences of questions. With the vast number of problems to choose from and the intricate relationship between student knowledge and behavior, selecting the proper negative samples becomes a notable challenge in this problem. NANS, along with our proposed multi-objective, multi-task sequential model for KM and BM, NANS-KoBeM frames the simultaneous modeling of student knowledge and question selection as a multi-task learning problem with dual objectives: predicting students’ performance and their question selections.

State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China, Department of Statistics, University of California, Los Angeles, CA, USA, Department of Computer Science, University of California, Los Angeles, CA, USA, State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China, State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China, Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA, State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China

Abstract: Graph neural networks (GNNs) have gained superior performance in graphbased prediction tasks with a variety of applications such as social analysis and drug discovery. Despite the remarkable progress, their performance often degrades on test graphs with distribution shifts. Existing domain adaptation methods rely on unlabeled test graphs during optimization, limiting their applicability to graphs in the wild. Towards this end, this paper studies the problem of multi-domain generalization on graphs, which utilizes multiple source graphs to learn a GNN with high performance on unseen target graphs. We propose a new approach named Topological Adversarial Learning with Prototypical Mixup (TRACI) to solve the problem. The fundamental principle behind our TRACI is to produce virtual adversarial and mixed graph samples from a data-centric view. In particular, TRACI enhances GNN generalization by employing a gradient-ascent strategy that considers both label prediction entropy and graph topology to craft challenging adversarial samples. Additionally, it generates domain-agnostic node representations by characterizing class-graph pair prototypes through latent distributions and applying multi-sample prototypical Mixup for distribution alignment across graphs. We further provide theoretical analysis showing that TRACI reduces the model's excess risk. Extensive experiments on various benchmark datasets demonstrate that TRACI outperforms state-of-the-art baselines, validating its effectiveness.

Abstract: We examine an approvalbased model of Liquid Democracy with a budget constraint on voting and delegating costs, aiming to centrally select casting voters ensuring complete representation of the electorate. From a computational complexity perspective, we focus on minimizing overall costs, maintaining short delegation paths, and preventing excessive concentration of voting power. Furthermore, we explore computational aspects of strategic control, specifically, whether external agents can change election components to influence the voting power of certain voters.

Abstract: Proportional representation plays a crucial role in electoral systems. In ordinal elections, where voters rank candidates based on their preferences, the Single Transferable Vote (STV) is the most widely used proportional voting method. STV is considered proportional because it satisfies an axiom requiring that large enough "solid coalitions" of voters are adequately represented. Using realworld data from local Scottish elections, we observe that solid coalitions of the required size rarely occur in practice. This observation challenges the importance of proportionality axioms and raises the question of how the proportionality of voting methods can be assessed beyond their axiomatic performance. We address these concerns by developing quantitative measures of proportionality. We apply these measures to evaluate the proportionality of voting rules on real-world election data. Besides STV, we consider SNTV, the Expanding Approvals Rule, and Sequential Ranked-Choice Voting. We also study the effects of ballot truncation by artificially completing truncated ballots and comparing the proportionality of outcomes under complete and truncated ballots.

Abstract: We initiate the study of computing envyfree allocations of indivisible items in the extension setting, i.e., when some part of the allocation is fixed and the task is to allocate the remaining items. In view of the NP-hardness of the problem, we investigate whether - and under which conditions - one can obtain fixed-parameter algorithms for computing a solution in settings where most of the allocation is already fixed. Our results provide a broad complexity-theoretic classification of the problem which includes: (a) fixed-parameter algorithms tailored to settings with few distinct types of agents or items; (b) lower bounds which exclude the generalization of these positive results to more general settings. We conclude by showing that - unlike when computing allocations from scratch - the non-algorithmic question of whether more relaxed EF1 or EFX allocations exist can be completely resolved in the extension setting.

Abstract: We study fair division of indivisible chores among n agents with additive cost functions using the popular fairness notion of maximin share (MMS). Since MMS allocations do not always exist for more than two agents, the goal has been to improve its approximations and identify interesting special cases where MMS allocations exist. We show the existence of · 1out-of-9n/11 MMS allocations, which improves the state-of-the-art factor of 1-out-of-3n/4. · MMS allocations for factored instances, which resolves an open question posed by Ebadian et al. (2021). · 15/13-MMS allocations for personalized bivalued instances, improving the state-of-the-art factor of 13/11. We achieve these results by leveraging the HFFD algorithm of Huang and Lu (2021). Our approach also provides polynomial-time algorithms for computing an MMS allocation for factored instances and a 15/13-MMS allocation for personalized bivalued instances.

Abstract: Tâtonnement is a simple, intuitive market process where prices are iteratively adjusted based on the difference between demand and supply. Many variants under different market assumptions have been studied and shown to converge to a market equilibrium, in some cases at a fast rate. However, the classical case of linear Fisher markets have long eluded the analyses, and it remains unclear whether tâtonnement converges in this case. We show that, for a sufficiently small stepsize, the prices given by the tâtonnement process are guaranteed to converge to equilibrium prices, up to a small approximation radius that depends on the stepsize. To achieve this, we consider the dual EisenbergGale convex program in the price space, view tâtonnement as subgradient descent on this convex program, and utilize novel last-iterate convergence results for subgradient descent under error bound conditions. In doing so, we show that the convex program satisfies a particular error bound condition, the quadratic growth condition, and that the price sequence generated by tâtonnement is bounded above and away from zero. We also show that a similar convergence result holds for tâtonnement in quasi-linear Fisher markets. Numerical experiments are conducted to demonstrate that the theoretical linear convergence aligns with empirical observations.

Abstract: We investigate the problem of designing randomized obviously strategyproof (OSP) mechanisms in several canonical auction settings. Obvious strategyproofness, introduced by Li [American Economic Review 2017], strengthens the wellknown concept of dominant-strategy incentive compatibility (DSIC). Loosely speaking, it ensures that even agents who struggle with contingent reasoning can identify that their dominant strategy is optimal. Thus, one would hope to design OSP mechanisms with good approximation guarantees. Unfortunately, Ron [SODA 2024] has showed that deterministic OSP mechanisms fail to achieve an approximation better than the minimum of the number of items and the number of bidders, even for the simple settings of additive and unit-demand bidders. We circumvent these impossibilities by showing that randomized mechanisms that are obviously strategy-proof in the universal sense obtain a constant factor approximation for these classes. We show that this phenomenon occurs also for the setting of a multi-unit auction with single-minded bidders. Thus, our results provide a more positive outlook on the design of OSP mechanisms and exhibit a stark separation between the power of randomized and deterministic OSP mechanisms. To complement the picture, we provide lower bounds on the performance of randomized OSP mechanisms in each setting. This further demonstrates that OSP mechanisms are significantly weaker than dominant-strategy mechanisms: it is well known that the deterministic VCG mechanism outputs an optimal allocation in dominant-strategies, whereas we show that even randomized OSP mechanisms cannot obtain more than 87.5% of the optimal welfare.

Abstract: Student placements under diversity constraints are a common practice globally. This paper addresses the selection of students by a single school under a oneto-one convention, where students can belong to multiple types but are counted only once based on one type. While existing algorithms in economics and computer science aim to help schools meet diversity goals and priorities, we demonstrate that these methods can result in significant imbalances among students with different type combinations. To address this issue, we introduce a new property called balanced representation, which ensures fair representation across all types and type combinations. We propose a straightforward choice function that uniquely satisfies four fundamental properties: maximal diversity, non-wastefulness, justified envy-freeness, and balanced representation. While previous research has primarily focused on algorithms based on bipartite graphs, we take a different approach by utilizing flow networks. This method provides a more compact formalization of the problem and significantly improves computational efficiency. Additionally, we present efficient algorithms for implementing our choice function within both the bipartite graph and flow network frameworks.

Abstract: We present an algorithm for computing purestrategy epsilon-perfect Bayesian equilibria in sequential auctions with continuous action and value spaces. Importantly, our algorithm includes a verification phase that computes an upper bound on the utility loss of the found strategies. Prior work on equilibrium computation in auctions with verification has focussed on the single-round case, but the methods do not work for sequential auctions because of two main challenges: (1) there are infinitely many subgames, and (2) the setting has no optimal substructure as bidders' beliefs and best response strategies depend on the strategies of previous rounds. We make two contributions. First, we introduce a tailor-made game abstraction that discretizes the auction and augments the state space with the public beliefs, such that an approximate equilibrium can be computed via dynamic programming. Second, we prove a decomposition theorem to upper bound the utility loss of the computed equilibrium. This is essential because it is neither guaranteed that the auction has an equilibrium nor that any algorithm converges to it. We validate our algorithm on multiple settings with known equilibria and apply it to a new multi-round combinatorial auction.

Abstract: Existing learningfrom-crowds methods aim to design proper aggregation strategies to infer the unknown true labels from noisy labels provided by crowdsourcing. They treat the ground truth as hidden variables and use statistical or deep learning based worker behavior models to infer the ground truth. However, worker behavior models that rely on ground truth hidden variables overlook workers' behavior at the item feature level, leading to imprecise characterizations and negatively impacting the quality of learning-from-crowds. This paper proposes a new paradigm of multi-task supervised learning-from-crowds, which eliminates the need for modeling of items's ground truth in worker behavior models. Within this paradigm, we propose a worker behavior model at the item feature level called Mixture of Experts based Multi-task Supervised Learning-from-Crowds (MMLC), then, two aggregation strategies are proposed within MMLC. The first strategy, named MMLC-owf, utilizes clustering methods in the worker spectral space to identify the projection vector of the oracle worker. Subsequently, the labels generated based on this vector are regarded as the items's ground truth The second strategy, called MMLC-df, employs the MMLC model to fill the crowdsourced data, which can enhance the effectiveness of existing aggregation strategies . Experimental results demonstrate that MMLC-owf outperforms state-of-the-art methods and MMLC-df enhances the quality of existing learning-from-crowds methods.

Abstract: Artificial intelligence (AI) models for computer vision trained with supervised machine learning are assumed to solve classification tasks by imitating human behavior learned from training labels. Most efforts in recent vision research focus on measuring the model task performance using standardized benchmarks such as accuracy. However limited work has sought to understand the perceptual difference between humans and machines. To fill this gap, this study first analyzes the statistical distributions of mistakes from the two sources, and then explores how task difficulty level affects these distributions. We find that even when AI learns an excellent model from the training data, one that outperforms humans in overall accuracy, these AI models have significant and consistent differences from human perception. We demonstrate the importance of studying these differences with a simple humanAI teaming algorithm that outperforms humans alone, AI alone, or AI-AI teaming.

Abstract: Procedural Content Generation via Machine Learning (PCGML) has enhanced game content creation, yet challenges in controllability and limited training data persist. This study addresses these issues by distilling a constructive PCG algorithm into a controllable PCGML model. We first generate a large amount of content with a constructive algorithm and label it using a Large Language Model (LLM). We use these synthetic labels to condition two PCGML models for contentspecific generation, a diffusion model and the five-dollar model. This neural network distillation process ensures that the generation aligns with the original algorithm while introducing controllability through plain text. We define this text-conditioned PCGML as a Text-to-game-Map (T2M) task, offering an alternative to prevalent text-to-image multi-modal tasks. We compare our distilled models with the baseline constructive algorithm. Our analysis of the variety, accuracy, and quality of our generation demonstrates the efficacy of distilling constructive methods into controllable text-conditioned PCGML models.

Abstract: Diffusion models have achieved remarkable success in sequential decisionmaking by leveraging the highly expressive model capabilities in policy learning. A central problem for learning diffusion policies is to align the policy output with human intents in various tasks. To achieve this, previous methods conduct return-conditioned policy generation or Reinforcement Learning (RL)-based policy optimization, while they both rely on pre-defined reward functions. In this work, we propose a novel framework, Forward KL regularized Preference optimization for aligning Diffusion policies, to align the diffusion policy with preferences directly. We first train a diffusion policy from the offline dataset without considering the preference, and then align the policy to the preference data via direct preference optimization. During the alignment phase, we formulate direct preference learning in a diffusion policy, where the forward KL regularization is employed in preference optimization to avoid generating out-of-distribution actions. We conduct extensive experiments for MetaWorld manipulation and D4RL tasks. The results show our method exhibits superior alignment with preferences and outperforms previous state-of-the-art algorithms.

Abstract: MultiRobot Coverage problems have been extensively studied in robotics, planning and multi-agent systems. In this work, we consider the coverage problem when there are constraints on the proximity (e.g., maximum distance between the agents, or a blue agent must be adjacent to a red agent) and the movement (e.g., terrain traversability and material load capacity) of the robots. Such constraints naturally arise in many real-world applications, e.g. in search-and-rescue and maintenance operations. Given such a setting, the goal is to compute a covering tour of the graph with a minimum number of steps, and that adheres to the proximity and movement constraints. For this problem, our contributions are four: (i) a formal formulation of the problem, (ii) an exact algorithm that is FPT in parameters ||F||, d and ω - the set of robot formations that encode the proximity constraints, the maximum nodes degree, and the tree-width of the graph, respectively, (iii) for the case that the graph is a tree: a PTAS approximation scheme, that given an ε produces a tour that is within a 1+ ε⋅error(||F||, d)) of the optimal one, and the computation runs in time poly(n) ⋅ h(1/ε, ||F||). (iv) for the case that the graph is a tree, with k=3 robots, and the constraint is that all agents are connected: a PTAS scheme with multiplicative approximation error of 1 + O(ε), independent of d.

Abstract: Understanding how humans cooperatively utilize semantic knowledge to explore unfamiliar environments and decide on navigation directions is critical for house service multirobot systems. Previous methods primarily focused on single-robot centralized planning strategies, which severely limited exploration efficiency. Recent research has considered decentralized planning strategies for multiple robots, assigning separate planning models to each robot, but these approaches often overlook communication costs. In this work, we propose Multimodal Chain-of-Thought Co-Navigation (MCoCoNav), a modular approach that utilizes multimodal Chain-of-Thought to plan collaborative semantic navigation for multiple robots. MCoCoNav combines visual perception with Vision Language Models (VLMs) to evaluate exploration value through probabilistic scoring, thus reducing time costs and achieving stable outputs. Additionally, a global semantic map is used as a communication bridge, minimizing communication overhead while integrating observational results. Guided by scores that reflect exploration trends, robots utilize this map to assess whether to explore new frontier points or revisit history nodes. Experiments on HM3D_v0.2 and MP3D demonstrate the effectiveness of our approach.

Abstract: We study the problem of optimizing a guidance policy capable of dynamically guiding the agents for lifelong MultiAgent Path Finding based on real-time traffic patterns. Multi-Agent Path Finding (MAPF) focuses on moving multiple agents from their starts to goals without collisions. Its lifelong variant, LMAPF, continuously assigns new goals to agents. In this work, we focus on improving the solution quality of PIBT, a state-of-the-art rule-based LMAPF algorithm, by optimizing a policy to generate adaptive guidance. We design two pipelines to incorporate guidance in PIBT in two different ways. We demonstrate the superiority of the optimized policy over both static guidance and human-designed policies. Additionally, we explore scenarios where task distribution changes over time, a challenging yet common situation in real-world applications that is rarely explored in the literature.

Abstract: Reinforcement learning is a widely used approach for training an agent to maximize rewards in a given environment. Action policies learned with this technique see a broad range of applications in practical areas like games, healthcare, robotics, or autonomous driving. However, enforcing ethical behavior or norms based on deontic constraints that the agent should adhere to during policy execution remains a complex challenge. Especially constraints that emerge after the training can necessitate to redo policy learning, which can be costly and, more critically, timeintense. In order to mitigate this problem, we present a framework for policy fixing in case of a norm violation, which allows the agent to stay operational. Based on answer set programming (ASP), emergency plans are generated that exclude or minimize cost of norm violations by future actions in a horizon of interest. By combining and developing optimization techniques, efficient policy fixing under real-time constraints can be achieved.

Abstract: In this paper, we introduce a new family of argumentranking semantics which can be seen as a refinement of the classification of arguments into skeptically accepted, credulously accepted and rejected. To this end we use so-called social ranking functions which have been developed recently to rank individuals based on their performance in groups. We provide necessary and sufficient conditions for a social ranking function to give rise to an argument-ranking semantics satisfying the desired refinement property.

Abstract: Knowledge graph (KG) completion aims to identify additional facts that can be inferred from the existing facts in the KG. Recent developments in this field have explored this task in the inductive setting, where at test time one sees entities that were not present during training; the most performant models in the inductive setting have employed path encoding modules in addition to standard subgraph encoding modules. This work similarly focuses on KG completion in the inductive setting, without the explicit use of path encodings, which can be timeconsuming and introduces several hyperparameters that require costly hyperparameter optimization. Our approach uses a Transformer-based subgraph encoding module only; we introduce connection-biased attention and entity role embeddings into the subgraph encoding module to eliminate the need for an expensive and time-consuming path encoding module. Evaluations on standard inductive KG completion benchmark datasets demonstrate that our Connection-Biased Link Prediction (CBLiP) model has superior performance to models that do not use path information. Compared to models that utilize path information, CBLiP shows competitive or superior performance while being faster. Additionally, to show that the effectiveness of connection-biased attention and entity role embeddings also holds in the transductive setting, we compare CBLiP's performance on the relation prediction task in the transductive setting.

Abstract: We investigate the synthesis of policies for highlevel agent programs expressed in Golog, a language based on situation calculus that incorporates nondeterministic programming constructs. Unlike traditional approaches for program realization that assume full agent control or rely on incremental search, we address scenarios where environmental nondeterminism significantly influences program outcomes. Our synthesis problem involves deriving a policy that successfully realizes a given Golog program while ensuring the satisfaction of a temporal specification, expressed in Linear Temporal Logic on finite traces (LTLf), across all possible environmental behaviors. By leveraging an expressive class of first-order action theories, we construct a finite game arena that encapsulates program executions and tracks the satisfaction of the temporal goal. A game-theoretic approach is employed to derive such a policy. Experimental results demonstrate this approach's feasibility in domains with unbounded objects and non-local effects. This work bridges agent programming and temporal logic synthesis, providing a framework for robust agent behavior in nondeterministic environments.

School of Computer Science, Hubei University, Wuhan 430062, China Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China, School of Cyber Science and Technology, Hubei University, Wuhan 430062, China, School of Computer Science, Hubei University, Wuhan 430062, China Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China, School of Computer Science, Hubei University, Wuhan 430062, China Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China, School of Computer Science, Hubei University, Wuhan 430062, China Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China, School of Computer Science, Hubei University, Wuhan 430062, China Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China, Institute of Vocational Education, Guangdong Industry Polytechnic University, Guangzhou 510300, China

Abstract: Multimodal knowledge graphs (MMKG) store structured world knowledge enriched with multimodal descriptive information. However, MMKG often faces the challenge of incompleteness. The primary objective of multimodal knowledge graph completion (MMKGC) is to predict missing entities within MMKG. Current MMKGC methods struggle with addressing the issue of overtrust attention and how to enhance the robustness of the model. To overcome these problems, we introduce APKGC, a noise-enhanced multimodal method for knowledge graph completion with attention penalty. APKGC effectively adjusts the attention scores in the language model and alleviates over-trust attention through a specifically designed attention penalty module. Additionally, an adaptive noise sampling module is proposed to supplement the entity's multimodal information, thereby enhancing the model's robustness. Experimental evaluation demonstrates that APKGC excels in overcoming these challenges. Compared to the existing state-of-the-art MMKGC model, APKGC improves Hit@1 by 3.3% on the DB15K dataset and by 3.4% on the MKG-W dataset.

Abstract: Recent works have explored the use of counting queries coupled with Description Logic ontologies. The answer to such a query in a model of a knowledge base is either an integer or infinity, and its spectrum is the set of its answers over all models. While it is unclear how to compute and manipulate such a set in general, we identify a class of counting queries whose spectra can be effectively represented. Focusing on atomic counting queries, we pinpoint the possible shapes of a spectrum over ALCIF ontologies: they are essentially the subsets of N and infinity closed under addition. For most sublogics of ALCIF, we show that possible spectra enjoy simpler shapes, being [ m, infinity ] or variations thereof. To obtain our results, we refine constructions used for finite model reasoning and notably rely on a cyclereversion technique for the Horn fragment of ALCIF. We also study the data complexity of computing the proposed effective representation and establish the FP^NP[log]-completeness of this task under several settings.

Abstract: Knowledge stored in large language models requires timely updates to reflect the dynamic nature of realworld information. To update the knowledge, most knowledge editing methods focus on the low layers, since recent probes into the knowledge recall process reveal that the answer information is enriched in low layers. However, these probes only and could only reveal critical recall stages for the original answers, while the goal of editing is to rectify model's prediction for the target answers. This inconsistency indicates that both the probe approaches and the associated editing methods are deficient. To mitigate the inconsistency and identify critical editing regions, we propose a contrast-based probe approach, and locate two crucial stages where the model behavior diverges between the original and target answers: Information Enrichment in low layers and Probability Promotion in high layers. Building upon the insights, we develop the Joint knowledge Editing for information Enrichment and probability Promotion (JEEP) method, which jointly edits both the low and high layers to modify the two critical recall stages. Considering the mutual interference and growing forgetting due to dual modifications, JEEP is designed to ensure that updates to distinct regions share the same objectives and are complementary. We rigorously evaluate JEEP by editing up to thousands of facts on various models, i.e., GPT-J (6B) and LLaMA (7B), and addressing diverse editing objectives, i.e., adding factual and counterfactual knowledge. In all tested scenarios, JEEP achieves best performances, validating the effectiveness of the revealings of our probe approach and the designs of our editing method.

Abstract: CrossDomain Few-Shot Learning (CD-FSL) aims to transfer knowledge from seen source domains to unseen target domains, which is crucial for evaluating the generalization and robustness of models. Recent studies focus on utilizing visual styles to bridge the domain gap between different domains. However, the serious dilemma of gradient instability and local optimization problem occurs in those style-based CD-FSL methods. This paper addresses these issues and proposes a novel crop-global style perturbation method, called Self-Versatility Adversarial Style Perturbation (SVasP), which enhances the gradient stability and escapes from poor sharp minima jointly. Specifically, SVasP simulates more diverse potential target domain adversarial styles via diversifying input patterns and aggregating localized crop style gradients, to serve as global style perturbation stabilizers within one image, a concept we refer to as self-versatility. Then a novel objective function is proposed to maximize visual discrepancy while maintaining semantic consistency between global, crop, and adversarial features. Having the stabilized global style perturbation in the training phase, one can obtain a flattened minima in the loss landscape, boosting the transferability of the model to the target domains. Extensive experiments on multiple benchmark datasets demonstrate that our method significantly outperforms existing state-of-the-art methods.

Abstract: The independence assumption between random variables is a useful tool to increase the tractability of a modelling framework. However, this assumption can be too simplistic; failing to take dependencies into account can cause models to fail dramatically. The field of multiaxis graphical modelling (also called multi-way modelling, Kronecker-separable modelling) has seen growth over the past decade, but these models require that the data have zero mean. In the multi-axis case, inference is typically done in the single sample scenario, making mean inference impossible. In this paper, we demonstrate how the zero-mean assumption can cause egregious modelling errors for Kronecker-sum-decomposable Gaussian graphical models, as well as propose a relaxation to the zero-mean assumption that allows the avoidance of such errors. Specifically, we propose the "Kronecker-sum-structured mean" assumption, which leads to models with nonconvex-but-unimodal log-likelihoods that can be solved efficiently with coordinate descent.

Abstract: Fewshot learning (FSL) commonly requires a model to identify images (queries) that belong to classes unseen during training, based on a few labelled samples of the new classes (support set) as reference. So far, plenty of algorithms involve training data augmentation to improve the generalization capability of FSL models, but outlier queries or support images during inference can still pose great generalization challenges. In this work, to reduce the bias caused by the outlier samples, we generate additional test-class samples by combining original samples with suitable train-class samples via a generative image combiner. Then, we obtain averaged features via an augmentor, which leads to more typical representations through the averaging. We experimentally and theoretically demonstrate the effectiveness of our method, obtaining a test accuracy improvement proportion of around 10% (e.g., from 46.86% to 53.28%) for trained FSL models. Importantly, given a pretrained image combiner, our method is training-free for off-the-shelf FSL models, whose performance can be improved without extra datasets nor further training of the models themselves.

Abstract: We introduce Neural Conjugate Flows (NCF), a class of neuralnetwork architectures equipped with exact flow structure. By leveraging topological conjugation, we prove that these networks are not only naturally isomorphic to a continuous group, but are also universal approximators for flows of ordinary differential equation (ODEs). Furthermore, topological properties of these flows can be enforced by the architecture in an interpretable manner. We demonstrate in numerical experiments how this topological group structure leads to concrete computational gains over other physics informed neural networks in estimating and extrapolating latent dynamics of ODEs, while training up to five times faster than other flow-based architectures.

Abstract: In many RL applications, ensuring an agent's actions adhere to constraints is crucial for safety. Most previous methods in ActionConstrained Reinforcement Learning (ACRL) employ a projection layer after the policy network to correct the action. However projection-based methods suffer from issues like the zero gradient problem and higher runtime due to the usage of optimization solvers. Recently methods were proposed to train generative models to learn a differentiable mapping between latent variables and feasible actions to address this issue. However, generative models require training using samples from the constrained action space, which itself is challenging. To address such limitations, first, we define a target distribution for feasible actions based on constraint violation signals, and train normalizing flows by minimizing the KL divergence between an approximated distribution over feasible actions and the target. This eliminates the need to generate feasible action samples, greatly simplifying the flow model learning. Second, we integrate the learned flow model with existing deep RL methods, which restrict it to exploring only the feasible action space. Third, we extend our approach beyond ACRL to handle state-wise constraints by learning the constraint violation signal from the environment. Empirically, our approach has significantly fewer constraint violations while achieving similar or better quality in several control tasks than previous best methods.

Abstract: Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL). Most works achieve this by carefully estimating the model uncertainty and following the socalled optimistic model. Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that provably achieves near-optimal regret for stochastic Multi-Armed Bandits (MAB). Crucially, our algorithm has just a single parameter, namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the losses. We complement our theoretical results by demonstrating the effectiveness of our algorithm on synthetic benchmarks.

Abstract: Learning with noisy labels (LNL) methods have enabled the deployment of machine learning systems with imperfectly labeled data. However, these methods often struggle to identify noise in the presence of longtailed (LT) class distributions, where the memorization effect becomes class-dependent. Conversely, LT methods are suboptimal under label noise, as it hinders access to accurate label frequency statistics. This study aims to address the long-tailed noisy data by bridging the methodological gap between LNL and LT approaches. We propose a direct solution, termed Robust Logit Adjustment, which estimates ground-truth labels through label refurbishment, thereby mitigating the impact of label noise. Simultaneously, our method incorporates the distribution of training-time corrected target labels into the LT method logit adjustment, providing class-rebalanced supervision. Extensive experiments on both synthetic and real-world long-tailed noisy datasets demonstrate the superior performance of our method.

Abstract: Network traffic includes data transmitted across a network, such as web browsing and file transfers, and is organized into packets (small units of data) and flows (sequences of packets exchanged between two endpoints). Classifying encrypted traffic is essential for detecting security threats and optimizing network management. Recent advancements have highlighted the superiority of foundation models in this task, particularly for their ability to leverage large amounts of unlabeled data and demonstrate strong generalization to unseen data. However, existing methods that focus on tokenlevel relationships fail to capture broader flow patterns, as tokens, defined as sequences of hexadecimal digits, typically carry limited semantic information in encrypted traffic. These flow patterns, which are crucial for traffic classification, arise from the interactions between packets within a flow, not just their internal structure. To address this limitation, we propose a Multi-Instance Encrypted Traffic Transformer (MIETT), which adopts a multi-instance approach where each packet is treated as a distinct instance within a larger bag representing the entire flow. This enables the model to capture both token-level and packet-level relationships more effectively through Two-Level Attention (TLA) layers, improving the model's ability to learn complex packet dynamics and flow patterns. We further enhance the model's understanding of temporal and flow-specific dynamics by introducing two novel pre-training tasks: Packet Relative Position Prediction (PRPP) and Flow Contrastive Learning (FCL). After fine-tuning, MIETT achieves state-of-the-art (SOTA) performance across five datasets, demonstrating its effectiveness in classifying encrypted traffic and understanding complex network behaviors.

Abstract: Anchorbased multi-view clustering has received extensive attention due to its efficient performance. Existing methods only focus on how to dynamically learn anchors from the original data and simultaneously construct anchor graphs describing the relationships between samples and perform clustering, while ignoring the reality of anchors, i.e., high-quality anchors should be generated uniformly from different clusters of data rather than scattered outside the clusters. To deal with this problem, we propose a noval method termed Anchor Learning with Potential Cluster Constraints for Multi-view Clustering (ALPC) method. Specifically, ALPC first establishes a shared latent semantic module to constrain anchors to be generated from specific clusters, and subsequently, ALPC improves the representativeness and discriminability of anchors by adapting the anchor graph to capture the common clustering center of mass from samples and anchors, respectively. Finally, ALPC combines anchor learning and graph construction into a unified framework for collaborative learning and mutual optimization to improve the clustering performance. Extensive experiments demonstrate the effectiveness of our proposed method compared to some state-of-the-art MVC methods.

Abstract: Graph contrastive learning (GCL) has drawn much research attention for its ability to learn node representations in a selfsupervised manner. However, the homophily assumption inherent in GNN encoders limits the direction (macro-level) and the process (micro-level) of message passing in current GCL frameworks, impairing the expressive power of GCL in non-homophilous graphs. This paper presents a novel framework that employs Macro and Micro Message Passing in GCL (M3P-GCL) to overcome these limitations and advance performance in both homophilous and non-homophilous graphs. Specifically, at the macro-level, we integrate structural and attribute views to enhance the direction of message passing, and employ an Aligned Priority-Supporting View Encoding (APS-VE) strategy to facilitate contrastive training; at the micro-level, we propose an Adaptive Self-Propagation (ASP) strategy based on role segmentation of self-loops to diversify the process of message passing in the encoder. These enhancements effectively address the limitations imposed by the homophily assumption. Experiments demonstrate that M3P-GCL outperforms both supervised and unsupervised baselines in the node classification task on various datasets with different levels of homophily.

Abstract: Advancements in hardware accelerators, such as graphics processing units and neural processing units, have significantly propelled computer vision research. The vision transformer (ViT), leveraging the multihead self-attention (MHSA) mechanism, has surpassed convolutional neural networks (CNNs) in accuracy but faces challenges in mobile and edge deployment due to its large size and computational demands. In addition, as privacy concerns push for on-device training, research on quantization methods for ViTs, particularly gradient quantization, has gained attention. Unlike CNNs, ViTs face challenges due to outliers and a complex loss landscape. To address this, we propose a gradient quantization framework that stabilizes training by adapting quantization points based on interquartile ranges and constructing an outlier-robust loss function. Additionally, we employ a scaling method to align quantized gradients with original gradients and adaptively assign the learning rate based on quantization error analysis. When quantizing weights, activations, and gradients to INT8, our method improves performance by 0.52% and 0.21% over DeiT-Base and Swin-Base, respectively, and achieves near parity with MobileViT-S with only a 0.09% accuracy drop. Furthermore, a 2.06x speedup was observed when applying our framework to MobileViT in a CUDA 11.8 environment.

Abstract: Implicit imitation reinforcement learning (IIRL) is a framework that aims to aid a trainee agent’s learning process via observing the state transitions of a mentor, but without access to the latter's action information. Standard IIRL assumes a shared Markov decision process (MDP) between the mentor and trainee, consequently implying an identical action space. This restriction imposes limitations on the applicability of implicit imitation frameworks in reallife scenarios where, possibly due to variations in physical characteristics, the mentor agent may possess distinct own actions, thereby creating a heterogeneous action setting. In this work, we extend the deep implicit imitation Q-networks (DIIQN) method -an online, model-free, deep RL algorithm for implicit imitation- to allow for heterogeneous action sets between mentor and trainee agents. Equipped with our heterogeneous actions DIIQN (HA-DIIQN) method, a trainee agent can harvest the benefits of IIRL even in heterogeneous action settings, achieving accelerated learning and outperforming non-optimal mentor agents.

Abstract: AudioVisual Learning (AVL) aims at the audio-visual perception with both audio and vision modalities. AVL also suffers from data insufficiency in many applications as with other unimodal tasks. Concurrently, AVL often needs to continuously learn over time rather than all knowledge simultaneously. Considering the above two perspectives, our work mainly focuses on benchmarking the unexplored Few-Shot Audio-Visual Class-Incremental Learning (FS-AVCIL), i.e., continually perceiving novel categories described by a limited number of labeled examples with audio and visual modalities. Firstly, we provide the detailed task configuration together with a thorough analysis of the challenges in FS-AVCIL: (1) how to efficiently learn and fuse multimodal information with limited labeled examples; and (2) how to alleviate catastrophic forgetting cross-modal semantic correlations with limited data. Then, we propose an efficient framework based on Vision Transformer to solve FS-AVCIL. This framework contains two parts: temporal-residual prompting for audio-visual synergy adapter and temporal prompt regularization. Specifically, temporal-residual prompting is incorporated into the audio-visual adapter to efficiently finetune the pre-trained foundation model with limited data and capture audio-visual correlation by learning temporal-relevant prompts. Besides, we regularize temporal-relevant prompts to memorize previous knowledge by fully using the temporal knowledge from various perspectives. This framework is validated in audio-visual classification tasks under the FS-AVCIL scenario, and extensive experiments demonstrate its superior performance.

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China

Abstract: Graph node clustering is a fundamental unsupervised task. Existing methods typically train an encoder through selfsupervised learning and then apply K-means to the encoder output. Some methods use this clustering result directly as the final assignment, while others initialize centroids based on this initial clustering and then finetune both the encoder and these learnable centroids. However, due to their reliance on K-means, these methods inherit its drawbacks when the cluster separability of encoder output is low, facing challenges from the Uniform Effect and Cluster Assimilation. We summarize three reasons for the low cluster separability in existing methods: (1) lack of contextual information prevents discrimination between similar nodes from different clusters; (2) training tasks are not sufficiently aligned with the downstream clustering task; (3) the cluster information in the graph structure is not appropriately exploited. To address these issues, we propose conTrastive grapH clustEring by SwApping fUsed gRomov-wasserstein coUplingS (THESAURUS). Our method introduces semantic prototypes to provide contextual information, and employs a cross-view assignment prediction pretext task that aligns well with the downstream clustering task. Additionally, it utilizes Gromov-Wasserstein Optimal Transport (GW-OT) along with the proposed prototype graph to thoroughly exploit cluster information in the graph structure. To adapt to diverse real-world data, THESAURUS updates the prototype graph and the prototype marginal distribution in OT by using momentum. Extensive experiments demonstrate that THESAURUS achieves higher cluster separability than the prior art, effectively mitigating the Uniform Effect and Cluster Assimilation issues.

Abstract: SharpnessAware Minimization (SAM) has emerged as a promising approach for effectively reducing the generalization error. However, SAM incurs twice the computational cost compared to the base optimizer (e.g., SGD). We propose Asymptotic Unbiased data sampling to accelerate SAM (AUSAM), which maintains the model's generalization capacity while significantly enhancing computational efficiency. Concretely, we probabilistically sample a subset of data points beneficial for SAM optimization based on a theoretically guaranteed criterion, i.e., the Gradient Norm of each Sample (GNS). We further approximate the GNS by evaluating the difference in loss values before and after perturbation in SAM. As a plug-and-play, architecture-agnostic method, our approach consistently accelerates SAM across various tasks and networks, i.e., classification, human pose estimation, and network quantization. On CIFAR-10/100 and Tiny-ImageNet, AUSAM achieves results comparable to SAM while providing a speedup of over 70%. By adjusting hyperparameters, AUSAM can match the speed of the base optimizer while significantly surpassing the base optimizer's performance. Compared to recent dynamic data pruning methods, AUSAM is better suited for SAM and excels in maintaining performance. Additionally, AUSAM accelerates optimization in human pose estimation and model quantization without sacrificing performance, demonstrating its broad practicality.

Abstract: Live comments, also known as Danmaku, are usergenerated messages that are synchronized with video content. These comments overlay directly onto streaming videos, capturing viewer emotions and reactions in real-time. While prior work has leveraged live comments in affective analysis, its use has been limited due to the relative rarity of live comments across different video platforms. To address this, we first construct the Live Comment for Affective Analysis (LCAffect) dataset, which contains live comments for English and Chinese videos spanning diverse genres that elicit a wide spectrum of emotions. Then, using this dataset, we use contrastive learning to train a video encoder to produce synthetic live comment features for enhanced multimodal affective content analysis. Through comprehensive experimentation on a wide range of affective analysis tasks (sentiment, emotion recognition, and sarcasm detection) in both English and Chinese, we demonstrate that these synthetic live comment features significantly improve performance over state-of-the-art methods.

Abstract: Learning curve extrapolation predicts neural network performance from early training epochs and has been applied to accelerate AutoML, facilitating hyperparameter tuning and neural architecture search. However, existing methods typically model the evolution of learning curves in isolation, neglecting the impact of neural network (NN) architectures, which influence the loss landscape and learning trajectories. In this work, we explore whether incorporating neural network architecture improves learning curve modeling and how to effectively integrate this architectural information. Motivated by the dynamical system view of optimization, we propose a novel architectureaware neural differential equation model to forecast learning curves continuously. We empirically demonstrate its ability to capture the general trend of fluctuating learning curves while quantifying uncertainty through variational parameters. Our model outperforms current state-of-the-art learning curve extrapolation methods and pure time-series modeling approaches for both MLP and CNN-based learning curves. Additionally, we explore the applicability of our method in Neural Architecture Search scenarios, such as training configuration ranking.

Abstract: Causal learning tackles the computationally demanding task of estimating causal graphs. This paper introduces a new divideand-conquer approach for causal graph learning, called DCILP. In the divide phase, the Markov blanket MB(Xi) of each variable Xi is identified, and causal learning subproblems associated with each MB(Xi) are independently addressed in parallel. This approach benefits from a more favorable ratio between the number of data samples and the number of variables considered. In counterpart, it can be adversely affected by the presence of hidden confounders, as variables external to MB(Xi) might influence those within it. The reconciliation of the local causal graphs generated during the divide phase is a challenging combinatorial optimization problem, especially in large-scale applications. The main novelty of DCILP is an original formulation of this reconciliation as an integer linear programming (ILP) problem, which can be delegated and efficiently handled by an ILP solver. Through experiments on medium to large scale graphs, and comparisons with state-of-the-art methods, DCILP demonstrates significant improvements in terms of computational complexity, while preserving the learning accuracy on real-world problem and suffering at most a slight loss of accuracy on synthetic problems.

Abstract: We study neural network training (NNT): optimizing a neural network's parameters to minimize the training loss over a given dataset. NNT has been studied extensively under theoretic lenses, mainly on twolayer networks with linear or ReLU activation functions where the parameters can take any real value (here referred to as continuous NNT (C-NNT)). However, less is known about deeper neural networks, which exhibit substantially stronger capabilities in practice. In addition, the complexity of the discrete variant of the problem (D-NNT in short), in which the parameters are taken from a given finite set of options, has remained less explored despite its theoretical and practical significance. In this work, we show that the hardness of NNT is dramatically affected by the network depth. Specifically, we show that, under standard complexity assumptions, D-NNT is not in the complexity class NP even for instances with fixed dimensions and dataset size, having a deep architecture. This separates D-NNT from any NP-complete problem. Furthermore, using a polynomial reduction we show that the above result also holds for C-NNT, albeit with more structured instances. We complement these results with a comprehensive list of NP-hardness lower bounds for D-NNT on two-layer networks, showing that fixing the number of dimensions, the dataset size, or the number of neurons in the hidden layer leaves the problem challenging. Finally, we obtain a pseudo-polynomial algorithm for D-NNT on a two-layer network with a fixed dataset size.

Abstract: In recent years, methods based on heterogeneous graph neural networks (HGNNs) have been widely used for embedding heterogeneous graphs (HGs) due to their ability to effectively encode the rich information from HGs into lowdimensional node embeddings. Existing HGNNs focus on neighbor aggregation and semantic fusion while neglecting the HG structure and learning paradigms. However, the original HG data might lack node features, which existing models may not effectively account for. Additionally, exclusively relying on a single supervised learning approach may only partially leverage the invariant information in graph data. To address these challenges, we introduce the Contrastive Auxiliary Learning Model for Heterogeneous Graphs (CALHG). This model combines edge perturbation and graph diffusion to enhance graph data, allowing it to capture the inherent structural information within heterogeneous graphs fully. Additionally, we employ a category-guided multi-view contrastive learning approach, which does not rely on positive and negative samples for model training, enabling us to capture the intrinsic invariances in heterogeneous graph data. Extensive experiments and analyses on five benchmark datasets without node features and three benchmark datasets with node features demonstrate the effectiveness and efficiency of our novel method compared with several state-of-the-art methods.

Abstract: In multiagent reinforcement learning, a commonly considered paradigm is centralized training with decentralized execution. However, in this framework, decentralized execution restricts the development of coordinated policies due to the local observation limitation. In this paper, we consider the cooperation among neighboring agents during execution and formulate their interactions as a graph. Thus, we introduce a novel encoder-decoder architecture named Factor-based Multi-Agent Transformer (f-MAT) that utilizes a transformer to enable communication between neighboring agents during both training and execution. By dividing agents into different overlapping groups and representing each group with a factor, f-MAT achieves efficient message passing and parallel action generation through factor-based attention layers. Empirical results in networked systems such as traffic scheduling and power control demonstrate that f-MAT achieves superior performance compared to strong baselines, thereby paving the way for handling complex collaborative problems.

Abstract: Many unsupervised visual anomaly detection methods train an autoencoder to reconstruct normal samples and then leverage the reconstruction error map to detect and localize the anomalies. However, due to the powerful modeling and generalization ability of neural networks, some anomalies can also be well reconstructed, resulting in unsatisfactory detection and localization accuracy. In this paper, a small coarsely-labeled anomaly dataset is first collected. Then, a coarse-knowledge-aware adversarial learning method is developed to align the distribution of reconstructed features with that of normal features. The alignment can effectively suppress the auto-encoder's reconstruction ability on anomalies and thus improve the detection accuracy. Considering that anomalies often only occupy very small areas in anomalous images, a patch-level adversarial learning strategy is further developed. Although no patch-level anomalous information is available, we rigorously prove that by simply viewing any patch features from anomalous images as anomalies, the proposed knowledge-aware method can also align the distribution of reconstructed patch features with the normal ones. Experimental results on four medical datasets and two industrial datasets demonstrate the effectiveness of our method in improving the detection and localization performance.

Abstract: In this paper, we delve into the utilization of the negative momentum technique in constrained minimax games. From an intuitive mechanical standpoint, we introduce a novel framework for momentum buffer updating, which extends the findings of negative momentum from the unconstrained setting to the constrained setting and provides a universal enhancement to the classic gamesolver algorithms. Additionally, we provide theoretical guarantees of convergence for our momentum-augmented learning algorithms. We then extend these algorithms to their extensive-form counterparts. Experimental results on both Normal Form Games (NFGs) and Extensive Form Games (EFGs) demonstrate that our momentum techniques can significantly improve algorithm performance, surpassing both their original versions and the SOTA baselines by a large margin.

Abstract: In this paper, we consider a perturbationbased metric of predictive faithfulness of feature rankings (or attributions) that we call PGI squared When applied to decision tree-based regression models, the metric can be computed exactly and efficiently for arbitrary independent feature perturbation distributions. In particular, the computation does not involve Monte Carlo sampling that has been typically used for computing similar metrics and which is inherently prone to inaccuracies. As a second contribution, we proposed a procedure for constructing feature ranking based on PGI squared. Our results indicate the proposed ranking method is comparable to the widely recognized SHAP explainer, offering a viable alternative for assessing feature importance in tree-based models.

College of Computer Science and Technology, National University of Defense Technology, Changsha, China. State Key Laboratory of Complex & Critical Software Environment, Changsha, China., College of Computer Science and Technology, National University of Defense Technology, Changsha, China., School of Computer Science, Tsinghua University, Beijing, China., College of Computer Science and Technology, National University of Defense Technology, Changsha, China. State Key Laboratory of Complex & Critical Software Environment, Changsha, China., College of Computer Science and Technology, National University of Defense Technology, Changsha, China. State Key Laboratory of Complex & Critical Software Environment, Changsha, China., College of Computer Science and Technology, National University of Defense Technology, Changsha, China. State Key Laboratory of Complex & Critical Software Environment, Changsha, China., College of Computer Science and Technology, National University of Defense Technology, Changsha, China., College of Computer Science and Technology, National University of Defense Technology, Changsha, China. State Key Laboratory of Complex & Critical Software Environment, Changsha, China.

Abstract: Logitbased knowledge distillation (KD) is commonly used to mitigate catastrophic forgetting in class-incremental learning (CIL) caused by data distribution shifts. However, the strict match of logit values between student and teacher models conflicts with the cross-entropy (CE) loss objective of learning new classes, leading to significant recency bias (i.e. unfairness). To address this issue, we rethink the overlooked limitations of KD-based methods through empirical analysis. Inspired by our findings, we introduce a plug-and-play pre-process method that normalizes the logits of both the student and teacher across all classes, rather than just the old classes, before distillation. This approach allows the student to focus on both old and new classes, capturing intrinsic inter-class relations from the teacher. By doing so, our method avoids the inherent conflict between KD and CE, maintaining fairness between old and new classes. Additionally, recognizing that overconfident teacher predictions can hinder the transfer of inter-class relations (i.e., dark knowledge), we extend our method to capture intra-class relations among different instances, ensuring fairness within old classes. Our method integrates seamlessly with existing logit-based KD approaches, consistently enhancing their performance across multiple CIL benchmarks without incurring additional training costs.

Abstract: Multimodal learning aims to learn predictive models based on the data from different modalities. However, due to the requirement of data security and privacy protection, real-world multi-modal data are often scattered to different agents and cannot be shared across the agents, which limits the application of existing multi-modal learning methods. To achieve robust multi-modal learning in such a challenging scenario, we propose a novel optimal transport-based mixer (OTM), which works as an effective latent code alignment and augmentation method for unaligned and distributed multi-modal data. In particular, we train a Wasserstein autoencoder (WAE) for each agent, which encodes its single modal samples in a latent space. Through a central server, the proposed OTM computes a stochastic fused Gromov-Wasserstein barycenter (FGWB) to mix different modalities' latent codes, so that each agent applies the barycenter to reconstruct its samples. This method neither requires well-aligned multi-modal data nor assumes the data to share the same latent distribution, and each agent can learn a specific model based on multi-modal data while achieving inference based on its local modality. Experiments on multi-modal clustering and classification demonstrate that the models learned with the OTM method outperform the corresponding baselines.

Abstract: Graph Contrastive Learning (GCL), as a primary paradigm of graph selfsupervised learning, spurs a fruitful line of research in tackling the data sparsity issue by maximizing the consistency of user/item embeddings between different augmented views with random perturbations. However, diversity, as a crucial metric for recommendation performance and user satisfaction, has received rather little attention. In fact, there exists a challenging dilemma in balancing accuracy and diversity. To address these issues, we propose a new Graph Contrastive Learning (DivGCL) model for diversifying recommendations. Inspired by the excellence of the determinant point process (DPP), DivGCL adopts a DPP likelihood-based loss function to achieve an ideal trade-off between diversity and accuracy, optimizing it jointly with the advanced Gaussian noise-augmented GCL objective. Extensive experiments on four popular datasets demonstrate that DivGCL surpasses existing approaches in balancing accuracy and diversity, with an improvement of 23.47% at T@20 (abbreviation for trade-off metric) on ML-1M.

Abstract: This paper presents a novel framework for multiarmed bandit problems with side-observations and switching constraints, which arises in a range of real-world applications such as robotic. To address the challenges of effectively utilizing graph-structured observations while adhering to graph constraints, we design graph-agnostic and graph-aware algorithms tailored to this new setting. Specifically, our graph-agnostic algorithm selects nodes with the highest upper confidence bound without prior knowledge of feedback probabilities, while minimizing switching costs using offline shortest path planning and the doubling trick. If the graph structure and associated probability matrix are known, our graph-aware algorithm plans the exploration step using a linear programming approach and eliminates suboptimal nodes iteratively. We rigorously analyze the performance of our proposed algorithms, providing near-optimal minimax and instance-dependent regret upper bounds. Our analysis shows that our algorithms outperform generic reinforcement learning methods in terms of both regret and computational efficiency. Extensive numerical experiments on various types of graphs, including two real-world datasets, demonstrate the efficacy of our proposed methods and their advantages over benchmark methods in graph bandit settings.

Abstract: Federated Learning (FL) aims to protect data privacy by enabling clients to collectively train machine learning models without sharing their raw data. However, recent studies demonstrate that information exchanged during FL is subject to Gradient Inversion Attacks (GIA) and, consequently, a variety of privacypreserving methods have been integrated into FL to thwart such attacks, such as Secure Multi-party Computing (SMC), Homomorphic Encryption (HE), and Differential Privacy (DP). Despite their ability to protect data privacy, these approaches inherently involve substantial privacy-utility trade-offs. By revisiting the key to privacy exposure in FL under GIA, which lies in the frequent sharing of model gradients that contain private data, we take a new perspective by designing a novel privacy preserve FL framework that effectively ``breaks the direct connection'' between the shared parameters and the local private data to defend against GIA. Specifically, we propose a Hypernetwork Federated Learning (HyperFL) framework that utilizes hypernetworks to generate the parameters of the local model and only the hypernetwork parameters are uploaded to the server for aggregation. Theoretical analyses demonstrate the convergence rate of the proposed HyperFL, while extensive experimental results show the privacy-preserving capability and comparable performance of HyperFL.

Abstract: When dealing with multiview data, the heterogeneity of data attributes across different views often leads to label ambiguity. To effectively address this challenge, this paper designs a Multi-View Partial-Label Learning (MVPLL) framework, where each training instance is described by multiple view features and associated with a set of candidate labels, among which only one is correct. The key to deal with such problem lies in how to effectively fuse multi-view information and accurately disambiguate these ambiguous labels. In this paper, we propose a novel approach named CFDM, which explores the consistency and complementarity of multi-view data by multi-view contrastive fusion and reduces label ambiguity by multi-class contrastive prototype disambiguation. Specifically, we first extract view-specific representations using multiple view-specific autoencoders, and then integrate multi-view information through both inter-view and intra-view contrastive fusion to enhance the distinctiveness of these representations. Afterwards, we utilize these distinctive representations to establish and update prototype vectors for each class within each view. Based on these, we apply contrastive prototype disambiguation to learn global class prototypes and accordingly reduce label ambiguity. In our model, multi-view contrastive fusion and multi-class contrastive prototype disambiguation are conducted mutually to enhance each other within a coherent framework, leading to a more ideal classification performance. Experimental results on multiple datasets have demonstrated that our proposed method is superior to other state-of-the-art methods.

Abstract: Autoregressive models have made significant progress in the realm of text-to-image synthesis, yet devising an appropriate model architecture and training strategy to achieve a satisfactory level remains an important avenue of exploration. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information—freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation. The flexibility of this framework lends itself to migration towards any-to-any task adaptability. Furthermore, MARS employs a multi-stage training strategy that first establishes robust image-text alignment through complementary bidirectional tasks and subsequently concentrates on refining the T2I generation process, significantly augmenting text-image synchrony and the granularity of image details. Notably, MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks, illustrating the training efficiency and the potential for swift deployment in various applications.

College of Computer Science and Technology, National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, College of Computer Science and Technology, National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, College of Computer Science and Technology, National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, College of Computer Science and Technology, National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, College of Computer Science and Technology, National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology

Abstract: As the demands for superior agents grow, the training complexity of Deep Reinforcement Learning (DRL) becomes higher. Thus, accelerating training of DRL has become a major research focus. Dividing the DRL training process into subtasks and using parallel computation can effectively reduce training costs. However, current DRL training systems lack sufficient parallelization due to data assignment between sub-task components. This assignment issue has been ignored, but addressing it can further boost training efficiency. Therefore, we propose a high-throughput distributed RL training system called TianJi. It relaxes assignment dependencies between sub-task components and enables event-driven asynchronous communication. Meanwhile, TianJi maintains clear boundaries between sub-task components. To address convergence uncertainty from relaxed assignment dependencies, TianJi proposes a distributed strategy based on the balance of sample production and consumption. The strategy controls the staleness of samples to correct their quality, ensuring convergence. We conducted extensive experiments. TianJi achieves a convergence time acceleration ratio of up to 4.37 compared to related comparison frameworks. When scaled to eight computational nodes, TianJi shows a convergence time speedup of 1.6 and a throughput speedup of 7.13 relative to XingTian, emonstrating its capability to accelerate training and scalability. In data transmission efficiency experiments, TianJi significantly outperforms other frameworks, approaching hardware limits. TianJi also shows effectiveness in on-policy algorithms, achieving convergence time acceleration ratios of 4.36 and 2.95 compared to RLlib and XingTian.

Abstract: In this study, we consider an optimization problem with uncertainty dependent on decision variables, which has recently attracted attention due to its importance in machine learning and pricing applications. In this problem, the gradient of the objective function cannot be obtained explicitly because the decisiondependent distribution is unknown. Therefore, several zeroth-order methods have been proposed, which obtain noisy objective values by sampling and update the iterates. Although these existing methods have theoretical convergence for optimization problems with decision-dependent uncertainty, they require strong assumptions about the function and distribution or exhibit large variances in their gradient estimators. To overcome these issues, we propose two zeroth-order methods under mild assumptions. First, we develop a zeroth-order method with a new one-point gradient estimator including a variance reduction parameter. The proposed method updates the decision variables while adjusting the variance reduction parameter. Second, we develop a zeroth-order method with a two-point gradient estimator. There are situations where only one-point estimators can be used, but if both one-point and two-point estimators are available, it is more practical to use the two-point estimator. As theoretical results, we show the convergence of our methods to stationary points and provide the worst-case iteration and sample complexity analysis. Our simulation experiments with real data on a retail service application show that our methods output solutions with lower objective values than the conventional zeroth-order methods.

Abstract: The scalability of instructable agents in robotics or gaming is often hindered by limited data that pairs instructions with agent trajectories. However, large datasets of unannotated trajectories containing sequences of various agent behaviour (play trajectories) are often available. In a semisupervised setup, we explore methods to extract labelled segments from play trajectories. The goal is to augment a small annotated dataset of instruction-trajectory pairs to improve the performance of an instruction-following policy trained downstream via imitation learning. Assuming little variation in segment length, recent video segmentation methods can effectively extract labelled segments. To address the constraint of segment length, we propose Play Segmentation (PS), a probabilistic model that finds maximum likely segmentations of extended subsegments, while only being trained on individual instruction segments. Our results in a game environment and a simulated robotic gripper setting underscore the importance of segmentation; randomly sampled segments diminish performance, while incorporating labelled segments from PS improves policy performance to the level of a policy trained on twice the amount of labelled data.

Abstract: In recent years, deep multimodal learning has seen significant advancements. However, there remains a lack of multimodal fusion methods capable of dynamically adjusting the weighting of information both within and across modalities based on input samples. In the domain of multimodal intent recognition, the text modality often contains the most relevant information for intent detection, while the audio and visual modalities provide comparatively less critical information. There is a significant variation in the density of important information across different modalities and samples. To address this challenge, we propose a Dynamic Attention Allocation Fusion (DAF) method with an adaptive network structure that dynamically allocates attention both within individual modalities and across multiple modalities. This approach enables the model to focus more effectively on the most informative modalities and their respective internal features. Furthermore, we introduce a multiview contrastive learning framework based on DAF (MVCL-DAF). This framework uses distinct and isolated modules to process information from various modalities, taking inspiration from the way the human brain processes multimodal information. Each modality independently infers intent using its respective module, while DAF integrates the multimodal information to produce a comprehensive global intent prediction. The text modality, functioning as the primary modality due to its rich semantic content, guides the other modules in the multi-view contrastive learning process. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods.

Abstract: Automating architectural floorplan design is vital for housing and interior design, offering a faster, costeffective alternative to manual sketches by architects. However, existing methods, including rule-based and learning-based approaches, face challenges in design complexity and constrained generation with extensive post-processing, and tend to obvious geometric inconsistencies such as misalignment, overlap, and gaps. In this work, we propose a novel generative framework for vector floorplan design via structural graph generation, called GSDiff, focusing on wall junction generation and wall segment prediction to capture both geometric and semantic aspects of structural graphs. To improve the geometric rationality of generated structural graphs, we propose two innovative geometry enhancement methods. In wall junction generation, we propose a novel alignment loss function to improve geometric consistency. In wall segment prediction, we propose a random self-supervision method to enhance the model’s perception of the overall geometric structure, thereby promoting the generation of reasonable geometric structures. Employing the diffusion model and the Transformer model, as well as the geometry enhancement strategies, our framework can generate wall junctions, wall segments and room polygons with structural and semantic information, resulting in structural graphs that accurately represent floorplans. Extensive experiments show that the proposed method surpasses existing techniques, enabling free generation and constrained generation, marking a shift towards structure generation in architectural design.

Abstract: NeuroSymbolic (NeSy) AI could be regarded as an analogy to human dual-process cognition, modeling the intuitive System 1 with neural networks and the algorithmic System 2 with symbolic reasoning. However, for complex learning targets, NeSy systems often generate outputs inconsistent with domain knowledge and it is challenging to rectify them. Inspired by the human Cognitive Reflection, which promptly detects errors in our intuitive response and revises them by invoking the System 2 reasoning, we propose to improve NeSy systems by introducing Abductive Reflection (ABL-Refl) based on the Abductive Learning (ABL) framework. ABL-Refl leverages domain knowledge to abduce a reflection vector during training, which can then flag potential errors in the neural network outputs and invoke abduction to rectify them and generate consistent outputs during inference. ABL-Refl is highly efficient in contrast to previous ABL implementations. Experiments show that ABL-Refl outperforms state-of-the-art NeSy methods, achieving excellent accuracy with fewer training resources and enhanced efficiency.

Abstract: Partial label learning (PLL) is a complicated weakly supervised multiclassification task compounded by class imbalance. Currently, existing methods only rely on inter-class pseudo-labeling from inter-class features, often overlooking the significant impact of the intra-class imbalanced features combined with the inter-class. To address these limitations, we introduce Granular Ball Representation for Imbalanced PLL (GBRIP), a novel framework for imbalanced PLL. GBRIP utilizes coarse-grained granular ball representation and multi-center loss to construct a granular ball-based feature space through unsupervised learning, effectively capturing the feature distribution within each class. GBRIP mitigates the impact of confusing features by systematically refining label disambiguation and estimating imbalance distributions. The novel multi-center loss function enhances learning by emphasizing the relationships between samples and their respective centers within the granular balls. Extensive experiments on standard benchmarks demonstrate that GBRIP outperforms existing state-of-the-art methods, offering a robust solution to the challenges of imbalanced PLL.

Abstract: Hierarchical pooling in conjunction with Graph Neural Networks (GNN) improves performance in graph classification tasks. Hierarchical pooling has to produce Multiresolution representations while preserving graph-level information. One such hierarchical pooling is Standard-Sparse-Pooling (SSP). SSP assigns an importance score to each node, selects the Top-K nodes, and scales their attributes by their scores to produce the output. We reveal SSPs’ tendency to pool Over Representative Regions (ORR) on the graph’s signal space leaving some regions unpooled thus proving that SSP is incapable of preserving graph-level information robustly. We propose to overcome this by an improved differentiable exploration over the graph’s signal space dubbed as skipping hence the name SkipPool. We tested SkipPool & its variant SkipPool-Full each against matching pooling methods. Proposed methods achieve new state-of-the-art performance on the majority of benchmark datasets. Moreover, we show that skipping is more robust at capturing the graph signal space consequently preserving more graph-level information than its counterparts. Proposed methods require a reasonably few parameters and their execution time can be kept low with parallelization.

Abstract: Gradient Descent (GD) and Conjugate Gradient (CG) methods are among the most effective iterative algorithms for solving unconstrained optimization problems, particularly in machine learning and statistical modeling, where they are employed to minimize cost functions. In these algorithms, tunable parameters, such as step sizes or conjugate parameters, play a crucial role in determining key performance metrics, like runtime and solution quality. In this work, we introduce a framework that models algorithm selection as a statistical learning problem, and thus learning complexity can be estimated by the pseudodimension of the algorithm group. We first propose a new cost measure for unconstrained optimization algorithms, inspired by the concept of primal-dual integral in mixed-integer linear programming. Based on the new cost measure, we derive an improved upper bound for the pseudo-dimension of gradient descent algorithm group by discretizing the set of step size configurations. Moreover, we generalize our findings from gradient descent algorithm to the conjugate gradient algorithm group for the first time, and prove the existence a learning algorithm capable of probabilistically identifying the optimal algorithm with a sufficiently large sample size.

Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany, Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany

Abstract: This work introduces IsUMap, a novel manifold learning technique that enhances data representation by integrating aspects of UMAP and Isomap with VietorisRips filtrations and metric realization of one-parameter filtrations of simplicial complexes. Inferring topological information from combinatorial models which have been built according to metric relations (Vietoris-Rips complexes) has proven useful in topological data analysis and general machine learning applications. This encourages the use of such objects for geometric inference. We extend this research direction by proposing a clear theoretical pipeline that not only provides a comprehensive guide for assigning a (triangulated) metric space to every admissible one- parameter filtration of simplicial complexes but also offers a method for merging these objects. With this, our method presents a systematic and detailed construction of a metric representation for locally distorted metric spaces that captures complex data structures more accurately than the previous schemes. Our approach addresses limitations in existing methods by accommodating non-uniform data distributions and intricate local geometries. We validate its performance through extensive experiments on examples with known geometries and in applications to data, in particular from computational biology.

Abstract: Probabilistic Circuits (PCs) have emerged as an efficient framework for representing and learning complex probability distributions. Nevertheless, the existing body of research on PCs predominantly concentrates on datadriven parameter learning, often neglecting the potential of knowledge-intensive learning, a particular issue in data-scarce/knowledge-rich domains such as healthcare. To bridge this gap, we propose a novel unified framework that can systematically integrate diverse domain knowledge into the parameter learning process of PCs. Experiments on several benchmarks as well as real world datasets show that our proposed framework can both effectively and efficiently leverage domain knowledge to achieve superior performance compared to purely data-driven learning approaches.

Abstract: In this study, we address causal inference when only observational data and a valid causal ordering from the causal graph are available. We introduce a set of flow models that can recover componentwise, invertible transformation of exogenous variables. Our flow-based methods offer flexible model design while maintaining causal consistency regardless of the number of discretization steps. We propose design improvements that enable simultaneous learning of all causal mechanisms and reduce abduction and prediction complexity to linear O(n) relative to the number of layers, independent of the number of causal variables. Empirically, we demonstrate that our method outperforms previous state-of-the-art approaches and delivers consistent performance across a wide range of structural causal models in answering observational, interventional, and counterfactual questions. Additionally, our method achieves a significant reduction in computational time compared to existing diffusion-based techniques, making it practical for large structural causal models.

Abstract: Given an edgeincomplete graph, how can we accurately find its missing links? The problem aims to discover the missing relations between entities when their relationships are represented as a graph. Edge-incomplete graphs are prevalent in real-world due to practical limitations, such as not checking all users when adding friends in a social network. Addressing the problem is crucial for various tasks, including recommending friends in social networks and finding references in citation networks. However, previous approaches rely heavily on the given edge-incomplete (observed) graph, making it challenging to consider the missing (unobserved) links. In this paper, we propose PULL, an accurate link prediction method based on the positive-unlabeled (PU) learning. PULL treats the observed edges in the training graph as positive examples, and the unconnected node pairs as unlabeled ones. PULL effectively prevents the link predictor from blindly trusting the observed graph by proposing latent variables for every edge, and leveraging the expected graph structure with respect to these variables. Extensive experiments on real- world datasets show that PULL consistently outperforms the baselines for predicting links in edge-incomplete graphs.

Abstract: Online ClassIncremental Learning (OCIL) enables a model to learn new classes from a data stream. Since data stream samples are seen only once and the capacity of storage is constrained, OCIL is particularly susceptible to Catastrophic Forgetting (CF). While exemplar replay methods alleviate CF by storing representative samples, the limited capacity of the buffer inhibits capturing the entire old data distribution, leading to CF. In this regard, recent papers suggest image compression for better memory usage. However, existing methods raise two concerns: computational overhead and compression defects. On one hand, computational overhead can limit their applicability in OCIL settings, as models might miss learning opportunities from the current streaming data if computational resources are budgeted and preoccupied with compression. On the other hand, typical compression schemes demanding low computational overhead, such as JPEG, introduce noise detrimental to training. To address these issues, we propose Salient Frequency-aware Exemplar Compression (SFEC), an efficient and effective JPEG-based compression framework. SFEC exploits saliency information in the frequency domain to reduce negative impacts from compression artifacts for learning. Moreover, SFEC employs weighted sampling for exemplar elimination based on the distance between raw and compressed data to mitigate artifacts further. Our experiments employing the baseline OCIL method on benchmark datasets such as CIFAR-100 and Mini-ImageNet demonstrate the superiority of SFEC over previous exemplar compression methods in streaming scenarios.

Abstract: Federated Learning (FL) offers a decentralized approach to model training, where data remains local and only model parameters are shared between the clients and the central server. Traditional methods, such as Federated Averaging (FedAvg), linearly aggregate these parameters which are usually trained on heterogeneous data distributions, potentially overlooking the complex, highdimensional nature of the parameter space. This can result in degraded performance of the aggregated model. While personalized FL approaches can mitigate the heterogeneous data issue to some extent, the limitation of linear aggregation remains unresolved. To alleviate this issue, we investigate the generative approach of diffusion model and propose a novel generative parameter aggregation framework for personalized FL, pFedGPA. In this framework, we deploy a diffusion model on the server to integrate the diverse parameter distributions and propose a parameter inversion method to efficiently generate a set of personalized parameters for each client. This inversion method transforms the uploaded parameters into a latent code, which is then aggregated through denoising sampling to produce the final personalized parameters. By encoding the dependence of a client's model parameters on the specific data distribution using the high-capacity diffusion model, pFedGPA can effectively decouple the complexity of the overall distribution of all clients' model parameters from the complexity of each individual client's parameter distribution. Our experimental results consistently demonstrate the superior performance of the proposed method across multiple datasets, surpassing baseline approaches.

Abstract: Bayesian optimization (BO) is a key technique for solving blackbox optimization problems. This study extends the scope of BO from conventional applications (e.g., AutoML and robotics learning) to automated tuning of software systems. Despite GP (Gaussian Process) implementing a foundation formalism for exploitation and exploration in BO, its limited predictive power and unrealistic assumptions (e.g., continuity and Gaussianity) can severely affect its effectiveness and efficiency in tuning complex software systems. To overcome these limitations, we propose a BO framework CoffeeBoost, which implements exploitation and exploration with a GBDT-native distribution-free probabilistic surrogate model. CoffeeBoost constructs surrogate models via stochastic gradient boosting ensembles (SGBE) and quantifies probabilistic distributions via distribution-free conformal predictive systems. Moreover, CoffeeBoost leverages the residual paths in SGBE to improve the local adaptiveness of the resulting predictive distributions in a GBDT-native manner. Across eight auto-tuning benchmarks for database management systems (DBMS), we evaluate CoffeeBoost and show its superior learnability and optimizability against existing GP-based and tree-ensemble-based BO schemes. Detailed analysis further shows CoffeeBoost's predictive distributions excel in both coverage and tightness.

Abstract: The Weber location problem is widely used in several artificial intelligence scenarios. However, the gradient of the objective does not exist at a considerable set of singular points. Recently, a desingularity subgradient method has been proposed to fix this problem, but it can only handle the q-th-powered l_2-norm case (1<= q<2), which has only finite singular points. In this paper, we further establish the de-singularity subgradient for the q-th-powered l_p-norm case with 1<= q<= p and 1<= p<2, which includes all the rest unsolved situations in this problem. This is a challenging task because the singular set is a continuum. The geometry of the objective function is also complicated so that the characterizations of the subgradients, minimum and descent direction are very difficult. We develop a q-th-powered l_p-norm Weiszfeld Algorithm without Singularity (qPpNWAWS) for this problem, which ensures convergence and the descent property of the objective function. Extensive experiments on six real-world data sets demonstrate that qPpNWAWS successfully solves the singularity problem and achieves a linear computational convergence rate in practical scenarios.

Abstract: Balancing predictive power and interpretability has long been a challenging research area, particularly in powerful yet complex models like neural networks, where nonlinearity obstructs direct interpretation. This paper introduces a novel approach to constructing an explainable neural network that harmonizes predictiveness and explainability. Our model is designed as a linear combination of a sparse set of jointly learned features, each derived from a different trainable function applied to a single 1dimensional input feature. Leveraging the ability to learn arbitrarily complex relationships, our neural network architecture enables automatic selection of a sparse set of important features, with the final prediction being a sum of rescaled versions of these features. We demonstrate the ability to select significant features while maintaining comparable predictive performance and direct interpretability through extensive experiments on synthetic and real-world datasets. We also provide theoretical analysis on the generalization bounds of our framework, which is favorably linear in the number of selected features and only logarithmic in the number of input features. We further lift any dependence of sample complexity on the number of parameters or the architectural details under very mild conditions. Our research paves the way for further research on sparse and explainable neural networks with guarantees.

Abstract: Catastrophic forgetting is when a neural network loses previously learnt information after learning a new task sequentially. Avoiding catastrophic forgetting could reduce the resources necessary to update neural networks. Recently, Kolmogorov–Arnold Networks (KAN) gained the community's attention as preliminary experiments suggest KAN avoid catastrophic forgetting. KAN replace neural network edges with learnable Bsplines and sum incoming edges in nodes. Proponents of KAN argue they avoid forgetting, are more accurate, are interpretable, and use fewer parameters. Our work investigates the claims that KAN avoid catastrophic forgetting, finding that they fail to do so on more complex datasets containing features that overlap between tasks. We give a simple explanation as to why and how KAN catastrophically forget. Motivated by evidence suggesting KAN are superior for symbolic regression, we augment KAN in the same ways as multilayer perceptron (MLP) to perform continual learning tasks, making special accommodations to support KAN. Our experiments found that unmodified KAN often forget more than MLP, but KAN can be better than MLP when combined with continual learning strategies. We aim to highlight some of the current shortcomings and strengths associated with KAN for continual learning.

Abstract: The offline datasets for imitation learning (IL) in multiagent games typically contain player trajectories exhibiting diverse strategies, which necessitate measures to prevent learning algorithms from acquiring undesirable behaviors. Learning representations for these trajectories is an effective approach to depicting the strategies employed by each demonstrator. However, existing learning strategies often require player identification or rely on strong assumptions, which are not appropriate for multi-agent games. Therefore, in this paper, we introduce the Strategy Representation for Imitation Learning (STRIL) framework, which (1) effectively learns strategy representations in multi-agent games, (2) estimates proposed indicators based on these representations, and (3) filters out sub-optimal data using the indicators. STRIL is a plug-in method that can be integrated into existing IL algorithms. We demonstrate the effectiveness of STRIL across competitive multi-agent scenarios, including Two-player Pong, Limit Texas Hold'em, and Connect Four. Our approach successfully acquires strategy representations and indicators, thereby identifying dominant trajectories and significantly enhancing existing IL performance across these environments.

Abstract: Training generally capable agents in complex environments is a challenging task that involves identifying "right" environments at the training stage. Recent research has highlighted the potential of the Unsupervised Environment Design framework, which generates environment instances/levels adaptively at the frontier of the agent’s capabilities using regret measures. While regret approaches have shown great promise in generating feasible environments, they can produce difficult environments that are challenging for an RL agent to learn from. This is because regret represents the bestcase (upper bound) learning potential and not the actual learning potential of an environment. To address this limitation, we propose an alternative mechanism that employs marginal benefit, focusing on the improvement (in terms of generalized performance) the agent policy gets for a given environment. The advantage of this new mechanism is that it is agent-focused (and not environment focused) and generates the "right" environments depending on the agent's policy. Additionally, to improve the generalizability of the agent, we introduce representative state diversity metric that aims to generate varied experiences for the agent. Finally, we provide detailed experimental results and ablation analysis to showcase the effectiveness of our new methods. We obtain SOTA results among RL based environment generation methods.

Abstract: The rapid advancement of multiagent reinforcement learning (MARL) has given rise to diverse training paradigms to learn the policies of each agent in the multi-agent system. The paradigms of decentralized training and execution (DTDE) and centralized training with decentralized execution (CTDE) have been proposed and widely applied. However, as the number of agents increases, the inherent limitations of these frameworks significantly degrade the performance metrics, such as win rate, total reward, etc. To reduce the influence of the increasing number of agents on the performance metrics, we propose a novel training paradigm of grouped training decentralized execution (GTDE). This framework eliminates the need for a centralized module and relies solely on local information, effectively meeting the training requirements of large-scale multi-agent systems. Specifically, we first introduce an adaptive grouping module, which divides each agent into different groups based on their observation history. To implement end-to-end training, GTDE uses Gumbel-Sigmoid for efficient point-to-point sampling on the grouping distribution while ensuring gradient backpropagation. To adapt to the uncertainty in the number of members in a group, two methods are used to implement a group information aggregation module that merges member information within the group. Empirical results show that in a cooperative environment with 495 agents, GTDE increased the total reward by an average of 382% compared to the baseline. In a competitive environment with 64 agents, GTDE achieved a 100% win rate against the baseline.

Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University Zhejiang Institute of Optoelectronics, Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, School of Computer Science and Technology, Zhejiang Normal University Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, School of Computer Science and Technology, Zhejiang Normal University Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, School of Artificial Intelligence, and Engineering Research Center of Intelligent Technology and Educational Application, Ministry of Education, Beijing Normal University, Department of Mathematics, City University of Hong Kong, Department of Computer Science and Technology, Cambridge University

Abstract: Hypergraph neural networks (HNNs) have shown promise in handling tasks characterized by highorder correlations, achieving notable success across various applications. However, there has been limited focus on heterophilic hypergraph learning (HHL), in contrast to the increasing attention given to graph neural networks designed for graphs exhibiting heterophily. This paper aims to pave the way for HHL by addressing key gaps from multiple perspectives: measurement, dataset diversity, and baseline model development. First, we introduce metrics to quantify heterophily in hypergraphs, providing a numerical basis for assessing the homophily/heterophily ratio. Second, we develop diverse benchmark datasets across various real-world scenarios, facilitating comprehensive evaluations of existing HNNs and advancing research in HHL. Additionally, as a novel baseline model, we propose HyperUFG, a framelet-based HNN integrating both low-pass and high-pass filters. Extensive experiments conducted on synthetic and benchmark datasets highlight the challenges current HNNs face with heterophilic hypergraphs, while showcasing that HyperUFG performs competitively and often outperforms many existing models in such scenarios. Overall, our study underscores the urgent need for further exploration and development in this emerging field, with the potential to inspire and guide future research in HHL.

Abstract: Highlevel synthesis (HLS) is a widely used tool in designing Field Programmable Gate Array (FPGA). HLS enables FPGA design with software programming languages by compiling the source code into an FPGA circuit. The source code includes a program (called ``kernel'') and several pragmas that instruct hardware synthesis, such as parallelization, pipeline, etc. While it is relatively easy for software developers to design the program, it heavily relies on hardware knowledge to design the pragmas, posing a big challenge for software developers. Recently, different machine learning algorithms, such as GNNs, have been proposed to automate the pragma design via performance prediction. However, when applying the trained model on new kernels, the significant domain shift often leads to unsatisfactory performance. We propose a more domain-generalizable model structure: a two-level hierarchical Mixture of Experts (MoE), that can be flexibly adapted to any GNN model. Different expert networks can learn to deal with different regions in the representation space, and they can utilize similar patterns between the old kernels and new kernels. In the low-level MoE, we apply MoE on three natural granularities of a program: node, basic block, and graph. The high-level MoE learns to aggregate the three granularities for the final decision. To stably train the hierarchical MoE, we further propose a two-stage training method. Extensive experiments verify the effectiveness of the hierarchical MoE.

Abstract: The core challenge of highdimensional and expensive black-box optimization (BBO) is how to obtain better performance faster with little function evaluation cost. The essence of the problem is how to design an efficient optimization strategy tailored to the target task. This paper designs a powerful optimization framework to automatically learn the optimization strategies from the target or cheap surrogate task without human intervention. However, current methods are weak for this due to poor representation of optimization strategy. To achieve this, 1) drawing on the mechanism of genetic algorithm, we propose a deep neural network framework called B2Opt, which has a stronger representation of optimization strategies based on survival of the fittest; 2) B2Opt can utilize the cheap surrogate functions of the target task to guide the design of the efficient optimization strategies. Compared to the state-of-the-art BBO baselines, B2Opt can achieve multiple orders of magnitude performance improvement with less function evaluation cost.

Abstract: Many realworld data are inherently multi-dimensional, e.g., color images, videos, and hyperspectral images. How to effectively and compactly represent these multi-dimensional data within a unified framework is an important pursuit. Previous methods focus on tensor factorizations, convolutional networks, or diffusion models for multi-dimensional data representation, which may not fully utilize inherent data structures and may lead to redundant parameters. In this work, we propose a Deep Rank-One Tensor Functional Factorization (DRO-TFF), which internally utilizes more comprehensive data priors facilitated by much fewer parameters. Concretely, our DRO-TFF consists of three organically integrated blocks: compact rank-one factorizations in the spatial domain, a deep transform to capture underlying low-dimensional structures, and smooth factors parameterized by implicit neural representations. Through a series of theoretical analysis, we show the rich data priors encoded in the DRO-TFF structure, e.g., Lipschitz smoothness and low-rankness. Extensive experiments on multi-dimensional data recovery problems, such as image and video inpainting, image denoising, and hyperspectral mixed noise removal, showcase the effectiveness of the proposed method.

Abstract: Graph contrastive learning (GCL) aims to learn representations from unlabeled graph data in a selfsupervised manner and has developed rapidly in recent years. However, edge-level contrasts are not well explored by most existing GCL methods. Most studies in GCL only regard edges as auxiliary information while updating node features. One of the primary obstacles of edge-based GCL is the heavy computation burden. To tackle this issue, we propose a model that can efficiently learn edge features for GCL, namely Augmentation-Free Edge Contrastive Learning (AFECL) to achieve edge-edge contrast. AFECL depends on no augmentation consisting of two parts. Firstly, we design a novel edge feature generation method, where edge features are computed by embedding concatenation of their connected nodes. Secondly, an edge contrastive learning scheme is developed, where edges connecting the same nodes are defined as positive pairs, and other edges are defined as negative pairs. Experimental results show that compared with recent state-of-the-art GCL methods or even some supervised GNNs, AFECL achieves SOTA performance on link prediction and semi-supervised node classification of extremely scarce labels.

Abstract: In recent years, Federated Domain Generalization (FedDG) has succeeded in generalizing to unknown clients (domains). However, current methods only utilize training data, and when there is a significant difference between the unknown client and source client domains (domain shift), these methods cannot ensure model performance. This limitation appears to have caused research in FedDG to reach a bottleneck. On the other hand, test data is a resource that can help models adapt while previous FedDG approaches have not taken this into account. In this paper, we introduce a new framework TTAFedDG to address the FedDG problem, which leverages test-time adaptation (TTA) to adapt across different domains, thereby enhancing the generalization of the model. We propose the method Federated domain generalization based on select Strong Pseudo Label (FedSPL), which combines fast feature matching and knowledge distillation. Our method consists of two parts. Firstly, we use fast feature reordering for feature mixing during local updates on the client side, improving the robustness of the global model and enhancing its generalization ability to mitigate domain shift. Secondly, we employ a teacher-student model with contrastive learning and label selection during the testing phase, enabling the global model to better adapt to the distribution of the target client,thereby alleviating domain shift. Extensive experiments havedemonstrated the effectiveness of FedSPL in handling domain shift, outperforming existing FedDG methods across multiple datasets and model architectures.

Abstract: Previous graph neural networks (GNNs) usually assume that the graph data is with clean labels for representation learning, but it is not true in real applications. In this paper, we propose a new multiteacher distillation method based on bi-level optimization (namely BO-NNC), to conduct noisy node classification on the graph data. Specifically, we first employ multiple self-supervised learning methods to train diverse teacher models, and then aggregate their predictions through a teacher weight matrix. Furthermore, we design a new bi-level optimization strategy to dynamically adjust the teacher weight matrix based on the training progress of the student model. Finally, we design a label improvement module to improve the label quality. Extensive experimental results on real datasets show that our method achieves the best results compared to state-of-the-art methods.

Abstract: This paper investigates a challenging problem of zeroshot learning in the multi-label scenario (MLZSL), wherein the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics and transferring the learned model to unseen ones. However, they neglect the integrity of local and global features. Although the use of the attention structure will accurately locate local features, especially objects, it will significantly lose its integrity, and the relationship between classes will also be affected. Rough processing of global features will also directly affect comprehensiveness. This neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. In terms of spatial information, we achieve effective refinement by group aggregating image features into several semantic prompts. It can aggregate semantic information rather than class information, preserving the correlation between semantics. In terms of global semantics, we use global forward propagation to collect as much information as possible to ensure that semantics are not omitted. Experiments on large-scale MLZSL benchmark datasets NUS-Wide and Open-Images-v4 demonstrate that the proposed Epsilon outperforms other state-of-the-art methods with large margins.

Abstract: Offline MultiAgent Reinforcement Learning (MARL) is an emerging field that aims to learn optimal multi-agent policies from pre-collected datasets. Compared to single-agent case, multi-agent setting involves a large joint state-action space and coupled behaviors of multiple agents, which bring extra complexity to offline policy optimization. In this work, we revisit the existing offline MARL methods and show that in certain scenarios they can be problematic, leading to uncoordinated behaviors and out-of-distribution (OOD) joint actions. To address these issues, we propose a new offline MARL algorithm, named In-Sample Sequential Policy Optimization (InSPO). InSPO sequentially updates each agent's policy in an in-sample manner, which not only avoids selecting OOD joint actions but also carefully considers teammates' updated policies to enhance coordination. Additionally, by thoroughly exploring low-probability actions in the behavior policy, InSPO can well address the issue of premature convergence to sub-optimal solutions. Theoretically, we prove InSPO guarantees monotonic policy improvement and converges to quantal response equilibrium (QRE). Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods.

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences School of Computer Science and Technology, University of Chinese Academy of Sciences, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, School of Computer Science and Technology, University of Chinese Academy of Sciences, School of Computer Science and Technology, University of Chinese Academy of Sciences, School of Computer Science and Technology, University of Chinese Academy of Sciences Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences Key Laboratory of Big Data Mining and Knowledge Management, University of Chinese Academy of Sciences

Abstract: This paper addresses the challenge of Granularity Competition in finegrained classification tasks, which arises due to the semantic gap between multi-granularity labels. Existing approaches typically develop independent hierarchy-aware models based on shared features extracted from a common base encoder. However, because coarse-grained levels are inherently easier to learn than finer ones, the base encoder tends to prioritize coarse feature abstractions, which impedes the learning of fine-grained features. To overcome this challenge, we propose a novel framework called the Bidirectional Logits Tree (BiLT) for Granularity Reconcilement. The key idea is to develop classifiers sequentially from the finest to the coarsest granularities, rather than parallelly constructing a set of classifiers based on the same input features. In this setup, the outputs of finer-grained classifiers serve as inputs for coarser-grained ones, facilitating the flow of hierarchical semantic information across different granularities. On top of this, we further introduce an Adaptive Intra-Granularity Difference Learning (AIGDL) approach to uncover subtle semantic differences between classes within the same granularity. Extensive experiments demonstrate the effectiveness of our proposed method.

Abstract: Identifying the causal pathways of unfairness is a critical objective for improving policy design and algorithmic decisionmaking. Prior work in causal fairness analysis often requires knowledge of the causal graph, hindering practical applications in complex or low-knowledge domains. Moreover, global discovery methods that learn causal structure from data can display unstable performance on finite samples, preventing robust fairness conclusions. To mitigate these challenges, we introduce local discovery for direct discrimination (LD3): a method that uncovers structural evidence of direct unfairness by identifying the causal parents of an outcome variable. LD3 performs a linear number of conditional independence tests relative to variable set size, and allows for latent confounding under the sufficient condition that all parents of the outcome are observed. We show that LD3 returns a valid adjustment set (VAS) under a new graphical criterion for the weighted controlled direct effect, a qualitative indicator of direct discrimination. LD3 limits unnecessary adjustment, providing interpretable VAS for assessing unfairness. We use LD3 to analyze causal fairness in two complex decision systems: criminal recidivism prediction and liver transplant allocation. LD3 was more time-efficient and returned more plausible results on real-world data than baselines, which took 46× to 5870× longer to execute.

Abstract: Continual learning (CL) learns a sequence of tasks incrementally. This paper studies the challenging CL setting of classincremental learning (CIL). CIL has two key challenges: catastrophic forgetting (CF) and inter-task class separation (ICS). Despite numerous proposed methods, these issues remain persistent obstacles. This paper proposes a novel CIL method, called Kernel Linear Discriminant Analysis (KLDA), that can effectively avoid CF and ICS problems. It leverages only the powerful features learned in a foundation model (FM). However, directly using these features proves suboptimal. To address this, KLDA incorporates the Radial Basis Function (RBF) kernel and its Random Fourier Features (RFF) to enhance the feature representations from the FM, leading to improved performance. When a new task arrives, KLDA computes only the mean for each class in the task and updates a shared covariance matrix for all learned classes based on the kernelized features. Classification is performed using Linear Discriminant Analysis. Our empirical evaluation using text and image classification datasets demonstrates that KLDA significantly outperforms baselines. Remarkably, without relying on replay data, KLDA achieves accuracy comparable to joint training of all classes, which is considered the upper bound for CIL performance.

Abstract: Modern Hopfield networks (MHNs) have recently gained significant attention in the field of artificial intelligence because they can store and retrieve a large set of patterns with an exponentially large memory capacity. A MHN is generally a dynamical system defined with Lagrangians of memory and feature neurons,where memories associated with indistribution (ID) samples are represented by attractors in the feature space. One major problem in existing MHNs lies in managing out-of-distribution (OOD) samples because it was originally assumed that all samples are ID samples. To address this, we propose the rectified Lagrangian (RegLag), a new Lagrangian for memory neurons that explicitly incorporates an attractor for OOD samples in the dynamical system of MHNs. RecLag creates a trivial point attractor for any interaction matrix, enabling OOD detection by identifying samples that fall into this attractor as OOD. The interaction matrix is optimized so that the probability densities can be estimated to identify ID/OOD. We demonstrate the effectiveness of RecLag-based MHNs compared to energy-based OOD detection methods, including those using state-of-the-art Hopfield energies, across nine image datasets.

Abstract: Time series forecasting is crucial for various applications, such as weather forecasting, power load forecasting, and financial analysis. In recent studies, MLPmixer models for time series forecasting have been shown as a promising alternative to transformer-based models. However, the performance of these models is still yet to reach its potential. In this paper, we propose Wavelet Patch Mixer (WPMixer), a novel MLP-based model, for long-term time series forecasting, which leverages the benefits of patching, multi-resolution wavelet decomposition, and mixing. Our model is based on three key components: (i) multi-resolution wavelet decomposition, (ii) patching and embedding, and (iii) MLP mixing. Multi-resolution wavelet decomposition efficiently extracts information in both the frequency and time domains. Patching allows the model to capture an extended history with a look-back window and enhances capturing local information while MLP mixing incorporates global information. Our model significantly outperforms state-of-the-art MLP-based and transformer-based models for long-term time series forecasting in a computationally efficient way, demonstrating its efficacy and potential for practical applications.

Abstract: Offline metareinforcement learning aims to equip agents with the ability to rapidly adapt to new tasks by training on data from a set of different tasks. Context-based approaches utilize a history of state-action-reward transitions – referred to as the context – to infer a representation of the current task, and then condition the agent, i.e., the policy and value function, on this task representation. Intuitively, the better the task representation captures the underlying tasks, the better the agent can generalize to new tasks. Unfortunately, context-based approaches suffer from distribution mismatch, as the context in the offline data does not match the context at test time, limiting their ability to generalize to the test task. This leads to the task representation overfitting to the offline training data. Intuitively, the task representation should be independent of the behavior policy used to collect the offline data. To address this issue, we approximately minimize the mutual information between the distribution over the task representation and behavior policy by maximizing the entropy of behavior policy conditioned on the task representation. We validate our approach in MuJoCo environments, showing that compared to baselines, our task representation more faithfully represents the underlying tasks, leading to outperforming prior methods in both in-distribution and out-of-distribution tasks.

Abstract: Promptbased approaches offer a cutting-edge solution to data privacy issues in continual learning, particularly in scenarios involving multiple data suppliers where long-term storage of private user data is prohibited. Despite delivering state-of-the-art performance, its impressive remembering capability can become a double-edged sword, raising security concerns as it might inadvertently retain poisoned knowledge injected during learning from private user data. Following this insight, in this paper, we expose continual learning to a potential threat: backdoor attack, which drives the model to follow a desired adversarial target whenever a specific trigger is present while still performing normally on clean samples. We highlight three critical challenges in executing backdoor attacks on incremental learners and propose corresponding solutions: (1) Transferability: We employ a surrogate dataset and manipulate prompt selection to transfer backdoor knowledge to data from other suppliers; (2) Resiliency: We simulate static and dynamic states of the victim to ensure the backdoor trigger remains robust during intense incremental learning processes; and (3) Authenticity: We apply binary cross-entropy loss as an anti-cheating factor to prevent the backdoor trigger from devolving into adversarial noise. Extensive experiments across various benchmark datasets and continual learners validate our continual backdoor framework, with further ablation studies confirming our contributions' effectiveness.

Abstract: Learning graph generative models over latent spaces has received less attention compared to models that operate on the original data space and has so far demonstrated lacklustre performance. We present GLAD a latent space graph generative model. Unlike most previous latent space graph generative models, GLAD operates on a discrete latent space that preserves to a significant extent the discrete nature of the graph structures making no unnatural assumptions such as latent space continuity. We learn the prior of our discrete latent space by adapting diffusion bridges to its structure. By operating over an appropriately constructed latent space we avoid relying on decompositions that are often used in models that operate in the original data space. We present experiments on a series of graph benchmark datasets that demonstrates GLAD as the first equivariant latent graph generative method achieves competitive performance with the state of the art baselines.

Abstract: Several recent works have focused on carrying out nonasymptotic convergence analyses for AC algorithms. Recently, a two-timescale critic-actor algorithm has been presented for the discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and only asymptotic convergence shown. In our work, we present the first two-timescale critic-actor algorithm with function approximation in the long-run average reward setting and present the first finite-time non-asymptotic as well as asymptotic convergence analysis for such a scheme. We obtain optimal learning rates and prove that our algorithm achieves a sample complexity that can be made arbitrarily close to that of single-timescale AC and clearly better than the one obtained for two-timescale AC in a similar setting.. A notable feature of our analysis is that we present the asymptotic convergence analysis of our scheme in addition to the finite-time bounds that we obtain and show the almost sure asymptotic convergence of the (slower) critic recursion to the attractor of an associated differential inclusion with actor parameters corresponding to local maxima of a perturbed average reward objective. We also show the results of numerical experiments on three benchmark settings and observe that our critic-actor algorithm performs the best amongst all algorithms.

Abstract: Graph contrastive learning has emerged as a powerful technique for learning graph representations that are robust and discriminative. However, traditional approaches often neglect the critical role of subgraph structures, particularly the intrasubgraph characteristics and inter-subgraph relationships, which are crucial for generating informative and diverse contrastive pairs. These subgraph features are crucial as they vary significantly across different graph types, such as social networks where they represent communities, and biochemical networks where they symbolize molecular interactions. To address this issue, our work proposes a novel subgraph-oriented learnable augmentation method for graph contrastive learning, termed SOLA-GCL, that centers around subgraphs, taking full advantage of the subgraph information for data augmentation. Specifically, SOLA-GCL initially partitions a graph into multiple densely connected subgraphs based on their intrinsic properties. To preserve and enhance the unique characteristics inherent to subgraphs, a graph view generator optimizes augmentation strategies for each subgraph, thereby generating tailored views for graph contrastive learning. This generator uses a combination of intra-subgraph and inter-subgraph augmentation strategies, including node dropping, feature masking, intra-edge perturbation, inter-edge perturbation, and subgraph swapping. Extensive experiments have been conducted on various graph learning applications, ranging from social networks to molecules, under semi-supervised learning, unsupervised learning, and transfer learning settings to demonstrate the superiority of our proposed approach.

Abstract: Continual Learning (CL) is a highly relevant setting gaining traction in recent machine learning research. Among CL works, architectural and hybrid strategies are particularly effective due to their potential to adapt the model architecture as new tasks are presented. However, many existing solutions do not efficiently exploit model sparsity, and are prone to capacity saturation due to their inefficient use of available weights, which limits the number of learnable tasks. In this paper, we propose TinySubNets (TSN), a novel architectural CL strategy that addresses the issues through the unique combination of pruning with different sparsity levels, adaptive quantization, and weight sharing. Pruning identifies a subset of weights that preserve model performance, making less relevant weights available for future tasks. Adaptive quantization allows a single weight to be separated into multiple parts which can be assigned to different tasks. Weight sharing between tasks boosts the exploitation of capacity and task similarity, allowing for the identification of a better tradeoff between model accuracy and capacity. These features allow TSN to efficiently leverage the available capacity, enhance knowledge transfer, and reduce computational resources consumption. Experimental results involving common benchmark CL datasets and scenarios show that our proposed strategy achieves better results in terms of accuracy than existing state-of-the-art CL strategies. Moreover, our strategy is shown to provide a significantly improved model capacity exploitation.

Abstract: Crossmodal hashing (CMH) has appeared as a popular technique for cross-modal retrieval due to its low storage cost and high computational efficiency in large-scale data. Most existing methods implicitly assume that multi-modal data is correctly labeled, which is expensive and even unattainable due to the inevitable imperfect annotations (i.e., noisy labels) in real-world scenarios. Inspired by human cognitive learning, a few methods introduce self-paced learning to gradually train the model from easy to hard samples, which is often used to mitigate the effects of feature noise or outliers. It is a less-touched problem that how to utilize SPL to alleviate the misleading of noisy labels on the hash model. To tackle this problem, we propose a new cognitive cross-modal retrieval method called Robust Self-paced Hashing with Noisy Labels (RSHNL), which can mimic the human cognitive process to identify the noise while embracing robustness against noisy labels. Specifically, we first propose a contrastive hashing learning (CHL) scheme to improve multi-modal consistency, thereby reducing the inherent semantic gap. Afterward, we propose center aggregation learning (CAL) to mitigate the intra-class variations. Finally, we propose Noise-tolerance Self-paced Hashing (NSH) that dynamically estimates the learning difficulty for each instance and distinguishes noisy labels through the difficulty level. For all estimated clean pairs, we further adopt a self-paced regularizer to gradually learn hash codes from easy to hard. Extensive experiments demonstrate that the proposed RSHNL performs remarkably well over the state-of-the-art CMH methods.

Abstract: The performance of offline reinforcement learning (RL) suffers from the limited size and quality of static datasets. Modelbased offline RL addresses this issue by generating synthetic samples through a dynamics model to enhance overall performance. To evaluate the reliability of the generated samples, uncertainty estimation methods are often employed. However, model ensemble, the most commonly used uncertainty estimation method, is not always the best choice. In this paper, we propose a Search-based Uncertainty estimation method for Model-based Offline RL (SUMO) as an alternative. SUMO characterizes the uncertainty of synthetic samples by measuring their cross entropy against the in-distribution dataset samples, and uses an efficient search-based method for implementation. In this way, SUMO can achieve trustworthy uncertainty estimation. We integrate SUMO into several model-based offline RL algorithms including MOPO and Adapted MOReL (AMOReL), and provide theoretical analysis for them. Extensive experimental results on D4RL datasets demonstrate that SUMO can provide accurate uncertainty estimation and boost the performance of base algorithms. These indicate that SUMO could be a better uncertainty estimator for model-based offline RL when used in either reward penalty or trajectory truncation.

Abstract: Reinforcement learning (RL) often encounters delayed and sparse feedback in realworld applications, even with only episodic rewards. Previous approaches have made some progress in reward redistribution for credit assignment but still face challenges, including training difficulties due to redundancy and ambiguous attributions stemming from overlooking the multifaceted nature of mission performance evaluation. Hopefully, Large Language Model (LLM) encompasses fruitful decision-making knowledge and provides a plausible tool for reward redistribution. Even so, deploying LLM in this case is non-trivial due to the misalignment between linguistic knowledge and the symbolic form requirement, together with inherent randomness and hallucinations in inference. To tackle these issues, we introduce LaRe, a novel LLM-empowered symbolic-based decision-making framework, to improve credit assignment. Key to LaRe is the concept of the Latent Reward, which works as a multi-dimensional performance evaluation, enabling more interpretable goal attainment from various perspectives and facilitating more effective reward redistribution. We examine that semantically generated code from LLM can bridge linguistic knowledge and symbolic latent rewards, as it is executable for symbolic objects. Meanwhile, we design latent reward self-verification to increase the stability and reliability of LLM inference. Theoretically, reward-irrelevant redundancy elimination in the latent reward benefits RL performance from more accurate reward estimation. Extensive experimental results witness that LaRe (i) achieves superior temporal credit assignment to SOTA methods, (ii) excels in allocating contributions among multiple agents, and (iii) outperforms policies trained with ground truth rewards for certain tasks.

Abstract: We study the problem of computing deterministic optimal policies for constrained Markov decision processes (MDPs) with continuous state and action spaces, which are widely encountered in constrained dynamical systems. Designing deterministic policy gradient methods in continuous state and action spaces is particularly challenging due to the lack of enumerable stateaction pairs and the adoption of deterministic policies, hindering the application of existing policy gradient methods for constrained MDPs. To this end, we develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence. Specifically, we leverage regularization of the Lagrangian of the constrained MDP to propose a deterministic policy gradient primal-dual (D-PGPD) algorithm that updates the deterministic policy via a quadratic-regularized gradient ascent step and the dual variable via a quadratic-regularized gradient descent step. We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair. We instantiate D-PGPD with function approximation and prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair, up to a function approximation error. Furthermore, we demonstrate the effectiveness of our method in two continuous control problems: robot navigation and fluid control. To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.

Abstract: Existing finetuning paradigms are predominantly characterized by Full Parameter Tuning (FPT) and Parameter-Efficient Tuning (PET). FPT fine-tunes all parameters of a pre-trained model on downstream tasks, whereas PET freezes the pre-trained model and employs only a minimal number of learnable parameters for fine-tuning. However, both approaches face issues of overfitting, especially in scenarios where downstream samples are limited. This issue has been thoroughly explored in FPT, but less so in PET. To this end, this paper investigates overfitting in PET, representing a pioneering study in the field. Specifically, across 19 image classification datasets, we employ three classic PET methods (e.g., VPT, Adapter/Adaptformer, and LoRA) and explore various regularization techniques to mitigate overfitting. Regrettably, the results suggest that existing regularization techniques are incompatible with the PET process and may even lead to performance degradation. Consequently, we introduce a new framework named TTE (Two Tokens are Enough), which effectively alleviates overfitting in PET through a novel constraint function based on the learnable tokens. Experiments conducted on 24 datasets across image and few-shot classification tasks demonstrate that our fine-tuning framework not only mitigates overfitting but also significantly enhances PET's performance. Notably, our TTE framework surpasses the highest-performing FPT framework (DR-Tune), utilizing significantly fewer parameters (0.15M vs. 85.84M) and achieving an improvement of 1%.

Abstract: Prototypebased classification learning methods are known to be inherently interpretable. However, this paradigm suffers from major limitations compared to deep models, such as lower performance. This led to the development of the so-called deep Prototype-Based Networks (PBNs), also known as prototypical parts models. In this work, we analyze these models with respect to different properties, including interpretability. In particular, we focus on the Classification-by-Components (CBC) approach, which uses a probabilistic model to ensure interpretability and can be used as a shallow or deep architecture. We show that this model has several shortcomings, like creating contradicting explanations. Based on these findings, we propose an extension of CBC that solves these issues. Moreover, we prove that this extension has robustness guarantees and derive a loss that optimizes robustness. Additionally, our analysis shows that most (deep) PBNs are related to (deep) RBF classifiers, which implies that our robustness guarantees generalize to shallow RBF classifiers. The empirical evaluation demonstrates that our deep PBN yields state-of-the-art classification accuracy on different benchmarks while resolving the interpretability shortcomings of other approaches. Further, our shallow PBN variant outperforms other shallow PBNs while being inherently interpretable and exhibiting provable robustness guarantees.

Abstract: Graduated optimization is a global optimization technique that is used to minimize a multimodal nonconvex function by smoothing the objective function with noise and gradually refining the solution. This paper experimentally evaluates the performance of the explicit graduated optimization algorithm with an optimal noise scheduling derived from a previous study and discusses its limitations. The evaluation uses traditional benchmark functions and empirical loss functions for modern neural network architectures. In addition, this paper extends the implicit graduated optimization algorithm, which is based on the fact that stochastic noise in the optimization process of SGD implicitly smooths the objective function, to SGD with momentum, analyzes its convergence, and demonstrates its effectiveness through experiments on image classification tasks with ResNet architectures.

Abstract: Optical neural networks (ONNs) have attracted great attention due to their low power consumption and highspeed processing. When training an ONN implemented on a chip with possible fabrication variations, the well-known backpropagation algorithm cannot be executed accurately because the perfect information inside the chip cannot be observed. Instead, we employ a black-box optimization method such as zeroth-order (ZO) optimization. In this paper, we first discuss how ONN parameters should be perturbed to search for better values in a black-box manner. Conventionally, parameter perturbations are sampled from a normal distribution with an identity covariance matrix. This is plausible if the parameters are not interrelated in a module, like a linear module of an ordinary neural network. However, this is not the best way for ONN modules with layered parameters, which are interrelated by optical paths. We then propose to perturb the parameters by a normal distribution with a special covariance matrix computed by our novel method. The covariance matrix is designed so that the perturbations appearing at the module output caused by the parameter perturbations become as isotropic as possible to uniformly search for better values. Experimental results show that the proposed method using the special covariance matrix significantly outperformed conventional methods.

Abstract: Scorebased diffusion models have emerged as effective approaches for both conditional and unconditional generation. Still conditional generation is based on either a specific training of a conditional model or classifier guidance, which requires training a noise-dependent classifier, even when a classifier for uncorrupted data is given. We propose a method that, given a pre-trained unconditional score-based generative model, samples from the conditional distribution under arbitrary logical constraints, without requiring additional training. Differently from other zero-shot techniques, that rather aim at generating valid conditional samples, our method is designed for approximating the true conditional distribution. Firstly, we show how to manipulate the learned score in order to sample from an un-normalized distribution conditional on a user-defined constraint. Then, we define a flexible and numerically stable neuro-symbolic framework for encoding soft logical constraints. Combining these two ingredients we obtain a general, but approximate, conditional sampling algorithm. We further developed effective heuristics aimed at improving the approximation. Finally, we show the effectiveness of our approach in approximating conditional distributions for various types of constraints and data: tabular data, images and time series.

Abstract: Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks, demonstrating superior performance and efficacy across various applications. The promising results come at the cost of slow inference, as each denoising step requires running the whole transformer model with a large amount of parameters. In this paper, we show that performing the full computation of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps. Furthermore, we show that the lower bound of similarity between outputs at consecutive steps is notably high, and this similarity can be linearly approximated using the inputs. To verify our demonstrations, we propose the **LazyDiT**, a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations. Specifically, we incorporate lazy learning layers into the model, effectively trained to maximize laziness, enabling dynamic skipping of redundant computations. Experimental results show that LazyDiT outperforms the DDIM sampler across multiple diffusion transformer models at various resolutions. Furthermore, we implement our method on mobile devices, achieving better performance than DDIM with similar latency.

University of Science and Technology of China NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences, NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, University of Science and Technology of China, Nanjing University NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences

Abstract: Model adaptation tackles the distribution shift problem with a pretrained model instead of raw data, which has become a popular paradigm due to its great privacy protection. Existing methods always assume adapting to a clean target domain, overlooking the security risks of unlabeled samples. This paper for the first time explores the potential trojan attacks on model adaptation launched by well-designed poisoning target data. Concretely, we provide two trigger patterns with two poisoning strategies for different prior knowledge owned by attackers. These attacks achieve a high success rate while maintaining the normal performance on clean samples in the test stage. To defend against such backdoor injection, we propose a plug-and-play method named DiffAdapt, which can be seamlessly integrated with existing adaptation algorithms. Experiments across commonly used benchmarks and adaptation methods demonstrate the effectiveness of DiffAdapt. We hope this work will shed light on the safety of transfer learning with unlabeled data.

University of Science and Technology of China, School of Computer Science and Technology, Hefei, China University of Science and Technology of China, Suzhou Institute for Advanced Research, Suzhou, China Laboratory for Advanced Computing and Intelligence Engineering, Wuxi, China University of Science and Technology of China, Hefei National Laboratory, Hefei, China, University of Science and Technology of China, School of Computer Science and Technology, Hefei, China University of Science and Technology of China, Suzhou Institute for Advanced Research, Suzhou, China, University of Science and Technology of China, School of Computer Science and Technology, Hefei, China, University of Science and Technology of China, School of Computer Science and Technology, Hefei, China, University of Science and Technology of China, School of Computer Science and Technology, Hefei, China, University of Science and Technology of China, School of Computer Science and Technology, Hefei, China University of Science and Technology of China, Hefei National Laboratory, Hefei, China, University of Science and Technology of China, School of Computer Science and Technology, Hefei, China University of Science and Technology of China, Suzhou Institute for Advanced Research, Suzhou, China University of Science and Technology of China, Hefei National Laboratory, Hefei, China, University of Science and Technology of China, School of Computer Science and Technology, Hefei, China University of Science and Technology of China, Suzhou Institute for Advanced Research, Suzhou, China

Abstract: Enhancing defense through model ensemble is an emerging trend, where the challenge lies in how to use ensemble knowledge to counter Outof-Distribution (OOD) attacks. In this paper, we propose the Reliable Defense Ensemble (REE) to address this issue. REE optimizes the ensemble knowledge of models through aggregation and enhances multidimensional robust performance through collaboration. It employs the Dynamic Synergy Amplification for weight allocation and strategy adjustment. Furthermore, we design a new Kernel Anomaly Smoothing Detection Module, which detects anomalous attacks using a smoothing feature function based on Gaussian kernel mean embedding and a multi-layer feedback structure. Particularly, we build a framework that uses reinforcement learning to iteratively fine-tune the parameters of inter-model communication and consensus. Extensive experimental results show that REE outperforms current state-of-the-art methods by a large margin in defending against OOD attacks.

Abstract: Recently segment anything model (SAM) has shown powerful segmentation capability and has drawn great attention in computer vision fields. Massive following works have developed various applications based on the pretrained SAM and achieved impressive performance on downstream vision tasks. However, SAM consists of heavy architectures and requires massive computational capacity, which hinders the further application of SAM on computation constrained edge devices. To this end, in this paper we propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance. We first propose a full-stage knowledge distillation method with hard prompt sampling and hard mask weighting strategy to distill a lightweight student model. We also adapt the post-training quantization to the prompt-based segmentation task and further reduce the computational cost. Moreover, a hierarchical segmenting everything strategy is proposed to accelerate the everything inference by 2× with almost no performance degradation. With all these proposed methods, our TinySAM leads to orders of magnitude computational reduction and pushes the envelope for efficient segment anything task. Extensive experiments on various zero-shot transfer tasks demonstrate the significantly advantageous performance of our TinySAM against counterpart methods.

Abstract: In a surge of textto-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new context. However, the existing methods often 1) generate images with the same pose as an input image, and 2) exhibit deterioration in the subject's identity when facing a pose variation prompt. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the pose indication in the textual embedding. Conversely, the textual embedding also harms the subject's identity which is tightly entangled with the pose in the visual embedding. As a remedy, we propose text-orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our method is both effective and robust, offering highly flexible zero-shot generation while effectively maintaining the subject's identity.

Abstract: Decision trees are a classic model for summarizing and classifying data. To enhance interpretability and generalization properties, it has been proposed to favor small decision trees. Accordingly, in the minimumsize decision tree training problem (MSDT), the input is a set of training examples in $\mathbb{R}^d$ with class labels and we aim to find a decision tree that classifies all training examples correctly and has a minimum number of nodes. MSDT is NP-hard and therefore presumably not solvable in polynomial time. Nevertheless, a promising algorithmic paradigm called witness trees which solves MSDT efficiently if the solution tree is small has been developed. In this work, we test this paradigm empirically. We provide an implementation, augment it with extensive heuristic improvements, and scrutinize it on standard benchmark instances. The augmentations achieve a mean 324-fold (median 84-fold) speedup over the naive implementation. Compared to the state of the art they achieve a mean 32-fold (median 7-fold) speedup over the dynamic programming based MurTree solver and a mean 61-fold (median 25-fold) speedup over SAT-based implementations. As a theoretical result we obtain an improved worst-case running-time bound for MSDT.

Abstract: The infrequent occurrence of overfitting in deep neural networks is perplexing: contrary to theoretical expectations, increasing model size often enhances performance in practice. But what if overfitting does occur, though restricted to specific subregions of the data space? In this work, we propose a novel score that captures the forgetting rate of deep models on validation data. We posit that this score quantifies local overfitting: a decline in performance confined to certain regions of the data space. We then show empirically that local overfitting occurs regardless of the presence of traditional overfitting. Using the framework of deep over-parametrized linear models, we offer a certain theoretical characterization of forgotten knowledge, and show that it correlates with knowledge forgotten by real deep models. Finally, we devise a new ensemble method that aims to recover forgotten knowledge, relying solely on the training history of a single network. When combined with knowledge distillation, this method will enhance the performance of a trained model without adding inference costs. Extensive empirical evaluations demonstrate the efficacy of our method across multiple datasets, contemporary neural network architectures, and training protocols.

Abstract: In recent years, the application of transformerbased models in time-series forecasting has received significant attention. While often demonstrating promising results, the transformer architecture encounters challenges in fully exploiting the temporal relations within time series data due to its attention mechanism. In this work, we design eXponential Patch (xPatch for short), a novel dual-stream architecture that utilizes exponential decomposition. Inspired by the classical exponential smoothing approaches, xPatch introduces the innovative seasonal-trend exponential decomposition module. Additionally, we propose a dual-flow architecture that consists of an MLP-based linear stream and a CNN-based non-linear stream. This model investigates the benefits of employing patching and channel-independence techniques within a non-transformer model. Finally, we develop a robust arctangent loss function and a sigmoid learning rate adjustment scheme, which prevent overfitting and boost forecasting performance.

Abstract: Temporal Graph Neural Networks (TGNNs) are a family of graph neural networks designed to model and learn dynamic information from temporal graphs. Given their substantial empirical success, there is an escalating interest in TGNNs within the research community. However, the majority of these efforts have been channelled towards algorithm and system design, with the evaluation metrics receiving comparatively less attention. Effective evaluation metrics are crucial for providing detailed performance insights, particularly in the temporal domain. This paper investigates the commonly used evaluation metrics for TGNNs and illustrates the failure mechanisms of these metrics in capturing essential temporal structures in the predictive behaviour of TGNNs. We provide a mathematical formulation of existing performance metrics and utilize an instancebased study to underscore their inadequacies in identifying volatility clustering (the occurrence of emerging errors within a brief interval). This phenomenon has profound implications for both algorithm and system design in the temporal domain. To address this deficiency, we introduce a new volatility-aware evaluation metric (termed volatility cluster statistics), designed for a more refined analysis of model temporal performance. Additionally, we demonstrate how this metric can serve as a temporal-volatility-aware training objective to alleviate the clustering of temporal errors. Through comprehensive experiments on various TGNN models, we validate our analysis and the proposed approach. The empirical results offer revealing insights: 1) existing TGNNs are prone to making errors with volatility clustering, and 2) TGNNs with different mechanisms to capture temporal information exhibit distinct volatility clustering patterns. Moreover, our empirical findings demonstrate that our proposed training objective effectively reduces volatility clusters in error.

Abstract: Federated MultiView Clustering (FMVC) aims to learn a global clustering model from heterogeneous data distributed across different devices, where each device only stores one view of all clustering samples. The key to deal with such problem lies in how to effectively fuse these heterogeneous samples while strictly preserve the data privacy across multiple devices. In this paper, we propose a novel structural graph learning framework named MGCD, which leverages both consistency and diversity of multi-view graph structure across global view-fusion server and local view-specific clients to achieve desired clustering while better preserves data privacy. Specifically, in each local client, we design a dual autoencoder to extract the latent consensuses and specificities of each view, where self-representation construction is introduced to generate the corresponding view-specific diversity graph. In the global server, the consistency implied in uploaded diversity graphs are further distilled and then incorporated into the consistency graph for subsequent cross-view contrastive fusion. During the training process, the server generates a global consistency graph and distributes it to each client for assisting in diversity graph construction, while the clients extract view-specific information and upload it to the server for more reliable consistency graph generation. The ``server-client'' interaction is conducted in an iterative manner, where the consistency implied in each local client is gradually aggregated into the global consistency graph, and the final clustering results are obtained by spectral clustering on the desired global consistency graph. Extensive experiments on various datasets have demonstrated the effectiveness of our proposed method on clustering federated multi-view data.

Abstract: In this work, we consider an online robust Markov Decision Process (MDP) where we have the information of finitely many prototypes of the underlying transition kernel. We consider an adaptively updated ambiguity set of the prototypes and propose an algorithm that efficiently identifies the true underlying transition kernel while guaranteeing the performance of the corresponding robust policy. To be more specific, we provide a sublinear regret of the subsequent optimal robust policy. We also provide an early stopping mechanism and a worstcase performance bound of the value function. In numerical experiments, we demonstrate that our method outperforms existing approaches, particularly in the early stage with limited data. This work contributes to robust MDPs by considering possible prior information about the underlying transition probability and online learning, offering both theoretical insights and practical algorithms for improved decision-making under uncertainty.

Abstract: The diversity of time series applications and scarcity of domainspecific data highlight the need for time-series models with strong few-shot learning capabilities. In this work, we propose a novel training scheme and a transformer-based architecture, collectively referred to as TimePFN, for multivariate time-series (MTS) forecasting. TimePFN is based on the concept of Prior-data Fitted Networks (PFN), which aims to approximate Bayesian inference. Our approach consists of (1) generating synthetic MTS data through diverse Gaussian process kernels and the linear coregionalization method, and (2) a novel MTS architecture capable of utilizing both temporal and cross-channel dependencies across all input patches. We evaluate TimePFN on several benchmark datasets and demonstrate that it outperforms the existing state-of-the-art models for MTS forecasting in both zero-shot and few-shot settings. Notably, fine-tuning TimePFN with as few as 500 data points nearly matches full dataset training error, and even 50 data points yield competitive results. We also find that TimePFN exhibits strong univariate forecasting performance, attesting to its generalization ability. Overall, this work unlocks the power of synthetic data priors for MTS forecasting and facilitates strong zero- and few-shot forecasting performance.

Abstract: Imperceptible adversarial attacks on 3D point clouds rely on effective constraints. While manifold constraints have notable advantages over Euclidean ones, the global parameterization used in current methods often fails to fully preserve manifold properties. In this paper, we propose to constrain latticebased barycentric coordinates during attacks from a local parametric perspective to ensure imperceptibility. Specifically, we utilize a permutohedral lattice to partition point clouds into multiple cells, and then extract barycentric coordinates for each point within these cells, forming a local parametric representation of the point clouds. By enforcing local parametric constraints that minimize the displacement of barycentric coordinates, we largely preserve the manifold properties, ultimately leading to improved imperceptibility. Extensive experiments validate that integrating these local parametric constraints into conventional adversarial attacks yields superior imperceptibility, outperforming state-of-the-art methods.

Abstract: Diffusionbased models have been recently shown to be high-quality data generators. However, their performance severely degrades when training on non-stationary changing data distributions in an online manner, due to the catastrophic forgetting. In this paper, we propose enabling the diffusion model with a novel Dynamic Expansion Memory Unit (DEMU) methodology that adaptively creates new memory buffers, to be added to a memory system, in order to preserve information deemed critical for training the model. Having a selective memory unit is essential for training diffusion networks, which are expensive to train, especially when deployed in resource-constrained environments. A Maximum Mean Discrepancy (MMD) based expansion mechanism, that evaluates probabilistic distances between each of the previously defined memory buffers and the newly given data, and uses them as expansion signals, is employed for ensuring the diversity of information learning. We propose a new model expansion mechanism to automatically add new diffusion models as experts in a mixture system, which enhances the multi-domain image generation performance. Also a novel memory compaction approach is proposed to automatically remove statistically overlapping memory units, through a graph relationship evaluation, preventing the limitless expansion of DEMU. Comprehensive results show that the proposed approach performs better than the state-of-the-art.

Abstract: Distributed machine learning (DML) is promising for training large models on large datasets. In DML, multiple workers collaborate on the training of neural networks, significantly reducing the time required for neural network training. The efficiency of DML is heavily influenced by communication, making it crucial to balance the tradeoff between communication cost and model performance in current research. Local methods are excellent at reducing communication costs, yet face degradation in accuracy and generalizability. Indeed, global knowledge is valuable for improving performance in local methods. However, the theoretical analysis of global knowledge validity is lacking, and global knowledge can currently only be used in the global aggregation of local methods due to communication limitations and staleness. To this end, in this paper, we establish the mechanism of global knowledge guidance and propose Adaptive Global Knowledge Guided Distributed Stochastic Gradient Descent (AdaGK-SGD) to extend the guidance of global knowledge to the whole distributed training process without any additional communication. Specifically, we define the maximum lifetime of global knowledge based on the mechanism, and establish a correlation between the maximum lifetime and the validity of global knowledge to circumvent the adverse effects of global knowledge staleness. The Maximum Lifetime of Global Knowledge module of our algorithm can be applied separately to other algorithms. In addition, considering the application, we provide a straightforward and efficient strategy for achieving the maximum lifetime adaptive setting. We establish the convergence rate of AdaGK-SGD for convex and non-convex scenarios. Numerically, we find that AdaGK-SGD can significantly improve the accuracy and generalizability of distributed algorithms compared with existing methods.

Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China

Abstract: VisionLanguage models (VLMs) have shown great potential in enhancing open-world visual concept comprehension. Recent researches focus on an optimum multimodal collaboration strategy that significantly advances CLIP-based few-shot tasks. However, existing prompt-based solutions suffer from unidirectional information flow and increased parameters since they explicitly condition the vision prompts on textual prompts across different transformer layers using non-shareable coupling functions. To address this issue, we propose a Dual-shared mechanism based on LoRA (DsRA) that addresses VLM adaptation in low-data regimes. The proposed DsRA enjoys several merits. First, we design an inter-modal shared coefficient that focuses on capturing visual and textual shared patterns, ensuring effective mutual synergy between image and text features. Second, an intra-modal shared matrix is proposed to achieve efficient parameter fine-tuning by combining the different coefficients to generate layer-wise adapters placed in encoder layers. Our extensive experiments demonstrate that DsRA improves the generalizability under few-shot classification, base-to-new generalization, and domain generalization settings. Our code will be released soon.

Abstract: In this work, we propose a novel activation mechanism called LayerAct for CNNs. This approach is motivated by our theoretical and experimental analyses, which demonstrate that Layer Normalization (LN) can mitigate a limitation of existing activation functions regarding noise robustness. However, LN is known to be disadvantageous in CNNs due to its tendency to make activation outputs homogeneous. The proposed method is designed to be more robust than existing activation functions by reducing the upper bound of influence caused by input shifts without inheriting LN's limitation. We provide analyses and experiments showing that LayerAct functions exhibit superior robustness compared to ElementAct functions. Experimental results on three clean and noisy benchmark datasets for image classification tasks indicate that LayerAct functions outperform other activation functions in handling noisy datasets while achieving superior performance on clean datasets in most cases.

School of Computer Science and Technology, University of Science and Technology of China, Hefei, China Hefei National Laboratory, University of Science and Technology of China, Hefei, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China Hefei National Laboratory, University of Science and Technology of China, Hefei, China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China Laboratory for Advanced Computing and Intelligence Engineering, Wuxi, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China Hefei National Laboratory, University of Science and Technology of China, Hefei, China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China

Abstract: In the context of Continual Semantic Segmentation (CSS), replaybased methods tend to achieve better performance than knowledge distillation-based ones, as the former utilizes additional data to transfer old knowledge. However, this advantage is at the cost of necessitating additional space for storing the generative model and extra time for continual training. To address this predicament, we propose a novel CSS framework, namely Adversarial Attack-based Knowledge Retention (AAKR). The AKKR framework generates specific adversarial samples by adding images, and uses them to retain old knowledge. Specifically, we leverage adversarial attacks to generate adversarial images for incremental samples. By imposing additional constraints within these attacks, we enhance the transfer of old knowledge, thereby reinforcing the understanding of previously learned information. Furthermore, we design an attack probability module that adjusts adversarial attack directions based on training feedback. This module effectively encourages the new model to learn old knowledge from poorly protected classes, significantly improving knowledge transfer effectiveness. Our comprehensive experiments demonstrate the efficacy of AAKR, and showcase that AAKR surpasses state-of-the-art competitors on benchmark datasets.

Abstract: Unsupervised domain adaptation (UDA) has emerged as a promising technique for transferring knowledge from a labeled domain to an unlabeled domain. However, existing UDA methods are severely constrained by data privacy and semantic inconsistencies. To alleviate these limitations, this work challenges the SourceFree Open-Set Domain Adaptation (SF-OSDA), where the pre-trained source model is directly leveraged on the open target domain for adaptation. For this purpose, we introduce the novel Dynamic Target Distribution Estimation (DTDE) method, which effectively performs known classification and unknown separation through self-supervised learning with prototypes. To construct known prototypes, a self-adaptive sampling strategy is employed to consider the category disparity. For unknown prototypes, we utilize a self-splitting and excluding principle to bypass the unknown semantics problem. Specifically, self-splitting is to evaluate the overall clustering distribution of the target domain. By excluding clusters resembling known prototypes, the remaining cluster centroids can serve as unknown prototypes. The superiority of our approach is validated across multiple benchmarks. Remarkably, DTDE outperforms the best competitor by 7.6% on the VisDA dataset.

School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China; Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, 450001, China; National Supercomputing Center in Zhengzhou, Zhengzhou, 450001, China;, School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China;, School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China;, School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China;, School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China; Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, 450001, China; National Supercomputing Center in Zhengzhou, Zhengzhou, 450001, China;, School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China; Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, 450001, China; National Supercomputing Center in Zhengzhou, Zhengzhou, 450001, China;, School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China; Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, 450001, China; National Supercomputing Center in Zhengzhou, Zhengzhou, 450001, China;

Abstract: To extract spatial information, depth estimation using conventional echobased methods typically employs models with encoder-decoder architectures, such as UNet. However, these methods may face challenges in extracting fine details from echo waveforms and handling multi-scale feature extraction with high precision. To address these challenges, we introduce EchoDiffusion, a framework that incorporates diffusion models conditioned on waveform embeddings for echo-based depth estimation. This framework employs the Multi-Scale Adaptive Latent Feature Network (MALF-Net) to extract multi-scale spatial features and perform adaptive fusion, encoding the echo spectrograms into the latent space. Additionally, we propose the Echo Waveform Detail Embedder (EWDE), which leverages a pre-trained Wav2Vec model to extract detailed spatial information from echo waveforms, using these details as conditional inputs to guide the reverse diffusion process in the latent space. By embedding the echo waveforms into the reverse diffusion process, we can more accurately guide the generation of depth maps. Our extensive evaluations on the Replica and Matterport3D datasets demonstrate that EchoDiffusion establishes new benchmarks for state-of-the-art performance in echo-based depth estimation.

Abstract: Online to batch conversion involves constructing a new batch learner by utilizing a series of models generated by an existing online learning algorithm, for achieving generalization guarantees under i.i.d assumption. However, when applied to realworld streaming applications such as streaming recommender systems, the data stream may be sampled from time-varying distributions instead of persistently being i.i.d. This poses a challenge in terms of out-of-distribution (OOD) generalization. Existing approaches employ fixed conversion mechanisms that are unable to adapt to novel testing distributions, hindering the testing accuracy of the batch learner. To address these issues, we propose AdaO2B, an adaptive online to batch conversion approach under the bandit setting. AdaO2B is designed to be aware of the distribution shifts in the testing data and achieves OOD generalization guarantees. Specifically, AdaO2B can dynamically combine the sequence of models learned by a contextual bandit algorithm and determine appropriate combination weights using a context-aware weighting function. This innovative approach allows for the conversion of a sequence of models into a batch learner that facilitates OOD generalization. Theoretical analysis provides justification for why and how the learned adaptive batch learner can achieve OOD generalization error guarantees. Experimental results have demonstrated that AdaO2B significantly outperforms state-of-the-art baselines on both synthetic and real-world recommendation datasets.

University of Science and Technology of China City University of Hong Kong, University of Science and Technology of China Suzhou Institute for Advanced Research, University of Science and Technology of China, The Hong Kong University of Science and Technology (Guangzhou), University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China Suzhou Institute for Advanced Research, University of Science and Technology of China Key Laboratory of Precision and Intelligent Chemistry, USTC

Abstract: Time series forecasting plays a crucial role in domains such as finance, healthcare, and climate science. However, as modern time series data become increasingly complex, featuring high dimensionality, intricate spatiotemporal dependencies, and multiscale evolutionary patterns, traditional analytical methods and existing predictive models face significant challenges. Although Large Language Models (LLMs) excel in capturing long-range dependencies, they still struggle with multi-scale dynamics and seasonal patterns. Moreover, while LLMs' semantic representation capabilities are rich, they often lack explicit alignment with the numerical patterns and temporal structures of time series data, leading to limitations in predictive accuracy and interpretability. To address these challenges, this paper proposes a novel framework, STEM-LTS (Semantic-TEmporal Modeling for Large-scale Time Series). STEM-LTS enhances the ability to capture complex spatiotemporal dependencies by integrating time series decomposition techniques with LLM-based modeling. The semantic-temporal alignment mechanism within the framework significantly improves LLMs' ability to interpret and forecast time series data. Additionally, we develop an adaptive multi-task learning strategy to optimize the model's performance across multiple dimensions. Through extensive experiments on various real-world datasets, we demonstrate that STEM-LTS achieves significant improvements in prediction accuracy, robustness to noise, and interpretability. Our work not only advances LLM-based time series analysis but also offers new perspectives on handling complex temporal data.

Abstract: Negative transfer (NF) is a critical challenge in personalized federated learning (pFL). Existing methods primarily focus on adapting local data distribution on the client side, which can only resist NF, rather than avoid NF itself. To tackle NF at its root, we investigate its mechanism through the lens of the global model, and argue that it is caused by update conflicts among clients during server aggregation. In light of this, we propose a conflictfree client update aggregation strategy (ConFREE), which enables us to avoid NF in pFL. Specifically, ConFREE guides the global update direction by constructing a conflict-free guidance vector through projection and utilizes the optimal local improvements of the worst-performing clients near the guidance vector to regularize server aggregation. This prevents the conflicting components of updates from transferring, achieving balanced updates across different clients. Notably, ConFREE is model-agnostic and can be straightforwardly adopted as a complement to enhance various existing NF-resistance methods implemented on the client side. Extensive experiments demonstrate substantial improvements to existing pFL algorithms by leveraging ConFREE.

Abstract: Diabetic foot neuropathy (DFN) is a critical factor leading to diabetic foot ulcers, which is one of the most common and severe complications of diabetes mellitus (DM) and is associated with high risks of amputation and mortality. Despite its significance, existing datasets do not directly derive from plantar data and lack continuous, longterm foot-specific information. To advance DFN research, we have collected a novel dataset comprising continuous plantar pressure data to recognize diabetic foot neuropathy. This dataset includes data from 94 DM patients with DFN and 41 DM patients without DFN. Moreover, traditional methods divide datasets by individuals, potentially leading to significant domain discrepancies in some feature spaces due to the absence of mid-domain data. In this paper, we propose an effective domain adaptation method to address this proplem. We split the dataset based on convolutional feature statistics and select appropriate sub-source domains to enhance efficiency and avoid negative transfer. We then align the distributions of each source and target domain pair in specific feature spaces to minimize the domain gap. Comprehensive results validate the effectiveness of our method on both the newly proposed dataset for DFN recognition and an existing dataset.

Abstract: LowRank Adaptation (LoRA) has become increasingly popular for efficiently fine-tuning large language models (LLMs) with minimal resources. However, traditional methods that serve multiple LoRA models independently result in redundant computation and low GPU utilization. This paper addresses these inefficiencies by introducing Dynamic Operator Optimization (Dop), an advanced automated optimization technique designed to dynamically optimize the Segmented Gather Matrix-Vector Multiplication (SGMV) operator based on specific scenarios. SGMV's unique design enables batching GPU operations for different LoRA models, significantly improving computational efficiency. The Dop approach leverages a Search Space Constructor to create a hierarchical search space, dividing the program space into high-level structural sketches and low-level implementation details, ensuring diversity and flexibility in operator implementation. Furthermore, an Optimization Engine refines these implementations using evolutionary search, guided by a cost model that estimates program performance. This iterative optimization process ensures that SGMV implementations can dynamically adapt to different scenarios to maintain high performance. We demonstrate that Dop can improve throughput by 1.30-1.46 times in a SOTA multi-tenant LoRA serving.

Abstract: Compared to conventional longtail learning, which focuses on addressing class-wise imbalances, generalized long-tail (GLT) learning considers that samples within each class still conform to long-tailed distributions due to varying attributes, known as attribute imbalance. In the presence of such imbalance, the assumption of equivalence between the class-conditional probability densities of the training and testing sets is no longer tenable. Existing GLT approaches typically employ regularization techniques to avoid directly modeling the class-conditional probability density (CCPD) ratio between training and test data, leading to suboptimal performance. This study aims to directly estimate this ratio, for which a novel class-attribute aware logit-adjusted (CALA) loss incorporating both the CCPD ratio and the class priors is presented. Two new GLT learning methods, named Heuristic-CALA and Meta-CALA, are then proposed, which estimate the CCPD ratio in the CALA loss by leveraging the neighborhood information of samples. Extensive experiments across diverse scenarios susceptible to class and attribute imbalances showcase the state-of-the-art performance of Meta-CALA. Furthermore, while Heuristic-CALA exhibits inferior performance compared to Meta-CALA, it incurs only negligible additional training time compared to the Cross-Entropy loss, yet surpasses existing methods by a significant margin.

Zhejiang University, Zhejiang University, Zhejiang University, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Big Graph Center, Hangzhou City University State Key Laboratory of Blockchain and Data Security, Zhejiang University, Zhejiang University, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Nanyang Technological University

Abstract: Offline MultiAgent Reinforcement Learning (MARL) aims to learn optimal joint policies from pre-collected datasets without further interaction with the environment. Despite the encouraging results achieved so far, we identify the policy mismatch problem that arises from employing diverse offline MARL datasets, a highly important ingredient for cooperative generalization yet largely overlooked by existing literature. Specifically, in the case that offline datasets exhibit various optimal joint policies, policy mismatch often occurs when individual actions from different optimal joint actions are combined in a way that results in a suboptimal joint action. In this paper, we introduce a novel Cooperative Policy Agreement (CPA) method, that not only mitigates the policy mismatch problem but also learns to generate diverse joint policies. CPA firstly introduces an autoregressive decision-making mechanism among agents during offline training. This mechanism enables agents to access the actions previously taken by other agents, thereby facilitating effective joint policy matching. Moreover, diverse joint policies can be directly obtained through sequential action sampling from the autoregressive model. Then we further incorporate a policy agreement mechanism to convert these autoregressive joint policies into decentralized policies with a non-autoregressive form, while still ensuring the diversity of the generated policies. This mechanism guarantees that the proposed CPA adheres to the Centralized Training with Decentralized Execution (CTDE) constraint. Experiments conducted on various benchmarks demonstrate that CPA yields superior performance to state-of-the-art competitors.

Abstract: Offline blackbox optimization aims to identify the optimal solution of a black-box objective function under the guidance of a surrogate model constructed solely from a pre-collected dataset. It is commonly used in industrial scenarios, which often involve constraints, i.e., constrained offline optimization (COO). Offline optimization has progressed in addressing the out-of-distribution (OOD) issue caused by its inherent inability to interact with the objective function. However, there is not enough research in addressing more difficult scenarios, which must simultaneously address OOD issues and constrained issues to find stable, high-quality (i.e., high-scoring and feasible) solutions. To bridge this gap, this paper proposes a method called constrained offline optimization via risk evaluation and management (COOREM), which is capable of consistently surpassing the offline dataset under the condition of satisfying constraints. Specifically, COOREM employs a dual-energy model to separately evaluate OOD risk and constrained risk. This separation strategy aims to distinguish and address two difficult cases: the infeasible but not OOD solutions and the feasible but OOD solutions. Moreover, COOREM effectively manages OOD risk and constrained risk, ensuring the identification of high-quality solutions. Extensive experiments on real-world tasks, e.g., space missions, process synthesis, and design problems, showcase COOREM's effectiveness in managing both OOD risk and constrained risk. Furthermore, our findings indicate that COOREM could outperform online methods that need to access the objective function in certain space missions.

Abstract: Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust musicvideo correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, music quality generative diversity, and application universality.

Abstract: The centralized training for decentralized execution paradigm emerged as the stateof-the-art approach to ϵ-optimally solving decentralized partially observable Markov decision processes. However, scalability remains a significant issue. This paper presents a novel and more scalable alternative, namely the sequential-move centralized training for decentralized execution. This paradigm further pushes the applicability of the Bellman’s principle of optimality, raising three new properties. First, it allows a central planner to reason upon sufficient sequential-move statistics instead of prior simultaneous-move ones. Next, it proves that ϵ-optimal value functions are piecewise linear and convex in such sufficient sequential-move statistics. Finally, it drops the complexity of the backup operators from double exponential to polynomial at the expense of longer planning horizons. Besides, it makes it easy to use single-agent methods, e.g., SARSA algorithm enhanced with these findings, while still preserving convergence guarantees. Experiments on two- as well as many-agent domains from the literature against ϵ-optimal simultaneous-move solvers confirm the superiority of our novel approach. This paradigm opens the door for efficient planning and reinforcement learning methods for multi-agent systems.

Abstract: Anytime multiagent path finding (MAPF) is a promising approach to scalable and collision-free path optimization in multi-agent systems. MAPF-LNS, based on Large Neighborhood Search (LNS), is the current state-of-the-art approach where a fast initial solution is iteratively optimized by destroying and repairing selected paths of the solution. Current MAPF-LNS variants commonly use an adaptive selection mechanism to choose among multiple destroy heuristics. However, to determine promising destroy heuristics, MAPF-LNS requires a considerable amount of exploration time. As common destroy heuristics are stationary, i.e., non-adaptive, any performance bottleneck caused by them cannot be overcome by adaptive heuristic selection alone, thus limiting the overall effectiveness of MAPF-LNS. In this paper, we propose Adaptive Delay-based Destroy-and-Repair Enhanced with Success-based Self-learning (ADDRESS) as a single-destroy-heuristic variant of MAPF-LNS. ADDRESS applies restricted Thompson Sampling to the top-K set of the most delayed agents to select a seed agent for adaptive LNS neighborhood generation. We evaluate ADDRESS in multiple maps from the MAPF benchmark set and demonstrate cost improvements by at least 50% in large-scale scenarios with up to a thousand agents, compared with the original MAPF-LNS and other state-of-the-art methods.

Abstract: LLMbased agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) task. However, existing LLM-based methods often focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in navigation scenarios. To bridge this gap, we propose AO-Planner, a novel Affordances-Oriented Planner for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making, both performed in a zero-shot setting. Specifically, we employ a Visual Affordances Prompting (VAP) approach, where the visible ground is segmented by SAM to provide navigational affordances, based on which the LLM selects potential candidate waypoints and plans low-level paths towards selected waypoints. We further propose a high-level PathAgent which marks planned paths into the image input and reasons the most probable path by comprehending all environmental information. Finally, we convert the selected path into 3D coordinates using camera intrinsic parameters and depth information, avoiding challenging 3D predictions for LLMs. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance (8.8% improvement on SPL). Our method can also serve as a data annotator to obtain pseudo-labels, distilling its waypoint prediction ability into a learning-based predictor. This new predictor does not require any waypoint data from the simulator and achieves 47% SR competing with supervised methods. We establish an effective connection between LLM and 3D world, presenting novel prospects for employing foundation models in low-level motion control.

Abstract: Large Language Models (LLMs) are prone to hallucination with nonfactual or unfaithful statements, which undermines the applications in real-world scenarios. Recent researches focus on uncertainty-based hallucination detection, which utilizes the output probability of LLMs for uncertainty calculation and does not rely on external knowledge or frequent sampling from LLMs. Whereas, most approaches merely consider the uncertainty of each independent token, while the intricate semantic relations among tokens and sentences are not well studied, which limits the detection of hallucination that spans over multiple tokens and sentences in the passage. In this paper, we propose a method to enhance uncertainty modeling with semantic graph for hallucination detection. Specifically, we first construct a semantic graph that well captures the relations among entity tokens and sentences. Then, we incorporate the relations between two entities for uncertainty propagation to enhance sentence-level hallucination detection. Given that hallucination occurs due to the conflict between sentences, we further present a graph-based uncertainty calibration method that integrates the contradiction probability of the sentence with its neighbors in the semantic graph for uncertainty calculation. Extensive experiments on two datasets show the great advantages of our proposed approach. In particular, we obtain substantial improvements with 19.78% in passage-level hallucination detection.

Abstract: As the size of language models notably grows, finetuning the models becomes more challenging: fine-tuning with first-order optimizers (e.g., SGD and Adam) requires high memory consumption, while fine-tuning with a memory-efficient zeroth-order optimizer (MeZO) has a significant accuracy drop and slower convergence rate. In this work, we propose a Low order Hybrid Optimizer (LoHO) which merges zeroth-order (ZO) and first-order (FO) optimizers for fine-tuning. LoHO is empowered with inter-layer hybrid optimization and intra-layer hybrid optimization, which boosts the accuracy of MeZO while keeping memory usage within a budget. The inter-layer hybrid optimization exploits the FO optimizer in deep layers and the ZO optimizer in shallow ones, therefore avoiding unnecessary gradient propagation to improve memory efficiency. The intra-layer hybrid optimization updates a proportion of parameters in a layer by the ZO optimizer, and the rest by the FO optimizer, taking advantage of gradient sparsity for high efficiency implementation. Our experimental results across common datasets on different pre-trained backbones (i.e., RoBERTa-large, OPT-13B and OPT-30B) demonstrate that LoHO can significantly improve the predictive accuracy and convergence rate of MeZO, while controlling the memory footprint during fine-tuning. Moreover, LoHO can achieve comparable performance with first-order fine-tuning using substantially fewer memory resources.

Abstract: Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). However, the task of extracting longer entity spans (e.g., awards) from extended texts (e.g., homepages) is barely explored. Current NER methods predominantly fall into two categories: spanbased methods and generation-based methods. Span-based methods require the enumeration of all possible token-pair spans, followed by classification on each span, resulting in substantial redundant computations and excessive GPU memory usage. In contrast, generation-based methods involve prompting or fine-tuning large language models (LLMs) to adapt to downstream NER tasks. However, these methods struggle with the accurate generation of longer spans and often incur significant time costs for effective finetuning. To address these challenges, this paper introduces a lightweight span-based NER method called SeNER, which incorporates a bidirectional arrow attention mechanism coupled with LogN-Scaling on the [CLS] token to embed long texts effectively, and comprises a novel bidirectional sliding-window plus-shaped attention (BiSPA) mechanism to reduce redundant candidate token-pair spans significantly and model interactions between token-pair spans simultaneously. Extensive experiments demonstrate that our method achieves state-of-the-art extraction accuracy on three long NER datasets and is capable of extracting entities from long texts in a GPU-memory-friendly manner.

School of Computer Science and Technology, Anhui University, Department of Automation, Tsinghua University, School of Computer Science and Technology, Anhui University, Department of Automation, Tsinghua University Beijing National Research Center for lnformation Science and Technology, Tsinghua University, Institute of Automation, Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences, Department of Automation, Tsinghua University, Institute of Automation, Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences, School of Computer Science and Technology, Anhui University, Institute of Automation, Chinese Academy of Sciences, School of Computer Science and Technology, Anhui University, Institute of Automation, Chinese Academy of Sciences

Abstract: Rapid advancements in speech synthesis and voice conversion bring convenience but also new security risks, creating an urgent need for effective audio deepfake detection. Although current models perform well, their effectiveness diminishes when confronted with the diverse and evolving nature of realworld deepfakes. To address this issue, we propose a continual learning method named Region-Based Optimization (RegO) for audio deepfake detection. Specifically, we use the Fisher information matrix to measure important neuron regions for real and fake audio detection, dividing them into four regions. First, we directly fine-tune the less important regions to quickly adapt to new tasks. Next, we apply gradient optimization in parallel for regions important only to real audio detection, and in orthogonal directions for regions important only to fake audio detection. For regions that are important to both, we use sample proportion-based adaptive gradient optimization. This region-adaptive optimization ensures an appropriate trade-off between memory stability and learning plasticity. Additionally, to address the increase of redundant neurons from old tasks, we further introduce the Ebbinghaus forgetting mechanism to release them, thereby promoting the model’s ability to learn more generalized discriminative features. Experimental results show our method achieves a 21.3 percent improvement in EER over the state-of-the-art continual learning approach RWM for audio deepfake detection. Moreover, the effectiveness of RegO extends beyond the audio deepfake detection domain, showing potential significance in other tasks, such as image recognition.

Abstract: In recent years, large language models (LLMs) have demonstrated outstanding capabilities in various tasks. However, LLMs also have various drawbacks, especially hallucination. Hallucination refers to the generation of content that does not align with the user input, contradicts previously generated content or world knowledge. Current research on hallucination mainly include knowledge retrieval, prompt engineering, training data improvement, reinforcement learning, etc. However, these methods do not involve different categories of hallucinations which is important on hallucination analysis, and make detailed investigation for the internal state of LLMs which indicates the direction on hallucination occurrence. Therefore, in our research, we introduce an attribution framework to trace the origins of hallucinations based on the internal signals of LLMs. To support this framework, we develop a new benchmark named RelQACate, which includes eight categories of hallucinations for the answers generated by LLMs. After that, we present a novel Differential Penalty Decoding (DPD) strategy for reducing hallucinations through adjusting post-probabilities of each answer. We conduct a series of experiments and the performance on answer reliability has significant improvement, achieving 28.25% at most, which demonstrates the effectiveness of our proposed DPD and its generalization in mitigating hallucination in LLMs.

Abstract: As Large Language Models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to generate diverse, complex prompts and dynamically explore the weaknesses of these models. To tackle these challenges, we introduce the SelfEvolving Adversarial Safety (SEAS) optimization framework, which includes both a SEAS dataset and a SEAS pipeline. The SEAS dataset comprises complex adversarial prompts, while the SEAS pipeline operates through three stages: Initialization, Attack, and Adversarial Optimization. This framework generates a diverse range of adversarial prompts and dynamically explores the model's vulnerabilities to enhance its security. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and empirical evidence demonstrating the effectiveness of SEAS.

School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, Ant Group, Shanghai, China, School of Computer Science, Peking University, Beijing, China, School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, Institute of Modern Languages and Linguistics, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Shanghai, China

Abstract: The capability of the reward model (RM) is crucial for the success of Reinforcement Learning from Human Feedback (RLHF) in aligning with human preferences. However, as training progresses, the output space distribution of the policy model shifts. The RM, initially trained on responses sampled from the output distribution of the early policy model, gradually loses its ability to distinguish between responses from the newly shifted distribution. This issue is further compounded when the RM, trained on a specific data distribution, struggles to generalize to examples outside of that distribution. These two issues can be united as a challenge posed by the shifted distribution of the environment. To surmount this challenge, we introduce MetaRM, a novel method leveraging metalearning to adapt the RM to the shifted environment distribution. MetaRM optimizes the RM in an alternating way, by preserving both the preferences of the original preference pairs, as well as maximizing discrimination power over new examples of the shifted distribution. Extensive experiments demonstrate that MetaRM can iteratively enhance the performance of human preference alignment by improving the RM's capacity to identify subtle differences in samples of shifted distributions.

Abstract: Documentlevel relation extraction (DocRE) is the process of identifying and extracting relations between entities that span multiple sentences within a document. Due to its realistic settings, DocRE has garnered increasing research attention in recent years. Previous research has mostly focused on developing sophisticated encoding models to better capture the intricate patterns between entity pairs. While these advancements are undoubtedly crucial, an even more foundational challenge lies in the data itself. The complexity inherent in DocRE makes the labeling process prone to errors, compounded by the extreme sparsity of positive relation samples, which is driven by both the limited availability of positive instances and the broad diversity of positive relation types. These factors can lead to biased optimization processes, further complicating the task of accurate relation extraction. Recognizing these challenges, we have developed a robust framework called COMM to better solve DocRE. COMM operates by initially employing an instance-aware reasoning method to dynamically capture pertinent information of entity pairs within the document and extract relational features. Following this, COMM takes into account the distribution of relations and the difficulty of samples to dynamically adjust the margins between prediction logits and the decision threshold, a process we call Concentrated Margin Maximization. In this way, COMM not only enhances the extraction of relevant relational features but also boosts DocRE performance by addressing the specific challenges posed by the data. Extensive experiments and analysis demonstrate the versatility and effectiveness of COMM, especially its robustness when trained on low-quality data (achieves >10% performance gains).

Department of Computer Science and Technology, Tsinghua University, Institute for Network Sciences and Cyberspace, Tsinghua University, Institute for Advanced Study, BNRist, Tsinghua University, Carnegie Mellon University, Institute for Advanced Study, BNRist, Tsinghua University, Institute for Advanced Study, BNRist, Tsinghua University Zhongguancun Laboratory National Financial Cryptography Research Center, Institute for Advanced Study, BNRist, Tsinghua University Zhongguancun Laboratory National Financial Cryptography Research Center Shandong Institute of Blockchain, Institute for Advanced Study, BNRist, Tsinghua University Zhongguancun Laboratory National Financial Cryptography Research Center Shandong Institute of Blockchain School of Cyber Science and Technology, Shandong University

Abstract: Large VisionLanguage Models (LVLMs) signify a groundbreaking paradigm shift within the Artificial Intelligence (AI) community, extending beyond the capabilities of Large Language Models (LLMs) by assimilating additional modalities (e.g., images). Despite this advancement, the safety of LVLMs remains adequately underexplored, with a potential overreliance on the safety assurances purported by their underlying LLMs. In this paper, we propose FigStep, a straightforward yet effective black-box jailbreak algorithm against LVLMs. Instead of feeding textual harmful instructions directly, FigStep converts the prohibited content into images through typography to bypass the safety alignment. The experimental results indicate that FigStep can achieve an average attack success rate of 82.50% on six promising open-source LVLMs. Not merely to demonstrate the efficacy of FigStep, we conduct comprehensive ablation studies and analyze the distribution of the semantic embeddings to uncover that the reason behind the success of FigStep is the deficiency of safety alignment for visual embeddings. Moreover, we compare FigStep with five text-only jailbreaks and four image-based jailbreaks to demonstrate the superiority of FigStep, i.e., negligible attack costs and better attack performance. Above all, our work reveals that current LVLMs are vulnerable to jailbreak attacks, which highlights the necessity of novel cross-modality safety alignment techniques.

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) University of Chinese Academy of Sciences, Beijing, China, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) University of Chinese Academy of Sciences, Beijing, China, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) University of Chinese Academy of Sciences, Beijing, China, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) University of Chinese Academy of Sciences, Beijing, China Key Laboratory of AI Safety, Chinese Academy of Sciences

Abstract: Simultaneous generation models write generation results while reading streaming inputs, necessitating a policymaker to determine the appropriate output timing. Existing simultaneous generation methods generally adopt the traditional encoder-decoder architecture and learn the generation and policy-making capabilities through complex dynamic programming techniques. Although LLMs excel at text generation, they face challenges in taking on the role of policy-makers through traditional training methods, limiting their exploration in simultaneous generation. To overcome these limitations, we propose a novel LLM-driven Simultaneous Generation (LSG) framework, which allows the off-the-shelf LLM to decide the generation timing and produce output concurrently. Specifically, LSG selects the generation policy that minimizes latency as the baseline policy. Referring to the baseline policy, LSG enables the LLM to devise an improved generation policy that better balances latency and generation quality, and writes generation results accordingly. Experiments on simultaneous translation and streaming automatic speech recognition tasks show that our method can achieve state-of-the-art performance utilizing the open-source LLMs and demonstrate practicality in real-world scenarios.

Abstract: Existing automatic prompt engineering methods are typically designed for discriminative tasks, where new task prompts are iteratively refined with limited feedback from a single metric reflecting a single aspect. However, these approaches are suboptimal for generative tasks, which require more nuanced guidance beyond a single numeric metric to improve the prompt and optimize multiple aspects of the generated text. To address these challenges, we propose a novel multiaspect Critique-Suggestion-guided automatic Prompt Optimization (CriSPO) approach. CriSPO introduces a critique-suggestion module as its core component. This module spontaneously discovers aspects, and compares generated and reference texts across these aspects, providing specific suggestions for prompt modification. These clear critiques and actionable suggestions guide a receptive optimizer module to make more substantial changes, exploring a broader and more effective search space. To further improve CriSPO with multi-metric optimization, we introduce an Automatic Suffix Tuning (AST) extension to enhance the performance of task prompts across multiple metrics. We evaluate CriSPO on 4 state-of-the-art Large Language Models (LLMs) across 4 summarization and 5 Question Answering (QA) datasets. Extensive experiments show 3-4% ROUGE score improvement on summarization and substantial improvement of various metrics on QA.

Abstract: Longcontext large language models (LLMs) inference is increasingly critical, motivating a number of studies devoted to alleviating the substantial storage and computational costs in such scenarios. Layer-wise skipping methods are promising optimizations but rarely explored in long-context inference. We observe that existing layer-wise skipping strategies have several limitations when applied in long-context inference, including the inability to adapt to model and context variability, disregard for sublayer significance, and inapplicability for the prefilling phase. This paper proposes AdaSkip, an adaptive sublayer skipping method specifically designed for long-context inference. AdaSkip adaptively identifies less important layers by leveraging on-the-fly similarity information, enables sublayer-wise skipping, and accelerates both the prefilling and decoding phases. The effectiveness of AdaSkip is demonstrated through extensive experiments on various long-context benchmarks and models, showcasing its superior inference performance over existing baselines.

Abstract: To understand a document with multiple events, eventevent relation extraction (ERE) emerges as a crucial task, aiming to discern how natural events temporally or structurally associate with each other. To achieve this goal, our work addresses the problems of temporal event relation extraction (TRE) and subevent relation extraction (SRE). The latest methods for such problems have commonly built document-level event graphs for global reasoning across sentences. However, the edges between events are usually derived from external tools heuristically, which are not always reliable and may introduce noise. Moreover, they are not capable of preserving logical constraints among event relations, e.g., coreference constraint, symmetry constraint and conjunction constraint. These constraints guarantee coherence between different relation types, enabling the generation of a unified event evolution graph. In this work, we propose a novel method named LogicERE, which performs high-order event relation reasoning through modeling logic constraints. Specifically, different from conventional event graphs, we design a logic constraint induced graph (LCG) without any external tools. LCG involves event nodes where the interactions among them can model the coreference constraint, and event pairs nodes where the interactions among them can retain the symmetry constraint and conjunction constraint. Then we perform high-order reasoning on LCG with relational graph transformer to obtain enhanced event and event pair embeddings. Finally, we further incorporate logic constraint information via a joint logic learning module. Extensive experiments demonstrate the effectiveness of the proposed method with state-of-the-art performance on benchmark datasets.

Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University, Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University, Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University, Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University, Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University

Abstract: The remarkable success of Large Language Models (LLMs) relies heavily on their substantial scale, which poses significant challenges during model deployment in terms of latency and memory consumption. Recently, numerous studies have attempted to compress LLMs using oneshot pruning methods. However, these methods often suffer from considerable performance degradation on complex language understanding tasks, raising concerns about the feasibility of pruning in LLMs. To address this issue, we propose Adaptive Sparse Trainer (AST), a novel and efficient retraining framework tailored for semi-structured sparse models. AST enables models to learn optimal masks during the weight update process without incurring additional computational overhead. Furthermore, we demonstrate that incorporating knowledge distillation significantly improves retraining efficiency and enhances model performance under fixed computational constraints. Additionally, a supplementary set of well-initialized parameters is integrated to further augment the model's efficacy. AST achieves state-of-the-art performance with minimal training cost. When applied to the LLaMA2-7B model, AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively, utilizing less than 0.4% of the pretraining tokens and GPU hours. Our work demonstrates the feasibility of deploying semi-structured sparse LLMs and offers a promising alternative for achieving highly compressed models when combined with existing quantization techniques.

Abstract: Large language models have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of highquality and reasoning-focused training datasets. Addressing this challenge, we propose Key-PointDriven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K questionanswer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. Our experiments demonstrate that this dataset can enhance the mathematical reasoning performance of models across various architectures and sizes. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0% accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 72B range and best commercial models like GPT-4 across multiple math reasoning datasets.

Abstract: This paper presents VDAct, a dataset for a Videogrounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. Unlike existing datasets, VDAct includes longer and more complex video sequences that depict a variety of event-driven activities that require advanced contextual understanding for accurate response generation. The dataset comprises 3,000 dialogues with over 30,000 question-and-answer pairs, derived from 1,000 videos with diverse activity scenarios. VDAct displays a notably challenging characteristic due to its broad spectrum of activity scenarios and wide range of question types. Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns.

Abstract: Solving tabular math word problems (TMWPs) has become a critical role in evaluating the mathematical reasoning ability of large language models (LLMs), where largescale TMWP samples are commonly required for fine-tuning. Since the collection of high-quality TMWP datasets is costly and time-consuming, recent research has concentrated on automatic TMWP generation. However, current generated samples usually suffer from issues of either correctness or diversity. In this paper, we propose a Template-driven LLM-paraphrased (TeLL) framework for generating high-quality TMWP samples with diverse backgrounds and accurate tables, questions, answers, and solutions. To this end, we first extract templates from existing real samples to generate initial problems, ensuring correctness. Then, we adopt an LLM to extend templates and paraphrase problems, obtaining diverse TMWP samples. Furthermore, we find the reasoning annotation is important for solving TMWPs. Therefore, we propose to enrich each solution with illustrative reasoning steps. Through the proposed framework, we construct a high-quality dataset TabMWP-TeLL by adhering to the question types in the TabMWP dataset, and we conduct extensive experiments on a variety of LLMs to demonstrate the effectiveness of TabMWP-TeLL in improving TMWP-solving performance.

Abstract: Emotion recognition in conversation (ERC) has been promoted with diverse approaches in the recent years. However, many studies have pointed out that emotion shift and confusing labels make it difficult for models to distinguish between different emotions. Existing ERC models suffer from these problems when the emotions are forced to be mapped into single label. In this paper, we utilize our strategies for extending single label to multilabels. We then propose a multi-label classification framework for emotion recognition in conversation (ML-ERC). Specifically, we introduce weighted supervised contrastive learning tailored for multi-label, which can easily applied to previous ERC models. The empirical results on existing task with single label support the efficacy of our approach, which is more effective in the most challenging settings: emotion shift or confusing labels. We also evaluate ML-ERC with the multi-labels we produced to support our contrastive learning scheme.

Abstract: Recent advancements have integrated Language Models (LMs) into a drug discovery pipeline. However, existing models mostly work with SMILES and SELFIES chemical string representations, which lack spatial features vital for drug discovery. Additionally, attempts to translate chemical 3D structures into text format encounter issues such as excessive length and insufficient atom connectivity information. To address these issues, we introduce nach0pc, a model combining domain-specific encoder and textual representation to handle spatial arrangement of atoms effectively. Our approach utilizes a molecular point cloud encoder for concise and order-invariant structure representation. We introduce a novel pre-training scheme for molecular point clouds to distillate the knowledge from spatial molecular structures datasets. After fine-tuning within both single-task and multi-task frameworks, nach0-pc demonstrates performance comparable with other diffusion models in terms of generated samples quality across several established spatial molecular generation tasks. Notably, our model is a multi-task approach, in contrast to diffusion models being limited to single tasks. Additionally, it is capable of processing point cloud-related data, which language models are not capable of handling due to memory limitations. These lead to our model having reduced training and inference time while maintaining on par performance.

Abstract: Large language models (LLMs) have revolutionized numerous fields of research, driving significant advancements in natural language processing, machine translation, and beyond. Although the extensive number of parameters contributes a lot to the great success, existing studies indicate that not all model parameters hold equal importance, which further leads to redundancy during the parameter update process. Recent works for reducing redundant parameter updates for LLMs either lack taskspecific data information, may leading to suboptimal model performance, or discard transformer components or insignificant parameters, limiting the model's scalability across different tasks and potentially compromising the LLM structure. To address these issues and further enhance the performance of LLMs, we propose Gradient-Mask Tuning (GMT), a method that selectively updates parameters based on gradient information, which is specific to the target tasks. Specifically, after calculating gradients during back propagation, we measure their absolute values and mask those with small absolute values. Our empirical results in various training paradigms like SFT and DPO for various domains of tasks demonstrate that GMT not only preserves the original network structure but also enhances the potential performance of LLMs. Further analysis indicates that GMT exhibits insensitivity to mask ratio and possesses computational efficiency comparable to vanilla training approach.

School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu 611130, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu 611130, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu 611130, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu 611130, China, Key Laboratory of Big Data Intelligent Computing, Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China, Key Laboratory of Big Data Intelligent Computing, Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China Engineering Research Center of Intelligent Finance, Ministry of Education, Chengdu 611130, China, National Center for Applied Mathematics in Chongqing, Chongqing Normal University, Chongqing 401331, China, School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 611756, China

Abstract: Open intent classification is critical for the development of dialogue systems, aiming to accurately classify known intents into their corresponding classes while identifying unknown intents. Prior boundarybased methods assumed known intents fit within compact spherical regions, focusing on coarse-grained representation and precise spherical decision boundaries. However, these assumptions are often violated in practical scenarios, making it difficult to distinguish known intent classes from unknowns using a single spherical boundary. To tackle these issues, we propose a Multi-granularity Open intent classification method via adaptive Granular-Ball decision boundary (MOGB). Our MOGB method consists of two modules: representation learning and decision boundary acquiring. To effectively represent the intent distribution, we design a hierarchical representation learning method. This involves iteratively alternating between adaptive granular-ball clustering and nearest sub-centroid classification to capture fine-grained semantic structures within known intent classes. Furthermore, multi-granularity decision boundaries are constructed for open intent classification by employing granular-balls with varying centroids and radii. Extensive experiments conducted on three public datasets demonstrate the effectiveness of our proposed method.

Abstract: Despite the rapid progress that existing automated feedback methods have made in correcting the output of large language models (LLMs), these methods cannot be well applied to the relation extraction (RE) task due to their designated feedback objectives and correction manner. To address this problem, we propose a novel automated feedback framework for RE, which presents a rationale supervisor to verify the rationale and provides reselected demonstrations as feedback to correct the initial prediction. Specifically, we first design a causal intervention and observation method to collect biased/unbiased rationales for contrastive training the rationale supervisor. Then, we present a verification-feedback-correction procedure to iteratively enhance LLMs' capability of handling the RE task. Extensive experiments prove that our proposed framework significantly outperforms existing methods.

Abstract: Retrievalaugmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. Current methods mainly employ separate retrieval and generation modules to acquire external knowledge and generate answers, respectively. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task, which seamlessly integrates knowledge retriever into the generative multi-modal large language model, serving as a built-in search engine. Specifically, our model functions both as a generative retriever and an accurate answer generator. It not only helps retrieve documents from the knowledge base by producing identifier for each document, but it also answers visual questions based on the retrieved documents. Furthermore, we also propose a reinforced retrieval calibration module from relevance feedback to improve retrieval performance and align with the preferences for accurate answer generation. Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9% to 9.6% across all evaluation metrics when compared to strong baselines.

Harbin Institute of Technology, Shenzhen, China Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China, Harbin Institute of Technology, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China Peng Cheng Laboratory, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China, Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China Research Centre on Data Science & Artificial Intelligence, Hong Kong, China, Harbin Institute of Technology, Shenzhen, China Peng Cheng Laboratory, Shenzhen, China Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, China

Abstract: Sexism affects both women and men, yet research often overlooks misandry and suffers from overly broad annotations that limit AI applications. To address this, we introduce BeyondGender, a dataset meticulously annotated according to the latest definitions of misogyny and misandry. It features innovative multifaceted labels encompassing aspects of sexism, gender, phrasing, misogyny, and misandry. The dataset includes 6K English and 1.7K Chinese sexism instances, alongside 13K nonsexism examples. Our evaluations of masked language models and large language models reveal that they detect misogyny in English and misandry in Chinese more effectively, with F1-scores of 0.87 and 0.62, respectively. However, they frequently misclassify hostile and mild comments, underscoring the complexity of sexism detection. Parallel corpus experiments suggest promising data augmentation strategies to enhance AI systems for nuanced sexism detection, and our dataset can be leveraged to improve value alignment in large language models.

Abstract: An essential component in Large Language Models (LLMs) is Rotary Position Encoding (RoPE) , which efficiently manages positional dependencies in longcontext modeling. However, when the number of input tokens surpasses the pretrained capacity of LLMs, their ability to process and generate text is markedly weakened. Although position interpolation techniques for RoPE can mitigate this issue, an increase in interpolations leads to a decrease in positional resolution. To tackle this challenge, drawing inspiration from the Bloch Sphere representation, we propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position Encoding (3D-RPE). 3D-RPE is an advanced version of the widely used 2D RoPE, with two major advantages for modeling long contexts: controllable long-term decay and improved position resolution. For controllable long-term decay, 3D-RPE allows for the regulation of long-term decay within the chunk size, ensuring the modeling of relative positional information between tokens at a distant relative position. For improved position resolution, 3D-RPE can mitigate the degradation of position resolution caused by position interpolation on RoPE. We have conducted experiments on long-context Natural Language Understanding (NLU) and long sequence Language Modeling (LM) tasks. From the experimental results, 3D-RPE achieved performance improvements over RoPE, especially in long-context NLU tasks.

Abstract: In the era of costly pretraining of large language models, ensuring the intellectual property rights of model owners, and insuring that said models are responsibly deployed, is becoming increasingly important. To this end, we propose model watermarking via passthrough layers, which are added to existing pre-trained networks and trained using a self-supervised loss such that the model produces high-entropy output when prompted with a unique private key, and acts normally otherwise. Unlike existing model watermarking methods, our method is fully task-agnostic, and can be applied to both classification and sequence-to-sequence tasks without requiring advanced access to downstream fine-tuning datasets. We evaluate the proposed passthrough layers on a wide range of downstream tasks, and show experimentally our watermarking method achieves a near-perfect watermark extraction accuracy and false-positive rate in most cases without damaging original model performance. Additionally, we show our method is robust to both downstream fine-tuning, fine-pruning, and layer removal attacks, and can be trained in a fraction of the time required to train the original model. Code is available in the paper.

Abstract: This paper considers the challenges Large Language Models (LLMs) face when reasoning over text that includes information involving uncertainty explicitly quantified via probability values. This type of reasoning is relevant to a variety of contexts ranging from everyday conversations to medical decisionmaking. Despite improvements in the mathematical reasoning capabilities of LLMs, they still exhibit significant difficulties when it comes to probabilistic reasoning. To deal with this problem, we introduce the Bayesian Linguistic Inference Dataset (BLInD), a new dataset specifically designed to test the probabilistic reasoning capabilities of LLMs. We use BLInD to find out the limitations of LLMs for tasks involving probabilistic reasoning. In addition, we present several prompting strategies that map the problem to different formal representations, including Python code, probabilistic algorithms, and probabilistic logical programming. We conclude by providing an evaluation of our methods on BLInD and an adaptation of a causal reasoning question-answering dataset. Our empirical results highlight the effectiveness of our proposed strategies for multiple LLMs.

Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, CAS Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, CAS University of Science and Technology of China, Shenzhen MSU-BIT University, Shenzhen MSU-BIT University, Harbin Institute of Technology (Shenzhen), Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, CAS Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University

Abstract: The success of Large Language Models (LLMs) relies heavily on the huge amount of pretraining data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under gray-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 35 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.

Abstract: Despite the great success of large language models (LLMs), efficiently controlling the length of the output sequence still remains a challenge. In this paper, we propose Hansel, an efficient framework for length control in LLMs without affecting its generation ability. Hansel utilizes periodically outputted hidden special tokens to keep track of the remaining target length of the output sequence. Together with techniques to avoid abrupt termination of the output, this seemingly simple method proved to be efficient and versatile, while not harming the coherency and fluency of the generated text. The framework can be applied to any pretrained LLMs during the finetuning stage of the model, regardless of its original positional encoding method. We demonstrate this by finetuning four different LLMs with Hansel and show that the mean absolute error of the output sequence decreases significantly in every model and dataset compared to the prompt-based length control finetuning. Moreover, the framework showed a substantially improved ability to extrapolate to target lengths unseen during finetuning, such as long dialog responses or extremely short summaries. This indicates that the model learns the general means of length control, rather than learning to match output lengths to those seen during training.

Abstract: Recent advancements in longcontext language modeling have attracted significant attention, yet their practical applications often suffer from suboptimal context utilization. To efficiently address this issue, we introduce the Structured Packing for Long Context, SPLiCe, a method that uses retrieval to collate mutually relevant documents into long training samples. We demonstrate that SPLiCe improves performance on long-context tasks, particularly by achieving perfect accuracy on the synthetic Needle in the Haystack benchmark, and effectively mitigating the ‘lost-in-the-middle’ phenomenon often observed in large language models. Notably, these long-context capabilities also extend to realistic downstream tasks, such as Qasper, across multiple model sizes—3B, 7B, and 13B—and are achieved with only brief fine-tuning on 2-6 billion tokens. We supplement these results with a detailed analysis of SPLiCe, examining the impact of hyperparameter choices, the different mixtures and proportions of SPLiCe-generated training data, and the choice of the retriever. We also study the transfer of long-context utilization skills between the modalities. An intriguing finding from our analysis is that training on a corpus of code can enhance performance on natural language tasks.

Abstract: Large Language Models (LLMs) have made significant progress in code generation, offering developers groundbreaking automated programming support. However, LLMs often generate code that is syntactically correct and even semantically plausible, but may not execute as expected or fulfill specified requirements. This phenomenon of hallucinations in the code domain has not been systematically explored. To advance the community's understanding and research on this issue, we introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We categorize code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, with each category further divided into different subcategories to understand and address the unique challenges faced by LLMs in code generation with finer granularity. Additionally, we present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations. By evaluating 17 popular LLMs using this benchmark, we reveal significant differences in their accuracy and reliability in code generation, offering detailed insights for further improving the code generation capabilities of LLMs.

Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China College of Informatics, Huazhong Agricultural University, Wuhan, China, College of Informatics, Huazhong Agricultural University, Wuhan, China, College of Informatics, Huazhong Agricultural University, Wuhan, China, College of Informatics, Huazhong Agricultural University, Wuhan, China, Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China College of Informatics, Huazhong Agricultural University, Wuhan, China, College of Informatics, Huazhong Agricultural University, Wuhan, China, Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China College of Informatics, Huazhong Agricultural University, Wuhan, China

Abstract: The performance of various tasks of natural language processing has greatly improved with the emergence of large language models. However, there is still much room for improvement in understanding certain specific linguistic phenomena, such as Chinese idioms, which are usually composed of four characters. Chinese idioms are difficult to understand due to semantic gaps between their literal and actual meanings. Researchers have proposed the Chinese idiom reading comprehension task to examine the ability of large language models to represent and understand Chinese idioms. The task requires choosing the correct Chinese idiom from a list of candidates to complete the sentence. The current research mainly focuses on textbased idiom comprehension. Nevertheless, there are many idiom application scenarios that combine images and text, and we believe that the corresponding images are beneficial for the model's understanding of the idioms. Therefore, to address the above problems, we first construct a large-scale Multimodal Chinese Idiom Reading Comprehension dataset (MChIRC), which contains a total of 44,433 image-text pairs covering 2,926 idioms. Then, we propose a Dual-Contrastive Idiom Graph Network (DCIGN), which employs a dual-contrastive learning module to align the text and image features corresponding to the same Chinese idiom at both coarse and fine levels, while utilizing a graph structure to capture the semantic relationships between idiom candidates. Finally, we use a cross-attention module to fuse multimodal features with graph features of candidate idioms to predict correct answers. The authoritativeness of MChIRC and the effectiveness of DCIGN are demonstrated through a variety of experiments, which provides a new benchmark for the multimodal Chinese idiom reading comprehension task.

Abstract: Models trained on realworld data often mirror and exacerbate existing social biases. Traditional methods for mitigating these biases typically require prior knowledge of the specific biases to be addressed, and the social groups associated with each instance. In this paper, we introduce a novel adversarial training strategy that operates withour relying on prior bias-type knowledge (e.g., gender or racial bias) and protected attribute labels. Our approach dynamically identifies biases during model training by utilizing auxiliary bias detector. These detected biases are simultaneously mitigated through adversarial training. Crucially, we implement these bias detectors at various levels of the feature maps of the main model, enabling the detection of a broader and more nuanced range of bias features. Through experiments on racial and gender biases in sentiment and occupation classification tasks, our method effectively reduces social biases without the need for demographic annotations. Moreover, our approach not only matches but often surpasses the efficacy of methods that require detailed demographic insights, marking a significant advancement in bias mitigation techniques.

Abstract: Existing knowledge distillation (KD) studies for streaming automatic speech recognition (ASR) adopt a nonstreaming model as the teacher and a streaming model as the student, respectively. Since the non-streaming teacher usually has less emission latency compared to the streaming student, the teacher's prediction is typically shifted by $\tau$ frames, where the parameter $\tau$ is selected heuristically. In this paper, we observe that this manual shifting is sub-optimal and propose a novel framework, namely Heuristic-free KD. Instead of leveraging knowledge from the non-streaming teacher model, we employ a self-distillation setup, distilling the knowledge within the streaming architecture itself. Since the teacher and student share the same streaming ASR backbone, the alignment mismatch issue can be effectively mitigated without requiring any time shifting by $\tau$. Additionally, we incorporate full-context textual information as an auxiliary multi-modal input for the proposed teacher. Although the streaming architecture lacks future context, the additional linguistic input enables it to generate more accurate knowledge for self-distillation. We empirically demonstrate that the proposed KD approach significantly improves the performance of the streaming ASR model, outperforming conventional methods that rely on the offline teacher and heuristic parameter.

Abstract: Recent advancements in Large Language Models (LLMs) have led to significant breakthroughs in various natural language processing tasks. However, generating factually consistent responses in knowledgeintensive scenarios remains a challenge due to issues such as hallucination, difficulty in acquiring long-tailed knowledge, and limited memory expansion. This paper introduces SMART, a novel multi-agent framework that leverages external knowledge to enhance the interpretability and factual consistency of LLM-generated responses. SMART comprises four specialized agents, each performing a specific sub-trajectory action to navigate complex knowledge-intensive tasks. We propose a multi-agent co-training paradigm, Long-Short Trajectory Learning, which ensures synergistic collaboration among agents while maintaining fine-grained execution by each agent. Extensive experiments on five knowledge-intensive tasks demonstrate SMART's superior performance compared to widely adopted knowledge internalization and knowledge enhancement methods. Our framework can extend beyond knowledge-intensive tasks to more complex scenarios.

Abstract: In natural humanto-human conversations, participants often receive feedback signals from one another based on their follow-up reactions. These reactions can include verbal responses, facial expressions, changes in emotional state, and other non-verbal cues. Similarly, in human-machine interactions, the machine can leverage the user's follow-up utterances as feedback signals to assess whether it has appropriately addressed the user's request. Therefore, we propose using the likelihood of follow-up utterances as rewards to differentiate preferred responses from less favored ones, without relying on human or commercial LLM-based preference annotations. Our proposed reward mechanism, ``Follow-up Likelihood as Reward" (FLR), matches the performance of strong reward models trained on large-scale human or GPT-4 annotated data on 8 pairwise-preference and 4 rating-based benchmarks. Building upon the FLR mechanism, we propose to automatically mine preference data from the online generations of a base policy model. The preference data are subsequently used to boost the helpfulness of the base model through direct alignment from preference (DAP) methods, such as direct preference optimization (DPO). Lastly, we demonstrate that fine-tuning the language model that provides follow-up likelihood with natural language feedback significantly enhances FLR's performance on reward modeling benchmarks and effectiveness in aligning the base policy model's helpfulness.

Abstract: Literature reviews play a crucial role in scientific research for understanding the current state of research, identifying gaps, and guiding future studies on specific topics. However, the process of conducting a comprehensive literature review is yet timeconsuming. This paper proposes a novel framework, collaborative knowledge minigraph agents (CKMAs), to automate scholarly literature reviews. A novel prompt-based algorithm, the knowledge minigraph construction agent (KMCA), is designed to identify relations between concepts from academic literature and automatically constructs knowledge minigraphs. By leveraging the capabilities of large language models on constructed knowledge minigraphs, the multiple path summarization agent (MPSA) efficiently organizes concepts and relations from different viewpoints to generate literature review paragraphs. We evaluate CKMAs on three benchmark datasets. Experimental results show the effectiveness of the proposed method, further revealing promising applications of LLMs in scientific research.

The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation，Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation，Chinese Academy of Sciences, Beijing, China Nanjing Artificial Intelligence Research of IA, Nanjing, China, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation，Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

Abstract: Prompt compression is increasingly studied for its potential to reduce computational costs and alleviate the burden on language models when processing lengthy prompts. Prior research has assessed token retention and removal by calculating information entropy. However, prompt compression encounters two significant challenges: (1) Information entropy, while widely used, may not be the optimal compression metric; and (2) The semantic significance of tokens is contextdependent, which renders independent token retention decisions inadequate. We posit that the solution to these challenges lies in the intrinsic mechanism of language models. Large language models (LLMs) exhibit robust contextual processing capabilities, with recent studies on their internal dynamics revealing that the attention mechanism plays a crucial role in modeling how LLMs leverage long contexts. Building on this insight, we introduce AttnComp, a novel approach that exploits the attention mechanism within language models to guide prompt compression. Our method employs causal cross-attention from the query to the context to evaluate the significance of each token, and we develop a graph-based algorithm to efficiently cluster tokens into semantic units, thus mitigating the issue of independent dependencies. We conduct experiments on datasets for retrieval-augmented generation and multiple long tasks involving single or multi-document QA. Our proposed method, AttnComp, outperforms previous baselines and validates the contributions of our components through analytical experiments. Compared to other methods that use a causal LM for prompt compression, our approach results in shorter latency and improved performance.

Abstract: In the rapidly evolving landscape of large language models (LLMs) for medical applications, ensuring the reliability and accuracy of these models in clinical settings is paramount. Existing benchmarks often focus on fixedformat tasks like multiple-choice QA, which fail to capture the complexity of real-world clinical diagnostics. Moreover, traditional evaluation metrics and LLM-based evaluators struggle with misalignment, often providing oversimplified assessments that do not adequately reflect human judgment. To address these challenges, we introduce HDCEval, a Hierarchical Divide-and-Conquer Evaluation framework tailored for fine-grained alignment in medical evaluation. HDCEval is built on a set of fine-grained medical evaluation guidelines developed in collaboration with professional doctors, encompassing Patient Question Relevance, Medical Knowledge Correctness, and Expression. The framework decomposes complex evaluation tasks into specialized subtasks, each evaluated by expert models trained through Attribute-Driven Token Optimization (ADTO) on a meticulously curated preference dataset. This hierarchical approach ensures that each aspect of the evaluation is handled with expert precision, leading to a significant improvement in alignment with human evaluators.

Abstract: Multimodal Large Language Models (MLLMs) excel in solving textbased mathematical problems, but they struggle with mathematical diagrams since they are primarily trained on natural scene images. For humans, visual aids generally enhance problem-solving, but MLLMs perform worse as information shifts from textual to visual modality. This decline is mainly due to their shortcomings in aligning images and text. To tackle aforementioned challenges, we propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment. This approach is designed to improve the mathematical reasoning skills of MLLMs through a three-stage training process, with the second stage being the critical alignment stage. We first enhance the language model's mathematical reasoning capabilities with extensive set of textual mathematical problems. We then construct a multimodal dataset with varying degrees of textual and visual information, creating data pairs by presenting each problem in at least two forms. By leveraging the Kullback-Leibler (KL) divergence of next-token prediction distributions to align visual and textual modalities, consistent problem-solving abilities are ensured. Finally, we utilize multimodal instruction tuning for MLLMs with high-quality multimodal data. Experimental results on multiple mathematical reasoning benchmarks demonstrate that the MLLMs trained with Math-PUMA surpass most open-source MLLMs. Our approach effectively narrows the performance gap for problems presented in different modalities.

Abstract: There is a growing trend toward AI systems interacting with humans to revolutionize a range of application domains such as healthcare and transportation. However, unsafe humanmachine interaction can lead to catastrophic failures. We propose a novel approach that predicts future states by accounting for the uncertainty of human interaction, monitors whether predictions satisfy or violate safety requirements, and adapts control actions based on the predictive monitoring results. Specifically, we develop a new quantitative predictive monitor based on Signal Temporal Logic with Uncertainty (STL-U) to compute a robustness degree interval, which indicates the extent to which a sequence of uncertain predictions satisfies or violates an STL-U requirement. We also develop a new loss function to guide the uncertainty calibration of Bayesian deep learning and a new adaptive control method, both of which leverage STL-U quantitative predictive monitoring results. We apply the proposed approach to two case studies: Type 1 Diabetes management and semi-autonomous driving. Experiments show that the proposed approach improves safety and effectiveness in both case studies.

Abstract: The rapid development of image generation models has facilitated the widespread dissemination of generated images on social networks, creating favorable conditions for provably secure image steganography. However, existing methods face issues such as low quality of generated images and lack of semantic control in the generation process. To leverage provably secure steganography with more effective and highperformance image generation models, and to ensure that stego images can accurately extract secret messages even after being uploaded to social networks and subjected to lossy processing such as JPEG compression, we propose a high-quality, provably secure, and robust image steganography method based on state-of-the-art autoregressive (AR) image generation models using Vector-Quantized (VQ) tokenizers. Additionally, we employ a cross-modal error-correction framework that generates stego text from stego images to aid in restoring lossy images, ultimately enabling the extraction of secret messages embedded within the images. Extensive experiments have demonstrated that the proposed method provides advantages in stego quality, embedding capacity, and robustness, while ensuring provable undetectability.

Abstract: Social norms are standards of behaviour common in a society. However, when agents make decisions without considering how others are impacted, norms can emerge that lead to the subjugation of certain agents. We present RAWL·E, a method to create ethical normlearning agents. RAWL·E agents operationalise maximin, a fairness principle from Rawlsian ethics, in their decision-making processes to promote ethical norms by balancing societal well-being with individual goals. We evaluate RAWL·E agents in simulated harvesting scenarios. We find that norms emerging in RAWL·E agent societies enhance social welfare, fairness, and robustness, and yield higher minimum experience compared to those that emerge in agent societies that do not implement Rawlsian ethics.

Abstract: Partially observable Markov decision processes (POMDPs) form a prominent model for uncertainty in sequential decision making. We are interested in constructing algorithms with theoretical guarantees to determine whether the agent has a strategy ensuring a given specification with probability 1. This wellstudied problem is known to be undecidable already for very simple omega-regular objectives, because of the difficulty of reasoning on uncertain events. We introduce a revelation mechanism which restricts information loss by requiring that almost surely the agent has eventually full information of the current state. Our main technical results are to construct exact algorithms for two classes of POMDPs called weakly and strongly revealing. Importantly, the decidable cases reduce to the analysis of a finite belief-support Markov decision process. This yields a conceptually simple and exact algorithm for a large class of POMDPs.

Abstract: Despite the widespread success of pattern database (PDB) heuristics in classical planning, to date there has been no application of PDBs to planning with numeric variables. In this paper we attempt to close this gap. We address optimal numeric planning involving conditions characterized by linear expressions and actions that modify numeric variables by constant quantities. Building upon prior research, we present an adaptation of PDB heuristics to numeric planning, introducing several approaches to deal with the unbounded nature of numeric variable projections. These approaches aim to restrict the initially infinite projections, thereby bounding the number of states and ultimately constraining the resulting PDBs. We show that the PDB heuristics obtained with our approach can provide strong guidance for the search.

Abstract: Assignment problems are a classic combinatorial optimization problem in which a group of agents must be assigned to a group of tasks such that maximum utility is achieved while satisfying assignment constraints. Given the utility of each agent completing each task, polynomialtime algorithms exist to solve a single assignment problem in its simplest form. However, in many modern-day applications such as satellite constellations, power grids, and mobile robot scheduling, assignment problems unfold over time, with the utility for a given assignment depending heavily on the state of the system. We apply multi-agent reinforcement learning to this problem, learning the value of assignments by bootstrapping from the known polynomial-time greedy solver and then learning from further experience. We then choose assignments using a distributed optimal assignment mechanism rather than by selecting them directly. We demonstrate that this algorithm is theoretically justified and avoids pitfalls experienced by other RL algorithms in this setting. Finally, we show that our algorithm significantly outperforms other methods in the literature, even while scaling to realistic scenarios with hundreds of agents and tasks.

Abstract: Large Language Models (LLMs) have shown promise in solving natural languagedescribed planning tasks, but their direct use often leads to inconsistent reasoning and hallucination. While hybrid LLM-symbolic planning pipelines have emerged as a more robust alternative, they typically require extensive expert intervention to refine and validate generated action schemas. It not only limits scalability but also introduces a potential for biased interpretation, as a single expert's interpretation of ambiguous natural language descriptions might not align with the user's actual intent. To address this, we propose a novel approach that constructs an action schema library to generate multiple candidates, accounting for the diverse possible interpretations of natural language descriptions. We further introduce a semantic validation and ranking module that automatically filter and rank these candidates without expert-in-the-loop. The experiments showed our pipeline maintains superiority in planning over the direct LLM planning approach. These findings demonstrate the feasibility of a fully automated end-to-end LLM-symbolic planner that requires no expert intervention, opening up the possibility for a broader audience to engage with AI planning with less prerequisite of domain expertise.

Abstract: The rapid advancement of autonomous web navigation has significantly benefited from grounding pretrained Large Language Models (LLMs) as agents. However, current research has yet to fully leverage the redundancy of HTML elements for contrastive training. This paper introduces a novel approach to LLMbased web navigation tasks, called Web Element Preference Optimization (WEPO). WEPO utilizes unsupervised preference learning by sampling distance-based non-salient web elements as negative samples, optimizing maximum likelihood objective within Direct Preference Optimization (DPO). We evaluate WEPO on the Mind2Web benchmark and empirically demonstrate that WEPO aligns user high-level intent with output actions more effectively. The results show that our method achieved the state-of-the-art, with an improvement of 13.8% over WebAgent and 5.3% over the visual language model CogAgent baseline. Our findings underscore the potential of preference optimization to enhance web navigation and other web page based tasks, suggesting a promising direction for future research.

Abstract: GNNbased approaches for learning general policies across planning domains are limited by the expressive power of C2, namely; first-order logic with two variables and counting. This limitation can be overcomed by transitioning to k-GNNs, for k=3, wherein object embeddings are substituted with triplet embeddings. Yet, while 3-GNNs have the expressive power of C3, unlike 1- and 2-GNNs that are confined to C2, they require quartic time for message exchange and cubic space to store embeddings, rendering them infeasible. In this work, we introduce a parameterized version R-GNN[t] (with parameter t) of Relational GNNs. Unlike GNNs, that are designed to perform computation on graphs, Relational GNNs are designed to do computation on relational structures. When t=infty, R-GNN[t] approximates 3-GNNs over graphs, but using only quadratic space for embeddings. For lower values of t, such as t=1 and t=2, R-GNN[t] achieves a weaker approximation by exchanging fewer messages, yet interestingly, often yield the expressivity required in several planning domains. Furthermore, the new R-GNN[t] architecture is the original R-GNN architecture with a suitable transformation applied to the inputs only. Experimental results illustrate the clear performance gains of R-GNN[1] over the plain R-GNNs, and also over Edge Transformers that also approximate 3-GNNs.

Abstract: In the Clustered TSP (CTSP), we are given an edgeweighted graph satisfying the triangle inequality property, and a family of pairwise disjoint vertex groups. The goal is to find a minimum weight tour that includes all vertices, ensuring that the vertices within each group appear consecutively on the tour. The subgroup planning problem (SGPP) is an extension of CTSP by relaxing some triangle inequality requirements on edge weights. CTSP and SGPP have plentiful applications in AI and robotics. In this paper, we design three improved approximation algorithms for SGPP and CTSP. First, we propose a polynomial-time 2.167-approximation algorithm for SGPP, improving the previous ratio of 3 (IJCAI 2017). Second, we give an FPT 2.072-approximation algorithm for SGPP parameterized by the maximum group size, improving the previous ratio of 2.5 (IJCAI 2017). Third, we prove an FPT (β<1.5)-approximation algorithm for SGPP parameterized by the number of groups, which even improves the previous ratio 1.667 for CTSP (ORL 1999). We also conduct experiments to evaluate the performance of our algorithms.

Abstract: In adaptive systems, predictors are used to anticipate changes in the system’s state or behavior that may require system adaption, e.g., changing its configuration or adjusting resource allocation. Therefore, the quality of predictors is crucial for the overall reliability and performance of the system under control. This paper studies predictors in systems exhibiting probabilistic and nondeterministic behavior modelled as Markov decision processes (MDPs). Main contributions are the introduction of quantitative notions that measure the effectiveness of predictors in terms of their average capability to predict the occurrence of failures or other undesired system behaviors. The average is taken over all memoryless policies. We study two classes of such notions. One class is inspired by concepts that have been introduced in statistical analysis to explain the impact of features on the decisions of binary classifiers (such as precision, recall, f-score). Second, we study a measure that borrows ideas from recent work on probability-raising causality in MDPs and determines the quality of a predictor by the fraction of memoryless policies under which (the set of states in) the predictor is a probability-raising cause for the considered failure scenario.

Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Peng Cheng Laboratory Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Beijing Key Laboratory of Mobile Computing and Pervasive Device, Tsinghua Shenzhen International Graduate School, Tsinghua University

Abstract: The performance of multimodal models often deteriorates when modality absence occurs. The absence disrupts the learned intermodal correlations, resulting in biased multimodal representations. This challenge is especially pronounced when the absence is pervasive, affecting both the training and inference phases. Recent studies have attempted to reconstruct the missing information; however, most of them require complete supervision, which is seldom available in scenarios of pervasive absence. The quality of reconstruction remains a critical issue. Alternatively, others aim to learn robust representations from the available modalities but the substantial variations and biases are not fully addressed. This paper introduces the Multimodal Generalization and Refinement (MGR) framework to mitigate the issue of pervasive modality absence. MGR begins by acquiring generalized multimodal representations and iteratively refines them to recognize and calibrate the biased representations. Initially, multimodal samples with absence are embedded through foundation models, and MGR integrates independent unimodal features to further enhance generalization. Additionally, a novel mixed-context prompt is adopted to identify biases in both features and correlations. A redistribution operation can then refine these biases through graph pooling, culminating in robust and calibrated multimodal representations, which are suitable for downstream tasks. Comprehensive experiments on four benchmark datasets demonstrate that the proposed MGR framework outperforms state-of-the-art methods, effectively mitigating the impact of pervasive modality absence.

Abstract: Decisionfocused learning (DFL) offers an end-to-end approach to the predict-then-optimize (PO) framework by training predictive models directly on decision loss (DL), enhancing decision-making performance within PO contexts. However, the implementation of DFL poses distinct challenges. Primarily, DL can result in deviation from the physical significance of the predictions under limited data. Additionally, some predictive models are non-differentiable or black-box, which cannot be adjusted using gradient-based methods. To tackle the above challenges, we propose a novel framework, Decision-Focused Fine-tuning (DFF), which embeds the DFL module into the PO pipeline via a novel bias correction module. DFF is formulated as a constrained optimization problem that maintains the proximity of the DL-enhanced model to the original predictive model within a defined trust region. We theoretically prove that DFF strictly confines prediction bias within a predetermined upper bound, even with limited datasets, thereby substantially reducing prediction shifts caused by DL under limited data. Furthermore, the bias correction module can be integrated into diverse predictive models, enhancing adaptability to a broad range of PO tasks. Extensive evaluations on synthetic and real-world datasets, including network flow, portfolio optimization, and resource allocation problems with different predictive models, demonstrate that DFF not only improves decision performance but also adheres to fine-tuning constraints, showcasing robust adaptability across various scenarios.

Abstract: The nondominated sorting genetic algorithm II (NSGA-II) is the most popular multi-objective optimization heuristic. Recent mathematical runtime analyses have detected two shortcomings in discrete search spaces, namely, that the NSGA-II has difficulties with more than two objectives and that it is very sensitive to the choice of the population size. To overcome these difficulties, we analyze a simple tie-breaking rule in the selection of the next population. Similar rules have been proposed before, but have found only little acceptance. We prove the effectiveness of our tie-breaking rule via mathematical runtime analyses on the classic OneMinMax, LeadingOnesTrailingZeros, and OneJumpZeroJump benchmarks. We prove that this modified NSGA-II can optimize the three benchmarks efficiently also for many objectives, in contrast to the exponential lower runtime bound previously shown for OneMinMax with three or more objectives. For the bi-objective problems, we show runtime guarantees that do not increase when moderately increasing the population size over the minimum admissible size. For example, for the OneJumpZeroJump problem with representation length n and gap parameter k, we show a runtime guarantee of O(max {n^(k + 1), N n}) function evaluations when the population size is at least four times the size of the Pareto front. For population sizes larger than the minimal choice N = Θ(n), this result improves considerably over the Θ(N n^k) runtime of the classic NSGA-II.

Abstract: In recent years the understanding of optimal bidirectional heuristic search (BiHS) has progressed significantly. Yet, BiHS is relatively unexplored in unbounded suboptimal search. Front-to-end (F2E) and front-to-front (F2F) bidirectional search have been used in optimal algorithms, but adapting them for unbounded suboptimal search remains an open challenge. We introduce a framework for suboptimal BiHS, called anchor search, and use it to derive a parameterized family of algorithms. Because our new algorithms need F2F heuristic evaluations, we propose using pattern databases (PDBs) as differential heuristics (DHs) to construct F2F heuristics. Our experiments evaluate three anchor search instances across diverse domains, outperforming existing methods, particularly as the search scales.

National University of Defense Technology State Key Laboratory of Complex & Critical Software Environment, National University of Defense Technology State Key Laboratory of Complex & Critical Software Environment, National University of Defense Technology State Key Laboratory of Complex & Critical Software Environment, National University of Defense Technology State Key Laboratory of Complex & Critical Software Environment, National University of Defense Technology State Key Laboratory of Complex & Critical Software Environment Hunan Institute of Advanced Technology, National University of Defense Technology State Key Laboratory of Complex & Critical Software Environment, National University of Defense Technology State Key Laboratory of Complex & Critical Software Environment Hunan Institute of Advanced Technology, Changsha, China

Abstract: Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decisionmaking steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open-source and API-based LLM agents, demonstrating that Q-value models significantly improve their performance. Notably, the performance of the agent built with Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally, Q-value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.

Abstract: The increasing parameters and expansive dataset of large language models (LLMs) highlight the urgent demand for a technical solution to audit the underlying privacy risks and copyright issues associated with LLMs. Existing studies have partially addressed this need through an exploration of the pre-training data detection problem, which is an instance of a membership inference attack (MIA). This problem involves determining whether a given piece of text has been used during the pre-training phase of the target LLM. Although existing methods have designed various sophisticated MIA score functions to achieve considerable detection performance in pre-trained LLMs, how to achieve high-confidence detection and how to perform MIA on aligned LLMs remain challenging. In this paper, we propose MIA-Tuner, a novel instruction-based MIA method, which instructs LLMs themselves to serve as a more precise pre-training data detector internally, rather than design an external MIA score function. Furthermore, we design two instruction-based safeguards to respectively mitigate the privacy risks brought by the existing methods and MIA-Tuner. To comprehensively evaluate the most recent state-of-the-art LLMs, we collect a more up-to-date MIA benchmark dataset, named WIKIMIA-24, to replace the widely adopted benchmark WIKIMIA. We conduct extensive experiments across various aligned and unaligned LLMs over the two benchmark datasets. The results demonstrate that MIA-Tuner increases the AUC of MIAs from 0.7 to a significantly high level of 0.9.

Abstract: In a decisionmaking scenario, a principal could use conditional predictions from an expert agent to inform their choice. However, this approach would introduce a fundamental conflict of interest. An agent optimizing for predictive accuracy is incentivized to manipulate their principal towards more predictable actions, which prevents that principal from being able to deterministically select their true preference. We demonstrate that this impossibility result can be overcome through the joint evaluation of multiple agents. When agents are made to engage in zero-sum competition, their incentive to influence the action taken is eliminated, and the principal can identify and take the action they most prefer. We further prove that this zero-sum setup is unique, efficiently implementable, and applicable under stochastic choice. Experiments in a toy environment demonstrate that training on a zero-sum objective significantly enhances both predictive accuracy and principal utility, and can eliminate previously learned manipulative behavior.

Abstract: The emergence of VisionLanguage Models (VLMs) is significant advancement in integrating computer vision with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. However, this progress has made VLMs vulnerable to advanced adversarial attacks, raising concerns about reliability. Objective of this paper is to assess resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate VLM's ability to maintain robustness against adversarial input perturbations, we propose novel metric called \textbf{Retention Score}. Retention Score is multi-modal evaluation metric that includes Retention-I and Retention-T scores for quantifying jailbreak risks in visual and textual components of VLMs. Our process involves generating synthetic image-text pairs using conditional diffusion model. These pairs are then predicted for toxicity score by VLM alongside toxicity judgment classifier. By calculating margin in toxicity scores, we can quantify robustness of VLM in attack-agnostic manner. Our work has four main contributions. First, we prove that Retention Score can serve as certified robustness metric. Second, we demonstrate that most VLMs with visual components are less robust against jailbreak attacks than corresponding plain VLMs. Additionally, we evaluate black-box VLM APIs and find that security settings in Google Gemini significantly affect score and robustness. Moreover, robustness of GPT4V is similar to medium settings of Gemini. Finally, our approach offers time-efficient alternative to existing adversarial attack methods and provides consistent model robustness rankings when evaluated on VLMs including MiniGPT-4, InstructBLIP, and LLaVA.

Abstract: Training advanced AI models requires large investments in computational resources, or compute. Yet, as hardware innovation reduces the price of compute and algorithmic advances make its use more efficient, the cost of training an AI model to a given performance falls over time a concept we describe as increasing compute efficiency. We find that while an access effect increases the number of actors who can train models to a given performance over time, a performance effect simultaneously increases the performance available to each actor. This potentially enables large compute investors to pioneer new capabilities, maintaining a performance advantage even as capabilities diffuse. Since large compute investors tend to develop new capabilities first, it will be particularly important that they share information about their AI models, evaluate them for emerging risks, and, more generally, make responsible development and release decisions. Further, as compute efficiency increases, governments will need to prepare for a world where dangerous AI capabilities are widely available - for instance, by developing defenses against harmful AI models or by actively intervening in the diffusion of particularly dangerous capabilities.

Abstract: Split learning, as a distributed learning framework, effectively addresses the issue of limited computing resources. However, despite achieving a separation of data and computation, recent studies have pointed out that this framework still faces two major security challenges: privacy leakage and model security. Most current research focuses on the problem of privacy leakage, emphasizing how to prevent malicious servers from recovering or inferring the client's private data. However, the issue of model security in split learning has not received sufficient attention. This paper reveals the vulnerability of split learning to backdoor attacks. Since split learning cannot access client data directly, it can only guide the client model to incorporate backdoors through gradients. To address this issue, we design an attack framework that modifies intermediate activations to influence the gradients. We designed a parrot model that learns the client’s feature space, enabling the server to obtain the intermediate activations of poisoned data. During the forward pass, some of the intermediate activations and labels transmitted from the client to the server are replaced with poisoned activations and target labels. This replacement method effectively integrates the backdoor task into the model while partially retaining the main task. This approach ensures that the main task is preserved while seamlessly embedding the backdoor task. Our attack framework minimizes reliance on client knowledge and ensures that the attack process remains undetectable by the client. Through extensive experiments, we demonstrated high attack success rates using triggers such as BadNet, SIG, Blended, and WaNet, while minimizing the impact on the main task.

Abstract: Phishing attacks are a major threat to online security, exploiting user vulnerabilities to steal sensitive information. Various methods have been developed to counteract phishing, each with varying levels of accuracy, but they also face notable limitations. In this study, we introduce PhishAgent, a multimodal agent that combines a wide range of tools, integrating both online and offline knowledge bases with Multimodal Large Language Models (MLLMs). This combination leads to broader brand coverage, which enhances brand recognition and recall. Furthermore, we propose a multimodal information retrieval framework designed to extract the relevant top k items from offline knowledge bases, using available information from a webpage, including logos and HTML. Our empirical results, based on three realworld datasets, demonstrate that the proposed framework significantly enhances detection accuracy and reduces both false positives and false negatives, while maintaining model efficiency. Additionally, PhishAgent shows strong resilience against various types of adversarial attacks.

Abstract: Captive breeding programs play a critical role in combating the ongoing biodiversity crisis by preserving the most endangered species and supporting reintroduction efforts. Maintaining the genetic health of captive populations requires careful management to prevent inbreeding and maximize the effective population size. Decisions about which males and females should be bred together are guided by the principle of minimizing relatedness between pairs. Methods to select breeding pairs are well developed, however, some species' ecology requires them to live in groups, and evaluating optimal groupings of multiple males and females that would be suitable to breed together is a more complex problem. Current computational tools to support the design of groupliving captive breeding programs suffer from challenges of scalability and flexibility. In this paper we demonstrate the applicability of constraint programming (CP) approaches to optimize breeding groups to minimize relatedness. We present the example of the Galapagos giant tortoises as the test case used to develop our approach. Exploration of the needs of this captive breeding program has informed the development of our flexible approach to capture the constraints on viable captive breeding program design. Our findings have directly informed the implementation of new group configurations at the captive breeding centre. We further demonstrate that our approach is broadly applicable in other contexts through a second case study, providing multi-objective optimisation of a breeding program of canids. Through these case studies and an ablation study using synthetic datasets, we show that our constraint optimisation approach provides an expressive and generalizable means to support captive breeding program design, including scaling to large captive populations, which are currently intractable using current computational methods.

Abstract: Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using datadriven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.

Abstract: As global populations age rapidly, incorporating agespecific considerations into urban planning has become essential to addressing the urgent demand for age-friendly built environments and ensuring sustainable urban development. However, current practices often overlook these considerations, resulting in inadequate and unevenly distributed elderly services in cities. There is a pressing need for equitable and optimized urban renewal strategies to support effective age-friendly planning. To address this challenge, we propose a novel framework, Fairness-driven Age-friendly community Planning via Conditional Diffusion generation (FAP-CD). FAP-CD leverages a conditioned graph denoising diffusion probabilistic model to learn the joint probability distribution of aging facilities and their spatial relationships at a fine-grained regional level. Our framework generates optimized facility distributions by iteratively refining noisy graphs, conditioned on the needs of the elderly during the diffusion process. Key innovations include a demand-fairness pre-training module that integrates community demand features and facility characteristics using an attention mechanism and min-max optimization, ensuring equitable service distribution across regions. Additionally, a discrete graph structure captures walkable accessibility within regional road networks, guiding model sampling. To enhance information integration, we design a graph denoising network with an attribute augmentation module and a hybrid graph message aggregation module, combining local and global node and edge information. Empirical results across multiple metrics demonstrate the effectiveness of FAP-CD in balancing age-friendly needs with regional equity, achieving an average improvement of 41% over competitive baseline models.

Abstract: Social media platforms have become vital spaces for public discourse, serving as modern agorás where a wide range of voices influence societal narratives. However, their open nature also makes them vulnerable to exploitation by malicious actors, including statesponsored entities, who can conduct information operations (IOs) to manipulate public opinion. The spread of misinformation, false news, and misleading claims threatens democratic processes and societal cohesion, making it crucial to develop methods for the timely detection of inauthentic activity to protect the integrity of online discourse. In this work, we introduce a methodology designed to identify users orchestrating information operations, a.k.a. IO drivers, across various influence campaigns. Our framework, named IOHunter, leverages the combined strengths of Language Models and Graph Neural Networks to improve generalization in supervised, scarcely-supervised, and cross-IO contexts. Our approach achieves state-of-the-art performance across multiple sets of IOs originating from six countries, significantly surpassing existing approaches. This research marks a step toward developing Graph Foundation Models specifically tailored for the task of IO detection on social media platforms.

Abstract: Rapidly changing climate conditions and the increase in extreme events are posing severe challenges to human life and infrastructure, requiring sophisticated analytical capabilities for hazard prediction and disaster risk management. Earth System Data Cubes (ESDCs) have become an essential tool in Earth System Sciences (ESS) by organizing largescale, multivariate environmental datasets into a structured, scalable and analysis-ready format. However, modern machine learning techniques are not yet being utilized to their full potential on ESDCs. This is due to the lack of proper tooling, domain-specific challenges, and high barriers of entry for practitioners. We introduce ml4xcube, an open-source Python framework designed to assist ESS domain experts in applying ML techniques on ESDCs for advanced analysis and prediction of environmental variables and impacts. Through a comprehensive suite of tools, it addresses specific challenges associated with the nature of ESS data, such as the non-uniform data distribution due to dynamic gaps, or spatio-temporal autocorrelation of environmental variables. Due to its modular architecture, it covers the complete analysis process, from data exploration, and preparation, to model development, result interpretation and evaluation. With support for distributed computing, it handles large ESDC datasets efficiently. In order to ease the adoption it includes extensive documentation and tutorial notebooks. We demonstrate ml4xcube's capabilities through three examples, showcasing its potential and capabilities for integrating machine learning with ESDC data.

Abstract: Crystal materials play an important role in the development of society. The discovery of new materials is critical to achieving sustainable development goals (SDGs), such as climate change mitigation, affordable and clean energy, and fostering innovation in industry and infrastructure. Recent advances in deep learning for crystal property prediction have accelerated material discovery, but these methods typically rely on labeled data, which is often limited and varies across different properties. This limitation hinders the full utilization of the vast amount of unlabeled data in materials science. To overcome this challenge, we introduce an unsupervised Denoising Pretraining Framework (DPF) tailored for crystal structures. DPF trains a model to reconstruct the original crystal structure by recovering the masked atom types, perturbed atom positions, and perturbed crystal lattices. Through pre-training, models learn the intrinsic features of crystal structures and capture the key features influencing crystal properties. We pre-train models on a dataset of 380,743 unlabeled crystal structures and fine-tune them on downstream property prediction tasks. Extensive experiments demonstrate the effectiveness of our framework, showing its potential to significantly advance material science and contribute to the development of society by accelerating the discovery of materials crucial for sustainable technologies.

Abstract: Individual models of infectious diseases or trajectories coming from different simulations may vary considerably, making it challenging for public communication and supporting policymaking. Therefore, it is common in public health to first create a consensus across multiple models and simulations through ensembling. However, current methods are limited to mean and median ensembles that perform aggregation of scale (cases, hospitalizations, deaths) along the time axis, which often misrepresents the underlying trajectories -- e.g., they underrepresent the peak. Instead, we wish to create an ensemble that represents aggregation simultaneously over both time and scale and thus better preserves the properties of the trajectories. This is particularly useful for public health where time-series have a sequence of meaningful local trends that are ordered, e.g., a surge to an increase to a peak to a decrease. We propose a novel alignment method DTW+SBA, which combines a representation of local trends along with dynamic time warping barycenter averaging. We prove key properties of this method that ensure appropriate alignment based on local trends. We demonstrate on real multi-model outputs that our approach preserves the properties of underlying trajectories. We also show that our alignment leads to a more sensible clustering of epidemic trajectories.

Abstract: Tropical cyclone (TC) intensity forecasting is crucial for early disaster warning and emergency decisionmaking. Numerous researchers have explored deep-learning methods to address computational and post-processing issues in operational forecasting. Regrettably, they exhibit subpar long-term forecasting capabilities. We use two strategies to enhance long-term forecasting. (1) By enhancing the matching between TC intensity and spatial information, we can improve long-term forecasting performance. (2) Incorporating physical knowledge and physical constraints can help mitigate the accumulation of forecasting errors. To achieve the above strategies, we propose the VQLTI framework. VQLTI transfers the TC intensity information to a discrete latent space while retaining the spatial information differences, using large-scale spatial meteorological data as conditions. Furthermore, we leverage the forecast from the weather prediction model FengWu to provide additional physical knowledge for VQLTI. Additionally, we calculate the potential intensity (PI) to impose physical constraints on the latent variables. In the global long-term TC intensity forecasting, VQLTI achieves state-of-the-art results for the 24h to 120h, with the MSW (Maximum Sustained Wind) forecast error reduced by 35.65%-42.51% compared to ECMWF-IFS.

Abstract: In recent years, ransomware has emerged as a formidable data security threat, causing significant data privacy breaches that inflict substantial financial, reputational, and operational damages on society. Many studies employ dynamic feature analysis for ransomware detection. However, these methods utilize neither the internal semantic information (semantic information inherent in the features), nor external semantics (the wealth of existing knowledge and expert experience with regard to ransomware detection). Moreover, conventional methods rely on training data from known ransomware families, while zeroday ransomware often has unknown data distribution patterns, posing detection challenges. In this paper, we propose a Semantics-based Ransomware Detection and family Classification (SRDC) framework that can utilize both internal and external semantics of software. To bolster semantic analysis in zero-day attacks, we also design a procedure called LLM-assisted task-adaptive pre-training (LATAP). In LATAP, ransomware semantics from human experts and LLMs are employed to pre-train the detection model (GPT-2). By fully utilizing semantics, the proposed SRDC framework outperforms the SOTA methods by 12.15% for ransomware family classification tasks, and by 4.03% for zero-day ransomware detection tasks. SRDC also exhibits excellent data efficiency, requiring only two ransom families for training, which is only 35% of the data required by existing methods, to achieve a 90%+ accuracy of zero-day ransomware detection in nine unseen ransom families.

Abstract: In this blue sky paper, we seek to stimulate the research community to pursue important new as well as existing (unsolved) AI problems in the context of a challenging, often ignored, sociosensitive application domain. We outline the key challenges in conducting elections credibly in leading democracies around the world today and identify our vision of a path forward with an overarching goal to increase voter participation with a two-pronged approach of AI-lead technological innovations and interdisciplinary community building. On the technology front, we envisage the need to transform Collation and Distribution of election information, and promote its Comprehensibility for users understanding and trust (CDC). On the community front, we need to invigorate the multi-disciplinary community consisting of, but not limited to, researchers in AI, security, journalism, political science, sociology, and business, to PROMote AI's Safe usage for Elections (PROMISE) with best-practices. This work is informed by our interdisciplinary research as well as experience in conducting three workshops at leading AI conferences and the AI Magazine special issue on AI and Elections.

Abstract: Many manufacturing companies are facing an acute shortage of qualified workers. Deploying robotic cells is a potential solution to address this challenge. Historically robots have been deployed only in mass production applications in manufacturing. A large fraction of manufacturing is classified as highmix manufacturing where a large variety of products are produced. Manually programming robots is not a viable solution in high-mix manufacturing applications. Robotic cells need to be powered by embodied AI to make them useful in high-mix manufacturing applications. This paper aims to build a bridge between smart manufacturing and AI communities to enable AI researchers to develop methods and tool that can be successfully deployed to realize smart robotic cells for high-mix manufacturing applications. This paper highlights key requirements for developing embodied AI for powering robotic cells for high-mix manufacturing applications. It also makes the case for approaches that combine model-based and data-driven methods to meet the needs of embodied AI in manufacturing applications and describes the role of generative AI approaches in smart manufacturing applications. Finally, it describes how AI can be used to enhance digital twins and augment human-machine interfaces in manufacturing applications.

Abstract: This paper surveys Machine Learning approaches to build predictive models that know what they don't know. The consequential action of this knowledge can consist of abstaining from providing an output (rejection), deferring to another model (dynamic model selection), deferring to a human expert (learning to defer), or informing the user (uncertainty estimation). We formally state the problems each approach solves and point to key references. We discuss open issues that deserve investigation from the scientific community.

Abstract: Along with the broad deployment of deep learning (DL) systems, their lack of trustworthiness, such as their lack of robustness, fairness, and numerical reliability, is raising serious social concerns, especially in safetycritical scenarios such as autonomous driving and aircraft navigation. Hence, a rigorous and accurate evaluation of the trustworthiness of DL systems is essential and would be a prerequisite for improving DL trustworthiness. The first part of the talk will be an overview of certified methods for DL trustworthiness. These methods provide computable guarantees for DL systems in terms of worst-case trustworthiness under certain realistic conditions, such as the accuracy lower bound against arbitrary tiny perturbations. Based on our taxonomy and systematization, we illustrate key methodologies, specifically semantic randomized smoothing and branch-and-bound, and their implications for certified DL trustworthiness. As a representative of recent DL breakthroughs, large language models (LLMs) are transforming our lives, but, on the other hand, posing more challenges to trustworthiness. For example, LLMs can be jailbroken with adversarial prompts to output harmful content with bias, harassment, misinformation, and more. The second part of the talk will be an overview of LLM trustworthiness. We will start with sharing hands-on experience in developing fontier LLMs, then illustrate common LLM trustworthiness issues via examples, then demonstrate evaluation challenges, take one benchmark as an example, and conclude by envisioning certifiable trustworthiness for LLMs.

Abstract: Humans have a strong intuitive understanding of the physical world. Through observations and interactions with the environment, we build mental models that predict how the world would change if we applied a specific action (i.e., intuitive physics). My research draws on these human insights to develop modelbased RL agents that learn from their interactions and build predictive models that generalize widely across a range of objects made with different materials. The core idea behind my research is to introduce novel representations and integrate structural priors into learning systems to model dynamics at different levels of abstraction. I will discuss how such structures can make model-based planning algorithms more effective, helping robots accomplish complex manipulation tasks (e.g., manipulating an object pile, shaping deformable foam into a target configuration, and making a dumpling from dough using various tools).

Abstract: Data plays an increasingly crucial role in both the performance and the safety of AI models. Data attribution is an emerging family of techniques aimed at quantifying the impact of individual training data points on a model trained on them, which has found datacentric applications such as instance-based explanation, unsafe training data detection, and copyright compensation. In this talk, I will comprehensively review our work contributing to the applications, methods, and open-source benchmarks of data attribution, and discuss open challenges in this field.

Abstract: AI for public sector research is about using AI to tackle the numerous challenges faced by public sector organizations when they are out there making our world a better place. AI for public sector research is useinspired research. It differs from traditional AI research first and foremost in its key objective being measurable societal impact. AI for public sector research contributes to the computing community by proposing new problem models, raising complexities that challenge abstractions which often leads to new methodologies, and introducing new contexts for evaluation. However, fulfilling this promise is easier said than done. This talk consists of three parts about our preliminary work in our long-term quest to make AI for public sector operationally scalable, financially sustainable, technically generalizable, and socially responsible. We will cover (1) a concrete AI for public sector project from problem scoping to field trials and deployment, (2) a generalizable algorithm applicable to various public sector domains, and (3) an overview of our work in a wide variety of applications.

Abstract: Some of the most powerful language models currently are proprietary systems, accessible only via (typically restrictive) web or software programming interfaces. This is the LanguageModelsas-a-Service (LMaaS) paradigm. In contrast with scenarios where full model access is available, as in the case of open-source models, such closed-off language models present specific challenges for evaluating, benchmarking, and testing them. This paper has two goals: on the one hand, we delineate how the aforementioned challenges act as impediments to the accessibility, reproducibility, reliability, and trustworthiness of LMaaS. We systematically examine the issues that arise from a lack of information about language models for each of these four aspects. We conduct a detailed analysis of existing solutions, put forth a number of recommendations, and highlight directions for future advancements. On the other hand, it serves as a synthesized overview of the licences and capabilities of the most popular LMaaS.

Abstract: The National Vulnerability Database (NVD) publishes over a thousand new vulnerabilities monthly, with a projected 25 percent increase in 2024, highlighting the crucial need for rapid vulnerability identification to mitigate cybersecurity attacks and save costs and resources. In this work, we propose using large language models (LLMs) to learn vulnerability evaluation from historical assessments of medical device vulnerabilities in a single manufacturer's portfolio. We highlight the effectiveness and challenges of using LLMs for automatic vulnerability evaluation and introduce a method to enrich historical data with cybersecurity ontologies, enabling the system to understand new vulnerabilities without retraining the LLM. Our LLM system integrates with the inhouse application - Cybersecurity Management System (CSMS) - to help Siemens Healthineers (SHS) product cybersecurity experts efficiently assess the vulnerabilities in our products. Also, we present guidelines for efficient integration of LLMs into the cybersecurity tool.

Abstract: Knowledge tagging for questions is vital in modern intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations have been performed by pedagogical experts, as the task demands not only a deep semantic understanding of question stems and knowledge definitions but also a strong ability to link problemsolving logic with relevant knowledge concepts. With the advent of advanced natural language processing (NLP) algorithms, such as pre-trained language models and large language models (LLMs), pioneering studies have explored automating the knowledge tagging process using various machine learning models. In this paper, we investigate the use of a multi-agent system to address the limitations of previous algorithms, particularly in handling complex cases involving intricate knowledge definitions and strict numerical constraints. By demonstrating its superior performance on the publicly available math question knowledge tagging dataset, MathKnowCT, we highlight the significant potential of an LLM-based multi-agent system in overcoming the challenges that previous methods have encountered. Finally, through an in-depth discussion of the implications of automating knowledge tagging, we underscore the promising future of deploying LLM-based algorithms in educational contexts.

Abstract: Enterprise AI Assistants are increasingly deployed in domains where accuracy is paramount, making each erroneous output a potentially significant incident. This paper presents a comprehensive framework for monitoring, benchmarking, and continuously improving such complex, multicomponent systems under active development by multiple teams. Our approach encompasses three key elements: (1) a hierarchical ``severity'' framework for incident detection that identifies and categorizes errors while attributing component-specific error rates, facilitating targeted improvements; (2) a scalable and principled methodology for benchmark construction, evaluation, and deployment, designed to accommodate multiple development teams, mitigate overfitting risks, and assess the downstream impact of system modifications; and (3) a continual improvement strategy leveraging multidimensional evaluation, enabling the identification and implementation of diverse enhancement opportunities. By adopting this holistic framework, organizations can systematically enhance the reliability and performance of their AI Assistants, ensuring their efficacy in critical enterprise environments. We conclude by discussing how this multifaceted evaluation approach opens avenues for various classes of enhancements, paving the way for more robust and trustworthy AI systems.

Abstract: Artificial Intelligence (AI) has impacted the world tremendously in the last decade, causing an increased demand for accessible AI education globally. Students benefit from studying AI earlier in the curriculum; however, AI courses can require a range of prerequisites, which can be structured differently in various educational contexts. In this paper, we study the curriculum structure of AI, Machine Learning (ML), and Data Science (DS) courses in Canadian Universities and compare it with that of US Research1 institutions. There are many similarities between AI, ML, and DS courses in Canada and the US. For example, DS courses tend to be more accessible earlier in the CS curriculum compared to AI and ML. However, there are key differences between the two countries, with Canadian AI, ML, and DS courses generally being a part of a longer prerequisites chain, and Canadian CS departments offering fewer DS courses. Still, both Canadian and US institutions find innovative ways to introduce AI earlier in the curriculum, including via interdisciplinary courses and specialized courses with few prerequisites. This study corroborates earlier work in recognizing diversity in curricular frameworks in North America and recommends curricular revisions and early academic advising to ensure access to AI courses.

Abstract: Neuron Sandbox is a browserbased tool that helps middle school students grasp basic principles of neural computation. It simulates a linear threshold unit applied to binary decision problems, which students solve by adjusting the unit's threshold and/or weights. Although Neuron Sandbox provides extensive visualization aids, solving these problems is challenging for students who have not yet been exposed to algebra. We collected survey, video, and worksheet data from 21 seventh grade students in two sections of an AI elective, taught by the same teacher, that used Neuron Sandbox. We present a scaffolding strategy that proved effective at guiding these students to achieve mastery of these problems. While the amount of scaffolding required was more than we originally anticipated, by the end of the exercise students understood the computation that linear threshold units perform and were able to generalize their understanding of the worksheet’s "solve for threshold" strategy to also solve for weights.

Abstract: Despite significant advancements in solving Markov Decision Processes (MDPs) and Simple Stochastic Games (SGs), scalability remains a challenge due to the exponential growth of their state spaces. This thesis aims to push the boundaries of stateof-the-art methods by tackling this issue using 1) explainability and 2) exploiting the model structure. First, we introduce the *1-2-3-Go* approach, which learns explainable policies from small MDP models and generalizes them to larger instances, improving scalability in MDPs. We then extend *Optimistic Value Iteration (OVI)* and *Sound Value Iteration (SVI)*—originally designed for MDPs—to SGs, improving efficiency in adversarial settings. Finally, we aim to exploit the *explainable policy representations* and the *model structure* to enhance both scalability and interpretability in SGs. This thesis contributes to both theoretical advancements and practical solutions for decision-making systems under uncertainty.

Abstract: Multimodal models, namely visionlanguage models, present unique possibilities through the seamless integration of different information mediums for data generation. These models mostly act as a black-box, making them lack transparency and explicability. Reliable results require accountable and trustworthy Artificial Intelligence (AI), namely when in use for critical tasks, such as the automatic generation of medical imaging reports for healthcare diagnosis. By exploring stress-testing techniques, multimodal generative models can become more transparent by disclosing their shortcomings, further supporting their responsible usage in the medical field.

Abstract: The basic objective of my research work is to address the challenging problem of recognizing object states in a visual context by integrating datadriven and symbolic approaches. In particular, I focus on the Zero-shot variation of this task. The contributions made so far include the development of novel methods that exhibit state-of-the-art (SOTA) performance, the creation of a new object states dataset, the formulation of novel problems, the successful integration of low-level and high-level approaches, and comprehensive analyses that highlight the specific challenges posed by the problem.

Abstract: Misinformation and propaganda undermine trust in institutions, spread falsehoods, and sometimes incite violence. However, recent advancements in transformerbased AI models can help combat the proliferation of disinformation globally and in real time. In this work, I propose and develop a system using these models to scalably identify, track, and analyze the spread of narratives from over 40,000 international news websites. First, by employing novel multilingual Matryoshka embeddings and hierarchical level-wise clustering, my proposed system identifies news stories, topics, and themes across these thousands of news websites. Second, by utilizing multilingual stance detection, my system assesses the biases and factual inconsistencies in news articles, enabling the identification of websites that spread propaganda or misinformation. Finally, through network inference methods, my system uncovers connections among websites disseminating slanted or false content. My approach illustrates how AI can be utilized to mitigate the global spread of harmful misinformation and propaganda.

Abstract: The rise of large language models (LLMs) has revolutionized natural language processing, offering immense capabilities across various applications. The widespread integration of these models into commonplace technology has brought to light deep concerns about the biases they encompass, which could serve to perpetuate negative preconceptions and social injustices. The scope of my research includes social biases, brand biases, the impact of personas on bias, and stereotypes in lowresource languages. My contributions aim to deepen our understanding of these biases and develop methodologies to mitigate them, enhancing the fairness and utility of LLMs across diverse global applications.

Abstract: Healthcare information is scattered across heterogeneous data sources, such as patient medical records, clinical guidelines, research literature, and online knowledge bases. Segmented information, both structured and unstructured, when integrated together using context augmentation a knowledge fusion technique, has the ability to contextualize broader medical context. Current approaches lack knowledge aggregation that is necessary to generate personalized healthcare recommendations. I propose novel AI frameworks that leverage language models and hybrid retrieval techniques to aggregate multi source knowledge, enabling the generation of contextual and accurate medical response.

Abstract: Large Language Models (LLMs) and Generative AI (GenAI) have markedly changed the landscape of many fields, including education. While these tools have significant capabilities, they also require understanding to effectively and responsibly use them. Additionally, little work has been done to evaluate how these tools can best benefit education at the secondary level, with design insights from instructors. My work focuses on informing secondary instructors of these tools, receiving their input on how to make these tools work best for them, and finally using this input to create and evaluate an inclass Retrieval-Augmented Generation (RAG)-based chatbot for their students to use to improve learning outcomes. This work aims to bridge the gap between the latest in computing technology and secondary education classrooms.

Abstract: Deploying machine learning (ML) models in highstakes domains such as healthcare and autonomous systems requires reliable uncertainty quantification (UQ) to ensure safe and accurate decision-making. Conformal prediction (CP) offers a robust, distribution-agnostic framework for UQ, providing valid prediction sets that guarantee a specified coverage probability. However, existing CP methods are often limited by assumptions that are violated in real-world scenarios, such as non-i.i.d. data, and by a lack of integration with modern machine learning workflows, particularly in large generative models. This research aims to address these limitations by advancing CP techniques to operate effectively in non-i.i.d. settings, improving predictive efficiency without sacrificing theoretical guarantees, and integrating CP directly into model training processes. These developments will enhance the practical applicability of CP for a wide range of ML tasks, enabling more reliable and interpretable models in high-stakes applications.

Abstract: Foundation models in general domains have leveraged multimodal knowledge graphs to great effect, yet the healthcare sector lacks such comprehensive structures, presenting a significant gap in current research. Based on previous exploration with pure datadriven approaches, this proposal describes a two-stage project aiming to enhance multimodal healthcare foundation model with domain knowledge. The first stage is to construct a robust multimodal healthcare knowledge graph based on established healthcare taxonomies, such as UMLS, and enriched with data from multimodal clinical databases like MIMIC-CXR. This knowledge graph will incorporate medical images as cross-modal instances linked to healthcare terminologies, enhancing the depth and applicability of the graph. In the second stage, the knowledge graph will serve as a foundational tool in training healthcare foundation models with enhanced capabilities, particularly in reducing hallucination and managing concept ambiguity through the novel use of reinforcement learning techniques like Direct Preference Optimization (DPO). This research is expected to make significant contributions to the domain of healthcare AI by enabling more accurate, reliable, and explainable AI-driven diagnostics and interventions.

Abstract: Probing classifiers are a technique for understanding and modifying the operation of neural networks in which a smaller classifier is trained to use the model's internal representation to learn a related probing task. Similar to a neural electrode array, training probing classifiers can help researchers both discern and edit the internal representation of a neural network. This paper presents an evaluation of the use of probing classifiers to modify the internal hidden state of a chessplaying transformer. We demonstrate that intervention vector scaling should follow a negative exponential according to the length of the input to ensure model outputs remain semantically valid after editing the residual stream activations.

Abstract: Increasing student populations and diverse course offerings have led to perceived inequities in U.S. high school course scheduling. Traditional integer programming (IP) methods for the High School Scheduling Problem (HSSP) fail to address these fairness concerns. This research introduces the Fair High School Scheduling Problem (FHSSP), an extension of the HSSP that incorporates student preferences and fairness principles from market design. We develop an IP model to generate course schedules that are both feasible and equitable. Tested on real course request data from a California high school, our model successfully produces schedules that ensure fairness without compromising feasibility. These results demonstrate the potential of our approach to enhance fairness in high school scheduling and its applicability to various realworld scheduling challenges. Additionally, this study highlights the feasibility of integrating human preferences and emotions into mathematical models, promoting more inclusive and balanced allocation systems.

Abstract: Esports has rapidly emerged as a global phenomenon with an everexpanding audience on livestream platforms. However, due to the complex nature of the game, it becomes challenging for newcomers to comprehend the gaming situation. This research introduces a 3M-Game that integrates multi-modal (MM) information from the livestream platform, including chat and livestream, to uncover the event. While conventional MM models typically prioritise aligning MM data through concurrent training towards a unified objective, our framework leverages multiple independent teachers trained on different tasks to accomplish game event detection. The results show the effectiveness of the proposed framework. The code and appendix are in https://github.com/adlnlp/3m_game.

Abstract: The prevalence of multimodal content on social media complicates automated moderation strategies. This calls for an enhancement in multi-modal classification and a deeper understanding of understated meanings in images and memes. Although previous efforts have aimed at improving model performance through fine-tuning, few have explored an end-to-end optimization pipeline that accounts for modalities, prompting, labelling, and fine-tuning. In this study, we propose an end-to-end conceptual framework for model opti- mization in complex tasks. Experiments support the efficacy of this traditional yet novel framework, achieving the highest accuracy and AUROC. Ablation experiments demonstrate that isolated optimisations are not ineffective on their own.

Abstract: This paper investigates the application of Generative Flow Networks (GFlowNets) to lead optimization in drug discovery. GFlowNets provide a novel framework for generating diverse molecular structures while optimizing for desired properties, addressing the limitations of traditional methods in exploring vast chemical spaces. We adapt GFlowNets to incrementally modify lead compounds, integrating domainspecific heuristics to guide the generation process. Our method employs the trajectory balance objective on a graph neural network (GNN), to learn a policy that samples fragments based on a multi-objective reward. The reward function ensures increase in cell permeability and similarity to the starting molecule. The results on benchmark datasets of activity cliffs demonstrate that GFlowNets can generate diverse modifications, producing optimized candidate molecules with improvement in cell permeability. This work can be extended with other pharmacokinetic properties for lead optimization in early-stage drug development, potentially accelerating the discovery of novel therapeutics.

Abstract: FuSEMET addresses critical challenges in deploying human activity recognition (HAR) systems in uncontrolled environments by effectively managing noisy labels, sparse data, and undefined activity vocabularies. By integrating BERT-based word embeddings with domain-specific knowledge (i.e., MET values), FuSE-MET optimizes label merging, reducing label complexity and improving classification accuracy. Our approach outperforms the state-of-the-art techniques, including ChatGPT-4, by balancing semantic meaning and physical intensity.

Abstract: Multiagent reinforcement learning (MARL) trains multiple agents in shared environments. Recently, MARL models have significantly improved performance by leveraging sequential decision-making processes. Although these models can enhance performance, they do not explicitly con-sider the importance of the order in which agents make decisions. We propose AOAD-MAT, a novel model incorporating action decision sequence into learning. AOAD-MAT uses a Transformer-based actor-critic architecture to dynamically adjust agent action order. It introduces a subtask predicting the next agent to act, integrated into a PPO-based loss function. Experiments on StarCraft Multi-Agent Challenge and Multi-Agent MuJoCo benchmarks show AOAD-MAT out-performs existing models, demonstrating the effectiveness of adjusting agent order in MARL.

Abstract: Autonomous robots are essential for navigating and collecting data in hazardous environments where human intervention is impractical. Current methods often result in inefficiencies, missed highquality imagery, and inadequate coverage in critical areas such as environmental monitoring, disaster response, and medical diagnostics. The absence of intelligent viewpoint selection leads to redundant data and poor image quality, limiting robotic effectiveness. This research proposes a framework that utilizes reinforcement learning and information-theoretic approaches to optimize viewpoint selection, aiming to enhance data collection efficiency and image quality while ensuring safety. This work has the potential to transform industries reliant on precise visual data and significantly improve medical robotics, enabling better diagnostics and patient care.

Abstract: Textconditioned image generation enables cross-modal comprehension. Recent emergence of many platforms have found applications in diverse domains like assisted designing and video gaming. However, there still exist challenges in existing platforms due to their expensive training and time-consuming generation processes. In this paper, we introduce an efficient text-conditioned image generation platform, termed InstantPainting. Unlike existing platforms based on large-scale pre-trained diffusion models, InstantPainting expands generative adversarial networks (GANs) to achieve efficient generation by using only about three percent pre-training data of other platforms. Compared to existing platforms, InstantPainting achieves the following functions at a very low deployment cost and approximately 4 to 5 times faster generation speeds: (1) Multi-category and multi-size image generation (2) Image stylization and controlled generation (3) Creative generation, including the generation of poetry pictures and counterfactual images. The proposed platform provides web application implementations for PC and mobile, users can create high-quality images directly through the user interface.

Abstract: Probabilistic logic shields integrate deep reinforcement learning (RL) with probabilistic logic reasoning to train agents that operate in uncertain environments while giving strong guarantees with respect to logical constraints, such as safety properties. In this demo paper, we introduce a codebase that streamlines the design of custom MiniHack environments where neurosymbolic RL agents leverage probabilistic logic shields to learn safe and interpretable policies with strong guarantees. Our framework allows expert users to easily define and train agents that integrate deep neural policies with probabilistic logic in arbitrarily complex games: from simple exploration to planning and interacting with enemies. Additionally, we provide a webbased platform that showcases our application, offering an interactive interface for the broader community to experiment with and explore the capabilities of neurosymbolic reinforcement learning. This lowers the barrier for researchers and developers, making it accessible for a wider audience to engage with safety-critical RL scenarios.

Abstract: In this demo, we present AERA Chat, an automated and explainable educational assessment system designed for interactive and visual evaluations of student responses. This system leverages large language models (LLMs) to generate automated marking and rationale explanations, addressing the challenge of limited explainability in automated educational assessment and the high costs associated with annotation. Our system allows users to input questions and student answers, providing educators and researchers with insights into assessment accuracy and the quality of LLMassessed rationales. Additionally, it offers advanced visualization and robust evaluation tools, enhancing the usability for educational assessment and facilitating efficient rationale verification.

Abstract: Reducing the environmental impact of cloud computing requires efficient workload distribution across geographically dispersed Data Center Clusters (DCCs) and simultaneously optimizing liquid and air (HVAC) cooling with time shift of workloads within individual data centers (DC). This paper introduces GreenDCC, which proposes a Reinforcement Learning (RL) based hierarchical controller to optimize both workload and liquid cooling dynamically in a DCC. By incorporating factors like weather, carbon intensity, and resource availability, Green-DCC addresses realistic constraints and interdependencies. We demonstrate how the system optimizes multiple data centers synchronously, enabling the scope of digital twins, and compare the performance of various RL approaches based on carbon emissions and sustainability metrics while also offering a framework and benchmark simulation for broader ML research in sustainability.

Abstract: This paper incorporates the efficiency of automatic summarization and addresses the challenge of generating personalized summaries tailored to individual users' interests and requirements. To tackle this challenge, we introduce SummPilot, an interactionbased customizable summarization system. SummPilot leverages a large language model to facilitate both automatic and interactive summarization. Users can engage with the system to understand document content and personalize summaries through interactive components such as semantic graphs, entity clustering, and explainable evaluation. Our demo and user studies demonstrate SummPilot's adaptability and usefulness for customizable summarization.

Abstract: Finding highquality representations of heterogeneous tabular datasets is crucial for their effective use in downstream machine learning tasks. Contrastive representation learning (CRL) methods have been previously shown to provide a straightforward way to learn such representations across various data domains. Current tabular CRL methods learn joint embeddings of data instances (tabular rows) by minimizing a contrastive loss between the original instance and its perturbations. Unlike existing tabular CRL methods, we propose leveraging frameworks established in multimodal representation learning, treating each tabular column as a distinct modality. A naive approach that applies a contrastive loss pairwise to tabular columns is not only prohibitively expensive as the number of columns increases, but as we demonstrate, it also fails to capture interactions between variables. Instead, we propose a novel method called ICE-T that learns each columnar embedding by contrasting it with aggregate embeddings of the complementary part of the table, thus capturing interactions and scaling linearly with the number of columns. Unlike existing tabular CRL methods, ICE-T allows for column-specific embeddings to be obtained independently of the rest of the table, enabling the inference of missing values and translation between columnar variables. We provide a comprehensive evaluation of ICE-T across diverse datasets, demonstrating that it generally surpasses the performance of the state-of-the-art alternatives.

Abstract: Longterm series forecasting aims to predict future data over long horizons based on historical information. However, existing methods struggle to effectively utilize long lookback windows due to overfitting, computational resource constraints, or information extraction challenges, thereby limiting them to using limited lookback windows for predicting long-term future series. To address these issues, this paper introduces the Input Refinement and Prediction Auxiliary (IRPA) framework, a lightweight model consisting of four linear layers designed to extract key information from ultra-long lookback windows to enhance limited lookback windows and assist prediction processes. IRPA comprises an Input Refinement Module (IRM) and a Prediction Auxiliary Module (PAM), each constructed from two linear layer sub-modules. The IRM performs effective decomposition and patching of ultra-long series, refining seasonal and trend features to increase the information density in limited lookback windows and mitigate overfitting and parameter inflation. The PAM extracts historical similarities and seasonal patterns from ultra-long lookback windows to significantly improve prediction accuracy. IRPA substantially extends the utilization of lookback windows, offering a lightweight and efficient solution with broad applicability. Experimental results on eight datasets show IRPA reduces the Mean Squared Error (MSE) by an average of 16.1% for various models.

State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China Peng Cheng Laboratory, Shenzhen, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China Peng Cheng Laboratory, Shenzhen, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, Xi'an Jiaotong University, Xi'an, China, Peng Cheng Laboratory, Shenzhen, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China Peng Cheng Laboratory, Shenzhen, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

Abstract: Offline preferencebased reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the learned reward via offline RL. However, accurately modeling step-wise rewards from trajectory-level preference feedback presents inherent challenges. The reward bias introduced, particularly the overestimation of predicted rewards, leads to optimistic trajectory stitching, which undermines the pessimism mechanism critical to the offline RL phase. To address this challenge, we propose In-Dataset Trajectory Return Regularization (DTR) for offline PbRL, which leverages conditional sequence modeling to mitigate the risk of learning inaccurate trajectory stitching under reward bias. Specifically, DTR employs Decision Transformer and TD-Learning to strike a balance between maintaining fidelity to the behavior policy with high in-dataset trajectory returns and selecting optimal actions based on high reward labels. Additionally, we introduce an ensemble normalization technique that effectively integrates multiple reward models, balancing the trade-off between reward differentiation and accuracy. Empirical evaluations on various benchmarks demonstrate the superiority of DTR over other state-of-the-art baselines.

Abstract: Video anomaly detection (VAD) aims at locating the abnormal events in videos. Recently, the Weakly Supervised VAD has made great progress, which only requires videolevel annotations when training. In practical applications, different institutions may have different types of abnormal videos. However, the abnormal videos cannot be circulated on the internet due to privacy protection. To train a more generalized anomaly detector that can identify various anomalies, it is reasonable to introduce federated learning into WSVAD. In this paper, we propose Global and Local Context-driven Federated Learning, a new paradigm for privacy protected weakly supervised video anomaly detection. Specifically, we utilize the vision-language association of CLIP to detect whether the video frame is abnormal. Instead of leveraging handcrafted text prompts for CLIP, we propose a text prompt generator. The generated prompt is simultaneously influenced by text and visual. On the one hand, the text provides global context related to anomaly, which improves the model's ability of generalization. On the other hand, the visual provides personalized local context because different clients may have videos with different types of anomalies or scenes. The generated prompt ensures global generalization while processing personalized data from different clients. Extensive experiments show that the proposed method achieves remarkable performance.

Abstract: Probabilistic circuits are a unifying representation of functions as computation graphs of weighted sums and products. Their primary application is in probabilistic modeling, where circuits with nonnegative weights (monotone circuits) can be used to represent and learn density/mass functions, with tractable marginal inference. Recently, it was proposed to instead represent densities as the square of the circuit function (squared circuits); this allows the use of negative weights while retaining tractability, and can be exponentially more expressive efficient than monotone circuits. Unfortunately, we show the reverse also holds, meaning that monotone circuits and squared circuits are incomparable in general. This raises the question of whether we can reconcile, and indeed improve upon the two modeling approaches. We answer in the positive by proposing Inception PCs, a novel type of circuit that naturally encompasses both monotone circuits and squared circuits as special cases, and employs complex parameters. Empirically, we validate that Inception PCs can outperform both monotone and squared circuits on a range of tabular and image datasets.

School of Computer Science (National Pilot School of Software Engineering), Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, Beijing, PR China, SChool of Economics and Management, Beijing University of Posts and Telecommunications, Beijing, PR China, School of Computer Science (National Pilot School of Software Engineering), Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, Beijing, PR China, School of Computer Science (National Pilot School of Software Engineering), Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, Beijing, PR China, School of Computer Science (National Pilot School of Software Engineering), Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, Beijing, PR China, School of Computer Science (National Pilot School of Software Engineering), Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, Beijing, PR China, School of Computer Science (National Pilot School of Software Engineering), Beijing University of Posts and Telecommunications Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, Beijing, PR China

Abstract: Carefully selecting clients to participate in aggregation can assist the global model in achieving better performance. However, existing research on federated heterogeneous graph learning (FHGL) has shown limited attention to the client selection (CS) problem. Current CS algorithms face challenges in accurately evaluating client contributions and selecting appropriate participants in the context of FHGL, leading to a dilemma between convergence and accuracy. In this paper, we propose a Reinforcement Active client selection based Federated Heterogeneous Graph Learning (RAFHGL), which precisely evaluates the importance of local heterogeneous graph data and selects highcontributing clients for aggregation. RAFHGL employs an active learning agent to select representative nodes for local training. The statistical features of the active scores are used to assess client contributions. A client selection agent then chooses clients conducive to global model convergence for aggregation. To address heterogeneity introduced by sample and client selection, the training process stabilizes by correcting local losses based on data prototypes. Experimental results on 4 publicly available heterogeneous graph datasets show that RAFHGL outperforms existing Client Selection algorithms in federated heterogeneous graph learning scenarios in terms of performance and convergence.

Abstract: Unsupervised domain adaptation (UDA) aims at knowledge transfer from a labeled source domain to an unlabeled target domain. Most UDA techniques achieve this by reducing feature discrepancies between the two domains to learn domaininvariant feature representations. In this paper, we enhance this approach by proposing a simple yet powerful probabilistic framework (SimProF) for UDA to minimize the domain gap between the two domains. SimProF estimates the feature space distribution for each class and generates contrastive pairs by leveraging the shared categories between the source and target domains. The concept behind SimProF is inspired by the observation that normalized features in contrastive learning tend to follow a mixture of von Mises-Fisher (vMF) distributions on the unit sphere. This characteristic allows for the generation of an infinite number of contrastive pairs and facilitates an efficient optimization method using a closed-form expression for the expected contrastive loss. As a result, target semantics can be effectively used to augment source features. To implement this, we create vMF distributions based on the inter-domain feature mean difference for each class. Notably, we derive and minimize an upper bound of the expected loss, which is implicitly achieved through an estimated supervised contrastive learning loss applied to the augmented source distribution. Comprehensive experiments on cross-domain benchmarks confirm the efficacy of the proposed method.

Future Network of Intelligence Institute (FNii), The Chinese University of Hong Kong, Shenzhen School of Science and Engineering (SSE), The Chinese University of Hong Kong, Shenzhen, Future Network of Intelligence Institute (FNii), The Chinese University of Hong Kong, Shenzhen, Shenzhen International Graduate School, Tsinghua University, School of Computing Science, Simon Fraser University College of Computer Science, Zhejiang University, School of Science and Engineering (SSE), The Chinese University of Hong Kong, Shenzhen Future Network of Intelligence Institute (FNii), The Chinese University of Hong Kong, Shenzhen

Abstract: In recent years, the distributed training of foundation models (FMs) has seen a surge in popularity. In particular, federated learning enables collaborative model training among edge clients while safeguarding the privacy of their data. However, federated training of FMs across resourceconstrained and highly heterogeneous edge devices encounter several challenges. These include the difficulty of deploying FMs on clients with limited computational resources and the high computation and communication costs associated with fine-tuning and collaborative training. To address these challenges, we propose FedCKMS, a Cluster-Aware Framework with Knowledge-Aware Model Search. Specifically, FedCKMS incorporates three key components. The first component is multi-factor heterogeneity-aware clustering, which groups clients based on both data distribution and resource limitations and selects an appropriate model for each cluster. The second component is knowledge-aware model architecture search, which enables each client to identify the optimal sub-model from the cluster model, facilitating adaptive deployment that accommodates highly heterogeneous computational resources across clients. The final component is cluster-aware knowledge transfer, which facilitates knowledge sharing between clusters and the server, addressing model heterogeneity, and reducing communication overhead. Extensive experiments demonstrate that FedCKMS outperforms state-of-the-art baselines by 3-10% in accuracy.

Abstract: In realworld scenarios, complex data such as multispectral images and multi-frame videos inherently exhibit robust low-rank property. This property is vital for multi-dimensional inverse problems, such as tensor completion, spectral imaging reconstruction, and multispectral image denoising. Existing tensor singular value decomposition (t-SVD) definitions rely on hand-designed or pre-given transforms, which lack flexibility for defining tensor nuclear norm (TNN). The TNN-regularized optimization problem is solved by the singular value thresholding (SVT) operator, which leverages the t-SVD framework to obtain the low-rank tensor. However, it's quite complicated to introduce SVT into deep neural network due to the numerical instability problem in solving the derivatives of the eigenvectors. In this paper, we introduce a novel data-driven generative low-rank t-SVD model based on the learnable orthogonal transform, which can be naturally solved under its representation. Prompted by the linear algebra theorem of the Householder transformation, our learnable orthogonal transform is achieved by constructing an endogenously orthogonal matrix adaptable to neural networks, optimizing it as arbitrary orthogonal matrices. Additionally, we propose a low-rank solver as a generalization of SVT, which utilizes an efficient representation of generative networks to obtain low-rank structures. Extensive experiments highlight its significant restoration enhancements.

Abstract: Federated Learning (FL) trains a shared model using data and computation power on distributed agents coordinated by a central server. Decentralized FL (DFL) utilizes local model exchange and aggregation between agents to reduce the communication and computation overheads on the central server. However, when agents are mobile, the communication opportunity between agents can be sporadic, largely hindering the convergence and accuracy of DFL. In this paper, we propose Cached Decentralized Federated Learning (CachedDFL) to investigate delay-tolerant model spreading and aggregation enabled by model caching on mobile agents. Each agent stores not only its own model, but also models of agents encountered in the recent past. When two agents meet, they exchange their own models as well as the cached models. Local model aggregation utilizes all models stored in the cache. We theoretically analyze the convergence of Cached-DFL, explicitly taking into account the model staleness introduced by caching. We design and compare different model caching algorithms for different DFL and mobility scenarios. We conduct detailed case studies in a vehicular network to systematically investigate the interplay between agent mobility, cache staleness, and model convergence. In our experiments, Cached-DFL converges quickly, and significantly outperforms DFL without caching.

Abstract: Multivariate time series forecasting is crucial across various industries, where accurate extraction of complex periodic and trend components can significantly enhance prediction performance. However, existing models often struggle to capture these intricate patterns. To address these challenges, we propose FilterTS, a novel forecasting model that utilizes specialized filtering techniques based on the frequency domain. FilterTS introduces a Dynamic CrossVariable Filtering Module, a key innovation that dynamically leverages other variables as filters to extract and reinforce shared variable frequency components across variables in multivariate time series. Additionally, a Static Global Filtering Module captures stable frequency components, identified throughout the entire training set. Moreover, the model is built in the frequency domain, converting time-domain convolutions into frequency-domain multiplicative operations to enhance computational efficiency. Extensive experimental results on eight real-world datasets have demonstrated that FilterTS significantly outperforms existing methods in terms of prediction accuracy and computational efficiency.

School of Computer Science and Engineering, Nanjing University of Science and Technology, School of Computer Science and Engineering, Nanjing University of Science and Technology, School of Computer Science and Engineering, Southeast University, School of Computer Science and Engineering, Nanjing University of Science and Technology, College of Computer Science, Sichuan University, School of Computer Science and Engineering, Nanjing University of Science and Technology, Southwest Automation Research Institute, China South Industries Group Corporation, School of National Defence Science and Technology, Southwest University of Science and Technology

Abstract: In recent years, anchor and hashbased multi-view clustering methods have gained attention for their efficiency and simplicity in handling large-scale data. However, existing methods often overlook the interactions among multi-view data and higher-order cooperative relationships during projection, negatively impacting the quality of hash representation in low-dimensional spaces, clustering performance, and sensitivity to noise. To address this issue, we propose a novel approach named Tensor-Interacted Projection and Cooperative Hashing for Multi-View Clustering(TPCH). TPCH stacks multiple projection matrices into a tensor, taking into account the synergies and communications during the projection process. By capturing higher-order multi-view information through dual projection and Hamming space, TPCH employs an enhanced tensor nuclear norm to learn more compact and distinguishable hash representations, promoting communication within and between views. Experimental results demonstrate that this refined method significantly outperforms state-of-the-art methods in clustering on five large-scale multi-view datasets. Moreover, in terms of CPU time, TPCH achieves substantial acceleration compared to the most advanced current methods.

Abstract: While transformer models have been highly successful, they are computationally inefficient. We observe that for each layer, the full width of the layer may be needed only for a small subset of tokens inside a batch and that the "effective" width needed to process a token can vary from layer to layer. Motivated by this observation, we introduce the Adaptive Computation Module (ACM), a generic module that dynamically adapts its computational load to match the estimated difficulty of the input on a pertoken basis. An ACM consists of a sequence of learners that progressively refine the output of their preceding counterparts. An additional gating mechanism determines the optimal number of learners to execute for each token. We also propose a distillation technique to replace any pre-trained model with an "ACMized" variant. Our evaluation of transformer models in computer vision and speech recognition demonstrates that substituting layers with ACMs significantly reduces inference costs without degrading the downstream accuracy for a wide interval of user-defined budgets.

Key Laboratory of Big Data & Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing, 100044, China, Key Laboratory of Big Data & Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing, 100044, China, College of Computer and Cyber Security, Hebei Normal University, Hebei, China, College of Science and Technology, Beijing Open University, Beijing, China, Key Laboratory of Big Data & Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing, 100044, China

Abstract: Incomplete MultiView Clustering (IMVC) has made significant progress by optimally merging multiple pre-specified incomplete views. Most existing IMVC algorithms operate under the assumption that view alignment is known, but in practice, the coupling information between views may be absent, thereby limiting the practical applicability of these methods. Being aware of this, we propose a novel IMVC method named Kernel cOupling And eLement imputAtion induced Multi-View Clustering (KOALA), which sufficiently explores the nonlinear relationship among features and optimally processes a group of kernels with missing and unaligned elements to simultaneously resolve multi-view clustering problem under both uncoupled and incomplete scenarios. Specifically, we first introduce a cross-kernel alignment learning strategy to reconstruct the coupling relationships among multiple kernels, which effectively captures high-order nonlinear relationships among samples and enhances alignment accuracy. Additionally, a low-rank tensor constraint is imposed on the optimizable alignment kernel tensor, facilitating the effective imputation of missing kernel elements by leveraging consistency information across views. Subsequently, we develop an alternative optimization approach with promising convergence to solve the resultant optimization problem. Extensive experimental results on various multi-view datasets demonstrate that the KOALA method achieves remarkable clustering performance.

Abstract: Graph neural networks (GNNs) have achieved impressive results in various graph learning tasks. Backdoor attacks pose a significant threat to GNNs, with a focus on dirtylabel attacks. However, these attacks often necessitate the inclusion of blatantly incorrect inputs into the training set, rendering them easily detectable through simple filtering. In response to this challenge, we introduce Clean-Label Graph Backdoor Attack (CGBA). The majority of features in the generated poisoned nodes align with their true labels, significantly enhancing the difficulty of detecting the attack. Firstly, leveraging the uncertainty inherent in the GNNs, we develop a low-budget strategy for selecting poisoned nodes. This approach focuses on nodes in the target class with uncertain and low-degree classifications, allowing for efficient attacks within a limited budget while mitigating the impact on other clean nodes. Secondly, we present an innovative strategy for generating feature triggers. By boosting the confidence of poisoned samples in the target class, this tactic establishes a robust association between the trigger and the target class, even without modifying the labels of poisoned nodes. Additionally, we incorporate two constraints to reduce disruption to the graph structure. In conclusion, comprehensive experimental results unequivocally showcase CGBA's exceptional attack performance across three benchmark datasets and four GNNs models. Notably, the attack targeting the GraphSAGE model attains a 100% success rate, accompanied by a marginal benign accuracy drop of no more than 0.5%.

Abstract: Multiobjective optimization (MOO) lies at the core of many machine learning (ML) applications that involve multiple, potentially conflicting objectives (e.g., multi-task learning, multi-objective reinforcement learning, among many others). Despite the long history of MOO, recent years have witnessed a surge in interest within the ML community in the development of gradient manipulation algorithms for MOO, thanks to the availability of gradient information in many ML problems. However, existing gradient manipulation methods for MOO often suffer from long training times, primarily due to the need for computing dynamic weights by solving an additional optimization problem to determine a common descent direction that can decrease all objectives simultaneously. To address this challenge, we propose a new and efficient algorithm called Periodic Stochastic Multi-Gradient Descent (PSMGD) to accelerate MOO. PSMGD is motivated by the key observation that dynamic weights across objectives exhibit small changes under minor updates over short intervals during the optimization process. Consequently, our PSMGD algorithm is designed to periodically compute these dynamic weights and utilizes them repeatedly, thereby effectively reducing the computational overload. Theoretically, we prove that PSMGD can achieve state-of-the-art convergence rates for strongly-convex, general convex, and non-convex functions. Additionally, we introduce a new computational complexity measure, termed backpropagation complexity, and demonstrate that PSMGD could achieve an objective-independent backpropagation complexity. Through extensive experiments, we verify that PSMGD can provide comparable or superior performance to state-of-the-art MOO algorithms while significantly reducing training time.

Abstract: Visual prompt, a pair of beforeand-after edited images, can convey indescribable imagery transformations and prosper in image editing. However, current visual prompt methods rely on a pretrained text-guided image-to-image generative model that requires a triplet of text, before, and after images for retraining over a text-to-image model. Such crafting triplets and retraining processes limit the scalability and generalization of editing. In this paper, we present a framework based on any single text-to-image model without reliance on the explicit image-to-image model thus enhancing the generalizability and scalability. Specifically, by leveraging the probability-flow ordinary equation, we construct a diffusion bridge to transfer the distribution between before-and-after images under the text guidance. By optimizing the text via the bridge, the framework adaptively textualizes the editing transformation conveyed by visual prompts into text embeddings without other models. Meanwhile, we introduce differential attention control during optimization, which disentangles the text embedding from the invariance of the before-and-after images and makes it solely capture the delicate transformation and generalize to edit various images. Experiments on real images validate competitive results on the generalization, contextual coherence, and high fidelity for delicate editing with just one image pair as the visual prompt.

Abstract: Spectral clustering requires the timeconsuming decomposition of the Laplacian matrix of the similarity graph, thus limiting its applicability to large datasets. To improve the efficiency of spectral clustering, a top-down approach was recently proposed, which first divides the data into several micro-clusters (granular-balls), then splits these micro-clusters when they are not ``compact'', and finally uses these micro-clusters as nodes to construct a similarity graph for more efficient spectral clustering. However, this top-down approach is challenging to adapt to unevenly distributed or structurally complex data. This is because constructing micro-clusters as a rough ball struggles to capture the shape and structure of data in a local range, and the simplistic splitting rule that solely targets ``compactness'' is susceptible to noise and variations in data density and leads to micro-clusters with varying shapes, making it challenging to accurately measure the similarity between them. To resolve these issues and improve spectral clustering, this paper first proposes to start from local structures to obtain micro-clusters, such that the complex structural information inside local neighborhoods is well captured by them. Moreover, by noting that Euclidean distance is more suitable for convex sets, this paper further proposes a data splitting rule that couples local density and data manifold structures, so that the similarities of the obtained micro-clusters can be easily characterized. A novel similarity measure between micro-clusters is then proposed for the final spectral clustering. A series of experiments based on synthetic and real-world datasets demonstrate that the proposed method has better adaptability to structurally complex data than granular-ball based methods.

City University of Hong Kong, Hong Kong, China The City University of Hong Kong Shenzhen Research Institute, Shenzhen, China, City University of Hong Kong, Hong Kong, China The City University of Hong Kong Shenzhen Research Institute, Shenzhen, China, City University of Hong Kong, Hong Kong, China The City University of Hong Kong Shenzhen Research Institute, Shenzhen, China, City University of Hong Kong, Hong Kong, China The City University of Hong Kong Shenzhen Research Institute, Shenzhen, China

Abstract: This paper studies lexicographic online learning within the framework of multiobjective stochastic linear bandits (MOSLB), where the agent aims to simultaneously maximize multiple objectives in a hierarchical manner. Previous literature has investigated lexicographic online learning in multiobjective multiarmed bandits, a special case of MOSLB. They provided a suboptimal algorithm whose regret bound is approximately O(T^(2/3)) based on a priority-based regret metric. In this paper, we propose an algorithm for lexicographic online learning in the MOSLB model, achieving an almost optimal regret bound of approximately O(dT^(1/2)) when evaluated by the general regret metric. Here, d is the dimension of arm vectors, and T is the time horizon. Our method introduces a new arm filter and a multiple trade-offs approach to effectively balance exploration and exploitation across different objectives. Experiments confirm the merits of our algorithms and provide compelling evidence to support our analysis.

Abstract: A market maker is a specialist who provides liquidity by continuously offering bid and ask quotes for a financial asset. The market maker’s objective is to maximize profit while avoiding the accumulation of a large position in the asset to control inventory risk. To achieve modelfree results, online learning has been applied to design market-making strategies that make no assumptions on the dynamics of the limit order book and asset price. However, existing work primarily focuses on profit rather than inventory risk. To address this limitation, this paper develops market-making strategies with inventory constraints within the online learning framework. To manage inventory risk, we propose two classes of market-making strategies with fixed bid-ask spreads that serve as reference strategies. Each reference strategy can ensure that the inventory remains under control, which enables the online learning algorithms designed for each class of reference strategies to satisfy inventory constraints. Different from the standard online learning model where the gain in each period is assumed to lie within a fixed bounded interval, the gain in our model depends on a state variable (i.e., the inventory size). Thus, a key challenge in analyzing the regret bounds is to bound the difference between the gains of any two reference strategies, which becomes significantly more complicated compared with scenarios without inventory constraints. By tackling these difficulties, we show that these algorithms achieve low regrets. Experimental results illustrate the superior performance of our algorithms in inventory risk control.

Abstract: Unsupervised graph alignment aims to find corresponding nodes across different graphs without supervision. Existing methods usually leverage the graph structure to aggregate features of nodes to find relations between nodes. However, the graph structure is inherently limited in pairwise relations between nodes without considering higherorder dependencies among multiple nodes. In this paper, we take advantage of the hypergraph structure to characterize higher-order structural information among nodes for better graph alignment. Specifically, we propose an optimal transport model to learn a hypergraph to capture complex relations among nodes, so that the nodes involved in one hyperedge can be adaptively based on local geometric information. In addition, inspired by the Dirichlet energy function of a hypergraph, we further refine our model to enhance the consistency between structural and feature information in each hyperedge. After that, we jointly leverage graphs and hypergraphs to extract structural and feature information to better model the relations between nodes, which is used to find node correspondences across graphs. We conduct experiments on several benchmark datasets with different settings, and the results demonstrate the effectiveness of our proposed method.

Abstract: The Information Bottleneck (IB) principle has emerged as a promising approach for enhancing the generalization, robustness, and interpretability of deep neural networks, demonstrating efficacy across image segmentation, document clustering, and semantic communication. Among IB implementations, the IB Lagrangian method, employing Lagrangian multipliers, is widely adopted. While numerous methods for the optimizations of IB Lagrangian based on variational bounds and neural estimators are feasible, their performance is highly dependent on the quality of their design, which is inherently prone to errors. To address this limitation, we introduce Structured IB, a framework for investigating potential structured features. By incorporating auxiliary encoders to extract missing informative features, we generate more informative representations. Our experiments demonstrate superior prediction accuracy and taskrelevant information preservation compared to the original IB Lagrangian method, even with reduced network size.

Abstract: Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Posttraining quantization (PTQ) is considered an effective method to accelerate LLM inference. Despite its growing popularity in LLM model compression, PTQ deployment faces two major challenges. First, low-bit quantization leads to performance degradation. Second, restricted by the limited integer computing unit type on GPUs, quantized matrix operations with different precisions cannot be effectively accelerated. To address these issues, we introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU. ABQ-LLM introduces several key innovations: (1) a distribution correction method for transformer blocks to mitigate distribution differences caused by full quantization of weights and activations, improving performance at low bit-widths. (2) the bit balance strategy to counteract performance degradation from asymmetric distribution issues at very low bit-widths (e.g., 2-bit). (3) an innovative quantization acceleration framework that reconstructs the quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents, gets rid of the limitations of INT4/INT8 computing units. ABQ-LLM can convert each component bit width gain into actual acceleration gain, maximizing performance under mixed precision(e.g., W6A6, W2A8). Based on W2*A8 quantization configuration on LLaMA-7B model, it achieved a WikiText2 perplexity of 7.59 (2.17⬇ vs 9.76 in AffineQuant). Compared to SmoothQuant, we realized 1.6x acceleration improvement and 2.7x memory compression gain.

Abstract: Precise measurements from sensors are crucial, but data is usually collected from lowcost, low-tech systems, which are often inaccurate. Thus, they require further calibrations. To that end, we first identify three requirements for effective calibration under practical low-tech sensor conditions. Based on the requirements, we develop a model called TESLA, Transformer for effective sensor calibration utilizing logarithmic-binned attention. TESLA uses a high-performance deep learning model, Transformers, to calibrate and capture non-linear components. At its core, it employs logarithmic binning, to minimize attention complexity. TESLA achieves consistent real-time calibration, even with longer sequences and finer-grained time series in hardware-constrained systems. Experiments show that TESLA outperforms existing novel deep learning and newly crafted linear models in accuracy, calibration speed, and energy efficiency.

School of Cyber Science and Engineering, Southeast University, Nanjing, China Key Laboratory of Computer Network and Information of Ministry of Education of China, Nanjing, China, School of Cyber Science and Engineering, Southeast University, Nanjing, China Key Laboratory of Computer Network and Information of Ministry of Education of China, Nanjing, China, School of Cyber Science and Engineering, Southeast University, Nanjing, China Key Laboratory of Computer Network and Information of Ministry of Education of China, Nanjing, China Purple Mountain Laboratories, Nanjing, China, School of Computer Science and Engineering, Southeast University, Nanjing, China Key Laboratory of Computer Network and Information of Ministry of Education of China, Nanjing, China Purple Mountain Laboratories, Nanjing, China, School of Cyber Science and Engineering, Southeast University, Nanjing, China Key Laboratory of Computer Network and Information of Ministry of Education of China, Nanjing, China Purple Mountain Laboratories, Nanjing, China Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education

Abstract: With the rapid development of the Internet, the information dissemination paradigm has changed and the efficiency has been improved greatly. While this also brings the quick spread of fake news and leads to negative impacts on cyberspace. Currently, the information presentation formats have evolved gradually, with the news formats shifting from texts to multimodal contents. As a result, detecting multimodal fake news has become one of the research hotspots. However, multimodal fake news detection research field still faces two main challenges: the inability to fully and effectively utilize multimodal information for detection, and the low credibility or static nature of the introduced external information, which limits dynamic updates. To bridge the gaps, we propose ERICFND, an external reliable information-enhanced multimodal contrastive learning framework for fake news detection. ERIC-FND strengthens the representation of news contents by entity-enriched external information enhancement method. It also enriches the multimodal news information via multimodal semantic interaction method where the multimodal constrative learning is employed to make different modality representations learn from each other. Moreover, an adaptive fusion method is taken to integrate the news representations from different dimensions for the eventual classification. Experiments are done on two commonly used datasets in different languages, X (Twitter) and Weibo. Experiment results demonstrate that our proposed model ERIC-FND outperforms existing state-of-the-art fake news detection methods under the same settings.

College of CSSE, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, College of CSSE, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, College of Intelligence and Computing, Tianjin University, School of Automation, Nanjing University of Posts and Telecommunications, College of CSSE, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University

Abstract: The increasing number of presentation attacks on reliable face matching has raised concerns and garnered attention towards face antispoofing (FAS). However, existing methods for FAS modeling commonly fuse multiple visual modalities (e.g., RGB, Depth, and Infrared) in a straightforward manner, disregarding latent feature gaps that can hinder representation learning. To address this challenge, we propose a novel multimodal FAS framework (mmFAS) that focuses on explicit alignment and fusion of latent features across different modalities. Specifically, we develop a multimodal alignment module to alleviate the latent feature gap by using instance-level contrastive learning and class-level matching simultaneously. Further, we explore a new switch-attention based fusion module to automatically aggregate complementary information and control model complexity. To evaluate the anti-spoofing performance more effectively, we adopt a challenging yet meaningful cross-database protocol involving four benchmark multimodal FAS datasets to simulate realworld scenarios. Extensive experimental results demonstrate the effectiveness of mmFAS in improving the accuracy of FAS systems, outperforming 10 representative methods.

Abstract: Cryoelectron microscopy (cryo-EM) has revolutionized the field of structural biology, determining structures of large protein machines and sharpening the understanding of fundamental biological processes. Despite cryo-EM’s unique capacity to discover novel proteins from unpurified samples and reveal the intricate structures of protein complexes within native cellular environments, the advancement of protein identification methods for cryo-EM lags behind. Without prior knowledge, such as sequence, protein identification from low-resolution density maps remains challenging. Here we introduce CryoDomain, an innovative method for identifying protein domains — conserved constituent units of proteins — from low-resolution cryo-EM density maps without requiring prior knowledge of protein sequences. CryoDomain leverages cross-modal alignment to correlate cryo-EM density maps with atomic structures, transferring the knowledge learned on a large atomic structure dataset to a sparse density map dataset. On two protein domain benchmarks constructed from CATH and SCOPe, CryoDomain significantly outperforms the state-of-the-art methods for domain identification from low-resolution density maps. CryoDomain liberates structural biologists from the tedious tasks of density inspection and database searching during protein identification. It has the potential to extend the border of unbiased structure discovery and cellular landscape investigation using cryo-EM.

Abstract: Weather forecasting is a crucial task for meteorologic research, with direct social and economic impacts. Recently, datadriven weather forecasting models based on deep learning have shown great potential, achieving superior performance compared with traditional numerical weather prediction methods. However, these models often require massive training data and computational resources. In this paper, we propose EWMoE, an effective model for accurate global weather forecasting, which requires significantly less training data and computational resources. Our model incorporates three key components to enhance prediction accuracy: 3D absolute position embedding, a core Mixture-of-Experts (MoE) layer, and two specific loss functions. We conduct our evaluation on the ERA5 dataset using only two years of training data. Extensive experiments demonstrate that EWMoE outperforms current models such as FourCastNet and ClimaX at all forecast time, achieving competitive performance compared with the state-of-the-art models Pangu-Weather and GraphCast in evaluation metrics such as Anomaly Correlation Coefficient (ACC) and Root Mean Square Error (RMSE). Additionally, ablation studies indicate that applying the MoE architecture to weather forecasting offers significant advantages in improving accuracy and resource efficiency.

Abstract: As urban residents demand higher travel quality, vehicle dispatch has become a critical component of online ridehailing services. However, current vehicle dispatch systems struggle to navigate the complexities of urban traffic dynamics, including unpredictable traffic conditions, diverse driver behaviors, and fluctuating supply and demand patterns. These challenges have resulted in travel difficulties for passengers in certain areas, while many drivers in other areas are unable to secure orders, leading to a decline in the overall quality of urban transportation services. To address these issues, this paper introduces GARLIC: a framework of GPT-Augmented Reinforcement Learning with Intelligent Control for vehicle dispatching. GARLIC utilizes multiview graphs to capture hierarchical traffic states, and learns a dynamic reward function that accounts for individual driving behaviors. The framework further integrates a GPT model trained with a custom loss function to enable high-precision predictions and optimize dispatching policies in real-world scenarios. Experiments conducted on two real-world datasets demonstrate that GARLIC effectively aligns with driver behaviors while reducing the empty load rate of vehicles.

School of Computing and Information Technology, Great Bay University, Dongguan, 523000, China Guangdong Provincial Key Laboratory of Mathematical and Neural Dynamical Systems, Dongguan, 523000, China, School of Computing and Information Technology, Great Bay University, Dongguan, 523000, China School of Medicine, The Chinese University of HongKong (Shenzhen), Shenzhen, 518000, China, School of Computing and Information Technology, Great Bay University, Dongguan, 523000, China, School of Computing and Information Technology, Great Bay University, Dongguan, 523000, China College of Business, City University of Hong Kong, Hong Kong, 999077, China, School of Computing and Information Technology, Great Bay University, Dongguan, 523000, China, School of Computing and Information Technology, Great Bay University, Dongguan, 523000, China, School of Computing and Information Technology, Great Bay University, Dongguan, 523000, China Guangdong Provincial Key Laboratory of Mathematical and Neural Dynamical Systems, Dongguan, 523000, China, School of Computing and Information Technology, Great Bay University, Dongguan, 523000, China

Abstract: Spatial multimodal omics technology, highlighted by Nature Methods as an advanced biological technique in 2023, plays a critical role in resolving biological regulatory processes with spatial context. Recently, graph neural networks based on K-nearest neighbor (KNN) graphs have gained prominence in spatial multi-modal omics methods due to their ability to model semantic relations between sequencing spots. However, the fixed KNN graph fails to capture the latent semantic relations hidden by the inevitable data perturbations during the biological sequencing process, resulting in the loss of semantic information. In addition, the common lack of spot annotation and class number priors in practice further hinders the optimization of spatial multi-modal omics models. Here, we propose a novel spatial multi-modal omics resolved framework, termed Prototype-aware Graph Adaptative Aggregation for Spatial Multi-modal Omics Analysis (PRAGA). PRAGA constructs a dynamic graph to capture latent semantic relations and comprehensively integrate spatial information and feature semantics. The learnable graph structure can also denoise perturbations by learning cross-modal knowledge. Moreover, a dynamic prototype contrastive learning is proposed based on the dynamic adaptability of Bayesian Gaussian Mixture Models to optimize the multi-modal omics representations for unknown biological priors. Quantitative and qualitative experiments on simulated and real datasets with 7 competing methods demonstrate the superior performance of PRAGA.

Abstract: With rapid advances, generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. Yet, language models' inherent vulnerabilities may be exacerbated due to increased accessibility and unrestricted model training on massive data. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pretrained on the poisoned data. Backdoored LLMs behave innocuously for normal queries and generate harmful responses when the backdoor trigger is activated. Despite significant efforts paid to LLMs' safety issues, LLMs are still struggling against backdoor attacks. As Anthropic recently revealed, existing safety training strategies, including supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), fail to revoke the backdoors once the LLM is backdoored during the pre-training stage. In this paper, we present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs. We initially propose Overwrite Supervised Fine-tuning (OSFT) for effective backdoor removal when the trigger is known. Then, to handle scenarios where trigger patterns are unknown, we integrate OSFT into our two-stage framework, SANDE. Unlike other works that assume access to cleanly trained models, our safety-enhanced LLMs are able to revoke backdoors without any reference. Consequently, our safety-enhanced LLMs no longer produce targeted responses when the backdoor triggers are activated. We conduct comprehensive experiments to show that our proposed SANDE is effective against backdoor attacks while bringing minimal harm to LLMs' powerful capability.

Abstract: Detecting and grounding multimodal media manipulation aims to categorize the type and localize the region of manipulation for image-text pairs in both two modalities. Existing methods have not sufficiently explored the intrinsic properties of the manipulated images, which contain both forgery and content features, leading to inefficient utilization. To address this problem, we propose an Image-Driven Decoupled Sequential Framework (IDseq), designed to decouple image features and rationally integrate them to accomplish different sub-tasks effectively. Specifically, IDseq employs two specially designed disentangled losses to guide the disentangled learning of forgery and content features. To efficiently leverage these features, we propose a Decoupled Image Manipulation Decoder (DIMD) that processes image tasks within a decoupled schema. We mitigate their exclusive competition by separating the image tasks into forgery-relevant and content-relevant components and training them without gradient interaction. Additionally, we utilize content features enhanced by the proposed Manipulation Indicator Generator (MIG) for the text tasks, which provide the maximal visual information as a reference while eliminating interference from unverified image data. Extensive experiments show the superiority of our IDseq, where it notably outperforms SOTA methods on the fine-grained classification by 3.8% in mAP and the forgery face grounding by 8.7% in IoUmean, even 1.3% in F1 on the most challenging manipulated text grounding.

School of Computer Science and Engineering, Central South University, China, School of Computer Science and Engineering, Central South University, China, School of Software and Microelectronics, Peking University, China, School of Computer Science and Engineering, Central South University, China, School of Computer Science and Engineering, Central South University, China, Department of Computer Science, Loughborough University, U.K., Centre for Interdisciplinary Methodologies, University of Warwick, U.K., Department of Computer Science, Loughborough University, U.K.

Abstract: Unauthorised face recognition (FR) systems have posed significant threats to digital identity and privacy protection. To alleviate the risk of compromised identities, recent makeup transferbased attack methods embed adversarial signals in order to confuse unauthorised FR systems. However, their major weakness is that they set up a fixed image unrelated to both the protected and the makeup reference images as the confusion identity, which in turn has a negative impact on both attack success rate and visual quality of transferred photos. In addition, the generated images cannot be recognised by authorised FR systems once attacks are triggered. To address these challenges, in this paper, we propose a Recoverable Makeup Transferred Generative Adversarial Network (RMT-GAN) which has the distinctive feature of improving its image-transfer quality by selecting a suitable transfer reference photo as the target identity. Moreover, our method offers a solution to recover the protected photos to their original counterparts that can be recognised by authorised systems. Experimental results demonstrate that our method provides significantly improved attack success rates while maintaining higher visual quality compared to state-of-the-art makeup transfer-based adversarial attack methods. Our code and supplementary materials are available on Github.

School of Information and Communication Engineering, University of Electronic Science and Technology of China, School of Information and Communication Engineering, University of Electronic Science and Technology of China, School of Information and Communication Engineering, University of Electronic Science and Technology of China, College of Computer Science, Sichuan University, Institude for Infocomm Research, Agency for Science, Technology and Research (ASTAR), School of Information and Communication Engineering, University of Electronic Science and Technology of China

Abstract: We address the challenge of WiFibased temporal activity detection and propose an efficient Dual Pyramid Network that integrates Temporal Signal Semantic Encoders and Local Sensitive Response Encoders. The Temporal Signal Semantic Encoder splits feature learning into high and low-frequency components, using a novel Signed Mask-Attention mechanism to emphasize important areas and downplay unimportant ones, with the features fused using ContraNorm. The Local Sensitive Response Encoder captures fluctuations without learning. These feature pyramids are then combined using a new cross-attention fusion mechanism. We also introduce a dataset with over 2,114 activity segments across 553 WiFi CSI samples, each lasting around 85 seconds. Extensive experiments show our method outperforms challenging baselines.

Abstract: Marvelous advances have been exhibited in recent document tampering localization (DTL) systems. However, confronted with corrupted tampered document images, their vulnerability is fatal in realworld scenarios. While robustness against adversarial attack has been extensively studied by adversarial training (AT), the robustness on natural corruptions remains under-explored for DTL. In this paper, to overcome forensic dependency, we propose the adversarial forensic regularization (AFR) based on min-max optimization to improve robustness. Specifically, we adopt mutual information (MI) to represent forensic dependency between two random variable over tampered and authentic pixels spaces, where the MI can be approximated by Jensen-Shannon-Divergence (JSD) with empirical sampling. To further enable a trade-off between predictive representations in clean tampered document pixels and robust ones in corrupted pixels, an additional regularization term is formulated with divergence between clean and perturbed pixels distribution (DDR). Following min-max optimization framework, our method can also work well against adversarial attacks. To evaluate our proposed method, we collect a dataset (i.e., TSorie-CRP) for evaluating robustness against natural corruptions in real scenarios. Extensive experiments demonstrate the effectiveness of our method against natural corruptions. Without any surprise, our method also achieves good performance against adversarial attack on DTL benchmark datasets.

Abstract: Missense mutations could affect the LiquidLiquid Phase Separation (LLPS) propensity of proteins and lead to aberrant phase-separating behaviours, which are recently found to be associated with many diseases including Alzheimer's and cancer. However, the regulatory role of mutations in LLPS remains unclear due to challenges in accurately characterizing the LLPS ability of mutants, including the high similarity in features, lack of labeled data, and vast amounts of data involved. To bridge this gap and facilitate the discovery of therapeutic strategies, we propose the first machine learning-based guider for protein phase-separating behaviour alteration, PScalpel. PScalpel leverages both structural information and an auxiliary tasks-based graph contrastive learning framework to distinguish the mutants’ LLPS ability, and incorporates a genetic algorithms-based recommendation method to identify mutants with desired LLPS properties. Comprehensive computational and biological experiments validate the effectiveness of PScalpel as a versatile tool for guiding alterations in protein phase separation behavior.

Abstract: For an extensive period, Vision Transformers (ViTs) have been deemed unsuitable for attaining robust performance on smallscale datasets, with WideResNet models maintaining dominance in this domain. While WideResNet models have persistently set the state-of-the-art (SOTA) benchmarks for robust accuracy on datasets such as CIFAR-10 and CIFAR-100, this paper challenges the prevailing belief that only WideResNet can excel in this context. We pose the critical question of whether ViTs can surpass the robust accuracy of WideResNet models. Our results provide a resounding affirmative answer. By employing ViT, enhanced with data generated by a diffusion model for adversarial training, we demonstrate that ViTs can indeed outshine WideResNet in terms of robust accuracy. Specifically, under the Infty-norm threat model with epsilon = 8/255, our approach achieves robust accuracies of 74.97% on CIFAR-10 and 44.07% on CIFAR-100, representing improvements of +3.9% and +1.4%, respectively, over the previous SOTA models. Notably, our ViT-B/2 model, with 3 times fewer parameters, surpasses the previously best-performing WRN-70-16. Our achievement opens a new avenue, suggesting that future models employing ViTs or other novel efficient architectures could eventually replace the long-dominant WRN models.

School of Computer Science and School of Software & Microelectronics, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China, School of Computer Science and School of Software & Microelectronics, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China, School of Computer Science and School of Software & Microelectronics, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China Center on Frontiers of Computing Studies, Peking University, Beijing, China Peking University Information Technology Institute (Tianjin Binhai), School of Computer Science and School of Software & Microelectronics, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China, Department of Computing, The Hong Kong Polytechnic University, Hong Kong S.A.R., School of Computer Science and School of Software & Microelectronics, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China, School of Computer Science and School of Software & Microelectronics, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China Nanhu Laboratory, Jiaxing, China, Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China National Engineering Research Center For Software Engineering, Peking University, Beijing, China Peking University Information Technology Institute (Tianjin Binhai), School of Computer Science and School of Software & Microelectronics, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China

Abstract: Exploring the correlations between medical features is essential for extracting patient health patterns from electronic health records (EHR) data, and strengthening medical predictions and decisionmaking. To constrain the hypothesis space of pure data-driven deep learning in the context of limited annotated data, a common trend is to incorporate external knowledge, especially knowledge priors related to personalized health contexts, to optimize model training. However, most existing methods lack flexibility and are constrained by the uncertainties brought about by fixed feature correlation priors. In addition, in utilizing knowledge, these methods overlook the knowledge informative for personalized healthcare. To this end, we propose DearLLM, a novel and effective framework that leverages feature correlations deduced by large language models (LLMs) to enhance personalized healthcare. Concretely, DearLLM captures and learns quantitative correlations between medical features by calculating the conditional perplexity of LLMs’ deduction based on personalized patient backgrounds. Then, DearLLM enhances healthcare predictions by emphasizing knowledge that carries unique patient information through a feature-frequency-aware graph pooling method. Extensive experiments on two real-world benchmark datasets show significant performance gains brought by DearLLM. Furthermore, the discovered findings align well with medical literature, offering meaningful clinical interpretations.

Abstract: Recent advancements in textto-speech and speech conversion technologies have enabled the creation of highly convincing synthetic speech. While these innovations offer numerous practical benefits, they also cause significant security challenges when maliciously misused. Therefore, there is an urgent need to detect these synthetic speech signals. Phoneme features provide a powerful speech representation for deepfake detection. However, previous phoneme-based detection approaches typically focused on specific phonemes, overlooking temporal inconsistencies across the entire phoneme sequence. In this paper, we develop a new mechanism for detecting speech deepfakes by identifying the inconsistencies of phoneme-level speech features. We design an adaptive phoneme pooling technique that extracts sample-specific phoneme-level features from frame-level speech data. By applying this technique to features extracted by pre-trained audio models on previously unseen deepfake datasets, we demonstrate that deepfake samples often exhibit phoneme-level inconsistencies when compared to genuine speech. To further enhance detection accuracy, we propose a deepfake detector that uses a graph attention network to model the temporal dependencies of phoneme-level features. Additionally, we introduce a random phoneme substitution augmentation technique to increase feature diversity during training. Extensive experiments on four benchmark datasets demonstrate the superior performance of our method over existing state-of-the-art detection methods.

Abstract: DrugDrug Interaction (DDI) prediction has attracted considerable attention in designing multi-drug combination strategies and avoiding adverse reactions. Notably, Artificial Intelligence (AI)-driven DDI prediction methods have emerged as a pivotal research paradigm. However, most AI-driven DDI prediction methods fall short in exploring intra-molecular motifs, and heavily rely on the overly idealized assumption of the complete inter-molecular topology, limiting their expressive capacities. To this end, we propose a Motif-Oriented representation learning with TOpology Refinement for DDI prediction, namely MOTOR, to exploit both the multi-granularity motif information and the topological structure of DDI networks. Specifically, MOTOR effectively captures motif internal structures, motif local contexts, and motif global semantics. Furthermore, MOTOR employs an iterative learning strategy to continuously refine the DDI topology and optimize the corresponding drug representations. Extensive experimental results demonstrate that MOTOR exhibits superior performance with interpretable insights in DDI prediction tasks across three real-world datasets, thereby opening up new avenues in AI-driven DDI prediction.

Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China, National Engineering Research Center for Software Engineering, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China, National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China, Institute of Geriatrics&National Clinical Research Center of Geriatrics Disease, Chinese PLA General Hospital, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China

Abstract: Coarsegrained (CG) molecular dynamics of proteins is a preferred approach to studying large molecules on extended time scales by condensing the entire atomic model into a limited number of pseudo-atoms and preserving the thermodynamic properties of the system. However, the significantly increased efficiency impedes the analysis of substantial physicochemical information, since high-resolution atomic details are sacrificed to accelerate simulation. In this paper, we propose LatCPB, a generative approach based on diffusion that enables high-resolution backmapping of CG proteins. Specifically, our model encodes an all-atom into discrete latent embeddings, aligned with learnable multimodal discrete priors for circumventing posterior collapse and maintaining the discrete properties of the protein sequence. During the generation, we further design a latent diffusion process within the continuous latent space due to the potential stochastics in the data. Moreover, LatCPB performs a contrastive learning strategy in latent space to separate feature representations of various molecules and conformations of the same molecule, thus enhancing the comprehension of molecular representational diversity. Experimental results demonstrate that LatCPB is able to backmap CG proteins effectively and achieve outstanding performance.

Abstract: Materials science text mining (MSTM), involving tasks like property extraction and synthesis action retrieval, is pivotal for advancing research by deriving critical insights from scientific literature. Descriptors, serving as essential task labels, often vary in meaning depending on researchers' usage purposes across different mining tasks. (e.g., 'Material' can refer to both synthesis components and participants in fuel cell experiment). This meaning difference makes it difficult for existing methods, finetuned to specific task, to handle the same descriptors in other tasks. To overcome above limitation, we propose MatDuck, a simple and effective approach for Zero-Shot MSTM by evoking material knowledge within Large Language Models (LLMs). Specifically, inspired by the Duck Typing principles in programming languages, we present a ClassDefinition-Style Descriptor generation method that evokes task-specific characteristics to address usage variation. Subsequently, we introduce code-style in-context learning for zero-shot tasks, reframing them into code to leverage LLMs' proficiency in code understanding. Extensive experiments on eight benchmark datasets demonstrate that MatDuck, as a plug-and-play approach, significantly improves the Zero-Shot MSTM performance of LLMs by an average of 11.3% across seven tasks.

Abstract: Predicting the future impact of newly published articles is pivotal for advancing scientific discovery in an era of unprecedented scholarly expansion. This paper introduces a promising approach, leveraging the capabilities of LLMs to predict the future impact of newborn articles solely based on titles and abstracts. Breaking away from traditional methods heavily reliant on external data, we propose finetuning the LLM to uncover the intrinsic semantic patterns shared by highly impactful articles from a vast collection of text-score pairs. These semantic features are further utilized to predict the proposed indicator, TNCSIsp, which incorporates favorable normalization properties across value, field, and time. To facilitate parameter-efficient fine-tuning of the LLM, we have also meticulously curated a dataset containing over 12,000 entries, each annotated with titles, abstracts, and their corresponding TNCSIsp values. Experimental results reveal an MAE of 0.216 and an NDCG@20 of 0.901, setting new benchmarks in predicting the impact of newborn articles. Finally, we present a real-world application example for predicting the impact of newborn journal articles to demonstrate its noteworthy practical value. Overall, our findings challenge existing paradigms and propose a shift towards a more content-focused prediction of academic impact, offering new insights for article impact prediction.

College of Computer Science and Software Engineering, Shenzhen University, Department of Computing, The Hong Kong Polytechnic University, Department of Computing, The Hong Kong Polytechnic University, College of Computer Science and Software Engineering, Shenzhen University, Department of Computing, The Hong Kong Polytechnic University, College of Computer Science and Software Engineering, Shenzhen University, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology

Abstract: Recent years have witnessed the rise of Neuralenhanced Video Streaming (NeVS), which integrates neural restoration models into video codecs for higher compression-restoration performance. Despite its benefit, existing work has not well explored the full potential of NeVS paradigm, due to: (1) post-streaming restoration by decoder while lacking the proactive collaboration of encoder, (2) end-to-end optimization based on conventional rate-distortion theory, which has been verified that low distortion is not always a synonym for high perceptual quality, and (3) coupled design for domain-specific tasks that cannot generalize to various video codecs. Observing these limitations, our objective is not to incrementally present an improved restoration model. Instead, we focus on the encoder-decoder synergy, i.e., the codec, which is non-trivial since it inherently strikes the rate-distortion-perception trade-off of NeVS. Aiming at this target, we propose the Diffusion-enhanced Neural Codec (DeNC), a plug-and-play module for current NeVS paradigm, to significantly reduce the required bitrates while preserving high perceptual quality of restored videos. Our key design is twofold. First, DeNC improves the encoder's compression efficiency by simultaneously reducing the resolution and color bit-depth of frame referencing. Second, DeNC empowers the decoder with perception-oriented restoration capability by making its diffusion-based restoration process aware of the encoder's compression conditions. Real-world evaluations show that DeNC improves compression ratios with nearly an order of magnitude and achieves much higher restoration quality (e.g., 93+ VMAF and 23% higher MOS) over the latest baselines.

Abstract: Detecting fake news in short videos is crucial for combating misinformation. Existing methods utilize topic modeling and coattention mechanism, overlooking the modality heterogeneity and resulting in suboptimal performance. To address this issue, we introduce Text-Guided Fine-grained Counterfactual Inference for Short Video Fake News detection (TGFC-SVFN). TGFC-SVFN leverages modality bias removal and teacher-model-enhanced inter-modal knowledge distillation to integrate the heterogeneous modalities in short videos. Specifically, we use causality-based reasoning prompts guided text as teacher model, which then transfers knowledge to the video and audio student models. Subsequently, a multi-head attention mechanism is employed to fuse information from different modalities. In each module, we utilize fine-grained counterfactual inference based on a diffusion model to eliminate modality bias. Experimental results on publicly available fake short video news datasets demonstrate that our method outperforms state-of-the-art techniques.

Abstract: The task of oneshot face video re-enactment aims at generating target video of faces with the same identity of one source frame and facial deformation of the driving video. To achieve high quality generation, it is essential to precisely disentangle identity-related and identity-independent characteristics, meanwhile build expressive features keeping high-frequency facial details, which still remain unaddressed for existing approaches. To deal with these two challenges, we propose a two-stage generation model based on StyleGAN, whose key novel techniques lie in better disentangling identity and deformation codes in the latent space through an identity-based modeling and manipulating intermediate StyleGAN features at the second stage for augmenting facial details of the generating targets. To further improve identity consistency, a data augmentation method is introduced during training for enhancing the key features affecting identity such as hair and wrinkles. Extensive experimental results demonstrate the superiority of our approach compared to state-of-the-art methods.

Abstract: The advancement in multimodal research has increased focus on Emotion Recognition in Conversations (ERC), targeting accurately identifying emotional changes. Methods based on graph convolution can better capture the dynamic changes of emotions and improve the accuracy and robustness of emotion recognition. However, existing methods do not distinguish the interaction patterns of a conversation, which results in limiting their ability to model contextual emotional relationships. In this paper, we propose a Dynamic Interactive Bimodal HyperGraph Convolutional Networks (DIBHGCN), which creatively constructs two types of sub-hypergraphs, i.e., the monologic sub-hypergraph and the dialogic sub-hypergraph, for modeling emotion relationships of different interaction patterns. The monologic sub-hypergraph is used to explore the contextual consistent emotions during the speaker's monologue interactions, while the dialogic sub-hypergraph focuses on capturing the emotional transfers in the dialogic interactions. Meanwhile, the single window partitioning mechanism fails to accommodate the distinct emotional velocity variations across the two interaction patterns. Therefore, we set up dynamic windows in the monologic interactions to fully utilize the information of sentence nodes with consistent emotions, and we add fragment windows to the dialogic interactions to prevent information interference caused by frequent emotional transfers. The experimental results show that our proposed method outperforms existing methods on two benchmark multimodal ERC datasets.

Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University, Tencent Hunyuan, Jarvis Research Center, Tencent YouTu Lab, School of Computer Science and Engineering, Central South University, CoAI Group, DCST, IAI, BNRIST, Tsinghua University, CoAI Group, DCST, IAI, BNRIST, Tsinghua University, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University

Abstract: Children with Autism Spectrum Disorder (ASD) often misunderstand social situations and struggle to participate in daily routines. Social Stories™ are traditionally crafted by psychology experts under strict constraints to address these challenges but are costly and limited in diversity. As Large Language Models (LLMs) advance, there's an opportunity to develop more automated, affordable, and accessible methods to generate Social Stories in realtime with broad coverage. However, adapting LLMs to meet the unique and strict constraints of Social Stories is a challenging issue. To this end, we propose SS-GEN, a Social Story GENeration framework with LLMs. Firstly, we develop a constraint-driven sophisticated strategy named StarSow to hierarchically prompt LLMs to generate Social Stories at scale, followed by rigorous human filtering to build a high-quality dataset. Additionally, we introduce quality assessment criteria to evaluate the effectiveness of these generated stories. Considering that powerful closed-source large models require very complex instructions and expensive API fees, we finally fine-tune smaller language models with our curated high-quality dataset, achieving comparable results at lower costs and with simpler instruction and deployment. This work marks a significant step in leveraging AI to personalize Social Stories cost-effectively for autistic children at scale, which we hope can encourage future research on special groups.

Abstract: The field of music generation has seen a surge of interest from both academia and industry, with innovative platforms such as Suno, Udio, and SkyMusic earning widespread recognition. However, the challenge of music infilling—modifying specific music segments without reconstructing the entire piece—remains a significant hurdle for both audiobased and symbolic-based models, limiting their adaptability and practicality. In this paper, we address symbolic music infilling by introducing the Collaborative Music Inpainter (CMI), an advanced human-in-the-loop (HITL) model for music infilling. The CMI features the Joint Embedding Predictive Autoregressive Generative Architecture (JEP-AGA), which learns the high-level predictive representations of the masked part that needs to be infilled during the autoregressive generative process, akin to how humans perceive and interpret music. The newly developed Dynamic Interaction Learner (DIL) achieves HITL by iteratively refining the infilled output based on user interactions alone, significantly reducing the interaction cost without requiring further input. Experimental results confirm CMI’s superior performance in music infilling, demonstrating its efficiency in producing high-quality music.

State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Information Science and Technology, ShanghaiTech University, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences School of Information Science and Technology, ShanghaiTech University

Abstract: Spiking Neural Networks (SNNs) are biologically inspired models that process visual inputs over multiple time steps. However, they often struggle with limited feature discrimination along the temporal dimension due to inherent spatiotemporal invariance. This limitation arises from the redundant activation of certain regions and shared supervision for multiple time steps, constraining the network’s ability to adapt and learn diverse features. To address this challenge, we propose a novel TemporalSelf-Erasing (TSE) supervision method that dynamically adapts the learning regions of interest for different time steps. The TSE method operates by identifying highly activated regions from predictions across multiple time steps and adaptively suppressing them during model training, thereby encouraging the network to focus on less activated yet potentially informative regions. This approach not only enhances the feature discrimination capability of SNNs but also facilitates more effective multi-time-step inference by exploiting more semantic information. Experimental results on benchmark datasets demonstrate that our TSE method significantly improves the classification accuracy and robustness of SNNs.

Abstract: Multimodal emotion recognition is a crucial research area in the field of affective braincomputer interfaces. However, in practical applications, it is often challenging to obtain all modalities simultaneously. To deal with this problem, researchers focus on using cross-modal methods to learn multimodal representations with fewer modalities. However, due to the significant differences in the distribution of different modalities, it is challenging to enable any modality to fully learn multimodal features. To address this limitation, we propose a Multi-to-Single (M2S) emotion recognition model, leveraging contrastive learning and incorporating two innovative modules: 1) a spatial and temporal-sparse (STS) attention mechanism that enhances the encoders' ability to extract features from data; 2) a novel Multi-to-Multi Contrastive Predictive Coding (M2M CPC) that learns and fuses features across different modalities. In the final testing, we only use a single modality for emotion recognition, reducing the dependence on multimodal data. Extensive experiments on five public multimodal emotion datasets demonstrate that our model achieves the state-of-the-art performance in the cross-modal tasks and maintains multimodal performance using only a single modality.

Abstract: We present and release MIDIGPT, a generative system based on the Transformer architecture that is designed for computer-assisted music composition workflows. MIDI-GPT supports the infilling of musical material at the track and bar level, and can condition generation on attributes including: instrument type, musical style, note density, polyphony level, and note duration. In order to integrate these features, we employ an alternative representation for musical material, creating a time-ordered sequence of musical events for each track and concatenating several tracks into a single sequence, rather than using a single time-ordered sequence where the musical events corresponding to different tracks are interleaved. We also propose a variation of our representation allowing for expressiveness. We present experimental results that demonstrate that MIDI-GPT is able to consistently avoid duplicating the musical material it was trained on, generate music that is stylistically similar to the training dataset, and that attribute controls allow enforcing various constraints on the generated material. We also outline several real-world applications of MIDI-GPT, including collaborations with industry partners that explore the integration and evaluation of MIDI-GPT into commercial products, as well as several artistic works produced using it.

Abstract: Spiking Neural Networks (SNNs) are seen as an energyefficient alternative to traditional Artificial Neural Networks (ANNs), but the performance gap remains a challenge. While this gap is narrowing through ANN-to-SNN conversion, substantial computational resources are still needed, and the energy efficiency of converted SNNs cannot be ensured. To address this, we present a unified training-free conversion framework that significantly enhances both the performance and efficiency of converted SNNs. Inspired by the biological nervous system, we propose a novel Adaptive-Firing Neuron Model (AdaFire), which dynamically adjusts firing patterns across different layers to substantially reduce the Unevenness Error - the primary source of error of converted SNNs within limited inference timesteps. We further introduce two efficiency-enhancing techniques: the Sensitivity Spike Compression (SSC) technique for reducing spike operations, and the Input-aware Adaptive Timesteps (IAT) technique for decreasing latency. These methods collectively enable our approach to achieve state-of-the-art performance with significant energy savings of up to 70.1%, 60.3%, and 43.1% on CIFAR-10, CIFAR-100, and ImageNet datasets, respectively. Extensive experiments across 2D, 3D, event-driven classification tasks, object detection, and segmentation tasks, demonstrate the effectiveness of our method in various domains.

Abstract: Depression can be reflected by longterm human spatio-temporal facial behaviours. While human face videos recorded in real-world usually have long and variable lengths, existing video-based depression assessment approaches frequently re-sample/down-sample such videos to short and equal-length videos, or split each video into several equal-length segments, where segment-level spatio-temporal facial behaviours are suppressed as a vector-style representations for RNN-based long-term (video-level) modelling. Both strategies lead to crucial information loss and distortion. In this paper, we propose a novel graph-style data structure called Matrixial Graph and an effective Matrixial Graph Neural Network (MGNN) for face video-based depression assessment, which can directly and end-to-end model long-term depression-specific spatio-temporal facial cues from variable-length videos without resampling/splitting videos or suppressing video segments to vectors. Importantly, the nodes in our matrixial graph are capable of including matrices of different shapes, and thus nodes of a matrix graph can directly represent all frame-level 2D facial feature maps (or images themselves) of an entire video regardless of its length. Then, our MGNN is the first GNN that can jointly process matrixial graphs containing varying numbers of nodes, which further learns matrix-style edge features, thereby facilitating to explicit model video-level multi-scale spatio-temporal facial behaviours among matrixial graph nodes for depression assessment. Experiments show that the explicit spatio-temporal modeling on 2D facial feature maps, facilitated by our matrixial graph/MGNN, provided significant benefits, leading our approach to achieve new state-of-the-art performances on AVEC2013 and AVEC2014 datasets with large advantages.

Abstract: In safetycritical domains such as medical diagnostics and autonomous driving, single-image evidence is sometimes insufficient to reflect the inherent ambiguity of vision problems. Therefore, multiple plausible assumptions that match the image semantics may be needed to reflect the actual distribution of targets and support downstream tasks. However, balancing and improving the diversity and consistency of segmentation predictions under the high-dimensional output spaces and potential multimodal distributions is still challenging. This paper presents Hierarchical Self-Regulation Diffusion (HSRDiff), a unified framework that simulates joint probability distribution over entire labels. Our model self-regulates the balance between the two modes of predicting the label and noise in a novel ``differentiation to unification" pipeline and dynamically fits the optimal path to model the aleatoric uncertainty rooted in observations. In addition, we preserve the high-fidelity reconstruction of the delicate structure in images by leveraging the hierarchical multi-scale condition priors. We validate HSRDiff in three different semantic scenarios. Experimental results show that HSRDiff is superior to the comparison method with a considerable performance gap.

Abstract: Textto-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL-base, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPS v2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques.

Abstract: Large language models (LLMs) have demonstrated remarkable performance in multimodal tasks even with frozen LLM Block and only a few trainable parameters. However, the underlying mechanisms of how LLMs enhance multimodal performance remains unclear. In this work, we focus on the phenomenon that ``Merely concatenating a frozen LLM block to the Vision Transformer (ViT) encoder can yield significant performance enhancements. Moreover, the choice of LLM block and insertion position can have a substantial impact, leading to varying degrees of improvement''. We analyze the optimization of the training process from the perspective of gradient dynamics and find that frozen LLM blocks act as gradient coherence rectifiers, aligning the gradients of different samples more closely during training. Furthermore, we demonstrate that the representation similarity between the inserted LLM block and the adjacent ViT block influences performance, with greater similarity tending to yield larger positive gains. Through these findings, we can justify the selection of suitable LLM blocks to be inserted at appropriate positions, and introduce additional gradient backpropagation paths by incorporating LLM blocks, could improve the performance of vanilla ViT through the rectification effect of gradient consistency during the training process, without the need to add LLM blocks during inference. Our experiments demonstrate the effectiveness of this strategy, making the practical application of the gradient rectification effect feasible.

Abstract: Domain generalization aims to learn a representation from the source domain, which can be generalized to arbitrary unseen target domains. A fundamental challenge for visual domain generalization is the domain gap caused by the dramatic style variation whereas the image content is stable. The realm of selective state space, exemplified by VMamba, demonstrates its global receptive field in representing the content. However, the way exploiting the domaininvariant property for selective state space is rarely explored. In this paper, we propose a novel Flow Factorized State Space model, dubbed as DGFamba, for visual domain generalization. To maintain domain consistency, we innovatively map the style-augmented and the original state embeddings by flow factorization. In this latent flow space, each state embedding from a certain style is specified by a latent probability path. By aligning these probability paths in the latent space, the state embeddings are able to represent the same content distribution regardless of the style differences. Extensive experiments conducted on various visual domain generalization settings show its state-of-the-art performance.

School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China, School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China, School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China, GVC Lab, Great Bay University, Dongguan, China,, Meituan, School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China, School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China Jinan Inspur Data Technology Co., Ltd., Jinan, China

Abstract: Benefiting from largescale pre-training of text-video pairs, current text-to-video (T2V) diffusion models can generate high-quality videos from the text description. Besides, given some reference images or videos, the parameter-efficient fine-tuning method, i.e. LoRA, can generate high-quality customized concepts, e.g., the specific subject or the motions from a reference video. However, combining the trained multiple concepts from different references into a single network shows obvious artifacts. To this end, we propose CustomTTT, where we can joint custom the appearance and the motion of the given video easily. In detail, we first analyze the prompt influence in the current video diffusion model and find the LoRAs are only needed for the specific layers for appearance and motion customization. Besides, since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination utilizing the trained customized models. We conduct detailed experiments to verify the effectiveness of the proposed methods. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.

Abstract: Existing fewshot medical image segmentation (FSMIS) models fail to address a practical issue in medical imaging: the domain shift caused by different imaging techniques, which limits the applicability to current FSMIS tasks. To overcome this limitation, we focus on the cross-domain few-shot medical image segmentation (CD-FSMIS) task, aiming to develop a generalized model capable of adapting to a broader range of medical image segmentation scenarios with limited labeled data from the novel target domain. Inspired by the characteristics of frequency domain similarity across different domains, we propose a Frequency-aware Matching Network (FAMNet), which includes two key components: a Frequency-aware Matching (FAM) module and a Multi-Spectral Fusion (MSF) module. The FAM module tackles two problems during the meta-learning phase: 1) intra-domain variance caused by the inherent support-query bias, due to the different appearances of organs and lesions, and 2) inter-domain variance caused by different medical imaging techniques. Additionally, we design an MSF module to integrate the different frequency features decoupled by the FAM module, and further mitigate the impact of inter-domain variance on the model's segmentation performance. Combining these two modules, our FAMNet surpasses existing FSMIS models and Cross-domain Few-shot Semantic Segmentation models on three cross-domain datasets, achieving state-of-the-art performance in the CD-FSMIS task.

Abstract: Diffusionbased zero-shot image restoration and enhancement models have achieved great success in various tasks of image restoration and enhancement. However, directly applying them to video restoration and enhancement results in severe temporal flickering artifacts. In this paper, we propose the first framework for zero-shot video restoration and enhancement based on the pre-trained image diffusion model. By replacing the spatial self-attention layer with the proposed short-long-range (SLR) temporal attention layer, the pre-trained image diffusion model can take advantage of the temporal correlation between frames. We further propose temporal consistency guidance, spatial-temporal noise sharing, and an early stopping sampling strategy to improve temporally consistent sampling. Our method is a plug-and-play module that can be inserted into any diffusion-based image restoration or enhancement methods to further improve their performance. Experimental results demonstrate the superiority of our proposed method.

Abstract: Reconstructing highquality 3D models from sparse 2D images has garnered significant attention in computer vision. Recently, 3D Gaussian Splatting (3DGS) has gained prominence due to its explicit representation with efficient training speed and real-time rendering capabilities. However, existing methods still heavily depend on accurate camera poses for reconstruction. Although some recent approaches attempt to train 3DGS models without the Structure-from-Motion (SfM) preprocessing from monocular video datasets, these methods suffer from prolonged training times, making them impractical for many applications. In this paper, we present an efficient framework that operates without any depth or matching model. Our approach initially uses SfM to quickly obtain rough camera poses within seconds, and then refines these poses by leveraging the dense representation in 3DGS. This framework effectively addresses the issue of long training times. Additionally, we integrate the densification process with joint refinement and propose a coarse-to-fine frequency-aware densification to reconstruct different levels of details. This approach prevents camera pose estimation from being trapped in local minima or drifting due to high-frequency signals. Our method significantly reduces training time from hours to minutes while achieving more accurate novel view synthesis and camera pose estimation compared to previous methods.

Abstract: The firstin-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.

Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory State Key Lab of CAD&CG, Zhejiang University, Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory, State Key Lab of CAD&CG, Zhejiang University, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, State Key Lab of CAD&CG, Zhejiang University, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory

Abstract: 3D Gaussian Splatting (3DGS) has shown promising performance in novel view synthesis. Previous methods adapt it to obtaining surfaces of either individual 3D objects or within limited scenes. In this paper, we make the first attempt to tackle the challenging task of largescale scene surface reconstruction. This task is particularly difficult due to the high GPU memory consumption, different levels of details for geometric representation, and noticeable inconsistencies in appearance. To this end, we propose GigaGS, the first work for high-quality surface reconstruction for large-scale scenes using 3DGS. GigaGS first applies a partitioning strategy based on the mutual visibility of spatial regions, which effectively grouping cameras for parallel processing. To enhance the quality of the surface, we also propose novel multi-view photometric and geometric consistency constraints based on Level-of-Detail representation. In doing so, our method can reconstruct detailed surface structures. Comprehensive experiments are conducted on various datasets. The consistent improvement demonstrates the superiority of GigaGS.

Abstract: 3D Referring Expression Segmentation (3DRES) aims to segment point cloud scenes based on a given expression. However, existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity arises from information loss or distortion during point cloud acquisition due to limitations such as lighting and viewpoint. Intent ambiguity refers to the model's equal treatment of all queries during the decoding process, lacking top-down task-specific guidance. In this paper, we introduce an Image-enhanced Prompt Decoding Network (IPDN), which leverages multi-view images and task-driven information to enhance the model's reasoning capabilities. To address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE) module, which injects multi-view 2D image information into the 3D scene and compensates for potential spatial information loss. To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features. Comprehensive experiments demonstrate that IPDN outperforms the state-of-the-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and 3D-GRES tasks, respectively.

China Mobile(Zhejiang) Research & Innovation Institute, China Mobile(Zhejiang) Research & Innovation Institute, China Mobile(Zhejiang) Research & Innovation Institute, China Mobile(Zhejiang) Research & Innovation Institute, China Mobile(Zhejiang) Research & Innovation Institute, China Mobile(Zhejiang) Research & Innovation Institute, China Mobile(Zhejiang) Research & Innovation Institute, China Mobile(Zhejiang) Research & Innovation Institute, China Mobile(Zhejiang) Research & Innovation Institute, China Mobile(Zhejiang) Research & Innovation Institute

Abstract: Recent research on universal object detection aims to introduce language in a SoTA closedset detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.

School of Artificial Intelligence, Xidian University, China, Medical College, Tianjin University, Tianjin, China Peng Cheng Lab, Shenzhen, China, School of Artificial Intelligence, Xidian University, China, College of Intelligence and Computing, Tianjin University, Tianjin, China, School of Artificial Intelligence, Xidian University, China, Medical College, Tianjin University, Tianjin, China College of Intelligence and Computing, Tianjin University, Tianjin, China, Peng Cheng Lab, Shenzhen, China, Peng Cheng Lab, Shenzhen, China

Abstract: Blind image superresolution (blind SR) aims to restore a high-resolution (HR) image from a low-resolution (LR) image with unknown degradation. Many existing methods explicitly estimate degradation information from various LR images. However, in most cases, image degradations are independent of image content. Their estimations may be influenced by the image content resulting in inaccuracy. Unlike existing works, we design a dual-encoder for degradation representation (DEDR) to preclude the influence of image content from LR images. This benefits in extracting the intrinsic degradation representation more accurately. To the best of our knowledge, this paper is the first work that estimates degradation representation through filtering out image content. Based on the degradation representation extracted by DEDR, we present a novel framework, named degradation representation aware transform network (DRAT) for blind SR. We propose global degradation aware (GDA) blocks to propagate degradation information across spatial and channel dimensions, in which a degradation representation transform module (DRT) is introduced to render features degradation-aware, thereby enhancing the restoration of LR images. Extensive experiments are conducted on three benchmark datasets (including Gaussian 8, DIV2KRK, and real-world datasets) under large scaling factors with complex degradations. The experimental results demonstrate that DRAT surpasses state-of-the-art supervised kernel estimation and unsupervised degradation representation methods.

Abstract: Unified detection of digital and physical attacks in facial recognition systems has become a focal point of research in recent years. However, current multimodal methods typically ignore the intra-class and inter-class variability across different types of attacks, leading to degraded performance. To address this limitation, we propose MoAE-CR, a framework that effectively leverages class-aware information for improved attack detection. Our improvements manifest at two levels, i.e., the feature and loss level. At the feature level, we propose Mixture-of-Attack-Experts (MoAEs) to capture more subtle differences among various types of fake faces. At the loss level, we introduce Class Regularization (CR) through the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Extensive experiments on two unified physical-digital attack datasets demonstrate the state-of-the-art performance of the proposed method.

Abstract: Knowledge distillation (KD) has recently gained great success in the field of object detection. By transferring the knowledge of the spatial or channel domain from the teacher model to the student model, it allows for a more compact representation with minimal performance loss. Despite this progress, existing KD methods typically treat knowledge from spatial or channel domains independently, ignoring the exploitation of the mutual relationship between these domains. In this work, we first explore the connection between spatial and channel domains and find there exists a strong correlation between them, i.e. the salient channels tend to contain significant object regions in the spatial domain. Motivated by this observation, we propose DCSFKD, a novel Dynamic Channel-wise Spatial Feature Knowledge Distillation framework for object detection by fully exploiting both spatial and channel knowledge. Specifically, we introduce channel-wise spatial feature distillation and global channel attention distillation, using information from both domains to improve the accuracy of the student network. Experiments demonstrate that our DCSF-KD outperforms existing detection methods on both homogeneous and heterogeneous teacher-student network pairs. For example, when using the MaskRCNN-Swin detector as the teacher, and based on RetinaNet and FCOS with ResNet-50 on MS COCO, our DCSF-KD can achieve 41.9% and 44.1% mAP, respectively.

Abstract: Flow matching has emerged as a promising framework for training generative models, demonstrating impressive empirical performance while offering relative ease of training compared to diffusionbased models. However, this method still requires numerous function evaluations in the sampling process. To address these limitations, we introduce a self-corrected flow distillation method that effectively integrates consistency models and adversarial training within the flow-matching framework. This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling. Our extensive experiments validate the effectiveness of our method, yielding superior results both quantitatively and qualitatively on CelebA-HQ and zero-shot benchmarks on the COCO dataset.

Abstract: Image retouching aims to enhance the visual quality of photos. Considering the different aesthetic preferences of users, the target of retouching is subjective. However, current retouching methods mostly adopt deterministic models, which not only neglects the style diversity in the expertretouched results and tends to learn an average style during training, but also lacks sample diversity during inference. In this paper, we propose a diffusion-based method, named DiffRetouch. Thanks to the excellent distribution modeling ability of diffusion, our method can capture the complex fine-retouched distribution covering various visual-pleasing styles in the training data. Moreover, four image attributes are made adjustable to provide a user-friendly editing mechanism. By adjusting these attributes in specified ranges, users are allowed to customize preferred styles within the learned fine-retouched distribution. Additionally, the affine bilateral grid and contrastive learning scheme are introduced to handle the problem of texture distortion and control insensitivity respectively. Extensive experiments have demonstrated the superior performance of our method on visually appealing and sample diversity.

Abstract: Generative image watermarking inserts secret watermarks into generated images and plays an important role in tracing the usages of generative models. For watermarking of diffusion models, inversionbased framework emerges as an effective approach. Such framework employs a robust mechanism to embed the watermark into the starting latent before ``forward sampling'', thereby generating images with the implicit watermark. During watermark detection, inversion techniques are employed to reverse the process and obtain the watermarked latent, followed by further extraction. The robustness of this technique hinges primarily on the embedding mechanism and inversion accuracy. Previous methods predominantly focused on enhancing the robustness of the embedding mechanism but overlooked the reduction of the inversion errors. However, our results show that inversion error will significantly affect the overall robustness. Therefore, in this paper, we delve into the inversion error aspect and propose CoSDA, a compensation sampling and drift alignment-based approach. The inversion error primarily accumulated during two stages: the internal error incurred by the algorithm, and the inevitable external noise. We observe that the main source of internal error comes from the mismatch in conditions (e.g. prompt, guidance scale) between forward and backward sampling processes. Therefore, we propose a compensation-based forward sampling, compensating for certain mismatch conditions and reducing the inversion error caused by the mismatch. Addressing external error caused by inevitable image distortions (e.g. JPEG compression), we introduce a drift-alignment approach, where a neural network is trained adversarially to restore the original watermarked latent from the distorted counterpart. Experimental results show that CoSDA effectively enhances watermark robustness while maintaining the visual quality of generated images.

Abstract: Manipulating human poses based on natural language is an emerging research field that has traditionally focused on coarse commands such as “walking” or “dancing.” However, finegrained pose manipulation, like instructing “put both hands in front of the stomach,” remains underexplored. In this paper, we introduce PoseLLaVA, a pioneering model that integrates SMPL-based pose representations into the multimodal LLaVA framework. Through a novel pose encoder decoder mechanism, PoseLLaVA achieves precise alignment between pose, textual, and visual modalities, enabling detailed control over pose manipulation tasks. PoseLLaVA excels in three key tasks: pose estimation, generation, and adjustment, all driven by detailed language instructions. We further introduce a fine-grained pose adjustment dataset PosePart, where each sample contains an initial pose and a target pose, along with specific instructions for adjustments, mimicking the guidance a human instructor might provide. Extensive evaluations across these tasks demonstrate significant improvements over existing methods, including metrics such as MPJPE and PA-MPJPE, which measure SMPL reconstruction errors, and Recall rates, which assess feature alignment across modalities. Specifically, PoseLLaVA reduces MPJPE errors by more than 20% compared to state-of-the-art methods in pose adjustment and generation tasks. Additionally, we demonstrate the feasibility of combining PoseLLaVA with generative models, such as diffusion, for pose image editing, highlighting its potential applications in language-controlled pose manipulation.

Abstract: Multiple Object Tracking (MOT) is a fundamental task in computer vision. Existing methods utilize motion information or appearance information to perform object tracking. However, these algorithms still struggle with special circumstances, such as occlusion and blurring in complex scenes. Inspired by the fact that people can pinpoint objects through verbal descriptions, we explore performing longterm robust tracking using semantic features of objects. Motivated by the success of the multimodal foundation model in text-image alignment, we reconsider the appearance feature extraction module in MOT and propose a Foundation model Driven multi-object tracker (FDTracker). Specifically, we propose a two-stage trained appearance feature extractor. In the first stage, using a single image of the object as input, the model could capture the attributes of objects with the assistance of natural language instructions. In the second stage, using a sequence of images of objects as input, the model learns how to use these attributes to distinguish between different objects and connect the same object at different times. Finally, for coordinating appearance and motion information, we propose a reasonable combined strategy, which better facilitates trajectory assignment and reconnection. Extensive experiments on benchmarks demonstrate the robustness of FDTracker.

Abstract: Deep learning has excelled in medical image classification, but its clinical application is limited by poor interpretability. Capsule networks, known for encoding hierarchical relationships and spatial features, show potential in addressing this issue. Nevertheless, traditional capsule networks often underperform due to their shallow structures, and deeper variants lack hierarchical architectures, thereby compromising interpretability. This paper introduces a novel capsule network, ParseCaps, which utilizes the sparse axial attention routing and parse convolutional capsule layer to form a parsetree-like structure, enhancing both depth and interpretability. Firstly, sparse axial attention routing optimizes connections between child and parent capsules, as well as emphasizes the weight distribution across instantiation parameters of parent capsules. Secondly, the parse convolutional capsule layer generates capsule predictions aligning with the parse tree. Finally, based on the loss design that is effective whether concept ground truth exists or not, ParseCaps advances interpretability by associating each dimension of the global capsule with a comprehensible concept, thereby facilitating clinician trust and understanding of the model's classification results. Experimental results on three medical datasets show that ParseCaps not only outperforms other capsule network variants in classification accuracy and robustness, but also provides interpretable explanations, regardless of the availability of concept labels.

Abstract: Object counting is crucial for understanding the distribution of objects in different scenarios. Recently, many object counting networks have been designed to be more complex to achieve marginal improvements, leading to excessive time spent on model design. With the development of large models (LMs), various visual tasks can be accomplished by transferring pretrained weights from LMs and fine-tuning them. However, tens of millions of training data make the pre-training parameters of LMs not entirely necessary. Moreover, if unnecessary parameters in the large model are not removed, it may lead to decreased performance on the tasks to be transferred. Motivated by this, this paper proposes an Enhancing low-Rank adaptation with Recoverability-based Reinforcement Pruning (E3RP) method to balance the complexity of large model and the accuracy of counting tasks. Firstly, we design a new reward mechanism based on the feature similarity of large model before and after globally unstructured pruning of specific parameters. Additionally, we propose a Patch Query Flip Attention (PQFA) mechanism to align multi-scale features through bidirectional interaction of features. Finally, the parameters of large model are pruned utilizing the pruning rate autonomously determined by the reinforcement learning network, and the large model is fine-tuned to counting tasks by a simple decoding head. Extensive experiments on four cross-scenario datasets demonstrate that the proposed method can remove redundant network parameters while ensuring network performance, with a maximum reduction of up to 63%.

Abstract: Multicamera 3D object detection aims to detect and localize objects in 3D space using multiple cameras, which has attracted more attention due to its cost-effectiveness trade-off. However, these methods often struggle with the lack of accurate depth estimation caused by the natural weakness of the camera in ranging. Recently, multi-modal fusion and knowledge distillation methods for 3D object detection have been proposed to solve this problem, which are time-consuming during the training phase and not friendly to memory cost. In light of this, we propose PromptDet, a lightweight yet effective 3D object detection framework motivated by the success of prompt learning in 2D foundation model. Our proposed framework, PromptDet, comprises two integral components: a general camera-based detection module, exemplified by models like BEVDet and BEVDepth, and a LiDAR-assisted prompter. The LiDAR-assisted prompter leverages the LiDAR points as a complementary signal, enriched with a minimal set of additional trainable parameters. Notably, our framework is flexible due to our prompt-like design, which can not only be used as a lightweight multi-modal fusion method but also as a camera-only method for 3D object detection during the inference phase. Extensive experiments on nuScenes validate the effectiveness of the proposed PromptDet. As a multi-modal detector, PromptDet improves the mAP and NDS by at most 22.8% and 21.1% with fewer than 2% extra parameters compared with the camera-only baseline. Without LiDAR points, PromptDet still achieves an improvement of at most 2.4% mAP and 4.0% NDS with almost no impact on camera detection inference time. We will release our code.

National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, National Biomedical Imaging Center, Peking University, University of Washington, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University National Biomedical Imaging Center, Peking University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University

Abstract: 3D Gaussian Splatting (3DGS) has been proven to exhibit exceptional performance in reconstructing 3D scenes. However, the effectiveness of 3DGS heavily relies on sharp images, and fulfilling this requirement presents challenges in realworld scenarios particularly when utilizing fast-moving cameras. This limitation severely constrains the practical application of 3DGS and may compromise the feasibility of real-time reconstruction. To mitigate these challenges, we proposed Spike Gaussian Splatting (SpikeGS), the first framework that integrates the Bayer-pattern spike streams into the 3DGS pipeline to reconstruct 3D scenes captured by a fast-moving high temporal color spike camera in one second. With accumulation rasterization, interval supervision, and a special designed pipeline, SpikeGS realizes continuous spatiotemporal perception while extracts detailed structure and texture from Bayer-pattern spike stream which is unstable and lacks details. Extensive experiments on both synthetic and real-world datasets demonstrate the superiority of SpikeGS compared with existing spike-based and deblur 3D scene reconstruction methods.

Abstract: We propose GGS, a Generalizable Gaussian Splatting method for Autonomous Driving that can achieve realistic rendering under large viewpoint changes. Previous generalizable 3D gaussian splatting methods are limited to rendering novel views that are very close to the original pair of images, which cannot handle large difference in viewpoint. Especially in autonomous driving scenarios, images are typically collected from a single lane. The limited training perspective makes rendering images of a different lane very challenging. To further improve the rendering capability of GGS under large viewpoint changes, we introduce a novel virtual lane generation module into GSS method to enable highquality lane switching even without a multi-lane dataset. Besides, we design a diffusion loss to supervise the generation of virtual lane images to further address the problem of data lacking in the virtual lanes. Finally, we also propose a depth refinement module to optimize depth estimation in the GSS model. Extensive validation of our method, compared to existing approaches, demonstrates state-of-the-art performance.

Abstract: While recent works have achieved great success on oneshot 3D common object generation, high quality and fidelity 3D head generation from a single image remains a great challenge. Previous text-based methods for generating 3D heads were limited by text descriptions and image-based methods struggled to produce high-quality head geometry. To handle this challenging problem, we propose a novel framework, ID-Sculpt, to generate high-quality 3D heads while preserving their identities. Our work incorporates the identity information of the portrait image into three parts: 1) geometry initialization, 2) geometry sculpting, and 3) texture generation stages. Given a reference portrait image, we first align the identity features with text features to realize ID-aware guidance enhancement, which contains the control signals representing the face information. We then use the canny map, ID features of the portrait image, and a pre-trained text-to-normal/depth diffusion model to generate ID-aware geometry supervision and 3D-GAN inversion is employed to generate ID-aware geometry initialization. Furthermore, with the ability to inject identity information into 3D head generation, we use ID-aware guidance to calculate ID-aware Score Distillation (ISD) for geometry sculpting. For texture generation, we adopt the ID Consistent Texture Inpainting and Refinement which progressively expands the view for texture inpainting to obtain an initialization UV texture map. We then use the id-aware guidance to provide image-level supervision for noisy multi-view images to obtain a refined texture map. Extensive experiments demonstrate that we can generate high-quality 3D heads with accurate geometry and texture from a single in-the-wild portrait image.

School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University

Abstract: In timelapse microscopy, inherent noise significantly limits imaging sensitivity and increases measurement uncertainty. Due to the scarcity of clean data, zero-shot approaches have emerged as highly data-efficient solutions for microscopy denoising. However, existing methods typically process video frames independently, resulting in long training times and issues such as temporal noise and over-smoothing. In this paper, we introduce MDSR-Zero, a zero-shot online learning method designed for plug-and-play noise suppression and super-resolution of microscopy videos. Our approach leverages an efficient online training strategy that reuses denoising models from previous frames. By treating the video as a continuous stream, our model significantly reduces training time and ensures temporally consistent denoising. Additionally, we propose a novel loss function tailored for denoising in the context of super-resolution, which enhances the detail in the denoised results. Extensive experiments on both synthetic and real-world noise demonstrate that our method achieves state-of-the-art performance among zero-shot denoising approaches and is competitive with self-supervised methods. Notably, our method can reduce training time by up to 10x compared to the previous SOTA method.

Abstract: The development of textto-image generative models has enabled the creation of images so realistic that distinguishing between AI-generated images and real photos is becoming a challenge. This progress offers new possibilities but also raises concerns over privacy, authenticity, and security. Detecting AI-generated images is crucial to prevent misuse. To assess the generalizability and robustness of AI-generated image detection, we present a large-scale dataset, referred to as WildFake. This dataset features cutting-edge image generators, a wide variety of generator categories, and generators for various applications, organized in a hierarchical framework. WildFake collects fake images from the open-source community, enriching its diversity with a broad range of image classes and image styles. Its design significantly improves the effectiveness of detection algorithms, making it a valuable resource for enhancing AI-generated image detection in practical applications. Our evaluations offer insights into the performance of generative models at various levels, showcasing WildFake's unique hierarchical structure's benefits.

Abstract: The garment structure serves as a crucial medium for expressing the designer's creative vision and showcasing the distinctive character of clothing items. Effective editing of garment structure in fashion images allows for an advanced preview of the design, accelerating the process of garment customization to meet individualized requirements. Although largescale diffusion models have demonstrated impressive image generation and editing capabilities, no efforts have been made to exploit their potential in part-level editing of images. Unlike previous research, we define a clothing structure editing (CSE) task aimed at accurately editing the local structure of human-centered clothing images through simple instruction-based prompts while maintaining the consistency of clothing appearance. Specifically, this paper develops a new controllable triple-flow framework for structure editing named FashionTailor. An additional network called ClothingNet is proposed to extract the clothing details to address the rigid constraints of the original garment structure. Then, we propose a semantic-refined module to extract the semantic understanding of the source image and adaptively focus on the part to be edited. We also design a cross-blend attention mechanism to integrate fine-grained clothing features to guarantee precise alignment between appearance and target structure features. In addition, a garment structure dataset called StructureFashion has been collated, wherein each item of clothing is represented by multiple photos with diverse structure characteristics, containing over six million pairs. Finally, our method supports editing the structure of multiple parts on a garment simultaneously. Extensive experiments validate the effectiveness of our method for editing part-level human images in StructureFashion dataset and real-scenarios.

Academy for Engineering and Technology, Fudan University, Academy for Engineering and Technology, Fudan University, Academy for Engineering and Technology, Fudan University, Academy for Engineering and Technology, Fudan University, Academy for Engineering and Technology, Fudan University, Academy for Engineering and Technology, Fudan University, Academy for Engineering and Technology, Fudan University, Academy for Engineering and Technology, Fudan University, Academy for Engineering and Technology, Fudan University, Academy for Engineering and Technology, Fudan University Institute of Metaverse & Intelligent Medicine, Fudan University Engineering Research Center of AI and Robotics, Ministry of Education Jilin Provincial Key Laboratory of Intelligence Science and Engineering Artificial Intelligence and Unmanned Systems Engineering Research Center of Jilin Province

Abstract: With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pretrained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.

Abstract: Implicit neural representations (INRs) have revolutionized arbitraryscale super-resolution (ASSR) by modeling images as continuous functions. Most existing INR-based ASSR networks first extract features from the given low-resolution image using an encoder, and then render the super-resolved result via a multi-layer perceptron decoder. Although these approaches have shown promising results, their performance is constrained by the limited representation ability of discrete latent codes in the encoded features. In this paper, we propose a novel ASSR method named GaussianSR that overcomes this limitation through 2D Gaussian Splatting (2DGS). Unlike traditional methods that treat pixels as discrete points, GaussianSR represents each pixel as a continuous Gaussian field. The encoded features are simultaneously refined and upsampled by rendering the mutually stacked Gaussian fields. As a result, long-range dependencies are established to enhance representation ability. In addition, a classifier is developed to dynamically assign Gaussian kernels to all pixels to further improve flexibility. All components of GaussianSR (i.e. encoder, classifier, Gaussian kernels, and decoder) are jointly learned end-to-end. Experiments demonstrate that GaussianSR achieves superior ASSR performance with fewer parameters than existing methods while enjoying interpretable and content-aware feature aggregations.

Abstract: Autonomous driving progress relies on largescale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities of freely-labeled data for autonomous driving applications and present SubjectDrive, the first model proven to scale generative data production in a way that could continuously improve autonomous driving applications. We investigate the impact of scaling up the quantity of generative data on the performance of downstream perception models and find that enhancing data diversity plays a crucial role in effectively scaling generative data production. Therefore, we have developed a novel model equipped with a subject control mechanism, which allows the generative model to leverage diverse external data sources for producing varied and useful data. Extensive evaluations confirm SubjectDrive's efficacy in generating scalable autonomous driving training data, marking a significant step toward revolutionizing data production methods in this field.

Southeast University, Nanjing 211189, Jiangsu, China, Southeast University, Nanjing 211189, Jiangsu, China, Hokkaido University, Sapporo 060-0808, Hokkaido, Japan, Nanjing Normal University, Nanjing 210023, Jiangsu, China, Southern University of Science and Technology, Shenzhen 518055, Guangdong, China, Southeast University, Nanjing 211189, Jiangsu, China, Southeast University, Nanjing 211189, Jiangsu, China, Hokkaido University, Sapporo 060-0808, Hokkaido, Japan, Hokkaido University, Sapporo 060-0808, Hokkaido, Japan

Abstract: In fewshot action recognition (FSAR), long sub-sequences of video naturally express entire actions more effectively. However, the high computational complexity of mainstream Transformer-based methods limits their application. Recent Mamba demonstrates efficiency in modeling long sequences, but directly applying Mamba to FSAR overlooks the importance of local feature modeling and alignment. Moreover, long sub-sequences within the same class accumulate intra-class variance, which adversely impacts FSAR performance. To solve these challenges, we propose a Matryoshka MAmba and CoNtrasTive LeArning framework (Manta). Firstly, the Matryoshka Mamba introduces multiple Inner Modules to enhance local feature representation, rather than directly modeling global features. An Outer Module captures dependencies of timeline between these local features for implicit temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining both supervised and unsupervised methods, is designed to mitigate the negative effects of intra-class variance accumulation. The Matryoshka Mamba and the hybrid contrastive learning paradigm operate in two parallel branches within Manta, enhancing Mamba for FSAR of long sub-sequence. Manta achieves new state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics, UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly improves FSAR of long sub-sequence from multiple perspectives.

Abstract: The discriminative feature is crucial for point cloud registration. Recent methods improve the feature discriminative by distinguishing between nonoverlapping and overlapping region points. However, they still face challenges in distinguishing the ambiguous structures in the overlapping regions. Therefore, the ambiguous features they extracted resulted in a significant number of outlier matches from overlapping regions. To solve this problem, we propose a prior-guided SMoE-based registration method to improve the feature distinctiveness by dispatching the potential correspondences to the same experts. Specifically, we propose a prior-guided SMoE module by fusing prior overlap and potential correspondence embeddings for routing, assigning tokens to the most suitable experts for processing. In addition, we propose a registration framework by a specific combination of Transformer layer and prior-guided SMoE module. The proposed method not only pays attention to the importance of locating the overlapping areas of point clouds, but also commits to finding more accurate correspondences in overlapping areas. Our extensive experiments demonstrate the effectiveness of our method, achieving state-of-the-art registration recall (95.7%/79.3%) on the 3DMatch/3DLoMatch benchmark. Moreover, we also test the performance on ModelNet40 and demonstrate excellent performance.

Abstract: Depth completion, predicting dense depth maps from sparse depth measurements, is an illposed problem requiring prior knowledge. Recent methods adopt learning-based approaches to implicitly capture priors, but the priors primarily fit in-domain data and do not generalize well to out-of-domain scenarios. To address this, we propose a zero-shot depth completion method composed of an affine-invariant depth diffusion model and test-time alignment. We use pre-trained depth diffusion models as depth prior knowledge, which implicitly understand how to fill in depth for scenes. Our approach aligns the affine-invariant depth prior with metric-scale sparse measurements, enforcing them as hard constraints via an optimization loop at test-time. Our zero-shot depth completion method demonstrates generalization across various domain datasets, achieving up to a 21% average performance improvement over the previous state-of-the-art methods while enhancing spatial understanding by sharpening scene details. We demonstrate that aligning a monocular affine-invariant depth prior with sparse metric measurements is a sufficient strategy to achieve domain-generalizable depth completion without relying on extensive training datasets.

Abstract: Significant advancements have been achieved in the realm of understanding poses and interactions of two hands manipulating an object. The emergence of augmented reality (AR) and virtual reality (VR) technologies has heightened the demand for realtime performance in these applications. However, current state-of-the-art models often exhibit promising results at the expense of substantial computational overhead. In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. We first limit the number of queries and decoders to meet the efficiency requirement. Given limited number of queries and decoders, we propose to optimize queries which are taken as input to the Transformer decoder, to secure better accuracy: (1) we propose to divide queries into three types (a left hand query, a right hand query and an object query) and enhance query features (2) by using the contact information between hands and an object and (3) by using three-step update of enhanced image and query features with respect to one another. With proposed methods, we achieved real-time pose estimation performance using just 108 queries and 1 decoder (53.5 FPS on an RTX 3090TI GPU). Surpassing state-of-the-art results on the H2O dataset by 17.6% (left hand), 22.8% (right hand), and 27.2% (object), as well as on the FPHA dataset by 5.3% (right hand) and 10.4% (object), our method excels in accuracy. Additionally, it sets the state-of-the-art in interaction recognition, maintaining real-time efficiency with an off-the-shelf action recognition module.

Abstract: The semantically interactive radiance field has always been an appealing task for its potential to facilitate userfriendly and automated real-world 3D scene understanding applications. However, it is a challenging task to achieve high quality, efficiency and zero-shot ability at the same time with semantics in radiance fields. In this work, we present FastLGS, an approach that supports real-time open-vocabulary query within 3D Gaussian Splatting (3DGS) under high resolution. We propose the semantic feature grid to save multi-view CLIP features which are extracted based on Segment Anything Model (SAM) masks, and map the grids to low dimensional features for semantic field training through 3DGS. Once trained, we can restore pixel-aligned CLIP embeddings through feature grids from rendered features for open-vocabulary queries. Comparisons with other state-of-the-art methods prove that FastLGS can achieve the first place performance concerning both speed and accuracy, where FastLGS is 98 times faster than LERF, 4 times faster than LangSplat and 2.5 times faster than LEGaussians. Meanwhile, experiments show that FastLGS is adaptive and compatible with many downstream tasks, such as 3D segmentation and 3D object inpainting, which can be easily applied to other 3D manipulation systems.

School of Intelligence Science and Technology, Key Laboratory of Machine Perception (MOE), State Key Laboratory of General Artificial Intelligence, Peking University, Beijing 100871, China, Institute of Artificial Intelligence, Peking University People’s Hospital, Peking University, Beijing 100871, China, Institute of Artificial Intelligence, Peking University People’s Hospital, Peking University, Beijing 100871, China, School of Intelligence Science and Technology, Key Laboratory of Machine Perception (MOE), State Key Laboratory of General Artificial Intelligence, Peking University, Beijing 100871, China

Abstract: Subcellular structure segmentation is a fundamental task in biological imaging. Existing selfsupervised representation learning combined with classical k-means clustering achieved unsupervised image segmentation, but it was constrained by time-consuming test-time pixel-wise feature extraction and clustering synchronization. This study introduces SCCS, a lightweight graph neural network-based spectral clustering framework for end-to-end subcellular structure segmentation upon superpixel graphs, greatly relieving the computational complexity in test-time numerical spectral clustering and inter-graph label inconsistency. Specifically, SCCS exploits the self-supervised masked autoencoder for representation learning and the construction of superpixel graphs (spG). Unlike per-graph scalar affinity-based spectral clustering, the proposed SCCS parameterizes the mapping from learned deep spG representations to coordinates in the spectral embedding space and the clustering assignments. The SCCS is optimized under unsupervised eigendecomposition and incremental clustering criteria, which synchronize the intra- and inter-graph spectral clustering. The proposed approach is evaluated on a publicly available volumetric electron microscopy dataset. Experiments demonstrate the effectiveness and performance gains of the proposed SCCS over the state-of-the-art in discovering a variety of subcellular structures.

College of Computer Science and Technology, Jilin University Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, College of Computer Science and Technology, Zhejiang Gongshang University, The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, School of Computing, National University of Singapore, College of Computer Science and Technology, Jilin University Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, College of Computer Science and Technology, Zhejiang Gongshang University, College of Computer Science and Technology, Zhejiang Gongshang University

Abstract: Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance. However, accurate video pose estimation remains an open challenge. One aspect that has been overlooked so far is that existing methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people. Additionally, while the current Transformerbased pose estimation methods has demonstrated impressive performance with global modeling, they struggle with local context perception and precise positional identification. In this paper, we try to tackle these challenges from three aspects: (1) We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions. (2) We further introduce a novel deformable cross attention mechanism and a bidirectional separation strategy to adaptively aggregate spatial and temporal motion clues from constrained surrounding contexts. (3) We mathematically formulate the deformable cross attention, constraining that the model focuses solely on the regions centered at the target person body. Empirically, our method achieves state-of-the-art performance on three large-scale benchmark datasets. A remarkable highlight is that our method achieves an 84.8 mean Average Precision (mAP) on the challenging wrist joint, which significantly outperforms the 81.5 mAP achieved by the current state-of-the-art method on the PoseTrack2017 dataset.

Abstract: The insufficient generalization of adaptive moment estimation (Adam) has hindered its broader application. Recent studies have shown that flat minima in loss landscapes are highly associated with improved generalization. Inspired by the filtering effect of integration operations on highfrequency signals, we propose multiple integral Adam (MIAdam), a novel optimizer that integrates a multiple integral term into Adam. This multiple integral term effectively filters out sharp minima encountered during optimization, guiding the optimizer towards flatter regions and thereby enhancing generalization capability. We provide a theoretical explanation for the improvement in generalization through the diffusion theory framework and analyze the impact of the multiple integral term on the optimizer's convergence. Experimental results demonstrate that MIAdam not only enhances generalization and robustness against label noise but also maintains the rapid convergence characteristic of Adam, outperforming Adam and its variants in state-of-the-art benchmarks.

Department of Electrical and Computer Engineering, Seoul National University NextQuantum, Seoul National University, Department of Electrical and Computer Engineering, Seoul National University NextQuantum, Seoul National University, Department of Electrical and Computer Engineering, Seoul National University, Department of Electrical and Computer Engineering, Seoul National University NextQuantum, Seoul National University, Department of Electrical and Computer Engineering, Seoul National University NextQuantum, Seoul National University HodooAI Labs

Abstract: As the use of machine learning models has increased, numerous studies have aimed to enhance fairness. However, research on the intersection of fairness and explainability remains insufficient, leading to potential issues in gaining the trust of actual users. Here, we propose a novel module that constructs a fair latent space, enabling faithful explanation while ensuring fairness. The fair latent space is constructed by disentangling and redistributing labels and sensitive attributes, allowing the generation of counterfactual explanations for each type of information. Our module is attached to a pretrained generative model, transforming its biased latent space into a fair latent space. Additionally, since only the module needs to be trained, there are advantages in terms of time and cost savings, without the need to train the entire generative model. We validate the fair latent space with various fairness metrics and demonstrate that our approach can effectively provide explanations for biased decisions and assurances of fairness.

Abstract: The objective of referring expression comprehension (REC) is to accurately identify the object in an image described by a given expression. Existing REC methods, including transformerbased and graph-based approaches among others, have shown robust performance in REC tasks. In this study, we present a groundbreaking framework named DiffusionREC for REC task. This framework reimagines the REC task as a text guided bounding box denoising diffusion process, through which noisy bounding boxes are refined and distilled to pinpoint the target box. Throughout the training process, the bounding box of the target object diffuses from its ground-truth position towards a random distribution. Simultaneously, a filtering-based object decoder is introduced to reverse this diffusion of noise, conditional on the provided expression, the result from previous denoised step and the interaction between the expression and the image. At the inference stage, we begin by randomly generating a collection of boxes. Subsequently, the filtering-based object decoder is iteratively employed to refine and prune these bounding boxes, taking into account the conditions on the given expression, the results from the previous denoised step, and the interaction between the expression and the image. Extensive experiments conducted on six datasets demonstrate that DiffusionREC outperforms previous REC methods, yielding superior performances.

Abstract: Foundational visionlanguage models like CLIP are emerging as a promising paradigm in vision due to their excellent generalization. However, adapting these models for downstream tasks while maintaining their generalization remains challenging. In literature, one branch of methods adapts CLIP by learning prompts using images. While effective, these methods often rely on image-label data, which is not always practical, and struggle to generalize to new datasets due to overfitting on few-shot source data. Another approach explores training-free methods by generating class captions from large language models (LLMs) and performing prompt ensembling, but these methods often produce static, class-specific prompts that cannot be transferred to new classes and incur additional costs by generating LLM descriptions for each class separately. In this work, we aim to combine the strengths of both approaches by learning prompts using only text data derived from LLMs. As supervised training of prompts in the image-free setup is non-trivial, we develop a language-only efficient training approach that enables prompts to distill rich contextual knowledge from LLM data. Furthermore, by mapping the LLM contextual text data within the learned prompts, our approach enables zero-shot transfer of prompts to new classes and datasets, potentially reducing the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized and transferable prompts for image tasks using only text data. We perform evaluations on 4 benchmarks, where ProText improves over ensembling methods while being competitive with those using labeled images.

Abstract: 3D point clouds are increasingly vital for applications like autonomous driving and robotics, yet the raw data captured by sensors often suffer from noise and sparsity, creating challenges for downstream tasks. Consequently, point cloud upsampling becomes essential for improving density and uniformity, with recent approaches showing promise by projecting randomly generated query points onto the underlying surface of sparse point clouds. However, these methods often result in outliers, nonuniformity, and difficulties in handling regions with high curvature and intricate structures. In this work, we address these challenges by introducing the Progressive Local Surface Estimator (PLSE), which more effectively captures local features in complex regions through a curvature-based sampling technique that selectively targets high-curvature areas. Additionally, we incorporate a curriculum learning strategy that leverages the curvature distribution within the point cloud to naturally assess the sample difficulty, enabling curriculum learning on point cloud data for the first time. The experimental results demonstrate that our approach significantly outperforms existing methods, achieving high-quality, dense point clouds with superior accuracy and detail.

Abstract: Speechdriven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (Dynamic Emotion Embedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (Temporally Hierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, a talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync. Our project page is available at https://whwjdqls.github.io/deeptalk.github.io/.

Abstract: Unpaired training has been verified as one of the most effective paradigms for real scene dehazing by learning from unpaired realworld hazy and clear images. Although numerous studies have been proposed, current methods demonstrate limited generalization for various real scenes due to limited feature representation and insufficient use of real-world prior. Inspired by the strong generative capabilities of diffusion models in producing both hazy and clear images, we exploit diffusion prior for real-world image dehazing, and propose an unpaired framework named Diff-Dehazer. Specifically, we leverage diffusion prior as bijective mapping learners within the CycleGAN, a classic unpaired learning framework. Considering that physical priors contain pivotal statistics information of real-world data, we further excavate real-world knowledge by integrating physical priors into our framework. Furthermore, we introduce a new perspective for adequately leveraging the representation ability of diffusion models by removing degradation in image and text modalities, so as to improve the dehazing effect. Extensive experiments on multiple real-world datasets demonstrate the superior performance of our method.

Abstract: Existing Large VisionLanguage Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks.

Abstract: Facial expression recognition (FER) remains a challenging task due to label ambiguity caused by the subjective nature of facial expressions and noisy samples. Additionally, class imbalance, which is common in realworld datasets, further complicates FER. Although many studies have shown impressive improvements, they typically address only one of these issues, leading to suboptimal results. To tackle both challenges simultaneously, we propose a novel framework called Navigating Label Ambiguity (NLA), which is robust under real-world conditions. The motivation behind NLA is that dynamically estimating and emphasizing ambiguous samples at each iteration helps mitigate noise and class imbalance by reducing the model's bias toward majority classes. To achieve this, NLA consists of two main components: Noise-aware Adaptive Weighting (NAW) and consistency regularization. Specifically, NAW adaptively assigns higher importance to ambiguous samples and lower importance to noisy ones, based on the correlation between the intermediate prediction scores for the ground truth and the nearest negative. Moreover, we incorporate a regularization term to ensure consistent latent distributions. Consequently, NLA enables the model to progressively focus on more challenging ambiguous samples, which primarily belong to the minority class, in the later stages of training. Extensive experiments demonstrate that NLA outperforms existing methods in both overall and mean accuracy, confirming its robustness against noise and class imbalance. To the best of our knowledge, this is the first framework to address both problems simultaneously.

Abstract: Surfacefrom-gradients (SfG) aims to recover a three-dimensional (3D) surface from its gradients. Traditional methods encounter significant challenges in achieving high accuracy and handling high-resolution inputs, particularly facing the complex nature of discontinuities and the inefficiencies associated with large-scale linear solvers. Although recent advances in deep learning, such as photometric stereo, have enhanced normal estimation accuracy, they do not fully address the intricacies of gradient-based surface reconstruction. To overcome these limitations, we propose a Fourier neural operator-based Numerical Integration Network (FNIN) within a two-stage optimization framework. In the first stage, our approach employs an iterative architecture for numerical integration, harnessing an advanced Fourier neural operator to approximate the solution operator in Fourier space. Additionally, a self-learning attention mechanism is incorporated to effectively detect and handle discontinuities. In the second stage, we refine the surface reconstruction by formulating a weighted least squares problem, addressing the identified discontinuities rationally. Extensive experiments demonstrate that our method achieves significant improvements in both accuracy and efficiency compared to current state-of-the-art solvers. This is particularly evident in handling high-resolution images with complex data, achieving errors of fewer than 0.1 mm on tested objects.

Abstract: Videoto-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state-of-the-art V2A methods that typically generate mono audio for a fixed duration.

Abstract: UNet has become a cornerstone in various visual applications such as image segmentation and diffusion probability models. While numerous innovative designs and improvements have been introduced by incorporating transformers or MLPs, the networks are still limited to linearly modeling patterns as well as the deficient interpretability. To address these challenges, our intuition is inspired by the impressive results of the Kolmogorov-Arnold Networks (KANs) in terms of accuracy and interpretability, which reshape the neural network learning via the stack of non-linear learnable activation functions derived from the Kolmogorov-Anold representation theorem. Specifically, in this paper, we explore the untapped potential of KANs in improving backbones for vision tasks. We investigate, modify and re-design the established U-Net pipeline by integrating the dedicated KAN layers on the tokenized intermediate representation, termed U-KAN. Rigorous medical image segmentation benchmarks verify the superiority of UKAN by higher accuracy even with less computation cost. We further delved into the potential of U-KAN as an alternative U-Net noise predictor in diffusion models, demonstrating its applicability in generating task-oriented model architectures.

Abstract: Image classification serves as the cornerstone of computer vision, traditionally achieved through discriminative models based on deep neural networks. Recent advancements have introduced classification methods derived from generative models, which offer the advantage of zeroshot classification. However, these methods suffer from two main drawbacks: high computational overhead and inferior performance compared to discriminative models. Inspired by the coordinated cognitive processes of rapid-slow pathway interactions in the human brain during visual signal recognition, we propose the Diffusion-Based Discriminative Model Enhancement Framework (DBMEF). This framework seamlessly integrates discriminative and generative models in a training-free manner, leveraging discriminative models for initial predictions and endowing deep neural networks with rethinking capabilities via diffusion models. Consequently, DBMEF can effectively enhance the classification accuracy and generalization capability of discriminative models in a plug-and-play manner. We have conducted extensive experiments across 17 prevalent deep model architectures with different training methods, including both CNN-based models such as ResNet and Transformer-based models like ViT, to demonstrate the effectiveness of the proposed DBMEF.Specifically, the framework yields a 1.51% performance improvement for ResNet-50 on the ImageNet dataset and 3.02% on the ImageNet-A dataset. In conclusion, our research introduces a novel paradigm for image classification, demonstrating stable improvements across different datasets and neural networks.

Abstract: Previous virtual tryon methods have employed ControlNet architecture in exemplar-based inpainting diffusion models to guide the generation of try-on images, preserving the garment's features and enhancing the realism of the generated images. While these methods have maintained the identity of the garment and improved the naturalness of the generated images, they still face the following limitations: (1) For garments with complex features, such as intricate text, patterns, and uncommon styles, they struggle to retain these detailed features in the generated try-on images. (2) They are limited to generating try-on images at a maximum resolution of 1K, which may not meet the demands of real-world scenarios, where higher resolutions might be required. To address the aforementioned issues, in this paper, we propose a Cascaded Diffusion Model for virtual try-on to enhance both image controllability and resolution. We call it CDM-VTON. Specifically, we design two diffusion models: the Multi-Conditioned Diffusion Model (MC-DM) and the Super-Resolution Diffusion Model (SR-DM). The former generates low-resolution try-on images while preserving the garment's complex features, and the latter enhances the resolution of these images. Additionally, we incorporate a multi-control integration module in the MC-DM, which injects multiple control conditions into a frozen denoising U-Net to ensure that the generated try-on images retain complex garment features. Our experimental results demonstrate that our method outperforms previous approaches in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively.

Abstract: Event cameras have gained attention in segmentation due to their higher temporal resolution and dynamic range compared to traditional cameras. However, they struggle with issues like lack of color perception and triggering only at motion edges, making it hard to distinguish objects with similar contours or segment spatially continuous objects. Our work aims to address these often overlooked issues. Based on the assumption that various objects exhibit different motion patterns, we believe that embedding the historical motion states of objects into segmented scenes can effectively address these challenges. Inspired by this, we propose the ESS framework ``Know Where You Are From" (KWYAF), which incorporates past motion cues through spatiotemporal propagation embedding. This framework features two core components: the Sequential Motion Encoding Module (SME) and the Event-Based Reliable Region Selection Mechanism (ER²SM). SMEs construct prior motion features through spatio-temporal correlation modeling for boosting final segmentation, while ER²SM adapts to identify high-confidence regions, embedding motion more precisely through local window masks and reliable region selection. A large number of experiments have demonstrated the effectiveness of our proposed framework in terms of both quantity and quality.

Abstract: Temporal action localization (TAL) involves dual tasks to classify and localize actions within untrimmed videos. However, the two tasks often have conflicting requirements for features. Existing methods typically employ separate heads for classification and localization tasks but share the same input feature, leading to suboptimal performance. To address this issue, we propose a novel TAL method with Cross Layer Task Decoupling and Refinement (CLTDR). Based on the feature pyramid of video, CLTDR strategy integrates semantically strong features from higher pyramid layers and detailed boundaryaware boundary features from lower pyramid layers to effectively disentangle the action classification and localization tasks. Moreover, the multiple features from cross layers are also employed to refine and align the disentangled classification and regression results. At last, a lightweight Gated Multi-Granularity (GMG) module is proposed to comprehensively extract and aggregate video features at instant, local, and global temporal granularities. Benefiting from the CLTDR and GMG modules, our method achieves state-of-the-art performance on five challenging benchmarks: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3, and HACS. Code：https://github.com/LiQiang0307/CLTDR-GMG

Abstract: Change captioning aims to describe the differences between two similar images using natural language, significantly aiding in understanding and monitoring changes. This challenging task requires a finegrained understanding of subtle changes while resisting disturbances like viewpoint shifts and illumination variations. Existing methods often rely solely on global difference features and lack comprehensive alignment of linguistic and visual information, leading to overlooking fine-grained details and generating semantic hallucinated sentences. To address these limitations, we propose the region-aware difference distilling (RDD) network with attribute-guided contrastive regularization (ACR). The RDD uses global difference features to progressively distill regional difference features using learnable vectors, allowing for more precise identification of changed regions. The ACR enhances comprehensive alignment between linguistic and visual information by formulating Nouns-to-Objects (N2O) and Verbs-to-Actions (V2A) alignment losses to regularize the regional difference features. Promising results on three datasets demonstrate that our method outperforms the state-of-the-art change captioning methods.

Abstract: 3D occupancy perception accurately estimates the volumetric status and semantic labels of a scene, attracting significant attention in the field of autonomous driving. However, enhancing the model's ability to generalize across different driving scenarios or sensing systems, often requires redesigning the model or extraexpensive annotations. To this end, following a comprehensive analysis of the occupancy model architecture, we proposed the UGOCC method that utilizes domain adaptation to efficiently harness unlabeled autonomous driving data, thereby enhancing the model's generalizability. Specifically, we design the depth fusion module by employing self-supervised depth estimation, and propose a strategy based on semantic attention and domain adversarial learning to improve the generalizability of the learnable fusion module. Additionally, we propose an OCC-specific pseudo-label selection tailored for semi-supervised learning, which optimizes the overall network's generalizability. Our experiment results on two challenging datasets nuScenes and Waymo, demonstrate that our method not only achieves state-of-the-art generalizability but also enhances the model's perceptual capabilities within the source domain by utilizing unlabeled data.

Abstract: Autonomous driving is a challenging task that requires perceiving and understanding the surrounding environment for safe trajectory planning. While existing visionbased end-to-end models have achieved promising results, these methods are still facing the challenges of vision understanding, decision reasoning and scene generalization. To solve these issues, a generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving. The proposed paradigm has two significant aspects. On one hand, a 3D-vision language pre-training module is designed to bridge the gap between visual perception and linguistic understanding in the bird's eye view. On the other hand, a cross-modal language model is introduced to generate reasonable planning with perception and navigation information in an auto-regressive manner. Experiments on the challenging nuScenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. Besides, the proposed GPVL presents strong generalization ability and real-time potential when handling high-level commands in various scenarios. It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems.

Abstract: ComputerAided Design (CAD) generative modeling has a strong and long-term application in the industry. Recently, the parametric CAD sequence as the design logic of an object has been widely mined by sequence models. However, the industrial CAD models, especially in component objects, are fine-grained and complex, requiring a longer parametric CAD sequence to define. To address the problem, we introduce Mamba-CAD, a self-supervised generative modeling for complex CAD models in the industry, which can model on a longer parametric CAD sequence. Specifically, we first design an encoder-decoder framework based on a Mamba architecture and pair it with a CAD reconstruction task for pre-training to model the latent representation of CAD models; and then we utilize the learned representation to guide a generative adversarial network to produce the fake representation of CAD models, which would be finally recovered into parametric CAD sequences via the decoder of Mamba-CAD. To train Mamba-CAD, we further create a new dataset consisting of 77,078 CAD models with longer parametric CAD sequences. Comprehensive experiments are conducted to demonstrate the effectiveness of our model under various evaluation metrics, especially in the generation length of valid parametric CAD sequences.

State Key Laboratory of Complex and Critical Software Environment, Beijing, China School of Computer Science and Engineering, Beihang University, China, State Key Laboratory of Complex and Critical Software Environment, Beijing, China School of Computer Science and Engineering, Beihang University, China, School of Artificial Intelligence, Beihang University, China Shanghai Artificial Intelligence Laboratory, Shanghai, China, State Key Laboratory of Complex and Critical Software Environment, Beijing, China School of Computer Science and Engineering, Beihang University, China

Abstract: 3D reconstruction from unconstrained image collections presents substantial challenges due to varying appearances and transient occlusions. In this paper, we introduce Micromacro Wavelet-based Gaussian Splatting (MW-GS), a novel approach designed to enhance 3D reconstruction by disentangling scene representations into global, refined, and intrinsic components. The proposed method features two key innovations: Micro-macro Projection, which allows Gaussian points to capture details from feature maps across multiple scales with enhanced diversity; and Wavelet-based Sampling, which leverages frequency domain information to refine feature representations and significantly improve the modeling of scene appearances. Additionally, we incorporate a Hierarchical Residual Fusion Network to seamlessly integrate these features. Extensive experiments demonstrate that MW-GS delivers state-of-the-art rendering performance, surpassing existing methods.

Abstract: Gaussian Splatting has emerged as a prominent 3D representation in novel view synthesis, but it still suffers from appearance variations, which are caused by various factors, such as modern camera ISPs, different time of day, weather conditions, and local light changes. These variations can lead to floaters and color distortions in the rendered images/videos. Recent appearance modeling approaches in Gaussian Splatting are either tightly coupled with the rendering process, hindering realtime rendering, or they only account for mild global variations, performing poorly in scenes with local light changes. In this paper, we propose DAVIGS, a method that decouples appearance variations in a plug-and-play and efficient manner. By transforming the rendering results at the image level instead of the Gaussian level, our approach can model appearance variations with minimal optimization time and memory overhead. Furthermore, our method gathers appearance-related information in 3D space to transform the rendered images, thus building 3D consistency across views implicitly. We validate our method on several appearance-variant scenes, and demonstrate that it achieves state-of-the-art rendering quality with minimal training time and memory usage, without compromising rendering speeds. Additionally, it provides performance improvements for different Gaussian Splatting baselines in a plug-and-play manner.

Abstract: Scene flow methods based on deep learning have achieved impressive performance. However, current topperforming methods still struggle with ill-posed regions, such as extensive flat regions or occlusions, due to insufficient local evidence. In this paper, we propose a novel global-aware scene flow estimation network with global motion propagation, named FlowMamba. The core idea of FlowMamba is a novel Iterative Unit based on the State Space Model (ISU), which first propagates global motion patterns and then adaptively integrates the global motion information with previously hidden states. As the irregular nature of point clouds limits the performance of ISU in global motion propagation, we propose a feature-induced ordering strategy (FIO). The FIO leverages semantic-related and motion-related features to order points into a sequence characterized by spatial continuity. Extensive experiments demonstrate the effectiveness of FlowMamba, with 21.9% and 20.5% EPE3D reduction from the best published results on FlyingThings3D and KITTI datasets. Specifically, our FlowMamba is the first method to achieve millimeter-level prediction accuracy in FlyingThings3D and KITTI. Furthermore, the proposed ISU can be seamlessly embedded into existing iterative networks as a plug-and-play module, improving their estimation accuracy significantly.

Abstract: Ophthalmologists typically require multimodal data sources to improve diagnostic accuracy in clinical decisions. However, due to medical device shortages, lowquality data and data privacy concerns, missing data modalities are common in real-world scenarios. Existing deep learning methods tend to address it by learning an implicit latent subspace representation for different modality combinations. We identify two significant limitations of these methods: (1) implicit representation constraints that hinder the model's ability to capture modality-specific information and (2) modality heterogeneity, causing distribution gaps and redundancy in feature representations. To address these, we propose an Incomplete Modality Disentangled Representation (IMDR) strategy, which disentangles features into explicit independent modal-common and modal-specific features by guidance of mutual information, distilling informative knowledge and enabling it to reconstruct valuable missing semantics and produce robust multimodal representations. Furthermore, we introduce a joint proxy learning module that assists IMDR in eliminating intra-modality redundancy by exploiting the extracted proxies from each class. Experiments on four ophthalmology multimodal datasets demonstrate that the proposed IMDR outperforms the state-of-the-art methods significantly.

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, SenseTime, SenseTime, SenseTime, School of Artificial Intelligence, Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture, China Key Laboratory of Intereactive Technology and Experience System, Ministry of Culture and Tourism, Beijing, China, School of Artificial Intelligence, Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture, China Key Laboratory of Intereactive Technology and Experience System, Ministry of Culture and Tourism, Beijing, China

Abstract: Recently, diffusionbased video generation models have achieved significant success. However, existing models often suffer from issues like weak consistency and declining image quality over time. To overcome these challenges, inspired by aesthetic principles, we propose a non-invasive plug-in called Uniform Frame Organizer (UFO), which is compatible with any diffusion-based video generation model. The UFO comprises a series of adaptive adapters with adjustable intensities, which can significantly enhance the consistency between the foreground and background of videos and improve image quality without altering the original model parameters when integrated. The training for UFO is simple, efficient, requires minimal resources, and supports stylized training. Its modular design allows for the combination of multiple UFOs, enabling the customization of personalized video generation models. Furthermore, the UFO also supports direct transferability across different models of the same specification without the need for specific retraining. The experimental results indicate that UFO effectively enhances video generation quality and demonstrates its superiority in public video generation benchmarks.

Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China

Abstract: Languagebased localization is a crucial task in robotics and computer vision, enabling robots to understand spatial positions through language. Recent methods rely on contrastive learning to establish correspondences between global features of texts and point clouds. However, the inherent ambiguity of textual descriptions makes it difficult to convey geometric information accurately, forcing alignment of them in the feature space may compromise the expressiveness of the point clouds. Unlike previous methods, this paper proposes using language as a filter to distinguish dissimilar locations. To this end, we propose a robust framework of multi-level negative contrastive learning for language-based localization, fully leveraging the descriptive power of language for spatial localization. Our method learns multiple mismatched factors by minimizing the similarity of different locations at different levels, including global-level, instance-level and relationlevel, respectively. Extensive experiments conducted on the KITTI360Pose benchmark demonstrate that our method outperforms better that the state-of-the-art methods. Specifically, we achieve a 56.3% improvement in Top-1 retrieval recall and a 45.9% improvement in 5m localization recall.

Abstract: Fewshot classification (FSC) is a fundamental yet challenging task in computer vision that involves recognizing novel classes from limited data. While previous methods have focused on enhancing visual features or incorporating additional modalities, Large Vision Language Models (LVLMs) offer a promising alternative due to their rich knowledge and strong visual perception. However, LVLMs risk learning specific response formats rather than effectively extracting useful information from support data in FSC. In this paper, we investigate LVLMs' performance in FSC and identify key issues such as insufficient learning and the presence of severe position biases. To tackle above challenges, we adopt the meta-learning strategy to teach models ``learn to learn". By constructing a rich set of meta-tasks for instruction fine-tuning, LVLMs enhance the ability to extract information from few-shot support data for classification. Additionally, we further boost LVLM's few-shot learning capabilities through label augmentation (LA) and candidate selection (CS) in the fine-tuning and inference stages, respectively. LA is implemented via a character perturbation strategy to ensure the model focuses on support information. CS leverages attribute descriptions to filter out unreliable candidates and simplify the task. Extensive experiments demonstrate that our approach achieves superior performance on both general and fine-grained datasets. Furthermore, our candidate selection strategy has been proven beneficial for training-free LVLMs.

Abstract: In this paper, we explore a novel image matting task aimed at achieving efficient inference under various computational cost constraints, specifically FLOP limitations, using a single matting network. Existing matting methods which have not explored scalable architectures or pathlearning strategies, fail to tackle this challenge. To overcome these limitations, we introduce Path-Adaptive Matting (PAM), a framework that dynamically adjusts network paths based on image contexts and computational cost constraints. We formulate the training of the computational cost-constrained matting network as a bilevel optimization problem, jointly optimizing the matting network and the path estimator. Building on this formalization, we design a path-adaptive matting architecture by incorporating path selection layers and learnable connect layers to estimate optimal paths and perform efficient inference within a unified network. Furthermore, we propose a performance-aware path-learning strategy to generate path labels online by evaluating a few paths sampled from the prior distribution of optimal paths and network estimations, enabling robust and efficient online path learning. Experiments on five image matting datasets demonstrate that the proposed PAM framework achieves competitive performance across a range of computational cost constraints.

Abstract: Reconstruction under adverse rainy conditions poses significant challenges due to reduced visibility and the distortion of visual perception. These conditions can severely impair the quality of geometric maps, which is essential for applications ranging from autonomous planning to environmental monitoring. In response to these challenges, this study introduces the novel task of 3D Reconstruction in Rainy Environments (3DRRE), specifically designed to address the complexities of reconstructing 3D scenes under rainy conditions. To benchmark this task, we construct the HydroViews dataset that comprises a diverse collection of both synthesized and realworld scene images characterized by various intensities of rain streaks and raindrops. Furthermore, we propose DeRainGS, the first 3DGS method tailored for reconstruction in adverse rainy environments. Extensive experiments across a wide range of rain scenarios demonstrate that our method delivers state-of-the-art performance, remarkably outperforming existing occlusion-free methods by a large margin.

State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing 100191, China School of Computer Science and Engineering, Beihang University, Beijing 100191, China, State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing 100191, China School of Computer Science and Engineering, Beihang University, Beijing 100191, China, State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing 100191, China School of Computer Science and Engineering, Beihang University, Beijing 100191, China, School of Computer Science and Engineering, Beihang University, Beijing 100191, China, State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing 100191, China School of Computer Science and Engineering, Beihang University, Beijing 100191, China

Abstract: Trainingfree open-vocabulary semantic segmentation aims to explore the potential of frozen vision-language models (VLM) for segmentation tasks. Recent works reform the inference process of CLIP and utilize the features from the final layer to reconstruct dense representations for segmentation, demonstrating promising performance. However, the final layer tends to prioritize global components over local representations, leading to suboptimal robustness and effectiveness of existing methods. In this paper, we propose CLIPSeg, a novel training-free framework that fully exploits the diverse knowledge across layers in CLIP for dense predictions. Our study unveils two key discoveries: Firstly, the features in the middle layers exhibit high locality awareness and feature coherence compared to the final layer, based on which we propose the coherence enhanced residual attention module that generates semantic-aware attention. Secondly, despite not being directly aligned with the text, the deep layers capture valid local semantics that complement those in the final layer. Leveraging this insight, we introduce the deep semantic integration module to boost the patch semantics in the final block. Experiments conducted on 9 segmentation benchmarks with various CLIP models demonstrate that CLIPSeg consistently outperforms all training-free methods by substantial margins, e.g., a 7.8 % improvement in average mIoU for CLIP with a ViT-L backbone, and competes with learning-based counterparts in generalizing to novel concepts in an efficient way.

Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Beihang University, AI Lab, Lenovo Research, AI Lab, Lenovo Research, AI Lab, Lenovo Research, AI Lab, Lenovo Research, AI Lab, Lenovo Research, AI Lab, Lenovo Research, AI Lab, Lenovo Research, AI Lab, Lenovo Research, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Lenovo Ltd.

Abstract: Recent advances in visionlanguage pre-training have significantly enhanced the model capabilities on grounded object detection. However, these studies often pre-train with coarse-grained text prompts, such as plain category names and brief grounded phrases. This limitation curtails the model's capacity for fine-grained linguistic comprehension and leads to a significant decline in performance when faced with detailed descriptions or contextual information. To tackle these problems, we develop DoGA: Detect objects with Grouped Attributes, which employs commonly apparent attributes to bridge different granular semantics and uses specific attributes to identify the object discrepancy. Our DoGA incorporates three principle components: 1) Generation of attribute-based prompts, consisting of linguistic definitions enriched with common-sense visible attributes and hard negative notations deriving from the image-specific attribute features; 2) Paralleled entity fusion and optimization, designed to manage long attribute-based descriptions and negative concepts efficiently; and 3) Prompt-wise grouped training to accommodate model to perform many-to-many assignments, facilitating simultaneous training and inferring with multiple attribute-based synonyms. Extensive experiments demonstrate that training with synonymous attribute-based prompts allows DoGA to generalize multi-granular prompts and surpass previous state-of-the-art approaches, yielding 50.2 on the COCO and 38.0 on the LVIS benchmarks under the zero-short setting. We will make our code publicly available upon acceptance.

Abstract: In recent years, NoReference Point Cloud Quality Assessment (NR-PCQA) research has achieved significant progress. However, existing methods mostly seek a direct mapping function from visual data to the Mean Opinion Score (MOS), which is contradictory to the mechanism of practical subjective evaluation. To address this, we propose a novel language-driven PCQA method named CLIP-PCQA. Considering that human beings prefer to describe visual quality using discrete quality descriptions (e.g., "excellent" and "poor") rather than specific scores, we adopt a retrieval-based mapping strategy to simulate the process of subjective assessment. More specifically, based on the philosophy of CLIP, we calculate the cosine similarity between the visual features and multiple textual features corresponding to different quality descriptions, in which process an effective contrastive loss and learnable prompts are introduced to enhance the feature extraction. Meanwhile, given the personal limitations and bias in subjective experiments, we further covert the feature similarities into probabilities and consider the Opinion Score Distribution (OSD) rather than a single MOS as the final target. Experimental results show that our CLIP-PCQA outperforms other State-Of-The-Art (SOTA) approaches.

MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Defense Innovation Institute, Chinese Academy of Military Science Intelligent Game and Decision Laboratory, MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Defense Innovation Institute, Chinese Academy of Military Science Intelligent Game and Decision Laboratory, Defense Innovation Institute, Chinese Academy of Military Science Intelligent Game and Decision Laboratory, MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Chinese Academy of Military Science Intelligent Game and Decision Laboratory

Abstract: The Segment Anything Model (SAM) is a widely used vision foundation model with diverse applications, including image segmentation, detection, and tracking. Given SAM's wide applications, understanding its robustness against adversarial attacks is crucial for realworld deployment. However, research on SAM's robustness is still in its early stages. Existing attacks often overlook the role of prompts in evaluating SAM's robustness, and there has been insufficient exploration of defense methods to balance the robustness and accuracy. To address these gaps, this paper proposes an adversarial robustness framework designed to evaluate and enhance the robustness of SAM. Specifically, we introduce a cross-prompt attack method to enhance the attack transferability across different prompt types. Besides attacking, we propose a few-parameter adaptation strategy to defend SAM against various adversarial attacks. To balance robustness and accuracy, we use the singular value decomposition (SVD) to constrain the space of trainable parameters, where only singular values are adaptable. Experiments demonstrate that our cross-prompt attack method outperforms previous approaches in terms of attack success rate on both SAM and SAM 2. By adapting only 512 parameters, we achieve at least a 15% improvement in mean intersection over union (mIoU) against various adversarial attacks. Compared to previous defense methods, our approach enhances the robustness of SAM while maximally maintaining its original performance.

Abstract: Existing RGBT tracking methods often design various interaction models to perform crossmodal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some extent, we design a Difference-based Fusion Mamba (DFM) to achieve enhanced fusion of different modalities with linear complexity. When interacting with features from all layers, a huge number of token sequences (3840 tokens in this work) are involved and the computational burden is thus large. To handle this problem, we design an Order-dynamic Fusion Mamba (OFM) to execute efficient and effective feature interactions of all layers by dynamically adjusting the scan order of different layers in Mamba. Extensive experiments on four public RGBT tracking datasets show that AINet achieves leading performance against existing state-of-the-art methods. We will release the code upon acceptance of the paper.

Abstract: Video moment retrieval (VMR) aims to locate the most likely video moment(s) corresponding to a text query in untrimmed videos. Training of existing methods is limited by the lack of diverse and generalisable VMR datasets, hindering their ability to generalise momenttext associations to queries containing novel semantic concepts (unseen both visually and textually in a training source domain). For model generalisation to novel semantics, existing methods rely heavily on assuming to have access to both video and text sentence pairs from a target domain in addition to the source domain pair-wise training data. This is neither practical nor scalable. In this work, we introduce a more generalisable approach by assuming only text sentences describing new semantics are available in model training without having seen any videos from a target domain. To that end, we propose a Fine-grained Video Editing framework, termed FVE, that explores generative video diffusion to facilitate fine-grained video editing from the seen source concepts to the unseen target sentences consisting of new concepts. This enables generative hypotheses of unseen video moments corresponding to the novel concepts in the target domain. This fine-grained generative video diffusion retains the original video structure and subject specifics from the source domain while introducing semantic distinctions of unseen novel vocabularies in the target domain. A critical challenge is how to enable this generative fine-grained diffusion process to be meaningful in optimising VMR, more than just synthesising visually pleasing videos. We solve this problem by introducing a hybrid selection mechanism that integrates three quantitative metrics to selectively incorporate synthetic video moments (novel video hypotheses) as enlarged additions to the original source training data, whilst minimising potential detrimental noise or unnecessary repetitions in the novel synthetic videos harmful to VMR learning. Experiments on three datasets demonstrate the effectiveness of FVE to unseen novel semantic video moment retrieval tasks

Abstract: Scene text spotting has attracted the enthusiasm of relative researchers in recent years. Most existing scene text spotters follow the detectionthen-recognition paradigm, where the vanilla detection module hardly determines the reading order and leads to failure recognition. After rethinking the auto-regressive scene text recognition method, we find that a well-trained recognizer can implicitly perceive the local semantics of all characters in a complete word or a sentence without a character-level detection module. Local semantic knowledge not only includes text content but also spatial information in the right reading order. Motivated by the above analysis, we propose the Local Semantics Guided scene text Spotter (LSGSpotter), which auto-regressively decodes the position and content of characters guided by the local semantics. Specifically, two effective modules are proposed in LSGSpotter. On the one hand, we design a Start Point Localization Module (SPLM) for locating text start points to determine the right reading order. On the other hand, a Multi-scale Adaptive Attention Module (MAAM) is proposed to adaptively aggregate text features in a local area. In conclusion, LSGSpotter achieves the arbitrary reading order spotting task without the limitation of sophisticated detection, while alleviating the cost of computational resources with the grid sampling strategy. Extensive experiment results show LSGSpotter achieves state-of-the-art performance on the InverseText benchmark. Moreover, our spotter demonstrates superior performance on English benchmarks for arbitrary-shaped text, achieving improvements of 0.7% and 2.5% on Total-Text and SCUT-CTW1500, respectively. These results validate our text spotter is effective for scene texts in arbitrary reading order and shape.

The Hong Kong Polytechnic University, Hong Kong SAR, China Laboratory for Artificial Intelligence in Design, Hong Kong SAR, China, The Hong Kong Polytechnic University, Hong Kong SAR, China Laboratory for Artificial Intelligence in Design, Hong Kong SAR, China, The Hong Kong Polytechnic University, Hong Kong SAR, China Laboratory for Artificial Intelligence in Design, Hong Kong SAR, China, The Hong Kong Polytechnic University, Hong Kong SAR, China Laboratory for Artificial Intelligence in Design, Hong Kong SAR, China, The Hong Kong Polytechnic University, Hong Kong SAR, China Laboratory for Artificial Intelligence in Design, Hong Kong SAR, China, The Hong Kong Polytechnic University, Hong Kong SAR, China Laboratory for Artificial Intelligence in Design, Hong Kong SAR, China

Abstract: Fewshot defect multi-classification (FSDMC) is an emerging trend in quality control within industrial manufacturing. However, current FSDMC research often lacks generalizability due to its focus on specific datasets. Additionally, defect classification heavily relies on contextual information within images, and existing methods fall short of effectively extracting this information. To address these challenges, we propose a general FSDMC framework called MVREC, which offers two primary advantages: (1) MVREC extracts general features for defect instances by incorporating the pre-trained AlphaCLIP model. (2) It utilizes a region-context framework to enhance defect features by leveraging mask region input and multi-view context augmentation. Furthermore, Few-shot Zip-Adapter(-F) classifiers within the model are introduced to cache the visual features of the support set and perform few-shot classification. We also introduce MVTec-FS, a new FSDMC benchmark based on MVTec AD, which includes 1228 defect images with instance-level mask annotations and 46 defect types. Extensive experiments conducted on MVTec-FS and four additional datasets demonstrate its effectiveness in general defect classification and its ability to incorporate contextual information to improve classification performance.

Abstract: Hyperspectral image (HSI) reconstruction aims to restore the original 3D HSIs from the 2D hyperspectral snapshot compressive images (SCIs). The key to highfidelity HSI reconstruction lies in designing refined spatial and spectral attention mechanisms, which are crucial for generating fine-grained representations of HSI based on the limited spatial and spectral information available in SCI. Recently, Mamba has demonstrated remarkable performance and efficiency in modeling spatial correlations. Its implicit attention mechanism generates three orders of magnitude more attention matrices than transformers, significantly raising the performance ceiling for HSI reconstruction. In this paper, we propose a novel joint SSM network named Sp3ctralMamba for HSI reconstruction. Sp3ctralMamba integrates frequency domain knowledge and physical priors to enhance reconstruction quality. Specifically, we first perform hierarchical decomposition of the 3D HSI embedding to mitigate the negative impact of distant bands on reconstruction. Next, we design a joint SSM block S3Mamba (S3MAB) to perform parallel scans of the embeddings from different bands. In addition to the conventional vanilla scan, S3MAB introduces a local scanning scheme to address the reconstruction challenges posed by the spatial sparsity of spectral information. Furthermore, a spiral scanning scheme in the frequency domain is incorporated to enhance the order correlation between different frequency signals. Finally, we introduce energy priors and structural priors to constrain the generation of spectral and spatial representations during the training process. Extensive experiments on both simulated and real datasets demonstrate that Sp3ctralMamba significantly elevates HSI reconstruction performance to a new level, surpassing SOTA methods in both quantitative and qualitative metrics.

Abstract: Temporal Action Localization (TAL) aims to accurately identify the start and end times of actions in untrimmed videos and classify them according to specific labels. However, the complexity and imbalance between target actions and background in video data make this task particularly challenging. Although relying on large amounts of finely annotated data has led to some progress in existing methods, the presence of noisy labels in largescale annotations limits their application in open-world scenarios. To address this issue, we take the perspective of the data itself, modeling the different energy patterns exhibited by the action foreground and background in video data to enhance video content inference. Specifically, we propose the Energy-Driven Meta Purifier (EDMP) method, which utilizes a meta-learning training paradigm to avoid dependence on extensive and precise manual annotations. Under this pipeline, we use energy modeling to distinguish between different actions and backgrounds from the perspective of energy differences, thereby improving the model's robustness to category noise. Additionally, these energy-based distinctions are employed to further refine action boundaries, enhancing the model's robustness to boundary noise. Experiments on THUMOS14 and ActivityNet1.3 datasets show that EDMP effectively enhances the robustness of TAL models.

Abstract: Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. Although significant advances have been made in imagebased virtual try-on, extending these successes to video often leads to frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequences. To tackle these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we propose ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the TikTokDress dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments demonstrate that our approach outperforms current baselines, particularly in terms of video consistency and inference speed.

Abstract: Point cloud upsampling aims to generate dense and uniformly distributed point sets from sparse point clouds. Existing point cloud upsampling methods typically approach the task as an interpolation problem. They achieve upsampling by performing local interpolation between point clouds or in the feature space, then regressing the interpolated points to appropriate positions. By contrast, our proposed method treats point cloud upsampling as a global shape completion problem. Specifically, our method first divides the point cloud into multiple patches. Then a masking operation is applied to remove some patches, leaving visible point cloud patches. Finally, our customdesigned neural network iterative completes the missing sections of the point cloud through the visible parts. During testing, by selecting different mask sequences, we can restore various complete patches. A sufficiently dense upsampled point cloud can be obtained by merging all the completed patches. We demonstrate the superior performance of our method through both quantitative and qualitative experiments, showing overall superiority against both existing self-supervised and supervised methods.

Abstract: Federated SemiSupervised Learning (FSSL) has emerged as a crucial topic in medical image analysis, allowing multiple medical institutions to collaboratively train a global model using limited labeled data. However, existing FSSL methods focus solely on an effective combination of federated learning and semi-supervised learning, ignoring the heterogeneity of client data and the inadaptability of semi-supervised methods in diverse environments, which leads to knowledge bias in local models and impedes stable convergence. To this end, we explore the application of personalization in FSSL and propose a novel dual-calibrated co-training framework. To adapt to the unique feature distribution of client data, we consider collaborative relationships among clients to aggregate a personalized model for each client. We further build a dual-student architecture with the personalized model and private local model on the client side, which encourages model disagreement for co-training while enhancing participant privacy. Most importantly, we design dual calibration strategies that adaptively optimize the model: Local calibration improves the boundary discrimination of the local model by dynamically replacing pseudo-label boundary patches; Global calibration corrects model direction based on the real-time perception of the biases between local dual-student models. Experimental results show the effectiveness of our method on a private medical dataset and two public medical datasets.

Abstract: Robustness and generalizability in medical image segmentation are often hindered by scarcity and limited diversity of training data, which stands in contrast to the variability encountered during inference. While conventional strategiessuch as domain-specific augmentation, specialized architectures, and tailored training procedures---can alleviate these issues, they depend on the availability and reliability of domain knowledge. When such knowledge is unavailable, misleading, or improperly applied, performance may deteriorate. In response, we introduce a novel, domain-agnostic, add-on, and data-driven strategy inspired by image stacking in image denoising. Termed ``semantic stacking,'' our method estimates a denoised semantic representation that complements the conventional segmentation loss during training. This method does not depend on domain-specific assumptions, making it broadly applicable across diverse image modalities, model architectures, and augmentation techniques. Through extensive experiments, we validate the superiority of our approach in improving segmentation performance under diverse conditions.

Abstract: Diffusion models have significantly facilitated the customization of input video with target appearance while maintaining its motion patterns. To distill the motion information from video frames, existing works often estimate motion representations as frame difference or correlation in pixel/feature-space. Despite its simplicity, these methods have unexplored limitations, including lack of understanding of global motion context, and the introduction of motion-independent spatial distortions. To address this, we present Spectral Motion Alignment (SMA), a novel framework that refines and aligns motion representations in the spectral domain. Specifically, SMA learns spectral motion representations, facilitating the learning of whole-frame global motion dynamics, and effectively mitigating motion-independent artifacts. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.

Abstract: Transformerbased Super-Resolution (SR) methods have demonstrated superior performance compared to convolutional neural network (CNN)-based SR approaches due to their capability to capture long-range dependencies. However, their high computational complexity necessitates the development of lightweight approaches for practical use. To address this challenge, we propose the Attention-Sharing Information Distillation (ASID) network, a lightweight SR network that integrates attention-sharing and an information distillation structure specifically designed for Transformer-based SR methods. We modify the information distillation scheme, originally designed for efficient CNN operations, to reduce the computational load of stacked self-attention layers, effectively addressing the efficiency bottleneck. Additionally, we introduce attention-sharing across blocks to further minimize the computational cost of self-attention operations. By combining these strategies, ASID achieves competitive performance with existing SR methods while requiring only around 300K parameters – significantly fewer than existing CNN-based and Transformer-based SR models. Furthermore, ASID outperforms state-of-the-art SR methods when the number of parameters is matched, demonstrating its efficiency and effectiveness.

Abstract: Aerial Action Recognition (AAR) in videos captured by Unmanned Aerial Vehicles (UAVs) plays a vital role in numerous applications. However, current methods related to traditional action recognition primarily cater to fixed or near cameras, and rarely consider the movement disturbance of UAVs, including their varying attitudes and positions. Those characteristics of aerial videos bring moving objects in small regions compared to broad backgrounds and relative movement to the motion of objects, which reflect more sparse and disturbed semantic information for AAR. To address these issues, we present a novel framework, dubbed 3DTok, to Select, Expand, and Squeeze original visual tokens for obtaining compact yet diverse semantic-enhanced tokens. In particular, we present a 3D-token selector (3TS) to select complex yet diverse tokens in three channels, capturing the semantic awareness of moving objects in comparatively small regions. Additionally, to get rid of disturbed semantic information caused by the UAV flight, we present an Expand-Squeeze Converter (ESC) to adaptively expand and squeeze the 3D-selected tokens constrained by contrastive loss, thereby suppressing the semantic-irrelevant information and reinforce semantic-relevant information via the interpolation converting. By involving the token selecting, expanding, and squeezing into an all-in-one framework, 3D-Tok shows significant improvements on the UAV-Human dataset(↑9.5%), RoCoG-v2 dataset (↑23.5%), and Drone-Action dataset (↑5.7%).

Abstract: In medical image analysis, detecting multiple structures is crucial for evaluations and diagnosis but is often limited by the lack of highquality annotations. Semi-supervised object detection emerges as a potent methodology to enhance model performance and generalization by leveraging a vast pool of unlabeled data alongside a minimal set of labeled data. A striking observation is that both unlabelled and labeled medical images contain a priori anatomical knowledge from human screening. In this work, we introduce a novel semi-supervised approach named Semi-akmm for mining and matching anatomical knowledge in ultrasound images. We develop an Adaptive Prior Knowledge Transfer (APKT) module to mine and explore the distribution and knowledge of potential proposal boxes by proposal proportion constraint. Furthermore, within a teacher-student learning framework, we put forward an Anatomical Structure Matching (ASM) module to facilitate co-learning consistent topological prior knowledge between the student and teacher models. To our knowledge, this marks the inception of an efficient semi-supervised medical multi-structure detection model. Our experiments across five publicly available ultrasound datasets demonstrate that Semi-akmm sets a new benchmark in performance with solid results that outperform existing methods.

School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, School of Computer Science and Information Engineering, Hefei University of Technology, School of Cyber Science and Technology, Zhejiang University, School of Information Science and Engineering, Lanzhou University, School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

Abstract: Recent works on remote PhotoPlethysmoGraphy (rPPG) estimation typically use techniques like CNNs and Transformers to encode implicit features from facial videos for prediction. These methods learn to directly map facial videos to the static values of rPPG signals, overlooking the inherent dynamic characteristics of rPPG sequence. Moreover, the rPPG signal is extremely weak and highly susceptible to interference from various sources of noise, including illumination conditions, head movements, and variations in skin tone. To address these limitations, we propose a Physiologybased dynamicity disentangled diffusion (PhysDiff) model particularly designed for robust rPPG estimation. PhysDiff leverages the diffusion model to learn the distribution of quasi-periodic rPPG signal and uses a dynamicity disentanglement strategy to capture two dynamic characteristics in temporal rPPG signal, i.e., trend and amplitude. This disentanglement is motivated by the underlying dynamic physiological processes of vasodilation and vasoconstriction, ensuring a more precise representation of the rPPG signal. The disentangled components are then used as pivotal conditions in the proposed spatial-temporal hybrid denoiser for rPPG reconstruction. Besides, we introduce a periodicity-based multi-hypothesis selection strategy in model inference, which compares the natural periodicity of multiple generated rPPG hypotheses and selects the most favorable one as the final prediction. Extensive experiments on four datasets demonstrate that our PhysDiff significantly outperforms prior methods on both intra-dataset and cross-dataset testing.

Abstract: Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary, we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves stateof-the-art results in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model.

School of Computer Science and Information Engineering, Hefei University of Technology, China, Tencent Youtu Lab, Tencent Youtu Lab, School of Computer Science and Information Engineering, Hefei University of Technology, China, Tencent WeChat Pay Lab33, Tencent WeChat Pay Lab33, Tencent Youtu Lab, Tencent Youtu Lab, School of Computer Science and Information Engineering, Hefei University of Technology, China, School of Computer Science and Information Engineering, Hefei University of Technology, China

Abstract: Palm vein recognition is an emerging biometric technology that offers enhanced security and privacy. However, acquiring sufficient palm vein data for training deep learningbased recognition models is challenging due to the high costs of data collection and privacy protection constraints. This has led to a growing interest in generating pseudo-palm vein data using generative models. Existing methods, however, often produce unrealistic palm vein patterns or struggle with controlling identity and style attributes. To address these issues, we propose a novel palm vein generation framework named PVTree. First, the palm vein identity is defined by a complex and authentic 3D palm vascular tree, created using an improved Constrained Constructive Optimization (CCO) algorithm. Second, palm vein patterns of the same identity are generated by projecting the same 3D vascular tree into 2D images from different views and converting them into realistic images using a generative model. As a result, PVTree satisfies the need for both identity consistency and intra-class diversity. Extensive experiments conducted on several publicly available datasets demonstrate that our proposed palm vein generation method surpasses existing methods and achieves a higher TAR@FAR=1e-4 under the 1:1 Open-set protocol. To the best of our knowledge, this is the first time that the performance of a recognition model trained on synthetic palm vein data exceeds that of the recognition model trained on real data, which indicates that palm vein image generation research has a promising future.

Abstract: Recent research showcases the considerable potential of conditional diffusion models for generating consistent stories. However, current methods, which primarily generate stories in a captiondependent manner, often overlook the importance of contextual consistency and the relevance of frames during sequential generation. To address this, we propose a novel Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach designed to enhance story generation's semantic consistency and temporal consistency. Specifically, in the first stage, the frame-prior transformer diffusion model is presented to predict the frame semantic embedding of the unknown clip by aligning the semantic correlations between the captions and frames of the known clip. The second stage establishes a robust model with rich contextual conditions, including reference images of the known clip, the predicted frame semantic embedding of the unknown clip, and text embeddings of all captions. By jointly injecting these rich contextual conditions at the image and feature levels, RCDMs can generate semantic and temporal consistency stories. Moreover, RCDMs can generate consistent stories with a single forward inference compared to autoregressive models. Our qualitative and quantitative results demonstrate that our proposed RCDMs outperform in challenging scenarios.

Abstract: The imperative for compression of material textures emerges from the critical demand for highquality rendering, which necessitates sophisticated textures that, in turn, require substantial storage and memory resources. Thus, low-bitrate compression is crucial, especially in modern games demanding higher texture resolutions. Concurrent methodologies in texture compression predominantly employ a block-based paradigm based on color space, which inevitably leads to representational redundancies and a limited compression scope, particularly at lower bitrates. In the context of mobile devices, bandwidth during texture loading and runtime memory are major bottlenecks, making existing compression algorithms inadequate for high-resolution textures. To mitigate these limitations, we propose a novel multi-resolution texture compression scheme, Neural Block Compression (NBC), developed within the neural feature domain. Our encoding scheme is constructed on a hierarchy of multi-resolution neural feature blocks, and the key ingredient is the variable bitrates quantization scheme. It allocates higher bitrates to higher feature mip-levels and lower bitrates to lower feature mip-levels, thereby extending the concept of block compression from color domain into neural feature domain. Extensive experiments demonstrate the superior texture compression quality achieved by the proposed scheme, especially at low bitrates.

Abstract: Diffusion models excel at producing highquality images; however, scaling to higher resolutions, such as 4K, often results in structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions. Specifically, ResMaster leverages a low-resolution reference image created by a pre-trained diffusion model to provide structural and fine-grained guidance for crafting high-resolution images on a patch-by-patch basis. To ensure a coherent structure, ResMaster meticulously aligns the low-frequency components of high-resolution patches with the low-resolution reference at each denoising step. For fine-grained guidance, tailored image prompts based on the low-resolution reference and enriched textual prompts produced by a vision-language model are incorporated. This approach could significantly mitigate local pattern distortions and improve detail refinement. Extensive experiments validate that ResMaster sets a new benchmark for high-resolution image generation.

Abstract: Masked point modeling methods have recently achieved great success in selfsupervised learning for point cloud data. However, these methods are sensitive to rotations and often exhibit sharp performance drops when encountering rotational variations. In this paper, we propose a novel Rotation-Invariant Masked AutoEncoders (RI-MAE) to address two major challenges: 1) achieving rotation-invariant latent representations, and 2) facilitating self-supervised reconstruction in a rotation-invariant manner. For the first challenge, we introduce RI-Transformer, which features disentangled geometry content, rotation-invariant relative orientation and position embedding mechanisms for constructing rotation-invariant point cloud latent space. For the second challenge, a novel dual-branch student-teacher architecture is devised. It enables the self-supervised learning via the reconstruction of masked patches within the learned rotation-invariant latent space. Each branch is based on an RI-Transformer, and they are connected with an additional RI-Transformer predictor. The teacher encodes all point patches, while the student solely encodes unmasked ones. Finally, the predictor predicts the latent features of the masked patches using the output latent embeddings from the student, supervised by the outputs from the teacher. Extensive experiments demonstrate that our method is robust to rotations, achieving the state-of-the-art performance on various downstream tasks.

Abstract: Single hyperspectral image superresolution (single-HSI-SR) aims to improve the resolution of a single input low-resolution HSI. Due to the bottleneck of data scarcity, the development of single-HSI-SR lags far behind that of RGB natural images. In recent years, research on RGB SR has shown that models pre-trained on large-scale benchmark datasets can greatly improve performance on unseen data, which may stand as a remedy for HSI. But how can we transfer the pre-trained RGB model to HSI, to overcome the data-scarcity bottleneck? Because of the significant difference in the channels between the pre-trained RGB model and the HSI, the model cannot focus on the correlation along the spectral dimension, thus limiting its ability to utilize on HSI. Inspired by the HSI spatial-spectral decoupling, we propose a new framework that first fine-tunes the pre-trained model with the spatial components (known as eigenimages), and then infers on unseen HSI using an iterative spectral regularization (ISR) to maintain the spectral correlation. The advantages of our method lie in: 1) we effectively inject the spatial texture processing capabilities of the pre-trained RGB model into HSI while keeping spectral fidelity, 2) learning in the spectral-decorrelated domain can improve the generalizability to spectral-agnostic data, and 3) our inference in the eigenimage domain naturally exploits the spectral low-rank property of HSI, thereby reducing the complexity. This work bridges the gap between pre-trained RGB models and HSI via eigenimages, addressing the issue of limited HSI training data, hence the name EigenSR. Extensive experiments show that EigenSR outperforms the state-of-the-art (SOTA) methods in both spatial and spectral metrics.

Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen 518055, China, Peng Cheng Laboratory, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen 518055, China,, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen 518055, China, Peng Cheng Laboratory, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen 518055, China, Peng Cheng Laboratory, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen 518055, China, Peng Cheng Laboratory

Abstract: Textdriven video editing has recently experienced rapid development. Despite this, evaluating edited videos remains a considerable challenge. Current metrics tend to fail to align with human perceptions, and effective quantitative metrics for video editing are still notably absent. To address this, we introduce VE-Bench, a benchmark suite tailored to the assessment of text-driven video editing. This suite includes VE-Bench DB, a video quality assessment (VQA) database for video editing. VE-Bench DB encompasses a diverse set of source videos featuring various motions and subjects, along with multiple distinct editing prompts, editing results from 8 different models, and the corresponding Mean Opinion Scores (MOS) from 24 human annotators. Based on VE-Bench DB, we further propose VE-Bench QA, a quantitative human-aligned measurement for the text-driven video editing task. In addition to the aesthetic, distortion, and other visual quality indicators that traditional VQA methods emphasize, VE-Bench QA focuses on the text-video alignment and the relevance modeling between source and edited videos. It introduces a new assessment network for video editing that attains superior performance in alignment with human preferences.To the best of our knowledge, VE-Bench introduces the first quality assessment dataset for video editing and proposes an effective subjective-aligned quantitative metric for this domain. All models, data, and code will be publicly available to the community.

Abstract: Indoor scene synthesis aims to automatically produce plausible, realistic, and diverse 3D indoor scenes, especially given arbitrary user requirements. Recently, the promising generalization ability of pretrained large language models (LLM) assist in open-vocabulary indoor scene synthesis. However, the challenge lies in converting the LLM-generated outputs into reasonable and physically feasible scene layouts. In this paper, we propose to generate hierarchically structured scene descriptions with LLM and then compute the scene layouts. Specifically, we train a hierarchy-aware network to infer the fine-grained relative positions between objects and design a divide-and-conquer optimization to solve for scene layouts. The advantages of using hierarchically structured scene representation are two-fold. First, the hierarchical structure provides a rough grounding for object arrangement, which alleviates contradictory placements with dense relations and enhances the generalization ability of the network to infer fine-grained placements. Second, it naturally supports the divide-and-conquer optimization, by first arranging the sub-scenes and then the entire scene, to more effectively solve for a feasible layout. We conduct extensive comparison experiments and ablation studies with both qualitative and quantitative evaluations to validate the effectiveness of our key designs with the hierarchically structured scene representation. Our approach can generate more reasonable scene layouts while better aligned with the user requirements and LLM descriptions. We also present open-vocabulary scene synthesis and interactive scene design results to show the strength of our approach in the applications.

Abstract: In medical image analysis, multiorgan semi-supervised segmentation faces challenges such as insufficient labels and low contrast in soft tissues. To address these issues, existing studies typically employ semi-supervised segmentation techniques using pseudo-labeling and consistency regularization. However, these methods mainly rely on individual data samples for training, ignoring the rich neighborhood information present in the feature space. In this work, we argue that supervisory information can be directly extracted from the geometry of the feature space. Inspired by the density-based clustering hypothesis, we propose using feature density to locate sparse regions within feature clusters. Our goal is to increase intra-class compactness by addressing sparsity issues. To achieve this, we propose a Density-Aware Contrastive Learning (DACL) strategy, pushing anchored features in sparse regions towards cluster centers approximated by high-density positive samples, resulting in more compact clusters. Specifically, our method constructs density-aware neighbor graphs using labeled and unlabeled data samples to estimate feature density and locate sparse regions. We also combine label-guided co-training with density-guided geometric regularization to form complementary supervision for unlabeled data. Experiments on the Multi-Organ Segmentation Challenge dataset demonstrate that our proposed method outperforms state-of-the-art methods, highlighting its efficacy in medical image segmentation tasks.

Abstract: Advancements in neural implicit representations and differentiable rendering have markedly improved the ability to learn animatable 3D avatars from sparse multiview RGB videos. However, current methods that map observation space to canonical space often face challenges in capturing pose-dependent details and generalizing to novel poses. While diffusion models have demonstrated remarkable zero-shot capabilities in 2D image generation, their potential for creating animatable 3D avatars from 2D inputs remains underexplored. In this work, we introduce 3D²-Actor, a novel approach featuring a pose-conditioned 3D-aware human modeling pipeline that integrates iterative 2D denoising and 3D rectifying steps. The 2D denoiser, guided by pose cues, generates detailed multi-view images that provide the rich feature set necessary for high-fidelity 3D reconstruction and pose rendering. Complementing this, our Gaussian-based 3D rectifier renders images with enhanced 3D consistency through a two-stage projection strategy and a novel local coordinate representation. Additionally, we propose an innovative sampling strategy to ensure smooth temporal continuity across frames in video synthesis. Our method effectively addresses the limitations of traditional numerical solutions in handling ill-posed mappings, producing realistic and animatable 3D human avatars. Experimental results demonstrate that 3D²-Actor excels in high-fidelity avatar modeling and robustly generalizes to novel poses.

Abstract: Emerging unsupervised implicit neural representation (INR) methods, such as NeRP, NeAT, and SCOPE, have shown great potential in addressing sparseview computed tomography (SVCT) inverse problems. While these INR-based methods perform well on relatively dense SVCT reconstructions, they struggle to achieve comparable performance with supervised methods in sparser SVCT scenarios and are prone to being affected by noise, limiting their applicability in real clinical settings. Additionally, current methods have not fully explored the use of image domain priors for solving SVCT inverse problems. In this work, we demonstrate that imperfect reconstruction results can provide effective image domain priors for INRs to enhance performance. To leverage this, we introduce Self-prior embedding neural representation (Spener), a novel unsupervised method for SVCT reconstruction that integrates iterative reconstruction algorithms. During each iteration, Spener extracts local image prior features from the previous iteration and embeds them to constrain the solution space. Experimental results on multiple CT datasets show that our unsupervised Spener method achieves performance comparable to supervised state-of-the-art (SOTA) methods on in-domain data while outperforming them on out-of-domain datasets. Moreover, Spener significantly improves the performance of INR-based methods in handling SVCT with noisy sinograms.

Abstract: Video has emerged as a favored multimedia format on the internet. To better gain video contents, a new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and stepcaptioning. The pioneering work chooses the pre-trained CLIP-based model for video retrieval, and leverages it as a feature extractor for other three challenging tasks solved in a multi-task learning paradigm. Nevertheless, this work struggles to learn the comprehensive cognition of user-preferred content, due to disregarding the hierarchies and association relations across modalities. In this paper, guided by the shallow-to-deep principle, we propose a query-centric audio-visual cognition (QUAG) network to construct a reliable multi-modal representation for moment retrieval, segmentation and step-captioning. Specifically, we first design the modality-synergistic perception to obtain rich audio-visual content, by modeling global contrastive alignment and local fine-grained interaction between visual and audio modalities. Then, we devise the query-centric cognition that uses the deep-level query to perform the temporal-channel filtration on the shallow-level audio-visual representation. This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks. Extensive experiments show QUAG achieves the SOTA results on HIREST. Further, we test QUAG on the query-based video summarization task and verify its good generalization.

Abstract: Facial expression recognition faces challenges where labeled significant features in datasets are mixed with unlabeled redundant ones. In this paper, we introduce Cross Similarity Attention (CSA) to mine richer intrinsic information from image pairs, overcoming a limitation when the Scaled DotProduct Attention of ViT is directly applied to calculate the similarity between two different images. Based on CSA, we simultaneously minimize intra-class differences and maximize inter-class differences at the fine-grained feature level through interactions among multiple branches. Contrastive residual distillation is utilized to transfer the information learned in the cross module back to the base network. We ingeniously design a four-branch centrally symmetric network, named Quadruplet Cross Similarity (QCS), which alleviates gradient conflicts arising from the cross module and achieves balanced and stable training. It can adaptively extract discriminative features while isolating redundant ones. The cross-attention modules exist during training, and only one base branch is retained during inference, resulting in no increase in inference time. Extensive experiments show that our proposed method achieves state-of-the-art performance on several FER datasets.

The State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications, The State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications, The State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications, The State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications, The State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications

Abstract: Object reidentification (ReID) is committed to searching for objects of the same identity across cameras, and its real-world deployment is gradually increasing. Current ReID methods assume that the deployed system follows the centralized processing paradigm, i.e., all computations are conducted in the cloud server and edge devices are only used to capture images. As the number of videos experiences a rapid escalation, this paradigm has become impractical due to the finite computational resources in the cloud server. Therefore, the ReID system should be converted to fit in the cloud-edge collaborative processing paradigm, which is crucial to boost its scalability and practicality. However, current works lack relevant research on this important specific issue, making it difficult to adapt them into a cloud-edge framework effectively. In this paper, we propose a cloud-edge collaborative inference framework for ReID systems, aiming to expedite the return of the desired image captured by the camera to the cloud server by learning the spatial-temporal correlations among objects. In the system, a Distribution-aware Correlation Modeling network (DaCM) is particularly proposed to embed the spatial-temporal correlations of the camera network implicitly into a graph structure, and it can be applied 1) in the cloud to regulate the size of the upload window and 2) on the edge device to adjust the sequence of images, respectively. Notably, the proposed DaCM can be seamlessly combined with traditional ReID methods, enabling their application within our proposed edge-cloud collaborative framework. Extensive experiments demonstrate that our method obviously reduces transmission overhead and significantly improves performance.

Abstract: Recent advances in diffusion models focus on efficiently handling conditional generative tasks without extra training. The process involves decomposing the result into two components: 1. unconditional sample, generated in the absence of conditions; 2. condition correction, adjusting unconditional sample to include the guidance image. This adjustment is quantified by the pixellevel measure, where the latent is decoded back into a pixel image, and the forward operator translates the noisy image into the guidance domain for comparison with the guidance image. To enhance the fidelity of condition correction, we propose a learnable latent forward operator, focusing on latent-space consistency with the expectation that this latent-space consistency approximates the pixel-level fidelity measure. The encoder translates the guidance image into the latent space, and a correctional operator is proposed to rectify model mismatching in the latent guidance model. The determination of the condition term and the correction estimation is akin to solving a blind inverse problem. Our EMControl employs the Expectation-Maximization (EM) algorithm to solve the blind inverse problem during the reverse sampling process. This technique ensures that samples, once consistent with the guidance, are accurately mapped back onto the noisy data manifold, adhering to the data's inherent distribution. The EMControl has proven its effectiveness by delivering superior performance in conditional diffusion generation tasks compared to previous approaches. Moreover, its application to multiple-condition scenarios underscores its versatility and robustness across a range of generative tasks.

Abstract: Camerabased 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving. However, images provide limited information making the model susceptible to geometric ambiguity caused by occlusion and perspective distortion. Existing methods often lack explicit semantic modeling between objects, limiting their perception of 3D semantic context. To address these challenges, we propose a novel method VLScene: Vision-Language Guidance Distillation for Camera-based 3D Semantic Scene Completion. The key insight is to use the vision-language model to introduce high-level semantic priors to provide the object spatial context required for 3D scene understanding. Specifically, we design a vision-language guidance distillation process to enhance image features, which can effectively capture semantic knowledge from the surrounding environment and improve spatial context reasoning. In addition, we introduce a geometric-semantic sparse awareness mechanism to propagate geometric structures in the neighborhood and enhance semantic information through contextual sparse interactions. Experimental results demonstrate that VLScene achieves rank-1st performance on challenging benchmarks—SemanticKITTI and SSCBench-KITTI-360, yielding remarkably mIoU scores of 17.52 and 19.10, respectively.

School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China, School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China SJTU-Paris Elite Institute of Technology, Shanghai Jiao Tong University, Shanghai, China, School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China, School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China, Institute of Cyber Science and Technology, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China, University of Minnesota Twin Cities, Saint Paul, MN, USA

Abstract: Computeraided design (CAD) significantly enhances the efficiency, accuracy, and innovation of design processes by enabling precise 2D and 3D modeling, extensive analysis, and optimization. Existing methods for creating CAD models rely on latent vectors or point clouds, which are difficult to obtain, and storage costs are substantial. Recent advances in Multimodal Large Language Models (MLLMs) have inspired researchers to use natural language instructions and images for CAD model construction. However, these models still struggle with inferring accurate 3D spatial location and orientation, leading to inaccuracies in determining the spatial 3D starting points and extrusion directions for constructing geometries. This work introduces CAD-GPT, a CAD synthesis method with spatial reasoning-enhanced MLLM that takes either a single image or a textual description as input. To achieve precise spatial inference, our approach introduces a 3D Modeling Spatial Mechanism. This method maps 3D spatial positions and 3D sketch plane rotation angles into a 1D linguistic feature space using a specialized spatial unfolding mechanism, while discretizing 2D sketch coordinates into an appropriate planar space to enable precise determination of spatial starting position, sketch orientation, and 2D sketch coordinate translations. Extensive experiments demonstrate that CAD-GPT consistently outperforms existing state-of-the-art methods in CAD model synthesis, both quantitatively and qualitatively.

Abstract: Textdriven Image to Video Generation (TI2V) aims to generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how to identify the target objects and ensure the consistency between the movement trajectory and the textual description. (ii) how to improve the subjective quality of generated videos. To tackle the above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending to achieve precise control and high-quality video generation based on textual-described motion for different objects. Concretely, we enable our TIV-Diffuion model to perceive the textual-described objects and their motion trajectory by incorporating the fused textual and visual knowledge through scale-offset modulation. Moreover, to mitigate the problems of object disappearance and misaligned objects and motion, we introduce an object-centric textual-visual alignment module, which reduces the risk of misaligned objects/motion by decoupling the objects in the reference image and aligning textual features with each object individually. Based on the above innovations, our TIV-Diffusion achieves state-of-the-art high-quality video generation compared with existing TI2V methods.

School of Computer Science and Technology, Xi’an Jiaotong University, China Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xi’an Jiaotong University, China, Institute of Big Data, Fudan University, China, Shanghai University of Finance and Economics, China, Nanyang Technological University, Singapore, School of Continuing Education, Xi’an Jiaotong University, China Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, China, School of Computer Science and Technology, Xi’an Jiaotong University, China Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xi’an Jiaotong University, China

Abstract: In this work, we address the challenging task of Generalized Referring Expression Comprehension (GREC). Compared to the classic Referring Expression Comprehension (REC) that focuses on singletarget expressions, GREC extends the scope to a more practical setting by further encompassing no-target and multi-target expressions. Existing REC methods face challenges in handling the complex cases encountered in GREC, primarily due to their fixed output and limitations in multi-modal representations. To address these issues, we propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G) for GREC, which can flexibly deal with various types of referring expressions. First, a Hierarchical Multi-modal Semantic Alignment (HMSA) module is proposed to incorporate three levels of alignments, including word-object, phrase-object, and text-image alignment. It enables hierarchical cross-modal interactions across multiple levels to achieve comprehensive and robust multi-modal understanding, greatly enhancing grounding ability for complex cases. Then, to address the varying number of target objects in GREC, we introduce an Adaptive Grounding Counter (AGC) to dynamically determine the number of output targets. Additionally, an auxiliary contrastive loss is employed in AGC to enhance object-counting ability by pulling in multi-modal features with the same counting and pushing away those with different counting. Extensive experimental results show that HieA2G achieves new state-of-the-art performance on the challenging GREC task and also the other 4 tasks, including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES), demonstrating the remarkable superiority and generalizability of the proposed HieA2G.

Department of Electronic Engineering, Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China, Department of Electronic Engineering, Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China, Department of Electronic Engineering, Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China, Department of Electronic Engineering, Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China

Abstract: 3D Vision Grounding (3DVG) seeks to unravel referential language and identify targets in 3D physical world. Prevailing methods align with the 2D-VG's pipeline to pinpoint the referred object in a categorical multi-modal reasoning manner. However, the geometric complexities of 3D scenes and the nuanced syntactic structures of language, exacerbates the \textbf{granularity inconsistency} of point cloud and text features, hindering the development of 3D-VG systems in complex scenarios. Towards this issue, we propose LIBA, a Language-Instructed multi-granularity Bridge Assistant tailored for 3D-VG task. LIBA tackles this issue as follows. (1) \textit{How to establish a multi-granularity 3D vision-text feature alignment in a unified model}? We advance a bilateral Dynamic Bridge Adapter (DBA) build multi-granularity interaction of 3D vision and language backnones during feature extraction. We further develop the Language-aware Cross-scale Object Modulation (LCOM) module to integrate multi-scale point cloud features modulated by language information. (2) After aligning multi-modal features, \textit{how to fully harness language model's knowledge to bolster vision concepts understanding}? A LLM-guided Hierarchical Query Selection (LLM-HQS) module incorporates world knowledge of Large Language Model~(LLM) to ground the target referral via an Attribute-then-Relation reasoning process. In this manner, our LIBA inherits reasoning prowess and world knowledge of LLM to bridge point clouds and texts at multiple granularities. Experiments on ScanRefer and Nr3D/Sr3D benchmarks substantiate the superiority of our LIBA, trumping state-of-the-arts by a considerable margin.

School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, School of Future Technology, School of Artificial Intelligence, Dalian University of Technology Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, School of Artificial Intelligence, Anhui University Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, School of Future Technology, School of Artificial Intelligence, Dalian University of Technology Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University

Abstract: Multimodal object Re-IDentification (ReID) aims to retrieve specific objects by combining complementary information from multiple modalities. Existing multi-modal object ReID methods primarily focus on the fusion of heterogeneous features. However, they often overlook the dynamic quality changes in multi-modal imaging. In addition, the shared information between different modalities can weaken modality-specific information. To address these issues, we propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts. To be specific, we first deploy a Patch-Integrated Feature Extractor (PIFE) to extract multi-granularity and multi-modal features. Then, we introduce a Hierarchical Decoupling Module (HDM) to decouple multi-modal features into non-overlapping forms, preserving the modality uniqueness and increasing the feature diversity. Finally, we propose an Attention-Triggered Mixture of Experts (ATMoE), which replaces traditional gating with dynamic attention weights derived from decoupled features. With these modules, our DeMo can generate more robust multi-modal features. Extensive experiments on three object ReID benchmarks verify the effectiveness of our methods.

School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, School of Future Technology, School of Artificial Intelligence, Dalian University of Technology Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, School of Artificial Intelligence, Anhui University Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, School of Future Technology, School of Artificial Intelligence, Dalian University of Technology Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, School of Future Technology, School of Artificial Intelligence, Dalian University of Technology

Abstract: Multimodal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation methods have obvious limitations in dealing with long sequences from different modalities. To address above issues, we introduce a novel framework called MambaPro for multi-modal object ReID. To be specific, we first employ a Parallel Feed-Forward Adapter (PFA) for adapting CLIP to multi-modal object ReID. Then, we propose the Synergistic Residual Prompt (SRP) to guide the joint learning of multi-modal features. Finally, leveraging Mamba's superior scalability for long sequences, we introduce Mamba Aggregation (MA) to efficiently model interactions between different modalities. As a result, MambaPro could extract more robust features with lower complexity. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) validate the effectiveness of our proposed methods.

Abstract: Despite the rapid development of Chinese visionlanguage models (VLMs), most existing Chinese vision-language (VL) datasets are constructed on Western-centric images from existing English VL datasets. The cultural bias in the images makes these datasets unsuitable for evaluating VLMs in Chinese culture. To remedy this issue, we present a new Chinese Vision-Language Understanding Evaluation (CVLUE) benchmark dataset, where the selection of object categories and images is entirely driven by Chinese native speakers, ensuring that the source images are representative of Chinese culture. The benchmark contains four distinct VL tasks ranging from image-text retrieval to visual question answering, visual grounding and visual dialogue. We present a detailed statistical analysis of CVLUE and provide a baseline performance analysis with several open-source multilingual VLMs on CVLUE and its English counterparts to reveal their performance gap between English and Chinese. Our in-depth category-level analysis reveals a lack of Chinese cultural knowledge in existing VLMs. We also find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.

Abstract: The goal of scene text image superresolution (STISR) is to enhance the clarity of text within line images, thereby improving readability and enabling more accurate text recognition. However, existing STISR methods often rely heavily on Text Prior (TP) derived from trained recognizers, which can be unreliable and may lead to incorrect glyph restoration. Text images contain two crucial types of information: semantic content from word meanings and structural details from glyphs. When semantic information is unreliable, accurate perception of glyph structures becomes essential. This paper introduces GlyphSR, a novel STISR framework that addresses three key challenges: precise extraction, effective learning, and optimal utilization of glyph structural information. GlyphSR incorporates the Glyph Extraction Module (GEM), a training-free approach leveraging the Segment Anything Model (SAM) to accurately extract character-level glyphs. The Glyph Perception Module (GPM) models and learns glyph structures through segmentation and classification tasks, while the Glyph Fusion Module (GFM) integrates glyph information to enhance overall STISR model performance. Extensive experiments on the TextZoom dataset demonstrate that GlyphSR achieves a new state-of-the-art performance.

Abstract: Action recognition is a crucial task in artificial intelligence, with significant implications across various domains. We initially perform a comprehensive analysis of seven prominent action recognition methods across five widelyused datasets. This analysis reveals a critical, yet previously overlooked, observation: as the velocity of actions increases, the performance of these methods variably declines, undermining their robustness. This decline in performance poses significant challenges for their application in real-world scenarios. Building on these findings, we introduce the Velocity-Aware Action Recognition (VA-AR) framework to obtain robust action representations across different velocities. Our principal insight is that rapid actions (e.g., the giant circle backward in uneven bars or a smash in badminton) occur within short time intervals, necessitating smaller temporal attention windows to accurately capture intricate changes. Conversely, slower actions (e.g., drinking water or wiping face) require larger windows to effectively encompass the broader context. VA-AR employs a Mixture of Window Attention (MoWA) strategy, dynamically adjusting its attention window size based on the action's velocity. This adjustment enables VA-AR to obtain a velocity-aware representation, thereby enhancing the accuracy of action recognition. Extensive experiments confirm that VA-AR achieves state-of-the-art performance on the same five datasets, demonstrating VA-AR's effectiveness across a broad spectrum of action recognition scenarios.

School of Computer Science and Technology, Chongqing University of Posts and Telecommunications,Chongqing, China, School of Computer Science and Technology, Chongqing University of Posts and Telecommunications,Chongqing, China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China, School of Cyber Engineering, Xidian University, Xi’an, China, School of Cyber Engineering, Xidian University, Xi’an, China, School of Computer Science and Technology, Chongqing University of Posts and Telecommunications,Chongqing, China Jinan Inspur Data Technology Co., Ltd., Jinan, China

Abstract: Substitute trainingbased data-free black-box attacks pose a significant threat to enterprise-deployed models. These attacks use a generator to synthesize data and query APIs, then train a substitute model to approximate the target model's decision boundary based on the returned results. However, existing attack methods often struggle to produce sufficiently diverse data, particularly for complex target models and extensive target data domains, severely limiting their practical application. To address this gap, we design domain-augmented learning to improve the quality of the synthetic data domain (SDD) generated by the generator from two perspectives. Specifically, (1) To broaden the SDD's coverage, we introduce textual semantic embeddings into the generator for the first time. (2) For enhancing the SDD's discretization, we propose a competitive optimization strategy that forces the generator to self-compete, along with heterogeneity excitation to overcome the constraints of information entropy on diversity. Comprehensive experiments demonstrate that our method is more effective. In non-targeted attacks on the CIFAR-10 and Tiny-ImageNet datasets, our method outperforms the state-of-the-art by 14% and 7% in attack success rate, respectively.

School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Software, Northwestern Polytechnical University, Xi'an 710072, China, School of Vehicle and Mobility, Tsinghua University, Beijing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China

Abstract: Contrastive learning has achieved great success in skeletonbased representation learning recently. However, the prevailing methods are predominantly negative-based, necessitating additional momentum encoder and memory bank to get negative samples, which increases the difficulty of model training. Furthermore, these methods primarily concentrate on learning a global representation for recognition and retrieval tasks, while overlooking the rich and detailed local representations that are crucial for dense prediction tasks. To alleviate these issues, we introduce a Unified Skeleton-based Dense Representation Learning framework based on feature decorrelation, called USDRL, which employs feature decorrelation across temporal, spatial, and instance domains in a multi-grained manner to reduce redundancy among dimensions of the representations to maximize information extraction from features. Additionally, we design a Dense Spatio-Temporal Encoder (DSTE) to capture fine-grained action representations effectively, thereby enhancing the performance of dense prediction tasks. Comprehensive experiments, conducted on the benchmarks NTU-60, NTU-120, PKU-MMD I, and PKU-MMD II, across diverse downstream tasks including action recognition, action retrieval, and action detection, conclusively demonstrate that our approach significantly outperforms the current state-of-the-art (SOTA) approaches.

Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing, China Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China, Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing, China, Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing, China, Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing, China, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, School of Computing and Communication, Lancaster University, Lancaster, UK, National Key Lab for Novel Software Technology, Nanjing University, Nanjing, China, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China

Abstract: Incremental fewshot semantic segmentation (IFSS) expands segmentation capacity of the trained model to segment new-class images with few samples. However, semantic meanings may shift from background to object class or vice versa during incremental learning. Moreover, new-class samples often lack representative attribute features when the new class greatly differs from the pre-learned old class. In this paper, we propose a causal framework to discuss the cause of semantic shift and incompleteness in IFSS, and we deconfound the revealed causal effects from two aspects. First, we propose a Causal Intervention Module (CIM) to resist semantic shift. CIM progressively and adaptively updates prototypes of old class, and removes the confounder in an intervention manner. Second, a Prototype Refinement Module (PRM) is proposed to complete the missing semantics. In PRM, knowledge gained from the episode learning scheme assists in fusing features of new-class and old-class prototypes. Experiments on both PASCAL-VOC 2012 and ADE20k benchmarks demonstrate the outstanding performance of our method.

Computer Vision Institute School of Computer Science & Software Engineering Shenzhen University, Computer Vision Institute School of Computer Science & Software Engineering Shenzhen University, Computer Vision Institute School of Computer Science & Software Engineering Shenzhen University, Computer Vision Institute School of Computer Science & Software Engineering Shenzhen University, Computer Vision Institute School of Computer Science & Software Engineering Shenzhen University Guangdong Provincial Key Laboratory of Intelligent Information Processing, University of Exeter, Great Bay University, Computer Vision Institute School of Computer Science & Software Engineering Shenzhen University Guangdong Provincial Key Laboratory of Intelligent Information Processing National Engineering Laboratory for Big Data System Computing Technology Shenzhen University

Abstract: For efficient and highfidelity local facial attribute editing, most existing editing methods either require additional fine-tuning for different editing effects or tend to affect beyond the editing regions. Alternatively, inpainting methods can edit the target image region while preserving external areas. However, current inpainting methods still suffer from the generation misalignment with facial attributes description and the loss of facial skin details. To address these challenges, (i) a novel data utilization strategy is introduced to construct datasets consisting of attribute-text-image triples from a data-driven perspective, (ii) a Causality-Aware Condition Adapter is proposed to enhance the contextual causality modeling of specific details, which encodes the skin details from the original image while preventing conflicts between these cues and textual conditions. In addition, a Skin Transition Frequency Guidance technique is introduced for the local modeling of contextual causality via sampling guidance driven by low-frequency alignment. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method in boosting both fidelity and editability for localized attribute editing. Our codes will be made publicly available.

Abstract: Masked Image Modeling (MIM) has garnered significant attention in selfsupervised learning, thanks to its impressive capacity to learn scalable visual representations tailored for downstream tasks. However, images inherently contain abundant redundant information, leading the pixel-based MIM reconstruction process to focus excessively on finer details such as textures, thus prolonging training times unnecessarily. Addressing this challenge requires a shift towards a compact representation of features during MIM reconstruction. Frequency domain analysis provides a promising avenue for achieving compact image feature representation. In contrast to the commonly used Fourier transform, wavelet transform not only offers frequency information but also preserves spatial characteristics and multi-level features of the image. Additionally, the multi-level decomposition process of wavelet transformation aligns well with the hierarchical architecture of modern neural networks. In this study, we leverage wavelet transform as a tool for efficient representation learning to expedite the training process of MIM. Specifically, we conduct multi-level decomposition of images using wavelet transform, utilizing wavelet coefficients from different levels to construct distinct reconstruction targets representing various frequencies and scales. These reconstruction targets are then integrated into the MIM process, with adjustable weights assigned to prioritize the most crucial information. Extensive experiments demonstrate that our method achieves comparable or superior performance across various downstream tasks while exhibiting higher training efficiency.

School of Computer Science and Engineering, South China University of Technology, School of Computer Science and Engineering, South China University of Technology, School of Computer Science and Engineering, South China University of Technology, School of Computer Science and Engineering, South China University of Technology, School of Computer Science and Engineering, South China University of Technology, Institute of Super Robotics(Huangpu), School of Computer Science and Engineering, South China University of Technology Institute of Super Robotics(Huangpu), Department of Computer Science, City University of Hong Kong

Abstract: Blind face video restoration aims to restore highfidelity details from videos subjected to complex and unknown degradations. This task poses a significant challenge of managing temporal heterogeneity while at the same time maintaining stable face attributes. In this paper, we introduce a Discrete Prior-based Temporal-Coherent content prediction transformer to address the challenge, and our model is referred to as DP-TempCoh. Specifically, we incorporate a spatial-temporal-aware content prediction module to synthesize high-quality content from discrete visual priors, conditioned on degraded video tokens. To further enhance the temporal coherence of the predicted content, a motion statistics modulation module is designed to adjust the content, based on discrete motion priors in terms of cross-frame mean and variance. As a result, the statistics of the predicted content can match with that of real videos over time. By performing extensive experiments, we verify the effectiveness of the design elements and demonstrate the superior performance of our DP-TempCoh in both synthetically and naturally degraded video restoration.

Abstract: Distinguishing spatial relations is a basic part of human cognition which requires finegrained perception on cross-instance. Although benchmarks like MME, MMBench and SEED comprehensively have evaluated various capabilities which already include visual spatial reasoning(VSR). There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning. To handle this, we first diagnosed current VLLMs with the VSR dataset and proposed a unified test set. We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information. By expanding the original benchmark from two aspects of tunning data and model structure, we mitigated this phenomenon. To our knowledge, we expanded spatially positioned image data controllably using diffusion models for the first time and integrated original visual encoding(CLIP) with other 3 powerful visual encoders(SigLIP, SAM and DINO). After conducting combination experiments on scaling data and models, we obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different instructions but also accurately distinguishes differences in visual positional information. VSRE achieved over a 27% increase in accuracy on the VSR test set. It becomes a performant VLLM on the position reasoning of both the VSR dataset and relevant subsets of other evaluation benchmarks. We hope it will accelerate advancements in VLLM on VSR learning.

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences National innovation center for advanced medical devices, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences

Abstract: Virtual staining leverages computeraided techniques to transfer the style of histochemically stained tissue samples to other staining types. In virtual staining of pathological images, maintaining strict structural consistency is crucial, as these images emphasize structural integrity more than natural images. Even slight structural alterations can lead to deviations in diagnostic semantic information. Furthermore, the unpaired characteristic of virtual staining data may compromise the preservation of pathological diagnostic content. To address these challenges, we propose a dual-path inversion virtual staining method using prompt learning, which optimizes visual prompts to control content and style, while preserving complete pathological diagnostic content. Our proposed inversion technique comprises two key components: (1) Dual Path Prompted Strategy, we utilize a feature adapter function to generate reference images for inversion, providing style templates for input image inversion, called Style Target Path. We utilize the inversion of the input image as the Structural Target path, employing visual prompt images to maintain structural consistency in this path while preserving style information from the style Target path. During the deterministic sampling process, we achieve complete content-style disentanglement through a plug-and-play embedding visual prompt approach. (2) StainPrompt Optimization, where we only optimize the null visual prompt as ``operator'' for dual path inversion, rather than fine-tune pre-trained model. We optimize null visual prompt for structual and style trajectory around pivotal noise on each timestep, ensuring accurate dual-path inversion reconstruction. Extensive evaluations on publicly available multi-domain unpaired staining datasets demonstrate high structural consistency and accurate style transfer results.

Abstract: 3D object detection is one of the fundamental perception tasks for autonomous vehicles. Fulfilling such a task with a 4D millimeterwave radar is very attractive since the sensor is able to acquire 3D point clouds similar to Lidar while maintaining robust measurements under adverse weather. However, due to the high sparsity and noise associated with the radar point clouds, the performance of the existing methods is still much lower than expected. In this paper, we propose a novel Semi-supervised Cross-modality Knowledge Distillation (SCKD) method for 4D radar-based 3D object detection. It characterizes the capability of learning the feature from a Lidar-radar-fused teacher network with semi-supervised distillation. We first propose an adaptive fusion module in the teacher network to boost its performance. Then, two feature distillation modules are designed to facilitate the cross-modality knowledge transfer. Finally, a semi-supervised output distillation is proposed to increase the effectiveness and flexibility of the distillation framework. With the same network structure, our radar-only student trained by SCKD boosts the mAP by 10.38% over the baseline and outperforms the state-of-the-art works on the VoD dataset. The experiment on ZJUODset also shows 5.12% mAP improvements on the moderate difficulty level over the baseline when extra unlabeled data are available.

Abstract: The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive highquality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply Video-ChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-of-the-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

Abstract: We present OOTDiffusion, a novel network architecture for realistic and controllable imagebased virtual try-on (VTON). We leverage the power of pretrained latent diffusion models, designing an outfitting UNet to learn the detailed garment features. Without a redundant warping process, the garment features are precisely aligned with the target human body via the proposed outfitting fusion in the self-attention layers of the denoising UNet. In order to further enhance the controllability, we introduce outfitting dropout to the training process, which enables us to adjust the strength of the garment features through classifier-free guidance. Our comprehensive experiments on the VITON-HD and Dress Code datasets demonstrate that OOTDiffusion efficiently generates high-quality try-on results for arbitrary human and garment images, which outperforms other VTON methods in both realism and controllability, indicating a breakthrough in virtual try-on.

Abstract: Large Language Models (LLMs) have demonstrated potential in Visionand-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for route summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME's superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards applications of MLLMs in the field of embodied intelligence.

Abstract: Point cloud completion aims to recover a complete point shape from a partial point cloud. Although existing methods can form satisfactory point clouds in global completeness, they often lose the original geometry details and face the problem of geometric inconsistency between existing point clouds and reconstructed missing parts. To tackle this problem, we introduce SymmCompletion, a highly effective completion method based on symmetry guidance. Our method comprises two primary components: a Local Symmetry Transformation Network (LSTNet) and a SymmetryGuidance Transformer (SGFormer). First, LSTNet efficiently estimates point-wise local symmetry transformation to transform key geometries of partial inputs into missing regions, thereby generating geometry-align partial-missing pairs and initial point clouds. Second, SGFormer leverages the geometric features of partial-missing pairs as the explicit symmetric guidance that can constrain the refinement process for initial point clouds. As a result, SGFormer can exploit provided priors to form high-fidelity and geometry-consistency final point clouds. Qualitative and quantitative evaluations on several benchmark datasets demonstrate that our method outperforms state-of-the-art completion networks.

Abstract: Image demoiréing poses one of the most formidable challenges in image restoration, primarily due to the unpredictable and anisotropic nature of moiré patterns. Limited by the quantity and diversity of training data, current methods tend to overfit to a single moiré domain, resulting in performance degradation for new domains, and restricting their robustness in realworld applications. In this paper, we propose a universal image demoiréing solution, UniDemoiré, which has superior generalization capability. Notably, we propose innovative and effective data generation and synthesis methods that can automatically provide vast high-quality moiré images to train a universal demoiréing model. Our extensive experiments demonstrate the cutting-edge performance and broad potential of our approach for generalized image demoiréing.

Abstract: Neural surface representation has demonstrated remarkable success in the areas of novel view synthesis and 3D reconstruction. However, assessing the geometric quality of 3D reconstructions in the absence of ground truth mesh remains a significant challenge, due to its renderingbased optimization process and entangled learning of appearance and geometry with photometric losses. In this paper, we present a novel framework, i.e, GURecon, which establishes a geometric uncertainty field for the neural surface based on geometric consistency. Different from existing methods that rely on rendering-based measurement, GURecon models a continuous 3D uncertainty field for the reconstructed surface, and is learned by an online distillation approach without introducing real geometric information for supervision. Moreover, in order to mitigate the interference of illumination on geometric consistency, a decoupled field is learned and exploited to finetune the uncertainty field. Experiments on various datasets demonstrate the superiority of GURecon in modeling 3D geometric uncertainty, as well as its plug-and-play extension to various neural surface representations and improvement on downstream tasks such as incremental reconstruction.

Abstract: This paper studies the complex challenge of crossdomain image retrieval under the condition of noisy labels (NCIR), a scenario that not only includes the inherent obstacles of traditional cross-domain image retrieval (CIR) but also requires alleviating the adverse effects of label noise. To address this challenge, this paper introduces a novel Robust Domain Alignment framework (RoDA), specifically designed for the NCIR task. At the heart of RoDA is the Selective Division and Adaptive Learning mechanism (SDAL), a key component crafted to shield the model from overfitting the noisy labels. SDAL effectively learns discriminative knowledge by dividing the dataset into clean and noisy parts, subsequently rectifying the labels for the latter based on information drawn from the clean one. This process involves adaptively weighting the relabeled samples and leveraging both the clean and relabeled data to bootstrap model training. Moreover, to bridge the domain gap further, we introduce the Accumulative Class Center Alignment (ACCA), a novel approach that fosters domain alignment through an accumulative domain loss mechanism.Thanks to SDAL and ACCA, our RoDA demonstrates its superiority in overcoming label noise and domain discrepancies within the NCIR paradigm. The effectiveness and robustness of our RoDA framework are comprehensively validated through extensive experiments across three multi-domain benchmarks.

Abstract: Medical Visual Question Answering (MedVQA) serves as an automated medical assistant, capable of answering patient queries and aiding physician diagnoses based on medical images and questions. Recent advancements have shown that incorporating Large Language Models (LLMs) into MedVQA tasks significantly enhances the capability for answer generation. However, for tasks requiring finegrained organ-level precise localization, relying solely on language prompts struggles to accurately locate relevant regions within medical images due to substantial background noise. To address this challenge, we explore the use of visual prompts in MedVQA tasks for the first time and propose fine-grained adaptive visual prompts to enhance generative MedVQA. Specifically, we introduce an Adaptive Visual Prompt Creator that adaptively generates region-level visual prompts based on image characteristics of various organs, providing fine-grained references for LLMs during answer retrieval and generation from the medical domain, thereby improving the model's precise cross-modal localization capabilities on original images. Furthermore, we incorporate a Hierarchical Answer Generator with Parameter-Efficient Fine-Tuning (PEFT) techniques, significantly enhancing the model's understanding of spatial and contextual information with minimal parameter increase, promoting the alignment of representation learning with the medical space. Extensive experiments on VQA-RAD, SLAKE, and DME datasets validate the effectiveness of our proposed method, demonstrating its potential in generative MedVQA.

Abstract: Remote sensing image fusion aims to reconstruct a high spatial and spectral resolution image by integrating the spatial and spectral information from multiple remote sensing sensor data. Despite the remarkable progress of deep learningbased fusion methods, most existing methods rely on manual network architecture design and hyperparameter tuning, lacking sufficient interpretability and adaptability. To address this limitation, we propose a novel neural Ordinary Differential Equation (ODE)-inspired tuning-free proximal splitting algorithm, which splits remote sensing image fusion as two optimization problems regularized by deep priors to model the fusion of spatial and spectral. Firstly, based on the physical properties of spatial and spectral information, the two problems are optimized by two proximal splitting operators to iteratively integrate spatial-spectral complementary information, eliminating or suppressing redundant information to reduce fusion errors. Secondly, considering the efficiency of neural ODE in reducing optimization error, we utilize a high-order numerical scheme to customize the proximal operator theoretically without additional handcrafted design and parameter tuning. Finally, by incorporating the numerical scheme as a solver into the proximal optimization algorithm, we derive an ODE-inspired Tuning-free Proximal Network, dubbed OTPNet, which achieves efficient and robust fusion reconstruction. Extensive experiments on nine datasets across three different remote sensing image fusion tasks show that our OTPNet outperforms existing state-of-the-art approaches, which validates the effectiveness of our method.

Abstract: The primary challenge in continuous sign language recognition (CSLR) mainly stems from the presence of multiorientational and long-term motions. However, current research overlooks these crucial aspects, significantly impacting accuracy. To tackle these issues, we propose a novel CSLR framework: Orientation-aware Long-term Motion Decoupling (OLMD), which efficiently aggregates long-term motions and decouples multi-orientational signals into easily interpretable components. Specifically, our innovative Long-term Motion Aggregation (LMA) module filters out static redundancy while adaptively capturing abundant features of long-term motions. We further enhance orientation awareness by decoupling complex movements into horizontal and vertical components, allowing for motion purification in both orientations. Additionally, two coupling mechanisms are proposed: stage and cross-stage coupling, which together enrich multi-scale features and improve the generalization capabilities of the model. Experimentally, OLMD shows SOTA performance on three large-scale datasets: PHOENIX14, PHOENIX14-T, and CSL-Daily. Notably, we improve the word error rate (WER) on PHOENIX14 by an absolute 1.6% compared to the previous SOTA.

Abstract: Adversarial examples’ (AE) transferability refers to the phenomenon that AEs crafted with one surrogate model can also fool other models. Notwithstanding remarkable progress in untargeted transferability, its targeted counterpart remains challenging. This paper proposes an everywhere scheme to boost targeted transferability. Our idea is to attack a victim image both globally and locally. We aim to optimize ‘an army of targets’ in every local image region instead of the previous works that optimize a highconfidence target in the image. Specifically, we split a victim image into non-overlap blocks and jointly mount a targeted attack on each block. Such a strategy mitigates transfer failures caused by attention inconsistency between surrogate and victim models and thus results in stronger transferability. Our approach is method-agnostic, which means it can be easily combined with existing transferable attacks for even higher transferability. Extensive experiments on ImageNet demonstrate that the proposed approach universally improves the state-of-the-art targeted attacks by a clear margin, e.g., the transferability of the widely adopted Logit attack can be improved by 28.8%-300%. We also evaluate the crafted AEs on a real-world platform: Google Cloud Vision. Results further support the superiority of the proposed method.

Abstract: Gait Emotion Recognition (GER) is an emerging task within Human Emotion Recognition. Skeletonbased GER requires discriminative spatial and temporal features. However, current methods primarily focus on capturing spatial topology information but fail to effectively learn temporal features from long-distance frames. Moreover, these methods are mostly sensitive to the order of sampled sequences, resulting in significant accuracy drops when sequences are randomly sampled. In order to obtain a more robust and comprehensive spatial-temporal representation of gait, we introduce the Graph-Transformer architecture into GER for the first time, proposing a novel framework named GaitCycFormer. Specifically, we designed a Cycle Position Encoding (CPE) based on the gait cycle, which explicitly segments any gait sequence into more manageable periodic units, to enhance temporal feature modeling. Additionally, we incorporate a bi-level Transformer, consisting of an Intra-cycle Transformer and an Inter-cycle Transformer to capture local and global temporal information within each gait cycle and between gait cycles respectively. Experiments demonstrate that our GaitCycFormer achieves state-of-the-art performance on popular datasets, and proves to be more reliable and robust.

Abstract: Face AntiSpoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out-of-domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images, enriching the model's supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model's perception of global visual features, we design a Globally Aware Connector (GAC) to align multi-level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.

Abstract: Textto-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model’s understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing and colors.

Abstract: Story visualization has gained increasing attention in artificial intelligence. However, existing methods still struggle with maintaining a balance between character identity preservation and textsemantics alignment, largely due to a lack of detailed semantic modeling of the story scene. To tackle this challenge, we propose a novel knowledge graph, namely Character-Graph (CG), which represents various story-related knowledge, including the characters, their attributes and the relationship. We then introduce StoryWeaver, an image generator that achieves Customization via Character-Graph (C-CG), capable of consistent story visualization with rich text semantics. To further improve the multi-character generation performance, we incorporate knowledge-enhanced spatial guidance (KE-SG) into StoryWeaver to precisely inject character semantics into generation. To validate the effectiveness of our proposed method, extensive experiments are conducted using a new benchmark called TBC-Bench. The experiments confirm that our StoryWeaver excels not only in creating vivid visual story plots but also in accurately conveying character identities across various scenarios with considerable storage efficiency, e.g., achieving an average increase of +9.03% DINO-I and +13.44% CLIP-T. Furthermore, ablation experiments are conducted to verify the superiority of each proposed module.

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China, Beijing Jingwei Hirain Technologies Co., Inc., Hangzhou Innovation Institute, Beihang University, Hangzhou, China, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China Hangzhou Innovation Institute, Beihang University, Hangzhou, China Zhongguancun Laboratory, Beijing, China, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China Hangzhou Innovation Institute, Beihang University, Hangzhou, China

Abstract: Bird'sEye-View (BEV) representation has emerged as a mainstream paradigm for multi-view 3D object detection, demonstrating impressive perceptual capabilities. However, existing methods overlook the geometric quality of BEV representation, leaving it in a low-resolution state and failing to restore the authentic geometric information of the scene. In this paper, we identify the drawbacks of previous approaches that limit the geometric quality of BEV representation and propose Radial-Cartesian BEV Sampling (RC-Sampling), which outperforms other feature transformation methods in efficiently generating high-resolution dense BEV representation to restore fine-grained geometric information. Additionally, we design a novel In-Box Label to substitute the traditional depth label generated from the LiDAR points. This label reflects the actual geometric structure of objects rather than just their surfaces, injecting real-world geometric information into the BEV representation. In conjunction with the In-Box Label, Centroid-Aware Inner Loss (CAI Loss) is developed to capture the inner geometric structure of objects. Finally, we integrate the aforementioned modules into a novel multi-view 3D object detector, dubbed GeoBEV, which achieves a state-of-the-art result of 66.2% NDS on the nuScenes test set.

Abstract: Neural Radiance Fields (NeRF) has been widely used in computer vision and graphics, achieving impressive results in novel view synthesis and multiview 3D reconstruction. However, despite its excellent performance under ideal conditions, NeRF struggles in challenging environments such as hazy, foggy, and underwater scenes, primarily due to the difficulty in decoupling objects from the scattering medium. To mitigate this limitation, we proposed a novel approach for NeRF in scenes with scattering media. Specifically, we leverage pseudo-labels during the early stage of training to guide NeRF in decoupling the densities of objects and the scattering medium, guiding the model toward a more appropriate search space. Furthermore, we introduce a Cyclical Progressive Dimensional Optimization Strategy (CPDOS) that focuses on optimizing a single or a few variables during specific periods. Experimental results demonstrate that our method can effectively simulate hazy and underwater scenes, accurately decouple the scattering medium from objects, estimate atmospheric parameters, and outperform existing methods in novel view synthesis and image restoration tasks.

Abstract: Recently, state space models have exhibited strong global modeling capabilities and linear computational complexity in contrast to transformers. This research focuses on applying such architecture to more efficiently and effectively model point cloud data globally with linear computational complexity. In particular, for the first time, we demonstrate that Mambabased point cloud methods can outperform previous methods based on transformer or multi-layer perceptrons (MLPs). To enable Mamba to process 3-D point cloud data more effectively, we propose a novel Consistent Traverse Serialization method to convert point clouds into 1-D point sequences while ensuring that neighboring points in the sequence are also spatially adjacent. Consistent Traverse Serialization yields six variants by permuting the order of x, y, and z coordinates, and the synergistic use of these variants aids Mamba in comprehensively observing point cloud data. Furthermore, to assist Mamba in handling point sequences with different orders more effectively, we introduce point prompts to inform Mamba of the sequence’s arrangement rules. Finally, we propose positional encoding based on spatial coordinate mapping to inject positional information into point cloud sequences more effectively. Point Cloud Mamba surpasses the state-of-the-art (SOTA) point-based method PointNeXt and achieves new SOTA performance on the ScanObjectNN, ModelNet40, ShapeNetPart, and S3DIS datasets. It is worth mentioning that when using a more powerful local feature extraction module, our PCM achieves 79.6 mIoU on S3DIS, significantly surpassing the previous SOTA models, DeLA and PTv3, by 5.5 mIoU and 4.9 mIoU, respectively.

Abstract: Generative models are widely used to produce synthetic images with annotations, alleviating the burden of image collection and annotation for training deep visual models. However, challenges such as limited image diversity, noisy pseudo labels, and domain gaps between synthetic and real images often undermine their effectiveness in downstream visual tasks. This paper introduces the Iterative SelfTraining with Class-Aware Text-to-Image Synthesis (IST-CATS) framework, which addresses these challenges by integrating a class-aware text-to-image synthesis (CATS) component with an iterative self-training (IST) strategy. CATS innovatively introduces a class-aware chain approach to generate detailed descriptions. These descriptions act as prompts for a diffusion model, enabling the creation of a diverse of images accompanied by distinguishable objects against the background. The generated images can be easily pseudo-labeled by an unsupervised instance segmentation method, and then noisy pseudo labels can be effectively purified by a novel feature similarity-based filtering mechanism. The generated images underpin our IST, which progressively enhances vision models and refines pseudo labels through self-training and our proposed label filtering strategy (LabFilt). LabFilt meticulously improves the quality of pseudo labels by employing class-adaptive techniques at both the pixel and object levels, ensuring refined pseudo-label accuracy. IST-CATS demonstrates superior performance in object detection and semantic segmentation compared to traditional synthetic and semi/weakly-supervised methods, effectively addressing data collection and annotation challenges.

Abstract: Multimodal large language models have experienced rapid growth, and numerous different models have emerged. The interpretability of LVLMs remains an underexplored area. Especially when faced with more complex tasks such as chain-of-thought reasoning, its internal mechanisms still resemble a black box that is difficult to decipher. By studying the interaction and information flow between images and text, we noticed that in models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence in the LLM decoding layer, and these image tokens receive higher attention scores. However, those image tokens that are less relevant to the text do not have information flow convergence, and they only get very small attention scores. To efficiently utilize the image information, we propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs by computing the similarity between image and text embeddings and ignoring image tokens that are irrelevant and unimportant to the text. Through extensive experiments, we demonstrate the effectiveness of our method for complex reasoning tasks.

The Hong Kong University of Science and Technology, SenseTime Research, The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology, Institute of Artificial Intelligence (TeleAI), China Telecom, Sensetime Research, The Chinese University of Hong Kong, Institute for AI Industry Research (AIR), Tsinghua University, Institute for AI Industry Research (AIR), Tsinghua University, The Hong Kong University of Science and Technology

Abstract: Existing learningbased stereo image codec adopt sophisticated transformation with simple entropy models derived from single image codecs to encode latent representations. However, those entropy models struggle to effectively capture the spatial-disparity characteristics inherent in stereo images, which leads to suboptimal rate-distortion results. In this paper, we propose a stereo image compression framework, named CAMSIC. CAMSIC independently transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model to capture both spatial and disparity dependencies, by introducing a novel content-aware masked image modeling (MIM) technique. Our content-aware MIM facilitates efficient bidirectional interaction between prior information and estimated tokens, which naturally obviates the need for an extra Transformer decoder. Experiments show that our stereo image codec achieves state-of-the-art rate-distortion performance on two stereo image datasets Cityscapes and InStereo2K with fast encoding and decoding speed.

Abstract: Controllable image semantic understanding tasks, such as captioning or segmentation, necessitate users to input a prompt (e.g., text or bounding boxes) to predict a unique outcome, presenting challenges such as highcost prompt input or limited information output. This paper introduces a new task ``Image Collaborative Segmentation and Captioning'' (SegCaptioning), which aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs, allowing flexible result selection by users. This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks. Technically, we propose a novel Scene Graph Guided Diffusion Model that leverages structured scene graph features for correlated mask-caption prediction. Initially, we introduce a Prompt-Centric Scene Graph Adaptor to map a user's prompt to a scene graph, effectively capturing his intention. Subsequently, we employ a diffusion process incorporating a Scene Graph Guided Bimodal Transformer to predict correlated caption-mask pairs by uncovering intricate correlations between them. To ensure accurate alignment, we design a Multi-Entities Contrastive Learning loss to explicitly align visual and textual entities by considering inter-modal similarity, resulting in well-aligned caption-mask pairs. Extensive experiments conducted on two datasets demonstrate that SGDiff achieves superior performance in SegCaptioning, yielding promising results for both captioning and segmentation tasks with minimal prompt input.

Abstract: Textto-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Please watch all videos in supplementary materials for better view.

Abstract: Deep neural networks (DNNs) are susceptible to universal adversarial perturbations (UAPs). These perturbations are meticulously designed to fool the target model universally across all sample classes. Unlike instancespecific adversarial examples (AEs), generating UAPs is more complex because they must be generalized across a wide range of data samples and models. Our research reveals that existing universal attack methods, which optimize UAPs using DNNs with static model parameter snapshots, do not fully leverage the potential of DNNs to generate more effective UAPs. Rather than optimizing UAPs against static DNN models with a fixed training set, we suggest using dynamic model-data pairs to generate UAPs. In particular, we introduce a dynamic maximin optimization strategy, aiming to optimize the UAP across a variety of optimal model-data pairs. We term this approach DM-UAP. DM-UAP utilizes an iterative max-min-min optimization framework that refines the model-data pairs, coupled with a curriculum UAP learning algorithm to examine the combined space of model parameters and data thoroughly. Comprehensive experiments on the ImageNet dataset demonstrate that the proposed DM-UAP markedly enhances both cross-sample universality and cross-model transferability of UAPs. Using only 500 samples for UAP generation, DM-UAP outperforms the state-of-the-art approach with an average increase in fooling ratio of 12.108%.

Abstract: Adversarial attack and defense have been extensively explored in classification tasks, but their study in semantic segmentation remains limited. Moreover, current attacks fail to act as strong underlying attacks for adversarial training (AT), making it difficult to achieve segmentation robustness against strong attacks. In this paper, we present RPPGD, a novel Region-and-Prototype based Projected Gradient Descent attack tailored to fool segmentation models. In particular, we propose a region-based attack, which leverages a spatial-temporal way to separate the pixels into three disjoint regions, and highlights the attack on the crucial True Region and Boundary Region. Moreover, we introduce a prototype-based attack to disrupt the feature space, further enhancing the attack capability. To boost the robustness of segmentation models, we inject adversaries generated by RP-PGD into the clean data and perform AT. Extensive experiments on multiple datasets showcase that RP-PGD generates adversaries with faster convergence and stronger attack effectiveness, surpassing state-of-the-art attacks by a large margin. Consequently, RP-PGD serves as a strong underlying attack for segmentation models to perform AT, assisting them in defending against a variety of strong attacks without incurring additional computational costs during inference.

Platform and Content Group, Tencent, Beijing, China, Platform and Content Group, Tencent, Beijing, China, Platform and Content Group, Tencent, Beijing, China, Platform and Content Group, Tencent, Beijing, China, Platform and Content Group, Tencent, Beijing, China, Harbin Institute of Technology, Harbin, China, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences., Beijing, China, Platform and Content Group, Tencent, Beijing, China

Abstract: Referring MultiObject Tracking (RMOT) aims to track multiple objects based on a provided language expression. Although prior studies have sought to accomplish this by integrating an textual module into the multi-object tracker, these methods combine text and image features in a basic way, neglecting the importance of text features. In this study, we propose a Hierarchical Fine-grained text-image Fusion tracker, named HFF-Tracker, which can perform fine-grained fusion of pixel-level visual features and text features across various semantic levels. Specifically, we have devised a Hierarchical Multi-Modal Fusion (HMMF) module to merge text and image features at an early stage in a hierarchical and detailed manner. The Text-Guided Decoder (TGD) is designed to provide the query with prior semantic information during the decoding process. Additionally, we have crafted a Text-Guided Prediction Head (TGPH) that utilizes text information to enhance the performance of the prediction head. Furthermore, we have implemented an adaptive Look-Back training strategy to maximize the utilization of valuable labeled data. Extensive experiments on the Refer-KITTI dataset and the Refer-KITTI-V2 dataset demonstrate that our proposed HFF-Tracker outperforms other state-of-the-art methods with remarkable margins.

Abstract: Neural image compression often faces a challenging tradeoff among rate, distortion and perception. While most existing methods typically focus on either achieving high pixel-level fidelity or optimizing for perceptual metrics, we propose a novel approach that simultaneously addresses both aspects for a fixed neural image codec. Specifically, we introduce a plug-and-play module at the decoder side that leverages a latent diffusion process to transform the decoded features, enhancing either low distortion or high perceptual quality without altering the original image compression codec. Our approach facilitates fusion of original and transformed features without additional training, enabling users to flexibly adjust the balance between distortion and perception during inference. Extensive experimental results demonstrate that our method significantly enhances the pretrained codecs with a wide, adjustable distortion-perception range while maintaining their original compression capabilities. For instance, we can achieve more than 150% improvement in LPIPS-BDRate without sacrificing more than 1 dB in PSNR.

Abstract: Recent years have seen substantial progress in diffusionbased controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce *TrackGo*, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the *TrackAdapter* for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores.

Abstract: Novel class discovery(NCD) aims to cluster the unlabeled data with the help of a labeled set containing different but related classes. The key to solving NCD is the knowledge transfer between labeled and unlabeled sets.Since NCD requires that known classes and unknown classes are related, it is significant to explore classlevel relationships between known and unknown for more effective knowledge transfer. However, most existing methods either facilitate knowledge transfer by learning a shared representation space or by modeling coarse-grained or asymmetric relationships between known and unknown, neglecting class-level relationships. To tackle these challenges, we propose a symmetric class-to-class relationship modeling and knowledge transfer method, achieving bidirectional knowledge transfer at class-level. Considering that class-level modeling often overlooks the subtle distinctions between samples, we propose pairwise similarity-based relationship modeling and consistency constraint for instance-level knowledge transfer. Extensive experiments on CIFAR100 and three fine-grained datasets demonstrate that our method achieves significant performance improvements compared to state-of-the-art methods.

Abstract: In this paper, we present GaussianPainter, the first method to paint a point cloud into 3D Gaussians given a reference image. GaussianPainter introduces an innovative feedforward approach to overcome the limitations of time-consuming test-time optimization in 3D Gaussian splatting. Our method addresses a critical challenge in the field: the non-uniqueness problem inherent in the large parameter space of 3D Gaussian splatting. This space, encompassing rotation, anisotropic scales, and spherical harmonic coefficients, introduces the challenge of rendering similar images from substantially different Gaussian fields. As a result, feed-forward networks face instability when attempting to directly predict high-quality Gaussian fields, struggling to converge on consistent parameters for a given output. To address this issue, we propose to estimate a surface normal for each point to determine its Gaussian rotation. This strategy enables the network to effectively predict the remaining Gaussian parameters in the constrained space. We further enhance our approach with an appearance injection module, incorporating reference image appearance into Gaussian fields via a multiscale triplane representation. Our method successfully balances efficiency and fidelity in 3D Gaussian generation, achieving high-quality, diverse, and robust 3D content creation from point clouds in a single forward pass. A video is provided in our supplementary material for a more detailed explanation of our method.

Abstract: Efficient tracking has garnered attention for its ability to operate on resourceconstrained platforms for real-world deployment beyond desktop GPUs. Current efficient trackers mainly follow precision-oriented trackers, adopting a one-stream framework with lightweight modules. However, blindly adhering to the one-stream paradigm may not be optimal, as incorporating template computation in every frame leads to redundancy, and pervasive semantic interaction between template and search region places stress on edge devices. In this work, we propose a novel asymmetric Siamese tracker named AsymTrack for efficient tracking. AsymTrack disentangles template and search streams into separate branches, with template computing only once during initialization to generate modulation signals. Building on this architecture, we devise an efficient template modulation mechanism to unidirectional inject crucial cues into the search features, and design an object perception enhancement module that integrates abstract semantics and local details to overcome the limited representation in lightweight tracker. Extensive experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms compared to the current state-of-the-arts. For instance, AsymTrack-T achieves 60.8% AUC on LaSOT and 224/81/84 FPS on GPU/CPU/AGX, surpassing HiT-Tiny by 6.0% AUC with higher speeds.

Abstract: Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs.

Department of Computer Science at School of Informatics, Xiamen University, University of Science and Technology of China, Department of Computer Science at School of Informatics, Xiamen University, Department of Computer Science at School of Informatics, Xiamen University, School of Computing and Data Science, The University of Hong Kong, EuroMov Digital Health in Motion, Univ Montpellier, IMT Mines Ales Service de Medecine Nucleaire, Centre Hospitalier Universitaire de Nimes, Universite de Montpellier, Department of Computer Science at School of Informatics, Xiamen University

Abstract: Significant disparities between the features of natural images and those inherent to histopathological images make it challenging to directly apply and transfer pretrained models from natural images to histopathology tasks. Moreover, the frequent lack of annotations in histopathology patch images has driven researchers to explore self-supervised learning methods like mask reconstruction for learning representations from large amounts of unlabeled data. Crucially, previous mask-based efforts in self-supervised learning have often overlooked the spatial interactions among entities, which are essential for constructing accurate representations of pathological entities. To address these challenges, constructing graphs of entities is a promising approach. In addition, the diffusion reconstruction strategy has recently shown superior performance through its random intensity noise addition technique to enhance the robust learned representation. Therefore, we introduce H-MGDM, a novel self-supervised Histopathology image representation learning method through the Dynamic Entity-Masked Graph Diffusion Model. Specifically, we propose to use complementary subgraphs as latent diffusion conditions and self-supervised targets respectively during pre-training. We note that the graph can embed entities' topological relationships and enhance representation. Dynamic conditions and targets can improve pathological fine reconstruction. Our model has conducted pretraining experiments on three large histopathological datasets. The advanced predictive performance and interpretability of H-MGDM are clearly evaluated on comprehensive downstream tasks such as classification and survival analysis on six datasets.

St. Petersburg State University, Steklov Mathematical Institute at St. Petersburg, Russian Academy of Sciences and ITMO University, École Polytechnique Fédérale de Lausanne, Neapolis University Pafos and JetBrains Research, Steklov Mathematical Institute at St. Petersburg, Russian Academy of Sciences, JetBrains Research, Steklov Mathematical Institute at St. Petersburg, Russian Academy of Sciences, Neapolis University Pafos, Neapolis University Pafos, Neapolis University Pafos, ITMO University

Abstract: We present an opensource tool for manipulating Boolean circuits. It implements efficient algorithms, both existing and novel, for a rich variety of frequently used circuit tasks such as satisfiability, synthesis, and minimization. We tested the tool on a wide range of practically relevant circuits (computing, in particular, symmetric and arithmetic functions) that have been optimized intensively by the community for the last three years. The tool helped us to win the IWLS 2024 Programming Contest. In 2023, it was Google DeepMind who took the first place in the competition. We were able to reduce the size of the best circuits from 2023 by 12% on average, whereas for some individual circuits, our size reduction was as large as 83%.

Abstract: This paper studies decentralized optimization over a compact submanifold within a communication network of n nodes, where each node possesses a smooth nonconvex local cost function, and the goal is to jointly minimize the sum of these local costs. We focus particularly on the online setting, where local data is processed in real-time as it streams in, without the need for full data storage. We propose a decentralized projected Riemannian stochastic recursive momentum (DPRSRM) method that employs local hybrid stochastic gradient estimators and uses the network to track the global gradient. DPRSRM achieves an oracle complexity of O(epsilon^(-3/2)), outperforming existing methods that have at most O(epsilon^(-2)) complexity. Our method requires only O(1) gradient evaluations per iteration for each local node and does not require restarting with a large batch gradient. Furthermore, we demonstrate the effectiveness of our proposed methods compared to state-of-the-art ones through numerical experiments on principal component analysis problems and low-rank matrix completion.

Abstract: We consider nonconvex optimization problem over simplex, and more generally, a product of simplices. We provide an algorithm, Langevin Multiplicative Weights Update (LMWU) for solving global optimization problems by adding a noise scaling with the nonEuclidean geometry in the simplex. Non-convex optimization has been extensively studied by machine learning community due to its application in various scenarios such as neural network approximation and finding Nash equilibrium. Despite recent progresses on provable guarantee of escaping and avoiding saddle point (convergence to local minima) and global convergence of Langevin gradient based method without constraints, the global optimization with constraints is less studied. We show that LMWU algorithm is provably convergent to interior global minima with a non-asymptotic convergence analysis. We verify the efficiency of the proposed algorithm in real data set from polynomial portfolio management, where optimization of a highly non-linear objective function plays a crucial role.

Abstract: Branchand-Bound (BB) is an exact method in integer programming that recursively divides the search space into a tree. During the resolution process, determining the next subproblem to explore within the tree—known as the search strategy—is crucial. Hand-crafted heuristics are commonly used, but none are effective over all problem classes. Recent approaches utilizing neural networks claim to make more intelligent decisions but are computationally expensive. In this paper, we introduce GP2S (Genetic Programming for Search Strategy), a novel machine learning approach that automatically generates a BB search strategy heuristic, aiming to make intelligent decisions while being computationally lightweight. We define a policy as a function that evaluates the quality of a BB node by combining features from the node and the problem; the search strategy policy is then defined by a best-first search based on this node ranking. The policy space is explored using a genetic programming algorithm, and the policy that achieves the best performance on a training set is selected. We compare our approach with the standard method of the SCIP solver, a recent graph neural network-based method, and handcrafted heuristics. Our first evaluation includes three types of primal hard problems, tested on instances similar to the training set and on larger instances. Our method is at most 2 percents slower than the best baseline and consistently outperforms SCIP, achieving an average speedup of 11.3 percents. Additionally, GP2S is tested on the MIPLIB 2017 dataset, generating multiple heuristics from different subsets of instances. It exceeds SCIP’s average performance in 7 out of 10 cases across 15 times more instances and under a time limit 15 times longer, with some GP2S methods leading on most experiments in terms of the number of feasible solutions or optimality gap.

Abstract: There has been tremendous progress in the past decade in the field of quantified Boolean formulas (QBF), both in practical solving as well as in creating a theory of corresponding proof systems and their proof complexity analysis. Both for solving and for proof complexity, it is important to have interesting formula families on which we can test solvers and gauge the strength of the proof systems. There are currently few such formula families in the literature. We initiate a general programme how to transform computationally hard problems (located in the polynomial hierarchy) into QBFs hard for the main QBF resolution systems QRes and QU-Res that relate to core QBF solvers. We illustrate this general approach on three problems from graph theory and logic. This yields QBF families that are provably hard for Q-Res and QU-Res (without any complexity assumptions).

Abstract: Spatialtemporal graph modeling is challenging due to the diverse node interactions across spatial and temporal dimensions. Recent studies typically adopt Graph Neural Networks (GNNs) to perform node-level aggregation at different time steps, acting as a series of low-pass graph spectral filters, for node interaction modeling. However, these filters, confined to the spatial dimension, are ill-suited for processing signals of nodes with inherent spatial-temporal interdependencies. Moreover, oversimplified low-pass filtering fails to fully exploit information from diverse node interactions. To address these issues, we propose a Spatial-Temporal Spectral Graph Neural Network (STSGNN), which designs specialized two-dimensional (2-D) graph spectral filters for comprehensive spatial-temporal graph modeling. First, based on the normalized Laplacian spectrum of spatial and temporal graphs, we extend the existing graph spectral theory from a univariate spatial dimension to a bivariate spatial-temporal dimension through a 2-D Discrete Graph Fourier Transform (2-D DGFT). Then, we leverage the bivariate Bernstein polynomial approximation, with learned basis coefficients, to design 2-D filters with specialized spectral properties for unified spatial-temporal signal filtering. Finally, the filtered signals, with refined spatial-temporal representations, are fed into well-designed pyramidal gated convolution modules to acquire multiple ranges of spatial-temporal dependencies. Experiments on traffic and meteorological prediction tasks demonstrate that STSGNN achieves state-of-the-art performance. Additionally, we visualize the 2-D filters learned from inputs with distinct spatial-temporal characteristics to enhance the model's interpretability.

SKLCCSE, School of Computer Science and Engineering, Beihang University Department of Data Science, City University of Hong Kong, SKLCCSE, School of Computer Science and Engineering, Beihang University MIIT Key Laboratory of Data Intelligence and Management, Beihang University School of Economics and Management, Beihang University, SKLCCSE, School of Computer Science and Engineering, Beihang University, SKLCCSE, School of Computer Science and Engineering, Beihang University, Department of Data Science, City University of Hong Kong, SKLCCSE, School of Computer Science and Engineering, Beihang University, Department of Data Science, City University of Hong Kong

Abstract: POI representation learning plays a crucial role in handling tasks related to user mobility data. Recent studies have shown that enriching POI representations with multimodal information can significantly enhance their task performance. Previously, the textual information incorporated into POI representations typically involved only POI categories or checkin content, leading to relatively weak textual features in existing methods. In contrast, large language models (LLMs) trained on extensive text data have been found to possess rich textual knowledge. However leveraging such knowledge to enhance POI representation learning presents two key challenges: first, how to extract POI-related knowledge from LLMs effectively, and second, how to integrate the extracted information to enhance POI representations. To address these challenges, we propose POI-Enhancer, a portable framework that leverages LLMs to improve POI representations produced by classic POI learning models. We first design three specialized prompts to extract semantic information from LLMs efficiently. Then, the Dual Feature Alignment module enhances the quality of the extracted information, while the Semantic Feature Fusion module preserves its integrity. The Cross Attention Fusion module then fully adaptively integrates such high-quality information into POI representations and Multi-View Contrastive Learning further injects human-understandable semantic information into these representations. Extensive experiments on three real-world datasets demonstrate the effectiveness of our framework, showing significant improvements across all baseline representations.

Abstract: Unsupervised anomaly detection (UAD) plays an important role in modern data analytics and it is crucial to provide simple yet effective and guaranteed UAD algorithms for real applications. In this paper, we present a novel UAD method for tabular data by evaluating how much noise is in the data. Specifically, we propose to learn a deep neural network from the clean (normal) training dataset and a noisy dataset, where the latter is generated by adding highly diverse noises to the clean data. The neural network can learn a reliable decision boundary between normal data and anomalous data when the diversity of the generated noisy data is sufficiently high so that the hard abnormal samples lie in the noisy region. Importantly, we provide theoretical guarantees, proving that the proposed method can detect anomalous data successfully, although the method does not utilize any real anomalous data in the training stage. Extensive experiments through more than 60 benchmark datasets demonstrate the effectiveness of the proposed method in comparison to 12 baselines of UAD. Our method obtains a 92.27% AUC score and a 1.68 ranking score on average. Moreover, compared to the stateof-the-art UAD methods, our method is easier to implement.

Abstract: Tabular data, widely used across industries, remains underexplored in deep learning. Selfsupervised learning (SSL) shows promise for pre-training deep neural networks (DNNs) on tabular data, but its potential is hindered by challenges in designing suitable augmentations. Unlike image and text data, where SSL leverages inherent spatial or semantic structures, tabular data lacks such explicit structure. This makes traditional input-level augmentations, like modifying or removing features, less effective due to difficulties in balancing critical information preservation with variability. To address these challenges, we propose RaTab, a novel method that shifts augmentation from input-level to representation-level using matrix factorization, specifically truncated SVD. This approach preserves essential data structures while generating diverse representations by applying dropout at various stages of the representation, thereby significantly enhancing SSL performance for tabular data.

Abstract: Multibehavior recommendation exploits auxiliary behaviors (e.g., view, cart) to help predict users' potential target behavior (e.g., purchase) on a given item. However, existing works suffer from two issues: (1) They generally consider only a single chain from auxiliary behaviors to the target behavior, referred to as a purchase chain (e.g., view -> cart -> purchase), ignoring other valuable purchase chains (e.g., view ->purchase) that are beneficial for recommendation performance. (2) Most studies presume that interacted items in auxiliary behaviors are good for recommendations, and pay little attention to the negative transfer problem. That is, some auxiliary behaviors may negatively transfer the influence to the modeling of target ones (e.g., items viewed but not purchased). To alleviate these issues, we propose a novel Multiple Purchase Chains (MPC) model with negative transfer elimination for multi-behavior recommendation. Specifically, we construct multiple purchase chains from auxiliary to target behaviors according to users' historical interactions, while the representations of a previous behavior will be fed to initialize the next behavior on the chain. Then, we construct a negative graph for the latter behavior and learn the negative representations of users and items which will be filtered out to eliminate negative transfer. Experimental results on two real datasets outperform the best baseline by 40.97% and 47.26% on average in terms of Recall@10 and NDCG@10 respectively, demonstrating the effectiveness of our method.

Abstract: Multimodal recommendation systems can learn users' preferences from existing useritem interactions as well as the semantics of multimodal data associated with items. Many existing methods model this through a multimodal user-item graph, approaching multimodal recommendation as a graph learning task. Graph Neural Networks (GNNs) have shown promising performance in this domain. Prior research has capitalized on GNNs' capability to capture neighborhood information within certain receptive fields (typically denoted by the number of hops, K) to enrich user and item semantics. We observe that the optimal receptive fields for GNNs can vary across different modalities. In this paper, we propose GNNs with Modality-Independent Receptive Fields, which employ separate GNNs with independent receptive fields for different modalities to enhance performance. Our results indicate that the optimal K for certain modalities on specific datasets can be as low as 1 or 2, which may restrict the GNNs' capacity to capture global information. To address this, we introduce a Sampling-based Global Transformer, which utilizes uniform global sampling to effectively integrate global information for GNNs. We conduct comprehensive experiments that demonstrate the superiority of our approach over existing methods.

Abstract: Sequential recommendation (SR) systems predict user preferences by analyzing timeordered interaction sequences. A common challenge for SR is data sparsity, as users typically interact with only a limited number of items. While contrastive learning has been employed in previous approaches to address the challenges, these methods often adopt binary labels, missing finer patterns and overlooking detailed information in subsequent behaviors of users. Additionally, they rely on random sampling to select negatives in contrastive learning, which may not yield sufficiently hard negatives during later training stages. In this paper, we propose Future data utilization with Enduring Negatives for contrastive learning in sequential Recommendation (FENRec). Our approach aims to leverage future data with time-dependent soft labels and generate enduring hard negatives from existing data, thereby enhancing the effectiveness in tackling data sparsity. Experiment results demonstrate our state-of-the-art performance across four benchmark datasets, with an average improvement of 6.16% across all metrics.

Abstract: Video Moment Retrieval (VMR) aims to identify a temporal segment in an untrimmed video that best matches a given textual query. Bias in VMR is a critical issue, where the model achieves favorable results even if disregarding the video input. Existing evaluation methods, such as Resplitting, have attempted to address bias by creating outof-distribution (OOD) datasets. However, these methods provide an incomplete definition of bias and do not quantify bias. To this end, we provide a comprehensive definition of bias in VMR, encompassing both data bias and model bias. Besides, our evaluation metrics can analyze the magnitude of these biases better. To address both data and model biases comprehensively, we introduce Reverse Distribution based VMR (ReDis-VMR). This novel approach dynamically generates datasets with inverse distributions tailored to different models based on Gaussian kernel estimation. As a result, it enables a more accurate evaluation of model performance. Building on ReDis-VMR, we further propose the Dynamic Expandable Adjustment (DEA) pipeline. DEA incrementally expands the model structure to enhance its focus on video and text features, and it incorporates a fair loss to minimize the influence of concentrated data distributions. The experimental results on bias ratio demonstrate that our ReDis method achieves state-of-the-art performance in bias elimination, while the results on moment retrieval confirm the effectiveness of our DEA framework across three evaluation methods, two datasets, and three baselines.

Abstract: Unsupervised deep crossmodal hash retrieval aims to map multi-modal features into binary hash codes without labels, which is of interest due to its storage efficiency, query speed and convenient applications. However, existing approaches suffer from two main limitations: (1) Slightly insufficient consideration of text instance similarity, along with independent or redundant fusion to learn multi-modal similarity information. (2) They ignore the noisy adjacent correlations between multi-modal instances, leading to a lack of discriminative power in the generated hash codes. To address these challenges, we propose a new approach called Statistical Model-driven Similarity Hashing (SMSH). Specifically, we introduce Jaccard similarity when constructing the text similarity matrix. It reduces the similarity error between text instances while better considering the asymmetry of the elements in the text features. After that, we integrate the original similarity information between various modalities to construct a unified similarity matrix. The gaps between modalities are bridged while reducing the redundant information in them. In addition, we introduce a Statistical Model-driven Similarity Enhancement (SMSE) approach, which reduces the noise of similarity relations between multi-modal instances by using a Gaussian Mixture Model to keep instances with lower semantic similarity as far away from each other as possible. Experiments on three benchmark datasets demonstrate the excellent performance of the SMSH method.

Abstract: Traffic flow prediction remains a critical issue in intelligent transport systems. Despite significant efforts in traffic flow modeling, existing approaches exhibit several notable limitations: (i) Most models fail to capture traffic flow similarities over long distances and extended periods; (ii) They struggle to account for spatiotemporal heterogeneity induced by varying traffic flow patterns; (iii) Due to their static modeling approach, they struggle to effectively capture the intricate spatio-temporal entanglement. To address these challenges, we propose a traffic flow prediction framework based on self-supervised learning spatio-temporal entanglement transformer(SSL-STMFormer). This framework adopts a self-supervised learning paradigm, leveraging a transformer architecture that captures richer spatio-temporal information to better represent traffic flow patterns. Specifically, a temporal attention module and a spatial attention module are employed to capture the spatio-temporal dependencies of traffic dynamics, respectively, and spatio-temporal entanglement-aware methods are introduced to allow the model to perceive spatio-temporal entanglement and thus better modelling of real traffic environments. Furthermore, to achieve adaptive spatio-temporal self-supervised learning, adaptive data augmentation is applied to the input traffic flow data, and the traffic flow prediction task is enhanced with temporal heterogeneity module and spatial heterogeneity module. Extensive experimental evaluations conducted on six publicly available real-world transportation datasets demonstrate that our method achieves substantial improvements across these datasets.

Abstract: Recent studies have highlighted significant fairness issues in Graph Transformer (GT) models, particularly against subgroups defined by sensitive features. Additionally, GTs are computationally intensive and memorydemanding, limiting their application to large-scale graphs. Our experiments demonstrate that graph partitioning can enhance the fairness of GT models while reducing computational complexity. To understand this improvement, we conducted a theoretical investigation into the root causes of fairness issues in GT models. We found that the sensitive features of higher-order nodes disproportionately influence lower-order nodes, resulting in sensitive feature bias. We propose Fairness-aware scalable GT based on Graph Partitioning (FairGP), which partitions the graph to minimize the negative impact of higher-order nodes. By optimizing attention mechanisms, FairGP mitigates the bias introduced by global attention, thereby enhancing fairness. Extensive empirical evaluations on six real-world datasets validate the superior performance of FairGP in achieving fairness compared to state-of-the-art methods.

Abstract: Currently, ecommerce platforms integrate ads and organic content into a mixed list for users. While platforms seek to maximize profit from advertisers, organic items enhance user experience. To ensure long-term development, platforms aim to design mechanisms that optimize both revenue and user satisfaction. Current methods rank ads and organic items separately before integrating them. Even if each part is locally optimal, the combined result may not be globally optimal. In this paper, we come up with the Joint Integrated Regret Network (JINTER Net). Unlike traditional methods, which pre-order ads and organic items separately, JINTER Net directly selects from the combined set of candidate ads and organic items to generate an optimal list. This approach aims to optimally balance platform revenue and user experience while satisfying approximate dominant strategy incentive compatibility and individual rationality. We validate the effectiveness of JINTER Net using both synthetic data and real dataset, and our experimental results show that it significantly outperforms baseline models across multiple metrics.

Abstract: Pointof-Interest (POI) recommendation plays an important role in a wide range of location-based social network ap- plications, aiming to accurately predicting users’ next visits based on their historical check-in records. Previous efforts have primarily focused on the modifications of existing sequential models, neglecting the fact that POI visiting sequences typically involve continuous state transformation of geographical and intention signals. Additionally, the diverse time span between check-ins require the model to prop- erly recognize user’s multi-granular preference. While recent advances of State Space Model (SSM) have revealed their potential in handling intricate temporal signals, we propose a state-based model that is tailored for spatio-temporal POI sequences. On top of traditional SSMs that are typically limited to linear sequences like Mamba, we propose GeoMamba, which customizes the model states to accommodate the spatio-temporal sequences, especially fitting for POI recommendations. Specifically, while the approximation operator HiPPO sets the foundation of linear SSMs, we introduce a novel GaPPO operator that extends the model’s state space into graph-represented geographical domains. This innovation allows us to construct locational SSM encoders that seamlessly integrate users’ spatio-temporal characteristics. The sequence-aware outputs of GeoMamba are further processed to generate multi-scale behavior representations. Extensive experimental results illustrate the superiority of GeoMamba over several state-of-the-art baselines.

Abstract: The representation learning of time series has a wide range of downstream tasks and applications in many practical scenarios. However, due to the complexity, spatiotemporality, and continuity of sequential stream data, compared with the representation learning of structural data such as images/videos, the time series selfsupervised representation learning is even more challenging. Besides, the direct application of existing contrastive learning and masked autoencoder based approaches to time series representation learning encounters inherent theoretical limitations, such as ineffective augmentation and masking strategies. To this end, we propose a Language Pre-training guided Masking Representation Learning (LPMRL) for times series classification. Specifically, we first propose a novel language pre-training guided masking encoder for adaptively sampling semantic spatiotemporal patches via natural language descriptions and improving the discriminability of latent representations. Furthermore, we present the dual-information contrastive learning mechanism to explore both local and global information by meticulously designing high-quality hard negative samples of time series data samples. As a result, we also design various experiments, such as visualization of masking position and distribution and reconstruction error to verify the reasonability of proposed language guided masking technique. Last, we evaluate the performance of proposed representation learning via classification task conducted on 106 time series datasets, which demonstrates the effectiveness of proposed method.

Abstract: Fusing side information in sessionbased recommendation is crucial for improving the performance of next-item prediction by providing additional context. Recent methods optimize attention weights by combining item and side information embeddings. However, semantic heterogeneity between item IDs and side information introduces computational noise in attention calculation, leading to inconsistencies in user interest modeling and reducing the accuracy of candidate item scores. These methods also often fail to leverage session-based re-interaction patterns, limiting improvements in score prediction during the decoding phase. To address these challenges, we propose ScoreNet, a consistency-driven framework with multi-side information fusion for session-based recommendation. ScoreNet explicitly models users' persistent preferences, generating consistent decoding scores for candidate items within a unified framework. It incorporates a multi-path re-engagement network to capture re-interaction behavior patterns in a semantic-agnostic manner, enhancing side information fusion while avoiding semantic interference. Additionally, a position-enhanced consistent scoring network redistributes attention scores within sessions, improving prediction accuracy, especially for items with limited interactions. Extensive experiments on three real-world datasets demonstrate that ScoreNet outperforms state-of-the-art models.

Abstract: Question answering on freeform tables (a.k.a. TableQA) is a challenging task because of the flexible structure and complex schema of tables. Recent studies use Large Language Models (LLMs) for this task, exploiting their capability in understanding the questions and tabular data, which are typically given in natural language and contain many textual fields, respectively. While this approach has shown promising results, it overlooks the challenges brought by numerical values which are common in tabular data, and LLMs are known to struggle with such values. We aim to address this issue, and we propose a model named TabLaP that uses LLMs as a planner rather than an answer generator. This approach exploits LLMs' capability in multi-step reasoning while leaving the actual numerical calculations to a Python interpreter for accurate calculation. Recognizing the inaccurate nature of LLMs, we further make a first attempt to quantify the trustworthiness of the answers produced by TabLaP, such that users can use TabLaP in a regret-aware manner. Experimental results on two benchmark datasets show that TabLaP is substantially more accurate than the state-of-the-art models, improving the answer accuracy by 5.7% and 5.8% on the two datasets, respectively.

Abstract: Recommender systems are increasingly prevalent to provide personalized suggestions and enhance user satisfaction. Typical recommendation models encode users and items as embeddings, and generate recommendations by assessing the similarity between these embeddings. Despite their effectiveness, these embeddingbased models struggle with modeling user uncertainty and capturing diverse user interests using a single fixed user embedding. Recent studies have begun to explore a user-distribution paradigm to learn distributions for users. However, this approach employs a single distribution per user, which fails to effectively delineate semantic boundaries, resulting in sub-optimal recommendations. To this end, we propose GCDR, a Guided Conditional Diffusion Recommender model, to learn multiple distributions for each user in this paper. Specifically, GCDR addresses two major challenges: 1) learning disentangled distributions, and 2) learning personalized distributions. GCDR captures inter-user and intra-user distribution properties through conditional and guided diffusion, respectively. It maintains user-specific embeddings to encode long-term interests for conditional diffusion, while for guided diffusion, it incorporates short-term interests encoded from recent interactions with category preferences. To align the diffusion model with the recommendation task, we train GCDR with three loss functions, included the user loss, the recommendation loss and the diffusion loss. Extensive experiments on four real-world datasets show that GCDR is able to learn effective user distributions and is superior to thirteen state-of-the-art baseline methods.

Abstract: Identifying and linking the same users across different social platforms is crucial for understanding user behavior and preferences. However, crossdomain datasets exhibit diverse characteristics, such as varying check-in frequencies, significant disparities in data precision, and distinct distributions. Existing trajectory representations rely on recurrent neural network, which fails to dynamically learn multi-dimensional feature relations and capture high-order associations. Furthermore, current methods for integrating trajectory information fails to capture the complex relations and dynamic variations among cross-domain mobility trajectories. To this end, we propose the Hierarchical Spatio-Temporal Enhanced Attention Hypergraph Network (StarNet). This model dynamically regulates the multi-dimensional features of trajectories through a locally enhanced spatiotemporal graph neural network. Meanwhile, StarNet employs a hypergraph network enhanced by a global spatiotemporal to capture high-order associations between cross-domain trajectories. The fusion enhancement association integrates local and global information, which enables this model to link user identities. Extensive experiments on two well-known LBSN cross-domain datasets reveal that StarNet outperforms state-of-the-art baselines in the accuracy of user identity linkage.

Abstract: Answering complex queries over incomplete knowledge graphs (KGs) is a challenging job. Most previous works have focused on learning entity/relation embeddings and simulating firstorder logic operators with various neural networks. However, they are bottlenecked by the inability to share world knowledge to improve logical reasoning, thus resulting in suboptimal performance. In this paper, we propose a complex reasoning schema over KG upon large language models (LLMs), containing a curriculum-based logical-aware instruction tuning framework, named LACT. Specifically, we augment the arbitrary first-order logical queries via binary tree decomposition, to stimulate the reasoning capability of LLMs. To address the difficulty gap among different types of complex queries, we design a simple and flexible logic-aware curriculum learning framework. Experiments across widely used datasets demonstrate that LACT has substantial improvements~(brings an average +5.5% MRR score) over advanced methods, achieving the new state-of-the-art.

Abstract: The detection of anomalous tissue regions (ATRs) within affected tissues is crucial in clinical diagnosis and pathological studies. Conventional automated ATR detection methods, primarily based on histology images alone, falter in cases where ATRs and normal tissues have subtle visual differences. The recent spatial transcriptomics (ST) technology profiles gene expressions across tissue regions, offering a molecular perspective for detecting ATRs. However, there is a dearth of ATR detection methods that effectively harness complementary information from both histology images and ST. To address this gap, we propose MEATRD, a novel ATR detection method that integrates histology image and ST data. MEATRD is trained to reconstruct image patches and gene expression profiles of normal tissue spots (inliers) from their multimodal embeddings, followed by learning a oneclass classification AD model based on latent multimodal reconstruction errors. This strategy harmonizes the strengths of reconstruction-based and one-class classification approaches. At the heart of MEATRD is an innovative masked graph dual-attention transformer (MGDAT) network, which not only facilitates cross-modality and cross-node information sharing but also addresses the model over-generalization issue commonly seen in reconstruction-based AD methods. Additionally, we demonstrate that modality-specific, task-relevant information is collated and condensed in multimodal bottleneck encoding generated in MGDAT, marking the first theoretical analysis of the informational properties of multimodal bottleneck encoding. Extensive evaluations across eight real ST datasets reveal MEATRD's superior performance in ATR detection, surpassing various state-of-the-art AD methods. Remarkably, MEATRD also proves adept at discerning ATRs that only show slight visual deviations from normal tissues.

School of Software, Nanjing University of Information Science and Technology, China Yunnan Key Laboratory of Service Computing, Yunan University of Finance and Economics, China Jiangsu Province Engineering Research Center of Advanced Computing and Intelligent Services, China Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, China, School of Software, Nanjing University of Information Science and Technology, China, School of Software, Nanjing University of Information Science and Technology, China Jiangsu Province Engineering Research Center of Advanced Computing and Intelligent Services, China, College of Meteorology and Oceanography, National University of Defense Technology, China, School of Computing, Macquarie University, Australia, College of Computer Science and Technology, China University of Petroleum (East China), China, State Key Laboratory for Novel Software Technology, Nanjing University, China

Abstract: Graph Neural Networks (GNNs) are widely applied on graphlevel tasks, such as node classification, link prediction and graph generation. Existing GNNs mostly adopt a message-passing mechanism to aggregate node information with their neighbors, which often makes node information similar after rounds of aggregations and leads to oversmoothing. Although recent works have made improvements by combining different message aggregation methods or introducing semantic encodings as priors, these message-passing based GNNs still fail to combat oversmoothing after multiple iterations of node aggregation. Besides, the feature extraction ability of these methods is restricted because of the graph sparsity that hinders the aggregation of node information. To deal with the above two issues, we propose Neighborhood-based and Label-enhanced Graph Transformer (NLGT), a novel and effective framework for graph learning. Specifically, we present a label-enhanced feature fusion mechanism that integrate the shallow node features and label embeddings as enhanced features. Moreover, we design a neighborhood-based mask attention mechanism to alleviate the negative effects caused by the sparsity of the graph. In the predicting stage, we aggregate the prediction results from multiple sampled sub-graphs and apply voting mechanisms to enhance the accuracy and robustness of our framework. Finally, extensive experiments are conducted on four open benchmark datasets, which demonstrate the effectiveness and robustness of our proposed framework compared with existing state-of-the-art methods.

Abstract: Unsupervised anomaly detection has emerged as a powerful technique for identifying abnormal patterns in images without relying on prelabeled defective samples. Many unsupervised methods use pre-trained feature extractors from large datasets, with knowledge distillation between teacher and student models being a leading technique. However, due to the similar structures of teacher and student, these methods face challenges like excessive specialization and inadequate generalization, reducing detection performance. In this paper, we introduce a Co-Progression Knowledge Distillation (CPKD) framework, enabling bidirectional learning between teacher and student models. This innovative framework enables concurrent evolution of both models, fostering mutual improvement and enhanced adaptability. To maintain system stability and prevent overspecialization, we introduce a knowledge prototype as a regulatory mechanism for the teacher's learning process. Our method effectively addresses key challenges in anomaly detection, including insufficient learning and overadaptation, by striking a balance between acquiring new knowledge and preserving core competencies. We demonstrate significant improvements in detection accuracy, achieving SOTA performance on the MVTec dataset.

Lab of Artificial Intelligence for Education, East China Normal University, Shanghai, China Shanghai Institute of Artificial Intelligence for Education, East China Normal University, Shanghai, China School of Computer Science and Technology, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, School of Psychology and Cognitive Science, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China, Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, School of Psychology and Cognitive Science, East China Normal University, Shanghai, China, Shanghai Changning Mental Health Center, Shanghai, China, Lab of Artificial Intelligence for Education, East China Normal University, Shanghai, China Shanghai Institute of Artificial Intelligence for Education, East China Normal University, Shanghai, China School of Computer Science and Technology, East China Normal University, Shanghai, China

Abstract: The group recommendation (GR) aims to suggest items for a group of users in social networks. Existing work typically considers individual preferences as the sole factor in aggregating group preferences. Actually, social influence is also an important factor in modeling users' contributions to the final group decision. However, existing methods either neglect the social influence of individual members or bundle preferences and social influence together as a unified representation. As a result, these models emphasize the preferences of the majority within the group rather than the actual interaction items, which we refer to as the preference bias issue in GR. Moreover, the selfsupervised learning (SSL) strategies they designed to address the issue of group data sparsity fail to account for users' contextual social weights when regulating group representations, leading to suboptimal results. To tackle these issues, we propose a novel model based on Disentangled Modeling of Preferences and Social Influence for Group Recommendation (DisRec). Concretely, we first design a user-level disentangling network to disentangle the preferences and social influence of group members with separate embedding propagation schemes based on (hyper)graph convolution networks. We then introduce a social-based contrastive learning strategy, selectively excluding user nodes based on their social importance to enhance group representations and alleviate the group-level data sparsity issue. The experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two real-world datasets.

Abstract: Outof-distribution (OOD) detection, determining whether a given sample is part of the in-distribution (ID) or not, has been newly explored by a generative model-based outlier synthesizing approach, especially with diffusion models. Nonetheless, existing diffusion models often produce outliers that are considerably distant from the ID in pixel-space, showing limited efficacy for capturing subtle distinctions between ID and OOD. To address these issues, we propose a novel framework, Semantic Outlier generation via Nuisance Awareness (SONA), which directly utilizes informative pixel-space ID images in diffusion models. Thereby, the generated outliers achieve two crucial properties: (i) they closely resemble the ID mainly in nuisances, while (ii) represent discriminative semantic information. To facilitate the separate effect on semantics and nuisances, we introduce SONA guidance, providing region-specific guidance. Extensive experiments demonstrate the effectiveness of our framework, achieving an impressive AUROC of 87% on near-OOD datasets, which surpasses the performance of baseline methods by a significant margin of approximately 6%.

Abstract: Credit risk assessment has increasingly become a prominent research field due to the dramatically increased incidents of financial default. Traditional graphbased methods have been developed to detect defaulters within user-merchant commercial payment networks. However, these methods face challenges in detecting complex risks, primarily due to their neglect of user-to-user fund transfer interactions and the under-utilization of temporal information. In this paper, we propose a novel framework named Dynamic Graph Neural Network with Static Relations (DGNN-SR) for credit risk assessment, which can encode the dynamic transaction graph and the static fund transfer graph simultaneously. To fully harness the temporal information, DGNN-SR employs a multi-view time encoder to explore the semantics of both relative and absolute time. To enhance the dynamic representations with static relations, we devise an adaptive re-weighting strategy to incorporate the static relations into the dynamic representations of time encoder, which extracts more discriminative features for risk assessment. Extensive experiments on two real-world business datasets demonstrate that our proposed method achieves a 0.85% - 2.5% improvement over existing SOTA methods.

College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, China, Australian Artificial Intelligence Institute, FEIT, University of Technology Sydney, Kuaishou Technology, Kuaishou Technology, Kuaishou Technology, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, China, Institute for AI Industry Research, Tsinghua University, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, China

Abstract: Multifaceted user modeling aims to uncover finegrained patterns and learn representations from user data, revealing their diverse interests and characteristics, such as profile, preference, and personality. Recent studies on foundation model-based recommendation have emphasized the Transformer architecture's remarkable ability to capture complex, non-linear user-item interaction relationships. This paper aims to advance foundation model-based recommendersystems by introducing enhancements to multifaceted user modeling capabilities. We propose a novel Transformer layer designed specifically for recommendation, using the self-attention mechanism to capture sequential user-item interaction patterns. Specifically, we design a group gating network to identify user groups, enabling hierarchical discovery across different layers, thereby capturing the multifaceted nature of user interests through multiple Transformer layers. Furthermore, to broaden the data scope and further enhance multifaceted user modeling, we extend the framework to a federated setting, enabling the use of private datasets while ensuring privacy. Experimental validations on benchmark datasets demonstrate the superior performance of our proposed method.

Abstract: Massive open online courses (MOOCs) recommendation provides online courses tailored to learners' individual preferences. Existing literature is limited by: 1) Ignoring the interrelations among courses, knowledge concepts, and videos, which leads to suboptimal recommendation performance; 2) Neglecting the hierarchical interactions between learners and components like courses, knowledge concepts, and videos, which makes it difficult to capture learners' intentions accurately. To address them, we propose a novel multitype MOOCs recommendation framework, which enables multi-type educational content recommendations. This framework includes two important components: multi-relational representation and hierarchical reasoning. Regarding multi-relational representation, we first create two static course-relational and knowledge concept-relational graphs based on domain knowledge and construct a dynamic video-relational graph using learners' browsing historical sequences. Then, we capture the interactions among different components by learning the corresponding embeddings via graph neural networks. Regarding hierarchical reasoning, we implement a hierarchical beam search strategy to narrow down the candidate courses, knowledge concepts, and videos by calculating joint probability. Finally, we introduce an optional layer to increase the diversity and reasonableness of video recommendations by estimating learners' intentions. Extensive experiments are conducted to show the effectiveness, robustness, and interpretability of our method.

Abstract: To preserve user privacy in recommender systems, federated recommendation (FR) based on federated learning (FL) emerges, keeping the personal data on the local client and updating a model collaboratively. Unlike FL, FR has a unique sparse aggregation mechanism, where the embedding of each item is updated by only partial clients, instead of full clients in a dense aggregation of general FL. Recently, as an essential principle of FL, model security has received increasing attention, especially for Byzantine attacks, where malicious clients can send arbitrary updates. The problem of exploring the Byzantine robustness of FR is particularly critical since in the domains applying FR, e.g., ecommerce, malicious clients can be injected easily by registering new accounts. However, existing Byzantine works neglect the unique sparse aggregation of FR, making them unsuitable for our problem. Thus, we make the first effort to investigate Byzantine attacks on FR from the perspective of sparse aggregation, which is non-trivial: it is not clear how to define Byzantine robustness under sparse aggregations and design Byzantine attacks under limited knowledge/capability. In this paper, we reformulate the Byzantine robustness under sparse aggregation by defining the aggregation for a single item as the smallest execution unit. Then we propose a family of effective attack strategies, named Spattack, which exploit the vulnerability in sparse aggregation and are categorized along the adversary's knowledge and capability. Extensive experimental results demonstrate that Spattack can effectively prevent convergence and even break down defenses under a few malicious clients, raising alarms for securing FR systems.

Abstract: Reinforcement learning (RL) algorithms can improve recommendation performance by capturing longterm user-system interaction. However, current RL-based recommendation tasks seldom consider the dynamism of the environment, and standard RL algorithms are ineffective in recommending items dynamically. In addressing these issues, we design a novel task termed dynamic recommendation, which takes the emergence of real-world recommendable items into consideration. Meanwhile, we propose Adaptive Q-Network (AdaQN) to tackle the dynamic recommendation task. Firstly, AdaQN predicts the value of different action characteristics, particularly during the testing phase, which can capture emerging new action characteristics. The above procedure helps AdaQN in effectively adapting to the dynamic action space. Secondly, AdaQN establishes a stable mapping that projects the discrete action space onto a continuous characteristic space. Finally, AdaQN employs a lightweight Q-network design, which mitigates the complexity of the optimization process. Extensive experiments demonstrate that our approach has achieved state-of-the-art performance in the dynamic recommendation task.

Abstract: Information diffusion prediction (IDP) is a pivotal task for understanding the dynamics of information propagation within social networks. Conventional models typically adhere to a fixed learningbased paradigm, where the trained prediction model remains static during the inference phase. This paradigm presupposes that the data is independent and identically distributed, an assumption that may not hold true due to the inherently open nature of social media and the uncertainty and variability in user behavior. In this paper, we address the novel problem of out-of-distribution (OOD) shifts within IDP tasks and propose a new test-time training-based model for multi-scale IDP tasks, named Ghidorah. Our approach focuses on adapting a subset of model parameters to accommodate the unique characteristics of test samples through self-supervised learning (SSL) tasks. Ghidorah comprises three components: the macroscopic prediction branch, the microscopic prediction branch, and the auxiliary SSL branch. The auxiliary SSL task employs a masked autoencoder-based loss to fine-tune the model for specific test samples prior to prediction. Furthermore, Ghidorah integrates invariant learning to capture robust representations while mitigating spurious correlations. To our knowledge, Ghidorah is the first work to introduce a test-time training framework specifically designed to address the critical yet often overlooked OOD challenges in IDP. Experimental results across several benchmark datasets validate the superiority of our approach.

Abstract: In contemporary ecommerce platforms, search result pages display two types of items: ad items and organic items. Ad items are determined through an advertising auction system, while organic items are selected by a recommendation system. These systems have distinct optimization objectives, creating the challenge of effectively merging these two components. Recent research has explored merging mechanisms for e-commerce platforms, but none have simultaneously achieved all desirable properties: incentive compatibility, individual rationality, adaptability to multiple slots, integration of inseparable candidates, and avoidance of repeated exposure for ads and organic items. This paper addresses the design of a merging mechanism that satisfies all these properties. We first provide the necessary conditions for the optimal merging mechanisms. Next, we introduce two simple and effective mechanisms, termed the generalized fix mechanism and the generalized change mechanism. Finally, we theoretically prove that both mechanisms offer guaranteed approximation ratios compared to the optimal mechanism in both simplest and general settings.

Abstract: Motivated by applications such as online labor markets we consider a variant of the stochastic multiarmed bandit problem where we have a collection of arms representing strategic agents with different performance characteristics. The platform (principal) chooses an agent in each round to complete a task. Unlike the standard setting, when an arm is pulled it can modify its reward by absorbing it or improving it at the expense of a higher cost. The principle has to solve a mechanism design problem to incentivize the arms to give their best performance. However, since even with an effective mechanism agents may still deviate from rational behavior, the principal wants a robust algorithm that also gives a non-vacuous guarantee on the total accumulated rewards under non-equilibrium behavior. In this paper, we introduce a class of bandit algorithms that meet the two objectives of performance incentivization and robustness simultaneously. We do this by identifying a collection of intuitive properties that a bandit algorithm has to satisfy to achieve these objectives. Finally, we show that settings where the principal has no information about the arms' performance characteristics can be handled by combining ideas from second price auctions with our algorithms.

Abstract: We study the problem of determining an envyfree allocation of indivisible goods among multiple agents with additive valuations. EFX, which stands for envy-freeness up to any good, is a well-studied relaxation of the envy-free allocation problem and has been shown to exist for specific scenarios. EFX is known to exist for three agents, and for any number of agents when there are only two types of valuations. EFX allocations are also known to exist for four agents with at most one good unallocated. In this paper, we show that EFX exists with at most k-2 goods unallocated for any number of agents having k distinct valuations. Additionally, we show that complete EFX allocations exist when all but two agents have identical valuations.

Abstract: We study online fair division when there are a finite number of item types and the player values for the items are drawn randomly from distributions with unknown means. In this setting, a sequence of indivisible items arrives according to a random online process, and each item must be allocated to a single player. The goal is to maximize expected social welfare while maintaining that the allocation satisfies proportionality in expectation. When player values are normalized, we show that it is possible to with high probability guarantee proportionality constraint satisfaction and achieve O(√T) regret. To achieve this result, we present an upper confidence bound (UCB) algorithm that uses two rounds of linear optimization. This algorithm highlights fundamental aspects of proportionality constraints that allow for a UCB algorithm despite the presence of many (potentially tight) constraints. This result improves upon the previous best regret rate of O(T^(2/3)).

Abstract: Collaborative machine learning (CML) provides a promising paradigm for democratizing advanced technologies by enabling costsharing among participants. However, the potential for rent-seeking behaviors among parties can undermine such collaborations. Contract theory presents a viable solution by rewarding participants with models of varying accuracy based on their contributions. However, unlike monetary compensation, using models as rewards introduces unique challenges, particularly due to the stochastic nature of these rewards when contribution costs are privately held information. This paper formalizes the optimal contracting problem within CML and proposes a transformation that simplifies the non-convex optimization problem into one that can be solved through convex optimization algorithms. We conduct a detailed analysis of the properties that an optimal contract must satisfy when models serve as the rewards, and we explore the potential benefits and welfare implications of these contract-driven CML schemes through numerical experiments.

Abstract: Decoding natural visual scenes from brain activity has flourished, with extensive research in singlesubject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.

Abstract: Dependencyaware spatial crowdsourcing (DASC) addresses the unique challenges posed by subtask dependencies in spatial task assignment. This paper investigates the task assignment problem in DASC and proposes a two-stage Recommend and Match Optimization (RMO) framework, leveraging multi-agent reinforcement learning for subtask recommendation and a multi-dimensional utility function for subtask matching. The RMO framework primarily addresses two key challenges: credit assignment for subtasks with interdependencies and maintaining overall coherence between subtask recommendation and matching. Specifically, we employ meta-gradients to construct auxiliary policies and establish a gradient connection between two stages, which can effectively address credit assignment and joint optimization of subtask recommendation and matching, while concurrently accelerating network training. We further establish a unified gradient descent process through gradient synchronization across recommendation networks, auxiliary policies, and the matching utility evaluation function. Experiments on two real-world datasets validate the effectiveness and feasibility of our proposed approach.

Abstract: As AI chatbots increasingly incorporate empathy, understanding usercentered perceptions of chatbot empathy and its impact on conversation quality remains essential yet under-explored. This study examines how chatbot identity and perceived empathy influence users' overall conversation experience. Analyzing 155 conversations from two datasets, we found that while GPT-based chatbots were rated significantly higher in conversational quality, they were consistently perceived as less empathetic than human conversational partners. Empathy ratings from GPT-4o annotations aligned with user ratings, reinforcing the perception of lower empathy in chatbots compared to humans. Our findings underscore the critical role of perceived empathy in shaping conversation quality, revealing that achieving high-quality human-AI interactions requires more than simply embedding empathetic language; it necessitates addressing the nuanced ways users interpret and experience empathy in conversations with chatbots.

Abstract: In this paper, we offer a learning framework in which the agent's knowledge gaps are overcome through corrective feedback from a teacher whenever the agent explains its (incorrect) predictions. We test it in a lowresource visual processing scenario, in which the agent must learn to recognize distinct types of toy truck. The agent starts the learning process with no ontology about what types of truck exist nor which parts they have, and a deficient model for recognizing those parts from visual input. The teacher's feedback to the agent's explanations addresses its lack of relevant knowledge in the ontology via a generic rule (e.g., "dump trucks have dumpers"), whereas an inaccurate part recognition is corrected by a deictic statement (e.g., "this is not a dumper"). The learner utilizes this feedback not only to improve its estimate of the hypothesis space of possible domain ontologies and probability distributions over them but also to use those estimates to update its visual interpretation of the scene. Our experiments demonstrate that teacher-learner pairs utilizing explanations and corrections are more data-efficient than those without such a faculty.

The College of Computer Science and Technology, Zhejiang University, China MOE Frontier Science Center for Brain Science and Brain-machine Integration, Zhejiang University, China Affiliated Mental Health Center \& Hangzhou Seventh People's Hospital, Zhejiang University, China, MOE Frontier Science Center for Brain Science and Brain-machine Integration, Zhejiang University, China Affiliated Mental Health Center \& Hangzhou Seventh People's Hospital, Zhejiang University, China State Key Lab of Brain-Machine Intelligence, Zhejiang University, China The College of Computer Science and Technology, Zhejiang University, China, The College of Computer Science and Technology, Zhejiang University, China

Abstract: Neural decoding, which transforms neural signals into motor commands, plays a key role in braincomputer interfaces (BCIs). Existing neural decoding approaches mainly rely on the assumption of independent noises, which could perform poorly in case the assumption is invalid. However, correlations in noises have been commonly observed in neural signals. Specifically, noise in different neural channels can be similar or highly related, which could degrade the performance of those neural decoders. To tackle this problem, we propose the DeCorrNet, which explicitly removes noise correlation in neural decoding. DeCorrNet could incorporate diverse neural decoders as an ensemble module to enhance the neural decoding performance. Experiments with benchmark BCI datasets demonstrated the superiority of DeCorrNet and achieved state-of-the-art results.

Abstract: With rapid advances in generative artificial intelligence, the textto-music synthesis task has emerged as a promising direction for music generation. Nevertheless, achieving precise control over multi-track generation remains an open challenge. While existing models excel in directly generating multi-track mix, their limitations become evident when it comes to composing individual tracks and integrating them in a controllable manner. This departure from the typical workflows of professional composers hinders the ability to refine details in specific tracks. To address this gap, we propose JEN-1 Composer, a unified framework designed to efficiently model marginal, conditional, and joint distributions over multi-track music using a single model. Building upon an audio latent diffusion model, JEN-1 Composer extends the versatility of multi-track music generation. We introduce a progressive curriculum training strategy, which gradually escalates the difficulty of training tasks while ensuring the model's generalization ability and facilitating smooth transitions between different scenarios. During inference, users can iteratively generate and select music tracks, thus incrementally composing entire musical pieces in accordance with the Human-AI co-composition workflow. Our approach demonstrates state-of-the-art performance in controllable and high-fidelity multi-track music synthesis, marking a significant advancement in interactive AI-assisted music creation.

Abstract: To enhance the processing of complex multimodal documents (e.g. e-books, long web pages, etc.), it is an efficient way for users to take digital screenshots of key parts and reorganize them into a new collage E-Note. Existing methods for assisting collage layout design primarily employ a semantic relevance-first strategy, with arranging related contents together. Though capable, it can not ensure the visual readability of screenshots and may conflict with human natural reading patterns. In this paper, we introduce CollageNoter for real-time collage layout design that adapts to various devices (e.g. laptop, tablet, phone, etc.), offering users with visually and cognitively well-organized screenshot-based E-Notes. Specifically, we construct a novel two-stage pipeline for collage design, including 1) readability-first layout generation and 2) cognitive-driven layout adjustment. In addition, to achieve real-time response and adaptive model training, we propose a cascade transformer-based layout generator named CollageFormer and a size-aware collage layout builder for automatic dataset construction. Extensive experimental results have confirmed the effectiveness of our CollageNoter.

Abstract: The strategic behavior of users is significantly influenced by their hidden information such as private valuations, risk preferences, and price sensitivities. Contextual behavioral model learning refers to learning the dependence of users' hidden information on their observable context information. While many existing studies use offline data to learn contextual behavioral models, we study how to design sequential experiments to collect the most informative user behavioral data for learning. We propose a basic inferencethen-design method. In each experimental period, it infers a probabilistic contextual behavioral model using historical experimental data, and then designs the new experiment to maximize the gain of information about the probabilistic model. We further improve the basic method in two aspects. First, we improve the inference step by specifying a more informative prior for learning the probabilistic contextual behavioral model. Second, we integrate the inference and design steps instead of conducting them separately. Our rigorous theoretic analysis reveals that the optimization objective of the inference step can be modified to account for the downstream experimental design step. Numerical experiments show that our methods lead to more effective experiments, i.e., the collected experimental data can help in learning a more accurate behavioral model.

School of Electronic Information & Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China, Siebel School of Computing and Data Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA, School of Electronic Information & Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China, School of Electronic Information & Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China, School of Electronic Information & Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China School of Artificial Intelligence & MoE Lab of AI, Shanghai Jiao Tong University, Shanghai, China

Abstract: Optimal control problems (OCPs) involve finding a control function for a dynamical system such that a cost functional is optimized. It is central to physical systems in both academia and industry. In this paper, we propose a novel instancesolution control operator perspective, which solves OCPs in a one-shot manner without direct dependence on the explicit expression of dynamics or iterative optimization processes. The control operator is implemented by a new neural operator architecture named Neural Adaptive Spectral Method (NASM), a generalization of classical spectral methods. We theoretically validate the perspective and architecture by presenting the approximation error bounds of NASM for the control operator. Experiments on synthetic environments and a real-world dataset verify the effectiveness and efficiency of our approach, including substantial speedup in running time, and high-quality in- and out-of-distribution generalization.

Abstract: Learning policies from highdimensional visual inputs, such as pixels and point clouds, is crucial in various applications. Visual reinforcement learning is a promising approach that directly trains policies from visual observations, although it faces challenges in sample efficiency and computational costs. This study conducts an empirical comparison of State-to-Visual DAgger — a two-stage framework that initially trains a state policy before adopting online imitation to learn a visual policy — and Visual RL across a diverse set of tasks. We evaluate both methods across 16 tasks from three benchmarks, focusing on their asymptotic performance, sample efficiency, and computational costs. Surprisingly, our findings reveal that State-to-Visual DAgger does not universally outperform Visual RL but shows significant advantages in challenging tasks, offering more consistent performance. In contrast, its benefits in sample efficiency are less pronounced, although it often reduces the overall wall-clock time required for training. Based on our findings, we provide recommendations for practitioners and hope that our results contribute valuable perspectives for future research in visual policy learning.

Abstract: Human cognition can leverage fundamental conceptual knowledge, like geometry and kinematic ones, to appropriately perceive, comprehend and interact with novel objects. Motivated by this finding, we aim to endow machine intelligence with an analogous capability through performing at the conceptual level, in order to understand and then interact with articulated objects, especially for those in novel categories, which is challenging due to the intricate geometric structures and diverse joint types of articulated objects. To achieve this goal, we propose Analytic Ontology Template (AOT), a parameterized and differentiable program description of generalized conceptual ontologies. A baseline approach called AOTNet driven by AOTs is designed accordingly to equip intelligent agents with these generalized concepts, and then empower the agents to effectively discover the conceptual knowledge on the structure and affordance of articulated objects. The AOTdriven approach yields benefits in three key perspectives: i) enabling concept-level understanding of articulated objects without relying on any real training data, ii) providing analytic structure information, and iii) introducing rich affordance information indicating proper ways of interaction. We conduct exhaustive experiments and the results demonstrate the superiority of our approach in understanding and then interacting with articulated objects.

Abstract: Enabling humanoid robots to perform longhorizon mobile manipulation planning in real-world environments based on embodied perception and comprehension abilities has been a longstanding challenge. With the recent rise of large language models (LLMs), there has been a notable increase in the development of LLM-based planners. These approaches either utilize human-provided textual representations of the real world or heavily depend on prompt engineering to extract such representations, lacking the capability to quantitatively understand the environment, such as determining the feasibility of manipulating objects. To address these limitations, we present the Instruction-Augmented Long-Horizon Planning (IALP) system, a novel framework that employs LLMs to generate feasible and optimal actions based on real-time sensor feedback, including grounded knowledge of the environment, in a closed-loop interaction. Distinct from prior works, our approach augments user instructions into PDDL problems by leveraging both the abstract reasoning capabilities of LLMs and grounding mechanisms. By conducting various real-world long-horizon tasks, each consisting of seven distinct manipulatory skills, our results demonstrate that the IALP system can efficiently solve these tasks with an average success rate exceeding 80%. Our proposed method can operate as a high-level planner, equipping robots with substantial autonomy in unstructured environments through the utilization of multi-modal sensor inputs.

Abstract: This paper presents fast and exact methods for computing the probability of an argument’s acceptance using Dung’s semantics in the Constellation paradigm of Argumentation. For (directed) SinglyConnected Graphs (SCGs), the problem can now be solved in linearithmic time instead of being exponential in the number of attacks, as reported in the literature. Moreover, in the more general case of Directed Acyclic Graphs (DAGs), we provide an algorithm whose time complexity is linearithmic in the product of the out-degree of dependent arguments, i.e., arguments reaching the argument considered for acceptance through multiple paths in the graph. We theoretically show that this complexity is lower than the lower bound of the (exact) Constellation method, which is also supported by empirical results. Our approach to DAGs is also compared with the (approximate) Monte-Carlo method, which is stopped when exact results are obtained. Within this time constraint, Monte-Carlo still outputs significant errors, underlying the fast computation of our approach.

Abstract: The profusion of knowledge encoded in large language models (LLMs) and their ability to apply this knowledge zeroshot in a range of settings makes them promising candidates for use in decision-making. However, they are currently limited by their inability to provide outputs which can be faithfully explained and effectively contested to correct mistakes. In this paper, we attempt to reconcile these strengths and weaknesses by introducing argumentative LLMs (ArgLLMs), a method for augmenting LLMs with argumentative reasoning. Concretely, ArgLLMs construct argumentation frameworks, which then serve as the basis for formal reasoning in support of decision-making. The interpretable nature of these argumentation frameworks and formal reasoning means that any decision made by ArgLLMs may be explained and contested. We evaluate ArgLLMs’ performance experimentally in comparison with state-of-the-art techniques, in the context of the decision-making task of claim verification. We also define novel properties to characterise contestability and assess ArgLLMs formally in terms of these properties.

Abstract: The connection between inconsistent databases and Dung’s abstract argumentation framework has recently drawn growing interest. Specifically, an inconsistent database, involving certain types of integrity constraints such as functional and inclusion dependencies, can be viewed as an argumentation framework in Dung’s setting. Nevertheless, no prior work has explored the exact expressive power of Dung’s theory of argumentation when compared to inconsistent databases and integrity constraints. In this paper, we close this gap by arguing that an argumentation framework can also be viewed as an inconsistent database. We first establish a connection between subsetrepairs for databases and extensions for AFs considering conflict-free, naive, admissible, and preferred semantics. Further, we define a new family of attribute-based repairs based on the principle of maximal content preservation. The effectiveness of these repairs is then highlighted by connecting them to stable, semi-stable, and stage semantics. Our main contributions include translating an argumentation framework into a database together with integrity constraints. Moreover, this translation can be achieved in polynomial time, which is essential in transferring complexity results between the two formalisms.

College of Computer Science and Technology, Harbin Engineering University, Heilongjiang, China, College of Computer Science and Technology, Harbin Engineering University, Heilongjiang, China, College of Computer Science and Technology, Harbin Engineering University, Heilongjiang, China, College of Computer Science and Technology, Harbin Engineering University, Heilongjiang, China, The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou, China, College of Computer Science and Technology, Harbin Engineering University, Heilongjiang, China

Abstract: Smart contracts, closely intertwined with cryptocurrency transactions, have sparked widespread concerns about considerable financial losses of security issues. To counteract this, a variety of tools have been developed to identify vulnerability in smart contract. However, they fail to overcome two challenges at the same time when faced with smart contract bytecode: (i) strong interference caused by enormous nonrelevant instructions; (ii) missing semantics of bytecode due to incomplete data and control flow dependencies. In this paper, we propose a multi-teacher based bytecode vulnerability detection method, namely Multi-Teacher Vulnerability Hunter (MTVHunter), which delivers effective denoising and missing semantic to bytecode under multi-teacher guidance. Specifically, we first propose an instruction denoising teacher to eliminate noise interference by abstract vulnerability pattern and further reflect in contract embeddings. Secondly, we design a novel semantic complementary teacher with neuron distillation, which effectively extracts necessary semantic from source code to replenish the bytecode. Particularly, the proposed neuron distillation accelerate this semantic filling by turning the knowledge transition into a regression task. We conduct experiments on 229,178 real-world smart contracts that concerns four types of common vulnerabilities. Extensive experiments show MTVHunter achieves significantly performance gains over state-of-the-art approaches.

Abstract: Prompt optimization automatically refines prompting expressions, unlocking the full potential of LLMs in downstream tasks. However, current prompt optimization methods are costly to train and lack sufficient interpretability. This paper proposes enhancing LLMs' reasoning performance by eliciting their causal inference ability from prompting instructions to correct answers. Specifically, we introduce the SelfCausal Instruction Enhancement (SCIE) method, which enables LLMs to generate high-quality, low-quantity observational data, then estimates the causal effect based on these data, and ultimately generates instructions with the optimized causal effect. In SCIE, the instructions are treated as the treatment, and textual features are used to process natural language, establishing causal relationships through treatments between instructions and downstream tasks. Additionally, we propose applying Object-Relational (OR) principles, where the uncovered causal relationships are treated as the inheritable class across task objects, ensuring low-cost reusability. Extensive experiments demonstrate that our method effectively generates instructions that enhance reasoning performance with reduced training cost of prompts, leveraging interpretable textual features to provide actionable insights.

Abstract: With the pervasive deployment of Machine Learning (ML) models in realworld applications, verifying and auditing properties of ML models have become a central concern. In this work, we focus on three properties: robustness, individual fairness, and group fairness. We discuss two approaches for auditing ML model properties: estimation with and without reconstruction of the target model under audit. Though the first approach is studied in the literature, the second approach remains unexplored. For this purpose, we develop a new framework that quantifies different properties in terms of the Fourier coefficients of the ML model under audit but does not parametrically reconstruct it. We propose the Active Fourier Auditor (AFA), which queries sample points according to the Fourier coefficients of the ML model, and further estimates the properties. We derive high probability error bounds on AFA's estimates, along with the worst-case lower bounds on the sample complexity to audit them. Numerically we demonstrate on multiple datasets and models that AFA is more accurate and sample-efficient to estimate the properties of interest than the baselines.

Abstract: We present theory of synaptic neural balance and we show experimentally that synaptic neural balance can improve deep learning speed, and accuracy, even in datascarce environments. Given an additive cost function (regularizer) of the synaptic weights, a neuron is said to be in balance if the total cost of its incoming weights is equal to the total cost of its outgoing weights. For large classes of networks, activation functions, and regularizers, neurons can be balanced fully or partially using scaling operations that do not change their functionality. Furthermore, these balancing operations are associated with a strictly convex optimization problem with a single optimum and can be carried out in any order. In our simulations, we systematically observe that: (1) Fully balancing before training results in better performance as compared to several other training approaches; (2) Interleaving partial (layer-wise) balancing and stochastic gradient descent steps during training results in faster learning convergence and better overall accuracy (with L1 balancing converging faster than L2 balancing); and (3) When given limited training data, neural balanced models outperform plain or regularized models; and this is observed in both feedforward and recurrent networks. In short, the evidence supports that neural balancing operations could be added to the arsenal of methods used to regularize and train neural networks. Furthermore, balancing operations are entirely local and can be carried out asynchronously, making them plausible for biological or neuromorphic systems.

Abstract: Generalized Category Discovery is a significant and complex task that aims to identify both known and undefined novel categories from a set of unlabeled data, leveraging another labeled dataset containing only known categories. The primary challenges stem from model bias induced by pretraining on only known categories and the lack of precise supervision for novel ones, leading to category bias towards known categories and category confusion among different novel categories, which hinders models' ability to identify novel categories effectively. To address these challenges, we propose a novel framework named Self-Debiasing Calibration (SDC). Unlike prior methods that regard model bias towards known categories as an obstacle to novel category identification, SDC provides a novel insight into unleashing the potential of the bias to facilitate novel category learning. Specifically, we utilize the biased pre-trained model to guide the subsequent learning process on unlabeled data. The output of the biased model serves two key purposes. First, it provides an accurate modeling of category bias, which can be utilized to measure the degree of bias and debias the output of the current training model. Second, it offers valuable insights for distinguishing different novel categories by transferring knowledge between similar categories. Based on these insights, SDC dynamically adjusts the output logits of the current training model using the output of the biased model. This approach produces less biased logits to effectively address the issue of category bias towards known categories, and generates more accurate pseudo labels for unlabeled data, thereby mitigating category confusion for novel categories. Experiments on three benchmark datasets show that SDC outperforms SOTA methods, especially in the identification of novel categories.

Abstract: Although distributed machine learning (distributed ML) is gaining considerable attention in the community, prior works have independently looked at instances of distributed ML in either the training or the inference phase. No prior work has examined the combined robustness stemming from distributing both the learning and the inference process. In this work, we explore, for the first time, the robustness of distributed ML models that are fully heterogeneous in training data, architecture, scheduler, optimizer, and other model parameters. Supported by theory and extensive experimental validation using CIFAR10 and FashionMNIST, we show that such properly distributed ML instantiations achieve acrossthe-board improvements in accuracy-robustness tradeoffs against state-of-the-art transfer-based attacks that could otherwise not be realized by current ensemble or federated learning instantiations. For instance, our experiments on CIFAR10 show that for the Common Weakness attack, one of the most powerful state-of-the-art transfer-based attacks, our method improves robust accuracy by up to 40%, with a minimal impact on clean task accuracy.

Abstract: Evaluating deep reinforcement learning (DRL) agents against targeted behavior attacks is critical for assessing their robustness. These attacks aim to manipulate the victim into specific behaviors that align with the attacker’s objectives, often bypassing traditional rewardbased defenses. Prior methods have primarily focused on reducing cumulative rewards; however, rewards are typically too generic to capture complex safety requirements effectively. As a result, focusing solely on reward reduction can lead to suboptimal attack strategies, particularly in safety-critical scenarios where more precise behavior manipulation is needed. To address these challenges, we propose RAT, a method designed for universal, targeted behavior attacks. RAT trains an intention policy that is explicitly aligned with human preferences, serving as a precise behavioral target for the adversary. Concurrently, an adversary manipulates the victim's policy to follow this target behavior. To enhance the effectiveness of these attacks, RAT dynamically adjusts the state occupancy measure within the replay buffer, allowing for more controlled and effective behavior manipulation. Our empirical results on robotic simulation tasks demonstrate that RAT outperforms existing adversarial attack algorithms in inducing specific behaviors. Additionally, RAT shows promise in improving agent robustness, leading to more resilient policies. We further validate RAT by guiding Decision Transformer agents to adopt behaviors aligned with human preferences in various MuJoCo tasks, demonstrating its effectiveness across diverse tasks.

Leibniz Institute for Prevention Research and Epidemiology – BIPS Faculty of Mathematics and Computer Science, University of Bremen, Leibniz Institute for Prevention Research and Epidemiology – BIPS Faculty of Mathematics and Computer Science, University of Bremen, Leibniz Institute for Prevention Research and Epidemiology – BIPS Faculty of Mathematics and Computer Science, University of Bremen, Leibniz Institute for Prevention Research and Epidemiology – BIPS Faculty of Mathematics and Computer Science, University of Bremen, Leibniz Institute for Prevention Research and Epidemiology – BIPS Faculty of Mathematics and Computer Science, University of Bremen, Department of Business and Economics, Berlin School of Economics and Law, Leibniz Institute for Prevention Research and Epidemiology – BIPS Faculty of Mathematics and Computer Science, University of Bremen Department of Public Health, University of Copenhagen

Abstract: This paper proposes a method for measuring conditional feature importance via generative modeling. In explainable artificial intelligence (XAI), conditional feature importance assesses the impact of a feature on a prediction model's performance given the information of other features. Modelagnostic post hoc methods to do so typically evaluate changes in the predictive performance under on-manifold feature value manipulations. Such procedures require creating feature values that respect conditional feature distributions, which can be challenging in practice. Recent advancements in generative modeling can facilitate this. For tabular data, which may consist of both categorical and continuous features, the adversarial random forest (ARF) stands out as a generative model that can generate on-manifold data points without requiring intensive tuning efforts or computational resources, making it a promising candidate model for subroutines in XAI methods. This paper proposes cARFi (conditional ARF feature importance), a method for measuring conditional feature importance through feature values sampled from ARF-estimated conditional distributions. cARFi requires only little tuning to yield robust importance scores that can flexibly adapt for conditional or marginal notions of feature importance, including straightforward extensions to condition on feature subsets and allows for inferring the significance of feature importances through statistical tests.

Abstract: Finetuning text-to-image diffusion models to maximize rewards has proven effective for enhancing model performance. However, reward fine-tuning methods often suffer from slow convergence due to online sample generation. Therefore, obtaining diverse samples with strong reward signals is crucial for improving sample efficiency and overall performance. In this work, we introduce DiffExp, a simple yet effective exploration strategy for reward fine-tuning of text-to-image models. Our approach employs two key strategies: (a) dynamically adjusting the scale of classifier-free guidance to enhance sample diversity, and (b) randomly weighting phrases of the text prompt to exploit high-quality reward signals. We demonstrate that these strategies significantly enhance exploration during online sample generation, improving the sample efficiency of recent reward fine-tuning methods, such as DDPO and AlignProp.

Abstract: Efficient anomaly detection of irregular sequences, especially those characterized by nonuniform sampling from discontinuous operations or unreliable sensors, presents challenges across various fields. In response, this paper introduces irregular-sequence classification in ''Ct-Echo Model Space''. A novel Continuous-time Echo Network (Ct-Echo) is proposed to fit irregular sequences, efficiently capturing their inherent dynamic characteristics. Ct-Echo utilizes the ''Echo'' mechanism, where history information influences the current state and diminishes over time, and employs Ordinary Differential Equation (ODE) to construct continuous-time transition of hidden states. Each sequence is individually fitted via Ct-Echo to derive a readout model. These fitted models, capturing the dynamic characteristics of the original data, serve as representations of the corresponding sequences, thus mapping the original data from the data space to the Ct-Echo model space. Anomaly detection is further performed in this model space, evaluating differences between models rather than directly on the original sequences. Our method enhances real-time processing and lessens reliance on the amount of labeled training data, as demonstrated by experimental studies.

Abstract: Large models for textto-music generation have achieved significant progress, facilitating the creation of high-quality and varied musical compositions from provided text prompts. However, input text prompts may not precisely capture user requirements, particularly when the objective is to generate music that embodies a specific concept derived from a designated reference collection. In this paper, we propose a novel method for customized text-to-music generation, which can capture the concept from a two-minute reference music and generate a new piece of music conforming to the concept. We achieve this by fine-tuning a pretrained text-to-music model using the reference music. However, directly fine-tuning all parameters leads to overfitting issues. To address this problem, we propose a Pivotal Parameters Tuning method that enables the model to assimilate the new concept while preserving its original generative capabilities. Additionally, we identify a potential concept conflict when introducing multiple concepts into the pretrained model. We present a concept enhancement strategy to distinguish multiple concepts, enabling the fine-tuned model to generate music incorporating either individual or multiple concepts simultaneously. We also introduce a new dataset and evaluation protocol for this task. Our proposed JEN1-DreamStyler outperforms several baselines in both qualitative and quantitative evaluations.

Institute of Software Chinese Academy of Sciences, Beijing, China Science & Technology on Integrated Information System Laboratory, Beijing, China State Key Laboratory of Intelligent Game, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Institute of Software Chinese Academy of Sciences, Beijing, China Science & Technology on Integrated Information System Laboratory, Beijing, China State Key Laboratory of Intelligent Game, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Institute of Software Chinese Academy of Sciences, Beijing, China Science & Technology on Integrated Information System Laboratory, Beijing, China State Key Laboratory of Intelligent Game, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Singapore Management University, Singapore, Institute of Software Chinese Academy of Sciences, Beijing, China Science & Technology on Integrated Information System Laboratory, Beijing, China State Key Laboratory of Intelligent Game, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Institute of Software Chinese Academy of Sciences, Beijing, China Science & Technology on Integrated Information System Laboratory, Beijing, China State Key Laboratory of Intelligent Game, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Institute of Software Chinese Academy of Sciences, Beijing, China Science & Technology on Integrated Information System Laboratory, Beijing, China State Key Laboratory of Intelligent Game, Beijing, China University of Chinese Academy of Sciences, Beijing, China

Abstract: Explaining multiagent systems (MAS) is urgent as these systems become increasingly prevalent in various applications. Previous work has provided explanations for the actions or states of agents, yet falls short in understanding the blackboxed agent’s importance within a MAS and the overall team strategy. To bridge this gap, we propose EMAI, a novel agent-level explanation approach that evaluates the individual agent’s importance. Inspired by counterfactual reasoning, a larger change in reward caused by the randomized action of agent indicates its higher importance. We model it as a MARL problem to capture interactions across agents. Utilizing counterfactual reasoning, EMAI learns the masking agents to identify important agents. Specifically, we define the optimization function to minimize the reward difference before and after action randomization and introduce sparsity constraints to encourage the exploration of more action randomization of agents during training. The experimental results in seven multi-agent tasks demonstrate that EMAI achieves higher fidelity in explanations compared to baselines and provides more effective guidance in practical applications concerning understanding policies, launching attacks, and patching policies.

Abstract: Supervised multimodal classification has been proven to outperform unimodal classification in the imagetext domain. However, this task is highly dependent on abundant labeled data. To perform multimodal classification in data-insufficient scenarios, in this study, we explore semi-supervised multimodal classification (SSMC) that only requires a small amount of labeled data and plenty of unlabeled data. Specifically, we first design baseline SSMC models by combining known semi supervised pseudo-labeling methods with the two most commonly used modal fusion strategies, i.e. feature-level fusion and label-level aggregation. Based on our investigation and empirical study of the baselines, we discover two complementarities that may benefit SSMC if properly exploited: the predictions from different modalities (modal complementarity) and modal fusion strategies for pseudo-labeling (strategic complementarity). Therefore, we propose a Modal and Strategic Complementarity (MSC) framework for SSMC. Concretely, to exploit modal complementarity, we propose to learn reliability weights for the predictions from different modalities and refine the fusion scores. To learn from strategic complementarity, we introduce a dual KL divergence loss to guide the balance of quantity and quality of pseudo-labeled data selection. Extensive empirical studies demonstrate the effectiveness of the proposed framework.

Abstract: In the realm of classincremental learning (CIL), alleviating the catastrophic forgetting problem is a pivotal challenge. This paper discovers a counter-intuitive observation: by incorporating domain shift into CIL tasks, the forgetting rate is significantly reduced. Our comprehensive studies demonstrate that incorporating domain shift leads to a clearer separation in the feature distribution across tasks and helps reduce parameter interference during the learning process. Inspired by this observation, we propose a simple yet effective method named DisCo to deal with CIL tasks. DisCo introduces a lightweight prototype pool that utilizes contrastive learning to promote distinct feature distributions for the current task relative to previous ones, effectively mitigating interference across tasks. DisCo can be easily integrated into existing state-of-the-art class-incremental learning methods. Experimental results show that incorporating our method into various CIL methods achieves substantial performance improvements, validating the benefits of our approach in enhancing class-incremental learning by separating feature representation and reducing interference. These findings illustrate that DisCo can serve as a robust fashion for future research in class-incremental learning.

Abstract: Recently, deep MultiAgent Reinforcement Learning (MARL) has demonstrated its potential to tackle complex cooperative tasks, pushing the boundaries of AI in collaborative environments. However, the efficiency of these systems is often compromised by inadequate sample utilization and a lack of diversity in learning strategies. To enhance MARL performance, we introduce a novel sample reuse approach that dynamically adjusts policy updates based on observation novelty. Specifically, we employ a Random Network Distillation (RND) network to gauge the novelty of each agent's current state, assigning additional sample update opportunities based on the uniqueness of the data. We name our method Multi-Agent Novelty-GuidEd sample Reuse (MANGER). This method increases sample efficiency while promoting exploration and diverse agent behaviors. Our evaluations confirm substantial improvements in MARL effectiveness in complex cooperative scenarios such as Google Research Football and super-hard StarCraft II micromanagement tasks.

Abstract: As a fundamental problem of graph analysis, graph visualization aims to embed a set of graphs in a lowdimensional (e.g., 2D) space and provide insights into their distribution and clustering structure. Focusing on this problem, we propose a novel Wasserstein t-distributed embedding (WatE) method, leading to an information-enriched graph visualization paradigm. Our method learns a graph neural network to represent each graph as the mean and covariance of its node embedding distribution. Accordingly, our method can visualize each graph as an ellipse (determined by the mean and the covariance) rather than a single point. The positions of different ellipses reveal the relations among different graphs as traditional visualization methods do, while the size and shape of an ellipse preserve the node-level structural information of the corresponding graph. We propose a regularized t-distributed stochastic neighbor embedding (Rt-SNE) framework to learn the visualization model, deriving a Wasserstein distance-based Student's t-distribution of graph pairs and fitting the distribution to the data distribution under regularization. Both subjective and objective evaluations demonstrate that WatE achieves encouraging performance in various graph visualization and clustering tasks.

Abstract: A few recent studies have shown the benefits of using centrally pretrained models to initialize federated learning (FL). However, existing methods do not generalize well when faced with an arbitrary set of downstream FL tasks. Specifically, they often (i) achieve limited accuracy, especially with unseen downstream labels, and (ii) result in significant accuracy variance, failing to provide a balanced performance across clients. To address these challenges, we propose CoPreFL, a collaborative/distributed pre-training approach that robustly initializes for downstream FL tasks. CoPreFL leverages model-agnostic meta-learning (MAML) that tailors the global model to mimic heterogeneous and unseen FL scenarios, resulting in a pre-trained model that is rapidly adaptable to any FL task. Our MAML procedure integrates performance variance into the meta-objective function, balancing performance across clients rather than solely optimizing for accuracy. Extensive experiments show that CoPreFL significantly enhances average accuracy and reduces variance in arbitrary downstream FL tasks with unseen/seen labels, outperforming various pre-training baselines. Additionally, CoPreFL proves compatible with different well-known FL algorithms used in downstream tasks, boosting performance in each case.

Abstract: The field of video generation has expanded significantly in recent years, with controllable and compositional video generation garnering considerable interest. Most methods rely on leveraging annotations such as text, objects' bounding boxes, and motion cues, which require substantial human effort and thus limit their scalability. In contrast, we address the challenge of controllable and compositional video generation without any annotations by introducing a novel unsupervised approach. Our model is trained from scratch on a dataset of unannotated videos. At inference time, it can compose plausible novel scenes and animate objects by placing object parts at the desired locations in space and time. The core innovation of our method lies in the unified control format and the training process, where video generation is conditioned on a randomly selected subset of pretrained self-supervised local features. This conditioning compels the model to learn how to inpaint the missing information in the video both spatially and temporally, thereby learning the inherent compositionality of a scene and the dynamics of moving objects. The abstraction level and the imposed invariance of the conditioning input to minor visual perturbations enable control over object motion by simply using the same features at all the desired future locations. We call our model CAGE, which stands for visual Composition and Animation for video GEneration. We conduct extensive experiments to validate the effectiveness of CAGE across various scenarios, demonstrating its capability to accurately follow the control and to generate high-quality videos that exhibit coherent scene composition and realistic animation.

Abstract: A major challenge in Reinforcement Learning (RL) is the difficulty of learning an optimal policy from sparse rewards. Prior works enhance online RL with conventional Imitation Learning (IL) via a handcrafted auxiliary objective, at the cost of restricting the RL policy to be suboptimal when the offline data is generated by a non-expert policy. Instead, to better leverage valuable information in offline data, we develop Generalized Imitation Learning from Demonstration (GILD), which meta-learns an objective that distills knowledge from offline data and instills intrinsic motivation towards the optimal policy. Distinct from prior works that are exclusive to a specific RL algorithm, GILD is a flexible module intended for diverse vanilla off-policy RL algorithms. In addition, GILD introduces no domain-specific hyperparameter and minimal increase in computational cost. In four challenging MuJoCo tasks with sparse rewards, we show that three RL algorithms enhanced with GILD significantly outperform state-of-the-art methods.

Abstract: Tiny machine learning (TinyML) has attracted heightened attention for its ability to provide lowcost and instantaneous performance on edge devices. Particularly, the commonly used microcontroller unit (MCU) imposes extreme constraints on peak memory (SRAM) and storage (Flash). Existing TinyML methods often rely on a customized and hard-to-obtain inference libraries, as well as necessitate a time-consuming search for a deployable architecture using advanced Neural Architecture Search (NAS) algorithms. To solve these problems, we fully exploit the resources on MCU and deduce hardware-oriented guidelines for designing models under extreme MCU constraints. In detail, we delve into thorough information about the atom operators by collecting the runtime data of Flash, SRAM, and latency to build a dataset named AtomDB. Based on AtomDB, several critical operator guidelines are established to fully utilize limited Flash and SRAM, while minimizing latency. By transferring the guidelines to analyze blocks, we propose a hybrid pattern that organizes appropriate blocks at different network stages to form the AtomNet, a more hardware-oriented architecture, to handle the former SRAM bottleneck and the latter Flash bottleneck. Extensive experiments demonstrate the effectiveness of the exploitation of the hardware characteristics. Remarkably, AtomNet pioneeringly achieve 3.5% accuracy enhancement and more than 15% latency reduction on 320KB MCU using readily available official inference libraries for ImageNet tasks, surpassing the current state-of-the-art method.

Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, ETHZ - ETH Zurich, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Beijing Jiaotong University, Institute of Computing Technology, Chinese Academy of Sciences, ETHZ - ETH Zurich

Abstract: Diffusion models have received wide attention in generation tasks. However, the expensive computation cost prevents the application of diffusion models in resourceconstrained scenarios. Quantization emerges as a practical solution that significantly saves storage and computation by reducing the bit-width of parameters. However, the existing quantization methods for diffusion models still cause severe degradation in performance, especially under extremely low bit-widths (2-4 bit). The primary decrease in performance comes from the significant discretization of activation values at low bit quantization. Too few activation candidates are unfriendly for outlier significant weight channel quantization, and the discretized features prevent stable learning over different time steps of the diffusion model. This paper presents MPQ-DM, a Mixed-Precision Quantization method for Diffusion Models. The proposed MPQ-DM mainly relies on two techniques: (1) To mitigate the quantization error caused by outlier severe weight channels, we propose an Outlier-Driven Mixed Quantization (OMQ) technique that uses Kurtosis to quantify outlier salient channels and apply optimized intra-layer mixed-precision bit-width allocation to recover accuracy performance within target efficiency. (2) To robustly learn representations crossing time steps, we construct a Time-Smoothed Relation Distillation (TRD) scheme between the quantized diffusion model and its full-precision counterpart, transferring discrete and continuous latent to a unified relation space to reduce the representation inconsistency. Comprehensive experiments demonstrate that MPQ-DM achieves significant accuracy gains under extremely low bit-widths compared with SOTA quantization methods. MPQ-DM achieves a 58% FID decrease under W2A4 setting compared with baseline, while all other methods even collapse.

Abstract: Transfer learning for biosignals has recently become an important technique to improve prediction performance on downstream tasks with small bio-signal datasets. Recent works have shown that pre-training a neural network model on a large dataset (e.g. EEG) with a self-supervised task, replacing the self-supervised head with a linear classification head, and fine-tuning the model on different downstream bio-signal datasets (e.g., EMG or ECG) can dramatically improve the performance on those datasets. In this paper, we propose a new convolution-transformer hybrid model architecture with masked auto-encoding for low-data bio-signal transfer learning, introduce a frequency-based masked auto-encoding task, employ a more comprehensive evaluation framework, and evaluate how much and when (multimodal) pre-training improves fine-tuning performance. We also introduce a dramatically more performant method of aligning a downstream dataset with a different temporal length and sampling rate to the original pre-training dataset. Our findings indicate that the convolution-only part of our hybrid model can achieve state-of-the-art performance on some low-data downstream tasks. The performance is often improved even further with our full model. In the case of transformer-based models we find that pre-training especially improves performance on downstream datasets, multimodal pre-training often increases those gains further, and our frequency-based pre-training performs the best on average for the lowest and highest data regimes.

Abstract: The deployment of machine learning models in operational contexts represents a significant investment for any organisation. Consequently, the risk of these models being misappropriated by competitors needs to be addressed. In recent years, numerous proposals have been put forth to detect instances of model stealing. However, these proposals operate under implicit and disparate data and model access assumptions; as a consequence, it remains unclear how they can be effectively compared to one another. Our evaluation shows that a simple baseline that we introduce performs on par with existing stateof-the-art fingerprints, which, on the other hand, are much more complex. To uncover the reasons behind this intriguing result, this paper introduces a systematic approach to both the creation of model fingerprinting schemes and their evaluation benchmarks. By dividing model fingerprinting into three core components – Query, Representation and Detection (QuRD) – we are able to identify ~100 previously unexplored QuRD combinations and gain insights into their performance. Finally, we introduce a set of metrics to compare and guide the creation of more representative model stealing detection benchmarks. Our approach reveals the need for more challenging benchmarks and a sound comparison with baselines. To foster the creation of new fingerprinting schemes and benchmarks, we open-source our fingerprinting toolbox.

Abstract: Recent advancements in graph representation learning have shifted attention towards dynamic graphs, which exhibit evolving topologies and features over time. The increased use of such graphs creates a paramount need for generative models suitable for applications such as data augmentation, obfuscation, and anomaly detection. However, there are few generative techniques that handle continuously changing temporal graph data; existing work largely relies on augmenting static graphs with additional temporal information to model dynamic interactions between nodes. In this work, we propose a fundamentally different approach: We instead directly model interactions as a joint probability of an edge forming between two nodes at a given time. This allows us to autoregressively generate new synthetic dynamic graphs in a largely assumption free, scalable, and inductive manner. We formalize this approach as DGGen, a generative framework for continuous time dynamic graphs, and demonstrate its effectiveness over five datasets. Our experiments demonstrate that DG-Gen not only generates higher fidelity graphs compared to traditional methods but also significantly advances link prediction tasks.

Abstract: This paper proposes a new principled multitask representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language models (PLMs) under the multi-task paradigm. Firstly, a shared information maximization principle is proposed to learn more sufficient shared representations for all target tasks. It can avoid the insufficiency issue arising from representation compression in the multi-task paradigm. Secondly, a task-specific information minimization principle is designed to mitigate the negative effect of potential redundant features in the input for each task. It can compress task-irrelevant redundant information and preserve necessary information relevant to the target for multi-task prediction. Experiments on six classification benchmarks show that our method outperforms 12 comparative multi-task methods under the same multi-task settings, especially in data-constrained and noisy scenarios. Extensive experiments demonstrate that the learned representations are more sufficient, data-efficient, and robust.

Abstract: While Large Language Models (LLMs) show promise for TextAttributed Graphs (TAGs) learning, their deployment is hindered by computational demands. Graph Neural Networks (GNNs) are efficient but struggle with TAGs' complex semantics. We propose LinguGKD, a novel LLM-to-GNN knowledge distillation framework that enables transferring both local semantic details and global structural information from LLMs to GNNs. First, it introduces TAG-oriented instruction tuning, enhancing LLMs with graph-specific knowledge through carefully designed prompts. Next, it develops a layer-adaptive multi-scale contrastive distillation strategy aligning LLM and GNN features at multiple granularities, from node-level to graph-level. Finally, the distilled GNNs combine the semantic richness of LLMs with the computational efficiency of traditional GNNs. Experiments demonstrate that LinguGKD outperforms existing graph distillation frameworks, the distilled simple GNNs achieve comparable or superior performance to more complex GNNs and teacher LLMs, while maintaining computational efficiency. This work bridges the gap between LLMs and GNNs, facilitating advanced graph learning in resource-constrained environments and providing a framework to leverage ongoing LLM advancements for GNN improvement.

School of Artifcial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, P.R. China Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China International Joint Laboratory on Artificial Intelligence, Ministry of Education, China, School of Artifcial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, P.R. China Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China International Joint Laboratory on Artificial Intelligence, Ministry of Education, China, School of Artifcial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, P.R. China Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China International Joint Laboratory on Artificial Intelligence, Ministry of Education, China, School of Artifcial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, P.R. China Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China International Joint Laboratory on Artificial Intelligence, Ministry of Education, China, School of Artifcial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, P.R. China Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China International Joint Laboratory on Artificial Intelligence, Ministry of Education, China, School of Artifcial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, P.R. China Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China International Joint Laboratory on Artificial Intelligence, Ministry of Education, China, School of Artifcial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, P.R. China Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China International Joint Laboratory on Artificial Intelligence, Ministry of Education, China

Abstract: Drug Target Interaction (DTI) prediction has witnessed promising performance boosts accompanied by advanced multimodal feature extraction. However, existing approaches suffer from two main difficulties. First, the complex protein structures cannot be well represented by current proteinsequence-based feature extractors. Second, the gap between protein and drug features increases the vulnerability of the obtained classifier thus degrading the prediction robustness. To address these issues, we propose a novel R-DTI method by exploring the second-order relevance in both protein structural feature extraction and DTI prediction phases. Specifically, we construct a pre-trained structural feature extractor that mines the atomic relevance of each amino acid. Then, an inter-feature structure-preserved Riemannian network is designed to expand the existing protein extraction patterns. To improve the prediction robustness, we also develop a Riemannian classifier that uses the second-order protein-drug relevance with a unified feature space. Extensive experimental results demonstrate the merits and superiority of our R-DTI against the state-of-the-art, achieving 1.4% and 1.9% higher AUC-ROC on the BindingDB and DrugBank datasets, respectively.

Abstract: Trajectory prediction models that can infer both future trajectories and their associated uncertainties of the target vehicles is crucial for safe and robust navigation and path planning of autonomous vehicles. However, the majority of existing trajectory prediction models have neither considered reducing the uncertainty as one objective during the training stage nor provided reliable uncertainty quantification during inference stage, especially under potential distribution shift. Therefore, in this paper, we propose the Conformal Uncertainty Quantification under Distribution Shift framework, CUQDS, to quantify the uncertainty of the predicted trajectories of existing trajectory prediction models under potential data distribution shift, while improving the prediction accuracy of the models and reducing the estimated uncertainty during the training stage. Specifically, CUQDS includes 1) a learningbased Gaussian process regression module that models the output distribution of the base model (any existing trajectory prediction neural networks) and reduces the estimated uncertainty by an additional loss term, and 2) a statistical-based Conformal P control module to calibrate the estimated uncertainty from the Gaussian process regression module in an online setting under potential distribution shift between training and testing data. Experimental results on various state-of-the-art methods using benchmark motion forecasting datasets demonstrate the effectiveness of our proposed design.

Abstract: Time series generation models are crucial for applications like data augmentation and privacy preservation. Most existing time series generation models are typically designed to generate data from one specified domain. While leveraging data from other domain for better generalization is proved to work in other application areas, this approach remains challenging for time series modeling due to the large divergence in patterns among different real world time series categories. In this paper, we propose a multidomain time series diffusion model with domain prompts, named TimeDP. In TimeDP, we utilize a time series semantic prototype module which defines time series prototypes to represent time series basis, each prototype vector serving as "word" representing some elementary time series feature. A prototype assignment module is applied to extract the extract domain specific prototype weights, for learning domain prompts as generation condition. During sampling, we extract ``domain prompt" with few-shot samples from the target domain and use the domain prompts as condition to generate time series samples. Experiments demonstrate that our method outperforms baselines to provide the state-of-the-art in-domain generation quality and strong unseen domain generation capability.

Abstract: Multisensor fusion systems (MSFs) play a vital role as the perception module in modern autonomous vehicles (AVs). Therefore, ensuring their robustness against common and realistic adversarial semantic transformations, such as rotation and shifting in the physical world, is crucial for the safety of AVs. While empirical evidence suggests that MSFs exhibit improved robustness compared to single-modal models, they are still vulnerable to adversarial semantic transformations. In addition, although many empirical defenses have been proposed, several works show that these defenses can be further attacked by new adaptive attacks. So far, there is no certified defense proposed for MSFs. In this work, we propose the first robustness certification framework COMMIT to certify the robustness of multi-sensor fusion systems against semantic attacks. In particular, we propose a practical anisotropic noise mechanism that leverages randomized smoothing on multi-modal data and performs a grid-based splitting method to characterize complex semantic transformations. We also propose efficient algorithms to compute the certification in terms of object detection accuracy and IoU for large-scale MSF models. Empirically, we evaluate the efficacy of COMMIT in different settings and provide a comprehensive benchmark of certified robustness for different MSF models using the CARLA simulation platform. We show that the certification for MSF models is at most 48.39% higher than that of single-modal models, which validates the advantages of MSF models. We believe our certification framework and benchmark will contribute an important step towards certifiably robust AVs in practice.

Abstract: Tabular data poses unique challenges due to its heterogeneous nature, combining both continuous and categorical variables. Existing approaches often struggle to effectively capture the underlying structure and relationships within such data. We propose GFTab (Geodesic Flow Kernels for SemiSupervised Learning on Mixed-Variable Tabular Dataset), a semi-supervised framework specifically designed for tabular datasets. GFTab incorporates three key innovations: 1) Variable-specific corruption methods tailored to the distinct properties of continuous and categorical variables, 2) A Geodesic flow kernel based similarity measure to capture geometric changes between corrupted inputs, and 3) Tree-based embedding to leverage hierarchical relationships from available labeled data. To rigorously evaluate GFTab, we curate a comprehensive set of 21 tabular datasets spanning various domains, sizes, and variable compositions. Our experimental results show that GFTab outperforms existing ML/DL models across many of these datasets, particularly in settings with limited labeled data.

Abstract: Missing values in multivariate time series data can harm machine learning performance and introduce bias. These gaps arise from sensor malfunctions, blackouts, and human error and are typically addressed by data imputation. Previous work has tackled the imputation of missing data in random, complete blackouts and forecasting scenarios. The current paper addresses a more general missing pattern, which we call "partial blackout," where a subset of features is missing for consecutive time steps. We introduce a twostage imputation process using self-attention and diffusion processes to model feature and temporal correlations. Notably, our model effectively handles missing data during training, enhancing adaptability and ensuring reliable imputation and performance, even with incomplete datasets. Our experiments on benchmark and two real-world time series datasets demonstrate that our model outperforms the state-of-the-art in partial blackout scenarios and shows better scalability.

Abstract: Despite the impressive progress of multimodal generative models, videoto-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely, text-to-audio generation methods generate high-quality audio but pose challenges in ensuring comprehensive scene depiction and time-varying control. To tackle these challenges, we propose a novel video-and-text-to-audio generation method, called ReWaS, where video serves as a conditional control for a text-to-audio generation model. Especially, our method estimates the structural information of sound (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-audio model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency.

National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, Centre for Frontier AI Research (CFAR) & Institute of High Performance Computing (IHPC), A*STAR, Singapore, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, Centre for Frontier AI Research (CFAR) & Institute for Infocomm Research (I2R), A*STAR, Singapore, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China

Abstract: The effective utilization of data through Deep Neural Networks (DNNs) has profoundly influenced various aspects of society. The growing demand for highquality, particularly personalized, data has spurred research efforts to prevent data leakage and protect privacy in recent years. Early privacy-preserving methods primarily relied on instance-wise modifications, such as erasing or obfuscating essential features for de-identification. However, this approach highlights an inherent trade-off: minimal modification offers insufficient privacy protection, while excessive modification significantly degrades task performance. In this paper, we propose a novel Recombining for Obfuscation (FRO) approach to address this trade-off. Unlike existing methods that generate one anonymized instance by perturbing the original data on a one-to-one basis, our FRO approach generates an anonymized instance by reassembling mixed ID-related features from multiple original data sources on a many-in-one basis. Instead of introducing additional noise for de-identification, our approach leverages the existing non-polluted features from other instances to anonymize data. Extensive experiments on identity identification tasks demonstrate that FRO outperforms previous state-of-the-art methods, not only in utility performance but also in visual anonymization.

Abstract: As partial samples are often absent in certain views, incomplete multiview clustering has become a challenging task. To tackle data with missing views, current methods either utilize the data similarity relations to recover missing samples or primarily consider the available information of existing samples, typically facing some inherent limitations. Firstly, traditional solutions cannot fully explore the potential information contained in missing samples due to their omission strategy, leading to sub-optimal graphs. Moreover, most methods mainly focus on data recovery from the view level, ignoring the differences among available/missing samples in various views. To this end, we propose a collaborative Similarity Fusion and Consistency Recovery (SFCR) method, which resolves the incomplete multi-view clustering problem by learning a unified similarity graph and recovering missing samples with consistent structures. Specifically, to learn a reliable graph compatible across views, a novel view-to-sample fusion model is designed to adaptively coalesce the view-wise similarities among available samples, not only preserving the complementarity and consistency among views but also properly balancing different samples. Furthermore, the missing samples are effectively recovered under the guidance of the fused similarity graph, so as to maintain the consistent structure of recovered data across views. In this way, the similarity learning and the missing data recovery benefit from each other in a collaborative reinforcement manner. Meanwhile, SFCR can directly obtain the final clustering labels without additional post-processing. Extensive experiments demonstrate the effectiveness and superiority of SFCR.

Abstract: With the rapid development of artificial intelligence (AI), especially in the medical field, the need for its explainability has grown. In medical image analysis, a high degree of transparency and model interpretability can help clinicians better understand and trust the decisionmaking process of AI models. In this study, we propose a Knowledge Distillation (KD) based approach that aims to enhance the transparency of the AI model in medical image analysis. The initial step is to use traditional CNN to obtain a teacher model and then use KD to simplify the CNN architecture, retain most of the features of the data set, and reduce the number of network layers. It also uses the feature map of the student model to perform hierarchical analysis to identify key features and decision-making processes. This leads to intuitive visual explanations. We selected three public medical data sets (brain tumor, eye disease, and Alzheimer's disease) to test our method. It shows that even when the number of layers is reduced, our model provides a remarkable result in the test set and reduces the time required for the interpretability analysis.

Abstract: Fractionalorder differential equations (FDEs) enhance traditional differential equations by extending the order of differential operators from integers to real numbers, offering greater flexibility in modeling complex dynamic systems with nonlocal characteristics. Recent progress at the intersection of FDEs and deep learning has catalyzed a new wave of innovative models, demonstrating the potential to address challenges such as graph representation learning. However, training neural FDEs has primarily relied on direct differentiation through forward-pass operations in FDE numerical solvers, leading to increased memory usage and computational complexity, particularly in large-scale applications. To address these challenges, we propose a scalable adjoint backpropagation method for training neural FDEs by solving an augmented FDE backward in time, which substantially reduces memory requirements. This approach provides a practical neural FDE toolbox and holds considerable promise for diverse applications. We demonstrate the effectiveness of our method in several tasks, achieving performance comparable to baseline models while significantly reducing computational overhead.

Abstract: Marked Temporal Point Process (MTPP) the de-facto sequence model for continuous-time event sequences -- historically employed for modeling human-generated action sequences, lack awareness of external stimuli. In this study, we propose a novel framework developed over Transformer Hawkes Process (THP) to incorporate external stimuli in a domain-agnostic manner. Furthermore, we integrate personalization into our framework by employing language model-based representations of user and event descriptions, which is essential for modeling human-generated action sequences. Towards evaluating the efficacy, we put together a comprehensive benchmark comprising 5 datasets (2 novel additions, and 3 repurposed from existing open datasets) harvested from several domains, spanning education, e-commerce, online payment, and discussion forum. On average, we achieve 9.35% gain in type-prediction accuracy and 7.38% reduction in time-prediction RMSE across all datasets over SOTA MTPP baselines. We demonstrate the superior performance of our proposed model through extensive ablations and showcasing its ability to capture complex combinations of external stimuli in a synthetic set up.

Abstract: Deep Neural Networks have spearheaded remarkable advancements in time series forecasting (TSF), one of the major tasks in time series modeling. Nonetheless, the nonstationarity of time series undermines the reliability of pre-trained source time series forecasters in mission-critical deployment settings. In this study, we introduce a pioneering test-time adaptation framework tailored for TSF (TSF-TTA). TAFAS, the proposed approach to TSF-TTA, flexibly adapts source forecasters to continuously shifting test distributions while preserving the core semantic information learned during pre-training. The novel utilization of partially-observed ground truth and gated calibration module enables proactive, robust, and model-agnostic adaptation of source forecasters. Experiments on diverse benchmark datasets and cutting-edge architectures demonstrate the efficacy and generality of TAFAS, especially in long-term forecasting scenarios that suffer from significant distribution shifts.

Abstract: Monitoring a large population of dynamic processes with limited resources presents a significant challenge across various industrial sectors. This is due to 1) the inherent disparity between the available monitoring resources and the extensive number of processes to be monitored and 2) the unpredictable and heterogeneous dynamics inherent in the progression of these processes. Online learning approaches, commonly referred to as bandit methods, have demonstrated notable potential in addressing this issue by dynamically allocating resources and effectively balancing the exploitation of highreward processes and the exploration of uncertain ones. However, most online learning algorithms are designed for 1) a centralized setting that requires data sharing across processes for accurate predictions or 2) a homogeneity assumption that estimates a single global model from decentralized data. To overcome these limitations and enable online learning in a heterogeneous population under a decentralized setting, we propose a federated collaborative online monitoring method. Our approach utilizes representation learning to capture the latent representative models within the population and introduces a novel federated collaborative UCB algorithm to estimate these models from sequentially observed decentralized data. This strategy facilitates informed monitoring of resource allocation. The efficacy of our method is demonstrated through theoretical analysis, simulation studies, and its application to decentralized cognitive degradation monitoring in Alzheimer’s disease.

Abstract: Importance sampling is a rare event simulation technique used in Monte Carlo simulations to bias the sampling distribution towards the rare event of interest. By assigning appropriate weights to sampled points, importance sampling allows for more efficient estimation of rare events or tails of distributions. However, importance sampling can fail when the proposal distribution does not effectively cover the target distribution. In this work, we propose a method for more efficient sampling by updating the proposal distribution in the latent space of a normalizing flow. Normalizing flows learn an invertible mapping from a target distribution to a simpler latent distribution. The latent space can be more easily explored during the search for a proposal distribution, and samples from the proposal distribution are recovered in the space of the target distribution via the invertible mapping. We empirically validate our methodology on simulated robotics applications such as autonomous racing and aircraft ground collision avoidance.

Abstract: In social networks, people influence each other through social links, which can be represented as propagation among nodes in graphs. Influence minimization (IMIN) is the problem of manipulating the structures of an input graph (e.g., removing edges) to reduce the propagation among nodes. IMIN can represent timecritical real-world applications, such as rumor blocking, but IMIN is theoretically difficult and computationally expensive. Moreover, the discrete nature of IMIN hinders the usage of powerful machine learning techniques, which requires differentiable computation. In this work, we propose DiffIM, a novel method for IMIN with two differentiable schemes for acceleration: (1) surrogate modeling for efficient influence estimation, which avoids time-consuming simulations (e.g., Monte Carlo), and (2) the continuous relaxation of decisions, which avoids the evaluation of individual discrete decisions (e.g., removing an edge). We further propose a third accelerating scheme, gradient-driven selection, that chooses edges instantly based on gradients without optimization (spec., gradient descent iterations) on each test instance. Through extensive experiments on real-world graphs, we show that each proposed scheme significantly improves speed with little (or even no) IMIN performance degradation. Our method is Pareto-optimal (i.e., no baseline is faster and more effective than it) and typically several orders of magnitude (spec., up to 15,160X) faster than the most effective baseline, while being more effective.

Abstract: Model Inversion (MI) attacks, which reconstruct the training dataset of neural networks, pose significant privacy concerns in machine learning. Recent MI attacks have managed to reconstruct realistic labellevel private data, such as the general appearance of a target person from all training images labeled on him. Beyond label-level privacy, in this paper we show sample-level privacy, the private information of a single target sample, is also important but under-explored in the MI literature due to the limitations of existing evaluation metrics. To address this gap, this study introduces a novel metric tailored for training-sample analysis, namely, the Diversity and Distance Composite Score (DDCS), which evaluates the reconstruction fidelity of each training sample by encompassing various MI attack attributes. This, in turn, enhances the precision of sample-level privacy assessments. Leveraging DDCS as a new evaluative lens, we observe that many training samples remain resilient against even the most advanced MI attack. As such, we further propose a transfer learning framework that augments the generative capabilities of MI attackers through the integration of entropy loss and natural gradient descent. Extensive experiments verify the effectiveness of our framework on improving state-of-the-art MI attacks over various metrics including DDCS, coverage and FID. Finally, we demonstrate that DDCS can also be useful for MI defense, by identifying samples susceptible to MI attacks in an unsupervised manner.

Abstract: Bayesian Optimization (BO) is a sampleefficient black-box optimizer commonly used in search spaces where hyperparameters are independent. However, in many practical AutoML scenarios, there will be dependencies among hyperparameters, forming a conditional search space, which can be partitioned into structurally distinct subspaces. The structure and dimensionality of hyperparameter configurations vary across these subspaces, challenging the application of BO. Some previous BO works have proposed solutions to develop multiple Gaussian Process models in these subspaces. However, these approaches tend to be inefficient as they require a substantial number of observations to guarantee each GP's performance and cannot capture relationships between hyperparameters across different subspaces. To address these issues, this paper proposes a novel approach to model the response surfaces of all subspaces in one, which can model the relationships between hyperparameters elegantly via a self-attention mechanism. Concretely, we design a structure-aware hyperparameter embedding to preserve the structural information. Then, we introduce an attention-based deep feature extractor, capable of projecting configurations with different structures from various subspaces into a unified feature space, where the response surfaces can be formulated using a single standard Gaussian Process. The empirical results on a simulation function, various real-world tasks, and HPO-B benchmark demonstrate that our proposed approach improves the efficacy and efficiency of BO within conditional search spaces.

Abstract: Multiview multi-label learning has become a research focus for describing objects with rich expressions and annotations. However, real-world data often contains numerous unlabeled instances, due to the high cost and technical limitations of manual labeling. This crucial problem involves three main challenges: i) How to extract advanced semantics from available views? ii) How to build a refined classification framework with limited labeled space? iii) How to provide more high-quality supervisory information? To address these problems, we propose a Semi-Supervised Multi-View Multi-Label Learning Method with View-Specific Transformer and Enhanced Pseudo-Label named SMVTEP. Specifically, Generative Adversarial Networks are employed to extract informative shared and specific representations and their consistency and distinctiveness are ensured through the adversarial mechanism and information theory based contrastive learning. Then we build specific classifiers for each extracted feature and apply instance-level manifold constraints to reduce bias across classifiers. Moreover, we design a transformer-style fusion approach that simultaneously captures the imbalance of expressive power among views, mapping effects on specific labels, and label dependencies by incorporating confidence scores and category semantics into the self-attention mechanism. Furthermore, after using Mixup for data augmentation, category-enhanced pseudo-labels are leveraged to improve the reliability of additional annotations by aligning the label distribution of unlabeled samples with the true distribution. Finally, extensive experimental results validate the effectiveness of SMVTEP against state-of-the-art methods.

Abstract: The convergence behavior of Stochastic Gradient Descent (SGD) crucially depends on the stepsize configuration. When using a constant stepsize, the SGD iterates form a Markov chain, enjoying fast convergence during the initial transient phase. However, when reaching stationarity, the iterates oscillate around the optimum without making further progress. In this paper, we study the convergence diagnostics for SGD with constant stepsize, aiming to develop an effective dynamic stepsize scheme. We propose a novel couplingbased convergence diagnostic procedure, which monitors the distance of two coupled SGD iterates for stationarity detection. Our diagnostic statistic is simple and is shown to track the transition from transience stationarity theoretically. We conduct extensive numerical experiments and compare our method against various existing approaches. Our proposed coupling-based stepsize scheme is observed to achieve superior performance across a diverse set of convex and non-convex problems. Moreover, our results demonstrate the robustness of our approach to a wide range of hyperparameters.

Abstract: Diffusion models (DMs) have attracted attention in generative modeling due to their ability to produce highquality, diverse outputs by progressively adding noise to data and then denoising it. However, DMs are computationally intensive due to their iterative nature, requiring numerous forward passes and high-precision operations, making them less efficient for resource-constrained environments. Recent efforts to reduce these computational demands using quantization show promise by converting high-precision parameters to lower precision, but they face challenges unique to DMs, particularly in addressing cross-timestep error propagation in the iterative process. In this paper, we analyze cross-timestep error propagation in quantized DMs, revealing that previous methods focusing only on reducing noise estimation discrepancies are insufficient. Instead, we introduce Cross-Timestep Error Correction (CTEC), where the quantized model not only approximates the full-precision model but also corrects errors from the previous timestep. A distillation method is applied to learn this correction process effectively. We conduct extensive experiments on unconditional image generation with LSUN-Churches and LSUN-Bedrooms, as well as conditional image generation with ImageNet. Our findings demonstrate the effectiveness of our method in significantly reducing accumulated quantization errors across timesteps within the quantized diffusion process. This enhancement enables the generation of high-quality images, even when constrained by reduced bitwidths.

Abstract: Spiking neural networks (SNNs), inspired by the inherent spiking computation paradigm of the biological neural systems, have exhibited superior energy efficiency in 2D classification tasks over traditional artificial neural networks (ANNs). However, the regression potential of SNNs has not been well explored, especially in 3D point cloud processing. In this paper, we propose noiseinjected spiking graph convolutional networks to leverage the full regression potential of SNNs in 3D point cloud denoising. Specifically, we first emulate the noise-injected neuronal dynamics to build noise-injected spiking neurons. On this basis, we design noise-injected spiking graph convolution for promoting disturbance-aware spiking representation learning on 3D points. Starting from the spiking graph convolution, we build two SNN-based denoising networks. One is a purely spiking graph convolutional network, which achieves low accuracy loss compared with some ANN-based alternatives, while resulting in significantly reduced energy consumption on two benchmark datasets, PU-Net and PC-Net. The other is a hybrid architecture, which integrates some ANN-based learning operations and exhibits a high performance-efficiency trade-off with only a few time steps. Our work lights up SNN’s potential for 3D point cloud denoising, injecting new perspectives of exploring the deployment on neuromorphic chips while paving the way for developing energy-efficient 3D data acquisition devices.

Abstract: In recent years, Graph Neural Networks (GNNs) have achieved remarkable success in many graph mining tasks. However, scaling them to large graphs is challenging due to the high computational and storage costs of repeated feature propagation and nonlinear transformation during training. One commonly employed approach to address this challenge is model-simplification, which only executes the Propagation (P) once in the pre-processing, and Combine (C) these receptive fields in different ways and then feed them into a simple model for better performance. Despite their high predictive performance and scalability, these methods still face two limitations. First, existing approaches mainly focus on exploring different C methods from the model perspective, neglecting the crucial problem of performance degradation with increasing P depth from the data-centric perspective, known as the over-smoothing problem. Second, pre-processing overhead takes up most of the end-to-end processing time, especially for large-scale graphs. To address these limitations, we present random walk with noise masking (RMask), a plug-and-play module compatible with the existing model-simplification works. This module enables the exploration of deeper GNNs while preserving their scalability. Unlike the previous model-simplification works, we focus on continuous P and found that the noise existing inside each P is the cause of the over-smoothing issue, and use the efficient masking mechanism to eliminate them. Experimental results on six real-world datasets demonstrate that model-simplification works equipped with RMask yield superior performance compared to their original version and can make a good trade-off between accuracy and efficiency.

Abstract: Backward error analysis allows finding a modified loss function, which the parameter updates really follow under the influence of an optimization method. The additional loss terms included in this modified function is called implicit regularizer. In this paper, we attempt to find the implicit regularizer for various federated learning algorithms on nonIID data distribution, and explain why each method shows different convergence behavior. We first show that the implicit regularizer of FedAvg disperses the gradient of each client from the average gradient, thus increasing the gradient variance. We also empirically show that the implicit regularizer hampers its convergence. Similarly, we compute the implicit regularizers of FedSAM and SCAFFOLD, and explain why they converge better. While existing convergence analyses focus on pointing out the advantages of FedSAM and SCAFFOLD, our approach can explain their limitations in complex non-convex settings. In specific, we demonstrate that FedSAM can partially remove the bias in the first-order term of the implicit regularizer in FedAvg, whereas SCAFFOLD can fully eliminate the bias in the first-order term, but not in the second-order term. Consequently, the implicit regularizer can provide a useful insight on the convergence behavior of federated learning from a different theoretical perspective.

Abstract: Multivariate time series forecasting (MTSF) aims to learn temporal dynamics among variables to forecast future time series. Existing statistical and deep learningbased methods suffer from limited learnable parameters and small-scale training data. Recently, large language models (LLMs) combining time series with textual prompts have achieved promising performance in MTSF. However, we discovered that current LLM-based solutions fall short in learning disentangled embeddings. We introduce TimeCMA, an intuitive yet effective framework for MTSF via cross-modality alignment. Specifically, we present a dual-modality encoding with two branches: the time series encoding branch extracts disentangled yet weak time series embeddings, and the LLM-empowered encoding branch wraps the same time series with text as prompts to obtain entangled yet robust prompt embeddings. As a result, such a cross-modality alignment retrieves both disentangled and robust time series embeddings, ``the best of two worlds'', from the prompt embeddings based on time series and prompt modality similarities. As another key design, to reduce the computational costs from time series with their length textual prompts, we design an effective prompt to encourage the most essential temporal information to be encapsulated in the last token: only the last token is passed to downstream prediction. We further store the last token embeddings to accelerate inference speed. Extensive experiments on eight real datasets demonstrate that TimeCMA outperforms state-of-the-arts.

Abstract: We consider the conditional generation of 3D druglike molecules with explicit control over molecular properties such as drug-like properties (e.g., Quantitative Estimate of Druglikeness or Synthetic Accessibility score) and effectively binding to specific protein sites. To tackle this problem, we propose an E(3)-equivariant Wasserstein autoencoder and factorize the latent space of our generative model into two disentangled aspects: molecular properties and the remaining structural context of 3D molecules. Our model ensures explicit control over these molecular attributes while maintaining equivariance of coordinate representation and invariance of data likelihood. Furthermore, we introduce a novel alignment-based coordinate loss to adapt equivariant networks for auto-regressive de-novo 3D molecule generation from scratch. Extensive experiments validate our model's effectiveness on property-guided and context-guided molecule generation, both for de-novo 3D molecule design and structure-based drug discovery against protein targets.

Abstract: Conventional multisource domain few-shot adaptation (MFDA) faces the challenge of further reducing the load on edge-side devices in low-resource scenarios. Considering the native language-supervised advantage of CLIP and the plug-and-play nature of prompt to transfer CLIP efficiently, this paper introduces an uploadable multi-source few-shot domain adaptation (UMFDA) schema. It belongs to a decentralized edge collaborative learning in the edge-side models that must maintain a low computational load. And only a limited amount of annotations in source domain data is provided, with most of the data being unannotated. Further, this paper proposes a vision-aware multimodal prompt tuning framework (VAMP) under the decentralized schema, where the vision-aware prompt guides the text domain-specific prompt to maintain semantic discriminability and perceive the domain information. The cross-modal semantic and domain distribution alignment losses optimize each edge-side model, while text classifier consistency and semantic diversity losses promote collaborative learning among edge-side models. Extensive experiments were conducted on OfficeHome and DomainNet datasets to demonstrate the effectiveness of the proposed VAMP in the UMFDA, which outperformed the previous prompt tuning methods.

Abstract: Deep learning (e.g., Transformer) has been widely and successfully used in multivariate time series forecasting (MTSF). Unlike existing methods that focus on training models from a single modal of time series input, large language models (LLMs) based MTSF methods with crossmodal text and time series input have recently shown great superiority, especially with limited temporal data. However, current LLM-based MTSF methods usually focus on adapting and fine-tuning LLMs, while neglecting the distribution discrepancy between textual and temporal input tokens, thus leading to sub-optimal performance. To address this issue, we propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF by reducing the distribution discrepancy between textual and temporal data, which mainly consists of the temporal target branch with temporal input and the textual source branch with aligned textual input. To reduce the distribution discrepancy, we develop the cross-modal match module to first align cross-modal input distributions. Additionally, to minimize the modality distribution gap in both feature and output spaces, feature regularization loss is developed to align the intermediate features between the two branches for better weight updates, while output consistency loss is introduced to allow the output representations of both branches to correspond effectively. Thanks to the modality alignment, CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks with low computational complexity, and exhibits favorable few-shot and zero-shot abilities similar to that in LLMs.

School of Computer Science and Technology, Tongji University, Shanghai, China., School of Computer Science and Technology, Tongji University, Shanghai, China. Shanghai Eye Diseases Prevention and Treatment Center, Shanghai Eye Hospital, Shanghai, China., School of Communication and Electronic Engineering, East China Normal University, Shanghai, China., School of Computer Science and Technology, Tongji University, Shanghai, China., School of Computer Science and Technology, Tongji University, Shanghai, China.

Abstract: Current selfsupervised methods, such as contrastive learning, predominantly focus on global discrimination, neglecting the critical fine-grained anatomical details required for accurate radiographic analysis. To address this challenge, we propose the Anatomy-driven self-supervised framework for enhancing Fine-grained Representation in radiographic image analysis (AFiRe). The core idea of AFiRe is to align the anatomical consistency with the unique token-processing characteristics of Vision Transformer. Specifically, AFiRe synergistically performs two self-supervised schemes: (i) Token-wise anatomy-guided contrastive learning, which aligns image tokens based on structural and categorical consistency to enhance fine-grained spatial-anatomical discrimination; (ii) Pixel-level anomaly-removal restoration, which particularly focuses on local anomalies, thereby refining the learned discrimination with detailed geometrical information. Additionally, we propose the Synthetic Lesion Mask to enhance anatomical diversity while preserving intra-consistency, which is typically corrupted by traditional data augmentations, such as Cropping and Affine transformations. Experimental results show that AFiRe: (i) provides robust anatomical discrimination, achieving more cohesive feature clusters compared to state-of-the-art contrastive learning methods; (ii) demonstrates superior generalization, surpassing 7 radiography-specific self-supervised methods in multi-label classification tasks with limited labeling; and (iii) integrates fine-grained information, enabling precise anomaly detection using only image-level annotations.

Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, University of Trento, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University

Abstract: Short text classification has gained significant attention in the information age due to its prevalence and realworld applications. Recent advancements in graph learning combined with contrastive learning have shown promising results in addressing the challenges of semantic sparsity and limited labeled data in short text classification. However, existing models have certain limitations. They rely on explicit data augmentation techniques to generate contrastive views, resulting in semantic corruption and noise. Additionally, these models only focus on learning the intrinsic consistency between the generated views, neglecting valuable discriminative information from other potential views. To address these issues, we propose a Simple graph contrastive learning framework for Short Text Classification (SimSTC). Our approach involves performing graph learning on multiple text-related component graphs to obtain multi-view text embeddings. Subsequently, we directly apply contrastive learning on these embeddings. Notably, our method eliminates the need for data augmentation operations to generate contrastive views while still leveraging the benefits of multi-view contrastive learning. Despite its simplicity, our model achieves outstanding performance, surpassing large language models on various datasets.

Abstract: Federated Learning (FL) is notorious for its vulnerability to Byzantine attacks. Most current Byzantine defenses share a common inductive bias: among all the gradients, the densely distributed ones are more likely to be honest. However, such a bias is a poison to Byzantine robustness due to a newly discovered phenomenon in this paper gradient skew. We discover that a group of densely distributed honest gradients skew away from the optimal gradient (the average of honest gradients) due to heterogeneous data. This gradient skew phenomenon allows Byzantine gradients to hide within the densely distributed skewed gradients. As a result, Byzantine defenses are confused into believing that Byzantine gradients are honest. Motivated by this observation, we propose a novel skew-aware attack called STRIKE: first, we search for the skewed gradients; then, we construct Byzantine gradients within the skewed gradients. Experiments on three benchmark datasets validate the effectiveness of our attack.

Abstract: Multisource domain adaptation (MSDA), which utilizes multiple source domains to align the distribution of a single target domain, is a popular and challenging setting in domain adaptation (DA). However, existing MSDA approaches are difficult to obtain sufficient target domain knowledge, which serve as the transfer object. Furthermore, the target distributions are confused in the real world, i.e., the model cannot obtain the domain labels of target domains. To tackle these problems, we consider a more realistic DA setting Multi-Source Blended-Target Domain Adaptation (MBDA) and propose an Invertible Projection and Conditional Alignment (IPCA) method. Specifically, to reduce the impact of the distribution discrepancy, we construct an invertible projection for the source and blended-target domains. Then, we adopt a projection consistency regularization to our model, which makes the model more robust on the domain-specific parts. In addition, because the labels of the blended-target domain are unseen, we introduce conditional discrepancy to obtain the domain-level discriminative information and guide the classifier to serve as the discriminator, which is suitable for MBDA settings. Extensive experiment results on the ImageCLEF-DA, Office-Home, and DomainNet datasets validate the effectiveness of our method.

Key Lab. of Intelligent Information Processing, Institute of Computing Tech., CAS School of Computer Science and Tech., University of Chinese Academy of Sciences, Key Lab. of Intelligent Information Processing, Institute of Computing Tech., CAS, School of Computer Science and Tech., University of Chinese Academy of Sciences, Tencent Corporate, School of Computer Science and Tech., University of Chinese Academy of Sciences Key Lab. of Intelligent Information Processing, Institute of Computing Tech., CAS BDKM, University of Chinese Academy of Sciences

Abstract: Realworld datasets often exhibit a long-tailed distribution, where vast majority of classes known as tail classes have only few samples. Traditional methods tend to overfit on these tail classes. Recently, a new approach called Imbalanced SAM (ImbSAM) is proposed to leverage the generalization benefits of Sharpness-Aware Minimization (SAM) for long-tailed distributions. The main strategy is to merely enhance the smoothness of the loss function for tail classes. However, we argue that improving generalization in long-tail scenarios requires a careful balance between head and tail classes. We show that neither SAM nor ImbSAM alone can fully achieve this balance. For SAM, we prove that although it enhances the model's generalization ability by escaping saddle point in the overall loss landscape, it does not effectively address this for tail-class losses. Conversely, while ImbSAM is more effective at avoiding saddle points in tail classes, the head classes are trained insufficiently, resulting in significant performance drops. Based on these insights, we propose Stage-wise Saddle Escaping SAM (SSE-SAM), which uses complementary strengths of ImbSAM and SAM in a phased approach. Initially, SSE-SAM follows the majority sample to avoid saddle points of the head-class loss. During the later phase, it focuses on tail-classes to help them escape saddle points. Our experiments confirm that SSE-SAM has better ability in escaping saddles both on head and tail classes, and shows performance improvements.

Abstract: Visionlanguage foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data. However, these models also display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of ``decision shortcuts'' that hinder their generalization capabilities. In this work, we find that the CLIP model possesses a rich set of features, encompassing both desired invariant causal features and undesired decision shortcuts. Moreover, the underperformance of CLIP on downstream tasks originates from its inability to effectively utilize pre-trained features in accordance with specific task requirements. To address this challenge, we propose a simple yet effective method, Spurious Feature Eraser (SEraser), to alleviate the decision shortcuts by erasing the spurious features. Specifically, we introduce a test-time prompt tuning paradigm that optimizes a learnable prompt, thereby compelling the model to exploit invariant features while disregarding decision shortcuts during the inference phase. The proposed method effectively alleviates excessive dependence on potentially misleading spurious information. We conduct comparative analysis of the proposed method against various approaches which validates the significant superiority.

Abstract: In this paper, we link two existing approaches to derive counterfactuals: adaptations based on a causal graph, and optimal transport. We extend "Knothe's rearrangement" and "triangular transport" to probabilistic graphical models, and use this counterfactual approach, referred to as sequential transport, to discuss fairness at the individual level. After establishing the theoretical foundations of the proposed method, we demonstrate its application through numerical experiments on both synthetic and real datasets.

Abstract: Realworld vision models in dynamic environments face rapid shifts in domain distributions, leading to decreased recognition performance. Using unlabeled test data, continuous test-time adaptation (CTTA) directly adjusts a pre-trained source discriminative model to these changing domains. A highly effective CTTA method involves applying layer-wise adaptive learning rates for selectively adapting pre-trained layers. However, it suffers from the poor estimation of domain shift and the inaccuracies arising from the pseudo-labels. This work aims to overcome these limitations by identifying layers for adaptation via quantifying model prediction uncertainty without relying on pseudo-labels. We utilize the magnitude of gradients as a metric, calculated by backpropagating the KL divergence between the softmax output and a uniform distribution, to select layers for further adaptation. Subsequently, for the parameters exclusively belonging to these selected layers, with the remaining ones frozen, we evaluate their sensitivity to approximate the domain shift and adjust their learning rates accordingly. We conduct extensive image classification experiments on CIFAR-10C, CIFAR-100C, and ImageNet-C, demonstrating the superior efficacy of our method compared to prior approaches.

Abstract: Many approaches to program synthesis perform a combinatorial search within a large space of programs to find one that satisfies a given specification. To tame the search space blowup, previous works introduced probabilistic and neural approaches to guide this combinatorial search by inducing heuristic cost functions. Bestfirst search algorithms ensure to search in the exact order induced by the cost function, significantly reducing the portion of the program space to be explored. We present a new best-first search algorithm called Eco Search, which is the first no-delay algorithm for pre-generation cost function: the amount of compute required between outputting two programs is constant, and in particular does not increase over time. This key property yields important speedups: we observe that Eco Search outperforms its predecessors on two classical domains.

Abstract: In realworld applications, spectral Graph Neural Networks (GNNs) are powerful tools for processing diverse types of graphs. However, a single GNN often struggles to handle different graph types—such as homogeneous and heterogeneous graphs—simultaneously. This challenge has led to the manual design of GNNs tailored to specific graph types, but these approaches are limited by the high cost of labor and the constraints of expert knowledge, which cannot keep up with the rapid growth of graph data. To overcome these challenges, we introduce AutoSGNN, an automated framework for discovering propagation mechanisms in spectral GNNs. AutoSGNN unifies the search space for spectral GNNs by integrating large language models with evolutionary strategies to automatically generate architectures that adapt to various graph types. Extensive experiments on nine widely-used datasets, encompassing both homophilic and heterophilic graphs, demonstrate that AutoSGNN outperforms state-of-the-art spectral GNNs and graph neural architecture search methods in both performance and efficiency.

Abstract: Recently, image inpainting has become a common tool for manipulating nature images in a malicious manner, which has led to the rapid advancement of inpainting forensics. Although current forensics methods have shown precise location of inpainting regions and reliable robustness against image postprocessing operations, it remains unclear whether they can effectively resist the possible attacks in real-world scenarios. To identify potential flaws, we propose a novel black-box anti-forensics framework to attack inpainting forensics methods, which employs reinforcement learning to generate a query-efficient countermeasure, named RLGC. To this end, we define reinforcement learning paradigm to model the Markov Decision Process of query-based black-box anti-forensics scenario. Specifically, pixel-wise agents are used to modulate anti-forensics images based on action selection and query forensics methods to obtain corresponding outputs. Later, reward function evaluates attack effect and image distortion with these outputs. To maximize the cumulative reward, policy and value networks are integrated and trained by Asynchronous Advantage Actor-Critic algorithm. Experimental results demonstrate that, without visually detectable distortion on anti-forensics images, RLGC achieves remarkable attack effects in a highly query-effcient way against various black-box inpainting forensics methods, even outperforming the most representative white-box attack method.

Abstract: LLMsas-a-judge is a recently popularized method which replaces human judgements in task evaluation with automatic evaluation using LLMs. Due to widespread use of RLHF (Reinforcement Learning from Human Feedback), state-of-the-art LLMs like GPT4 and Llama3 are expected to have strong alignment with human preferences when prompted for a quality judgement, such as the coherence of a text. While this seems beneficial, it is not clear whether the assessments by an LLM-as-a-judge constitute only an evaluation based on the instructions in the prompts, or reflect its preference for high-quality data similar to its fine-tune data. To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several LLMs-as-a-judge. Further, we compare to a prompt-free method using model perplexity as a quality measure instead. We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges. Overall, we show that the LLMs-as-a-judge benefit only little from highly detailed instructions in prompts and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality.

Abstract: Ordinal regression classifies an object to a class out of a given set of possible classes, where labels possess a natural order. It is relevant to a wide array of domains including risk assessment, sentiment analysis, image ranking, and recommender systems. Like common classification, the primary goal of ordinal regression is accuracy. Yet, in this context, the severity of prediction errors varies, e.g., in risk assessment, Critical Risk is more urgent than High risk and significantly more urgent than No risk. This leads to a modified objective of ensuring that the model's output is as close as possible to the correct class, considering the order of labels. Therefore, ordinal regression models should use ordinalityaware loss for training. In this work, we focus on two properties of ordinality-aware losses, namely monotonicity and balance sensitivity. We show that existing ordinal loss functions lack these properties and introduce SLACE (Soft Labels Accumulating Cross Entropy), a novel loss function that provably satisfies said properties. We demonstrate empirically that SLACE outperforms the state-of-the-art ordinal loss functions on most tabular ordinal regression benchmarks.

Abstract: ion is key to scaling up reinforcement learning (RL). However, autonomously learning abstract state and action representations to enable transfer and generalization remains a challenging open problem. This paper presents a novel approach for inventing, representing, and utilizing options, which represent temporally extended behaviors, in continual RL settings. Our approach addresses streams of stochastic problems characterized by long horizons, sparse rewards, and unknown transition and reward functions. Our approach continually learns and maintains an interpretable state abstraction, and uses it to invent highlevel options with abstract symbolic representations. These options meet three key desiderata: (1) composability for solving tasks effectively with lookahead planning, (2) reusability across problem instances for minimizing the need for relearning, and (3) mutual independence for reducing interference among options. Our main contributions are approaches for continually learning transferable, generalizable options with symbolic representations, and for integrating search techniques with RL to efficiently plan over these learned options to solve new problems. Empirical results demonstrate that the resulting approach effectively learns and transfers abstract knowledge across problem instances, achieving superior sample efficiency compared to state-of-the-art methods.

Abstract: Unsupervised domain adaptation (UDA) refers to a domain adaptation framework in which a learning model is trained based on the labeled samples on the source domain and unlabelled ones in the target domain. The dominant existing methods in the field that rely on the classical covariate shift assumption to learn domaininvariant feature representation have yielded suboptimal performance under label distribution shift. In this paper, we propose a novel Conditional Adversarial SUpport ALignment (CASUAL) whose aim is to minimize the conditional symmetric support divergence between the source’s and target domain’s feature representation distributions, aiming at a more discriminative representation for the classification task. We also introduce a novel theoretical target risk bound, which justifies the merits of aligning the supports of conditional feature distributions compared to the existing marginal support alignment approach in the UDA settings. We then provide a complete training process for learning in which the objective optimization functions are precisely based on the proposed target risk bound. Our empirical results demonstrate that CASUAL outperforms other state-of-the-art methods on different UDA benchmark tasks under different label shift conditions.

Abstract: Open set anomaly detection (OSAD) is a crucial task that aims to identify abnormal patterns or behaviors in data sets, especially when the anomalies observed during training do not represent all possible classes of anomalies. The recent advances in quantum computing in handling complex data structures and improving machine learning models herald a paradigm shift in anomaly detection methodologies. This study proposes a Quantum Scoring Module (Qsco), embedding quantum variational circuits into neural networks to enhance the model's processing capabilities in handling uncertainty and unlabeled data. Extensive experiments conducted across eight realworld anomaly detection datasets demonstrate our model's superior performance in detecting anomalies across varied settings and reveal that integrating quantum simulators does not result in prohibitive time complexities. At the same time, the experimental results under different noise models also prove that Qsco is a noise-resilient algorithm. Our study validates the feasibility of quantum-enhanced anomaly detection methods in practical applications.

Abstract: This work presents a novel approach to lowrank matrix factorization in a federated learning context, where multiple clients collaboratively solve a matrix decomposition problem without sharing their local data. The algorithm introduces a power initialization technique for the global factorization matrix and combines it with local gradient descent updates to achieve strong theoretical and practical guarantees. Considering this power initialization, we rewrite the previous smooth non-convex problem into a smooth strongly-convex problem that we solve using a parallel Nesterov gradient descent potentially requiring a single step of communication at the initialization step. We provide a linear rate of convergence of the excess loss, our results improve the rates of convergence given in the literature. We provide an upper bound on the Frobenius-norm error of reconstruction under the power initialization strategy. We complete our analysis with experiments on both synthetic and real data.

Abstract: We propose denoising diffusion variational inference (DDVI), a blackbox variational inference algorithm for latent variable models which relies on diffusion models as flexible approximate posteriors. Specifically, our method introduces an expressive class of diffusion-based variational posteriors that perform iterative refinement in latent space; we train these posteriors with a novel regularized evidence lower bound (ELBO) on the marginal likelihood inspired by the wake-sleep algorithm. Our method is easy to implement (it fits a regularized extension of the ELBO), is compatible with black-box variational inference, and outperforms alternative classes of approximate posteriors based on normalizing flows or adversarial networks. We find that DDVI improves inference and learning in deep latent variable models across common benchmarks as well as on a motivating task in biology---inferring latent ancestry from human genomes---where it outperforms strong baselines on the Thousand Genomes dataset.

School of Artificial Intelligence, and Engineering Research Center of Intelligent Technology and Educational Application, Ministry of Education, Beijing Normal University, Beijing, China, School of Artificial Intelligence, and Engineering Research Center of Intelligent Technology and Educational Application, Ministry of Education, Beijing Normal University, Beijing, China, School of Information, Central University of Finance and Economics, Beijing, China, Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, Jinhua, China Zhejiang Institute of Optoelectronics, Jinhua, China, School of Cyber Science and Technology, Sun Yat-Sen University, Shenzhen, China, School of Computer and Information Technology, Shanxi University, Taiyuan, China, Department of Computer Science, University of York, York, United Kingdom

Abstract: Graphbased representations are powerful tools for analyzing structured data. In this paper, we propose a novel model to learn Deep Hierarchical Attention-based Kernelized Representations (DHAKR) for graph classification. To this end, we commence by learning an assignment matrix to hierarchically map the substructure invariants into a set of composite invariants, resulting in hierarchical kernelized representations for graphs. Moreover, we introduce the feature-channel attention mechanism to capture the interdependencies between different substructure invariants that will be converged into the composite invariants, addressing the shortcoming of discarding the importance of different substructures arising in most existing R-convolution graph kernels. We show that the proposed DHAKR model can adaptively compute the kernel-based similarity between graphs, identifying the common structural patterns over all graphs. Experiments demonstrate the effectiveness of the proposed DHAKR model.

Abstract: Recently, there has been a focus on the streaming setting in a line of works on the MultiArmed Bandit (MAB). In this scenario, a large number of arms arrive in a streaming manner, and the algorithm scans through the stream and stores some arms in its limited processing memory. We advance this line of research by introducing the Linear Streaming Bandit setup, where the arriving arms have profile vectors observable to the algorithm. The profile of an arm has a linear correlation with the expected reward. This setup is motivated by real-world applications, such as when a company or a crowdsourcing platform hires a worker from many sequentially arriving applicants with their resumes. We address two problems in this setup: Regret Minimization and Fixed-Budget Epsilon-Best Arm Identification. For the former, we propose an algorithm whose regret is independent of the number of arms, thus it is able to handle arbitrarily long arm streams. For the latter, we present a multi-pass algorithm whose error probability is sub-linear w.r.t. the number of arms, and an algorithm identifying the exact best arm in only a single pass. We validate the effectiveness of all proposed algorithms through experiments on both synthetic and real-world datasets.

Abstract: Existing crossnetwork node classification methods are mainly proposed for closed-set setting, where the source network and the target network share exactly the same label space. Such a setting is restricted in real-world applications, since the target network might contain additional classes that are not present in the source. In this work, we study a more realistic open-set cross-network node classification (O-CNNC) problem, where the target network contains all the known classes in the source and further contains several target-private classes unseen in the source. Borrowing the concept from open-set domain adaptation, all target-private classes are defined as an additional “unknown” class. To address the challenging O-CNNC problem, we propose an unknown-excluded adversarial graph domain alignment (UAGA) model with a separate-adapt training strategy. Firstly, UAGA roughly separates known classes from unknown class, by training a graph neural network encoder and a neighborhood-aggregation node classifier in an adversarial framework. Then, unknown-excluded adversarial domain alignment is customized to align only target nodes from known classes with the source, while pushing target nodes from unknown class far away from the source, by assigning positive and negative domain adaptation coefficient to known class nodes and unknown class nodes. Extensive experiments on real-world datasets demonstrate significant outperformance of the proposed UAGA over state-of-the-art methods on O-CNNC.

Abstract: Gaussian process regression is a powerful Bayesian nonlinear regression method. Recent research has enabled the capture of many types of observations using nonGaussian likelihoods. To deal with various tasks in spatial modeling, we benefit from this development. Difficulties still arise when we can only access summarized data consisting of representative features, summary statistics, and data point counts. Such situations frequently occur primarily due to concerns about confidentiality and management costs associated with spatial data. This study tackles learning and inference using only summarized data within the framework of Gaussian process regression. To address this challenge, we analyze the approximation errors in the marginal likelihood and posterior distribution that arise from utilizing representative features. We also introduce the concept of sample quasi-likelihood, which facilitates learning and inference using only summarized data. Non-Gaussian likelihoods satisfying certain assumptions can be captured by specifying a variance function that characterizes a sample quasi-likelihood function. Theoretical and experimental results demonstrate that the approximation performance is influenced by the granularity of summarized data relative to the length scale of covariance functions. Experiments on a real-world dataset highlight the practicality of our method for spatial modeling.

Abstract: Writing comprehensive and accurate descriptions of technical drawings in patent documents is crucial to effective knowledge sharing and enabling the replication and protection of intellectual property. However, automation of this task has been largely overlooked by the research community. To this end, we introduce PatentDesc355K, a novel large-scale dataset containing ∼355K patent figures along with their brief and detailed textual descriptions extracted from more than 60K US patent documents. In addition, we propose PatentLMM – a novel large multimodal model specifically tailored to generate high-quality descriptions of patent figures. Our proposed PatentLMM comprises two key components: (i) PatentMME, a specialized multimodal vision encoder that captures the unique structural elements of patent figures, and (ii) PatentLLaMA, a domain-adapted version of LLaMA fine-tuned on a large collection of patents. Our extensive experiments demonstrate that training a vision encoder specifically designed for patent figures significantly boosts the performance, generating coherent descriptions compared to fine-tuning similar-sized off-the-shelf multimodal models. PatentDesc-355K and PatentLMM pave the way for automating the understanding of patent figures, enabling efficient knowledge sharing and faster drafting of patent documents.

Abstract: Crossmodal retrieval aims to retrieve relevant data across different modalities. Driven by costly massive labeled data, existing cross-modal retrieval methods achieve encouraging results. To reduce annotation costs while maintaining performance, this paper focuses on an untouched but challenging problem, i.e., cross-modal retrieval with partial labels (PLCMR). PLCMR faces the dual challenges of annotation ambiguity and modality gap. To address these challenges, we propose a novel method termed disambiguated contrastive alignment (DiCA) for cross-modal retrieval with partial labels. Specifically, DiCA proposes a novel non-candidate boosted disambiguation learning mechanism (NBDL), which elaborately balances the trade-off between the losses on candidate and non-candidate labels that eliminate label ambiguity and narrow the modality gap. Moreover, DiCA presents an instance-prototype representation learning mechanism (IPRL) to enhance the model by further eliminating the modality gap at both the instance and prototype levels. Thanks to NBDL and IPRL, our DiCA effectively addresses the issues of annotation ambiguity and modality gap for cross-modal retrieval with partial labels. Experiments on four benchmarks validate the effectiveness of our proposed method, which demonstrates enhanced performance over existing state-of-the-art methods.

Abstract: Training deep Convolutional Neural Networks (CNNs) presents unique challenges, including the pervasive issue of elimination singularities—consistent deactivation of nodes leading to degenerate manifolds within the loss landscape. These singularities impede efficient learning by disrupting feature propagation. To mitigate this, we introduce Pool Skip, an architectural enhancement that strategically combines a Max Pooling, a Max Unpooling, a 3 × 3 convolution, and a skip connection. This configuration helps stabilize the training process and maintain feature integrity across layers. We also propose the Weight Inertia hypothesis, which underpins the development of Pool Skip, providing theoretical insights into mitigating degradation caused by elimination singularities through dimensional and affine compensation. We evaluate our method on a variety of benchmarks, focusing on both 2D natural and 3D medical imaging applications, including tasks such as classification and segmentation. Our findings highlight Pool Skip's effectiveness in facilitating more robust CNN training and improving model performance.

Abstract: Generative models can enhance discriminative classifiers by constructing complex feature spaces, thereby improving performance on intricate datasets. Conventional methods typically augment datasets with more detailed feature representations or increase dimensionality to make nonlinear data linearly separable. Utilizing a generative model solely for feature space processing falls short of unlocking its full potential within a classifier and typically lacks a solid theoretical foundation. We base our approach on a novel hypothesis: the probability information (logit) derived from a single model training can be used to generate the equivalent of multiple training sessions. Leveraging the central limit theorem, this synthesized probability information is anticipated to converge toward the true probability more accurately. To achieve this goal, we propose the BernoulliGaussian Decision Block (BGDB), a novel module inspired by the Central Limit Theorem and the concept that the mean of multiple Bernoulli trials approximates the probability of success in a single trial. Specifically, we utilize Improved Denoising Diffusion Probabilistic Models (IDDPM) to model the probability of Bernoulli Trials. Our approach shifts the focus from reconstructing features to reconstructing logits, transforming the logit from a single iteration into logits analogous to those from multiple experiments. We provide the theoretical foundations of our approach through mathematical analysis and validate its effectiveness through experimental evaluation using various datasets for multiple imaging tasks, including both classification and segmentation.

Abstract: Local intrinsic dimension (LID) estimation methods have received a lot of attention in recent years thanks to the progress in deep neural networks and generative modeling. In opposition to old nonparametric methods, new methods use generative models to approximate diffused dataset density to scale the methods to high-dimensional datasets (e.g. images). In this paper, we investigate the recent state-of-the-art parametric LID estimation methods from the perspective of the Wiener process. We explore how these methods behave when their assumptions are not met. We give an extended mathematical description of those methods and their error as a function of the probability density of the data.

Key Lab of System Software at Chinese Academy of Sciences, State Key Lab of Computer Science at Institute of Software at Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing Nanyang Technological University, Continental-NTU Corporate Lab, Key Lab of System Software at Chinese Academy of Sciences, State Key Lab of Computer Science at Institute of Software at Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing Nanjing Institute of Software Technology, University of Chinese Academy of Sciences, Nanjing, Nanjing University, Zhejiang Sci-Tech University, CFAR and IHPC, A*STAR, Key Lab of System Software at Chinese Academy of Sciences, State Key Lab of Computer Science at Institute of Software at Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, Key Lab of System Software at Chinese Academy of Sciences, State Key Lab of Computer Science at Institute of Software at Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing Nanjing Institute of Software Technology, University of Chinese Academy of Sciences, Nanjing, Nanyang Technological University

Abstract: Multiobjective evolutionary algorithms (MOEAs) are widely used for searching optimal solutions in complex multi-component applications. Traditional MOEAs for multi-component deep learning (MCDL) systems face challenges in enhancing the search efficiency while maintaining the diversity. To combat these, this paper proposes the first LLM-empowered adaptive evolutionary search algorithm to detect safety violations in MCDL systems. Inspired by the context-understanding ability of Large Language Models (LLMs), our approach promotes the LLM to comprehend the optimization problem and generate an initial population tailed to evolutionary objectives. Subsequently, it employs adaptive selection and variation to iteratively produce offspring, balancing the evolutionary efficiency and diversity. During the evolutionary process, to navigate away from the local optima, our approach integrates the evolutionary experience back into the LLM. This utilization harnesses the LLM's quantitative reasoning prowess to generate differential seeds, breaking away from current optimal solutions. We evaluate our approach in finding safety violations of MCDL systems, and compare its performance with state-of-the-art MOEA methods. Experimental results show that our approach can significantly improve the efficiency and diversity of the evolutionary search.

Abstract: Offsitetuning is a privacy-preserving method for tuning large language models (LLMs) by sharing a lossy compressed emulator from the LLM owners with data owners for downstream task tuning. This approach protects the privacy of both the model and data owners. However, current offsite tuning methods often suffer from adaptation degradation, high computational costs, and limited protection strength due to uniformly dropping LLM layers or relying on expensive knowledge distillation. To address these issues, we propose ScaleOT, a novel privacy-utility-scalable offsite-tuning framework that effectively balances privacy and utility. ScaleOT introduces a novel layerwise lossy compression algorithm that uses reinforcement learning to obtain the importance of each layer. It employs lightweight networks, termed harmonizers, to replace the raw LLM layers. By combining important original LLM layers and harmonizers in different ratios, ScaleOT generates emulators tailored for optimal performance with various model scales for enhanced privacy protection. Additionally, we present a rank reduction method to further compress the original LLM layers, significantly enhancing privacy with negligible impact on utility. Comprehensive experiments show that ScaleOT can achieve nearly lossless offsite tuning performance compared with full fine-tuning while obtaining better model privacy.

Abstract: Backdoor attacks pose a significant threat during the model's training phase. Attackers craft predefined triggers to break deep neural networks, ensuring the model accurately classifies clean samples during inference yet erroneously classifies samples added with these triggers. Recent studies have shown that speaker recognition systems trained on large-scale data are susceptible to backdoor attacks. Existing attackers employ unnoticed ambient sounds as triggers. However, these sounds are not inherently part of the training samples themselves. In essence, triggers can be designed to maintain an intrinsic connection with the original speech to enhance stealthiness. Our paper presents a novel attack methodology named Speed Master, which undermines deep neural networks by manipulating the speed of speech samples. Specifically, we execute poison-only backdoor attacks using speed or tempo adjustment. Changes in speech rate have become a common occurrence, as seen on platforms that allow users to adjust playback speed. In real-world scenarios, people naturally adjust their speaking rate depending on the context. As a result, changes in a speaker’s speech rate are typically perceived as normal and are unlikely to raise suspicion. Furthermore, detecting such subtle adjustments becomes challenging for users without reference speech. Our comprehensive experiments demonstrate that Speed Master can achieve an ASR over 99% in the digital domain, with only a 0.6% poisoning rate. Additionally, we validate the feasibility of Speed Master in the real world and its resistance to typical defensive measures.

Abstract: Existing class incremental learning is mainly designed for singlelabel classification task, which is ill-equipped for multi-label scenarios due to the inherent contradiction of learning objectives for samples with incomplete labels. We argue that the main challenge to overcome this contradiction in multi-label class-incremental learning (MLCIL) lies in the model's inability to clearly distinguish between known and unknown knowledge. This ambiguity hinders the model's ability to retain historical knowledge, master current classes, and prepare for future learning simultaneously. In this paper, we target at specifying what is known or not to accommodate Historical, Current, and Prospective knowledge for MLCIL and propose a novel framework termed as HCP. Specifically, (i) we clarify the known classes by dynamic feature purification and recall enhancement with distribution prior, enhancing the precision and retention of known information. (ii) We design prospective knowledge mining to probe the unknown, preparing the model for future learning. Extensive experiments validate that our method effectively alleviates catastrophic forgetting in MLCIL, surpassing the previous state-of-the-art by 3.3% on average accuracy for MS-COCO B0-C10 setting without replay buffers.

College of Computer Science, Inner Mongolia University, China National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian, China Inner Mongolia Key Laboratory of Multilingual Artificial Intelligence Technology, China, College of Computer Science, Inner Mongolia University, China National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian, China Inner Mongolia Key Laboratory of Multilingual Artificial Intelligence Technology, China, College of Computer Science, Inner Mongolia University, China National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian, China Inner Mongolia Key Laboratory of Multilingual Artificial Intelligence Technology, China, College of Computer Science, Inner Mongolia University, China National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian, China Inner Mongolia Key Laboratory of Multilingual Artificial Intelligence Technology, China

Abstract: The great challenge of handwritten mathematical expression recognition (HMER) is the complex structures of the expressions, which are directly related to the symbol spatial positions. Existing HMER methods typically employ attention mechanisms in the decoder of their models to implicitly perceive the symbol positions, or employ symbol counting and treebased strategies to model the symbol spatial relation. However, these methods still cannot effectively capture the structural information of formulas, thus negatively impacting the symbol decoding in HMER. To deal with this problem and enhance the HMER performance, this paper proposes a novel auxiliary task, namely predicting the symbol spatial distribution map of handwritten expression images. On such basis, this paper designs a symbol spatial-aware network (SSAN) for this task, which is jointly optimized with the HMER model. Specifically, considering the similarity of the symbol spatial positions between the handwritten mathematical expression images and their corresponding printed templates, we obtain the symbol spatial distribution map by first generating printed templates from LaTeX ground-truth for handwritten formula images and then replacing the connected components of printed templates with 2D Gaussian distribution maps of the same size. Meanwhile, due to the loose alignment of the symbol spatial positions between handwritten and printed formula images, and misclassification of similar symbols, we further propose a coarse-to-fine alignment strategy and an attention-guided symbol masking strategy in SSAN to tackle these issues. Extensive experiments demonstrate that SSAN significantly improves the recognition performance of the HMER models, and the proposed auxiliary tasks are more effective in enhancing HMER performance than existing auxiliary tasks.

School of Computer Science and Engineering, Tianjin University of Technology, Tianjin, 300384, China, School of Computer Science and Engineering, Tianjin University of Technology, Tianjin, 300384, China, School of Information Science and Technology, Hangzhou Normal University, Hangzhou, 311121, China the Department of Technology, Management and Economics, Technical University of Denmark, Lyngby, Denmark, School of Computer Science and Engineering, Tianjin University of Technology, Tianjin, 300384, China the Department of Technology, Management and Economics, Technical University of Denmark, Lyngby, Denmark, Norwegian University of Science and Technology, School of Computer Science and Engineering, Tianjin University of Technology, Tianjin, 300384, China

Abstract: Knowledge distillation (KD) is a model compression technique that transfers knowledge from a large teacher model to a smaller student model to enhance its performance. Existing methods often assume that the student model is inherently inferior to the teacher model. However, we identify that the fundamental issue affecting student performance is the bias transferred by the teacher. Current KD frameworks transmit both right and wrong knowledge, introducing bias that misleads the student model. To address this issue, we propose a novel strategy to rectify bias and greatly improve the student model's performance. Our strategy involves three steps: First, we differentiate knowledge and design a bias elimination method to filter out biases, retaining only the right knowledge for the student model to learn. Next, we propose a bias rectification method to rectify the teacher model's wrong predictions, fundamentally addressing bias interference. The student model learns from both the right knowledge and the rectified biases, greatly improving its prediction accuracy. Additionally, we introduce a dynamic learning approach with a loss function that updates weights dynamically, allowing the student model to quickly learn right knowledgebased easy tasks initially and tackle hard tasks corresponding to biases later, greatly enhancing the student model's learning efficiency. To the best of our knowledge, this is the first strategy enabling the student model to surpass the teacher model. Experiments demonstrate that our strategy, as a plug-and-play module, is versatile across various mainstream KD frameworks.

Abstract: Federated learning (FL) has garnered considerable interest for its capability to learn from decentralized data sources. Given the increasing application of FL in decisionmaking scenarios, addressing fairness issues across different sensitive groups (e.g., female, male) in FL is crucial. Current research typically focus on facilitating fairness at each client's data (local fairness) or within the entire dataset across all clients (global fairness). However, existing approaches that focus exclusively on either global or local fairness fail to address two key challenges: (CH1) Under statistical heterogeneity, global fairness does not imply local fairness, and vice versa. (CH2) Achieving fairness under model-agnostic setting. To tackle the aforementioned challenges, this paper proposes a novel post-processing framework for achieving both Local and Global Fairness in the FL context, namely LoGoFair. To address CH1, LoGoFair endeavors to seek the Bayes optimal classifier under local and global fairness constraints, which strikes the optimal accuracy-fairness balance in the probabilistic sense. To address CH2, LoGoFair employs a model-agnostic federated post-processing procedure that enables clients to collaboratively optimize global fairness while ensuring local fairness, thereby achieving the optimal fair classifier within FL. Experimental results on three real-world datasets further illustrate the effectiveness of the proposed LoGoFair framework.

Abstract: Malicious traffic detection is one of the main challenges in the field of cybersecurity. Although modern deep learning methods have made progress in identifying malicious traffic, they often overlook the persistent nature of attack behaviors, making it difficult to distinguish between malicious and normal traffic at a single observation point. To address this issue, we propose MalDetectFormer, which aims to accurately capture the spatiotemporal dynamics of malicious traffic. By incorporating a sparse attention mechanism, MalDetectFormer can efficiently focus on key characteristics of traffic nodes while overcoming the challenges faced by traditional longsequence processing. Additionally, by adopting a time-cyclic attention mechanism, the model can identify and capture persistent attack patterns of malicious traffic. Experiments conducted on benchmark datasets demonstrate the advantages of the proposed MalDetectFormer in both malicious traffic detection and malicious attack recognition tasks.

Abstract: Unsupervised federated learning for crossmodal retrieval has received increasing attention in recent years as it can free the requirement for annotations and avoid uploading original clients’ data to servers. Most existing methods focus on how to learn better local models and their aggregation to overcome data distribution drift across clients. Unlike prior works, we propose to address the data distribution problem by generating synthetic data, which can benefit existing federated learning methods. Specifically, we train a WGAN generator with three newly designed loss constraints on each client to improve the quality of the generated data. We first compute cluster prototypes to address the problem of lack of labels. Then, a direct contrastive loss between generated image and text features, an indirect contrastive loss with reference to cluster prototypes, and a Jensen-Shannon Divergence (JSD) loss also with reference to cluster prototypes work together to constrain the WGAN. The locally trained generators and local prototypes are sent to the server to generate and filter synthetic data with consideration of data distribution across all clients. The filtered data are used to train the aggregated global retrieval model, which is later sent to clients. The final global model becomes robust to all clients after several rounds of client-server iteration. Extensive experiments using four baselines across three datasets demonstrate that our method performs favourably against state-of-the-art methods.

Abstract: Multiview clustering (MVC) aims to integrate information from diverse data sources to facilitate the clustering process, which has achieved considerable success in various real-world applications. However, previous MVC methods typically employ one of two strategies: (1) designing separate feature extraction pipelines for each view, which restricts their ability to fully exploit collaborative potential; or (2) employing a single shared representation module, which hinders the capture of diverse, view-specific representations. To tackle these challenges, we introduce Deep Multi-View Clustering via Collaborative Experts (DMVC-CE), a novel MVC approach that employs the Mixture of Experts (MoE) framework. DMVC-CE incorporates a gating network that dynamically selects multiple experts for handling each data sample, capturing diverse and complementary information from different views. Additionally, to ensure balanced expert utilization and maintain their diversity, we introduce an equilibrium loss and a multi-expert distinctiveness enhancer. The equilibrium loss prevents excessive reliance on specific experts, while the distinctiveness enhancer encourages each expert to specialize in different aspects of the data, thereby promoting diversity in learned representations. Comprehensive experiments on various multi-view benchmark datasets demonstrate the superiority of DMVC-CE compared to state-of-the-art MVC baselines.

Abstract: Accurate trajectory prediction has prominent significance in autonomous driving scenarios. Most existing methods predict the trajectory of an agent by learning its interaction with other agents and the map within the scenario. However, the heterogeneous distribution of these elements across different geographical scenarios is always ignored. Thus, trajectory predictors might struggle to generalize well when deployed in different geographical scenarios. To bridge the crossgeography gap, in this paper, we propose a plug-and-play self-training pipeline, termed STraj, for cross-geography trajectory prediction. STraj comprises three progressive steps: pseudo label (i.e., time-series trajectory) generation, update, and utilization. First, to generate pseudo labels that generalize to the cross-geography scenarios, STraj pre-trains the predictor through the complementary agent and map augmentations. Second, to facilitate the stable training of the predictor, we design a specific pseudo label update strategy. This strategy selects high-consistency pseudo trajectories from the current and historical epochs to supervise the target domain samples. Third, with generated pseudo trajectories, we introduce trajectory-induced contrastive learning to mitigate the representation bias of cross-geography agents. Extensive experiment results on various cross-geography trajectory prediction benchmarks demonstrate the effectiveness of STraj.

Abstract: In this work, we study the logarithmic regret for reinforcement learning (RL) with linear function approximation and adversarial corruptions, in the formulation of linear Markov decision processes (MDPs). Specifically, we consider the case where there exist adversarial corruptions over the reward functions, and the total amount of the corruptions of each step h across all episodes K is bounded by a corruption level C ≥ 0. We propose an algorithm, doubleweighted least-squares value iteration with UCB (DW-LSVI-UCB), which leverages weighted linear regressions to learn the (corrupted) unknown reward parameters and unknown transition parameters simultaneously. We prove that DW-LSVI-UCB attains an O( d2H4 log2(1+K/δ) gapmin + CdH2) regret (omitting the dependence on lower order terms), where d is the ambient dimension of the feature mapping, H is the horizon length, gapmin is the minimal sub-optimality gap, and K is the number of episodes. Additionally, when there are no adversarial corruptions over reward functions, the regret of our algorithm improves the previous best result by an O(dH/ log K) factor.

Abstract: Due to its effectiveness and efficiency, graphbased multi-view clustering has recently attracted much attention. However, the multi-view data are often incomplete and unpaired in real-world applications as a consequence of data loss or corruption. Although efforts have been made through a series of methods to address the problems of incomplete or unpaired multi-view data, the following issues still persist: 1) Most existing methods only focus on the incomplete multi-view data or unpaired multi-view data, and exhibit weaknesses when addressing both incomplete and unpaired multi-view data simultaneously. 2) Some methods neglect the graph information of the data from different views during the learning process. To tackle these issues, we propose the Multi-view Graph Clustering framework with Cross-view Feature Fusion (MGCCFF), a novel approach for clustering incomplete and unpaired multi-view data. Specifically, MGCCFF learns soft clustering label information from complete data and utilizes this to capture category-level cross-view correspondences. It then learns latent representation enriched with cross-view information based on the established mappings. To obtain a multi-view graph structure under conditions of incomplete and unpaired data, MGCCFF innovatively integrates the concept of self-expression with the autoencoder architecture and exploits the latent relationships between labels and the graph structure, thereby enabling the generation of sparse and accurate graphical structure under multi-view conditions for the final clustering task. The experiments on incomplete and unpaired multi-view datasets demonstrate that MGCCFF outperforms state-of-the-art methods.

Abstract: We study convex optimization problems under differential privacy (DP). With heavytailed gradients, existing works achieve suboptimal rates. The main obstacle is that existing gradient estimators have suboptimal tail property, resulting in a superfluous factor of d in the union bound. In this paper, we explore algorithms achieving optimal rates of DP optimization with heavy-tailed gradients. Our first method is a simple clipping approach. Under bounded p-th order moments of gradients, with n samples, it achieves minimax optimal population risk with epsilon less than 1/d. We then propose an iterative updating method, which is more complex but achieves this rate for all epsilon smaller than 1. The results significantly improve over existing methods. Such improvement relies on a careful treatment of the tail behavior of gradient estimators. Our results match the minimax lower bound, indicating that the theoretical limit of stochastic convex optimization under DP is achievable.

Abstract: Generative Large Language Models (LLMs) stand as a revolutionary advancement in the modern era of artificial intelligence (AI). However, scaling down LLMs for resourceconstrained hardware, such as Internet-of-Things (IoT) devices requires non-trivial efforts and domain knowledge. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative language models. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size.

Abstract: We consider a general and realistic scenario involving nonstationary time series, consisting of several offline intervals with different distributions within a fixed offline time horizon, and an online interval that continuously receives new samples. For non-stationary time series, the data distribution in the current online interval may have appeared in previous offline intervals. We theoretically explore the feasibility of applying knowledge from offline intervals to the current online interval. To this end, we propose the Mixture of Online and Offline Experts (MOOE). MOOE learns static offline experts from offline intervals and maintains a dynamic online expert for the current online interval. It then adaptively combines the offline and online experts using a meta expert to make predictions for the samples received in the online interval. Specifically, we focus on theoretical analysis, deriving parameter convergence, regret bounds, and generalization error bounds to prove the effectiveness of the algorithm.

Abstract: In practical applications, it is often necessary to transfer knowledge from large pretrained models to small ones with various architectures for tackling different tasks. The Learngene framework, proposed recently, firstly extracts one compact module termed as learngene from a large welltrained model, after which learngene is used to build descendant models for handling diverse tasks. In this paper, we aim to explore extracting and inheriting learngene which can be generalized across different model architectures and tasks, remaining understudied in previous works. Inspired by the existing observations that large kernel convolutional neural networks (CNNs) exhibit significant generalization potential across various architectures and tasks, we propose a novel two-stage Learngene method termed CLKG (Convolutional Learngene for Knowledge Generalization), which inherits convolutional kernels containing generalized knowledge as learngene to build diverse models for multiple tasks. Specifically, we construct an auxiliary model comprised of small kernels and train it through dense feature distillation to inherit the feature extraction ability from large kernel CNNs. After distillation, we select certain kernels from the auxiliary model as learngene based on three criteria: direct kernel extraction, priority to edge kernels, and continuous kernel selection. Subsequently, we adapt learngene according to the width of the descendant models and use it to initialize the backbone part of descendant models. Experiments on diverse vision tasks such as image classification, object detection and semantic segmentation demonstrate the superiority of CLKG. For example, compared with from scratch training, it brings 2.89% improvements on VOC12+SBD, and reduces around 2x training data volume and training epochs to achieve better results. Furthermore, compared to knowledge distillation method, CLKG significantly reduces negative transfer on certain datasets, e.g., achieves 1.88% performance improvements on NAO dataset despite domain differences.

Abstract: We consider the problem of learning stable matchings with unknown preferences in a decentralized and uncoordinated manner, where ``decentralized" means that players make decisions individually without the influence of a central platform, and ``uncoordinated" means that players do not need to synchronize their decisions using prespecified rules. First, we provide a game formulation for this problem with known preferences, where the set of pure Nash equilibria (NE) coincides with the set of stable matchings, and mixed NE can be rounded to a stable matching. Then, we show that for hierarchical markets, applying the exponential weight (EXP) learning algorithm to the stable matching game achieves logarithmic regret in a fully decentralized and uncoordinated fashion. Moreover, we show that EXP converges locally and exponentially fast to a stable matching in general matching markets. We complement our results by introducing another decentralized and uncoordinated learning algorithm that globally converges to a stable matching with arbitrarily high probability.

The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, School of Artificial Intelligence, Shandong University, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences

Abstract: A key challenge in multiagent collaborative tasks is reducing uncertainty about teammates to enhance cooperative performance. Explicit communication methods can reduce uncertainty about teammates, but the associated high communication costs limit their practicality. Alternatively, implicit consensus learning can promote cooperation without incurring communication costs. However, its performance declines significantly when local observations are severely limited. This paper introduces a novel multi-agent learning framework that combines the strengths of these methods. In our framework, agents generate a consensus about the group based on their local observations and then use both the consensus and local observations to produce messages. Since the consensus provides a certain level of global guidance, communication can be disabled when not essential, thereby reducing overhead. Meanwhile, communication can provide supplementary information to the consensus when necessary. Experimental results demonstrate that our algorithm significantly reduces inter-agent communication overhead while ensuring efficient collaboration.

Clermont Auvergne University, Clermont Auvergne INP, CNRS, LIMOS, F-63000 Clermont-Ferrand, France., Univ Lyon, UCBL, CNRS, INSA Lyon, Centrale Lyon, Univ Lyon 2, LIRIS, UMR5205, Lyon, France, Laboratory of Applied Mathematics, Faculty of Exact Sciences, University of Bejaia, Bejaia, Algeria, TCG Centres for Research and Education in Science and Technology, Kolkata, India, Carnegie Mellon University, Computer Science Department, Pittsburgh, USA Strategy Robot, Inc. Strategic Machine, Inc. Optimized Markets, Inc.

Abstract: Coalition structure generation (CSG), i.e. the problem of optimally partitioning a set of agents into coalitions to maximize social welfare, is a fundamental computational problem in multiagent systems. This problem is important for many applications where small run times are necessary, including transportation and disaster response. In this paper, we develop SALDAE, a multiagent path finding algorithm for CSG that operates on a graph of coalition structures. Our algorithm utilizes a variety of heuristics and strategies to perform the search and guide it. It is an anytime algorithm that can handle large problems with hundreds and thousands of agents. We show empirically on nine standard value distributions, including disaster response and electric vehicle allocation benchmarks, that our algorithm enables a rapid finding of highquality solutions and compares favorably with other state-of-the-art methods.

Abstract: Multiagent path finding (MAPF) is a safety-critical scenario where the goal is to secure collision-free trajectories from initial to desired locations. However, due to system complexity and uncertainty, integrating learning-based controllers with MAPF is challenging and cannot theoretically guarantee the safety of the learned controllers. In response, our study proposes a verified safe multi-agent neural control (VSMANC) approach for MAPF, focusing on the unified training of Decentralized Control Barrier Functions (DCBF) and controllers to enhence safety. VSMANC enables all agents to concurrently learn controllers and DCBFs using a unified loss function designed to maximize safety, adhere to standard control policies, and incorporate path-finding-related heuristics. We also propose a formal verification-guided retraining process to both verify the properties of the learned DCBFs and generate counterexamples for retraining, thereby providing a verified safety guarantee. We validate our approach through shape formation experiments and UAV simulations, demonstrating significant improvements in safety and effectiveness in complex multi-agent environments.

Abstract: Large Language Models (LLMs) excel in linguistic tasks but struggle with mathematical reasoning, particularly in nonEnglish languages like Hindi. This research aims to en- hance the mathematical reasoning skills of smaller, resource- efficient open-source LLMs in both Hindi and English. We evaluate models like OpenHathi 7B, LLaMA-2 7B, Wizard- Math 7B, Mistral 7B, LLeMMa 7B, MAmmoTH 7B, Gemini Pro, and GPT-4 using zero-shot, few-shot chain-of-thought (CoT) methods, and supervised fine-tuning. Our approach in- corporates curriculum learning, progressively training mod- els on increasingly difficult problems, a novel Decompo- sition Strategy to simplify complex arithmetic operations, and a Structured Solution Design that divides solutions into phases. Our experiments result in notable performance en- hancements. WizardMath 7B exceeds Gemini’s accuracy on English datasets by +6% and matches Gemini’s performance on Hindi datasets. Adopting a bilingual approach that com- bines English and Hindi samples achieves results comparable to individual language models, demonstrating the capability to learn mathematical reasoning in both languages. This re- search highlights the potential for improving mathematical reasoning in open-source LLMs.

Abstract: The rapid development of social platforms exacerbates the dissemination of misinformation, which stimulates the research in fact verification. Recent studies tend to leverage semantic features to solve this problem as a singlehop task. However, the process of verifying a claim requires several pieces of evidence with complicated inner logic and relations to verify the given claim in real-world situations. Recent studies attempt to improve both understanding and reasoning abilities to enhance the performance, but they overlook the crucial relations between entities that benefit models to understand better and facilitate the prediction. To emphasize the significance of relations, we resort to Large Language Models (LLMs) considering their excellent understanding ability. Instead of other methods using LLMs as the predictor, we take them as relation extractors, for they do better in understanding rather than reasoning according to the experimental results. Thus, to solve the challenges above, we propose a novel Structured Knowledge-Augmented LLM-based Network (LLM-SKAN) for multi-hop fact verification. Specifically, we utilize an LLM-driven Knowledge Extractor to capture fine-grained information, including entities and their complicated relations. Besides, we leverage a Knowledge-Augmented Relation Graph Fusion module to interact with each node and learn better claim-evidence representations comprehensively. The experimental results on four common-used datasets demonstrate the effectiveness and superiority of our model.

Abstract: Large Language Models (LLMs) possess vast amounts of knowledge within their parameters, prompting research into methods for locating and editing this knowledge. Previous work has largely focused on locating entityrelated (often single-token) facts in smaller models. However, several key questions remain unanswered: (1) How can we effectively locate query-relevant neurons in contemporary autoregressive LLMs, such as Llama and Mistral? (2) How can we address the challenge of long-form text generation? (3) Are there localized knowledge regions in LLMs? In this study, we introduce Query-Relevant Neuron Cluster Attribution (QRNCA), a novel architecture-agnostic framework capable of identifying query-relevant neurons in LLMs. QRNCA allows for the examination of long-form answers beyond triplet facts by employing the proxy task of multi-choice question answering. To evaluate the effectiveness of our detected neurons, we build two multi-choice QA datasets spanning diverse domains and languages. Empirical evaluations demonstrate that our method outperforms baseline methods significantly. Further, analysis of neuron distributions reveals the presence of visible localized regions, particularly within different domains. Finally, we show potential applications of our detected neurons in knowledge editing and neuron-based prediction.

Abstract: An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic concept space with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in blackbox LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.

Abstract: This work presents a novel systematic methodology to analyse the capabilities and limitations of Large Language Models (LLMs) with feedback from a formal inference engine, on logic theory induction. The analysis is complexitygraded w.r.t. rule dependency structure, allowing quantification of specific inference challenges on LLM performance. Integrating LLMs with formal methods is a promising frontier in the Natural Language Processing field, as an important avenue for improving model inference control and explainability. In particular, inductive learning over complex sets of facts and rules, poses unique challenges for current autoregressive models, as they lack explicit symbolic grounding. While they can be complemented by formal systems, the properties delivered by LLMs regarding inductive learning, are not well understood and quantified. Empirical results indicate that the largest LLMs can achieve competitive results against a SOTA Inductive Logic Programming (ILP) system baseline, but also that tracking long predicate relationship chains is a more difficult obstacle than theory complexity for LLMs.

Abstract: Singing voice synthesis has made remarkable progress in generating natural and highquality voices. However, existing methods rarely provide precise control over vocal techniques such as intensity, mixed voice, falsetto, bubble, and breathy tones, thus limiting the expressive potential of synthetic voices. We introduce TechSinger, an advanced system for controllable singing voice synthesis that supports five languages and seven vocal techniques. TechSinger leverages a flow-matching-based generative model to produce singing voices with enhanced expressive control over various techniques. To enhance the diversity of training data, we develop a technique detection model that automatically annotates datasets with phoneme-level technique labels. Additionally, our prompt-based technique prediction model enables users to specify desired vocal attributes through natural language, offering fine-grained control over the synthesized singing. Experimental results demonstrate that TechSinger significantly enhances the expressiveness and realism of synthetic singing voices, outperforming existing methods in terms of audio quality and technique-specific control.

Abstract: Multimodal aspectbased sentiment analysis (MABSA) integrates text and images to perform fine-grained sentiment analysis on specific aspects, enhancing the understanding of user opinions in various applications. Existing methods use modality alignment for information interaction and fusion between images and text, but an inherent gap between these two modalities necessitates a more direct bridging mechanism to effectively connect image understanding with text content. For this, we propose the Descriptions Enhanced Question-Answering Framework (DEQA), which generates descriptions of images using GPT-4, leveraging the multimodal large language model to provide more direct semantic context of images. In DEQA, to help the model better understand the task's purpose, we frame MABSA as a multi-turn question-answering problem to add semantic guidance and hints. We input text, image, and description into separate experts in various combinations, allowing each expert to focus on different features and thereby improving the comprehensive utilization of input information. By integrating these expert outputs within a multi-turn question-answering format, we employ a multi-expert ensemble decision-making approach to produce the final prediction results. Experimental results on two widely-used datasets demonstrate that our method achieves state-of-the-art performance. Furthermore, our framework substantially outperforms GPT-4o and other multimodal large language models, showcasing its superior effectiveness in multimodal sentiment analysis.

Abstract: Early exiting is an effective paradigm for improving the inference efficiency of pretrained language models (PLMs) by dynamically adjusting the number of executed layers for each sample. However, in most existing works, easy and hard samples are treated equally by each classifier during training, which neglects the test-time early exiting behavior, leading to inconsistency between training and testing. Although some methods have tackled this issue under a fixed speed-up ratio, the challenge of flexibly adjusting the speed-up ratio while maintaining consistency between training and testing is still under-explored. To bridge the gap, we propose a novel Consistency-Oriented Signal-based Early Exiting (COSEE) framework, which leverages a calibrated sample weighting mechanism to enable each classifier to emphasize the samples that are more likely to exit at that classifier under various acceleration scenarios. Extensive experiments on the GLUE benchmark demonstrate the effectiveness of our COSEE across multiple exiting signals and backbones, yielding a better trade-off between performance and efficiency.

Abstract: Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosodyaware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.

Abstract: Correct labels are indispensable for training effective machine learning models. However, creating highquality labels is expensive, and even professionally labeled data contains errors and ambiguities. Filtering and denoising can be applied to curate labeled data prior to training, at the cost of additional processing and loss of information. An alternative is on-the-fly sample reweighting during the training process to decrease the negative impact of incorrect or ambiguous labels, but this typically requires clean seed data. In this work we propose unsupervised on-the-fly meta loss rescaling to reweight training samples. Crucially, we rely only on features provided by the model being trained, to learn a rescaling function in real time without knowledge of the true clean data distribution. We achieve this via a novel meta learning setup that samples validation data for the meta update directly from the noisy training corpus by employing the rescaling function being trained. Our proposed method consistently improves performance across various NLP tasks with minimal computational overhead. Further, we are among the first to attempt on-the-fly training data reweighting on the challenging task of dialogue modeling, where noisy and ambiguous labels are common. Our strategy is robust in the face of noisy and clean data, handles class imbalance, and prevents overfitting to noisy labels. Our self-taught loss rescaling improves as the model trains, showing the ability to keep learning from the model's own signals. As training progresses, the impact of correctly labeled data is scaled up, while the impact of wrongly labeled data is suppressed.

Abstract: Mathematical reasoning ability objectively reflects a language model's understanding of implicit knowledge in contexts, with logic being a prerequisite for exploring, articulating and establishing effective reasoning. Large language models (LLMs) have shown great potential in complex reasoning tasks represented by mathematical reasoning. However, existing mathematical datasets either focus on commonsense reasoning, assessing the model's knowledge application ability, or arithmetic problems with fixed calculation rules, evaluating the model's rapid learning capability. There is a lack of datasets that require solving problems solely through logical reasoning. As a result, the performance of LLMs in accurately understanding the implicit logical relationships in problems and deriving conclusions based solely on given conditions is hindered. To address this challenge, we construct a dataset specifically for multiple step reasoning tasks: ReasoningMath (RMath). This dataset focuses on evaluating logical reasoning abilities with mathematical reasoning problems, covering typical problem types, including direct reasoning problems, hypothetical reasoning problems, and nested reasoning problems. Additionally, we design a standardized annotation scheme that transforms natural language descriptions of conditions into formal propositions. Other annotation contents include problem categories, proposition truth values, and proposition relationship types. This not only reduces biases caused by semantic misunderstandings during problem-solving, but also facilitates the incorporation of theoretically grounded logical reasoning methods to enhance reasoning abilities. Furthermore, we propose a normalization problem-solving framework based on propositional logic for RMath and design the problem-solving process for prompt tuning to guide LLMs to absorb mathematical logical theories and improving reasoning abilities. Finally, we evaluate RMath on several popular LLMs and present the corresponding results.

Abstract: Selflearning of Large Language Models (LLMs) facilitates their advancement towards super-intelligence by training with self-synthesized experiences. However, a critical challenge is the amplification of hallucinations in generated data during iterative self-learning, underscoring the need for reliable data selection. To address this, we investigate the mechanism of Inner Knowledge Explicitation, which involves explicitly extracting the inner knowledge from memory of LLMs, to concurrently improves reasoning, and enables reliable self-learning data selection. This paper introduces a Self Knowledge Explicitation Learning (SKE-Learn) framework, which equips the LLMs with meta-skills to explicitly extract, verify and utilize inner knowledge for reasoning. By leveraging these meta-skills, SKE-Learn establishes a self-learning approach that ensures reliable selection of self-synthetic data. This approach enhances performance through iterative self-learning while mitigating the problem of hallucinations. Empirical results from six benchmarks demonstrate that Inner Knowledge Explicitation improves reasoning by serving as a more effective prompting method. Additionally, SKE-Learn, based on the verifiability of explicit knowledge, shows consistent performance improvements over multiple self-training iterations, with an average performance increase from 52.79% to 56.54% across all benchmarks. Furthermore, Inner Knowledge Explicitation provides explanation and intervention space during LLM's generation process.

Department of Electrical and Computer Engineering, Seoul National University, Department of Mathematics, Chung-Ang University, College of Liberal Studies, Seoul National University, Department of Electrical and Computer Engineering, Seoul National University, NVIDIA, Department of Electrical and Computer Engineering, Seoul National University Interdisciplinary Program in Artificial Intelligence, Seoul National University, Department of Electrical and Computer Engineering, Seoul National University Interdisciplinary Program in Artificial Intelligence, Seoul National University

Abstract: In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearingimpaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i 'side' of x), instead of the concise LaTeX format, which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured LaTeX representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates LaTeX generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for LaTeX translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.

Abstract: Prompt tuning is a promising method to finetune a pre-trained language model without retraining its large-scale parameters. Instead, it attaches a soft prompt to the input text, whereby downstream tasks can be well adapted by merely learning the embeddings of prompt tokens. Nevertheless, existing methods still suffer from two challenges: (i) they are hard to balance accuracy and efficiency. A longer (shorter) soft prompt generally leads to a better (worse) accuracy but at the cost of more (less) training time. (ii) The performance may not be consistent when adapting to different downstream tasks. We attribute it to the same embedding space but responsible for different requirements of downstream tasks. To address these issues, we propose an Efficient Prompt Tuning method (EPT) by multi-space projection and prompt fusion. Specifically, it decomposes a given soft prompt into a shorter prompt and two low-rank matrices, significantly reducing the training time. Accuracy is also enhanced by leveraging low-rank matrices and the short prompt as additional knowledge sources to enrich the semantics of the original short prompt. In addition, we project the soft prompt into multiple subspaces to improve the performance consistency, and then adaptively learn the combination weights of different spaces through a gating network. Experiments on 13 natural language processing downstream tasks show that our method significantly and consistently outperforms 11 comparison methods with the relative percentage of improvements up to 12.9%, and training time decreased by 14%.

Abstract: To address catastrophic forgetting in Continual Relation Extraction (CRE), many current approaches rely on memory buffers to rehearse previously learned knowledge while acquiring new tasks. Recently, promptbased methods have emerged as potent alternatives to rehearsal-based strategies, demonstrating strong empirical performance. However, upon analyzing existing prompt-based approaches for CRE, we identified several critical limitations, such as inaccurate prompt selection, inadequate mechanisms for mitigating forgetting in shared parameters, and suboptimal handling of cross-task and within-task variances. To overcome these challenges, we draw inspiration from the relationship between prefix tuning and mixture of experts, proposing a novel approach that employs a prompt pool for each task, capturing variations within each task while enhancing cross-task variances. Furthermore, we incorporate a generative model to consolidate prior knowledge within shared parameters, eliminating the need for explicit data storage. Extensive experiments validate the efficacy of our approach, demonstrating superior performance over state-of-the-art prompt-based and rehearsal-free methods in continual relation extraction.

Abstract: Large Language Models (LLMs) like ChatGPT and GPT4 are versatile and capable of addressing open-domain question-answering(QA) tasks effectively. However, general LLMs, which are developed on open-domain data, may lack the domain-specific knowledge essential for tasks in vertical domains, such as legal, medical, etc. To address this issue, previous approaches either conduct continuous pre-training with domain-specific data or employ retrieval augmentation to support general LLMs in handling QA tasks. Unfortunately, these strategies are either cost-intensive or unreliable in practical applications. To this end, we present a novel framework named BLADE, which enhances Black-box LArge language models with small Domain-spEcific models. BLADE consists of a black-box LLM and a small domain-specific LM. The small LM preserves domain-specific knowledge and offers specialized insights, while the general LLM contributes robust language comprehension and reasoning capabilities. Specifically, our method involves three steps: 1) pre-training the small LM with domain-specific data, 2) fine-tuning this model using knowledge instruction data, and 3) joint Bayesian optimization of the general LLM and the small LM. In our experiments, we verify the effectiveness of BLADE on diverse LLMs and datasets across different domains. This shows the potential of BLADE as an effective and cost-efficient solution in adapting general LLMs for vertical domains.

Abstract: Recently, numerous new benchmarks have been established to evaluate the performance of large language models (LLMs) via either computing a holistic score or employing another LLM as a judge. However, these approaches suffer from data leakage due to the open access of the benchmark and inflexible evaluation process. To address this issue, we introduce TreeEval, a benchmarkfree evaluation method for LLMs that let a high-performance LLM host an irreproducible evaluation session and essentially avoids the data leakage. Moreover, this LLM performs as an examiner to raise up a series of questions under a topic with a tree planing strategy, which considers the current evaluation status to decide the next question generation and ensures the completeness and efficiency of the evaluation process. We evaluate 6 models of different parameter sizes, including 7B, 13B, and 34B, and ultimately achieved the highest correlation coefficient with AlpacaEval2.0 using only around 45 questions. We also conduct more analysis to show the robustness and reliability of TreeEval.

Abstract: Word order difference between source and target languages is a major obstacle to crosslingual transfer, especially in the dependency parsing task. Current works are mostly based on order-agnostic models or word reordering to mitigate this problem. However, such methods either do not leverage grammatical information naturally contained in word order or are computationally expensive as the permutation space grows exponentially with the sentence length. Moreover, the reordered source sentence with an unnatural word order may be a form of noising that harms the model learning. To this end, we propose an Implicit Word Reordering framework with Knowledge Distillation (IWR-KD). This framework is inspired by that deep networks are good at learning feature linearization corresponding to meaningful data transformation, e.g. word reordering. To realize this idea, we introduce a knowledge distillation framework composed of a word-reordering teacher model and a dependency parsing student model. We verify our proposed method on Universal Dependency Treebanks across 31 different languages and show it outperforms a series of competitors, together with experimental analysis to illustrate how our method works towards training a robust parser.

School of Computer Science and Technology, University of Science and Technology of China, Hefei, China State Key Laboratory of Cognitive Intelligence, Hefei, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China State Key Laboratory of Cognitive Intelligence, Hefei, China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China, Independent Researcher, Zhejiang University, Hangzhou, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China State Key Laboratory of Cognitive Intelligence, Hefei, China

Abstract: Complex question answering (QA) is a challenging task in artificial intelligence research which requires reasoning based on related knowledge. The retrievalaugmented generation (RAG) based on large language models (LLMs) have become one promising solution in QA. To facilitate RAG more effectively, the LLM needs to precisely evaluate knowledge required in QA. That is, first, the LLM needs to examine its knowledge boundary (what the LLM does not know) to retrieve external knowledge as supplement. Second, the LLM needs to evaluate the utility of the retrieved knowledge (whether it helps in reasoning) for robust RAG. To this end, in this paper, we propose a novel Question Answering with Knowledge Evaluation (KEQA) framework to promote the effectiveness and efficiency of RAG in QA. First, inspired by quizzes in classroom, we propose a quiz-based method to precisely examine the knowledge state of the uninterpretable LLM for QA. We ask indicative quizzes on each required knowledge, and inspect whether the LLM can consistently answer the quiz to examine its knowledge boundary. Second, we retrieve the unknown knowledge from external source, and evaluate its utility to pick the helpful ones for reasoning. We design a reasoning-based metric to evaluate utility, and construct a demonstration set in training data for reference to guide knowledge picking in inference. We conduct extensive experiments on four widely-used QA datasets, and the results demonstrate the effectiveness of the proposed method.

Abstract: Visual Textto-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth images into patches and adopt the Gemini-generated environment captions to guide the local spatial understanding. After that, the multi-modal and multi-scale features are integrated by the local-aware global spatial understanding. In this way, M2SE-VTTS effectively models the interactions between local and global spatial contexts in the multi-modal spatial environment. Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation.

Abstract: Multihop question answering (MHQA) poses a significant challenge for large language models (LLMs) due to the extensive knowledge demands involved. Knowledge editing, which aims to precisely modify the LLMs to incorporate specific knowledge without negatively impacting other unrelated knowledge, offers a potential solution for addressing MHQA challenges with LLMs. However, current solutions struggle to effectively resolve issues of knowledge conflicts. Most parameter-preserving editing methods are hindered by inaccurate retrieval and overlook secondary editing issues, which can introduce noise into the reasoning process of LLMs. In this paper, we introduce KEDKG, a novel knowledge editing method that leverages a dynamic knowledge graph for MHQA, designed to ensure the reliability of answers. KEDKG involves two primary steps: dynamic knowledge graph construction and knowledge graph augmented generation. Initially, KEDKG autonomously constructs a dynamic knowledge graph to store revised information while resolving potential knowledge conflicts. Subsequently, it employs a fine-grained retrieval strategy coupled with an entity and relation detector to enhance the accuracy of graph retrieval for LLM generation. Experimental results on benchmarks show that KEDKG surpasses previous state-of-the-art models, delivering more accurate and reliable answers in environments with dynamic information.

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Langboat Technology, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Langboat Technology, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

Abstract: Large Language Modelbased Dense Retrieval (LLM-DR) optimizes over numerous heterogeneous fine-tuning collections from different domains. However, the discussion about its training data distribution is still minimal. Previous studies rely on empirically assigned dataset choices or sampling ratios, which inevitably lead to sub-optimal retrieval performances. In this paper, we propose a new task-level Distributionally Robust Optimization (tDRO) algorithm for LLM-DR fine-tuning, targeted at improving the universal domain generalization ability by end-to-end reweighting the data distribution of each task. The tDRO parameterizes the domain weights and updates them with scaled domain gradients. The optimized weights are then transferred to the LLM-DR fine-tuning to train more robust retrievers. Experiments show optimal improvements in large-scale retrieval benchmarks and reduce up to 30% dataset usage after applying our optimization algorithm with a series of different-sized LLM-DR models.

Abstract: In recent years, diffusion modeling has shown great potential for image generation and editing. Beyond singlemodel approaches, various drawing workflows now exist to handle diverse drawing tasks. However, few solutions effectively identify user intentions through dialogue and progressively complete drawings. We introduce DialogDraw, which facilitates image generation and editing through continuous dialogue interaction. DialogDraw enables users to create and refine drawings using natural language and integrates with numerous open-source drawing workflows and models. The system accurately recognizes intentions and extracts user inputs via parameterization, adapts to various drawing function parameters, and provides an intuitive interaction mode. It effectively executes user instructions, supports dozens of image generation and editing methods, and offers robust scalability. Moreover, we employ SFT and RLHF to iterate the Intention Recognition and Parameter Extraction Model (IRPEM). To evaluate DialogDraw's functionality, we propose DrawnConvos, a dataset rich in drawing functions and command dialogue data collected from the open-source community. Our evaluation demonstrates that DialogDraw excels in command compliance, identifying and adapting to user drawing intentions, thereby proving the effectiveness of our method.

Abstract: In this paper, we focus on prompting one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Despite the growing body of research in this area, we find that many crucial design decisions in LLMbased ASR systems are often inadequately justified. This lack of clarity impedes the field's progress, making it challenging to pinpoint which design choices truly improve model performance. To address these challenges, we conduct a comprehensive series of experiments that explore various aspects, leading to the optimal LLM-based ASR system. We found that delicate designs are not necessary, while a clean setup with little task-specific design is competent. The models achieve strong performance on the Librispeech and Gigaspeech datasets, compared to both LLM-based models and non-LLM-based models. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.

Abstract: This paper investigates how hallucination rates in Large Language Models (LLMs) may be controlled via a symbolic data generation framework, exploring a fundamental relationship between the rate of certain mathematical errors and types of input intervention. Specifically, we systematically generate data for a derivation generation task using a symbolic engine, applying targeted interventions to prompts to perturb features of mathematical derivations such as the surface forms of symbols, equational tree structures, and mathematical context. We then evaluate the effect of prompt interventions across a range of LLMs including finetuned T5 models, GPT, and LLaMa-based models. Our experiments suggest that T5-Large can outperform the few-shot performance of GPT-4 on various evaluation sets generated via the framework. However, an extensive evaluation based on human analysis, template-based error detection, and text generation metrics reveals model weaknesses beyond what the reference-based metrics singularly describe. We use these results to tie characteristic distributional footprints of interventions to the human evaluation of LLM derivation quality, potentially leading to significant control over fine-grained mathematical capabilities of language models with respect to specific types of errors.

Faculty of Artificial Intelligence in Education, Central China Normal University, Faculty of Artificial Intelligence in Education, Central China Normal University Central China Normal University Wollongong Joint Institute, Central China Normal University, Faculty of Artificial Intelligence in Education, Central China Normal University Central China Normal University Wollongong Joint Institute, Central China Normal University, Faculty of Artificial Intelligence in Education, Central China Normal University, Faculty of Artificial Intelligence in Education, Central China Normal University

Abstract: Large language models (LLMs) have made significant advancements in math problem solving, but their large size and high latency render them impractical for realworld applications in intelligent mathematics solvers. Recently, task-agnostic compact models have been developed to replace LLMs in general natural language processing tasks. However, these models often struggle to acquire sufficient math-related knowledge from LLMs, leading to unsatisfactory performance in solving math word problems (MWPs). To develop a specialized compact model for representing MWPs, we develop the knowledge distillation (KD) technique to extract mathematical semantics knowledge from the large pre-trained model BERT. Effective knowledge types and distillation strategies are explored through extensive experiments. Our KD algorithm employs multi-knowledge distillation to extract fundamental knowledge from hidden states in the middle to lower layers, while also incorporating knowledge of mathematical relations and symbol constraints from higher-layer outputs and math decoder outputs, by leveraging bottleneck networks. Pre-training tasks on MWP datasets, such as masked language modeling and part-of-speech tagging, are also utilized to enhance the generalization of the compact model for MWP understanding. Additionally, a simple parameter mixing strategy is employed to prevent catastrophic forgetting of acquired knowledge. Our findings indicate that our approach can reduce the size of a BERT model by 10% while retaining approximately 95% of its performance on MWP datasets, outperforming the mainstream BERT-based task-agnostic compact models. The efficacy of each component has been validated through ablation studies.

Abstract: In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark, to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark is designed to evaluate whether LLMs can emulate the knowledge and reasoning skills of OR experts when given diverse and complex optimization problems. The dataset, crafted by OR experts, presents realworld optimization problems that require multistep reasoning to build their mathematical models. Our evaluations of various open-source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral reveal their modest performance, indicating a gap in their aptitude to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs’ generalization capabilities, providing insights for future research in this area. The dataset and evaluation code are publicly available.

Abstract: Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language modelbased token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically.

Abstract: Existing modificationbased linguistic steganography methods primarily perform linguistic manipulations within a single embedding space to conceal secret information. However, these methods are stringently constrained by the original semantics of the cover text, making it struggle to achieve a satisfactory embedding capacity in a single embedding space. In this paper, we propose a novel Multi-granularity Modification-based Linguistic Steganography framework (MMLS) that hides secret information in both syntactic space and symbolic space, enhancing syntactic naturalness and semantic coherence while further increasing embedding capacity. Specifically, MMLS utilizes a paraphrase generation model to automatically modify the syntactic structure of the given original sentence, which enables the generation of paraphrases and the preservation of semantics simultaneously. Moreover, MMLS employs a distance-aware syntactic bins coding strategy to embed part of secret information into the syntactic space. This strategy utilizes a cluster-based way to partition the implicit syntactic space into a finite number of separate zones, thus increasing the number of candidate paraphrases and avoiding the selection of semantically distorted steganographic texts. Finally, the pre-trained BERT is used to replace some words in candidate paraphrases with their synonyms. Such a design embeds the remaining secret information into symbolic space while ensuring syntactic and semantic naturalness. Experimental results demonstrate that MMLS significantly outperforms existing methods in terms of semantic coherence, embedding capacity, and security.

Abstract: Developing intelligent negotiation dialogue systems that resolve conflicts and promote equitable, inclusive, and sustainable outcomes is at the forefront of advancing automated negotiation technology for social good. Negotiation involves balancing cooperation and competition to maximize value without causing offense. Using polite language fosters mutual understanding and creates a respectful and collaborative environment essential for successful negotiations in various domains. Considering this, in this paper, we propose a polite negotiation dialogue system, GENTEELNEGOTIATOR for social good applications to boost the overall quality of negotiation outcomes. We focus on developing a negotiation dialogue system for two key application areas, namely tourism and e-commerce. We begin by curating a unique negotiation dialogue dataset, NEGOCHAT for tourism. We further enrich the NEGOCHAT and Integrative Negotiation Dataset (IND) for e-commerce with various negotiation strategies. These datasets are then used to develop the GENTEEL-NEGOTIATOR, leveraging the Large Language Model (LLM) and mixture-of-expert (MoE)-based reinforcement learning approach. The proposed MoE-based method employs heuristic experts dedicated to negotiation, politeness, and dialogue coherence to facilitate the learning of diverse semantics by analyzing the dialogue context. A novel reward function with negotiation strategy congruence, politeness, dialogue coherence, and engagingness rewards is designed to guide the policy’s learning for generating responses. Automatic and human evaluations on NEGOCHAT and IND datasets validate the effectiveness of GENTEEL-NEGOTIATOR in generating polite responses during negotiation while maintaining conversation goals, including coherence and engagingness.

Abstract: The language model (LM) approach based on acoustic and linguistic prompts, such as VALLE, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grained control over the synthesized speech with autoregressive (AR) language model; 3) infinite silence generation due to the nature of AR-based decoding, especially under the greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead of the corresponding acoustic tokens. The experimental findings reveal that our model outperforms baselines in terms of accuracy and delivers more stable results using both greedy and sampling-based decoding strategies.

Abstract: Large Language Models (LLMs) have brought significant advances across various NLP tasks through fewshot or zero-shot prompting, bypassing the need for parameter tuning. However, the "black-box" nature behind their massive parameter sizes increases the "hallucination" concerns, especially in high-stakes applications (e.g., healthcare), where decision mistakes can lead to severe consequences. In contrast, human decision-making relies on complex cognitive processes, such as the ability to sense and adaptively correct mistakes through conceptual understanding. Drawing inspiration from human cognition, we propose an innovative metacognitive approach CLEAR, to equip LLMs with capabilities for self-aware error identification and correction. Our framework constructs concept-specific sparse subnetworks that indicate decision processes. This provides a novel interface for model {intervention} after deployment. The benefits include: (i) at inference time, our metacognitive LLMs can self-consciously identify potential mispredictions with minimum human involvement, (ii) the model can self-correct its errors efficiently without additional tuning, and (iii) the correction procedure is not only self-explanatory but also user-friendly, enhancing model interpretability and accessibility. With these metacognitive features, our approach pioneers a new path toward the trustworthiness of LLMs.

Abstract: Recently, Large Language Models (LLMs) with incontext learning have demonstrated remarkable potential in handling neural machine translation. However, existing evidence shows that LLMs are prompt-sensitive and it is sub-optimal to apply the fixed prompt to any input for downstream machine translation tasks. To address this issue, we propose an adaptive few-shot prompting (AFSP) framework to automatically select suitable translation demonstrations for various source input sentences to further elicit the translation capability of an LLM for better machine translation. First, we build a translation demonstration retrieval module based on LLM's embedding to retrieve top-k semantic-similar translation demonstrations from aligned parallel translation corpus. Rather than using other embedding models for semantic demonstration retrieval, we build a hybrid demonstration retrieval module based on the embedding layer of the deployed LLM to build better input representation for retrieving more semantic-related translation demonstrations. Then, to ensure better semantic consistency between source inputs and target outputs, we force the deployed LLM itself to generate multiple output candidates in the target language with the help of translation demonstrations and rerank these candidates. Besides, to better evaluate the effectiveness of our AFSP framework on the latest language and extend the research boundary of neural machine translation, we construct a high-quality diplomatic Chinese-English parallel dataset that consists of 5,528 parallel Chinese-English sentences. Finally, extensive experiments on the proposed diplomatic Chinese-English parallel dataset and the United Nations Parallel Corpus (Chinese-English part) show the effectiveness and superiority of our proposed AFSP.

School of Computer Science, Peking University MoE Key Lab. of High Confidence Software Technologies(PKU), China Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong MoE Key Lab. of High Confidence Software Technologies(Hong Kong), China, School of Computer Science, Peking University MoE Key Lab. of High Confidence Software Technologies(PKU), China, School of Computer Science, Peking University MoE Key Lab. of High Confidence Software Technologies(PKU), China, School of Computer Science, Peking University MoE Key Lab. of High Confidence Software Technologies(PKU), China, School of Computer Science, Peking University MoE Key Lab. of High Confidence Software Technologies(PKU), China, School of Computer Science, Peking University MoE Key Lab. of High Confidence Software Technologies(PKU), China, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong MoE Key Lab. of High Confidence Software Technologies(Hong Kong), China, Beihang University, Huawei Noah's Ark Lab, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong MoE Key Lab. of High Confidence Software Technologies(Hong Kong), China

Abstract: Event reasoning is a fundamental ability that underlies many applications. It requires event schema knowledge to perform global reasoning and needs to deal with the diversity of the interevent relations and the reasoning paradigms. The extent to which LLMs excel in event reasoning across various relations and reasoning paradigms has not been thoroughly investigated. Additionally, it is still unclear whether LLMs utilize event knowledge in the same way humans do. To mitigate this disparity, we comprehensively evaluate the abilities of event reasoning of LLMs on different relations, paradigms, and levels of abstraction. We introduce a novel benchmark EV2 for EValuation of EVent reasoning. EV2 consists of two levels of evaluation on schema and instance and is comprehensive in relations and reasoning paradigms. We conduct extensive experiments on EV2. We find that 1) LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory. 2) There are imbalances of event reasoning abilities on different relations and paradigms. 3) LLMs have event schema knowledge, however, they're not aligned with humans on how to utilize the knowledge. Based on these findings, we guide the LLMs in utilizing the event schema knowledge as memory leading to improvements in event reasoning.

Abstract: Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with realworld tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.

Abstract: Despite the widespread use of LLMs due to their superior performance in various tasks, their high computational costs often lead potential users to opt for the pretrainingfinetuning pipeline. However, biases prevalent in manually constructed datasets can introduce spurious correlations between tokens and labels, creating so-called shortcuts and hindering the generalizability of fine-tuned models. Existing debiasing methods often rely on prior knowledge of specific dataset biases, which is challenging to acquire a priori. We propose RAZOR (Rewriting And Zero-bias Optimization Refinement), a novel, unsupervised, and data-focused debiasing approach based on text rewriting for shortcut mitigation. RAZOR leverages LLMs to iteratively rewrite potentially biased text segments by replacing them with heuristically selected alternatives in a shortcut space defined by token statistics and positional information. This process aims to align surface-level text features more closely with diverse label distributions, thereby promoting the learning of genuine linguistic patterns. Compared with unsupervised SoTA models, RAZOR improves by 3.5% on the FEVER and 6.5% on MNLI and SNLI datasets according to the F1 score. Additionally, RAZOR effectively mitigates specific known biases, reducing bias-related terms by x2 without requiring prior bias information, a result that is on par with SoTA models that leverage prior information. Our work prioritizes data manipulation over architectural modifications, emphasizing the pivotal role of data quality in enhancing model performance and fairness. This research contributes to developing more robust evaluation benchmarks for debiasing methods by incorporating metrics for bias reduction and overall model efficacy.

Abstract: Following formatting instructions to generate wellstructured content is a fundamental yet often unmet capability for large language models (LLMs). To study this capability, which we refer to as format faithfulness, we present FormatBench, a comprehensive format-related benchmark. Compared to previous format-related benchmarks, FormatBench involves a greater variety of tasks in terms of application scenes (traditional NLP tasks, creative works, autonomous agency tasks), human-LLM interaction styles (single-turn instruction, multi-turn chat), and format types (inclusion, wrapping, length, coding). Moreover, each task in FormatBench is attached with a format checker program. Extensive experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness. By virtue of the decidable nature of formats, we propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality. Without any annotated data, ReFF can substantially improve the format faithfulness rate (e.g., from 21.6% in original LLaMA3 to 95.0% on caption segmentation task), while keep the general quality comparable (e.g., from 47.3 to 46.4 in F1 scores). Combined with labeled training data, ReFF can simultaneously improve both format faithfulness (e.g., from 21.6% in original LLaMA3 to 75.5%) and general quality (e.g., from 47.3 to 61.6 in F1 scores). We further offer an interpretability analysis to explain how ReFF improves both format faithfulness and general quality.

Abstract: Zeroshot voice conversion (VC) aims to transfer the timbre from the source speaker to an arbitrary unseen speaker while preserving the original linguistic content. Despite recent advancements in zero-shot VC using language model-based or diffusion-based approaches, several challenges remain: 1) current approaches primarily focus on adapting timbre from unseen speakers and are unable to transfer style and timbre to different unseen speakers independently; 2) these approaches often suffer from slower inference speeds due to the autoregressive modeling methods or the need for numerous sampling steps; 3) the quality and similarity of the converted samples are still not fully satisfactory. To address these challenges, we propose a Style controllable zero-shot VC approach named StableVC, which aims to transfer timbre and style from source speech to different unseen target speakers. Specifically, we decompose speech into linguistic content, timbre, and style, and then employ a conditional flow matching module to reconstruct the high-quality mel-spectrogram based on these decomposed features. To effectively capture timbre and style in a zero-shot manner, we introduce a novel dual attention mechanism with an adaptive gate, rather than using conventional feature concatenation. With this non-autoregressive design, StableVC can efficiently capture the intricate timbre and style from different unseen speakers and generate high-quality speech significantly faster than real-time. Experiments demonstrate that our proposed StableVC outperforms state-of-the-art baseline systems in zero-shot VC and achieves flexible control over timbre and style from different unseen speakers. Moreover, StableVC offers approximately 25x and 1.65x faster sampling compared to autoregressive and diffusion-based baselines.

Abstract: Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations and have not established a corresponding comprehensive benchmark. To bridge the gap, this paper first introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanationaugmented samples featuring the design of hybrid edit-wise explanations. We then benchmark several series of LLMs in multi-task learning settings, including post-explaining and pre-explaining. To promote the development of the task, we also build a comprehensive evaluation suite by leveraging existing automatic metrics and conducting human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. Our experiments reveal the effectiveness of evaluating free-text explanations using traditional metrics like METEOR and ROUGE, and the inferior performance of multi-task models compared to the pipeline solution, indicating its challenges to establish positive effects in learning both tasks.

College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, AI Center, Guangdong OPPO Mobile Telecommunications Corp., Ltd., College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University Innovation Center of Yangtze River Delta, Zhejiang University

Abstract: Lyricto-melody generation aims to automatically create melodies based on given lyrics, requiring the capture of complex and subtle correlations between them. However, previous works usually suffer from two main challenges: 1) lyric-melody alignment modeling, which is often simplified to one-syllable/word-to-one-note alignment, while others have the problem of low alignment accuracy; 2) lyric-melody harmony modeling, which usually relies heavily on intermediates or strict rules, limiting model's capabilities and generative diversity. In this paper, we propose SongGLM, a lyric-to-melody generation system that leverages 2D alignment encoding and multi-task pre-training based on the General Language Model (GLM) to guarantee the alignment and harmony between lyrics and melodies. Specifically, 1) we introduce a unified symbolic song representation for lyrics and melodies with word-level and phrase-level (2D) alignment encoding to capture the lyric-melody alignment; 2) we design a multi-task pre-training framework with hierarchical blank infilling objectives (n-gram, phrase, and long span), and incorporate lyric-melody relationships into the extraction of harmonized n-grams to ensure the lyric-melody harmony. We also construct a large-scale lyric-melody paired dataset comprising over 200,000 English song pieces for pre-training and fine-tuning. The objective and subjective results indicate that SongGLM can generate melodies from lyrics with significant improvements in both alignment and harmony, outperforming all the previous baseline methods.

Abstract: Highquality, large-scale instructions are crucial for aligning large language models (LLMs), however, there is a severe shortage of instruction in the field of natural language understanding (NLU). Previous works on constructing NLU instructions mainly focus on information extraction (IE), neglecting tasks such as machine reading comprehension, question answering, and text classification. Furthermore, the lack of diversity in the data has led to a decreased generalization ability of trained LLMs in other NLU tasks and a noticeable decline in the fundamental model's general capabilities. To address this issue, we propose Hum, a large-scale, high-quality synthetic instruction corpus for NLU tasks, designed to enhance the NLU capabilities of LLMs. Specifically, Hum includes IE (either close IE or open IE), machine reading comprehension, text classification, and instruction generalist tasks, thereby enriching task diversity. Additionally, we introduce a human-LLMs collaborative mechanism to synthesize instructions, which enriches instruction diversity by incorporating guidelines, preference rules, and format variants. We conduct extensive experiments on 5 NLU tasks and 28 general capability evaluation datasets for LLMs. Experimental results show that Hum enhances the NLU capabilities of six LLMs by an average of 3.1%, with no significant decline observed in other general capabilities.

Abstract: Natural language question answering (QA) over structured data sources such as tables and knowledge graphs have been widely investigated, especially with Large Language Models (LLMs) in recent years. The main solutions include question to formal query parsing and retrievalbased answer generation. However, current methods of the former often suffer from weak generalization, failing to dealing with multi-types of sources, while the later is limited in trustfulness. In this paper, we propose TrustUQA, a trustful QA framework that can simultaneously support multiple types of structured data in a unified way. To this end, it adopts an LLM-friendly and unified knowledge representation method called Condition Graph (CG), and uses an LLM and demonstration-based two-level method for CG querying. For enhancement, it is also equipped with dynamic demonstration retrieval. We have evaluated TrustUQA with 5 benchmarks covering 3 types of structured data. It outperforms 2 existing unified structured data QA methods. In comparison with the baselines that are specific to one data type, it achieves state-of-the-art on 2 of the datasets. Further more, we have demonstrated the potential of our method for more general QA tasks, QA over mixed structured data and QA across structured data.

Abstract: The automatic generation of brain CT reports has gained widespread attention, given its potential to assist radiologists in diagnosing cranial diseases. However, brain CT scans involve extensive medical entities, such as diverse anatomy regions and lesions, exhibiting highly inconsistent spatial patterns in 3D volumetric space. This leads to biased learning of medical entities in existing methods, resulting in repetitiveness and inaccuracy in generated reports. To this end, we propose a Medical Entitybalanced Prompting Network (MEPNet), which harnesses the large language model (LLM) to fairly interpret various entities for accurate brain CT report generation. By introducing the visual embedding and the learning status of medical entities as enriched clues, our method prompts the LLM to balance the learning of diverse entities, thereby enhancing reports with comprehensive findings. First, to extract visual embedding of entities, we propose Knowledge-driven Joint Attention to explore and distill entity patterns using both explicit and implicit medical knowledge. Then, a Learning Status Scorer is designed to evaluate the learning of entity visual embeddings, resulting in unique learning status for individual entities. Finally, these entity visual embeddings and status are elaborately integrated into multi-modal prompts, to guide the text generation of LLM. This process allows LLM to self-adapt the learning process for biased-fitted entities, thereby covering detailed findings in generated reports. We conduct experiments on two brain CT report generation benchmarks, showing the effectiveness in clinical accuracy and text coherence.

Abstract: The pervasive spread of misinformation on social networks highlights the critical necessity for effective fact verification systems. Traditional approaches primarily focus on pairwise correlations between claims and evidence, often neglecting comprehensive multihop retrieval and reasoning, which results in suboptimal performance when dealing with complex claims. In this paper, we propose MRR-FV, a generative retrieval-enhanced model designed to address the novel challenge of Multi-hop Retrieval and Reasoning for Fact Verification, which integrates two core modules: Generative Multi-hop Retriever and the Hierarchical Interaction Reasoner. MRR-FV utilizes an autoregressive model for iterative multi-hop evidence retrieval, complemented by a pre-trained compressor to address the challenge of intention shift across retrieval hops. For claim verification, we propose a hierarchical interaction reasoner that conducts intra-sentence reasoning to capture long-term semantic dependencies and inter-sentence reasoning across multi-hop evidence subgraphs to reveal complex evidence interactions. Experimental evaluations on the FEVER and HOVER datasets demonstrate the superior performance of our model in both claim verification and evidence retrieval tasks.

Laboratory for Big Data and Decision, National University of Defense Technology, Department of Computer Science and Technology, Tsinghua University College of Information and Communication, National University of Defense Technology, Laboratory for Big Data and Decision, National University of Defense Technology, Laboratory for Big Data and Decision, National University of Defense Technology, Department of Computer Science and Technology, Tsinghua University, Laboratory for Big Data and Decision, National University of Defense Technology, Department of Computer Science and Technology, Tsinghua University

Abstract: In real life, many dynamic events, such as major disasters and largescale sports events, evolve continuously over time. Obtaining an overview of these events can help people quickly understand the situation and respond more effectively. This is challenging because the key information of the event is often scattered across multiple documents, involving complex event knowledge understanding and reasoning, which is under-explored in previous work. Therefore, we proposed the Event-Centric Multi-Document Summarization task, which aims to generate concise and comprehensive summaries of a given event based on multiple related news documents. Based on this, we constructed the EventSum dataset, which was constructed using Baidu Baike entries and underwent extensive human annotation, to facilitate relevant research. It is the first large-scale Chinese multi-document summarization dataset, containing 5,100 events and a total of 57,984 news documents, with an average of 11.4 input news documents and 13,471 characters per event. To ensure data quality and mitigate potential data leakage, we adopted a multi-stage annotation approach for manually labeling the test set. Given the complexity of event-related information, existing metrics struggle to comprehensively assess the quality of generated summaries. We designed specific metrics including Event Recall, Argument Recall, Causal Recall, and Temporal Recall along with corresponding calculation methods for evaluation. We conducted comprehensive experiments on EventSum to evaluate the performance of advanced long-context Large Language Models (LLMs) on this task. Our experimental results indicate that: 1) The event-centric multi-document summarization task remains challenging for existing long-context LLMs; 2) The recall metrics we designed are crucial for evaluating the comprehensiveness of the summary information.

Abstract: We introduce PokerBench a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench leads to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios.

Abstract: Backdoor attacks present a serious security threat to deep neuron networks (DNNs). Although numerous effective defense techniques have been proposed in recent years, they inevitably rely on the availability of either clean or poisoned data. In contrast, datafree defense techniques have evolved slowly and still lag significantly in performance. To address this issue, different from the traditional approach of pruning followed by fine-tuning, we propose a novel data-free defense method named Optimal Transport-based Backdoor Repairing (OTBR) in this work. This method, based on our findings on neuron weight changes (NWCs) of random unlearning, uses optimal transport (OT)-based model fusion to combine the advantages of both pruned and backdoored models. Specifically, we first demonstrate our findings that the NWCs of random unlearning are positively correlated with those of poison unlearning. Based on this observation, we propose a random-unlearning NWC pruning technique to eliminate the backdoor effect and obtain a backdoor-free pruned model. Then, motivated by the OT-based model fusion, we propose the pruned-to-backdoored OT-based fusion technique, which fuses pruned and backdoored models to combine the advantages of both, resulting in a model that demonstrates high clean accuracy and a low attack success rate. To our knowledge, this is the first work to apply OT and model fusion techniques to backdoor defense. Extensive experiments show that our method successfully defends against all seven backdoor attacks across three benchmark datasets, outperforming both state-of-the-art (SOTA) data-free and data-dependent methods.

Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology, College of Computer Science and Software Engineering, Shenzhen University, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Department of Computing, The Hong Kong Polytechnic University, Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology

Abstract: Perturbationbased mechanisms, such as differential privacy, mitigate gradient leakage attacks by introducing noise into the gradients, thereby preventing attackers from reconstructing clients' private data from the leaked gradients. However, can gradient perturbation protection mechanisms truly defend against all gradient leakage attacks? In this paper, we present the first attempt to break the shield of gradient perturbation protection in Federated Learning for the extraction of private information. We focus on common noise distributions, specifically Gaussian and Laplace, and apply our approach to DNN and CNN models. We introduce Mjölnir, a perturbation-resilient gradient leakage attack that is capable of removing perturbations from gradients without requiring additional access to the original model structure or external data. Specifically, we leverage the inherent diffusion properties of gradient perturbation protection to develop a novel diffusion-based gradient denoising model for Mjölnir. By constructing a surrogate client model that captures the structure of perturbed gradients, we obtain crucial gradient data for training the diffusion model. We further utilize the insight that monitoring disturbance levels during the reverse diffusion process can enhance gradient denoising capabilities, allowing Mjölnir to generate gradients that closely approximate the original, unperturbed versions through adaptive sampling steps. Extensive experiments demonstrate that Mjölnir effectively recovers the protected gradients and exposes the Federated Learning process to the threat of gradient leakage, achieving superior performance in gradient denoising and private data recovery.

Abstract: Large language models have repeatedly shown outstanding performance across diverse applications. However, deploying these models can inadvertently risk user privacy. The significant memory demands during training pose a major challenge in terms of resource consumption. This substantial size places a heavy load on memory resources, raising considerable practical concerns. In this paper, we introduce DPMemArc, a novel training framework aimed at reducing the memory costs of large language models while emphasizing the protection of user data privacy. DP-MemArc incorporates side network or reversible network designs to support a variety of differential privacy memory-efficient fine-tuning schemes. Our approach not only achieves about 2.5 times in memory optimization but also ensures robust privacy protection, keeping user data secure and confidential. Extensive experiments have demonstrated that DP-MemArc effectively provides differential privacy-efficient fine-tuning across different task scenarios.

Abstract: With Large Language Model (LLM) agents taking on more evaluation responsibilities in decisionmaking, it is essential to recognize their possible biases to guarantee fair and trustworthy AI-supported decisions. This study is the first to thoroughly examine the choice-supportive bias in LLM agents, a cognitive bias that is known to impact human decision-making and evaluation. We conduct experiments across 19 open/unopen-source LLM models in five scenarios at maximum, employing both memory-based and evaluation-based tasks adapted and redesigned from human cognitive studies. Our findings show that LLM agents may exhibit biased attribution or evaluation that supports their initial choices, and such bias may persist even if contextual hallucination is not observable. Key findings show that bias manifestation can differ greatly depending on prompt construction and context preservation, and the bias may be mitigated in larger models. Significantly, we observe that the bias increases when the agents perceive they are in control. Our extensive study involving 284 well-educated humans shows that, despite bias, certain LLM agents can still perform better than humans in similar evaluation tasks. This research contributes to the growing area of AI psychology, and the findings underscore the importance of addressing cognitive biases in LLM Agent systems, with wide-ranging implications spanning from improving AI-assisted decision-making to advancing AI safety and ethics.

Abstract: Constrained Markov decision processes (CMDPs), in which the agent optimizes expected payoffs while keeping the expected cost below a given threshold, are the leading framework for safe sequential decision making under stochastic uncertainty. Among algorithms for planning and learning in CMDPs, methods based on Monte Carlo tree search (MCTS) have particular importance due to their efficiency and extendibility to more complex frameworks (such as partially observable settings and games). However, current MCTSbased methods for CMDPs either struggle with finding safe (i.e., constraint-satisfying) policies, or are too conservative and do not find valuable policies. We introduce Threshold UCT (T-UCT), an online MCTS-based algorithm for CMDP planning. Unlike previous MCTS-based CMDP planners, T-UCT explicitly estimates Pareto curves of cost-utility trade-offs throughout the search tree, using these together with a novel action selection and threshold update rule to seek safe and valuable policies. Our experiments demonstrate that our approach significantly outperforms state-of-the-art methods from the literature.

Department of Mathematics and Computer Science, Eindhoven University of Technology Eindhoven Artificial Intelligence Systems Institute, Eindhoven University of Technology, Eindhoven Artificial Intelligence Systems Institute, Eindhoven University of Technology Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, Delmia R&D, Dassault Systèmes, Eindhoven Artificial Intelligence Systems Institute, Eindhoven University of Technology Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, Department of Mathematics and Computer Science, Eindhoven University of Technology Eindhoven Artificial Intelligence Systems Institute, Eindhoven University of Technology

Abstract: Neural combinatorial optimization (NCO) has gained significant attention due to the potential of deep learning to efficiently solve combinatorial optimization problems. NCO has been widely applied to job shop scheduling problems (JSPs) with the current focus predominantly on deterministic problems. In this paper, we propose a novel attentionbased scenario processing module (SPM) to extend NCO methods for solving stochastic JSPs. Our approach explicitly incorporates stochastic information by an attention mechanism that captures the embedding of sampled scenarios (i.e., an approximation of stochasticity). Fed with the embedding, the base neural network is intervened by the attended scenarios, which accordingly learns an effective policy under stochasticity. We also propose a training paradigm that works harmoniously with either the expected makespan or Value-at-Risk objective. Results demonstrate that our approach outperforms existing learning and non-learning methods for the flexible JSP problem with stochastic processing times on a variety of instances. In addition, our approach holds significant generalizability to varied numbers of scenarios and disparate distributions.

Abstract: In this paper, we augment online algorithms for the knapsack problem using the total weight information. The conventional optimal online algorithm achieves the ln(U/L)+1 competitive ratio where L and U are the upper and lower bounds of the valueto-weight ratio. However, it does not consider that decision makers can know the total weight information or obtain it through machine-learned predictions. To fill this gap, we first propose the Known Weight Algorithm (KWA) which uses the exact total weight information to achieve a competitive ratio of W((U-L)/(eL))+1, where W denotes the Lambert-W function. We prove that it is optimal and tight. After that, we extend KWA to the Predicted Weight Algorithm (PWA), a learning-augmented online algorithm that uses predicted total weight. We show the consistency and robustness of PWA, and prove that its competitive ratio degrades gracefully as the prediction error grows. Finally, we introduce the Limited Volume Algorithm (LWA), which achieves a better competitive ratio than ln(U/L)+1 when the total weight is less than twice the capacity.

Abstract: Large Language Models (LLMs) gain substantial reasoning and decisionmaking capabilities from thought structures. However, existing methods such as Tree of Thought and Retrieval Augmented Thoughts often fall short in complex tasks due to the limitations of insufficient local retrieval of factual knowledge and inadequate global selection of strategies. These limitations make it challenging for these methods to balance factual accuracy and comprehensive logical optimization effectively. To address these limitations, we introduce the Retrieval Augmented Thought Tree (RATT), a novel thought structure that considers both overall logical soundness and factual correctness at each step of the thinking process. Specifically, at every point of a thought branch, RATT performs planning and lookahead to explore and evaluate multiple potential reasoning steps, and integrate the fact-checking ability of Retrieval-Augmented Generation (RAG) with LLM's ability to assess overall strategy. Through this combination of factual knowledge and strategic feasibility, the RATT adjusts and integrates the thought tree structure to search for the most promising branches within the search space. This thought structure significantly enhances the model's coherence in logical inference and efficiency in decision-making, and thus increases the limit of the capacity of LLM to generate reliable inferences and decisions based on thought structures. A broad range of experiments on different types of tasks showcases that the RATT structure significantly outperforms existing methods in factual correctness and logical coherence.

Abstract: Binary Neural Networks (BNNs) have garnered significant attention due to their immense potential for deployment on edge devices. However, the nondifferentiability of the quantization function poses a challenge for the optimization of BNNs, as its derivative cannot be backpropagated. To address this issue, hypernetwork based methods, which utilize neural networks to learn the gradients of non-differentiable quantization functions, have emerged as a promising approach due to their adaptive learning capabilities to reduce estimation errors. However, existing hypernetwork based methods typically rely solely on current gradient information, neglecting the influence of historical gradients. This oversight can lead to accumulated gradient errors when calculating gradient momentum during optimization. To incorporate historical gradient information, we design a Historical Gradient Storage (HGS) module, which models the historical gradient sequence to generate the first-order momentum required for optimization. To further enhance gradient generation in hypernetworks, we propose a Fast and Slow Gradient Generation (FSG) method. Additionally, to produce more precise gradients, we introduce Layer Recognition Embeddings (LRE) into the hypernetwork, facilitating the generation of layer-specific fine gradients. Extensive comparative experiments on the CIFAR-10 and CIFAR-100 datasets demonstrate that our method achieves faster convergence and lower loss values, outperforming existing baselines.

Abstract: We present Learn2Aggregate, a machine learning (ML) framework for optimizing the generation of ChvatalGomory (CG) cuts in mixed integer linear programming (MILP). The framework trains a graph neural network to classify useful constraints for aggregation in CG cut generation. The ML-driven CG separator selectively focuses on a small set of impactful constraints, improving runtimes without compromising the strength of the generated cuts. Key to our approach is the formulation of a constraint classification task which favours sparse aggregation of constraints, consistent with empirical findings. This, in conjunction with a careful constraint labeling scheme and a hybrid of deep learning and feature engineering, results in enhanced CG cut generation across five diverse MILP benchmarks. On the largest test sets, our method closes roughly twice as much of the integrality gap as the standard CG method while running 40% faster. This performance improvement is due to our method eliminating 75% of the constraints prior to aggregation.

Abstract: Randomized search heuristics have been applied successfully to a plethora of problems. This success is complemented by a large body of theoretical results. Unfortunately, the vast majority of these results regard problems with binary or continuous decision variables the theoretical analysis of randomized search heuristics for unbounded integer domains is almost nonexistent. To resolve this shortcoming, we start the runtime analysis of multi-objective evolutionary algorithms, which are among the most successful randomized search heuristics, for unbounded integer search spaces. We analyze single- and full-dimensional mutation operators with three different mutation strengths, namely changes by plus/minus one (unit strength), random changes following a law with exponential tails, and random changes following a power-law. The performance guarantees we prove on a recently proposed natural benchmark problem suggest that unit mutation strengths can be slow when the initial solutions are far from the Pareto front. When setting the expected change right (depending on the benchmark parameter and the distance of the initial solutions), the mutation strength with exponential tails yields the best runtime guarantees in our results -- however, with a wrong choice of this expectation, the performance guarantees quickly become highly uninteresting. With power-law mutation, which is an essentially parameter-less mutation operator, we obtain good results uniformly over all problem parameters and starting points. We complement our mathematical findings with experimental results that suggest that our bounds are not always tight. Most prominently, our experiments indicate that power-law mutation outperforms the one with exponential tails even when the latter uses a near-optimal parametrization. Hence, we suggest to favor power-law mutation for unknown problems in integer spaces.

Abstract: Recent advances in Metalearning for Black-Box Optimization (MetaBBO) have shown the potential of using neural networks to dynamically configure evolutionary algorithms (EAs), enhancing their performance and adaptability across various BBO instances. However, they are often tailored to a specific EA, which limits their generalizability and necessitates retraining or redesigns for different EAs and optimization problems. To address this limitation, we introduce ConfigX, a new paradigm of the MetaBBO framework that is capable of learning a universal configuration agent (model) for boosting diverse EAs. To achieve so, our ConfigX first leverages a novel modularization system that enables the flexible combination of various optimization sub-modules to generate diverse EAs during training. Additionally, we propose a Transformer-based neural network to meta-learn a universal configuration policy through multitask reinforcement learning across a designed joint optimization task space. Extensive experiments verify that, our ConfigX, after large-scale pre-training, achieves robust zero-shot generalization to unseen tasks and outperforms state-of-the-art baselines. Moreover, ConfigX exhibits strong lifelong learning capabilities, allowing efficient adaptation to new tasks through fine-tuning. Our proposed ConfigX represents a significant step toward an automatic, all-purpose configuration agent for EAs.

Abstract: Bayesian Optimization (BO) is a widelyused method for optimizing expensive-to-evaluate black-box functions. Traditional BO assumes that the learner has full control over all query variables without additional constraints. However, in many real-world scenarios, controlling certain query variables may incur costs. Therefore, the learner needs to balance the selection of informative subsets for targeted learning against leaving some variables to be randomly sampled to minimize costs. This problem is known as Bayesian Optimization with cost-varying variable subsets (BOCVS). While the goal of BOCVS is to identify the optimal solution with minimal cost, previous works have only guaranteed finding the optimal solution without considering the total costs incurred. Moreover, these works assume precise knowledge of the cost for each subset, which is often unrealistic. In this paper, we propose a novel algorithm for the extension of the BOCVS problem with random and unknown costs that separates the process into exploration and exploitation phases. The exploration phase will filter out low-quality variable subsets, while the exploitation phase will leverage high-quality ones. Furthermore, we theoretically demonstrate that our algorithm achieves a sub-linear rate in both quality regret and cost regret, addressing the objective of the BOCVS problem more effectively than previous analyses. Finally, we show that our proposed algorithm outperforms comparable baselines across a wide range of benchmarks.

Abstract: In social online platforms, identifying influential seed users to maximize influence spread is a crucial as it can greatly diminish the cost and efforts required for information dissemination. While effective, traditional methods for Multiplex Influence Maximization (MIM) have reached their performance limits, prompting the emergence of learningbased approaches. These novel methods aim for better generalization and scalability for more sizable graphs but face significant challenges, such as (1) inability to handle unknown diffusion patterns and (2) reliance on high-quality training samples. To address these issues, we propose the Reinforced Expert Maximization framework (REM). REM leverages a Propagation Mixture of Experts technique to encode dynamic propagation of large multiplex networks effectively in order to generate enhanced influence propagation. Noticeably, REM treats a generative model as a policy to autonomously generate different seed sets and learn how to improve them from a Reinforcement Learning perspective. Extensive experiments on several real-world datasets demonstrate that REM surpasses state-of-the-art methods in terms of influence spread, scalability, and inference time in influence maximization tasks.

Abstract: Biological oscillations are periodic changes in various signaling processes crucial for the proper functioning of living organisms. These oscillations are modeled by ordinary differential equations, with coefficient variations leading to diverse periodic behaviors, typically measured by oscillatory frequencies. This paper explores sampling techniques for neural networks to model the relationship between system coefficients and oscillatory frequency. However, the scarcity of oscillations in the vast coefficient space results in many samples exhibiting nonperiodic behaviors, and small coefficient changes near oscillation boundaries can significantly alter oscillatory properties. This leads to non-oscillatory bias and boundary sensitivity, making accurate predictions difficult. While existing importance and uncertainty sampling approaches partially mitigate these challenges, they either fail to resolve the sensitivity problem or result in redundant sampling. To address these limitations, we propose the Hierarchical Gradient-based Genetic Sampling (HGGS) framework, which improves the accuracy of neural network predictions for biological oscillations. The first layer, Gradient-based Filtering, extracts sensitive oscillation boundaries and removes redundant non-oscillatory samples, creating a balanced coarse dataset. The second layer, Multi-grid Genetic Sampling, utilizes residual information to refine these boundaries and explore new high-residual regions, increasing data diversity for model training. Experimental results demonstrate that HGGS outperforms seven comparative sampling methods across four biological systems, highlighting its effectiveness in enhancing sampling and prediction accuracy.

Abstract: We develop a method for the efficient verification of neural networks against convolutional perturbations such as blurring or sharpening. To define input perturbations, we use wellknown camera shake, box blur and sharpen kernels. We linearly parameterise these kernels in a way that allows for a variation of the perturbation strength while preserving desired kernel properties. To facilitate their use in neural network verification, we develop an efficient way of convolving a given input with the parameterised kernels. The result of this convolution can be used to encode the perturbation in a verification setting by prepending a linear layer to a given network. This leads to tight bounds and a high effectiveness in the resulting verification step. We add further precision by employing input splitting as a branching strategy. We demonstrate that we are able to verify robustness on a number of standard benchmarks where the baseline is unable to provide any safety certificates. To the best of our knowledge, this is the first solution for verifying robustness against specific convolutional perturbations such as camera shake.

Abstract: Significant efforts have been made to analyze the political stance or bias in news articles, especially as political polarization intensifies over the years. Recent advancements in machine learning have enabled researchers to develop various bias prediction models, which typically learn features not only from the text of the news articles but also from external knowledge. However, when training these models, the political bias label assigned to a news article is often based solely on the news source which published it. This approach can be problematic, as a news outlet with a particular political stance might publish an article that reflects a different political perspective. To address this issue, we first identify distinct text patterns associated with specific news sources or publishers, that are minimally relevant to predicting the political bias of a news article. We then conduct comprehensive experiments to investigate (i) whether existing models trained to predict political bias can also accurately predict the source, and (ii) whether these models change their predictions when a distinct pattern from a source with a different political stance is incorporated into a news article. Our experimental results reveal that all existing models tend to predict the source, even when trained solely to predict bias. Based on these findings, we propose a new deep learning model for political bias prediction that avoids learning sourceindicative patterns specific to a given news source.

Abstract: As AI algorithms are deployed extensively, the need to ensure the fairness of their outputs is critical. Most existing work is on “fairness by design” approaches that incorporate limited tests for fairness into a limited number of algorithms. Here, we explore a framework that removes these limitations and can be used with any algorithm’s output that allocates instances to one of K categories/classes such as outlier detection (OD), clustering and classification. The framework can encode standard and novel fairness types beyond simple counting, and importantly, it can detect intersectional unfairness without being specifically told what to look for. Our experimental results show that both standard and novel types of unfairness exist extensively in the outputs of fairby-design algorithms and the counter-intuitive result that they can actually increase intersectional unfairness.

Abstract: Existing work on the alignment problem has focused mainly on (1) qualitative descriptions of the alignment problem; (2) attempting to align AI actions with human interests by focusing on value specification and learning; and/or (3) focusing on a single agent or on humanity as a monolith. Recent sociotechnical approaches highlight the need to understand complex misalignment among multiple human and AI agents. We address this gap by adapting a computational social science model of human contention to the alignment problem. Our model quantifies misalignment in large, diverse agent groups with potentially conflicting goals across various problem areas. Misalignment scores in our framework depend on the observed agent population, the domain in question, and conflict between agents' weighted preferences. Through simulations, we demonstrate how our model captures intuitive aspects of misalignment across different scenarios. We then apply our model to two case studies, including an autonomous vehicle setting, showcasing its practical utility. Our approach offers enhanced explanatory power for complex sociotechnical environments and could inform the design of more aligned AI systems in realworld applications.

Abstract: Warning: This paper contains offensive content that may disturb some readers. Visionlanguage models (VLMs) demonstrate strong multimodal capabilities but have been found to be more susceptible to generating harmful content compared to their backbone large language models (LLMs). Our investigation reveals that the integration of images significantly shifts the model's internal activations during the forward pass, diverging from those triggered by textual input. Moreover, the safety alignments of LLMs embedded within VLMs are not sufficiently robust to handle the activations discrepancies, making the models vulnerable to even the simplest jailbreaking attacks. To address this issue, we propose an internal activation revision approach that efficiently revises activations during generation, steering the model toward safer outputs. Our framework incorporates revisions at both the layer and head levels, offering control over the model's generation at varying levels of granularity. In addition, we explore three strategies for constructing positive and negative samples and two approaches for extracting revision vectors, resulting in different variants of our method. Comprehensive experiments demonstrate that the internal activation revision method significantly improves the safety of widely used VLMs, reducing attack success rates by an average of 48.94%, 34.34%, 43.92%, and 52.98% on SafeBench, Safe-Unsafe, Unsafe, and MM-SafetyBench, respectively, while minimally impacting model helpfulness.

Abstract: The superalignment problem of how humans can effectively supervise super-human AI has garnered increasing attention. Recent research has focused on investigating the weak-to-strong generalization (W2SG) scenario as an analogy for super-alignment. This scenario examines how a pre-trained strong model, supervised by an aligned weak model, can outperform its weak supervisor. Despite good progress, current W2SG methods face two main issues: 1) The annotation quality is limited by the knowledge scope of the weak model; 2) It is risky to position the strong model as the final corrector. To tackle these issues, we propose a ``Strong Empowered and Aligned Weak Mastered'' (SEAM) framework for weak annotations in W2SG. This framework can leverage the vast intrinsic knowledge of the pre-trained strong model to empower the annotation and position the aligned weak model as the annotation master. Specifically, the pre-trained strong model first generates principle fast-and-frugal trees for samples to be annotated, encapsulating rich sample-related knowledge. Then, the aligned weak model picks informative nodes based on the tree's information distribution for final annotations. Experiments on six datasets for preference tasks in W2SG scenarios validate the effectiveness of our proposed method.

Institute of Automation, Chinese academy of science School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese academy of science School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing Baichuan Intelligent Technology Co., Beijing Baichuan Intelligent Technology Co., Beijing Baichuan Intelligent Technology Co., Institute of Automation, Chinese academy of science School of Artificial Intelligence, University of Chinese Academy of Sciences

Abstract: Human preference alignment is critical in building powerful and reliable large language models (LLMs). However, current methods either ignore the multidimensionality of human preferences (e.g. helpfulness and harmlessness) or struggle with the complexity of managing multiple reward models. To address these issues, we propose Sequential Preference Optimization (SPO), a method that sequentially fine-tunes LLMs to align with multiple dimensions of human preferences. SPO avoids explicit reward modeling, directly optimizing the models to align with nuanced human preferences. We theoretically derive closed-form optimal SPO policy and loss function. Gradient analysis is conducted to show how SPO manages to fine-tune the LLMs while maintaining alignment on previously optimized dimensions. Empirical results on LLMs of different size and multiple evaluation datasets demonstrate that SPO successfully aligns LLMs across multiple dimensions of human preferences and significantly outperforms the baselines.

Abstract: The alignment of large language models (LLMs) with human values is critical as these models become increasingly integrated into various societal and decisionmaking processes. Traditional methods, such as reinforcement learning from human feedback (RLHF), achieve alignment by fine-tuning model parameters, but these approaches are often computationally expensive and impractical when models are frozen or inaccessible for parameter modification. In contrast, prompt optimization is a viable alternative to RLHF for LLM alignment. While the existing literature has shown empirical promise of prompt optimization, its theoretical underpinning remains under-explored. We address this gap by formulating prompt optimization as an optimization problem and try to provide theoretical insights into the optimality of such a framework. To analyze the performance of the prompt optimization, we study theoretical suboptimality bounds and provide insights in terms of how prompt optimization depends upon the given prompter and target model. We also provide empirical validation through experiments on various datasets, demonstrating that prompt optimization can effectively align LLMs, even when parameter fine-tuning is not feasible.

Abstract: Converting different modalities into generalized text, which then serves as input prompts for large language models (LLMs), is a common approach for aligning multimodal models, particularly when pairwise data is limited. Textcentric alignment method leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation, thereby enabling downstream models to effectively interpret various modal inputs. This study evaluates the quality and robustness of multimodal representations in the face of noise imperfections, dynamic input order permutations, and missing modalities, revealing that current text-centric alignment methods can compromise downstream robustness. To address this issue, we propose a new text-centric adversarial training approach that significantly enhances robustness compared to traditional robust training methods and pre-trained multimodal foundation models. Our findings underscore the potential of this approach to improve the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the model's confidence to the question based on the internal states during decoding as well as to the answer based on cumulative probability ratios. With the BCE, we construct a conversational preference dataset composed of context, original statement, and argument, which is adopted for aligning the LLM for faithful integrity using Direct Preference Optimization (DPO). Extensive experimental results on a wide range of benchmarks demonstrate significant improvements in the LLM's ability to maintain faithful responses when encountering opposing arguments, ensuring both the practical utility and trustworthiness of LLMs in complex interactive settings.

Abstract: Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and finetuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel sequence-to-sequence (seq2seq) reward modeling method. Its key insight is that learning from language feedback rather than scalar feedback improves RLHF without additional annotations. We replaced the reward modeling target from binary maximum likelihood estimation (MLE) with sequence MLE. This method enables richer and fine-grained language feedback without additional annotations, models, or training stages. Our experiments demonstrated its effectiveness, specifically, reducing the refusal-to-response paradigm in single-turn safety dialogues and the long-response bias in text summarization tasks. We provide further analysis that seq2seq RM improves RLHF performance across 2B and 7B LLMs on 3 NLP tasks, achieving an average win rate of 76.9%. We further show that seq2seq RM can still improve the performance of RLHF under out-of-distribution prompts.

Abstract: Fraudulent shopping websites pose a significant threat to online consumers and legitimate businesses: in 2023, victims of such scams reported $392 million in losses to the Federal Trade Commission. This alarming trend not only impacts individuals but also erodes societal trust in ecommerce, necessitating urgent countermeasures. While previous studies have attempted to identify these fraudulent websites at scale, they face limitations such as potential bias in data collection, overreliance on easily manipulated features, and the lack of explainable results. This study explores the potential of Large Language Models (LLMs) in identifying fraudulent shopping websites, revealing that current LLMs underperform compared to existing machine learning models. To address this, we propose ScamNet, a fine-tuned LLM for explainable fraudulent shopping website detection. Our experimental results on real-world datasets demonstrate a breakthrough in detection performance from 22.35% detection rate to 95.59%, particularly in identifying subtle deceptive tactics such as using a legitimate-looking website template. ScamNet offers interpretable insights into its decision-making process, enhancing transparency and overcoming a key limitation of previous approaches.

Abstract: The global economy relies on the flow of goods over supply chain networks, with nodes as firms and edges as transactions between firms. While we may observe these external transactions, they are governed by unseen production functions, which determine how firms internally transform the input products they receive into output products that they sell. In this setting, it can be extremely valuable to infer these production functions, to better understand and improve supply chains, and to forecast future transactions more accurately. However, existing graph neural networks (GNNs) cannot capture these hidden relationships between nodes’ inputs and outputs. Here, we introduce a new class of models for this setting, by combining temporal GNNs with a novel inventory module, which learns production functions via attention weights and a special loss function. We evaluate our models extensively on real supply chains data, along with data generated from our new opensource simulator, SupplySim. Our models successfully infer production functions, outperforming the strongest baseline by 6-50% (across datasets), and forecast future transactions, outperforming the strongest baseline by 11-62%.

Abstract: A recent report from the World Meteorological Organization (WMO) highlights that waterrelated disasters have caused the highest human losses among natural disasters over the past 50 years, with over 91\% of deaths occurring in low-income countries. This disparity is largely due to the lack of adequate ground monitoring stations, such as weather surveillance radars (WSR), which are expensive to install. For example, while the US and Europe combined possess over 600 WSRs, Africa, despite having almost one and half times their landmass, has fewer than 40. To address this issue, satellite-based observations offer a global, near-real-time monitoring solution. However, they face several challenges like accuracy, bias, and low spatial resolution. This study leverages the power of diffusion models and residual learning to address these limitations in a unified framework. We introduce the first diffusion model for correcting the inconsistency between different precipitation products. Our method demonstrates the effectiveness in downscaling satellite precipitation estimates from 10 km to 1 km resolution. Extensive experiments conducted in the Seattle region demonstrate significant improvements in accuracy, bias reduction, and spatial detail. Importantly, our approach achieves these results using only precipitation data, showcasing the potential of a purely computer vision-based approach for enhancing satellite precipitation products and paving the way for further advancements in this domain.

Abstract: Social segregation in cities, spanning racial, residential, and income dimensions, is becoming increasingly diverse and severe. As urban spaces and social dynamics grow more complex, residents experience varying levels of segregation, which, if left unaddressed, could exacerbate crime rates, fuel social tensions, and lead to other societal challenges. Effectively addressing these issues requires a comprehensive analysis of the underlying structures of urban spaces and resident interactions. While previous studies have primarily focused on surfacelevel indicators of segregation, they often fail to explore the complexity of urban structure and mobility dynamics, leaving gaps in understanding modern segregation patterns. To fill this gap, we propose the Motif-Enhanced Graph Prototype Learning (MotifGPL) framework, offering a novel approach to studying urban segregation. The framework consists of three key modules: prototype-based graph structure extraction, motif distribution discovery, and urban graph reconstruction. Specifically, we use prototype-based learning to extract key urban graph prototypes from both spatial and origin-destination graphs, incorporating attributes such as points of interest, street images, and flow indices. The motif distribution discovery module enhances interpretability by matching each prototype to similar motifs, which represent simplified graph structures that reflect local patterns. These motifs are then used to guide the reconstruction of urban graphs, enabling a more detailed exploration of spatial structures and mobility patterns. By identifying critical motifs influencing urban segregation, MotifGPL offers insights to guide the design of urban environments that can help reduce segregation. Experimental results demonstrate that MotifGPL effectively uncovers these key motifs and provides actionable insights for mitigating segregation.

Abstract: Limited accessibility to neurological care leads to underdiagnosed Parkinson's Disease (PD), preventing early intervention. Existing AI-based PD detection methods primarily focus on unimodal analysis of motor or speech tasks, overlooking the multifaceted nature of the disease. To address this, we introduce a large-scale, multi-task video dataset of 1102 sessions (each containing videos of finger tapping, facial expression, and speech tasks captured via webcam) from 845 participants (272 with PD). We propose a novel Uncertainty-calibrated Fusion Network (UFNet) that leverages this multimodal data to enhance diagnostic accuracy. UFNet employs independent task-specific networks, trained with Monte Carlo Dropout for uncertainty quantification, followed by self-attended fusion of features, with attention weights dynamically adjusted based on task-specific uncertainties. We randomly split the participants into training (60%), validation (20%), and test (20%) sets to ensure patient-centered evaluation. UFNet significantly outperformed single-task models in terms of accuracy, area under the ROC curve (AUROC), and sensitivity while maintaining non-inferior specificity. Withholding uncertain predictions further boosted the performance, achieving 88.0 +- 0.3% accuracy, 93.0 +- 0.2% AUROC, 79.3 +- 0.9% sensitivity, and 92.6 +- 0.3% specificity, at the expense of not being able to predict for 2.3 +- 0.3% data (+- denotes 95% confidence interval). Further analysis suggests that the trained model does not exhibit any detectable bias across sex and ethnic subgroups and is most effective for individuals aged between 50 and 80. By merely requiring a webcam and microphone, our approach facilitates accessible home-based PD screening, especially in regions with limited healthcare resources.

Abstract: The proliferation of malicious users on social platforms poses significant financial and psychological threats, with activities ranging from scams to the dissemination of illicit content. Existing malicious user prediction comprises supervised and selfsupervised learning methods. However, the former relies on extensive labeled malicious users for training, while the latter typically focuses on one form of malicious activity and depends heavily on manually crafted rules and features during pre-training. Moreover, existing pre-training methods fail to effectively capture the crucial repetitive and sporadic behavior patterns of malicious users. To address these limitations, we propose a Malicious User Behavior Pre-training framework (MaP) to build pre-trained behavior models. MaP integrates malicious pattern recognition with behavior consistency augmentation and local disruption augmentation strategies for contrastive learning to capture repetitive and sporadic malicious patterns, respectively. We instantiate MaP on a billion-level behavior pre-training scenario within an industry context. Both online and offline evaluations validate the superior performance of MaP in malicious user detection and classification.

Abstract: Statistical agencies rely on sampling techniques to collect sociodemographic data crucial for policy-making and resource allocation. This paper shows that surveys of important societal relevance introduce sampling errors that unevenly impact group-level estimates, thereby compromising fairness in downstream decisions. To address these issues, this paper introduces an optimization approach modeled on real-world survey design processes, ensuring sampling costs are optimized while maintaining error margins within prescribed tolerances. Additionally, privacy-preserving methods used to determine sampling rates can further impact these fairness issues. This paper explores the impact of differential privacy on the statistics informing the sampling process, revealing a surprising effect: not only is the expected negative effect from the addition of noise for differential privacy negligible, but also this privacy noise can in fact reduce unfairness as it positively biases smaller counts. These findings are validated over an extensive analysis using datasets commonly applied in census statistics.

Abstract: The propagation of aggressive behavior in online social networks presents a growing threat to digital wellbeing and social harmony. While existing research focuses on modeling aggression diffusion or detecting aggressive content, forecasting individual user aggression remains an open challenge. This work fills this gap by introducing Temporal Social Graph Attention Network (TSGAN), a social-aware sequence-to-sequence architecture designed to forecast aggressive behavior in dynamic social networks. The core of TSGAN is an adaptive socio-temporal attention module that dynamically models social influence and temporal dynamics. To capture global social influence, TSGAN employs a graph contrastive learning approach to generate global network context embeddings. TSGAN utilizes an aggression intensity metric derived from a proposed hybrid aggression content detection model (92.87% F1), combining a fine-tuned transformer with a large language model to quantify user aggression over time. TSGAN uniquely addresses user inactivity, models dynamic follower relationship impacts, and accounts for temporal behavioral decay while scaling to large networks. Experiments on real-world datasets (X for aggression forecasting and Flickr for popularity prediction) demonstrate TSGAN’s versatility and effectiveness. TSGAN outperforms baselines in forecasting across hourly, daily, and weekly temporal intervals, showing up to 24.8% improvement in daily aggression predictions.

Abstract: In recent years, there has been growing interest in leveraging machine learning for homeless service assignment. However, the categorical nature of administrative data recorded for homeless individuals hinders the development of accurate machine learning methods for this task. This work asserts that deriving latent representations of such features, while at the same time leveraging underlying relationships between instances is crucial in algorithmically enhancing the existing assignment decisionmaking process. Our proposed approach learns temporal and functional relationships between services from historical data, as well as unobserved but relevant relationships between individuals to generate features that significantly improve the prediction of the next service assignment compared to the state-of-the-art.

Abstract: Largescale, volunteer-collected datasets of community-identified natural world imagery like iNaturalist have enabled marked performance gains for fine-grained visual classification of species using machine learning methods. However, such data---sometimes referred to as citizen science data---are opportunistic and lack a structured sampling strategy. This volunteer-collected biodiversity data contains geographic, temporal, taxonomic, observers, and sociopolitical biases that can have significant effects on biodiversity model performance, but whose impacts are unclear for fine-grained species recognition performance. Here we introduce Diversity Shift (DivShift), a framework for quantifying the effects of domain-specific distribution shifts on machine learning model performance. To diagnose the performance effects of biases specific to volunteer-collected biodiversity data, we also introduce DivShift - North American West Coast (DivShift-NAWC), a curated dataset of almost 7.5 million iNaturalist images across the western coast of North America partitioned across five types of expert-verified bias. We compare species recognition performance across these bias partitions using a diverse variety of species- and ecosystem-focused accuracy metrics. We observe that these biases confound model performance less than expected from the underlying label distribution shift, and that more data leads to better model performance but the magnitude of these improvements are bias-specific. These findings imply that while the structure within natural world images provides generalization improvements for biodiversity monitoring tasks, the biases present in volunteer-collected biodiversity data can also affect model performance; thus these models should be used with caution in downstream biodiversity monitoring tasks.

Abstract: Biophysical models offer valuable insights into climatephenology relationships in both natural and agricultural settings. However, there are substantial structural discrepancies across models which require site-specific recalibration, often yielding inconsistent predictions under similar climate scenarios. Machine learning methods offer data-driven solutions, but often lack interpretability and alignment with existing knowledge. We present a phenology model describing dormancy in fruit trees, integrating conventional biophysical models with a neural network to address their structural disparities. We evaluate our hybrid model in an extensive case study predicting cherry tree phenology in Japan, South Korea and Switzerland. Our approach consistently outperforms both traditional biophysical and machine learning models in predicting blooming dates across years. Additionally, the neural network's adaptability facilitates parameter learning for specific tree varieties, enabling robust generalization to new sites without site-specific recalibration. This hybrid model leverages both biophysical constraints and data-driven flexibility, offering a promising avenue for accurate and interpretable phenology modeling.

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, Department of Biostatistics and Health Data Science, Indiana University, Indianapolis, IN, USA Regenstrief Institute, Indianapolis, IN, USA, J. Crayton Pruitt Family Department of Biomedical Engineering, University of Florida, Gainesville, FL, USA, Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA

Abstract: Adverse clinical events related to unsafe care are among the top ten causes of death in the U.S. Accurate modeling and prediction of clinical events from electronic health records (EHRs) play a crucial role in patient safety enhancement. An example is modeling de facto care pathways that characterize common stepby-step plans for treatment or care. However, clinical event data pose several unique challenges, including the irregularity of time intervals between consecutive events, the existence of cycles, periodicity, multi-scale event interactions, and the high computational costs associated with long event sequences. Existing neural temporal point processes (TPPs) methods do not effectively capture the multi-scale nature of event interactions, which is common in many real-world clinical applications. To address these issues, we propose the cross-temporal-scale transformer (XTSFormer), specifically designed for irregularly timed event data. Our model consists of two vital components: a novel Feature-based Cycle-aware Time Positional Encoding (FCPE) that adeptly captures the cyclical nature of time, and a hierarchical multi-scale temporal attention mechanism, where different temporal scales are determined by a bottom-up clustering approach. Extensive experiments on several real-world EHR datasets show that our XTSFormer outperforms multiple baseline methods.

Abstract: Scientific discovery is a complex cognitive process that has driven human knowledge and technological progress for centuries. While artificial intelligence (AI) has made significant advances in automating aspects of scientific reasoning, simulation, and experimentation, we still lack integrated AI systems capable of performing autonomous longterm scientific research and discovery. This paper examines the current state of AI for scientific discovery, highlighting recent progress in large language models and other AI techniques applied to scientific tasks. We then outline key challenges and promising research directions toward developing more comprehensive AI systems for scientific discovery, including the need for science-focused AI agents, improved benchmarks and evaluation metrics, multimodal scientific representations, and unified frameworks combining reasoning, theorem proving, and data-driven modeling. Addressing these challenges could lead to transformative AI tools to accelerate progress across disciplines towards scientific discovery.

Abstract: Reinforcement learning (RL), particularly its combination with deep neural networks referred to as deep RL (DRL), has shown tremendous promise across a wide range of applications, suggesting its potential for enabling the development of sophisticated robotic behaviors. Robotics problems, however, pose fundamental difficulties for the application of RL, stemming from the complexity and cost of interacting with the physical world. These challenges notwithstanding, recent advances have enabled DRL to succeed at some realworld robotic tasks. However, state-of-the-art DRL solutions’ maturity varies significantly across robotic applications. In this talk, I will review the current progress of DRL in real-world robotic applications based on our recent survey paper (with Tang, Abbatematteo, Hu, Chandra, and Martı́n-Martı́n), with a particular focus on evaluating the real-world successes achieved with DRL in realizing several key robotic competencies, including locomotion, navigation, stationary manipulation, mobile manipulation, human-robot interaction, and multi-robot interaction. The analysis aims to identify the key factors underlying those exciting successes, reveal underexplored areas, and provide an overall characterization of the status of DRL in robotics. I will also highlight several important avenues for future work, emphasizing the need for stable and sample-efficient real-world RL paradigms, holistic approaches for discovering and integrating various competencies to tackle complex long-horizon, open-world tasks, and principled development and evaluation procedures. The talk is designed to offer insights for RL practitioners and roboticists toward harnessing RL’s power to create generally capable real-world robotic systems.

Abstract: Artificial Intelligence (AI) and Machine Learning hold immense potential to accelerate scientific discovery and engineering design. A fundamental challenge in these domains involves efficiently exploring a large space of hypotheses using expensive experiments in a resourceefficient manner. My research focuses on developing novel adaptive experimental design methods to address this broad challenge. Specifically, I develop new probabilistic modeling and decision making tools that operate in small data settings. These approaches have yielded substantial improvements in sample-efficiency, particularly for black-box optimization over high-dimensional combinatorial spaces (e.g., sequences and graphs). This cover letter outlines key methods I have developed and their real-world sustainability applications in areas such as nano-porous materials discovery, hardware design, and additive manufacturing. Additionally, I highlight my initiatives to foster collaboration between Science/Engineering and AI communities.

Abstract: The development of large language models has demonstrated robust performance on Englishcentric benchmarks, which predominantly reflect majority opinions and dominant cultural norms. However, successful deployment in real-world applications requires the ability to handle context-specific and diverse knowledge, which is often underrepresented in training data. Addressing a plurality of perspectives is therefore essential. My research focuses on developing pluralistic evaluation methods to assess the diversity of LLM outputs, with a particular focus on culturally rich common-sense reasoning. Additionally, I work on advancing models that integrate diverse knowledge into LLMs, aiming to bridge the gap between human and AI understanding through the incorporation of varied perspectives using innovative probabilistic frameworks. In this talk, I will emphasize two key directions of my previous work: the probabilistic box model for representing diverse knowledge and probabilistic evaluation for assessing diversity in LLMs, with a focus on distributional aspects. Additionally, I will discuss my efforts to understand model behavior in long-tail scenarios.

Abstract: Persuasion is important in numerous situations like healthy habit promotion, and emotional support. As AI gets more involved in our daily life, it becomes critical to study how they can persuade humans and how persuasive they are. In this talk, I will cover (1) how to build such persuasive AI systems that can persuade, negotiate, and cooperate with other humans in the game of Diplomacy. (2) I will also discuss how humans perceive such specialized AI systems. This study validates the necessity of California's Autobot Law and proposes guidance to regulate such systems. (3) As these systems become more powerful, AI safety problems become more important. So I will describe how to persuade AI models to jailbreak them and study AI safety problems. Finally, I will conclude with my longterm vision to further study persuasion from a multi-angle approach that combines Artificial Intelligence, Human-Computer Interaction, and social sciences.

Abstract: Design generation requires tight integration of neural and symbolic reasoning, as good design must meet explicit user needs and honor implicit rules for aesthetics, utility, and convenience. Current automated design tools driven by neural networks produce appealing designs, but cannot satisfy user specifications and utility requirements. Symbolic reasoning tools, such as constraint programming, cannot perceive lowlevel visual information in images or capture subtle aspects such as aesthetics. We introduce Spatial Reasoning Integrated Generator (SPRING) for design generation. SPRING embeds a neural and symbolic integrated spatial reasoning module inside the deep generative network. The spatial reasoning module samples the set of locations of objects to be generated from a backtrack-free distribution. This distribution modifies the implicit preference distribution, which is learned by a recursive neural network to capture utility and aesthetics. Sampling from the backtrack-free distribution is accomplished by a symbolic reasoning approach, SampleSearch, which zeros out the probability of sampling spatial locations violating explicit user specifications. Embedding symbolic reasoning into neural generation guarantees that the output of SPRING satisfies user requirements. Furthermore, SPRING offers interpretability, allowing users to visualize and diagnose the generation process through the bounding boxes. SPRING also handles novel user specifications not encountered during its training with zero-shot constraint transfer. Quantitative evaluations and a human study show that SPRING outperforms baseline generative models, delivering high design quality and better meeting user specifications.

Abstract: The transfer of patients between two aircraft using an underway watercraft increases medical evacuation reach and flexibility in maritime environments. The selection of any one of multiple underway watercraft for patient exchange is complicated by participating aircraft utilization histories and participating watercraft positions and velocities. The selection problem is modeled as a semiMarkov decision process with an action space including both fixed land and moving watercraft exchange points. Monte Carlo tree search with root parallelization is used to select optimal exchange points and determine aircraft dispatch times. Model parameters are varied in simulation to identify representative scenarios where watercraft exchange points reduce incident response times. We find that an optimal policy with watercraft exchange points outperforms an optimal policy without watercraft exchange points and a greedy policy by 35% and 40%, respectively. In partnership with the United States Army, we deploy for the first time the watercraft exchange point by executing a mock patient transfer with a manikin between two HH-60M medical evacuation helicopters and an underway Army Logistic Support Vessel south of the Hawaiian island of Oahu. Both helicopters were dispatched in accordance with our optimized decision strategy.

Abstract: Medical quality control (MQC) indicators are essential for evaluating the performance of healthcare institutions to ensure highquality patient care. In this paper, we report the design, implementation, and deployment of the Intelligent EMR-LLM platform for Medical Quality Control (IMQC), a large language model (LLM)-empowered system for automatically computing MQC indicators for enhancing the quality of medical services in Shanghai. It consists of an LLM (i.e., EMR-LLM) for processing electronic medical records (EMRs). With EMR-LLM, IMQC translates existing MQC indicators into a standardized representation language and automatically computes them based on EMRs. Since its deployment in February 2024, IMQC has been adopted by the Shanghai Medical Quality Management Center and associated hospitals. So far, it has processed 1,245 medical quality indicators for secondary- and tertiary-level hospitals, achieving an MQC evaluation accuracy of 93.31%, which is comparable to human experts. It has significantly improved efficiency, increasing from 10 EMRs per hour per human expert to over 1,000 EMRs per hour on average using one single H800 GPU. Over the first round of deployment in Shanghai, it is estimated that IMQC saves around 3.42 million RMB per month in manpower costs compared to traditional reporting methods. The successful deployment of IMQC sets a precedence for other regions to adopt similar AI-driven solutions to enhance medical quality control.

Abstract: Cotton is a critical agricultural product and industrial raw material, playing a key role in the national economies and people's living conditions, particularly in developing countries. However, cotton picking and processing often result in the contamination with various foreign fibers, such as hair, hemp rope, plastic film, and polypropylene rope. These contaminants are difficult to remove during textile processing and tend to break into small fragments, significantly reducing the quality of cotton products and negatively impacting the cotton industry. In this paper, we present an AIenabled hardware-software integrated system--XCotton, for identifying and removing foreign fibers. Our system has been deployed in actual cotton production environments in the multiple regions in China, Central Asia, and Africa. XCotton achieves a cleaning efficiency of 1000kg/h, representing a 43% improvement, with only 14 kWh energy consumption (63% less). Moreover, XCotton brings significant business values to its manufacturer and clients. XCotton not only enhances the quality of cotton products but also contributes to the value-adding and upgrading of the cotton industry in developing regions, supporting economic growth and improving living conditions.

Abstract: We present a novel Automatic Target Recognition (ATR) system using openvocabulary object detection and classification models. A primary advantage of this approach is that target classes can be defined just before runtime by a non-technical end user, using either a few natural language text descriptions of the target, or a few image exemplars, or both. Nuances in the desired targets can be expressed in natural language, which is useful for unique targets with little or no training data. We also implemented a novel combination of several techniques to improve performance, such as leveraging the additional information in the sequence of overlapping frames to perform tubelet identification (i.e., sequential bounding box matching), bounding box re-scoring, and tubelet linking. Additionally, we developed a technique to visualize the aggregate output of many overlapping frames as a mosaic of the area scanned during the aerial surveillance or reconnaissance, and a kernel density estimate (or heatmap) of the detected targets. We initially applied this ATR system to the use case of detecting and clearing unexploded ordinance on airfield runways and we are currently extending our research to other real-world applications.

Abstract: Infusing artificial intelligence algorithms into production aerospace systems can be challenging due to costs, timelines, and a riskaverse industry. We introduce the Onboard Artificial Intelligence Research (OnAIR) platform, an open-source software pipeline and cognitive architecture tool that enables full life cycle AI research for on-board intelligent systems. We begin with a description and user walk-through of the OnAIR tool. Next, we describe four use cases of OnAIR for both research and deployed onboard applications, detailing their use of OnAIR and the benefits it provided to the development and function of each respective scenario. We conclude with remarks on future work, future planned deployments, and goals for the forward progression of OnAIR as a tool to enable a larger AI and aerospace research community.

Abstract: The rapid and nearly pervasive impact of artificial intelligence on fields as diverse as medicine, law, banking, and the arts has made many students who would never enroll in a computer science class become interested in understanding elements of artificial intelligence. Fueled by questions about how this technology would change their own fields, these students are not seeking to become experts in building AI systems but instead are searching for a sufficient understanding to be safe, effective, and informed users. In this paper, we describe a firstof-its-kind course offering, "Artificial Intelligence for Future Presidents" designed and taught during the spring of 2024. We share rationale on the design and structure of the course, consider how best to convey complex technical information to students without the background in programming or mathematics, and consider methods for supporting an understanding of the limits of this technology.

Abstract: Understanding how robots plan and execute tasks is crucial in today's world, where they are becoming more prevalent in our daily lives. However, teaching nonexperts, such as K-12 students, the complexities of robot planning can be challenging. This work presents an open-source platform, JEDAI.Ed, that simplifies the process using a visual interface that abstracts the details of various planning processes that robots use for performing complex mobile manipulation tasks. Using principles developed in the field of explainable AI, this intuitive platform enables students to use a high-level intuitive instruction set to perform complex tasks, visualize them on an in-built simulator, and to obtain helpful hints and natural language explanations for errors. Finally, JEDAI.Ed, includes an adaptive curriculum generation method that provides students with customized learning ramps. This platform's efficacy was tested through a user study with university students who had little to no computer science background. Our results show that JEDAI.Ed is highly effective in increasing student engagement, teaching robotics programming, and decreasing the time need to solve tasks as compared to baselines.

Abstract: Responsible AI (RAI) encompasses the science and practice of ensuring that AI design, development, and use are socially sustainable—maximizing the benefits of technology while mitigating its risks. Industry practitioners play a crucial role in achieving the objectives of RAI, yet there is a persistent a shortage of consolidated educational resources and effective methods for teaching RAI to practitioners. In this paper, we present a stakeholder-first educational approach using interactive case studies to foster organizational and practitioner-level engagement and enhance learning about RAI. We detail our partnership with Meta, a global technology company, to co-develop and deliver RAI workshops to a diverse company audience. Assessment results show that participants found the workshops engaging and reported an improved understanding of RAI principles, along with increased motivation to apply them in their work.

Abstract: The growing impact of AI on various fields, including art, highlights the importance of integrating AI learning into art education. This work investigates whether traditional art lessons can be adapted to meaningfully incorporate AI, focusing on its application to artmaking practices. We adapted a character design activity to incorporate AI at different stages, such as using AI for creating references, getting feedback, visual design, animation, and personality design. We developed a character design learning activity which was supplemented by a code notebook and a front-end character design tool. 39 middle and high school students participated in this activity during two in-person Art and AI workshops. Analysis of creative outputs, knowledge surveys, and classroom discussions showed that students showed significant shifts in their understanding of AI as a creative collaborator, their art making practice, and their confidence with using AI tools. Learners demonstrated different creative styles while adopting AI into their character design. This approach demonstrates the potential for integrating AI into art lessons and offers a scalable framework for other non-CS subjects.

Abstract: With the rapid rise of AI technologies such as ChatGPT, understanding and integrating AI into K12 education has become increasingly important. However, teachers often lack the AI literacy necessary to navigate these tools, which can lead to the perpetuation of misconceptions and biases in the classroom. This study seeks to identify K-12 teachers’ self-identified needs regarding AI education and compare them with existing research on professional development (PD) for AI integration. We surveyed 34 K-12 teachers to assess their knowledge of AI, identify areas where they require further support, and evaluate the relevance of current PD offerings. Our findings reveal a significant disconnect between the top-down assumptions of expert-driven PD initiatives and the practical needs articulated by teachers. Key themes emerged, including a diverse range of AI understanding among educators, a strong preference for hands-on, practical training, and a demand for ongoing institutional support. Additionally, teachers expressed a desire for collaborative learning environments to share strategies and experiences related to AI. This study underscores the importance of tailoring PD programs to address the unique contexts and challenges faced by educators, advocating for a more personalized approach that fosters confidence and competence in AI integration. By aligning PD offerings with teachers’ needs, we aim to enhance their ability to effectively utilize AI tools in the classroom, ultimately enriching the educational experience for students.

Abstract: Time sequences are essential in fields such as finance, healthcare, and environmental science, where understanding temporal dependencies and making accurate predictions are crucial. These sequences often exhibit complexities like nonlinearity, noise, and concept drift. Traditional models struggle to capture the intricate dynamics of multivariate and coevolving sequences, particularly in contexts where relationships between variables shift unpredictably. This thesis introduces a range of Kernel Representation Learning (KRL) methodologies to address these challenges. We develop kernel self-representation learning to capture the temporal dependencies and hidden structures, while identifying concept drift in co-evolving sequences. Additionally, we explore theoretical connections between KRL and advanced deep-learning models. The proposed methods are validated through real-world applications, showing improvements in predictive accuracy, interpretability, and robustness.

Abstract: While advances in machine learning and the expansion of massive datasets have significantly improved predictive accuracy, the translation of these predictions into actionable decisions—alongside a robust understanding of associated risks—remains underexplored. My research focuses on developing methodology and theory in datadriven decision-making and uncertainty quantification that effectively address core data challenges. This paper presents two connected pillars of my research: data-driven contextual optimization, uncertainty quantification and reduction.

Abstract: Large language models (LLMs) now turn their attention to search. Recently, Thought of Search (ToS) proposed defining the search space with code, having an LLM produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test, achieving impressive 100% accuracy on all the tested datasets. In this work, we automate ToS (AutoToS), completely taking the human out of the loop of solving planning problems. AutoToS guides the language model step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.

Abstract: The increasing adoption of conversational agents powered by large language models (LLMs) raises questions about its effects across culturally diverse interactions. While these agents are linguistically versatile and multilingual, their ability to adapt along cultural dimensionsdefined as geographically and communally nurtured sets of values and behavioral norms--lacks close scrutiny of both their design and deployment. To achieve inclusive conversational AI, it is essential to understand how agents adapt to users from diverse cultural backgrounds. In this study, we analyze dialogues between human users from different countries and LLM-powered agents to examine how both parties adapt their word use, a salient aspect of linguistic styles, toward one another throughout casual conversations. Our analysis reveals that LLMs exhibit varying degrees of style matching based on users' national cultures and demonstrate asymmetric adaptation when interacting with culturally diverse users. Moreover, we observe a reciprocal dynamic where both the LLMs and users from certain cultures adjust their styles in response to one another. Additionally, our findings support the hypothesis that LLMs and users naturally converge in conversational styles over the course of interactions, mirroring the dynamics of human conversations that accommodate and converge. To develop localized and culturally aware agents, there's a potential to utilize such cross-cultural convergence process during fine-tuning to align LLMs.

Abstract: Stateof-the-art large language models (LLMs) are designed with robust safeguards to prevent the disclosure of harmful information and dangerous procedures. However, "jailbreaking" techniques can circumvent these protections by exploiting vulnerabilities in the models. This paper introduces a novel method, Hex Injection, which leverages a specific weakness in LLMs' ability to decode encoded text to uncover concealed dangerous instructions. Hex Injection distinguishes itself from traditional methods by combining encoded instructions with plaintext prompts to reveal unsafe content more effectively. Our approach involves encoding potentially malicious prompts in hexadecimal and integrating them. We observe a 94% average success rate (ASR) with a combination of plaintext, encoded, and role-play for Llama 3 and 3.1 models, and an 86% ASR for the Gemma 2 model. This research not only advances the understanding of LLM security but also offers valuable insights for improving safety mechanisms in artificial intelligence systems.

Abstract: This paper addresses contention window optimization for multiaccess scenarios. Our investigation into state-of-the-art models revealed that a limited number of nodes dominate the communication channels. Such monopolization issues are critical in networks as they can lead to significant disruptions. To mitigate this monopolization problem, we propose an imitation learning-based backoff mechanism. The proposed model is a reinforcement learning-based contention window optimization method. It imitates the expert's policy to ensure fair policy convergence for the agent and includes opportunities for weight adjustment to boost performance. The proposed model shows a fairness improvement of approximately 20% to 41% across various scenarios.

Abstract: Transfer learning enhances model performance in financial time series by leveraging data from related domains. The selection of appropriate source domains is crucial to avoid negative transfer. We propose using Gramian Angular Field (GAF) transformations to improve time series similarity functions for better domain alignment. Extensive experiments with DNN and LSTM models show that GAFbased similarity functions, specifically Coral (GAF) for DNN and CMD (GAF) for LSTM, significantly reduce prediction errors, demonstrating their effectiveness in complex financial environments.

Abstract: This abstract presents a simulated annealing based approach that constructs hyperspectral images from the frequency spectrums of a distributed acoustic sensing system and iteratively improves them through the training of learnable filters. The aim is to construct an image that represents features of signals from events while repressing noise. Hyper-spectral images are specifically created for downstream computer vision tasks such as object detection. Hyper-spectral images are images with more than three channels that are derived from a frequency spectrum to obtain the spectrum for each image pixel. Simulated annealing is used to train the filters to automatically select frequencies and bin them into frequency bands. Each frequency band is mapped into an image channel. We fully integrate our filtering method with an object detection network so that filters are trained in conjunction with the neural network. The detection model serves as both the measure and the selector. Our simulated annealing approach significantly outperforms current state-of-the-art methods by a margin of 22%. Limitations include a dependency on randomness and excluding parts of the search space prematuraly due to the design of the local moves.

Abstract: Intimate Partner Violence is a global, lifethreatening public health issue that can be prevented by recognizing emotionally aggressive behaviors that signal the potential for future relationship abuse. To help identify these precursory unhealthy behaviors, this study proposes a Multi-task Learning framework for training robust models capable of detecting not only physically abusive behaviors but also emotionally abusive behaviors, such as belittling or manipulation, which historically precede physical abuse. Preliminary results indicate that Multi-task Learning can improve detection of emotional abuse and help tune detection models to particular kinds of relationship abuse.

Abstract: On social media, it is easy to see how people are connected and find the leader, or mastermind of a network. The mastermind is responsible for the planning of the activities in the network. Hiding the mastermind is important to carry out these activities. This raises the question for the mastermind: How effectively can the mastermind hide his connections to avoid being found? We propose an efficient heuristic algorithm called HERMES (Hide Exposures by Removing Mastermind’s External Sources) to address this. Experiments on Facebook and Google networks show that HERMES hides the mastermind more effectively than the stateof-the-art, achieving time gains of 103 and 1397 seconds, respectively, and improving influence value by up to 11.11%.

Abstract: Phone scams pose a significant threat to individuals and communities, causing substantial financial losses and emotional distress. Despite ongoing efforts to combat these scams, scammers continue to adapt and refine their tactics, making it imperative to explore innovative countermeasures. This research explores the potential of large language models (LLMs) to provide detection of fraudulent phone calls. By analyzing the conversational dynamics between scammers and victims, LLMbased detectors can identify potential scams as they occur, offering immediate protection to users. While such approaches demonstrate promising results, we also acknowledge the challenges of biased datasets, relatively low recall, and hallucinations that must be addressed for further advancement in this field.

Harbin Institute of Technology, Shenzhen, China Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China, Peking University, Beijing, China

Abstract: In recent years, reinforcement learning has been widely applied in the field of games. However, most studies focus on assisting agents to achieve victory, with less attention paid to whether the agents exhibit humanlike characteristics. In order to build human-like agents with high performance, we propose a method for learning the strategies of human players in modern three-dimensional video games. Our method utilizes a hierarchical framework, learning basic behaviors and intentions of human players at the lower level through imitation learning, and generalized policies at the high level through reinforcement learning. Compared with other existing methods, our method demonstrates significant advantages in learning human-like strategies in complex environments.

Abstract: This paper proposes extended Long ShortTerm Memory (LSTM) networks for the knowledge tracing task and employs explainable AI methods to address interpretability issues. Specifically, we developed an extended LSTM-based model to automatically diagnose students' knowledge states. We then leveraged three interpreting methods—gradient sensitivity, gradient*input, and Deep SHAP—to explain the model's predictions by computing input contributions. The results demonstrate that the proposed model outperforms DKT, and the three methods effectively explain its predictions. Additionally, we identified three key insights into the model's working mechanisms.

Abstract: In multivariate time series classification, although current sequence analysis models have excellent classification capabilities, they show significant shortcomings when dealing with long sequence multivariate data. This paper focuses on optimizing model performance for longsequence multivariate data by mitigating the impact of extended time series and multiple variables on the model. We propose a principal component analysis (PCA)-based temporal streaming compression and dimensionality reduction algorithm for time series data (temporal streaming batch PCA, TSBPCA), which continuously updates the compact representation of the entire sequence through streaming PCA time estimation with time block updates, enhancing the data representation capability of a range of sequence analysis models.We evaluated this method using various models on five datasets, and the experimental results show that our method demonstrates outstanding performance in both classification accuracy and time efficiency.

Abstract: Leaf based diseases in tomatoes such as early blight, late blight, and septoria leaf spot, pose a significant threat to global food security and have substantial economic impacts. Early detection of these diseases is crucial for improving crop yields. This paper explores the use of visionlanguage models (VLMs) for detecting tomato leaf diseases by fine-tuning a pre-trained model on a large dataset of tomato leaf images with corresponding disease annotations. This approach enhances disease detection accuracy and enables multi-modal learning, real-time monitoring, and automated diagnosis, offering promising applications in precision farming and food production.

Abstract: Artificial intelligence (AI) has improved significantly in recent decades, and, along with it, its applications to realworld scenarios. AI has been used within a wide variety of fields like health care and e-commerce, however, AI has yet to integrate with the agriculture industry. With the help of machine learning, AI can begin to integrate with the industry via a research assistant. The model will assist researchers conduct experiments by giving treatment methods that are best suited for the experiment rather than relying on the expertise of the researcher. This will help research within the industry to become more efficient and less error prone. To accomplish this, the model will use a Knowledge Graph created by the IDIR lab that converts the large CSV files into a graph that can be queried and then summarized by the model.

Abstract: Agentic systems interleave large language model (LLM) reasoning, tool usage, and tool observations over multiple iterations to tackle complex tasks. The raw data from an agent's problemsolving process (the agents' trajectory) is not an ideal format for human analysis and oversight. There is a need for tooling that converts this primary data into an easily navigable and understandable visual format for better human feedback. To address this opportunity, we developed the Agent Trajectory Explorer, a tool designed to help AI developers and researchers visualize, annotate, and demonstrate agent behavior.

Abstract: We present QGen Studio: an adaptive questionanswer generation, training, and evaluation platform. QGen Studio enables users to leverage large language models (LLMs) to create custom question-answer datasets and fine-tune models on this synthetic data. It features a dataset viewer and model explorer to streamline this process. The dataset viewer provides key metrics and visualizes the context from which the QA pairs are generated, offering insights into data quality. The model explorer supports model comparison, allowing users to contrast the performance of their trained LLMs against other models, supporting performance benchmarking and refinement. QGen Studio delivers an interactive, end-to-end solution for generating QA datasets and training scalable, domain-adaptable models. The studio will be open-sourced soon, allowing users to deploy it locally.

Abstract: Scalable Vector Graphics (SVG) have become integral to modern image rendering applications due to their infinite scalability and versatility, especially in graphic design and web development. SVGs are essentially long strings of code that adhere to a structured syntax with validity constraints. With the rise of large language models, which excel at generating code in various languages, we aim to generate SVG code in a similar way. Our findings show that a visionlanguage model can be conditioned to produce valid SVG code that closely resembles input images, effectively enabling vectorization. Additionally, we harness the rich SVG syntax, encompassing all possible primitives—such as lines, paths, polygons, text, and effects like color gradients—that previous methods often missed. We briefly explain how the StarVector model operates, primarily leveraging a vision-language transformer architecture to generate SVG code. We also detail our training and inference procedures. Finally, we provide an interactive demo that allows users to input an image and generate its SVG code autoregressively, featuring real-time rendering that visually demonstrates the SVG generation process.

Abstract: In an era where diverse and complex data are increasingly accessible, the precise prediction of individual treatment effects (ITE) becomes crucial across fields such as healthcare, economics, and social policy. Current stateof-the-art approaches, while providing valid prediction intervals through Conformal Quantile Regression (CQR) and related techniques, often yield overly conservative prediction intervals. In this work, we introduce a conformal inference approach to ITE using the conditional density of the outcome given the covariates. We leverage the reference distribution technique to efficiently estimate the conditional densities as the score functions under a two-stage conformal ITE framework. We show that our prediction intervals are not only marginally valid but are narrower than existing methods. Experimental results further validate the usefulness of our method.

Abstract: Partially linear models (PLM) have attracted much attention in the field of statistical machine learning. Specially, the ability of variable selection of PLM has been studied extensively due to the high requirement of model interpretability. However, few of the existing works concerns the false discovery rate (FDR) controllability of variable selection associated with PLM. To address this issue, we formulate a new Knockoffs Inference scheme for Linear And Nonlinear Discoverer (called KILAND), where FDR is controlled with respect to both linear and nonlinear variables for automatic structure discovery. For the proposed KI-LAND, theoretical guarantees are established for both FDR controllability and power, and experimental evaluations are provided to validate its effectiveness.

Abstract: The substantial computational and memory demands of Large Language Models (LLMs) hinder their deployment. Block Floating Point (BFP) has proven effective in accelerating linear operations, a cornerstone of LLM workloads. However, as sequence lengths grow, nonlinear operations, such as Attention, increasingly become performance bottlenecks due to their quadratic computational complexity. These nonlinear operations are predominantly executed using inefficient floatingpoint formats, which renders the system challenging to optimize software efficiency and hardware overhead. In this paper, we delve into the limitations and potential of applying BFP to nonlinear operations. Given our findings, we introduce a hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced BFP version, overcomes nonlinear operation challenges with a pivot-focus strategy for diverse data and an adaptive grouping strategy for flexible exponent sharing. (ii) DH-LUT, a novel lookup table algorithm dedicated to accelerating nonlinear operations with DBFP format. (iii) An RTL-level DBFP-based engine is implemented to support DB-Attn, applicable to FPGA and ASIC. Results show that DB-Attn provides significant performance improvements with negligible accuracy loss, achieving 74% GPU speedup on Softmax of LLaMA and 10x low-overhead performance improvement over SOTA designs.

Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University School of Cyber Engineering, Xidian University, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University Guangzhou Institute of Technology, Xidian University, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University School of Cyber Engineering, Xidian University, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University Academy of Artificial Intelligence, College of Mathematics Science, Inner Mongolia Normal University

Abstract: Federated fewshot learning (FedFSL) aims to enable the clients to obtain personalized generalization models for unseen categories with only a small number of referenceable samples in the distributed collaborative training paradigm. Most existing FedFSL-related algorithms suffer from domain bias and feature coupling in the presence of data heterogeneity and sample scarcity. In this work, we propose a collaborative feature representation disentanglement (CFRD) scheme for FedFSL to address these issues. After each client receives the global aggregation parameters, the original feature representation is decoupled into global communal features and local personality features with personalized bias representation, to maintain both global consistency and local relevance in the first feature representation disentanglement. On the few-shot metric space about the second feature representation disentanglement, category-independent information is encoded by class-specific and class-irrelevant reconstructions to separate the discriminative features. The proposed scheme collaboratively accomplishes global domain bias feature disentanglement and local category degradation feature disentanglement from client-wise and class-wise. Experiments on three few-shot benchmark datasets conforming to the FedFSL paradigm demonstrate that our proposed method outperforms state-of-the-art approaches in both global generality and local specificity.

Abstract: Graph Neural Networks (GNNs) has been widely used in a variety of fields because of their great potential in representing graphstructured data. However, lacking of rigorous uncertainty estimations limits their application in high-stakes. Conformal Prediction (CP) can produce statistically guaranteed uncertainty estimates by using the classifier's probability estimates to obtain prediction sets, which contains the true class with a user-specified probability. In this paper, we propose a Rank-based CP during training framework to GNNs (RCP-GNN) for reliable uncertainty estimates to enhance the trustworthiness of GNNs in the node classification scenario. By exploiting rank information of the classifier's outcome, prediction sets with desired coverage rate can be efficiently constructed. The strategy of CP during training with differentiable rank-based conformity loss function is further explored to adapt prediction sets according to network topology information. In this way, the composition of prediction sets can be guided by the goal of jointly reducing inefficiency and probability estimation errors. Extensive experiments on several real-world datasets show that our model achieves any pre-defined target marginal coverage while significantly reducing the inefficiency compared with state-of-the-art methods.

Abstract: Incomplete multiview clustering presents significant challenges due to missing views. Although many existing graph-based methods aim to recover missing instances or complete similarity matrices with promising results, they still face several limitations: (1) Recovered data may be unsuitable for spectral clustering, as these methods often ignore guidance from spectral analysis; (2) Complex optimization processes require high computational burden, hindering scalability to large-scale problems; (3) Most methods do not address the rotational mismatch problem in spectral embeddings. To address these issues, we propose a highly efficient rotation-invariant spectral embedding (RISE) method for scalable incomplete multi-view clustering. RISE learns view-specific embeddings from incomplete bipartite graphs to capture the complementary information. Meanwhile, a complete consensus representation with second-order rotation-invariant property is recovered from these incomplete embeddings in a unified model. Moreover, we design a fast alternating optimization algorithm with linear complexity and promising convergence to solve the proposed formulation. Extensive experiments on multiple datasets demonstrate the effectiveness, scalability, and efficiency of RISE compared to the state-of-the-art methods.

Abstract: Graph Neural Networks are powerful tools for modeling graphstructured data but their interpretability remains a significant challenge. Existing model-agnostic GNN explainers aim to identify critical subgraphs or node features relevant to task predictions but often rely on GNN predictions for supervision, lacking ground-truth explanations. This limitation can introduce biases, causing explanations to fail in accurately reflecting the GNN's decision-making processes. To address this, we propose a novel explainer for GNNs with graph segmentation and contrastive learning. Our model introduces a graph segmentation learning module to divide the input graph into explanatory and redundant subgraphs. Next, we implement edge perturbation to augment these subgraphs, generating multiple positive and negative pairs for contrastive learning between explanatory and redundant subgraphs. Finally, we develop a contrastive learning module to guide the learning of explanatory and redundant subgraphs by pulling positive pairs with the same explanatory subgraphs closer while pushing negative pairs with different explanatory subgraphs far away. This approach allows for a clearer distinction of critical subgraphs, enhancing the fidelity of the explanations. We conducted extensive experiments on graph classification and node classification tasks, demonstrating the effectiveness of the proposed method.

Abstract: Graph neural networks (GNNs) have demonstrated their effectiveness in various tasks supported by their generalization capabilities. However, the current analysis of GNN generalization relies on the assumption that training and testing data are independent and identically distributed (i.i.d). This imposes limitations on the cases where a model mismatch exists when generating testing data. In this paper, we examine GNNs that operate on geometric graphs generated from manifold models, explicitly focusing on scenarios where there is a mismatch between manifold models generating training and testing data. Our analysis reveals the robustness of the GNN generalization in the presence of such model mismatch. This indicates that GNNs trained on graphs generated from a manifold can still generalize well to unseen nodes and graphs generated from a mismatched manifold. We attribute this mismatch to both node feature perturbations and edge perturbations within the generated graph. Our findings indicate that the generalization gap decreases as the number of nodes grows in the training graph while increasing with larger manifold dimension as well as larger mismatch. Importantly, we observe a tradeoff between the generalization of GNNs and the capability to discriminate high-frequency components when facing a model mismatch. The most important practical consequence of this analysis is to shed light on the filter design of generalizable GNNs robust to model mismatch. We verify our theoretical findings with experiments on multiple real-world datasets.

Abstract: Action advising endeavors to leverage supplementary guidance from expert teachers to alleviate the issue of sampling inefficiency in Deep Reinforcement Learning (DRL). Previous agentspecific action advising methods are hindered by imperfections in the agent itself, while agent-agnostic approaches exhibit limited adaptability to the learning agent. In this study, we propose a novel framework called Agent-Aware trAining yet Agent-Agnostic Action Advising (A7) to strike a balance between the two. The underlying concept of A7 revolves around utilizing the similarity of state features as an indicator for soliciting advice. However, unlike prior methodologies, the measurement of state feature similarity is performed by neither the error-prone learning agent nor the agent-agnostic advisor. Instead, we employ a proxy model to extract state features that are both discriminative (adaptive to the agent) and generally applicable (robust to agent noise). Furthermore, we utilize behavior cloning to train a model for reusing advice and introduce an intrinsic reward for the advised samples to incentivize the utilization of expert guidance. Experiments are conducted on the GridWorld, LunarLander, and six prominent scenarios from Atari games. The results demonstrate that A7 significantly accelerates the learning process and surpasses existing methods (both agent- specific and agent-agnostic) by a substantial margin. Our code will be made publicly available.

Abstract: Deep reinforcement learning (DRL) has achieved remarkable success in various domains, yet its reliance on neural networks results in a lack of transparency, which limits its practical applications in safetycritical and human-agent interaction domains. Decision trees, known for their notable explainability, have emerged as a promising alternative to neural networks. However, decision trees often struggle in long-horizon continuous control tasks with high-dimensional observation space due to their limited expressiveness. To address this challenge, we propose SkillTree, a novel hierarchical framework that reduces the complex continuous action space of challenging control tasks into discrete skill space. By integrating the differentiable decision tree within the high-level policy, SkillTree generates discrete skill embeddings that guide low-level policy execution. Furthermore, through distillation, we obtain a simplified decision tree model that improves performance while further reducing complexity. Experiment results validate SkillTree’s effectiveness across various robotic manipulation tasks, providing clear skill-level insights into the decision-making process. The proposed approach not only achieves performance comparable to neural network based methods in complex long-horizon control tasks but also significantly enhances the transparency and explainability of the decision-making process.

Abstract: Pairwise learning includes various machine learning tasks, with ranking and metric learning serving as the primary representatives. While randomized coordinate descent (RCD) is popular in various problems, there is much less theoretical analysis on the generalization behavior of models trained by RCD, especially under the pairwise learning framework. In this paper, we consider the generalization of RCD for pairwise learning. We measure the onaverage argument stability for both convex and strongly convex objective functions, based on which we develop generalization bounds in expectation. The early-stopping strategy is adopted to quantify the balance between estimation and optimization. Our analysis further incorporates the low-noise setting into the excess risk bounds to achieve the optimistic bound as O(1/n), where n is the sample size.

Abstract: Most existing graph clustering methods primarily focus on exploiting topological structure, often neglecting the "missinghalf" node feature information, especially how these features can enhance clustering performance. This issue is further compounded by the challenges associated with high-dimensional features. Feature selection in graph clustering is particularly difficult because it requires simultaneously discovering clusters and identifying the relevant features for these clusters. To address this gap, we introduce a novel paradigm called "one node one model", which builds an exclusive model for each node and defines the node label as a combination of predictions for node groups. Specifically, the proposed "Feature Personalized Graph Clustering (FPGC)" method identifies cluster-relevant features for each node using a squeeze-and-excitation block, integrating these features into each model to form the final representations. Additionally, the concept of feature cross is developed as a data augmentation technique to learn low-order feature interactions. Extensive experimental results demonstrate that FPGC outperforms state-of-the-art clustering methods. Moreover, the plug-and-play nature of our method provides a versatile solution to enhance GNN-based models from the feature perspective.

Abstract: Classifiers often learn to be biased corresponding to the classimbalanced dataset under the semi-supervised learning (SSL) set. While previous work tries to appropriately re-balance the classifiers by subtracting a class-irrelevant image's logit, we further utilize a cheaper form of consistency gradients, which can be widely applicable to various class-imbalanced SSL (CISSL) models. We theoretically analyze that the process of refining pseudo-labels with a baseline image (solid color image without any patterns) in the basic SSL algorithm implicitly utilizes integrated gradient flow training, which can improve the attribution ability. Based on the analysis, we propose a consistently conflicting gradient-based debiasing scheme dubbed LCGC, by encouraging biased class predictions during training. We intentionally update the pseudo-labels whose gradient conflicts with the debiased logits, which is represented as the optimization direction offered by the over-imbalanced classifier predictions. Then, we debias the predictions by subtraction the baseline image logits during testing. Extensive experiments demonstrate that our method can significantly improve the prediction accuracy of existing CISSL models on public benchmarks.

Abstract: Recently, anchorbased incomplete multi-view clustering (IMVC) has been widely adopted for fast clustering, but most existing approaches still encounter some issues: (1) They generally rely on the observed samples to construct anchor graphs, ignoring the potentially useful information of missing instances. (2) Most methods attempt to learn a consensus anchor graph, failing to fully excavate the complementary information and high-order correlations across views. (3) They generally apply post-processing on learned anchor graph to seek latent embeddings, making them not globally-optimal. To address these issues, this paper proposes a novel fast IMVC approach with Adaptive Similarity Completion and Reconstruction (ASCR), which unifies anchor learning, anchor-sample similarity construction and completion, and latent multi-view embedding learning in a joint framework. Specifically, ASCR learns an anchor-sample similarity graph for each view, and the missing values are fulfilled to mitigate the adverse effects. To explore the consistent and complementary information across views, ASCR simultaneously seeks the view-specific anchor embeddings and sample embeddings in a latent subspace by similarity reconstruction, which not only preserves the semantic information into latent embeddings but also enhances the low-rank property of similarity graphs, achieving a reliable graph completion process. Furthermore, the high-order cross-view correlations are explored with tensor-based regularization. Extensive experimental results demonstrate the superiority and efficiency of ASCR compared with SOTA approaches.

Abstract: The challenges tied to unstructured graph data are manifold, primarily falling into node, edge, and graphlevel problem categories. Graph Neural Networks (GNNs) serve as effective tools to tackle these issues. However, individual tasks often demand distinct model architectures, and training these models typically requires abundant labeled data, a luxury often unavailable in practical settings. Recently, various "prompt tuning" methodologies have emerged to empower GNNs to adapt to multi-task learning with limited labels. The crux of these methods lies in bridging the gap between pre-training tasks and downstream objectives. Nonetheless, a prevalent oversight in existing studies is the homophily-centric nature of prompt tuning frameworks, disregarding scenarios characterized by high heterogeneity. To remedy this oversight, we introduce a novel prompting strategy named HeterGP tailored for highly heterophilic scenarios. Specifically, we present a dual-view approach to capture both homophilic and heterophilic information, along with a prompt graph design that encompasses token initialization and insertion patterns. Through extensive experiments conducted in a few-shot context encompassing node and graph classification tasks, our method showcases superior performance in highly heterophilic environments compared to state-of-the-art prompt tuning techniques.

Abstract: We investigate the online nonsubmodular optimization with delayed feedback in the bandit setting, where the loss function is αweakly DR-submodular and β-weakly DR-supermodular. Previous work has established an (α,β)-regret bound of O(nd^⅓T^⅔), where n is the dimensionality and d is the maximum delay. However, its regret bound relies on the maximum delay and is thus sensitive to irregular delays. Additionally, it couples the effects of delays and bandit feedback as its bound is the product of the delay term and the O(nT^⅔) regret bound in the bandit setting without delayed feedback. In this paper, we develop two algorithms to address these limitations, respectively. Firstly, we propose a novel method, namely DBGD-NF, which employs the one-point gradient estimator and utilizes all the available estimated gradients in each round to update the decision. It achieves a better O(nd̅^⅓T^⅔) regret bound, which is relevant to the average delay d̅ = 1/T ∑ₜ₌₁ᵀ dₜ <= d. Secondly, we extend DBGD-NF by employing a blocking update mechanism to decouple the joint effect of the delays and bandit feedback, which enjoys an O(n(T^⅔+ √(dT))) regret bound. When d = O(T^⅓), our regret bound matches the O(nT^⅔) bound in the bandit setting without delayed feedback. Compared to our first O(nd̅^⅓T^⅔) bound, it is more advantageous when the maximum delay d = o(d̅^⅔T^⅓). Finally, we conduct experiments on structured sparse learning to demonstrate the superiority of our methods.

Abstract: Representation learning is a powerful tool that enables learning over large multitudes of agents or domains by enforcing that all agents operate on a shared set of learned features. However, many robotics or controls applications that would benefit from collaboration operate in settings with changing environments and goals, whereas most guarantees for representation learning are stated for static settings. Toward rigorously establishing the benefit of representation learning in dynamic settings, we analyze the regret of multitask representation learning for linear-quadratic control. This setting introduces unique challenges. Firstly, we must account for and balance the misspecification introduced by an approximate representation. Secondly, we cannot rely on the parameter update schemes of single-task online LQR, for which least-squares often suffices, and must devise a novel scheme to ensure sufficient improvement. We demonstrate that for settings where exploration is "benign", the regret of any agent after T timesteps scales with the square root of T/H, where H is the number of agents. In settings with "difficult" exploration, the regret scales as the square root of the input dimension times the parameter dimension multiplied by T, plus a term which scales with T to the three quarters divided by H to the one fifth. In both cases, by comparing to the minimax single-task regret, we see a benefit of a large number of agents. Notably, in the difficult exploration case, by sharing a representation across tasks, the effective task-specific parameter count can often be small. Lastly, we validate the trends we predict.

Abstract: Reentrancy vulnerabilities in smart contracts have been exploited to steal enormous amounts of money, thus detecting reentrancy vulnerabilities is a hotspot issue in security research. However, a new attack is emerging in which attackers continuously release new reentrancy patterns to exploit fresh vulnerabilities and obfuscate existing ones. Existing detection methods neglect the timeseries evolution of vulnerabilities across different smart contract versions, leading to a gradual decline in their effectiveness over time. We investigate the time-series correlations among vulnerabilities in various versions and refer to these as Evolutionary Reentrancy Vulnerabilities (ERVs). We summarize that ERVs detection faces two key challenges: (i) capturing the evolving pattern of ERVs along a complete evolutionary chain and (ii) detecting fresh reentrancy vulnerabilities in new versions. To address these challenges, we propose CLEP, a novel Contrastive Learning with Evolving Pairs detection method. It can effectively capture the evolving patterns by discerning similarities and differences across versions. Specifically, we first modified the sample distribution by incorporating version declarations as time-series evolution information. Then, leveraging the hierarchical similarity, we design an evolving pairs scheme to form negative and positive contract pairs across versions. Finally, we build a complete evolutionary chain by proposing a version-aware contrastive sampler. Our experimental results show that CLEP not only outperforms state-of-the-art baselines in version-specific scenarios but also shows promising performance in cross-version evolution scenarios.

Abstract: Organic Solar Cells (OSCs) are a promising technology for sustainable energy production. However, the identification of molecules with desired OSC properties typically involves laborious experimental research. To accelerate progress in the field, it is crucial to develop machine learning models capable of accurately predicting the properties of OSC molecules. While graph representation learning has demonstrated success in molecular property prediction, it remains underexplored for OSCspecific tasks. Existing methods fail to capture the unique structural features of OSC molecules, particularly the intricate ring systems that critically influence OSC properties, leading to suboptimal performance. To fill the gap, we present RingFormer, a novel graph transformer framework specially designed to capture both atom and ring level structural patterns in OSC molecules. RingFormer constructs a hierarchical graph that integrates atomic and ring structures and employs a combination of local message passing and global attention mechanisms to generate expressive graph representations for accurate OSC property prediction. We evaluate RingFormer's effectiveness on five curated OSC molecule datasets through extensive experiments. The results demonstrate that RingFormer consistently outperforms existing methods, achieving a 22.77% relative improvement over the nearest competitor on the CEPDB dataset.

Abstract: Cognitive Diagnosis Models (CDMs) are designed to assess students' cognitive states by analyzing their performance across a series of exercises. However, existing CDMs often struggle with diagnosing infrequent students and exercises due to a lack of rich prior knowledge. With the advancement in large language models (LLMs), which possess extensive domain knowledge, their integration into cognitive diagnosis presents a promising opportunity. Despite this potential, integrating LLMs with CDMs poses significant challenges. LLMs are not wellsuited for capturing the fine-grained collaborative interactions between students and exercises, and the disparity between the semantic space of LLMs and the behavioral space of CDMs hinders effective integration. To address these issues, we propose a novel Knowledge-enhanced Cognitive Diagnosis (KCD) framework, which is a model-agnostic framework utilizing LLMs to enhance CDMs and compatible with various CDM architectures. The KCD framework operates in two stages: LLM Diagnosis and Cognitive Level Alignment. In the LLM Diagnosis stage, both students and exercises are diagnosed to achieve comprehensive and detailed modeling. In the Cognitive Level Alignment stage, we bridge the gap between the CDMs' behavioral space and the LLMs' semantic space using contrastive learning and mask-reconstruction approaches. Experiments on several real-world datasets demonstrate the effectiveness of our proposed framework.

Abstract: Infectious diseases have historically had profound effects on global health, economies, and social structures. Effective tracing of infectious diseases is essential not only for immediate public health responses but also for shaping future prevention strategies. Traditional tracing methods often emphasize homogeneous networks, overlooking the diverse transmission characteristics of heterogeneous populations. This research addresses two critical challenges: the heterogeneity of transmission across various media and modes, and the significant yet underexplored influence of community structures on epidemic spread and tracing.We propose a Heterogeneous Hypergraph Attention Network (HHAN) modelthat accounts for multiple transmission pathways and patterns within heterogeneous networks. HHAN integrates a heterogeneous graph neural network module to handle the complexity of communication among different populations, and an AgentBased Modeling Module that combines agent-based ideas to model individual behaviors. This approach effectively captures complex interactions within community structures and addresses individual variability. Experimental results on three real-world datasets demonstrate that the HHAN model significantly outperforms other state-of-the-art methods in tackling the complex challenge of tracing infectious diseases in heterogeneous populations.

Abstract: Membership Inference Attack (MIA) aims to determine if a specific sample is present in the training dataset of a target machine learning model. Previous MIAs against finetuned Large Language Models (LLMs) either fail to address the unique challenges in the fine-tuned setting or rely on strong assumption of the training data distribution. This paper proposes a distribution-free MIA framework tailored for fine-tuned LLMs, named DF-MIA. We recognize that samples await to test can serve as a valuable reference dataset for fine-tuning reference models. By enhancing the signals of non-member samples within this reference dataset, we can achieve a more reliable and practical calibration of probabilities, improving the differentiation between members and non-members. Leveraging these insights, we have developed a two-stage framework that employs specially designed data augmentation and perturbation techniques to prioritize the significance of non-members and mitigate the influence of potential members within the reference dataset. We evaluate our method on three representative LLM models ranging from 1B to 8B on three datasets. The results demonstrate that the DF-MIA significantly enhances the performance of MIA.

Abstract: With the diversification of online social platforms, news dissemination has become increasingly complex, heterogeneous, and multimodal, making the fake news detection task more challenging and crucial. Previous works mainly focus on obtaining social relationships of news via retweets, limiting the accurate detection when real cascades are inaccessible. Given the proven assessment of the spreading influence of events, this paper proposes a method called HML (Complex Heterogeneous Multimodal Fake News Detection method via Latent Network Inference). Specifically, an improved social latent network inference strategy is designed to estimate the maximum likelihood of news influences under the same event. Meanwhile, a novel heterogeneous graph is built based on social attributes for multimodal news under different events. Further, to better aggregate the relationships among heterogeneous multimodal features, this paper proposes a selfsupervised-based multimodal content learning strategy, to enhance, align, fuse and compare heterogeneous modal contents. Based above, a personalized heterogeneous graph representation learning is designed to classify fake news. Extensive experiments demonstrate that the proposed method outperforms the SOTA in real social media news datasets.

Abstract: Leveraging the vast genetic diversity within microbiomes offers unparalleled insights into complex phenotypes, yet the task of accurately predicting and understanding such traits from genomic data remains challenging. We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. Based on our model, we develop attribution techniques to elucidate gene interaction effects that drive microbial adaptation to diverse environments. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats. We not only demonstrate solid predictive performance, but also how sequencelevel information of entire genomes allows us to identify gene associations underlying complex phenotypes. Our attribution recovers known important interaction networks and proposes new candidates for experimental follow up.

Abstract: Accurate drugtarget affinity (DTA) prediction holds significant potential in the field of artificial intelligence (AI)-based drug discovery. However, existing methods primarily operate at a single scale, specifically at the macro (residue) scale for target proteins and the micro (atom) scale for drugs, which limits their ability to provide information at micro (atom) scale for targets and macro (functional group, FG) scale for drugs. This limitation hinders a comprehensive understanding of the binding patterns and properties of drug-target pairs. In this paper, we propose a progressive Macro-to-Micro 3D Modeling Network (M²N) that enables macro (residue/FG) to micro (atom) scale unified modeling, termed cross-scale, to predict DTA. Specifically, M²N operates drugs by learning their chemical properties and structural characteristics from a 3D FG graph to a 3D atom graph. Correspondingly, M²N encodes proteins from a 3D residue graph to a 3D atom graph to exploit their sequence, evolutionary, and geometric representations. Such cross-scale 3D modeling scheme allows for coarse-to-fine embedding optimization, followed by an adaptive fusion module to dynamically integrate the refined features by end-to-end learning. Extensive experiments on two datasets indicate that M²N not only outperforms state-of-the-art methods under various conditions, but also provides a new paradigm for target and drug unified modeling.

Abstract: Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3Dstructures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.

Abstract: Heterogeneous Graph Neural Networks (HGNNs) have achieved stateof-the-art performance in classifying molecular graphs, capitalizing on their ability to capture rich semantics. However, HGNNs for molecule property prediction exhibit significant susceptibility to adversarial attacks—a challenge that prior research has entirely overlooked. To fill this gap, this paper introduces the first study focused on robust graph-level representation learning tailored for heterogeneous molecular graphs. To achieve this goal, we propose a comprehensive Robust Heterogeneous Graph Classification (RHGC) framework grounded in the Information Bottleneck principle, which aims to identify the most informative and least noisy heterogeneous subgraphs to derive robust, holistic representations. This is specifically accomplished through a dedicated Node Semantic Purifier, which enhances node-level and semantic-level robustness by eliminating label-irrelevant interference using graph stochastic attention and the Hilbert-Schmidt Independence Criterion, along with a Global Graph Disentanglement method, which improves graph-level robustness by addressing information leak. Experiments on three molecular benchmarks demonstrate that RHGC enhances accuracy by an average of 5.06% under all three attack settings and meanwhile by 4.33% on clean data.

Abstract: Automated Program Repair (APR) for introductory programming assignments (IPAs) is motivated by the large number of student enrollments in programming courses each year. Since providing feedback on programming assignments requires substantial time and effort from faculty, personalized automated feedback often involves suggesting repairs to students' programs. Symbolic semantic repair approaches, which rely on Formal Methods (FM), check a program's execution against a test suite or reference solution, are effective but limited. These tools excel at identifying buggy parts but can only fix programs if the correct implementation and the faulty one share the same control flow graph. Conversely, Large Language Models (LLMs) are used for program repair but often make extensive rewrites instead of minimal adjustments. This tends to lead to more invasive fixes, making it harder for students to learn from their mistakes. In summary, LLMs excel at completing strings, while FMbased fault localization excel at identifying buggy parts of a program. In this paper, we propose a novel approach that combines the strengths of both FM-based fault localization and LLMs, via zero-shot learning, to enhance APR for IPAs. Our method uses MaxSAT-based fault localization to identify buggy parts of a program, then presents the LLM with a program sketch devoid of these buggy statements. This hybrid approach follows a Counterexample Guided Inductive Synthesis (CEGIS) loop to iteratively refine the program. We ask the LLM to synthesize the missing parts, which are then checked against a test suite. If the suggested program is incorrect, a counterexample from the test suite is fed back to the LLM for revised synthesis. Our experiments on 1,431 incorrect student programs show that our counterexample guided approach, using MaxSAT-based bug-free program sketches, significantly improves the repair capabilities of all six evaluated LLMs. This method allows LLMs to repair more programs and produce smaller fixes, outperforming other configurations and state-of-the-art symbolic program repair tools.

Abstract: Traffic classification is crucial for network management and security. Recently, deep learningbased methods have demonstrated good performance in traffic classification. However, they primarily capture features from raw packet bytes, overlooking the significance of inter-packet correlations within flows from a global perspective. Additionally, effectively handling both packet-length and temporal information, while extracting the structural relationships from a graph into the model, remains a challenge for enhancing the performance of traffic prediction. In this paper, we propose DigTraffic, a novel dual-channel interactive graph transformer to address these limitations. DigTraffic employs a message-level graph-structured flow representation combined with message-aware structural aggregation. To learn intrinsic flow representations, DigTraffic constructs traffic interaction graphs, by incorporating three well-designed heterogeneous types of edges to capture client-server interactions. After that, we separately encode lengthy and temporal flow sequences using a dual-channel network and fuse these modalities within a Transformer architecture. Furthermore, DigTraffic introduces a message-aware Graph Transformer that leverages both node embeddings and edge spatial relations to capture complex graph structures and rich structural information. Experimental results demonstrate that our method significantly outperforms the state-of-the-art methods on four real-world traffic datasets.

Abstract: The rapid advancements of generative AI have fueled the potential of generative text image editing, meanwhile escalating the threat of misinformation spreading. However, existing forensics methods struggle to detect unseen forgery types that they have not been trained on, underscoring the need for a model capable of generalized detection of tampered scene text. To tackle this, we propose a novel task: openset tampered scene text detection, which evaluates forensics models on their ability to identify both seen and previously unseen forgery types. We have curated a comprehensive, high-quality dataset, featuring the texts tampered by eight text editing models, to thoroughly assess the open-set generalization capabilities. Further, we introduce a novel and effective pre-training paradigm that subtly alters the texture of selected texts within an image and trains the model to identify these regions. This approach not only mitigates the scarcity of high-quality training data but also enhances models' fine-grained perception and open-set generalization abilities. Additionally, we present DAF, a novel framework that improves open-set generalization by distinguishing between the features of authentic and tampered text, rather than focusing solely on the tampered text's features. Our extensive experiments validate the remarkable efficacy of our methods. For example, our zero-shot performance can even beat the previous state-of-the-art full-shot model by a large margin.

BrainCog Lab, Institute of Automation, Chinese Academy of Sciences Beijing Institute of Al Safety and Governance Beijing Key Laboratory of AI Safety and Superalignment Center for Long-term Artificial Intelligence School of Future Technology, University of Chinese Academy of Sciences, BrainCog Lab, Institute of Automation, Chinese Academy of Sciences Beijing Institute of Al Safety and Governance Beijing Key Laboratory of AI Safety and Superalignment Center for Long-term Artificial Intelligence, BrainCog Lab, Institute of Automation, Chinese Academy of Sciences Beijing Institute of Al Safety and Governance Beijing Key Laboratory of AI Safety and Superalignment Center for Long-term Artificial Intelligence, BrainCog Lab, Institute of Automation, Chinese Academy of Sciences Beijing Institute of Al Safety and Governance Beijing Key Laboratory of AI Safety and Superalignment Center for Long-term Artificial Intelligence, BrainCog Lab, Institute of Automation, Chinese Academy of Sciences Beijing Institute of Al Safety and Governance Beijing Key Laboratory of AI Safety and Superalignment Center for Long-term Artificial Intelligence School of Future Technology, University of Chinese Academy of Sciences, BrainCog Lab, Institute of Automation, Chinese Academy of Sciences Beijing Institute of Al Safety and Governance Beijing Key Laboratory of AI Safety and Superalignment Center for Long-term Artificial Intelligence School of Future Technology, University of Chinese Academy of Sciences

Abstract: Human beings often experience stress, which can significantly influence their performance. This study explores whether Large Language Models (LLMs) exhibit stress responses similar to those of humans and whether their performance fluctuates under different stressinducing prompts. To investigate this, we developed a novel set of prompts, termed StressPrompt, designed to induce varying levels of stress. These prompts were derived from established psychological frameworks and carefully calibrated based on ratings from human participants. We then applied these prompts to several LLMs to assess their responses across a range of tasks, including instruction-following, complex reasoning, and emotional intelligence. The findings suggest that LLMs, like humans, perform optimally under moderate stress, consistent with the Yerkes-Dodson law. Notably, their performance declines under both low and high-stress conditions. Our analysis further revealed that these StressPrompts significantly alter the internal states of LLMs, leading to changes in their neural representations that mirror human responses to stress. This research provides critical insights into the operational robustness and flexibility of LLMs, demonstrating the importance of designing AI systems capable of maintaining high performance in real-world scenarios where stress is prevalent, such as in customer service, healthcare, and emergency response contexts. Moreover, this study contributes to the broader AI research community by offering a new perspective on how LLMs handle different scenarios and their similarities to human cognition.

Abstract: Diffusionbased generative models have recently excelled in generating molecular conformations but struggled with the generalization issue -- models trained on one dataset may produce meaningless conformations on out-of-distribution molecules. On the other hand, distance geometry serves as a generalizable tool for the traditional computational chemistry methods of molecular conformation, which is predicated on the assumption that it is possible to adequately define the set of all potential conformations of any non-rigid molecular system using purely geometric constraints. In this work, we for the first time explicitly incorporate distance geometry constraints into pretraining phase of diffusion-based molecular generation models to improve the generalizability. Inspired by the classical distance geometry solution designed for solving the molecular distance geometry problem, we propose MiGDiff, a Metrization-Informed Geometric Diffusion framework. MiGDiff injects distance geometry constraints by pretraining the deep geometric diffusion backbone within the Metrization sampling approach, yielding a "Metrization-driven pretraining + Data-driven finetuning" paradigm. Experimental results demonstrate that MiGDiff outperforms state-of-the-art methods and possesses strong generalization capabilities, particularly on generating previously unseen molecules, revealing the vast untapped potential of combining traditional computational methods with deep generative models for 3D molecular generation.

Abstract: The development of therapeutic antibodies heavily relies on accurate predictions of how antigens will interact with antibodies. Existing computational methods in antibody design often overlook crucial conformational changes that antigens undergo during the binding process, significantly impacting the reliability of the resulting antibodies. To bridge this gap, we introduce dyAb, a flexible framework that incorporates AlphaFold2driven predictions to model pre-binding antigen structures and specifically addresses the dynamic nature of antigen conformation changes. Our dyAb model leverages a unique combination of coarse-grained interface alignment and fine-grained flow matching techniques to simulate the interaction dynamics and structural evolution of the antigen-antibody complex, providing a realistic representation of the binding process. Extensive experiments show that dyAb significantly outperforms existing models in antibody design involving changing antigen conformations. These results highlight dyAb's potential to streamline the design process for therapeutic antibodies, promising more efficient development cycles and improved outcomes in clinical applications.

Abstract: Accurate trajectory prediction is essential for the safety and efficiency of autonomous driving. Traditional models often struggle with realtime processing, capturing non-linearity and uncertainty in traffic environments, efficiency in dense traffic, and modeling temporal dynamics of interactions. We introduce NEST (Neuromodulated Small-world Hypergraph Trajectory Prediction), a novel framework that integrates Small-world Networks and hypergraphs for superior interaction modeling and prediction accuracy. This integration enables the capture of both local and extended vehicle interactions, while the Neuromodulator component adapts dynamically to changing traffic conditions. We validate the NEST model on several real-world datasets, including nuScenes, MoCAD, and HighD. The results consistently demonstrate that NEST outperforms existing methods in various traffic scenarios, showcasing its exceptional generalization capability, efficiency, and temporal foresight. Our comprehensive evaluation illustrates that NEST significantly improves the reliability and operational efficiency of autonomous driving systems, making it a robust solution for trajectory prediction in complex traffic environments.

Abstract: This paper explores Machine Unlearning (MU), an emerging field that is gaining increased attention due to concerns about neural models unintentionally remembering personal or sensitive information. We present SeUL, a novel method that enables selective and finegrained unlearning for language models. Unlike previous work that employs a fully reversed training objective in unlearning, SeUL minimizes the negative impact on the capability of language models, particularly in terms of generation. Furthermore, we introduce two innovative evaluation metrics, sensitive extraction likelihood (S-EL) and sensitive memorization accuracy (S-MA), specifically designed to assess the effectiveness of forgetting sensitive information. In support of the unlearning framework, we propose efficient automatic online and offline sensitive span annotation methods. The online selection method, based on language probability scores, ensures computational efficiency, while the offline annotation involves a two-stage LLM-based process for robust verification. In summary, this paper contributes a novel selective unlearning method (SeUL), introduces specialized evaluation metrics (S-EL and S-MA) for assessing sensitive information forgetting, and proposes automatic online and offline sensitive span annotation methods to support the overall unlearning framework and evaluation.

Abstract: Radio telescopes produce visibility data about celestial objects, but these data are sparse and noisy. As a result, images created on raw visibility data are of low quality. Recent studies have used deep learning models to reconstruct visibility data to get cleaner images. However, these methods rely on a substantial amount of labeled training data, which requires significant labeling effort from radio astronomers. Addressing this challenge, we propose VisRec, a modelagnostic semi-supervised learning approach to visibility data reconstruction in radio astronomy. Specifically, VisRec consists of both a supervised learning module and an unsupervised learning module. In the supervised learning module, we introduce a set of data augmentation functions to produce diverse visibility examples. In comparison, the unsupervised learning module in VisRec augments unlabeled data and uses reconstructions from non-augmented visibility as pseudo-labels for training. This hybrid approach allows VisRec to effectively leverage both labeled and unlabeled data. This way, VisRec performs well even when labeled data is scarce. Our evaluation results show that VisRec is applicable to various models, and outperforms all baseline methods in terms of reconstruction quality, robustness, and generalizability.

Zhejiang University AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University AI Lab, Research Center for Industries of the Future, Westlake University, AI Lab, Research Center for Industries of the Future, Westlake University, AI Lab, Research Center for Industries of the Future, Westlake University, AI Lab, Research Center for Industries of the Future, Westlake University, AI Lab, Research Center for Industries of the Future, Westlake University, AI Lab, Research Center for Industries of the Future, Westlake University, AI Lab, Research Center for Industries of the Future, Westlake University

Abstract: Antibodies are Yshaped proteins that protect the host by binding to specific antigens, and their binding is mainly determined by the Complementary Determining Regions (CDRs) in the antibody. Despite the great progress made in CDR design, existing computational methods still encounter several challenges: 1) poor capability of modeling complex CDRs with long sequences due to insufficient contextual information; 2) conditioned on pre-given antigenic epitopes and their static interaction with the target antibody; 3) neglect of specificity during antibody optimization leads to non-specific antibodies. In this paper, we take into account a variety of node features, edge features, and edge relations to include more contextual and geometric information. We propose a novel Relation-Aware Antibody Design (RAAD) framework, which dynamically models antigen-antibody interactions for co-designing the sequences and structures of antigen-specific CDRs. Furthermore, we propose a new evaluation metric to better measure antibody specificity and develop a contrasting specificity-enhancing constraint to optimize the specificity of antibodies. Extensive experiments have demonstrated the superior capability of RAAD in terms of antibody modeling, generation, and optimization across different CDR types, sequence lengths, pre-training strategies, and input contexts.

Abstract: Understanding and predicting the popularity of online UserGenerated Content (UGC) is critical for various social and recommendation systems. Existing efforts have focused on extracting predictive features and using pre-trained deep models to learn and fuse multimodal UGC representations. However, the dissemination of social UGCs is not an isolated process in social network; rather, it is influenced by contextual relevant UGCs and various exogenous factors, including social ties, trends, user interests, and platform algorithms. In this work, we propose a retrieval-based framework to enhance the popularity prediction of multimodal UGCs. Our framework extends beyond a simple semantic retrieval, incorporating a meta retrieval strategy that queries a diverse set of relevant UGCs by considering multimodal content semantics, and metadata from user and post. Moreover, to eliminate irrelevant and noisy UGCs in retrieval, we introduce a new measure called Relative Retrieval Contribution to Prediction (RRCP), which selectively refines the retrieved UGCs. We then aggregate the contextual UGC knowledge using vision-language graph neural networks, and fuse them with an RRCP-Attention-based prediction network. Extensive experiments on three large-scale social media datasets demonstrate significant improvements ranging from 26.68% to 48.19% across all metrics compared to strong baselines.

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in generating code. However, the misuse of LLMgenerated (synthetic) code has raised concerns in both educational and industrial contexts, underscoring the urgent need for synthetic code detectors. Existing methods for detecting synthetic content are primarily designed for general text and struggle with code due to the unique grammatical structure of programming languages and the presence of numerous ``low-entropy'' tokens. Building on this, our work proposes a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants. Our method is based on the observation that differences between LLM-rewritten and original code tend to be smaller when the original code is synthetic. We utilize self-supervised contrastive learning to train a code similarity model and evaluate our approach on two synthetic code detection benchmarks. Our results demonstrate a significant improvement over existing SOTA synthetic content detectors, delivering notable gains in both performance and robustness on the APPS and MBPP benchmarks.

School of Computer Science and Engineering, Ministry of Education Key Laboratory of Information Technology, Guangdong Province Key Laboratory of Information Security Technology, Sun Yat-sen University, Guangzhou 510006, China, School of Computer Science and Engineering, Ministry of Education Key Laboratory of Information Technology, Guangdong Province Key Laboratory of Information Security Technology, Sun Yat-sen University, Guangzhou 510006, China, School of Computer Science and Engineering, Ministry of Education Key Laboratory of Information Technology, Guangdong Province Key Laboratory of Information Security Technology, Sun Yat-sen University, Guangzhou 510006, China, State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China, State Key Laboratory of Internet of Things for Smart City and the Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau 999078, China

Abstract: Multimodal fake news detection aims to automatically identify real or fake news, thereby mitigating the adverse effects caused by such misinformation. Although prevailing approaches have demonstrated their effectiveness, challenges persist in crossmodal feature fusion and refinement for classification. To address this, we present a residual-aware compensation network with multi-granularity constraints (RaCMC) for fake news detection, that aims to sufficiently interact and fuse cross-modal features while amplifying the differences between real and fake news. First, a multiscale residual-aware compensation module is designed to interact and fuse features at different scales, and ensure both the consistency and exclusivity of feature interaction, thus acquiring high-quality features. Second, a multi-granularity constraints module is implemented to limit the distribution of both the news overall and the image-text pairs within the news, thus amplifying the differences between real and fake news at the news and feature levels. Finally, a dominant feature fusion reasoning module is developed to comprehensively evaluate news authenticity from the perspectives of both consistency and inconsistency. Experiments on three public datasets, including Weibo17, Politifact and GossipCop, reveal the superiority of the proposed method.

Abstract: Businesses using thirdparty LLMs face privacy risks from exposed prompts. This paper presents Portcullis, a privacy-preserving gateway that safeguards sensitive data while supporting efficient and accurate LLM responses. Portcullis functions as a mediator, anonymizing sensitive data in prompts through parallel substitution, securely interacting with LLMs, and accurately reconstructing responses. It ensures all data processing occurs within secure encrypted memory. The gateway is attested to ensure trustworthiness and protect user privacy. Portcullis is the first of its kind, offering a verifiable and scalable privacy gateway for third-party LLM inferences. We assess Portcullis's efficiency as a confidential container platform, demonstrating that its startup time scales linearly, ensuring scalability. Additionally, we evaluate its runtime performance using the PII and Enron Email Dataset. For masking and unmasking workloads, Portcullis outperforms Hide-and-Seek by 96x speed up, while maintaining equal or better false positive and false negative rates compared to existing solutions. On the Enron dataset, Portcullis achieves notably higher accuracy, surpassing Hide-and-Seek by over 0.1 for GPT-4o mini.

Abstract: The questionable responses caused by knowledge hallucination may lead to LLMs' unstable ability in decisionmaking. However, it has never been investigated whether the LLMs' hallucination is possibly usable for generating negative reasoning to assist fake news detection. In this paper, we propose a novel supervised self-reinforced reasoning rectification approach - SR^3 that not only yields common reasonable reasoning for news but also forces LLMs to generate the wrong understandings of news via LLMs reflection for semantic consistency learning. Upon that, we construct a negative reasoning-based news learning model called - NRFE, which leverages positive or negative news-reasoning pairs for learning the semantic consistency between them. To avoid the impact of label-implicated reasoning, we deploy a student model - NRFE-D that only takes news content as input to inspect the performance of our method by distilling the knowledge from NRFE. The experimental results verified on three popular fake news datasets demonstrate the superiority of our method compared with three kinds of baselines including prompting-based LLMs, fine-tuning-based PLMs, and other representative fake news detection methods.

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China Peng Cheng Laboratory, Shenzhen, China Key Laboratory of Cyberspace Security, Ministry of Education of China, Zhengzhou, China, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China Peng Cheng Laboratory, Shenzhen, China Key Laboratory of Cyberspace Security, Ministry of Education of China, Zhengzhou, China, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China Peng Cheng Laboratory, Shenzhen, China Key Laboratory of Cyberspace Security, Ministry of Education of China, Zhengzhou, China, Nanjing University of Posts and Telecommunications, Nanjing, China, Peng Cheng Laboratory, Shenzhen, China, Southeast University, Nanjing, China, National University of Singapore, Singapore

Abstract: With the growing significance of network security, the classification of encrypted traffic has emerged as an urgent challenge. Traditional bytebased traffic analysis methods are constrained by the rigid granularity of information and fail to fully exploit the diverse correlations between bytes. To address these limitations, this paper introduces MH-Net, a novel approach for classifying network traffic that leverages multi-view heterogeneous traffic graphs to model the intricate relationships between traffic bytes. The essence of MH-Net lies in aggregating varying numbers of traffic bits into multiple types of traffic units, thereby constructing multi-view traffic graphs with diverse information granularities. By accounting for different types of byte correlations, such as header-payload relationships, MH-Net further endows the traffic graph with heterogeneity, significantly enhancing model performance. Notably, we employ contrastive learning in a multi-task manner to strengthen the robustness of the learned traffic unit representations. Experiments conducted on the ISCX and CIC-IoT datasets for both the packet-level and flow-level traffic classification tasks demonstrate that MH-Net achieves the best overall performance compared to dozens of SOTA methods.

Abstract: Series Section Electron Microscopy (ssEM) is a crucial technique for visualizing threedimensional (3D) biological structures, which involves collecting electron microscopy images from a series of biological sections along the z-axis and reconstructing the 3D structure. 3D registration is an essential step in ssEM, designed to eliminate axial misalignment and nonlinear distortions introduced during sample sectioning. A significant challenge in 3D registration is eliminating nonlinear distortions while preserving natural deformations. In this paper, we present a new formulation of the 3D registration problem from a frequency domain perspective and propose a Gaussian filtering-based 3D registration method, which defines 3D registration as a superposition problem of high-frequency and low-frequency components. We extend the concept of a one-dimensional Gaussian filter to three-dimensional image stacks and integrate it with optical flow networks to consolidate the deformation field within the receptive field. Extensive experiments demonstrate that our method can successfully decouple nonlinear distortions and natural deformations in the frequency domain, proving superior to existing methods in rapidly and accurately eliminating nonlinear distortions and restoring biological structures, and has the potential to be extended to large datasets.

Abstract: Elucidating the functional mechanisms of the primary visual cortex (V1) remains a fundamental challenge in systems neuroscience. Current computational models face two critical limitations, namely the challenge of crossmodal integration between partial neural recordings and complex visual stimuli, and the inherent variability in neural characteristics across individuals, including differences in neuron populations and firing patterns. To address these challenges, we present a multi-modal identifiable variational autoencoder (miVAE) that employs a two-level disentanglement strategy to map neural activity and visual stimuli into a unified latent space. This framework enables robust identification of cross-modal correlations through refined latent space modeling. We complement this with a novel score-based attribution analysis that traces latent variables back to their origins in the source data space. Evaluation on a large-scale mouse V1 dataset demonstrates that our method achieves state-of-the-art performance in cross-individual latent representation and alignment, without requiring subject-specific fine-tuning, and exhibits improved performance with increasing data size. Significantly, our attribution algorithm successfully identifies distinct neuronal subpopulations characterized by unique temporal patterns and stimulus discrimination properties, while simultaneously revealing stimulus regions that show specific sensitivity to edge features and luminance variations. This scalable framework offers promising applications not only for advancing V1 research but also for broader investigations in neuroscience.

College of Computer and Data Science, Fuzhou University; Guangdong Institute of Intelligence Science and Technology, Guangdong Institute of Intelligence Science and Technology University of Electronic Science and Technology of China, Center for Brain Inspired Computing Research, Department of Precision Instrument, Tsinghua University, Guangdong Institute of Intelligence Science and Technology, College of Computer and Data Science, Fuzhou University, School of Computer Science and Technology, Dalian University of Technology, College of Computer and Data Science, Fuzhou University, Guangdong Institute of Intelligence Science and Technology Center for Brain Inspired Computing Research, Department of Precision Instrument, Tsinghua University

Abstract: The rapid evolution of multimedia technology has revolutionized human perception, paving the way for multiview learning. However, traditional multi-view learning approaches are tailored for scenarios with fixed data views, falling short of emulating the intricate cognitive procedures of the human brain processing signals sequentially. Our cerebral architecture seamlessly integrates sequential data through intricate feed-forward and feedback mechanisms. In stark contrast, traditional methods struggle to generalize effectively when confronted with data spanning diverse domains, highlighting the need for innovative strategies that can mimic the brain's adaptability and dynamic integration capabilities. In this paper, we propose a bio-neurologically inspired multi-view incremental framework named MVIL aimed at emulating the brain's fine-grained fusion of sequentially arriving views. MVIL lies two fundamental modules: structured Hebbian plasticity and synaptic partition learning. The structured Hebbian plasticity reshapes the structure of weights to express the high correlation between view representations, facilitating a fine-grained fusion of view representations. Moreover, synaptic partition learning is efficient in alleviating drastic changes in weights and also retaining old knowledge by inhibiting partial synapses. These modules bionically play a central role in reinforcing crucial associations between newly acquired information and existing knowledge repositories, thereby enhancing the network's capacity for generalization. Experimental results on six benchmark datasets show MVIL's effectiveness over state-of-the-art methods.

Abstract: Despite multimodal sentiment analysis being a fertile research ground that merits further investigation, current approaches take up high annotation cost and suffer from label ambiguity, nonamicable to high-quality labeled data acquisition. Furthermore, choosing the right interactions is essential because the significance of intra- or inter-modal interactions can differ among various samples. To this end, we propose Semi-IIN, a Semi-supervised Intra-inter modal Interaction learning Network for multimodal sentiment analysis. Semi-IIN integrates masked attention and gating mechanisms, enabling effective dynamic selection after independently capturing intra- and inter-modal interactive information. Combined with the self-training approach, Semi-IIN fully utilizes the knowledge learned from unlabeled data. Experimental results on two public datasets, MOSI and MOSEI, demonstrate the effectiveness of Semi-IIN, establishing a new state-of-the-art on several metrics.

Gaoling School of Artifcial Intelligence, Renmin University of China, Beijing, China, Gaoling School of Artifcial Intelligence, Renmin University of China, Beijing, China, Gaoling School of Artifcial Intelligence, Renmin University of China, Beijing, China, Gaoling School of Artifcial Intelligence, Renmin University of China, Beijing, China, Gaoling School of Artifcial Intelligence, Renmin University of China, Beijing, China, Gaoling School of Artifcial Intelligence, Renmin University of China, Beijing, China, Department of Psychology, Renmin University of China, Beijing, China

Abstract: Imitating how humans move their gaze in a visual scene is a vital research problem for both visual understanding and psychology, kindling crucial applications such as building alive virtual characters. Previous studies aim to predict gaze trajectories when humans are freeviewing an image, searching for required targets, or looking for clues to answer questions in an image. While these tasks focus on visual-centric scenarios, humans move their gaze also along with audio signal inputs in more common scenarios. To fill this gap, we introduce a new task that predicts human gaze trajectories in a visual scene with synchronized audio inputs and provide a new dataset containing 20k gaze points from 8 subjects. To effectively integrate audio information and simulate the dynamic process of human gaze motion, we propose a novel learning framework called EyEar (Eye moving while Ear listening) based on physics-informed dynamics, which considers three key factors to predict gazes: eye inherent motion tendency, vision salient attraction, and audio semantic attraction. We also propose a probability density score to overcome the high individual variability of gaze trajectories, thereby improving the stabilization of optimization and the reliability of the evaluation. Experimental results show that EyEar outperforms all the baselines in the context of all evaluation metrics, thanks to the proposed components in the learning model.

The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences The School of Artificial Intelligence, University of Chinese Academy of Sciences, The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, The School of Artificial Intelligence, University of Chinese Academy of Sciences The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Department of Information Systems, College of Business, City University of Hong Kong, The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences The School of Artificial Intelligence, University of Chinese Academy of Sciences

Abstract: Personality identification plays important roles in understanding user behavior and offering foresight ability for downstream applications. The key challenge is how to address the scarcity of labeled personality data. Recently, some studies have adopted data augmentation and prompt learning to perform personality identification. However, they still heavily require a large amount of labeled data to learn an appropriate distance strategy, which limits the generalization and flexibility of the model. This study proposes a knowledgeenhanced hierarchical heterogeneous graph model, which adopts a global multi-view graph node encoding to acquire comprehensive personality features and their inherent associations, where three types of knowledge including part-of-speech (POS) tag, entity, and Linguistic Inquiry and Word Count (LIWC) are introduced. Then, a hierarchical heterogeneous graph with a “post-word-diverse knowledge” structure is constructed for each post to obtain enhanced representation. Finally, a relation guided representation optimization that considers intra-user relationships and inter-label relationships is further developed to learn more discriminative semantic representation. Experimental results on three widely used datasets demonstrate that the model outperforms state-of-the-art methods when training with only 100 samples (approximately 1% of the total data set).

Abstract: AI and ML are poised to provide new insights into mathematical cognition and development. Here, we focus on the domains of geometry and topology (GT). According to one prominent developmental perspective, infants possess core knowledge of GT concepts, presumably underwritten by dedicated neural circuitry. We use the alignment between human cognition and computer vision models to evaluate an alternate proposal: that these concepts are learned “for free” through experience with the visual world. Specifically, we measure the sensitivity of five convolutional neural network (CNN) models to 43 GT concepts that aggregate into seven classes. We focus on CNNs over other architectures (e.g., vision transformers) because their neural plausibility has been established through studies mapping their layers to areas of the brain’s ventral visual stream. We find evidence that the CNNs are sensitive to some classes (e.g., Euclidean Geometry) but not others (e.g., Geometric Transformations). The models’ sensitivity is generally lower at lower layers and maximal at the final fullyconnected layer. Experiments with models from the ResNet family show that increasing model depth does not necessarily increase sensitivity to GT concepts. The models’ profiles of sensitivity to the seven classes roughly align with the profile shown by humans, with ResNet-18 corresponding best to Western adults and DenseNet to Western children ages 3-6 years. This case study shows how CNNs can provide sufficiency proofs for the learnability of mathematical concepts and thus inform theoretical debates in cognitive and developmental science. These findings set the stage for future experiments with other vision model architectures.

School of New Media and Communication, Tianjin University, Tianjin, China Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China, Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China, Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China, Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Abstract: Multimodal Sentiment Analysis (MSA) stands as a critical research frontier, seeking to comprehensively unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge, particularly when emotional polarities across various segments appear similar. In this paper, our objective is to spotlight emotionrelevant attributes of audio and visual modalities to facilitate multimodal fusion in the context of nuanced emotional shifts in visual-audio scenarios. To this end, we introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions aimed at accentuating emotional features of visual-audio content. DEVA employs an Emotional Description Generator (EDG) to transmute raw audio and visual data into textualized sentiment descriptions, thereby amplifying their emotional characteristics. These descriptions are then integrated with the source data to yield richer, enhanced features. Furthermore, DEVA incorporates the Text-guided Progressive Fusion Module (TPF), leveraging varying levels of text as a core modality guide. This module progressively fuses visual-audio minor modalities to alleviate disparities between text and visual-audio modalities. Experimental results on widely used sentiment analysis benchmark datasets, including MOSI, MOSEI, and CH-SIMS, underscore significant enhancements compared to state-of-the-art models. Moreover, fine-grained emotion experiments corroborate the robust sensitivity of DEVA to subtle emotional variations.

University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China

Abstract: Event cameras encode visual information by generating asynchronous and sparse event streams, which hold great potential for low latency and low power consumption. Despite many successful implementations of event camerabased applications, most of them accumulate the events into frames and then utilize conventional frame-based computer vision algorithms. These frame-based methods, though typically effective, diminish the inherent advantages of the event camera's low latency and low power consumption. To solve the above problems, we propose ASGCN, which efficiently processes data on an event-by-event basis and dynamically evolves into a corresponding dynamic representation, enabling low latency and high sparsity of data representation. The sparsity computation is further improved by introducing brain-inspired spiking neural networks, resulting in low power consumption for ASGCN. Extensive and diverse experiments demonstrate the energy efficiency and low latency advantages of our processing pipeline. Especially on real-world event camera datasets, our pipeline consumes more than 10,000 times less energy and achieves similar performance compared to current frame-based methods.

Abstract: Symbolic Regression of Integer Sequences (SRIS) aims to discover precise mathematical formulas from integer sequences. The neural machine translationbased method of SRIS trains the model using randomly generated data, and directly utilizes the trained model for inference on target sequences. However, the method often fails to effectively generalize to the target sequence, since the randomly generated data can not adequately cover the distributions of target data, i.e., there are distribution differences between them. In this work, we propose a progressive self-learning (PSL) method to explicitly capture sequence-formula distributions of the target domain. Specifically, a source domain dataset is generated by incorporating initial terms of the target domain to reduce the sequence distribution gap between the source domain and the target domain. Meanwhile, a self-learning loop strategy is adopted to improve the ability of the model to capture the sequence-formula distribution of the target domain. In this strategy, a neural machine translation model is used to learn the mappings from sequences to formulas in an end-to-end fashion. Then, this model is employed to explore candidate formulas of the target sequence using beam search. After verifying these candidate formula correctness, some of them are retained as training data for the next learning. Experimental results on OEIS datasets demonstrate that the proposed method surpasses current state-of-the-art methods in accuracy, and also discovers new formulas.

Abstract: Iterative selfimprovement, a concept extending beyond personal growth, has found powerful applications in machine learning, particularly in transforming weak models into strong ones. While recent advances in natural language processing have shown its efficacy through iterative preference optimization, applying this approach to Video Large Multimodal Models (VLMMs) remains challenging due to modality misalignment. VLMMs struggle with this misalignment during iterative preference modeling, as the self-judge model often prioritizes linguistic knowledge over visual information. Additionally, iterative preference optimization can lead to visually hallucinated verbose responses due to length bias within the self-rewarding cycle. To address these issues, we propose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO), a method that uses self-retrospection to enhance preference modeling. This approach enhances the self-judge’s focus on informative video regions, resulting in more visually grounded preferences. In extensive empirical evaluations across diverse video question answering benchmarks, the ISR-DPO significantly outperforms the state of the art. We are committed to open-sourcing our code, models, and datasets to encourage further investigation.

Abstract: Highresolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input. Processing such a large number of visual tokens poses significant computational challenges, particularly for resource-constrained commodity GPUs. To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget. HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly. The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM). We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods. Empirically, HiRED-20% (i.e., a 20% token budget) on LLaVA-Next-7B achieves a 4.7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits.

Abstract: With the rapid advancement of 3D scanning technology, point clouds have become a crucial data type in computer vision and machine learning. However, learning robust representations for point clouds remains a significant challenge due to their irregularity and sparsity. In this paper, we propose a novel Dual Manifold Regularization (DMR) framework that makes full use of the properties of positive and negative curvature in manifolds to improve the representation of point clouds. Specifically, we leverage DMR based on hyperbolic and hyperspherical manifolds to address the limitations of traditional singlemanifold regularization techniques, including inadequate generalization ability and adaptability to data diversity, as well as the difficulty of capturing complex relationships between data. To begin, we utilize the tree-like structure of the hyperbolic manifold to model the part-whole hierarchical relationships within point clouds. This allows for a more comprehensive representation of the data, improving the model's capability to understand complex shapes. Additionally, we construct positive samples through topological consistency augmentation and employ contrastive learning techniques in the hyperspherical manifold to capture more discriminative features within the data. Our experimental results show that our method outperforms traditional supervised learning and single-manifold regularization techniques in point cloud analysis. Specifically, for shape classification, DMR achieves a new State-Of-The-Art (SOTA) performance with 94.8% Overall Accuracy (OA) on ModelNet40 and 90.7% OA on ScanObjectNN, surpassing the recent SOTA model without increasing the baseline parameters.

Abstract: Textto-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.

Abstract: Goaloriented visual dialogue involves multi-round interaction between artificial agents, which has been of remarkable attention due to its wide applications. Given a visual scene, this task occurs when a Questioner asks an action-oriented question and an Answerer responds with the intent of letting the Questioner know the correct action to take. The quality of questions affects the accuracy and efficiency of the target search progress. However, existing methods lack a clear strategy to guide the generation of questions, resulting in the randomness in the search process and inconvergent results. We propose a Tree-Structured Strategy with Answer Distribution Estimator (TSADE) which guides the question generation by excluding half of the current candidate objects in each round. The above process is implemented by maximizing a binary reward inspired by the ``divide-and-conquer'' paradigm. We further design a candidate-minimization reward which encourages the model to narrow down the scope of candidate objects toward the end of the dialogue. We experimentally demonstrate that our method can enable the agents to achieve high task-oriented accuracy with fewer repeating questions and rounds compared to traditional ergodic question generation approaches. Qualitative results further show that TSADE facilitates agents to generate higher-quality questions.

Abstract: Human skeletonbased action recognition has long been an indispensable aspect of artificial intelligence. Current state-of-the-art methods tend to consider only the dependencies between connected skeletal joints, limiting their ability to capture non-linear dependencies between physically distant joints. Moreover, most existing approaches distinguish action classes by estimating the probability density of motion representations, yet the high-dimensional nature of human motions invokes inherent difficulties in accomplishing such measurements. In this paper, we seek to tackle these challenges from two directions: (1) We propose a novel dependency refinement approach that explicitly models dependencies between any pair of joints, effectively transcending the limitations imposed by joint distance. (2) We further propose a framework that utilizes the Hilbert-Schmidt Independence Criterion to differentiate action classes without being affected by data dimensionality, and mathematically derive learning objectives guaranteeing precise recognition. Empirically, our approach sets the state-of-the-art performance on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets.

Abstract: Acute lymphoblastic leukemia is a childhood cancer prevalent worldwide, which can prove fatal within weeks or months. However, current diagnosis models based on machine learning and deep learning methods fail to consider device noise (pixellevel perturbations) and rotation/translation (spatial-transformed perturbations), which can undermine the model's robustness. Adversarial training is a potential solution to this issue. This paper presents a hybrid perturbation adversarial training (HPAT) strategy that leverages two types of adversarial samples: pixel-level adversarial samples and spatial adversarial samples. This work generates these hybrid adversarial samples through Projected Gradient Descent (PGD) in couple with spatial transformation based on the Bayesian optimization (STBO) algorithm, respectively. This work introduced the Mixed Batch Normalization (MixBN) module to handle both adversarial samples and clean samples, alleviating the problem of clean accuracy degradation due to adversarial training. The proposed hybrid adversarial training strategy is tested on the public acute lymphoblastic leukemia dataset and found that it outperformed existing acute lymphoblastic cell classification models.

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Science, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences The Hong Kong Polytechnic University, The Hong Kong University of Science and Technology, Shanghai Artificial Intelligence Laboratory Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Shanghai Artificial Intelligence Laboratory

Abstract: With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multimodal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolution. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively select multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.

State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China, State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China Zhengzhou University Research Institute of Industrial Technology, Zhengzhou University, Zhengzhou, China, State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China, State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China, Department of Management Science and Systems, University at Buffalo, Buffalo, New York, United States, State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China

Abstract: Dense video captioning (DVC) aims to describe multiple events within a video, and its performance is greatly affected by the accuracy of video event detection. Video event detection involves predicting the proposal boundaries (start and end times) and the classification score of each event in a video. Recently, a few methods have applied diffusion models originally designed for image object detection to detect events in DVC. These methods add noise to the groundtruth event proposal boundaries, and subsequently learn the denoising process. However, these methods often overlook the fundamental differences between videos and images. We observe that, whereas in images the important information for object classification is normally around the boundaries of the ground-truth boxes, in videos the key information for event classification is typically centered in the middle of ground-truth event proposals. As a result, the classification module in these existing diffusion models becomes insensitive to boundary changes introduced by the added noise, leading to sub-optimal performance. This paper introduces DiffDVC, an innovative diffusion model for DVC. The core of DiffDVC is a boundary-sensitive detector. The detector increases the sensitivity of the classification module to boundary changes by focusing on frames within a specific range around the start and end times of noisy event proposals. Additionally, this range is dynamically adjusted to suit different event proposals. Comprehensive experiments on ActivityNet-1.3, ActivityNet Captions, and YouCook2 datasets show DiffDVC achieving superior performance.

Abstract: The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location (IMDL). However, the lack of a largescale data foundation makes the IMDL task unattainable. In this paper, we build a local manipulation data generation pipeline that integrates the powerful capabilities of SAM, LLM, and generative models. Upon this basis, we propose the GIM dataset, which has the following advantages: 1) Large scale, GIM includes over one million pairs of AI-manipulated images and real images. 2) Rich image content, GIM encompasses a broad range of image classes. 3) Diverse generative manipulation, the images are manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce the GIM benchmark with two settings to evaluate existing IMDL methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial block (FSB), and a Multi-Window Anomalous Modeling (MWAM) module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses the previous state-of-the-art approach on two different benchmarks.

Abstract: In structured light systems, the accuracy of measurement notably diminishes when assessing complex texture objects, especially encountering boundaries between various colors. To address this challenge, this paper meticulously analyzes and establishes an error model, elaborating the correlation between phase errors and the gradients of phase and grayscale. Based on this analysis, a novel high-precision method is proposed for measuring complex texture objects via bidirectional fringe projection. This approach firstly leverages horizontal and vertical fringe projections to derive bidirectional phase information and calculates the angles between the tangent of the texture edges and the phase gradient. Subsequently, a refined temporal phase correction algorithm is formulated based on the epipolar matching algorithm and the devised error model, effectively mitigating numerical instability issues within the algorithm and significantly reducing errors of bidirectional phases. Ultimately, corrected point clouds are calculated based on bidirectional phases, and the obtained point clouds are merged to further diminish phase errors. Comparison experiments indicate that this method can reduce Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) by 65.74% and 67.75%, respectively. Compared to existing methods, it improves performance by 27.29% and 33.74%, respectively, demonstrating superior performance.

Abstract: Selfsupervised video denoising aims to remove noise from videos without relying on ground truth data, leveraging the video itself to recover clean frames. Existing methods often rely on simplistic feature stacking or apply optical flow without thorough analysis. This results in suboptimal utilization of both inter-frame and intra-frame information, and it also neglects the potential of optical flow alignment under self-supervised conditions, leading to biased and insufficient denoising outcomes. To this end, we first explore the practicality of optical flow in the self-supervised setting and introduce a SpatioTemporal Blind-spot Network (STBN) for global frame feature utilization. In the temporal domain, we utilize bidirectional blind-spot feature propagation through the proposed blind-spot alignment block to ensure accurate temporal alignment and effectively capture long-range dependencies. In the spatial domain, we introduce the spatial receptive field expansion module, which enhances the receptive field and improves global perception capabilities. Additionally, to reduce the sensitivity of optical flow estimation to noise, we propose an unsupervised optical flow distillation mechanism that refines fine-grained inter-frame interactions during optical flow alignment. Our method demonstrates superior performance across both synthetic and real-world video denoising datasets.

Abstract: Diffusion models have demonstrated remarkable synthesis quality and diversity in generating cospeech gestures. However, the computationally intensive sampling steps associated with diffusion models hinder their practicality in real-world applications. Hence, we present DIDiffGes, for a Decoupled Semi-Implicit Diffusion model-based framework, that can synthesize high-quality, expressive gestures from speech using only a few sampling steps. Our approach leverages Generative Adversarial Networks (GANs) to enable large-step sampling for diffusion model. We decouple gesture data into body and hands distributions and further decompose them into marginal and conditional distributions. GANs model the marginal distribution implicitly, while L2 reconstruction loss learns the conditional distributions exciplictly. This strategy enhances GAN training stability and ensures expressiveness of generated full-body gestures. Our framework also learns to denoise root noise conditioned on local body representation, guaranteeing stability and realism. DIDiffGes can generate gestures from speech with just 10 sampling steps, without compromising quality and expressiveness, reducing the number of sampling steps by a factor of 100 compared to existing methods. Our user study reveals that our method outperforms state-of-the-art approaches in human likeness, appropriateness, and style correctness.

Abstract: Temporally locating objects with arbitrary class texts is the primary pursuit of openvocabulary Video Instance Segmentation (VIS). Because of the insufficient vocabulary of video data, previous methods leverage the image-text pretraining model for recognizing object instances by separately aligning each frame with class texts. As a result, the separation breaks the instance movement context of videos and requires a lot of inference overhead. To tackle these issues, we propose BridgeText Alignment (BTA) to link frame-level instance representations as a Brownian Bridge. On one hand, we can calculate the global descriptor of a Brownian bridge for capturing instance dynamics, which enables extra considering temporal information rather than only static information of each frame for aligning with texts. On the other hand, according to the goal-conditioned property of the Brownian bridge, we can estimate the middle frame features via the start and the end frame features so the global feature calculation of a Brownian bridge only needs to infer a few frames, which largely reduces inference overhead. We term our overall pipeline as BriVIS. Following the training settings of previous works, BriVIS surpasses the SOTA (OV2Seg) by a clear margin. For example, on the challenging large-vocabulary datasets (BURST, LVVIS), BriVIS achieves 5.7 and 20.9 mAP, which exhibits +2.2∼+6.7 mAP improvement compared to OV2Seg. Furthermore, after training via BTA, using only the head and the tail frames for alignment improves the speed by 32% (2.77 → 1.88 s/iter) while just decreasing the performance by 0.2 mAP (21.1 → 20.9 mAP).

Abstract: Smartphone cameras are ubiquitous in daily life, yet their performance can be severely impacted by dirty lenses, leading to degraded image quality. This issue is often overlooked in image restoration research, which assumes ideal or controlled lens conditions. To address this gap, we introduced SIDL (Smartphone Images with Dirty Lenses), a novel dataset designed to restore images captured through contaminated smartphone lenses. SIDL contains diverse realworld images taken under various lighting conditions and environments. These images feature a wide range of lens contaminants, including water drops, fingerprints, and dust. Each contaminated image is paired with a clean reference image, enabling supervised learning approaches for restoration tasks. To evaluate the challenge posed by SIDL, various state-of-the-art restoration models were trained and compared on this dataset. Their performances achieved some level of restoration but did not adequately address the diverse and realistic nature of the lens contaminants in SIDL. This challenge highlights the need for more robust and adaptable image restoration techniques for restoring images with dirty lenses.

Abstract: The common occurrence of occlusioninduced incompleteness in point clouds has made point cloud completion (PCC) a highly-concerned task in the field of geometric processing. Existing PCC methods typically produce complete point clouds from partial point clouds in a coarse-to-fine paradigm, with the coarse stage generating entire shapes and the fine stage improving texture details. Though diffusion models have demonstrated effectiveness in the coarse stage, the fine stage still faces challenges in producing high-fidelity results due to the ill-posed nature of PCC. The intrinsic contextual information for texture details in partial point clouds is the key to solving the challenge. In this paper, we propose a high-fidelity PCC method that digs into both short and long-range contextual information from the partial point cloud in the fine stage. Specifically, after generating the coarse point cloud via a diffusion-based coarse generator, a mixed sampling module introduces short-range contextual information from partial point clouds into the fine stage. A surface freezing module safeguards points from noise-free partial point clouds against disruption. As for the long-range contextual information, we design a similarity modeling module to derive similarity with rigid transformation invariance between points, conducting effective matching of geometric manifold features globally. In this way, the high-quality components present in the partial point cloud serve as valuable references to refine the coarse point cloud with high fidelity. Extensive experiments have demonstrated the superiority of the proposed method over SOTA competitors.

Abstract: Implicit Neural Representation (INR) methods have demonstrated great potential in arbitraryscale super-resolution tasks. This success is primarily due to their ability to continuously represent images using coordinates. In the task of remote sensing image fusion, INR methods have also shown promising applications. However, the previous INR methods neglect channel-wise modeling, while sharing a single kernel across all channels at each position, resulting in a lack of sensitivity to data specificity. To address these issues, we propose the OcTree Implicit Adaptive Sampling (OTIAS) method, which innovatively applies the octree structure to restore data from both horizontal and vertical directions, effectively incorporating spatial and spectral information from hyperspectral data. Additionally, we introduce a novel method to adaptively generate interpolation kernels based on coordinates. This approach efficiently produces customized interpolation kernel parameters for octree nodes, tailored to different spectral information. Overall, our method achieves state-of-the-art performance on the CAVE and Harvard datasets with 4× and 8× scaling factors, outperforming existing approaches.

University of Chinese Academy of Sciences Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing Institute of Technology, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences Peng Cheng Laboratory, Huawei Technologies Ltd., Huawei Technologies Ltd., Huawei Technologies Ltd., University of Chinese Academy of Sciences Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences Peng Cheng Laboratory

Abstract: Personalized image generation enables customized content creation based on the textto-image diffusion models.However, existing personalization methods focus on fine-tuning generative models to learn to generate specific single individuals or concepts, such as an image of a specific Corgi, but are unable to generate data for multiple individuals or concepts with common characteristics, such as images of multiple different Corgis. In this work, we focus on personalizing a diffusion model to generated varied data usually containing multiple subjects, which has a more diverse and complex data distribution. Our basic assumption is that the varied data distribution is composed of the common features shared among all samples, as well as the reasonable variations within it. Accordingly, we are capable to decompose the learning process of complex data distributions into two simpler sub-tasks, employing a divide-and-conquer approach. To this end we propose Dis2Booth, a framework that can learn complex image Distribution by Disentangling data distribution in an unsupervised manner.Specifically, Dis2Booth contains two modules, Anchor LoRA and Delta LoRA, that are tasked with learning the common features and variational features constrained by Contextual Loss and Delta Loss unsupervisedly. Besides, the Asynchronous Optimization Strategy is proposed to ensure the collaborative training of the two modules. Extensive experiments suggest that Dis2Booth is able to learn the data distribution with higher diversity and complexity while maintaining the same level of flexibility as LoRA.

School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing

Abstract: 3D object detection in point clouds is critical in 3D computer vision, autonomous driving, and robotics. Existing pointbased detectors, tailored to handle unstructured raw point clouds, often rely on simplistic sampling strategies to select a subset of points for local representation learning and detection. However, the diverse patterns exhibited by multiple types of point cloud data present a significant challenge to the universality of current detectors, particularly those captured by varied sensors (e.g., LiDAR and 4D Imaging Radar). In response to this challenge, we introduce an adaptable point-based single-stage 3D detector, AS-Det, engineered to excel on both LiDAR and 4D Radar point clouds. Specifically, we propose a novel active sampling strategy that actively mines object-related information to achieve efficient sampling and representation across different types of point clouds through end-to-end training. Additionally, we introduce a lightweight multi-scale center feature aggregation module to exploit multi-scale object context for precise and low-cost detection. By integrating the abovementioned modules, AS-Det achieves highly adaptive detection on various point clouds, encompassing different sensors and scales. Experimental results demonstrate the superior performance and adaptability of AS-Det on both LiDAR and 4D Radar point clouds.

Abstract: A good garment tryon model should learn the transfer between different types of garments while satisfying: 1) high fidelity and 2) low inference speed. Existing methods address either of these two issues, limited processing speed or low generation quality. We directly use a lightweight encoder-decoder, ensuring faster speeds. To tackle the problem of lower image quality typically generated by lighter models, we present GarFast, a simplified, parser-free framework that optimizes the same lightweight network through a two-stage transformation of real data roles (from input to supervision), thereby greatly promoting model convergence. Specifically, first, we propose a correction strategy to prevent the difficulty of convergence caused by the lack of ground truth in the first stage. Second, we propose a fine-grained domain consistency to ensure that the results generated in the unsupervised first stage are highly realistic clothed human images. Finally, we propose a skin-variant refinement loss and a skinMix regularization to amplify texture differences and enhance the realism of skin-variant regions, thereby improving the quality of the generated skin. Extensive experiments thoroughly demonstrate that our method achieves high resolution, near real-time performance, and superior reconstruction quality compared to state-of-the-art approaches, with processing times of less than 0.03 seconds on an Nvidia A100.

Abstract: Crossmodal retrieval, as an emerging field within multimedia research, has gained significant attention in recent years. Unsupervised cross-modal hashing methods are attractive due to their ability to capture latent relationships within the data without label supervision and to produce compact hash codes for high search efficiency. However, the text modality exhibits worse representation ability compared with the image modality, leading to weak guidance to construct the joint similarity matrix. Moreover, most unsupervised cross-modal hashing methods are based on pairwise similarities for training, resulting in non-aggregating data distribution in the hash space. In this paper, we propose a novel Vision-guided Text Mining for Unsupervised Cross-modal Hashing via Community Similarity Quantization, termed VTM-UCH. Specifically, we first find the one-to-one correspondence between each word and each vision (image or object) based on the Contrastive Language-Image Pre-training (CLIP) model and compute the text similarities according to the clustering of their corresponding visions. Then, we define the fine-grained object-level image similarities and design the joint similarity matrix based on the text and image similarities. Accordingly, we construct an undirected graph to compute the communities as the pseudo-centers and adjust the pairwise similarities to improve the hash codes distribution. The experimental results on two common datasets verify the accuracy improvements in comparison with state-of-the-art baselines.

Abstract: It is widely known that stateof-the-art machine learning models, including vision and language models, can be seriously compromised by adversarial perturbations. It is therefore increasingly relevant to develop capabilities to certify their performance in the presence of the most effective adversarial attacks. Our paper offers a new approach to certify the performance of machine learning models in the presence of adversarial attacks with population level risk guarantees. In particular, we introduce the notion of (α,ζ)-safe machine learning model. We propose a hypothesis testing procedure, based on the availability of a calibration set, to derive statistical guarantees providing that the probability of declaring that the adversarial (population) risk of a machine learning model is less than α (i.e. the model is safe), while the model is in fact unsafe (i.e. the model adversarial population risk is higher than α), is less than ζ. We also propose Bayesian optimization algorithms to determine efficiently whether a machine learning model is (α,ζ)-safe in the presence of an adversarial attack, along with statistical guarantees. We apply our framework to a range of machine learning models - including various sizes of vision Transformer (ViT) and ResNet models - impaired by a variety of adversarial attacks, such as PGDAttack, MomentumAttack, GenAttack and BanditAttack, to illustrate the operation of our approach. Importantly, we show that ViT's are generally more robust to adversarial attacks than ResNets, and large models are generally more robust than smaller models. Our approach goes beyond existing empirical adversarial risk-based certification guarantees. It formulates rigorous (and provable) performance guarantees that can be used to satisfy regulatory requirements mandating the use of state-of-the-art technical tools.

Abstract: Visual cross view geolocalization is generally approached within a joint retrieval-and-calibration framework. However, existing methods overlook semantic ambiguities arising from query and reference images characterized by low overlap, dynamic foregrounds, viewpoint changes, and perceptual aliasing. This makes it challenging to automatically control the relative importance of the two tasks, potentially compromising the retrieval task in favor of the offset regression. Consequently, the model may encounter conflicting dominating gradients during joint training. To address this, we propose to model the semantic ambiguity during the offset regression process by integrating associated uncertainty scores, represented as 2D Gaussian distributions, to mitigate negative transfer effects within the joint tasks. We further introduce an uncertainty-aware similarity metric to enhance similarity assessment between query and reference images, accounting for their semantic ambiguities. This metric propagates uncertainty scores into the retrieval task, focusing on certain samples and learning discriminative feature embeddings, allowing the model to adaptively handle conflicting dominating gradients during joint training. Extensive experiments demonstrate that our method improves the overall performance of the joint tasks, achieving state-of-the-art results on the VIGOR and CVACT datasets.

Abstract: Multimodal Federated Learning (MFL) is a distributed machine learning paradigm that enables multiple participants with multi-modal data to collaboratively train a global model for multi-modal tasks without sharing their local data. MFL typically deploys the trained global model as an Embedding-as-a-Service (EaaS), allowing participants to obtain embeddings for downstream tasks. However, it increases the risk of unauthorized copying and leakage of the model. Protecting the ownership of the MFL model while maintaining model performance is challenging. In this paper, we propose the first general model ownership protection framework for MFL, named MFL-Owner. MFL-Owner decouples the watermarking process from the model training process and addresses both ownership verification and traceability, effectively safeguarding the interests of the MFL collective. MFL-Owner leverages the concept of orthogonal transformations by incorporating a linear transformation matrix with orthogonal constraints into the model, achieving high-quality ownership verification and traceability with minimal impact on model performance. To enhance the practicality of the watermark and prevent conflicts among multiple clients during tracing, we propose a trigger dataset selection method based on out-of-distribution data combined with Gaussian noise perturbation. Our experiments on multiple datasets demonstrate that MFL-Owner is effective for model ownership verification and traceability for MFL.

Abstract: Incontext learning (ICL) advances Large Language Models (LLMs) exhibiting emergent ability on downstream tasks without updating billions of parameters. However, in the area of multimodal Large Language Models (MLLMs), two problems hinder the application of multimodal ICL: (1) Most primary MLLMs are only trained on single-image datasets, making them unable to read extra multimodal demonstrations. (2) With the demonstrations increasing, thousands of visual tokens highly challenge hardware and degrade ICL performance. During preliminary explorations, we discovered that the inner LLM focuses more on the linguistic modality within multimodal demonstrations during generation. Therefore, we propose a general and lightweight framework AIM to tackle the mentioned problems through Aggregating Image information of Multimodal demonstrations to the latent space of the corresponding textual labels. After aggregation, AIM substitutes each demonstration with generated fused virtual tokens whose length is reduced to the same as its texts. Except for shortening input length, AIM further upgrades MLLMs pre-trained on image-text pairs to support multimodal ICL, as images from demonstrations are disregarded. Furthermore, benefiting from aggregating different demonstrations independently, AIM configures Demonstration Bank (DB) to avoid repeated aggregation, which significantly boosts model efficiency. We build AIM upon QWen-VL and LLaVA-Next, and AIM is comprehensively evaluated on image caption, VQA, and hateful speech detection. Outstanding results reveal that AIM provides an efficient and effective solution in upgrading MLLMs for multimodal ICL.

Abstract: WeaklySupervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event, as the relevant supervision is unavailable. Existing methods rely on explicit alignment constraints between event locations and captions, which involve complex event proposal procedures during both training and inference. To tackle this problem, we propose a novel implicit location-caption alignment paradigm by complementary masking, which simplifies the complex event proposal and localization process while maintaining effectiveness. Specifically, our model comprises two components: a dual-mode video captioning module and a mask generation module. The dual-mode video captioning module captures global event information and generates descriptive captions, while the mask generation module generates differentiable positive and negative masks for localizing the events. These masks enable the implicit alignment of event locations and captions by ensuring that captions generated from positively and negatively masked videos are complementary, thereby forming a complete video description. In this way, even under weak supervision, the event location and event caption can be aligned implicitly. Extensive experiments on the public datasets demonstrate that our method outperforms existing weakly-supervised methods and achieves competitive results compared to fully-supervised methods.

Abstract: In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers. As a result, researchers have been actively investigating effective selfsupervised pre-training strategies. For example, MAMP shows that instead of following the prevalent masked joint reconstruction, explicit masked motion reconstruction is key to the success of learning effective feature representation for 3D action recognition. However, we find that if we make a simple and effective change to the reconstructed target of masked joint reconstruction, masked joint reconstruction can achieve the same results as masked motion reconstruction. The devil is in the special characteristic of 3D skeleton data and the normalization process of training targets. We need to dig for all effective information of targets during normalization. Besides, considering that mask data reconstruction focuses more on learning local relations in input data for fulfilling the reconstruction task, instead of modeling the relation among samples, we further employ contrastive learning to learn more discriminative 3D action representations. We show that contrastive learning can consistently boost the performance of model pre-trained by masked joint prediction under various settings, especially in the semi-supervised setting that has a very limited number of labeled samples. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed pre-training strategy achieves state-of-the-art results without bells and whistles.

Abstract: Point clouds have become the preferred data format for a variety of tasks in 3D vision and graphics. However, raw point clouds often contain significant noise. This paper introduces the Adaptive Stop Denoising Network (ASDN), a novel approach aimed at restoring highquality point clouds from noisy data. Our method is built upon a pivotal observation: during the denoising phase, high-noise points draw more focus from the network, which may suppress the points that have already been effectively denoised. This observation has led us to develop an adaptive strategy that ceases denoising already cleaned points to prevent over-denoising, while continuing to refine points that remain noisy. We employ a U-Net architecture complemented by an adaptive classifier, which utilizes a recoverability factor to assess the completion of denoising and make dynamic decisions about when to halt the process. Our method not only demonstrates superior noise removal efficiency but also preserves geometric details more effectively, reducing over- or under-denoising artifacts. Extensive experiments and evaluations demonstrate that our method outperforms the state-of-the-art both qualitatively and quantitatively.

Abstract: Current collaborative perception methods often rely on fully annotated datasets, which can be expensive to obtain in practical situations. To reduce annotation costs, some works adopt sparsely supervised learning techniques and generate pseudo labels for the missing instances. However, these methods fail to achieve an optimal confidence threshold that harmonizes the quality and quantity of pseudo labels. To address this issue, we propose an endto-end Collaborative perception Dual Teacher-Student framework (CoDTS), which employs adaptive complementary learning to produce both high-quality and high-quantity pseudo labels. Specifically, the Main Foreground Mining (MFM) module generates high-quality pseudo labels based on the prediction of the static teacher. Subsequently, the Supplement Foreground Mining (SFM) module ensures a balance between the quality and quantity of pseudo labels by adaptively identifying missing instances based on the prediction of the dynamic teacher. Additionally, the Neighbor Anchor Sampling (NAS) module is incorporated to enhance the representation of pseudo labels. To promote the adaptive complementary learning, we implement a staged training strategy that trains the student and dynamic teacher in a mutually beneficial manner. Extensive experiments demonstrate that the CoDTS effectively ensures an optimal balance of pseudo labels in both quality and quantity, establishing a new state-of-the-art in sparsely supervised collaborative perception.

Abstract: Monocular camera calibration is a key precondition for numerous 3D vision applications. Despite considerable advancements, existing methods often hinge on specific assumptions and struggle to generalize across varied realworld scenarios, and the performance is limited by insufficient training data. Recently, diffusion models trained on expansive datasets have been confirmed to maintain the capability to generate diverse, high-quality images. This success suggests a strong potential of the models to effectively understand varied visual information. In this work, we leverage the comprehensive visual knowledge embedded in pre-trained diffusion models to enable more robust and accurate monocular camera intrinsic estimation. Specifically, we reformulate the problem of estimating the four degrees of freedom (4-DoF) of camera intrinsic parameters as a dense incident map generation task. The map details the angle of incidence for each pixel in the RGB image, and its format aligns well with the paradigm of diffusion models. The camera intrinsic then can be derived from the incident map with a simple non-learning RANSAC algorithm during inference. Moreover, to further enhance the performance, we jointly estimate a depth map to provide extra geometric information for the incident map estimation. Extensive experiments on multiple testing datasets demonstrates that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image.

Abstract: Occupancy prediction plays a pivotal role in autonomous driving (AD) due to its capabilities of finegrained 3D perception and general object recognition. However, existing methods often incur high computational costs, which conflict with AD's real-time demand. To this end, we redirect the focus from accuracy only to both accuracy and efficiency. By conducting a head-to-head comparison of existing methods, we find it challenging to balance accuracy and efficiency. We identify a core issue for this challenge: the strong coupling between geometry and semantics. Specifically, the predicted geometric structure (e.g., depth) guides the projection of 2D image features into 3D voxel space, which significantly affects feature discriminability and subsequent semantic learning. To address this issue, we focus on two key aspects: model design and learning strategies. 1) For model design, we propose a dual-branch network that disentangles the representation of geometry and semantics. The voxel branch utilizes a novel re-parameterized large-kernel 3D convolution to refine geometric structure efficiently, while the BEV branch employs temporal fusion and BEV encoding for efficient semantic learning. 2) For learning strategies, we propose to separate geometric learning from semantic learning by the mixup of ground-truth and predicted depths. Our method achieves 39.4% mIoU at 20 FPS on Occ3D-nuScenes, showcasing a state-of-the-art balance between accuracy and efficiency.

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China., Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China., Department of Information Science, Tsinghua university., Tencent Youtu Lab, Shanghai 200233, China., Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.

Abstract: Superresolution (SR) techniques are critical for enhancing image quality, particularly in scenarios where high-resolution imagery is essential yet limited by hardware constraints. Existing diffusion models for SR have relied predominantly on Gaussian models for noise generation, which often fall short when dealing with the complex and variable texture inherent in natural scenes. To address these deficiencies, we introduce the Bayesian Uncertainty Guided Diffusion Probabilistic Model (BUFF). BUFF distinguishes itself by incorporating a Bayesian network to generate high-resolution uncertainty masks. These masks guide the diffusion process, allowing for the adjustment of noise intensity in a manner that is both context-aware and adaptive. This novel approach not only enhances the fidelity of super-resolved images to their original high-resolution counterparts but also significantly mitigates artifacts and blurring in areas characterized by complex textures and fine details. The model demonstrates exceptional robustness against complex noise patterns and showcases superior adaptability in handling textures and edges within images. Empirical evidence, supported by visual results, illustrates the model's robustness, especially in challenging scenarios, and its effectiveness in addressing common SR issues such as blurring. Experimental evaluations conducted on the DIV2K dataset reveal that BUFF achieves a notable improvement, with a +0.61 increase compared to baseline in SSIM on BSD100, surpassing traditional diffusion approaches by an average additional +0.20dB PSNR gain. These findings underscore the potential of Bayesian methods in enhancing diffusion processes for SR, paving the way for future advancements in the field.

Abstract: Recent advancements in online Video Instance Segmentation (VIS) methods show notable performance improvements across benchmarks. However, the leading methods in the trackingby-detection paradigm often result in temporally inconsistent predictions at both instance-level and pixel-level that lead to visually unsatisfactory outcomes. To address these challenges, we propose RoCoVIS, a simple yet effective approach that integrates segmentation and tracking to provide consistent online VIS. Our approach is an end-to-end sequential learning where object queries are propagated through mask predictions, improving the accuracy of temporal instance mapping at the pixel level. Additionally, we propose a new label assignment criterion in harmony with our approach. We also examine the limitations and challenges presented by the current standard evaluation protocol (AP) and suggest adopting additional metrics, Tube-Boundary AP and AP_Pool. RoCoVIS demonstrates superior performance on challenging VIS benchmarks with a Swin-L backbone and shows competitive results when employing a ResNet-50 backbone. By employing Tube-Boundary AP and AP_Pool as metrics to measure mask accuracy and consistency, RoCoVIS outperforms its counterpart, GenVIS, on the HQ-YTVIS and VIPSeg.

Abstract: Adversarial attacks poses a significant threat to the security of AIbased systems. To counteract these attacks, adversarial training (AT) and ensemble learning (EL) have emerged as widely adopted methods for enhancing model robustness. However, a counter-intuitive phenomenon arises where the simple combination of these approaches may potentially compromising adversarial robustness of ensemble models. In this paper, we propose a novel method called Alignment and Unlearning for Training Ensembles (AUTE), aiming to effectively integrate AT and EL to maximize their benefits. Specifically, AUTE incorporates two key components. Firstly, AUTE divides the ensemble into a big peer model and a single member in a loop manner, aligning their outputs for boosting robustness of each member. Secondly, AUTE introduces the concept of unlearning, actively forgetting specific data with over-confident properties to preserve model capacity to learn more robust features. Extensive experiments across various datasets and networks illustrate that AUTE achieves superior performance compared to baselines. For instance, a 5-member AUTE with ResNet-20 networks outperforms state-of-the-art method by 2.1% and 3.2% in classifying clean and adversarial data. Additionally, AUTE can easily extend to non-adversarial training paradigm, surpassing current standard ensemble learning methods by a large margin.

School of Computer Science and Technology, Xi’an Jiaotong University MOE KLINNS Lab, Xi’an Jiaotong University, School of Computer Science and Technology, Xi’an Jiaotong University MOE KLINNS Lab, Xi’an Jiaotong University, School of Computer Science and Technology, Xi’an Jiaotong University Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi’an Jiaotong University Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, MOE KLINNS Lab, Xi’an Jiaotong University, School of Computer Science and Technology, Xi’an Jiaotong University MOE KLINNS Lab, Xi’an Jiaotong University, School of Computer Science and Technology, Xi’an Jiaotong University MOE KLINNS Lab, Xi’an Jiaotong University

Abstract: Chart understanding enables automated data analysis for humans, which requires models to achieve highly accurate visual comprehension. While existing Visual Language Models (VLMs) have shown progress in chart understanding, the lack of highquality training data and comprehensive evaluation benchmarks hinders VLM chart comprehension. In this paper, we introduce EvoChart, a novel self-training method for generating synthetic chart data to enhance VLMs' capabilities in real-world chart comprehension. We also propose EvoChart-QA, a noval benchmark for measuring models' chart comprehension abilities in real-world scenarios. Specifically, EvoChart is a unique self-training data synthesis approach that simultaneously produces high-quality training corpus and a high-performance chart understanding model. EvoChart-QA consists of 650 distinct real-world charts collected from 140 different websites and 1,250 expert-curated questions that focus on chart understanding. Experimental results on various open-source and proprietary VLMs tested on EvoChart-QA demonstrate that even the best proprietary model, GPT-4o, achieves only 49.8% accuracy. Moreover, the EvoChart method significantly boosts the performance of open-source VLMs on real-world chart understanding tasks, achieving 54.2% accuracy on EvoChart-QA.

Abstract: In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on imagelevel understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric.

Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an, China School of Automation, Northwestern Polytechnical University, Xi’an, China, Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an, China School of Automation, Northwestern Polytechnical University, Xi’an, China, Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an, China, School of Computer Science, Northwestern Polytechnical University, Xi’an, China, School of Software, Northwestern Polytechnical University, Xi’an, China, Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an, China

Abstract: Human action understanding is crucial for the advancement of multimodal systems. While recent developments, driven by powerful large language models (LLMs), aim to be general enough to cover a wide range of categories, they often overlook the need for more specific capabilities. In this work, we address the more challenging task of Finegrained Action Recognition (FAR), which focuses on detailed semantic labels within shorter temporal duration (e.g., ``salto backward tucked with 1 turn"). Given the high costs of annotating fine-grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL). Our framework, SeFAR, incorporates several innovative designs to tackle these challenges. Specifically, to capture sufficient visual details, we construct Dual-level temporal elements as more effective representations, based on which we design a new strong augmentation strategy for the Teacher-Student learning paradigm through involving moderate temporal perturbation. Furthermore, to handle the high uncertainty within the teacher model's predictions for FAR, we propose the Adaptive Regulation to stabilize the learning process. Experiments show that SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes, as well as two classical coarse-grained datasets, UCF101 and HMDB51. Further analysis and ablation studies validate the effectiveness of our designs. Additionally, we show that the features extracted by SeFAR could largely promote the ability of multimodal models to understand fine-grained and domain-specific semantics.

Abstract: Humanobject interaction (HOI) detectors with popular query-transformer architecture have achieved promising performance. However, accurately identifying uncommon visual patterns and distinguishing between ambiguous HOIs continue to be difficult for them. We observe that these difficulties may arise from the limited capacity of traditional detector queries to represent diverse intra-category patterns and inter-category dependencies. To address this, we introduce the Interaction Prompt Distribution Learning (InterProDa) approach. InterProDa learns multiple sets of soft prompts and estimates category distributions from various prompts. It then incorporates HOI queries with category distributions, making them capable of representing near-infinite intra-category dynamics and universal cross-category relationships. Our InterProDa detector demonstrates competitive performance on HICO-DET and vcoco benchmarks. Additionally, our method can be integrated into most transformer-based HOI detectors, significantly enhancing their performance with minimal additional parameters.

Abstract: Diffusion models are prominent in image generation for producing detailed and realistic images from Gaussian noises. However, they often encounter instability issues in image restoration tasks, e.g., superresolution. Existing methods typically rely on multiple runs to find an initial noise that produces a reasonably restored image. Unfortunately, these methods are computationally expensive and time-consuming without guaranteeing stable and consistent performance. To address these challenges, we propose a novel Predictive Noise Fusion Strategy (PNFS) that predicts pixel-wise errors in the restored image and combines different noises to generate a more effective noise. Extensive experiments show that PNFS significantly improves the stability and performance of diffusion models in super-resolution, both quantitatively and qualitatively. Furthermore, PNFS can be flexibly integrated into various diffusion models to enhance their stability.

Abstract: Portable Document Format (PDF) files are dominantly used for storing and disseminating scientific research, legal documents, and tax information. LaTeX is a popular application for creating PDF documents. Despite its advantages, LaTeX is not WYSWYGwhat you see is what you get, i.e., the LaTeX source and rendered PDF images look drastically different, especially for formulae and tables. This gap makes it hard to modify or export LaTeX sources for formulae and tables from PDF images, and existing work is still limited. First, prior work generates LaTeX sources in a single iteration and struggles with complex LaTeX formulae. Second, existing work mainly recognizes and extracts LaTeX sources for formulae; and is incapable or ineffective for tables. This paper proposes LATTE, the first iterative refinement framework for LaTeX recognition. Specifically, we propose delta-view as feedback, which compares and pinpoints the differences between a pair of rendered images of the extracted LaTeX source and the expected correct image. Such delta-view feedback enables our fault localization model to localize the faulty parts of the incorrect recognition more accurately and enables our LaTeX refinement model to repair the incorrect extraction more accurately. LATTE improves the LaTeX source extraction accuracy of both LaTeX formulae and tables, outperforming existing techniques as well as GPT-4V by at least 7.07% of exact match, with a success refinement rate of 46.08% (formula) and 25.51% (table).

Abstract: The rapid advancement of pretrained textdriven diffusion models has significantly enriched applications in image generation and editing. However, as the demand for personalized content editing increases, new challenges emerge especially when dealing with arbitrary objects and complex scenes. Existing methods usually mistakes mask as the object shape prior, which struggle to achieve a seamless integration result. The mostly used inversion noise initialization also hinders the identity consistency towards the target object. To address these challenges, we propose a novel training-free framework that formulates personalized content editing as the optimization of edited images in the latent space, using diffusion models as the energy function guidance conditioned by reference text-image pairs. A coarse-to-fine strategy is proposed that employs text energy guidance at the early stage to achieve a natural transition toward the target class and uses point-to-point feature-level image energy guidance to perform fine-grained appearance alignment with the target object. Additionally, we introduce the latent space content composition to enhance overall identity consistency with the target. Extensive experiments demonstrate that our method excels in object replacement even with a large domain gap, highlighting its potential for high-quality, personalized image editing.

Abstract: It is vital to recover 3D geometry from multiview RGB images in many 3D computer vision tasks. The latest methods infer the geometry represented as a signed distance field by minimizing the rendering error on the field through volume rendering. However, it is still challenging to explicitly impose constraints on surfaces for inferring more geometry details due to the limited ability of sensing surfaces in volume rendering. To resolve this problem, we introduce a method to infer signed distance functions (SDFs) with a better sense of surfaces through volume rendering. Using the gradients and signed distances, we establish a small surface patch centered at the estimated intersection along a ray by pulling points randomly sampled nearby. Hence, we are able to explicitly impose surface constraints on the sensed surface patch, such as multi-view photo consistency and supervision from depth or normal priors, through volume rendering. We evaluate our method by numerical and visual comparisons on scene benchmarks. Our superiority over the latest methods justifies our effectiveness.

Abstract: Neural Radiance Fields (NeRF) have achieved huge success in effectively capturing and representing 3D objects and scenes. However, to establish an ubiquitous presence in everyday media formats, such as images and videos, we need to fulfill three key objectives: 1. fast encoding and decoding time, 2. compact model sizes, and 3. highquality renderings. Despite recent advancements, a comprehensive algorithm that adequately addresses all objectives has yet to be fully realized. In this work, we present CodecNeRF, a neural codec for NeRF representations, consisting of an encoder and decoder architecture that can generate a NeRF representation in a single forward pass. Furthermore, inspired by the recent parameter-efficient finetuning approaches, we propose a finetuning method to efficiently adapt the generated NeRF representations to a new test instance, leading to high-quality image renderings and compact code sizes. The proposed CodecNeRF, a newly suggested encoding-decoding-finetuning pipeline for NeRF, achieved unprecedented compression performance of more than 100x and a remarkable reduction in encoding time while maintaining (or improving) the image quality on widely used 3D object datasets.

Abstract: Generalized zeroshot semantic segmentation of 3D point clouds aims to classify each point into both seen and unseen classes. A significant challenge with these models is their tendency to make biased predictions, often favoring the classes encountered during training. This problem is more pronounced in 3D applications, where the scale of the training data is typically smaller than in image-based tasks. To address this problem, we propose a novel method called E3DPC-GZSL, which reduces overconfident predictions towards seen classes without relying on separate classifiers for seen and unseen data. E3DPC-GZSL tackles the overconfidence problem by integrating an evidence-based uncertainty estimator into a classifier. This estimator is then used to adjust prediction probabilities using a dynamic calibrated stacking factor that accounts for pointwise prediction uncertainty. In addition, E3DPC-GZSL introduces a novel training strategy that improves uncertainty estimation by refining the semantic space. This is achieved by merging learnable parameters with text-derived features, thereby improving model optimization for unseen data. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on generalized zero-shot semantic segmentation datasets, including ScanNet v2 and S3DIS.

Abstract: Despite the advancements of Video Large Language Models (VideoLLMs) in various tasks, they struggle with finegrained temporal understanding, such as Dense Video Captioning (DVC). DVC is a complicated task of describing all events within a video while also temporally localizing them, which integrates multiple fine-grained tasks, including video segmentation, video captioning, and temporal video grounding. Previous VideoLLMs attempt to solve DVC in a single step, failing to utilize their reasoning capability. Moreover, previous training objectives for VideoLLMs do not fully reflect the evaluation metrics, therefore not providing supervision directly aligned to target tasks. To address such a problem, we propose a novel framework named VidChain comprised of Chain-of-Tasks (CoTasks) and Metric-based Direct Preference Optimization (M-DPO). CoTasks decompose a complex task into a sequence of sub-tasks, allowing VideoLLMs to leverage their reasoning capabilities more effectively. M-DPO aligns a VideoLLM with evaluation metrics, providing fine-grained supervision to each task that is well-aligned with metrics. Applied to two different VideoLLMs, VidChain consistently improves their fine-grained video understanding, thereby outperforming previous VideoLLMs on two different DVC benchmarks and also on the temporal video grounding task.

Abstract: Recent research on LiDARbased 3D object detectors has shown strong performance; however, evaluations typically focus on dominant classes, overlooking rare classes, such as strollers, which could be critical in real autonomous driving scenarios. This oversight is problematic because state-of-the-art 3D object detectors show significantly lower performance on rare classes compared to dominant ones when trained on both. To address this issue and achieve accurate 3D rare object detection using only LiDAR data, we propose the Neighbor-Based confidence Adjustment for 3D rare class predictions (NBA3D). NBA3D utilizes a graph neural network to analyze the surrounding environment of rare class prediction boxes, enabling a more effective distinction between true positives and false positives based on their local context. Our approach utilizes both 3D prediction box characteristics and CLIP-based class semantic information to better contextualize neighboring objects. Various experiments demonstrate that NBA3D effectively improves the detection performance of rare class objects, regardless of the type of 3D object detectors used.

Abstract: Medical image segmentation plays an important role in clinical decision making, treatment planning, and disease tracking. However, it still faces two major challenges. On the one hand, there is often a "soft boundary" between foreground and background in medical images, with poor illumination and low contrast further reducing the distinguishability of foreground and background within the image. On the other hand, cooccurrence phenomena are widespread in medical images, and learning these features is misleading to the model's judgment. To address these challenges, we propose a general framework called Contrast-Driven Medical Image Segmentation (ConDSeg). First, we develop a contrastive training strategy called Consistency Reinforcement. It is designed to improve the encoder's robustness in various illumination and contrast scenarios, enabling the model to extract high-quality features even in adverse environments. Second, we introduce a Semantic Information Decoupling module, which is able to decouple features from the encoder into foreground, background, and uncertainty regions, gradually acquiring the ability to reduce uncertainty during training. The Contrast-Driven Feature Aggregation module then contrasts the foreground and background features to guide multi-level feature fusion and key feature enhancement, further distinguishing the entities to be segmented. We also propose a Size-Aware Decoder to solve the scale singularity of the decoder. It accurately locate entities of different sizes in the image, thus avoiding erroneous learning of co-occurrence features. Extensive experiments on five datasets across three scenarios demonstrate the state-of-the-art performance of our method, proving its advanced nature and general applicability to various medical image segmentation scenarios.

Abstract: Visible watermark removal which involves watermark cleaning and background content restoration is pivotal to evaluate the resilience of watermarks. Existing deep neural network (DNN)based models still struggle with large-area watermarks and are overly dependent on the quality of watermark mask prediction. To overcome these challenges, we introduce a novel feature adapting framework that leverages the representation modeling capacity of a pre-trained image inpainting model. Our approach bridges the knowledge gap between image inpainting and watermark removal by fusing information of the residual background content beneath watermarks into the inpainting backbone model. We establish a dual-branch system to capture and embed features from the residual background content, which are merged into intermediate features of the inpainting backbone model via gated feature fusion modules. Moreover, for relieving the dependence on high-quality watermark masks, we introduce a new training paradigm by utilizing coarse watermark masks to guide the inference process. This contributes to a visible image removal model which is insensitive to the quality of watermark mask during testing. Extensive experiments on both a large-scale synthesized dataset and a real-world dataset demonstrate that our approach significantly outperforms existing state-of-the-art methods. The source code is available in the supplementary materials.

Abstract: Recently, zeroshot object customization generation methods have rapidly developed and shown tremendous potential for applications. For instance, in the e-commerce domain, consumers can observe the visual effect of furniture placed within their personal living spaces or clothes worn on their own bodies. Many existing approaches perform object customization generation based on diffusion models and extracted reference object features. However, the generated object significantly diverges from the original reference object in details such as patterns and curves. Particularly for asymmetrical reference objects, the absence of comprehensive multi-viewpoint information prevents the generation of object poses that harmonize with the background scene. To address these shortcomings, we have constructed a novel dataset comprising multi-angle images of furniture and indoor scenes. Based on diffusion models, we introduce HomeDiffusion, which can leverage multi-viewpoint images of the same reference object to accurately generate visually harmonious object poses within specified areas of the background scene. During the diffusion process, we further extract high-fidelity details of the reference object and perform cross-attention with the noise latents in the latent space, thereby ensuring the preservation of details in the customized object generation. Extensive qualitative and quantitative experiments demonstrate that our method achieves superior performance over other existing zero-shot as well as few-shot object customization approaches.

Abstract: Certified robustness is a critical measure for assessing the reliability of machine learning systems. Traditionally, the computational burden associated with certifying the robustness of machine learning models has posed a substantial challenge, particularly with the continuous expansion of model sizes. In this paper, we introduce an innovative approach to expedite the verification process for L2norm certified robustness through sparse transfer learning. Our approach is both efficient and effective. It leverages verification results obtained from pre-training tasks and applies sparse updates to these results. To enhance performance, we incorporate dynamic sparse mask selection and introduce a novel stability-based regularizer called DiffStab. Empirical results demonstrate that our method accelerates the verification process for downstream tasks by as much as 70-80%, with only slight reductions in certified accuracy compared to dense parameter updates. We further validate that this performance improvement is even more pronounced in the few-shot transfer learning scenario.

Abstract: Transformers have demonstrated impressive results for 3D point cloud semantic segmentation. However, the quadratic complexity of transformer makes computation costs high, limiting the number of points that can be processed simultaneously and impeding the modeling of longrange dependencies between objects in a single scene. Drawing inspiration from the great potential of recent state space models (SSM) for long sequence modeling, we introduce Mamba, an SSM-based architecture, to the point cloud domain and propose Pamba, a novel architecture with strong global modeling capability under linear complexity. Specifically, to make the disorderness of point clouds fit in with the causal nature of Mamba, we propose a multi-path serialization strategy applicable to point clouds. Besides, we propose the ConvMamba block to compensate for the shortcomings of Mamba in modeling local geometries and in unidirectional modeling. Pamba obtains state-of-the-art results on several 3D point cloud segmentation tasks, including ScanNet v2, ScanNet200, S3DIS and nuScenes, while its effectiveness is validated by extensive experiments.

Abstract: The objective of Composed Image Retrieval (CIR) is to identify a target image that meets the requirement based on a multimodal query (including the reference image and the modification text) provided by the user. Despite the notable success of existing approaches, they fail to adequately address the modification relation between visual entities and modification actions. This limitation is nontrivial due to three challenges: 1) irrelevant factor perturbation, 2) vague semantic boundaries, and 3) implicit modification relations. To address the above challenges, we propose an Entity miNing and modifiCation relatiOn binDing nEtwoRk (ENCODER), which has been designed to mine visual entities and modification actions, and then bind modification relations. Among the various components of the proposed ENCODER, we have initially designed the Latent Factor Filter (LFF) module to filter visual and textual latent factors related to modification semantics based on a threshold gating mechanism. Secondly, we propose Entity-Action Binding (EAB), which comprises modality-shared Learnable Relation Queries (LRQ) that are capable of mining visual entities and modification actions, as well as learning implicit modification relations for entity-action binding. Finally, the Multi-scale Composition module is introduced to achieve multi-scale feature composition, with guidance provided by entity-action binding. Extensive experiments on four benchmark datasets demonstrate the superiority of our proposed method.

Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China School of Computer Science and Engineering, University of Electronic Science and Technology of China, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China Sichuan Provincial Key Laboratory for Human Disease Gene Study and the Center for Medical Genetics, Department of Laboratory Medicine, Sichuan Academy of Medical Sciences and Sichuan Provincial People's Hospital, UESTC

Abstract: Learningbased methods have become increasingly popular in 3D indoor scene synthesis (ISS), showing superior performance over traditional optimization-based approaches. These learning-based methods typically model distributions on simple yet explicit scene representations using generative models. However, due to the oversimplified explicit representations that overlook detailed information and the lack of guidance from multimodal relationships within the scene, most learning-based methods struggle to generate indoor scenes with realistic object arrangements and styles. In this paper, we introduce a new method, Scene Implicit Neural Field (S-INF), for indoor scene synthesis, aiming to learn meaningful representations of multimodal relationships, to enhance the realism of indoor scene synthesis. S-INF assumes that the scene layout is often related to the object-detailed information. It disentangles the multimodal relationships into scene layout relationships and detailed object relationships, fusing them later through implicit neural fields (INFs). By learning specialized scene layout relationships and projecting them into S-INF, we achieve a realistic generation of scene layout. Additionally, S-INF captures dense and detailed object relationships through differentiable rendering, ensuring stylistic consistency across objects. Through extensive experiments on the benchmark 3D-FRONT dataset, we demonstrate that our method consistently achieves state-of-the-art performance under different types of ISS.

Abstract: Glass surfaces are becoming increasingly ubiquitous as modern buildings tend to use a lot of glass panels. This, however, poses substantial challenges to the operations of autonomous systems such as robots, selfdriving cars, and drones, as the glass panels can become transparent obstacles to navigation. Existing works attempt to exploit various cues, including glass boundary context or reflections, as a prior. However, they are all based on input RGB images. We observe that the transmission of 3D depth sensor light through glass surfaces often produces blank regions in the depth maps, which can offer additional insights to complement the RGB image features for glass surface detection. In this work, we propose a large-scale RGB-D glass surface detection dataset, RGB-D GSD, for rigorous experiments and future research. It contains 3,009 images offering a wide range of real-world RGB-D glass surface categories, paired with precise annotations. Moreover, we propose a novel glass surface detection framework combining RGB and depth information, with two novel modules: a cross-modal context mining (CCM) module to adaptively learn individual and mutual context features from RGB and depth information, and a depth-missing aware attention (DAA) module to explicitly exploit spatial locations where missing depths occur to help detect the presence of glass surfaces. Experimental results show that our proposed model outperforms state-of-the-art methods.

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China School of Informatics,Xiamen University, China, The Hong Kong University of Science and Technology (Guangzhou), China, The Hong Kong University of Science and Technology (Guangzhou), China, Tsinghua University, China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China School of Informatics,Xiamen University, China, University of Washington, USA, The Hong Kong University of Science and Technology (Guangzhou), China, The Hong Kong University of Science and Technology (Guangzhou), China, The Hong Kong University of Science and Technology (Guangzhou), China The Hong Kong University of Science and Technology, Hong Kong SAR, China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China School of Informatics,Xiamen University, China

Abstract: Existing lowlight image enhancement (LIE) methods have achieved noteworthy success in solving synthetic distortions, yet they often fall short in practical applications. The limitations arise from two inherent challenges in real-world LIE: 1) the collection of distorted/clean image pairs is often impractical and sometimes even unavailable, and 2) accurately modeling complex degradations presents a non-trivial problem. To overcome them, we propose the Attribute Guidance Diffusion framework (AGLLDiff), a training-free method for effective real-world LIE. Instead of specifically defining the degradation process, AGLLDiff shifts the paradigm and models the desired attributes, such as image exposure, structure and color of normal-light images. These attributes are readily available and impose no assumptions about the degradation process, which guides the diffusion sampling process to a reliable high-quality solution space. Extensive experiments demonstrate that our approach outperforms the current leading unsupervised LIE methods across benchmarks in terms of distortion-based and perceptual-based metrics, and it performs well even in sophisticated wild degradation.

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China School of Informatics,Xiamen University, China, Tsinghua University, China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China School of Informatics,Xiamen University, China, The Hong Kong University of Science and Technology (Guangzhou), China, The Hong Kong University of Science and Technology (Guangzhou), China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China School of Informatics,Xiamen University, China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China, Fudan University, China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China School of Informatics,Xiamen University, China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China School of Informatics,Xiamen University, China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China School of Informatics,Xiamen University, China

Abstract: Lowlight image enhancement (LIE) aims at precisely and efficiently recovering an image degraded in poor illumination environments. Recent advanced LIE techniques are using deep neural networks, which require lots of low-normal light image pairs, network parameters, and computational resources. As a result, their practicality is limited. In this work, we devise a novel unsupervised LIE framework based on diffusion priors and lookup tables (DPLUT) to achieve efficient low-light image recovery. The proposed approach comprises two critical components: a light adjustment lookup table (LLUT) and a noise suppression lookup table (NLUT). LLUT is optimized with a set of unsupervised losses. It aims at predicting pixel-wise curve parameters for the dynamic range adjustment of a specific image. NLUT is designed to remove the amplified noise after the light brightens. As diffusion models are sensitive to noise, diffusion priors are introduced to achieve high-performance noise suppression. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in terms of visual quality and efficiency.

Abstract: Due to the successful development of deep image generation technology, forgery detection plays a more important role in social and economic security. Racial bias has not been explored thoroughly in the deep forgery detection field. In the paper, we first contribute a dedicated dataset called the Fair Forgery Detection (FairFD) dataset, where we prove the racial bias of public stateof-the-art (SOTA) methods. Different from existing forgery detection datasets, the self-constructed FairFD dataset contains a balanced racial ratio and diverse forgery generation images with the largest-scale subjects. Additionally, we identify the problems with naive fairness metrics when benchmarking forgery detection models. To comprehensively evaluate fairness, we design novel metrics including Approach Averaged Metric and Utility Regularized Metric, which can avoid deceptive results. We also present an effective and robust post-processing technique, Bias Pruning with Fair Activations (BPFA), which improves fairness without requiring retraining or weight updates. Extensive experiments conducted with 12 representative forgery detection models demonstrate the value of the proposed dataset and the reasonability of the designed fairness metrics. By applying the BPFA to the existing fairest detector, we achieve a new SOTA. Furthermore, we conduct more in-depth analyses to offer more insights to inspire researchers in the community.

Abstract: Unknown Object Detection (UOD) aims to identify objects of unseen categories, differing from the traditional detection paradigm limited by the closedworld assumption. A key component of UOD is learning a generalized representation, i.e. objectness for both known and unknown categories to distinguish and localize objects from the background in a class-agnostic manner. However, previous methods obtain supervision signals for learning objectness in isolation from either localization or classification information, leading to poor performance for UOD. To address this issue, we propose a transformer-based UOD framework, UN-DETR. Based on this, we craft Instance Presence Score (IPS) to represent the probability of an object's presence. For the purpose of information complementarity, IPS employs a strategy of joint supervised learning, integrating attributes representing general objectness from the positional and the categorical latent space as supervision signals. To enhance IPS learning, we introduce a one-to-many assignment strategy to incorporate more supervision. Then, we propose Unbiased Query Selection to provide premium initial query vectors for the decoder. Additionally, we propose an IPS-guided post process strategy to filter redundant boxes and correct classification predictions for known and unknown objects. Finally, we pretrain the entire UN-DETR in an unsupervised manner, in order to obtain objectness prior. Our UN-DETR is comprehensively evaluated on multiple UOD and known detection benchmarks, demonstrating its effectiveness and achieving state-of-the-art performance.

Abstract: Textto-Image generation (TTI) technologies are advancing rapidly, especially in the English language communities. However, apart from the user input language barrier problem, English-native TTI models inherently carry biases from their English world centric training data, which creates a dilemma for development of other language-native TTI models. One common choice is to fine-tune the English-native TTI model with translated samples. It falls short of fully addressing the model bias problem. Alternatively, training non-English language native models from scratch can effectively resolve the English world bias, but model trained this way would diverge from the English TTI communities, thus not able to utilize the strides continuously gaining in the English TTI communities any more. To build Chinese TTI model meanwhile keep compatibility with the English TTI communities, we propose a novel model structure referred as "Bridge Diffusion Model" (BDM). The proposed BDM employs a backbone-branch network structure to learn the Chinese semantics while keep the latent space compatible with the English-native TTI backbone, in an end-to-end manner. The unique advantages of the proposed BDM are that it's not only adept at generating images that precisely depict Chinese semantics, but also compatible with various English-native TTI plugins, such as different checkpoints, LoRA, ControlNet, Dreambooth, and Textual Inversion, etc. Moreover, BDM can concurrently generate content seamlessly combining both Chinese-native and English-native semantics within a single image, fostering cultural interaction.

Abstract: Image forgeries can entirely change the semantic information of an image, and can be used for unscrupulous purposes. In this paper, we propose a novel image forgery localization network named as MUN, which consists of an M^3 encoder and a UN decoder. Firstly, the M^3 encoder is constructed based on a Multiscale Max-pooling query module to extract Multi-clue forged features. Noiseprint++ is adopted to assist the RGB clue, and its deployment methodology is discussed. A Multi-scale Max-pooling Query (MMQ) module is proposed to integrate RGB and noise features. Secondly, a novel UN decoder is proposed to extract hierarchical features from both top-down and bottom-up directions, reconstructing both high-level and low-level features at the same time. Thirdly, we formulate an IoU-recalibrated Dynamic Cross-Entropy (IoUDCE) loss to dynamically adjust the weights on forged regions according to IoU which can adaptively balance the influence of authentic and forged regions. Last but not least, we propose a data augmentation method, i.e., Deviation Noise Augmentation (DNA), which acquires accessible prior knowledge of RGB distribution to improve the generalization ability. Extensive experiments on publicly available datasets show that MUN outperforms the state-of-the-art works.

Abstract: Lowrank adaptation (LoRA) is an efficient strategy for adapting latent diffusion models (LDMs) on a private dataset to generate specific images by minimizing the adaptation loss. However, the LoRA-adapted LDMs are vulnerable to membership inference (MI) attacks that can judge whether a particular data point belongs to the private dataset, thus leading to the privacy leakage. To defend against MI attacks, we first propose a straightforward solution: Membership-Privacy-preserving LoRA (MP-LoRA). MP-LoRA is formulated as a min-max optimization problem where a proxy attack model is trained by maximizing its MI gain while the LDM is adapted by minimizing the sum of the adaptation loss and the MI gain of the proxy attack model. However, we empirically find that MP-LoRA has the issue of unstable optimization, and theoretically analyze that the potential reason is the unconstrained local smoothness, which impedes the privacy-preserving adaptation. To mitigate this issue, we further propose a Stable Membership-Privacy-preserving LoRA (SMP-LoRA) that adapts the LDM by minimizing the ratio of the adaptation loss to the MI gain. Besides, we theoretically prove that the local smoothness of SMP-LoRA can be constrained by the gradient norm, leading to improved convergence. Our experimental results corroborate that SMP-LoRA can indeed defend against MI attacks and generate high-quality images.

College of Computer Science and Technology, Jilin University, Changchun, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University.Changchun, China, College of Computer Science and Technology, Jilin University, Changchun, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University.Changchun, China, The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou, China, Institute for Infocomm Research (I2R), A*STAR, Singapore, Tsinghua Shenzhen International Graduate School, Nanshan District, Shenzhen, China, College of Computer Science and Technology, Jilin University, Changchun, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University.Changchun, China

Abstract: Grasping the intricacies of human motion, which involve perceiving spatiotemporal dependence and multi-scale effects, is essential for predicting human motion. While humans inherently possess the requisite skills to navigate this issue, it proves to be markedly more challenging for machines to emulate. To bridge the gap, we propose the Human-like Vision and Inference System (HVIS) for human motion prediction, which is designed to emulate human observation and forecast future movements. HVIS comprises two components: the human-like vision encode (HVE) module and the human-like motion inference (HMI) module. The HVE module mimics and refines the human visual process, incorporating a retina-analog component that captures spatiotemporal information separately to avoid unnecessary crosstalk. Additionally, a visual cortex-analogy component is designed to hierarchically extract and treat complex motion features, focusing on both global and local features of human poses. The HMI is employed to simulate the multi-stage learning model of the human brain. The spontaneous learning network simulates the neuronal fracture generation process for the adversarial generation of future motions. Subsequently, the deliberate learning network is optimized for hard-to-train joints to prevent misleading learning. Experimental results demonstrate that our method achieves new state-of-the-art performance, significantly outperforming existing methods by 19.8 % on Human3.6M, 15.7 % on CMU Mocap, and 11.1 % on G3D.

Abstract: Recent advances in textto-image diffusion models have spurred significant interest in continuous story image generation. In this paper, we introduce Storynizor, a model capable of generating coherent stories with strong inter-frame character consistency, effective foreground-background separation, and diverse pose variation. The core innovation of Storynizor lies in its key modules: ID-Synchronizer and ID-Injector. The ID-Synchronizer employs an auto-mask self-attention module and a mask perceptual loss across inter-frame images to improve the consistency of character generation, vividly representing their postures and backgrounds. The ID-Injector utilize a Shuffling Reference Strategy (SRS) to integrate ID features into specific locations, enhancing ID-based consistent character generation. Additionally, to facilitate the training of Storynizor, we have curated a novel dataset called StoryDB comprising 100, 000 images. This dataset contains single and multiple-character sets in diverse environments, layouts, and gestures with detailed descriptions. Experimental results indicate that Storynizor demonstrates superior coherent story generation with high-fidelity character consistency, flexible postures, and vivid backgrounds compared to other character-specific methods.

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China Institute of Artificial Intelligence, Xiamen University, Fujian, China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China

Abstract: Domain Generalized Semantic Segmentation (DGSS) seeks to utilize source domain data exclusively to enhance the generalization of semantic segmentation across unknown target domains. Prevailing studies predominantly concentrate on feature normalization and domain randomization, these approaches exhibit significant limitations. Feature normalizationbased methods tend to confuse semantic features in the process of constraining the feature space distribution, resulting in classification misjudgment. Domain randomization-based methods frequently incorporate domain-irrelevant noise due to the uncontrollability of style transformations, resulting in segmentation ambiguity. To address these challenges, we introduce a novel framework, named SCSD for Semantic Consistency prediction and Style Diversity generalization. It comprises three pivotal components: Firstly, a Semantic Query Booster is designed to enhance the semantic awareness and discrimination capabilities of object queries in the mask decoder, enabling cross-domain semantic consistency prediction. Secondly, we develop a Text-Driven Style Transform module that utilizes domain difference text embeddings to controllably guide the style transformation of image features, thereby increasing inter-domain style diversity. Lastly, to prevent the collapse of similar domain feature spaces, we introduce a Style Synergy Optimization mechanism that fortifies the separation of inter-domain features and the aggregation of intra-domain features by synergistically weighting style contrastive loss and style aggregation loss. Extensive experiments demonstrate that the proposed SCSD significantly outperforms existing state-of-theart methods. Notably, SCSD trained on GTAV achieved an average of 49.11 mIoU on the four unseen domain datasets, surpassing the state-of-the-art method by +4.08 mIoU.

Abstract: Semisupervised medical image segmentation (SSMIS) uses consistency learning to regularize model training, which alleviates the burden of pixel-wise manual annotations. However, it often suffers from error supervision from low-quality pseudo labels. Vision-Language Model (VLM) has great potential to enhance pseudo labels by introducing text prompt guided multimodal supervision information. It nevertheless faces the cross-modal problem: the obtained messages tend to correspond to multiple targets. To address aforementioned problems, we propose a Dual Semantic Similarity-Supervised VLM (DuSSS) for SSMIS. Specifically, 1) a Dual Contrastive Learning (DCL) is designed to improve cross-modal semantic consistency by capturing intrinsic representations within each modality and semantic correlations across modalities. 2) To encourage the learning of multiple semantic correspondences, a Semantic Similarity-Supervision strategy (SSS) is proposed and injected into each contrastive learning process in DCL, supervising semantic similarity via the distribution-based uncertainty levels. Furthermore, a novel VLM-based SSMIS network is designed to compensate for the quality deficiencies of pseudo-labels. It utilizes the pretrained VLM to generate text prompt guided supervision information, refining the pseudo label for better consistency regularization. Experimental results demonstrate that our DuSSS achieves outstanding performance with Dice of 82.52%, 74.61% and 78.03% on three public datasets (QaTa-COV19, BM-Seg and MoNuSeg).

Abstract: Procedure planning in instructional videos, producing a structured and plannable action sequence facilitating the transition from the start to the goal states, has achieved significant progress. The dominant singlebranch non-autoregressive planning paradigm guides action sequence generation through action labels, overlooking the limitation of the absence of intermediate visual information. Hence, we introduce the procedure knowledge decoupled distillation strategy to address the above issue. This innovative strategy deliberately lets the teacher model see the real visual information among the start and goal states to enhance its action semantic understanding and relationship modeling ability, producing the potential probability distribution containing the real action class and other action classes that may occur. Accordingly, we introduce a decoupled intermediate information knowledge distillation loss, which comprises single action knowledge distillation and sequence distribution knowledge distillation for the student model. The former improves the student model's precise inference ability for individual actions by transferring knowledge of a single action target category using binary classification loss. Conversely, the latter uses MSE loss to constrain the student model to learn the action sequence probability distribution from the teacher model, thereby enhancing the student model's global planning capability. Extensive experiments on three datasets demonstrate that our strategy can improve the performance of multiple weakly supervised models, achieving promising procedure knowledge modeling ability and plug-and-play flexibility.

Abstract: Humans navigate unfamiliar environments using episodic simulation and episodic memory, which facilitate a deeper understanding of the complex relationships between environments and objects. Developing an imaginative memory system inspired by human mechanisms can enhance the navigation performance of embodied agents in unseen environments. However, existing Visionand-Language Navigation (VLN) agents lack a memory mechanism of this kind. To address this, we propose a novel architecture that equips agents with a reality-imagination hybrid memory system. This system enables agents to maintain and expand their memory through both imaginative mechanisms and navigation actions. Additionally, we design tailored pre-training tasks to develop the agent's imaginative capabilities. Our agent can imagine high-fidelity RGB images for future scenes, achieving state-of-the-art results in a Success rate weighted by Path Length (SPL).

Abstract: To follow regulations on individual data privacy and safety, machine learning models must systematically remove information learned from specific subsets of a user's training data that can no longer be utilized. To address this problem, machine unlearning has emerged as an important area of research, that helps remove information learned from specific subsets of training data from a pretrained model without needing to retrain the whole model from scratch. The principal aim of this study is to formulate a methodology aimed for the purposeful elimination of information linked to a specific class of data from a pre-trained classification network. This intentional removal decreases the model's performance specifically concerning the unlearned data class while simultaneously minimizing any detrimental impacts on the model's performance in other classes. To achieve this goal, we frame the class unlearning problem from a Bayesian perspective, which yields a loss function that minimizes the log-likelihood associated with the unlearned data with a stability regularization in parameter space. This stability regularization incorporates Mohalanobis distance with respect to the Fisher Information matrix and L2 distance from the pre-trained model parameters. Our novel approach, termed Partially-Blinded Unlearning (PBU), surpasses existing state-of-the-art class unlearning methods, demonstrating superior effectiveness. Notably, PBU achieves this efficacy without requiring information about the entire training dataset but only of the unlearned data points, marking a distinctive feature of its performance.

Abstract: Accurate prediction of 3D semantic occupancy from 2D visual images is vital in enabling autonomous agents to comprehend their surroundings for planning and navigation. Stateof-the-art methods typically employ fully supervised approaches, necessitating a huge labeled dataset acquired through expensive LiDAR sensors and meticulous voxel-wise labeling by human annotators. The resource-intensive nature of this annotating process significantly hampers the application and scalability of these methods. We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data. Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues, facilitating a more efficient training process. Our framework exhibits notable properties: (1) Generalizability, applicable to various 3D semantic scene completion approaches, including 2D-3D lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated through experiments on SemanticKITTI and NYUv2, wherein our method achieves up to 85% of the fully-supervised performance using only 10% labeled data. This approach not only reduces the cost and labor associated with data annotation but also demonstrates the potential for broader adoption in camera-based systems for 3D semantic occupancy prediction.

Abstract: Visible spectrum images capture limited information from just three discrete bands, often resulting in suboptimal performance in underwater depth estimation (UDE) due to significant information loss from water absorption. In contrast, HSIs, which include hundreds of continuous bands, provide abundant spectral information that offers greater resilience against the adverse effects of water absorption. In this paper, we conduct a comprehensive study to investigate how spectral information can enhance remote sensing UDE through two key aspects: the benchmark dataset and the general framework. For the benchmark dataset, we construct a realworld hyperspectral UDE (HUDE) dataset ATR-HUDE, comprising approximately 500 synchronized hyperspectral and LiDAR data pairs collected from diverse coastal scenes and flight altitudes. Regarding the general framework, we integrate recent advances in state space models and physical imaging models to design a novel HUDE framework named HUDEMamba that estimates underwater depth using both model-driven and data-driven approaches. Experimental results on the constructed benchmark dataset validate the potential of HUDE and the effectiveness of HUDEMamba.

Abstract: In the coded aperture snapshot spectral imaging system, Deep Unfolding Networks (DUNs) have made impressive progress in recovering 3D hyperspectral images (HSIs) from a single 2D measurement. However, the inherent nonlinear and illposed characteristics of HSI reconstruction still pose challenges to existing methods in terms of accuracy and stability. To address this issue, we propose a Mamba-inspired Joint Unfolding Network (MiJUN), which integrates physics-embedded DUNs with learning-based HSI imaging. Firstly, leveraging the concept of trapezoid discretization to expand the representation space of unfolding networks, we introduce an accelerated unfolding network scheme. This approach can be interpreted as a generalized accelerated half-quadratic splitting with a second-order differential equation, which reduces the reliance on initial optimization stages and addresses challenges related to long-range interactions. Crucially, within the Mamba framework, we restructure the Mamba-inspired global-to-local attention mechanism by incorporating a selective state space model and an attention mechanism. This effectively reinterprets Mamba as a variant of the Transformer architecture, improving its adaptability and efficiency. Furthermore, we refine the scanning strategy with Mamba by integrating the tensor mode-k unfolding into the Mamba network. This approach emphasizes the low-rank properties of tensors along various modes, while conveniently facilitating 12 scanning directions. Numerical and visual comparisons on both simulation and real datasets demonstrate the superiority of our proposed MiJUN, and achieving overwhelming detail representation.

Abstract: Although multiview fusion has demonstrated potential in LiDAR segmentation, its dependence on computationally intensive pointbased interactions, arising from the lack of fixed correspondences between views such as range view and Bird's-Eye View (BEV), hinders its practical deployment. This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance. We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies within the BEV space. Our proposed BEV-only segmentation model leverages the inherent fixed grid correspondences between these partitioning schemes, enabling a fusion process that is orders of magnitude faster (170x speedup) than conventional point-based methods. Furthermore, our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point-based alternatives. To enhance scene understanding while maintaining inference efficiency, we also introduce a hybrid Transformer-CNN architecture. Extensive evaluation on the SemanticKITTI and nuScenes datasets provides compelling evidence that our method outperforms previous multiview fusion approaches in terms of both performance and inference speed, highlighting the potential of BEV-based fusion for LiDAR segmentation.

Abstract: Video Moment Retrieval (VMR) involves locating specific moments within a video based on natural language queries. However, existing VMR methods that employ various strategies for crossmodal alignment still face challenges such as limited understanding of fine-grained semantics, semantic overlap, and sparse constraints. To address these limitations, we propose a novel Concept Decomposition Transformer (CDTR) model for VMR. CDTR introduces a semantic concept decomposition module that disentangles video moments and sentence queries into concept representations, reflecting the relevance between various concepts and capturing fine-grained semantics which is crucial for cross-modal matching. These decomposed concept representations are then used as pseudo-labels, determined as positive or negative samples by adaptive concept-specific thresholds. Subsequently, fine-grained concept alignment is performed in video intra-modal and textual-visual cross-modal, aligning different conceptual components within features, enhancing the model's ability to distinguish fine-grained semantics, and alleviating issues related to semantic overlap and sparse constraints. Comprehensive experiments demonstrate the effectiveness of the CDTR, outperforming state-of-the-art methods on three widely used datasets: QVHighlights, Charades-STA, and TACoS.

Abstract: Transferable adversarial examples are known to cause threats in practical, blackbox attack scenarios. A notable approach to improving transferability is using integrated gradients (IG), originally developed for model interpretability. In this paper, we find that existing IG-based attacks have limited transferability due to their naive adoption of IG in model interpretability. To address this limitation, we focus on the IG integration path and refine it in three aspects: multiplicity, monotonicity, and diversity, supported by theoretical analyses. We propose the Multiple Monotonic Diversified Integrated Gradients (MuMoDIG) attack, which can generate highly transferable adversarial examples on different CNN and ViT models and defenses. Experiments validate that MuMoDIG outperforms the latest IG-based attack by up to 37.3% and other state-of-the-art attacks by 8.4%. In general, our study reveals that migrating established techniques to improve transferability may require non-trivial efforts.

Abstract: Diffusion models have demonstrated outstanding performance in generative tasks, making them ideal candidates for image editing. Recent studies highlight their ability to apply desired edits effectively by following textual instructions, yet with two key challenges remaining. First, these models struggle to apply multiple edits simultaneously, resulting in computational inefficiencies due to their reliance on sequential processing. Second, relying on textual prompts to determine the editing region can lead to unintended alterations to the image. We introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions. This approach enables complex editing tasks, such as object movement, by aggregating multiple functions and applying them simultaneously to specific areas. Our experiments demonstrate that FunEditor significantly outperforms recent inferencetime optimization methods and fine-tuned models, either quantitatively across various metrics or through visual comparisons or both, on complex tasks like object movement and object pasting. In the meantime, with only 4 steps of inference, FunEditor achieves 5--24 times inference speedups over existing popular methods.

Abstract: Visual Information Extraction (VIE) plays a crucial role in the comprehension of semistructured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.

Abstract: Gaussian Splatting (GS) has emerged as a crucial technique for representing discrete volumetric radiance fields. It leverages unique parametrization to mitigate computational demands in scene optimization. This work introduces TopologyAware 3D Gaussian Splatting (Topology-GS), which addresses two key limitations in current approaches: compromised pixel-level structural integrity due to incomplete initial geometric coverage, and inadequate feature-level integrity from insufficient topological constraints during optimization. To overcome these limitations, Topology-GS incorporates a novel interpolation strategy, Local Persistent Voronoi Interpolation (LPVI), and a topology-focused regularization term based on persistent barcodes, named PersLoss. LPVI utilizes persistent homology to guide adaptive interpolation, enhancing point coverage in low-curvature areas while preserving topological structure. PersLoss aligns the visual perceptual similarity of rendered images with ground truth by constraining distances between their topological features. Comprehensive experiments on three novel-view synthesis benchmarks demonstrate that Topology-GS outperforms existing methods in terms of PSNR, SSIM, and LPIPS metrics, while maintaining efficient memory usage. This study pioneers the integration of topology with 3D-GS, laying the groundwork for future research in this area.

Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, College of Computer Science, Zhejiang University, Alibaba Group, Alibaba Group, Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, College of Computer Science, Zhejiang University, Alibaba Group, Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, College of Computer Science, Zhejiang University, Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, College of Computer Science, Zhejiang University, Alibaba Group

Abstract: Recently, large language models (LLMs) and multimodal large language models (MLLMs) have demonstrated promising results on document visual question answering (VQA) task, particularly after training on document instruction datasets. An effective evaluation method for document instruction data is crucial in constructing instruction data with high efficacy, which, in turn, facilitates the training of LLMs and MLLMs for document VQA. However, most existing evaluation methods for instruction data are limited to the textual content of the instructions themselves, thereby hindering the effective assessment of document instruction datasets and constraining their construction. In this paper, we propose ProcTag, a dataoriented method that assesses the efficacy of document instruction data. ProcTag innovatively performs tagging on the execution process of instructions rather than the instruction text itself. By leveraging the diversity and complexity of these tags to assess the efficacy of the given dataset, ProcTag enables selective sampling or filtering of document instructions. Furthermore, DocLayPrompt, a novel semi-structured layout-aware document prompting strategy, is proposed for effectively representing documents. Experiments demonstrate that sampling existing open-sourced and generated document VQA/instruction datasets with ProcTag significantly outperforms current methods for evaluating instruction data. Impressively, with ProcTag-based sampling in the generated document datasets, only 30.5 percent of the document instructions are required to achieve 100 percent efficacy compared to the complete dataset.

Huazhong University of Science and Technology, Huazhong University of Science and Technology National Key Laboratory of Science and Technology on Multi-Spectral Information Processing, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Air Force Early Warning Academy, Chinese People’s Liberation Army 95841 troops, Chinese People’s Liberation Army 95841 troops, Chinese People’s Liberation Army 95841 troops

Abstract: The introduction of Feature Pyramid Network (FPN) has significantly improved object detection performance. However, substantial challenges remain in detecting tiny objects, as their features occupy only a very small proportion of the feature maps. Although FPN integrates multiscale features, it does not directly enhance or enrich the features of tiny objects. Furthermore, FPN lacks spatial perception ability. To address these issues, we propose a novel High Frequency and Spatial Perception Feature Pyramid Network (HS-FPN) with two innovative modules. First, we designed a high frequency perception module (HFP) that generates high frequency responses through high pass filters. These high frequency responses are used as mask weights from both spatial and channel perspectives to enrich and highlight the features of tiny objects in the original feature maps. Second, we developed a spatial dependency perception module (SDP) to capture the spatial dependencies that FPN lacks. Our experiments demonstrate that detectors based on HS-FPN exhibit competitive advantages over state-of-the-art models on the AI-TOD dataset for tiny object detection.

Abstract: Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multimodal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models. At the heart of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive information, enabling adaptive integration of crucial features for recognizing each actor's actions. We have developed a Transformer-based architecture, the Actor-centric Multi-modal Fusion Network, specifically designed to capture the dynamic interactions among actors and their multi-modal contexts. Our evaluation on three prominent VAD benchmarks—AVA, UCF101-24, and JHMDB51-21—demonstrates that incorporating multi-modal information significantly enhances performance, setting new state-of-the-art performances in the field.

Abstract: Recently, the success of textto-image synthesis has greatly advanced the development of identity customization techniques, whose main goal is to produce realistic identity-specific photographs based on text prompts and reference face images. However, it is difficult for existing identity customization methods to simultaneously meet the various requirements of different real-world applications, including the identity fidelity of small face, the control of face location, pose and expression, as well as the customization of multiple persons. To this end, we propose a scale-robust and fine-controllable method, namely RealisID, which learns different control capabilities through the cooperation between a pair of local and global branches. Specifically, by using cropping and up-sampling operations to filter out face-irrelevant information, the local branch concentrates the fine control of facial details and the scale-robust identity fidelity within the face region. Meanwhile, the global branch manages the overall harmony of the entire image. It also controls the face location by taking the location guidance as input. As a result, RealisID can benefit from the complementarity of these two branches. Finally, by implementing our branches with two different variants of ControlNet, our method can be easily extended to handle multi-person customization, even only trained on single-person datasets. Extensive experiments and ablation studies indicate the effectiveness of RealisID and verify its ability in fulfilling all the requirements mentioned above.

Abstract: Adversarial training (AT) incurs significant computational overhead, leading to growing interest in designing inherently robust architectures. We demonstrate that a carefully designed first layer of the neural network can serve as an implicit adversarial noise filter (ANF). This filter is created using a combination of large kernel size, increased convolution filters, and a maxpool operation. We show that integrating this filter as the first layer in architectures such as ResNet, VGG, and EfficientNet results in adversarially robust networks. Our approach achieves higher adversarial accuracies than existing natively robust architectures without AT and is competitive with adversarialtrained architectures across a wide range of datasets. Supporting our findings, we show that (a) the decision regions for our method have better margins, (b) the visualized loss surfaces are smoother, (c) the modified peak signal-to-noise ratio (mPSNR) values at the output of the ANF are higher, (d) high-frequency components are more attenuated, and (e) architectures incorporating ANF exhibit better denoising in Gaussian noise compared to baseline architectures.

Abstract: Human preference alignment can significantly enhance the capabilities of Multimodal Large Language Models (MLLMs). However, collecting highquality preference data remains costly. One promising solution is the self-evolution strategy, where models are iteratively trained on data they generate. Current multimodal self-evolution techniques, nevertheless, still need human- or GPT-annotated data. Some methods even require extra models or ground truth answers to construct preference data. To overcome these limitations, we propose a novel multimodal self-evolution framework that empowers the model to autonomously generate high-quality questions and answers using only unannotated images. First, in the question generation phase, we implement an image-driven self-questioning mechanism. This approach allows the model to create questions and evaluate their relevance and answerability based on the image content. If a question is deemed irrelevant or unanswerable, the model regenerates it to ensure alignment with the image. This process establishes a solid foundation for subsequent answer generation and optimization. Second, while generating answers, we design an answer self-enhancement technique to boost the discriminative power of answers. We begin by captioning the images and then use the descriptions to enhance the generated answers. Additionally, we utilize corrupted images to generate rejected answers, thereby forming distinct preference pairs for effective optimization. Finally, in the optimization step, we incorporate an image content alignment loss function alongside the Direct Preference Optimization (DPO) loss to mitigate hallucinations. This function maximizes the likelihood of the above generated descriptions in order to constrain the model's attention to the image content. As a result, model can generate more accurate and reliable outputs. Experiments demonstrate that our framework is competitively compared with previous methods that utilize external information, paving the way for more efficient and scalable MLLMs.

Abstract: Classical Transformerbased line segment detection methods have delivered impressive results. However, we observe that some accurately detected line segments are assigned low confidence scores during prediction, causing them to be ranked lower and potentially suppressed. Additionally, these models often require prolonged training periods to achieve strong performance, largely due to the necessity of bipartite matching. In this paper, we introduce RANK-LETR, a novel Transformer-based line segment detection method. Our approach leverages learnable geometric information to refine the ranking of predicted line segments by enhancing the confidence scores of high-quality predictions in a posterior verification step. We also propose a new line segment proposal method, wherein the feature point nearest to the centroid of the line segment directly predicts the location, significantly improving training efficiency and stability. Moreover, we introduce a line segment ranking loss to stabilize rankings during training, thereby enhancing the generalization capability of the model. Experimental results demonstrate that our method outperforms other Transformer-based and CNN-based approaches in prediction accuracy while requiring fewer training epochs than previous Transformer-based models.

Abstract: Existing face forgery detection methods achieve promising performance when training and testing forgery data are from identical manipulation types, while they fail to generalize well to unseen samples. In this paper, we experimentally investigate and find that the poor generalization of the methods mainly arises from their overfitting on the known fake patterns. Excessively focused on seen fakes, those detectors fail to effectively learn imageintrinsic information and the distributional disparity between real and fake images. Then, to address this issue, we redefine fake learning as real-fake distributional disparity learning. We propose a novel deepfake detection framework learning distributional disparity based on the differentiated reconstruction on real and fake images for improved generalization. Specifically, distributional disparity learning on differentiated reconstruction of the real and fake images, enforces the model to learn image-invariant intrinsic representations. The reconstruction on real and fake images forces the decoders to learn the distribution of real and fake images, respectively. Moreover, to avoid the influence from the specificalization of the known fake patterns, we further propose the information interaction learning on the encoded intrinsic information and the pixel disparity between the input image and its reconstruction to distinguish face forgeries that are even unknown. Extensive experiments on large-scale benchmark datasets demonstrated the effectiveness of addressing the overfitting issue of the classification network, and verified the superior performance of our method.

Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences) Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, MAIS, Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, MAIS, Institute of Automation, Chinese Academy of Sciences, Tongji University, Tongji University, Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences) Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Fuzhou University, Shandong University, Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences) Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, School of Artificial Intelligence, Beijing University of Posts and Telecommunications

Abstract: Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In addition, the lack of pixellevel correspondence supervision in the VPR dataset hinders further improvement of the local feature matching capability in the re-ranking stage. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency.

Abstract: Consistency distillation methods have demonstrated significant success in accelerating generative tasks of diffusion models. However, since previous consistency distillation methods use simple and straightforward strategies in selecting target timesteps, they usually struggle with blurs and detail losses in generated images. To address these limitations, we introduce TargetDriven Distillation (TDD), which (1) adopts a delicate selection strategy of target timesteps, increasing the training efficiency; (2) utilizes decoupled guidances during training, making TDD open to post-tuning on guidance scale during inference periods; (3) can be optionally equipped with non-equidistant sampling and x0 clipping, enabling a more flexible and accurate way for image sampling. Experiments verify that TDD achieves state-of-the-art performance in few-step generation, offering a better choice among consistency distillation models.

Abstract: The advancement of Spatial Transcriptomics (ST) has facilitated the spatiallyaware profiling of gene expressions based on histopathology images. Although ST data offers valuable insights into the micro-environment of tumors, its acquisition cost remains expensive. Therefore, directly predicting the ST expressions from digital pathology images is desired. Current methods usually adopt existing regression backbones along with patch-sampling for this task, which ignores the inherent multi-scale information embedded in the pyramidal data structure of digital pathology images, and wastes the inter-spot visual information crucial for accurate gene expression prediction. To address these limitations, we propose M2OST, a many-to-one regression Transformer that can accommodate the hierarchical structure of the pathology images via a decoupled multi-scale feature extractor. Unlike traditional models that are trained with one-to-one image-label pairs, M2OST uses multiple images from different levels of the digital pathology image to jointly predict the gene expressions in their common corresponding spot. Built upon our many-to-one scheme, M2OST can be easily scaled to fit different numbers of inputs, and its network structure inherently incorporates nearby inter-spot features, enhancing regression performance. We have tested M2OST on three public ST datasets and the experimental results show that M2OST can achieve state-of-the-art performance with fewer parameters and floating-point operations (FLOPs).

Abstract: Benefiting from their powerful generative capabilities, pretrained diffusion models have garnered significant attention for realworld image super-resolution (Real-SR). Existing diffusion-based SR approaches typically utilize semantic information from degraded images and restoration prompts to activate prior for producing realistic high-resolution images. However, general-purpose pretrained diffusion models, not designed for restoration tasks, often have suboptimal prior, and manually defined prompts may fail to fully exploit the generated potential. To address these limitations, we introduce RAP-SR, a novel restoration prior enhancement approach in pretrained diffusion models for Real-SR. First, we develop the High-Fidelity Aesthetic Image Dataset (HFAID), curated through a Quality-Driven Aesthetic Image Selection Pipeline (QDAISP). Our dataset not only surpasses existing ones in fidelity but also excels in aesthetic quality. Second, we propose the Restoration Priors Enhancement Framework, which includes Restoration Priors Refinement (RPR) and Restoration-Oriented Prompt Optimization (ROPO) modules. RPR refines the restoration prior using the HFAID, while ROPO optimizes the unique restoration identifier, improving the quality of the resulting images. RAP-SR effectively bridges the gap between general-purpose models and the demands of Real-SR by enhancing restoration prior. Leveraging the plug-and-play nature of RAP-SR, our approach can be seamlessly integrated into existing diffusion-based SR methods, boosting their performance. Extensive experiments demonstrate its broad applicability and state-of-the-art results.

Abstract: Hateful meme detection aims to prevent the proliferation of hateful memes on various social media platforms. Considering its impact on social environments, this paper introduces a previously ignored but significant threat to hateful meme detection: backdoor attacks. By injecting specific triggers into meme samples, backdoor attackers can manipulate the detector to output their desired outcomes. To explore this, we propose the Meme Trojan framework to initiate backdoor attacks on hateful meme detection. Meme Trojan involves creating a novel CrossModal Trigger (CMT) and a learnable trigger augmentor to enhance the trigger pattern according to each input sample. Due to the cross-modal property, the proposed CMT can effectively initiate backdoor attacks on hateful meme detectors under an automatic application scenario. Additionally, the injection position and size of our triggers are adaptive to the texts contained in the meme, which ensures that the trigger is seamlessly integrated with the meme content. Our approach outperforms the state-of-the-art backdoor attack methods, showing significant improvements in effectiveness and stealthiness. We believe that this paper will draw more attention to the potential threat posed by backdoor attacks on hateful meme detection.

Abstract: Recent years have witnessed remarkable progress in 3D content generation. However, corresponding evaluation methods struggle to keep pace. Automatic approaches have proven challenging to align with human preferences, and the mixed comparison of textand image-driven methods often leads to unfair evaluations. In this paper, we present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences. To begin with, we first collect and filter a standardized image prompt set from DALL·E and Objaverse, which we then use to generate multi-view assets with several multi-view diffusion models. Through a systematic ranking pipeline on these assets, we obtain a human annotation dataset with 16k expert pairwise comparisons and train a reward model, coined MVReward, to effectively encode human preferences. With MVReward, image-driven 3D methods can be evaluated against each other in a more fair and transparent manner. Building on this, we further propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy. Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences.

Abstract: Feature matching between image pairs is a fundamental problem in computer vision that drives many applications, such as SLAM. Recently, semidense matching approaches have achieved substantial performance enhancements and established a widely-accepted coarse-to-fine paradigm. However, the majority of existing methods focus on improving coarse feature representation rather than the fine-matching module. Prior fine-matching techniques, which rely on point-to-patch matching probability expectation or direct regression, often lack precision and do not guarantee the continuity of feature points across sequential images. To address this limitation, this paper concentrates on enhancing the fine-matching module in the semi-dense matching framework. We employ a lightweight and efficient homography estimation network to generate the perspective mapping between patches obtained from coarse matching. This patch-to-patch approach achieves the overall alignment of two patches, resulting in a higher sub-pixel accuracy by incorporating additional constraints. By leveraging the homography estimation between patches, we can achieve a dense matching result with low computational cost. Extensive experiments demonstrate that our method achieves higher accuracy compared to previous semi-dense matchers. Meanwhile, our dense matching results exhibit similar end-point-error accuracy compared to previous dense matchers while maintaining semi-dense efficiency.

Abstract: In this paper, we present CAD2Program, a new method for reconstructing 3D parametric models from 2D CAD drawings. Our proposed method is inspired by recent successes in visionlanguage models (VLMs), and departs from traditional methods which rely on task-specific data representations and/or algorithms. Specifically, on the input side, we simply treat the 2D CAD drawing as a raster image, regardless of its original format, and encode the image with a standard ViT model. We show that such an encoding scheme achieves competitive performance against existing methods that operate on vector-graphics inputs, while imposing substantially fewer restrictions on the 2D drawings. On the output side, our method auto-regressively predicts a general-purpose language describing 3D parametric models in text form. Compared to other sequence modeling methods for CAD which use domain-specific sequence representations with fixed-size slots, our text-based representation is more flexible, and can be easily extended to arbitrary geometric entities and semantic or functional properties. Experimental results on a large-scale dataset of cabinet models demonstrate the effectiveness of our method.

Abstract: Inthe-wild dynamic facial expression recognition (DFER) encounters a significant challenge in recognizing emotion-related expressions, which are often temporally and spatially diluted by emotion-irrelevant expressions and global context. Most prior DFER methods directly utilize coupled spatiotemporal representations that may incorporate weakly relevant features with emotion-irrelevant context bias. Several DFER methods highlight dynamic information for DFER, but following explicit guidance that may be vulnerable to irrelevant motion. In this paper, we propose a novel Implicit Facial Dynamics Disentanglement framework (IFDD). Through expanding wavelet lifting scheme to fully learnable framework, IFDD disentangles emotion-related dynamic information from emotion-irrelevant global context in an implicit manner, i.e., without exploit operations and external guidance. The disentanglement process contains two stages. The first is Inter-frame Static-dynamic Splitting Module (ISSM) for rough disentanglement estimation, which explores inter-frame correlation to generate content-aware splitting indexes on-the-fly. We utilize these indexes to split frame features into two groups, one with greater global similarity, and the other with more unique dynamic features. The second stage is Lifting-based Aggregation-Disentanglement Module (LADM) for further refinement. LADM first aggregates two groups of features from ISSM to obtain fine-grained global context features by an updater, and then disentangles emotion-related facial dynamic features from the global context by a predictor. Extensive experiments on in-the-wild datasets have demonstrated that IFDD outperforms prior supervised DFER methods with higher recognition accuracy and comparable efficiency.

Abstract: Despite the rapid and substantial advancements in object detection, it continues to face limitations imposed by predefined category sets. Current methods for visual grounding primarily focus on how to better leverage the visual backbone to generate text-tailored visual features, which may require adjusting the parameters of the entire model. Besides, some early methods, \ie, matching-based method, build upon and extend the functionality of existing object detectors by enabling them to localize an object based on free-form linguistic expressions, which have good application potential. However, the untapped potential of the matching-based approach has not been fully realized due to inadequate exploration. In this paper, we first analyze the limitations that exist in the current matching-based method (\ie, mismatch problem and complicated fusion mechanisms), and then present a simple yet effective matching-based method, namely RefDetector. To tackle the above issues, we devise a simple heuristic rule to generate proposals with improved referent recall. Additionally, we introduce a straightforward vision-language interaction module that eliminates the need for intricate manually-designed mechanisms. Moreover, we have explored the visual grounding based on the modern detector DETR, and achieved significant performance improvement. Extensive experiments on three REC benchmark datasets, \ie, RefCOCO, RefCOCO+, and RefCOCOg validate the effectiveness of the proposed method.

Abstract: The task of composed image retrieval aims to match the multimodal query composed of a reference image and a modification sentence with the target image. Most current approaches narrow the distances between the composed queries and targets by investigating matched correspondences in positive triplets. Nevertheless, they are inclined to exhibit heavy reliance on partial correlations. As the negative correspondences are underestimated, semantic clues that distinguish the target from mismatched candidates are obscured by incomplete associations. Moreover, the correlations between the modification textual features and the visual variations from the reference to candidates are imperative to further strengthen the semantic discriminations. In this paper, we propose DIscriminative Perception from NEgative Correspondences (DIPNEC) to address the aforementioned issues. To encourage awareness of the differences between matched and mismatched correspondences, DIPNEC introduces optimal transport with semantic preservation for reassignments on hard negative triplets. Besides, Difference Quantization Alignments (DQA) and Composed Word-level Alignments (CWA) jointly determine the matching scores between multi-modal queries and candidates. Specifically, DQA concentrates on the correlations of textual features with source-to-target visual differences, and CWA further emphasizes the differentiated semantics. DIPNEC has demonstrated competitive performances on the experimental results and ablation studies on widely-used datasets FashionIQ and CIRR.

Abstract: We present Capturing the Unseen (CAPUS), a novel facial motion capture (MoCap) technique that operates without visual signals. CAPUS leverages miniaturized Inertial Measurement Units (IMUs) as a new sensing modality for facial motion capture. While IMUs have become essential in fullbody MoCap for their portability and independence from environmental conditions, their application in facial MoCap remains underexplored. We address this by customizing micro-IMUs, small enough to be placed on the face, and strategically positioning them in alignment with key facial muscles to capture expression dynamics. CAPUS introduces the first facial IMU dataset, encompassing both IMU and visual signals from participants engaged in diverse activities such as multilingual speech, facial expressions, and emotionally intoned auditions. We train a Transformer Diffusion-based neural network to infer Blendshape parameters directly from IMU data. Our experimental results demonstrate that CAPUS reliably captures facial motion in conditions where visual-based methods struggle, including facial occlusions, rapid movements, and low-light environments. Additionally, by eliminating the need for visual inputs, CAPUS offers enhanced privacy protection, making it a robust solution for various applications.

School of Biomedical Engineering, Tsinghua University, Beijing, China, School of Biomedical Engineering, Tsinghua University, Beijing, China, School of Biomedical Engineering, Tsinghua University, Beijing, China, School of Biomedical Engineering, Tsinghua University, Beijing, China, School of Biomedical Engineering, and Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China, School of Biomedical Engineering, Tsinghua University, Beijing, China School of Biomedical Engineering, and Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China

Abstract: Drafting radiology reports is a complex task requiring flexibility, where radiologists tail content to available information and particular clinical demands. However, most current radiology report generation (RRG) models are constrained to a fixed task paradigm, such as predicting the full ''finding'' section from a single image, inherently involving a mismatch between inputs and outputs. The trained models lack the flexibility for diverse inputs and could generate harmful, inputagnostic hallucinations. To bridge the gap between current RRG models and the clinical demands in practice, we first develop a data generation pipeline to create a new MIMIC-RG4 dataset, which considers four common radiology report drafting scenarios and has perfectly corresponded input and output. Secondly, we propose a novel large language model (LLM) based RRG framework, namely LLM-RG4, which utilizes LLM's flexible instruction-following capabilities and extensive general knowledge. We further develop an adaptive token fusion module that offers flexibility to handle diverse scenarios with different input combinations, while minimizing the additional computational burden associated with increased input volumes. Besides, we propose a token-level loss weighting strategy to direct the model's attention towards positive and uncertain descriptions. Experimental results demonstrate that LLM-RG4 achieves state-of-the-art performance in both clinical efficiency and natural language generation on the MIMIC-RG4 and MIMIC-CXR datasets. We quantitatively demonstrate that our model has minimal input-agnostic hallucinations, whereas current open-source models commonly suffer from this problem.

Abstract: Existing approaches aiming to remove adverse weather degradations compromise the image quality and incur the long processing time. To this end, we introduce a multiaxis prompt and multi-dimension fusion network (MPMF-Net). Specifically, we develop a multi-axis prompts learning block (MPLB), which learns the prompts along three separate axis planes, requiring fewer parameters and achieving superior performance. Moreover, we present a multi-dimension feature interaction block (MFIB), which optimizes intra-scale feature fusion by segregating features along height, width and channel dimensions. This strategy enables more accurate mutual attention and adaptive weight determination. Additionally, we propose the coarse-scale degradation-free implicit neural representations (CDINR) to normalize the degradation levels of different weather conditions. Extensive experiments demonstrate the significant improvements of our model over the recent well-performing approaches in both reconstruction fidelity and inference time.

Abstract: Customized video generation aims to generate highquality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods.

Abstract: The demand for producing shortform videos for sharing on social media platforms has experienced significant growth in recent times. Despite notable advancements in the fields of video summarization and highlight detection, which can create partially usable short films from raw videos, these approaches are often domain-specific and require an in-depth understanding of real-world video content. To tackle this predicament, we propose Repurpose-10K, an extensive dataset comprising over 10,000 videos with more than 120,000 annotated clips aimed at resolving the video long-to-short task. Recognizing the inherent constraints posed by untrained human annotators, which can result in inaccurate annotations for repurposed videos, we propose a two-stage solution to obtain annotations from real-world user-generated content. Furthermore, we offer a baseline model to address this challenging task by integrating audio, visual, and caption aspects through a cross-modal fusion and alignment framework. We aspire for our work to ignite groundbreaking research in the lesser-explored realms of video repurposing.

Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, School of Software, Tsinghua University, Beijing, China

Abstract: In recent years, reconstructing indoor scene geometry from multiview images has achieved encouraging accomplishments. Current methods incorporate monocular priors into neural implicit surface models to achieve high-quality reconstructions. However, these methods require hundreds of images for scene reconstruction. When only a limited number of views are available as input, the performance of monocular priors deteriorates due to scale ambiguity, leading to the collapse of the reconstructed scene geometry. In this paper, we propose a new method, named Sparis, for indoor surface reconstruction from sparse views. Specifically, we investigate the impact of monocular priors on sparse scene reconstruction, introducing a novel prior based on inter-image matching information. Our prior offers more accurate depth information while ensuring cross-view matching consistency. Additionally, we employ an angular filter strategy and an epipolar matching weight function, aiming to reduce errors due to view matching inaccuracies, thereby refining the inter-image prior for improved reconstruction accuracy. The experiments conducted on widely used benchmarks demonstrate superior performance in sparse-view scene reconstruction.

Abstract: SourceFree Domain Adaptation (SFDA) aims to transfer a pre-trained source model to the unlabeled target domain without accessing the source data, thereby effectively solving labeled data dependency and domain shift problems. However, the SFDA setting faces a bottleneck due to the absence of supervisory information. To mitigate this problem, Active Learning (AL) is introduced to combine with SFDA, endeavoring to actively label a small set of the most high-quality target points so that models with satisfactory performance can be obtained at an acceptable cost. Nevertheless, several issues remain unresolved, namely when to query new labels during training, what kind of samples deserve labeling to ensure rich information, and where the labels should be distributed to guarantee diversity. Thus we elaborate OmniQuery to omnibearing address the “When, What, and Where” problems about active points querying in source-free domain adaptation for cross-modal 3D semantic segmentation. The method consists of three main components: Query Decider, Point Ranker, and Budget Slicer. The Query Decider determines the optimal timing to query new points by fitting the validation curves during training. The Point Ranker nominates points for annotation by calculating the ambiguity of neighboring points in the feature space. The Budget Slicer allocates the annotation quota, i.e., labeling percentage of the point cloud, to different semantic regions by utilizing the advanced 2D semantic segmentation capabilities of the Segment Anything Model (SAM). Extensive experiments demonstrate the effectiveness of our proposed method, achieving up to 99.64% of fully supervised performance with only 3% of labels, and consistently outperforming comparison methods across various scenarios.

School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China, School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China, School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China, School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China, School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China, School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China

Abstract: Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called MultiConcept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level and leverage these concepts to provide temporal event cues; and (2) establish cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, weakly supervised concept detection is performed for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to produce more discriminative concept embeddings. In the captioning network, a cyclic co-learning strategy is proposed, where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator’s event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.

Abstract: The rapid advancement of 3D Generative Adversarial Networks (GANs) has significantly enhanced the diversity and quality of generated 3D images. Despite these breakthroughs, the manipulation capabilities of 3D GANs remain unexplored, presenting substantial challenges for practical applications where user interaction and modification are essential. Current manipulation methods often lack the precision needed for finegrained attribute manipulation, and struggle to maintain multi-view consistency during the editing process. To address these limitations, we propose 3DHumanEdit, a novel approach for 3D human body part-aware manipulation. 3DHumanEdit leverages multi-modal feature fusion and body part-aware feature alignment to achieve precise manipulation of individual body parts based on detailed text inputs and segmentation images. By exploring 3D prior for accurate editing and enforcing correspondence in latent space, 3DHumanEdit ensures coherence across multiple views. Experiments demonstrate that 3DHumanEdit outperforms existing methods in both editing fidelity and multi-view consistency, offering a robust solution for fine-grained 3D manipulation.

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding—accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on finetuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.

Abstract: Although semisupervised learning has made significant advances in the field of medical image segmentation, fully annotating a volumetric sample slice by slice remains a costly and time-consuming task. Even worse, most of the existing approaches pay much attention to image-level information and ignore semantic features, resulting in the inability to perceive weak boundaries. To address these issues, we propose a novel Semantic-Guided Triplet Co-training (SGTC) framework, which achieves high-end medical image segmentation by only annotating three orthogonal slices of a few volumetric samples, significantly alleviating the burden of radiologists. Our method consist of two main components. Specifically, to enable semantic-aware, fine-granular segmentation and enhance the quality of pseudo-labels, a novel semantic-guided auxiliary learning mechanism is proposed based on the pretrained CLIP. In addition, focusing on a more challenging but clinically realistic scenario, a new triple-view disparity training strategy is proposed, which uses sparse annotations (i.e., only three labeled slices of a few volumes) to perform co-training between three sub-networks, significantly improving the robustness. Extensive experiments on three public medical datasets demonstrate that our method outperforms most state-of-the-art semi-supervised counterparts under sparse annotation settings.

Abstract: Deep neural networks (DNNs) are susceptible to Universal Adversarial Perturbations (UAPs), which are instanceagnostic perturbations that can deceive a target model across a wide range of samples. Unlike instance-specific adversarial examples, UAPs present a greater challenge as they must generalize across different samples and models. Generating UAPs typically requires access to numerous examples, which is a strong assumption in real-world tasks. In this paper, we propose a novel data-free method called Intrinsic UAP(IntriUAP), by exploiting the intrinsic vulnerabilities of deep models. We analyze a series of popular deep models composed of linear and nonlinear layers with a Lipschitz constant of 1, revealing that the vulnerability of these models is predominantly influenced by their linear components. Based on this observation, we leverage the ill-conditioned nature of the linear components by aligning the UAP with the right singular vectors corresponding to the maximum singular value of each linear layer. Remarkably, our method achieves highly competitive performance in attacking popular image classification deep models without using any image samples. We also evaluate the black-box attack performance of our method, showing that it matches the state-of-the-art baseline for data-free methods on models that conform to our theoretical framework. Beyond the data-free assumption, IntriUAP also operates under a weaker assumption, where the adversary only can access a few of the victim model's layers. Experiments demonstrate that the attack success rate decreases by only 4% when the adversary has access to just 50% of the linear layers in the victim model.

Abstract: Industrial anomaly detection achieves progress thanks to datasets such as MVTecAD and VisA. However, they suffer from limitations in terms of the number of defect samples, types of defects, and availability of real-world scenes. These constraints inhibit researchers from further exploring the performance of industrial detection with higher accuracy. To this end, we propose a new large-scale anomaly detection dataset called 3CAD, which is derived from real 3C production lines. Specifically, the proposed 3CAD includes eight different types of manufactured parts, totaling 27,039 high-resolution images labeled with pixel-level anomalies. The key features of 3CAD are that it covers anomalous regions of different sizes, multiple anomaly types, and the possibility of multiple anomalous regions and multiple anomaly types per anomaly image. This is the largest and first anomaly detection dataset dedicated to 3C product quality control for community exploration and development. Meanwhile, we introduce a simple yet effective framework for unsupervised anomaly detection: a Coarse-to-Fine detection paradigm with Recovery Guidance (CFRG). To detect small defect anomalies, the proposed CFRG utilizes a coarse-to-fine detection paradigm. Specifically, we utilize a heterogeneous distillation model for coarse localization and then fine localization through a segmentation model. In addition, to better capture normal patterns, we introduce recovery features as guidance. Finally, we report the results of our CFRG framework and popular anomaly detection methods on the 3CAD dataset, demonstrating strong competitiveness and providing a highly challenging benchmark to promote the development of the anomaly detection field.

Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing 100081, China School of Medical Technology, Beijing Institute of Technology, Beijing 100081, China, Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing 100081, China, School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China, School of Integrated Circuits and Electronics, Beijing Institute of Technology, Beijing 100081, China, School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China, School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China, School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China, School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China, School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

Abstract: Prior work employing deep neural networks (DNNs) with explainable techniques has identified human visual cortical selective representation to specific categories. However, constructing highperforming encoding models that accurately capture brain responses to coexisting multi-semantics remains elusive. Here, we used CLIP models combined with CLIP Dissection to establish a multi-semantic mapping framework (CLIP-MSM) for hypothesis-free analysis in human high-level visual cortex. First, we utilize CLIP models to construct voxel-wise encoding models for predicting visual cortical responses to natural scene images. Then, we apply CLIP Dissection and normalize the semantic mapping score to achieve the mapping of single brain voxels to multiple semantics. Our findings indicate that CLIP Dissection applied to DNNs modeling the human high-level visual cortex demonstrates better interpretability accuracy compared to Network Dissection. In addition, to demonstrate how our method enables fine-grained discovery in hypothesis-free analysis, we quantify the accuracy between CLIP-MSM’s reconstructed brain activation in response to categories of faces, bodies, places, words and food, and the ground truth of brain activation. We demonstrate that CLIP-MSM provides more accurate predictions of visual responses compared to CLIP Dissection. Our results have been validated using two large natural image datasets: the Natural Scenes Dataset (NSD) and the Natural Object Dataset (NOD).

School of Information and Engineering, Southwest University of Science and Technology, School of Information and Engineering, Southwest University of Science and Technology, School of Electronic and Optical Engineering, Nanjing University of Science and Technology, School of Information and Engineering, Southwest University of Science and Technology, School of Information and Engineering, Southwest University of Science and Technology, School of Information and Engineering, Southwest University of Science and Technology

Abstract: These recent years have witnessed that convolutional neural network (CNN)based methods for detecting infrared small targets have achieved outstanding performance. However, these methods typically employ standard convolutions, neglecting to consider the spatial characteristics of the pixel distribution of infrared small targets. Therefore, we propose a novel pinwheel-shaped convolution (PConv) as a replacement for standard convolutions in the lower layers of the backbone network. PConv better aligns with the Gaussian-like spatial distribution of infrared small target, improves feature extraction, significantly expands the receptive field, and introduces only a minimal increase in parameters. Additionally, while recent loss functions combine scale and location losses, they do not adequately account for the varying sensitivity of these losses across different target scales, limiting detection performance on dim-small targets. To overcome this, we propose a scale-based dynamic (SD) Loss that dynamically adjusts the influence of scale and location losses based on target size, improving the network's ability to detect targets of varying scales. We construct a new benchmark, SIRST-UAVB, which is the largest and most challenging dataset to date for real-shot single-frame infrared small target detection. Lastly, by integrating PConv and SD Loss into the latest small target detection algorithms, we achieved significant performance improvements on IRSTD-1K and our SIRST-UAVB dataset, validating the effectiveness and generalizability of our approach.

Abstract: Diffusion models have demonstrated superior performance in portrait animation. However, current approaches relied on either visual or audio modality to control character movements, failing to exploit the potential of mixedmodal control. This challenge arises from the difficulty in balancing the weak control strength of audio modality and the strong control strength of visual modality. To address this issue, we introduce MegActor-Sigma: a mixed-modal conditional diffusion transformer (DiT), which can flexibly inject audio and visual modality control signals into portrait animation. Specifically, we make substantial advancements over its predecessor, MegActor, by leveraging the promising model structure of DiT and integrating audio and visual conditions through advanced modules within the DiT framework. To further achieve flexible combinations of mixed-modal control signals, we propose a ``Modality Decoupling Control" training strategy to balance the control strength between visual and audio modalities, along with the ``Amplitude Adjustment" inference strategy to freely regulate the motion amplitude of each modality. Finally, to facilitate extensive studies in this field, we design several dataset evaluation metrics to filter out public datasets and solely use this filtered dataset for training. Extensive experiments demonstrate the superiority of our approach in generating vivid portrait animations.

Abstract: Existing semantic segmentation methods face challenges when processing input images degraded by raindrops on the lens or windshield. Unlike other adverse conditions such as fog and nighttime, which degrade visual quality, raindrops not only impair visual appearances but also introduce misleading occlusion, leading to significant performance drops in current models. The novelty of our approach lies in our twostage, dual teacher-student framework. We tackle the complex problem of raindrop degradation by dividing it into two distinct challenges: degraded visual appearance and raindrop occlusion. These challenges are then addressed individually in two stages, utilizing two pairs of teacher-student networks. This division enables the networks to develop specialized expertise in handling each aspect of raindrop degradation, enabling their collaboration to achieve superior performance. In the first stage, one teacher-student pair focuses on learning to extract information from visual degraded areas. Building on this, the second teacher-student pair focuses specially on the raindrop occlusion. As such, unlike the existing methods, our approach employs a collaborative approach to decompose and address raindrop-induced degradations. In the second stage, we introduce a mask-based recovery technique to identify and rectify areas that likely contain misleading information, thus further refining the predictions. Additionally, this stage encourages both pairs to expand knowledge by swapping their specialized expertise. Our method achieves a performance of 60.3 mIoU on Rainy WCity and 72.8 mIoU on ACDC Rainy, representing an improvement of +4.4 mIoU and +2.3 mIoU over the existing state-of-the-art methods, respectively.

Academy for Engineering and Technology, Fudan University, Shanghai 200433, China Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China, Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China, Shandong Computer Science Center (National Supercomputer Center in Jinan), Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China, Academy for Engineering and Technology, Fudan University, Shanghai 200433, China Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China

Abstract: Weakly Supervised Semantic Segmentation (WSSS) with imagelevel labels typically uses Class Activation Maps (CAM) to achieve dense predictions. Recently, Vision Transformer (ViT) has provided an alternative to generate localization maps from class-patch attention. However, due to insufficient constraints on modeling such attention, we observe that the Localization Attention Maps (LAM) often struggle with the artifact issue, i.e., patch regions with minimal semantic relevance are falsely activated by class tokens. In this work, we propose MoRe to address this issue and further explore the potential of LAM. Our findings suggest that imposing additional regularization on class-patch attention is necessary. To this end, we first view the attention as a novel directed graph and propose the Graph Category Representation module to implicitly regularize the interaction among class-patch entities. It ensures that class tokens dynamically condense the related patch information and suppress unrelated artifacts at a graph level. Second, motivated by the observation that CAM from classification weights maintains smooth localization of objects, we devise the Localization-informed Regularization module to explicitly regularize the class-patch attention. It directly mines the token relations from CAM and further supervises the consistency between class and patch tokens in a learnable manner. Extensive experiments on PASCAL VOC and MS COCO validate that MoRe effectively addresses the artifact issue and achieves state-of-the-art performance, surpassing recent single-stage and even multi-stage methods.

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Beijing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Beijing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China AIER Eye Hospital Group Co., Ltd., Changsha 410015, China, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Beijing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Beijing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Beijing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China AIER Eye Hospital Group Co., Ltd., Changsha 410015, China, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Peng Cheng Laboratory Beijing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China AIER Eye Hospital Group Co., Ltd., Changsha 410015, China

Abstract: Textto-image (T2I) diffusion models have achieved remarkable progress in generating realistic images from textual descriptions. However, ensuring consistent high-quality image generation with complete backgrounds, object appearance, and optimal texture rendering remains challenging. This paper presents a novel fine-grained pixel-level image editing method based on pre-trained diffusion models. The proposed dual-branch architecture, consisting of Guidance and Generation branches, employs U-Net Denoisers and Self-Attention mechanisms. An improved DDIM-like inversion method obtains the latent representation, followed by multiple denoising steps. Cross-branch interactions, such as KV Replacement, Classifier Guidance, and Feature Correspondence, enable precise control while preserving image fidelity. The iterative refinement and reconstruction process facilitates finegrained editing control, supporting attribute modification, image outpainting, style transfer, and face synthesis with Clickand-Drag style editing using masks. Experimental results demonstrate the effectiveness of the proposed approach in enhancing the quality and controllability of T2I-generated images, surpassing existing methods while maintaining attractive computational complexity for practical real-world applications.

Abstract: In this paper, we present a sequential joint dependency aware model for monocular 2Dto-3D human pose estimation. While existing estimators leverage the (bi)directional joint dependency with graph convolutions and attention, we further propose to exploit the sequential dependency between joints with state space model (SSM). Our sequential dependency takes into consideration the information of kinematic chain, joint hierarchy and the body part. We design a sequential dependency aware representation to transform the pose data into sequential data for our pose SSM module. We tailor the SSM layer in the pose SSM module for pose estimation by learning joint-dependent parameters and introducing pose aware hidden state initialization. Extensive experiments are conducted on two datasets to validate the effectiveness of our proposed SSM module, and the results demonstrate that our pose estimator can deliver impressive performance.

Abstract: Recent methods using diffusion models have made significant progress in human image generation with various control signals such as pose priors. However, existing efforts are still struggling to generate highquality images with consistent pose alignment, resulting in unsatisfactory output. In this paper, we propose a framework that delves into the graph relations of pose priors to provide control information for human image generation. The main idea is to establish a graph topological structure between the pose priors and latent representation of diffusion models to capture the intrinsic associations between different pose parts. A Progressive Graph Integrator (PGI) is designed to learn the spatial relationships of the pose priors with the graph structure, adopting a hierarchical strategy within an Adapter to gradually propagate information across different pose parts. Besides, a pose perception loss is introduced based on a pretrained pose estimation network to minimize the pose differences. Extensive qualitative and quantitative experiments conducted on the Human-Art and LAION-Human datasets clearly demonstrate that our model can achieve significant performance improvement over the latest benchmark models.

Abstract: In the pursuit of efficient vision architectures, substantial efforts have been devoted to optimizing operator efficiency. Depthwise separable operators, such as DWConv, are found cheap in both FLOPs and parameters. As a result, they are increasingly incorporated into efficient backbones, trading for deeper and wider architectures to enhance performance. However, separable operators are not really fast on devices due to the discontinuous memory access requirements. In this paper, we propose FreeNets, a family of simple and efficient backbones that free the separable operation to further accelerate the running speed. We introduce sparse sampling mixers (S2-Mixer) to supersede existing separable token mixers. The S2-Mixer samples multiple segments of partially continuous signals across spatial and channel dimensions for convolutional processing, achieving extremely fast on-device speed. The sparse sampling also enables S2-Mixer to capture long-range pixel relationships from dynamic receptive fields. Furthermore, we introduce a Shift Feed-Forward Network (ShiftFFN) as a faster alternative to existing channel mixers. It utilizes a shift neck architecture that aggregates global information to shift features, enabling faster channel mixing while incorporating global pixel information. Extensive experiments demonstrate that FreeNet offers a superior accuracy-efficiency tradeoff compared to the latest efficient models. On ImageNet-1k, FreeNet-S2 outperforms the StarNet-S4 by 0.4% in top-1 accuracy, while running around 40% faster on desktop GPU and 15% faster on Mobile GPU.

Abstract: Zeroshot action recognition (ZSAR) requires collaborative multi-modal spatiotemporal understanding. However, finetuning CLIP directly for ZSAR yields suboptimal performance, given its inherent constraints in capturing essential temporal dynamics from both vision and text perspectives, especially when encountering novel actions with fine-grained spatiotemporal discrepancies. In this work, we propose Spatiotemporal Dynamic Duo (STDD), a novel CLIP-based framework to comprehend multi-modal spatiotemporal dynamics synergistically. For the vision side, we propose an efficient Space-time Cross Attention, which captures spatiotemporal dynamics flexibly with simple yet effective operations applied before and after spatial attention, without adding additional parameters or increasing computational complexity. For the semantic side, we conduct spatiotemporal text augmentation by comprehensively constructing an Action Semantic Knowledge Graph (ASKG) to derive nuanced text prompts. The ASKG elaborates on static and dynamic concepts and their interrelations, based on the idea of decomposing actions into spatial appearances and temporal motions. During the training phase, the frame-level video representations are meticulously aligned with prompt-level nuanced text representations, which are concurrently regulated by the video representations from the frozen CLIP to enhance generalizability. Extensive experiments validate the effectiveness of our approach, which consistently surpasses state-of-the-art approaches on popular video benchmarks (i.e., Kinetics-600, UCF101, and HMDB51) under challenging ZSAR settings.

MoE Key Lab of Collaborative Intelligence Systems, Xidian University School of Computer Science and Technology, Xidian University, MoE Key Lab of Collaborative Intelligence Systems, Xidian University School of Computer Science and Technology, Xidian University, MoE Key Lab of Collaborative Intelligence Systems, Xidian University School of Electronic Engineering, Xidian University, MoE Key Lab of Collaborative Intelligence Systems, Xidian University Academy of Artificial Intelligence, College of Mathematics Science, Inner Mongolia Normal University, MoE Key Lab of Collaborative Intelligence Systems, Xidian University School of Computer Science and Technology, Xidian University, School of Artificial Intelligence, Xidian University

Abstract: We propose a transformation diffusion model for point cloud registration to balance precision and efficiency. Our method formulates point cloud registration as a denoising diffusion process from noisy transformation to object transformation, which is represented by quaternion and translation. Specifically, in training stage, object transformation diffuses from groundtruth transformation to random distribution, and the model learns to reverse this noising process. In sampling stage, the model refines randomly generated transformation to the optimal transformation in a progressive way. We derive the variational bound in closed form for training and provide instantiation of the model. Our diffusion model maps transformation into latent space, and splits the transformation into two components (rotation and translation) based on the fact that they belong to different solution spaces. In addition, our work provides the following crucial findings: (i) Point cloud registration, one of the representative discriminative tasks, can be solved by a generative way and mapped into latent space to obtain new unified probabilistic formulation. (ii) Our model, Transformation Diffusion Model (TDM) can be a plug-and-play agent for point cloud registration, making our method applicable to different deep registration networks. Experimental results on synthetic and real-world datasets demonstrate that, in correspondence-free and correspondence-based scenarios, TDM can both achieve exceeding 60% performance improvements and higher efficiency simultaneously.

Abstract: Compared with previous 3D reconstruction methods like Nerf, recent Generalizable 3D Gaussian Splatting (G3DGS) methods demonstrate impressive efficiency even in the sparse-view setting. However, the promising reconstruction performance of existing G-3DGS methods relies heavily on accurate multi-view feature matching, which is quite challenging. Especially for the scenes that have many non-overlapping areas between various views and contain numerous similar regions, the matching performance of existing methods is poor and the reconstruction precision is limited. To address this problem, we develop a strategy that utilizes a predicted depth confidence map to guide accurate local feature matching. In addition, we propose to utilize the knowledge of existing monocular depth estimation models as prior to boost the depth estimation precision in non-overlapping areas between views. Combining the proposed strategies, we present a novel G-3DGS method named TranSplat, which obtains the best performance on both the RealEstate10K and ACID benchmarks while maintaining competitive speed and presenting strong cross-dataset generalization ability.

Abstract: The excellent performance of diffusion models in image generation is always accompanied by overlarge computation costs, which have prevented the application of diffusion models in edge devices and interactive applications. Previous works mainly focus on using fewer sampling steps and compressing the denoising network of diffusion models, while this paper proposes to accelerate diffusion models by introducing SiTo, a similaritybased token pruning method that adaptive prunes the redundant tokens in the input data. SiTo is designed to maximize the similarity between model prediction with and without token pruning by using cheap and hardware-friendly operations, leading to significant acceleration ratios without performance drop, and even sometimes improvements in the generation quality. For instance, the zero-shot evaluation shows SiTo leads to 1.90x and 1.75x acceleration on COCO30K and ImageNet with 1.33 and 1.15 FID reduction at the same time. Besides, SiTo has no training requirements and does not require any calibration data, making it plug-and-play in real-world applications.

Abstract: Visual Question Answering (VQA) is a multifaceted task that integrates computer vision and natural language processing to produce textual answers from images and questions. Existing VQA benchmarks predominantly adhere to a closedset paradigm, limiting their ability to address arbitrary, unseen answers, and thus falling short in real-world scenarios. To address this limitation, we introduce the Open-Vocabulary Visual Question Answering (OVVQA) benchmark, specifically designed to evaluate models under open-world conditions by assessing their performance on both base classes (seen, common answers) and novel classes (unseen, rare answers). In conjunction with this benchmark, we propose a model-agnostic Causal Adapter to combat the inherent bias found in current VQA tasks. Our approach leverages front-door adjustment to enhance causal reasoning, significantly improving model performance on novel categories while maintaining accuracy on base classes. Additionally, we introduce an adaptive transfer loss to facilitate the transfer of more knowledge from the pretrained model to our OVVQA task. Extensive experiments across multiple datasets validate the superiority of our method over existing state-of-the-art approaches, demonstrating its robust generalization and adaptability in open-world VQA scenarios.

School of Computing and Artificial Intelligence, Southwest Jiaotong University Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education Manufacturing Industry Chains Collaboration and Information Support Technology Key Laboratory of Sichuan Province, Southwest Jiaotong University, School of Computing and Artificial Intelligence, Southwest Jiaotong University Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education Manufacturing Industry Chains Collaboration and Information Support Technology Key Laboratory of Sichuan Province, Southwest Jiaotong University, Queen Mary university of London School of Computing and Artificial Intelligence, Southwest Jiaotong University

Abstract: Face recognition has made remarkable strides, driven by the expanding scale of datasets, advancements in various backbone and discriminative losses. However, face recognition performance is heavily affected by the label noise, especially closedset noise. While numerous studies have focused on handling label noise, addressing closed-set noise still poses challenges. This paper identifies this challenge as training isn't robust to noise at the early-stage training, and necessitating an appropriate learning strategy for samples with low confidence, which are often misclassified as closed-set noise in later training phases. To address these issues, we propose a new framework to stabilize the training at early stages and split the samples into clean, ambiguous and noisy groups which are devised with separate training strategies. Initially, we employ generated auxiliary closed-set noisy samples to enable the model to identify noisy data at the early stages of training. Subsequently, we introduce how samples are split into clean, ambiguous and noisy groups by their similarity to the positive and nearest negative centers. Then we perform label fusion for ambiguous samples by incorporating accumulated model predictions. Finally, we apply label smoothing within the closed set, adjusting the label to a point between the nearest negative class and the initially assigned label. Extensive experiments validate the effectiveness of our method on mainstream face datasets, achieving state-of-the-art results.

Abstract: To break through the limitations of pretraining models on fixed categories, Open-Set Object Detection (OSOD) and Open-Set Segmentation (OSS) have attracted a surge of interest from researchers. Inspired by large language models, mainstream OSOD and OSS methods generally utilize text as a prompt, achieving remarkable performance. Following SAM paradigm, some researchers use visual prompts, such as points, boxes, and masks that cover detection or segmentation targets. Despite these two prompt paradigms exhibit excellent performance, they also reveal inherent limitations. On the one hand, it is difficult to accurately describe characteristics of specialized category using textual description. On the other hand, existing visual prompt paradigms heavily rely on multi-round human interaction, which hinders them being applied to fully automated pipeline. To address the above issues, we propose a novel prompt paradigm in OSOD and OSS, that is, Image Prompt Paradigm. This brand new prompt paradigm enables to detect or segment specialized categories without multi-round human intervention. To achieve this goal, the proposed image prompt paradigm uses just a few image instances as prompts, and we propose a novel framework named MI Grounding for this new paradigm. In this framework, high-quality image prompts are automatically encoded, selected and fused, achieving the single-stage and non-interactive inference. We conduct extensive experiments on public datasets, showing that MI Grounding achieves competitive performance on OSOD and OSS benchmarks compared to text prompt paradigm methods and visual prompt paradigm methods. Moreover, MI Grounding can greatly outperform existing method on our constructed specialized ADR50K dataset.

Abstract: An efficient and precise diagnosis of retinal diseases is a fundamental goal for auxiliary diagnostic systems in ophthalmology. Inspired by the importance of scattered subtle lesions in manual retinal disease diagnosis, recent research has achieved stateof-the-art performance by mining information related to subtle lesions, including their texture and shape. However, the spatial distribution patterns of subtle lesion areas, which are also crucial in manual diagnosis, have been overlooked in existing research. Neglecting these spatial distribution patterns (e.g., the ring distribution of microaneurysms in diabetic macular edema) may negatively impact the diagnostic process. In this paper, we introduce the Saliency-Image-Graph (SIGraph) network to capture the spatial distribution patterns of lesion areas. We first employ saliency-based perception to identify latent lesion pixels. Subsequently, we propose a novel image-graph block to efficiently capture the global distribution of abundant lesion pixels with minimal information loss. By leveraging additional distribution patterns, SIGraph achieves state-of-the-art performance with at least a 1.5% performance gain across three datasets. Furthermore, ablation studies demonstrate that our image-graph block can be integrated into other visual backbones and effectively boost performance.

Abstract: Image feature matching is a cardinal problem in computer vision, aiming to establish accurate correspondences between twoview images. Existing methods are constrained by the performance of feature extractors and struggle to capture local information affected by sparse texture or occlusions. Recognizing that human eyes consider not only similar local geometric features but also high-level semantic information of scene objects when matching images, this paper introduces SemaGlue. This novel algorithm perceives and incorporates semantic information into the matching process. In contrast to recent approaches that leverage semantic consistency to narrow the scope of matching areas, SemaGlue achieves semantic amalgamation with the designed Semantic-Aware Fusion (SAF) Block by injecting abundant semantic features from the pre-trained segmentation model. Moreover, the Cross-Domain Alignment (CDA) Block is proposed to address domain alignment issues, bridging the gaps between semantic and geometric domains to ensure applicable semantic amalgamation. Extensive experiments demonstrate that SemaGlue outperforms state-of-the-art methods across various applications such as homography estimation, relative pose estimation, and visual localization.

Abstract: Currently, 3D rendering for largescale free camera trajectories, namely, arbitrary input camera trajectories, poses significant challenges: 1) The distribution and observation angles of the cameras are irregular, and various types of scenes are included in the free trajectories; 2) Processing the entire point cloud and all images at once for large-scale scenes requires a substantial amount of GPU memory. This paper presents a Toy-GS method for accurately rendering large-scale free camera trajectories. Specifically, we propose an adaptive spatial division approach for free trajectories to divide cameras and the sparse point cloud of the entire scene into various regions according to camera poses. Training each local Gaussian in parallel for each area enables us to concentrate on texture details and minimize GPU memory usage. Next, we use the multi-view constraint and position-aware point adaptive control (PPAC) to improve the rendering quality of texture details. In addition, our regional fusion approach combines local and global Gaussians to enhance rendering quality with an increasing number of divided areas. Extensive experiments have been carried out to confirm the effectiveness and efficiency of Toy-GS, leading to state-of-the-art results on two public large-scale datasets as well as our SCUTic dataset. Our proposal demonstrates an enhancement of 1.19 dB in PSNR and conserves 7 G of GPU memory when compared to various benchmarks.

School of Astronautics, Beihang University; Tianmushan Laboratory, Beihang University, School of Astronautics, Beihang University; Tianmushan Laboratory, Beihang University, School of Astronautics, Beihang University; Tianmushan Laboratory, Beihang University, School of Astronautics, Beihang University; Tianmushan Laboratory, Beihang University, School of Electrical and Information Engineering, Changsha University of Science and Technology, School of Astronautics, Beihang University; Tianmushan Laboratory, Beihang University, School of Astronautics, Beihang University; Tianmushan Laboratory, Beihang University

Abstract: Vanilla convolution and windowbased self-attention have shown significant success in image dehazing. However, they are constrained by limited receptive fields and ignore frequency gaps between dehazed and clear images. The former hampers the modeling of global dependencies, while the latter impedes the learning of high-frequency features, leading to suboptimal performance. In this paper, we propose the Joint Spatial and Fourier Convolutional Network (JSFC-Net), which leverages Fourier transformation to simultaneously address the two aforementioned problems with low computational overhead. We introduce the Frequency-Spatial Promoted and Physical Learning Block, which extracts high-level features from the spatial domain and frequency domain in parallel. We design a simple yet effective solution that uses spatial features to promote and modulate frequency features in a multi-scale manner, achieving refinement of frequency features and addressing robustness issue caused by global sensitivity. Additionally, we present the Receptive Field Selection Module to facilitate improved fusion of spatial and frequency domain features. Finally, we introduce frequency loss to further narrow frequency gaps. Comprehensive experiments on multiple datasets demonstrate that JSFC-Net is significantly superior to SOTA dehazing methods.

Abstract: Large Visual Language Models (LVLMs) have achieved remarkable success in vision tasks. However, the significant differences between industrial and natural scenes make applying LVLMs challenging. Existing LVLMs rely on userprovided prompts to segment objects. This often leads to suboptimal performance due to the inclusion of irrelevant pixels. In addition, the scarcity of data also makes the application of LVLMs in industrial scenarios remain unexplored. To fill this gap, this paper proposes an open industrial dataset and a Refined Text-Visual Prompt (RTVP) for zero-shot industrial defect detection. First, this paper constructs the Multi-Modal Industrial Open Dataset (MMIO) containing 80K+ samples. MMIO contains diverse industrial categories, including 6 super categories and 18 subcategories. MMIO is the first large-scale multi-scenes pre-training dataset for industrial zero-shot learning, and provides valuable training data for open models in future industrial scenarios. Based on MMIO, this paper provides a RTVP specifically for industrial zero-shot tasks. RTVP has two significant advantages: First, this paper designs an expert-guided large model domain adaptation mechanism and designs an industrial zero-shot method based on Mobile-SAM, which enhances the generalization ability of large models in industrial scenarios. Second, RTVP automatically generates visual prompts directly from images and considers text-visual prompt interactions ignored by previous LVLM, improving visual and textual content understanding. RTVP achieves SOTA with 42.2% and 24.7% AP in zero-shot and closed scenes of MMIO.

Abstract: In recent years, applying multimodal large language models (MLLMs) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, MLLMs comprise the well-known Transformer network, which has a less efficient quadratic computation complexity. In this study, we introduce Cobra, a multi-modal large-scale language model built upon a state-space model, which has demonstrated significant potential in efficiently handling long sequences with fast inference and linear scalability concerning sequence length. Specifically, Cobra involves replacing Transformer-based backbone models (e.g., LLaMA or Phi) with pre-trained Mamba language models. We then empirically explore effective strategies for aligning visual and textual modalities and integrating various pre-trained Mamba model variants with visual encoders. Experiments across various multi-modal benchmarks demonstrate that: (i) Cobra performs 3× ∼ 4× faster than the most computationally efficient state-of-the-art methods, e.g., LLaVA-Phi and MobileVLM v2. Additionally, its performance is significantly enhanced thanks to the implementation of linear sequential modeling. (ii) Cobra fine-tunes a small parameter (∼48% of model parameters), leading to a significant improvement in overall performance compared to LLaVA.

State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China, State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China, State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China, State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China, State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China Key Laboratory of Collaborative sensing and autonomous unmanned systems of Zhejiang Province, Hangzhou, China

Abstract: Implicit Neural Representation (INR) has shown great potential in constructing the complex nature signal as a continuous implicit function. However, the representation results are incomplete since different components of the signal correspond to different frequencies and neural network inherently tends to lowfrequency convergence. In this paper, we propose the adaptive Wavelet-Positional Encoding (WPE) to precisely represent content under different frequency distributions for coordinate-based implicit representations. The High-Frequency Perception (HFP) method is first proposed to query locations of high-frequency components from input signals, which can be indicated as local centers of WPE. Then, motivated by wavelet series regression, we present to embed these queried low-dimensional coordinate inputs into wavelet-frequency space by WPE to represent fine details of target signals. Experiments demonstrate that the proposed method can be integrated into various INR methods without modifying training frameworks while significantly improving their performance in 1D signal fitting, 2D image regression, and even 3D scene representation.

School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism, School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism, School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism, School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism, School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism, School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism, School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism, School of Cyber Science and Technology, University of Science and Technology of China Anhui Province Key Laboratory of Digital Security the CCCD Key Lab of Ministry of Culture and Tourism

Abstract: Openvocabulary semantic segmentation (OVSS) aims to segment images of arbitrary categories specified by class labels. While previous approaches relied on extensive image-text pairs or dense semantic annotations, recent training-free methods attempted to overcome these limitations by constructing semantic prototypes in the construction stage and image-to-image matching (i.e., prototype matching) during testing. However, these methods often struggle to effectively capture the visual characteristics of categories and fail to utilize local features during prototype matching. To deal with these problems, we propose a novel training-free framework for OVSS that constructs diverse prototypes and performs fine-grained sub-region matching. Specifically, our method leverages Large Language Models (LLMs) to guide support image generation by descriptions of different attributes of categories and employs coarse-fine clustering to obtain diverse and robust part-level prototypes in the construction stage. During testing, we propose a sub-region matching method, which assigns part-level prototypes to sub-regions utilizing optimal transport, to fully utilize local image features among part-level prototypes. Extensive experiments demonstrate the effectiveness of our method and show that our method achieves state-of-the-art performance, outperforming previous methods across five datasets.

Abstract: The AudioVisual Question Answering (AVQA) task involves extracting question-related audio-visual clues from both temporal and spatial perspectives to answer questions accurately. Despite the promising performance of existing multi-modal AVQA models, thanks to large-scale pre-trained models, challenges remain in the field. Firstly, aligning audio-visual information across temporal and spatial dimensions is difficult. Secondly, the fusion of audio-visual information is often weighted inadequately, limiting model performance. To address the above issues, we design the Audio-Visual Adaptive Fusion Network (AVAF-Net), which uses contrastive learning to align audio-visual information temporally and spatially and adaptively adjusts fusion weights based on the question. Specifically, we initially align visual and audio information temporally through a temporal-alignment contrastive loss. This is followed by an audio-visual clue-mining module that highlights question-related cues, aligning them with the vocal region spatially using spatial alignment contrastive loss. Additionally, a question-oriented adaptive fusion module assigns different weights to audio and visual modalities based on the question content and then fuses them. The fused audio-visual cues are finally used to predict the answer. Extensive experiments on the MUSIC-AVQA dataset show that AVAF-Net surpasses all baseline models, with a maximum improvement of 15.90% in average accuracy and an average improvement of 9.80%.

Abstract: In crossmodal retrieval, comprehensive image understanding is vital while the scene text in images can provide fine-grained information to understand visual semantics. Current methods fail to make full use of scene text. They suffer from the semantic ambiguity of independent scene text and overlook the heterogeneous concepts in image-caption pairs. In this paper, we propose a heterogeneous prompt-guided entity inferring and distilling (HOPID) network to explore the nature connection of scene text in images and captions and learn a property-centric scene text representation. Specifically, we propose to align scene text in images and captions via heterogeneous prompt, which consists of visual and text prompt. For text prompt, we introduce the discriminative entity inferring module to reason key scene text words from captions, while visual prompt highlights the corresponding scene text in images. Furthermore, to secure a robust scene text representation, we design a perceptive entity distilling module that distills the beneficial information of scene text at a fine-grained level. Extensive experiments show that the proposed method significantly outperforms existing approaches on two public cross-modal retrieval benchmarks.

Abstract: Deep unfolding network (DUN) has shed new light on multisequence MRI reconstruction, providing both high interpretability and acceptable performance. However, current approaches still suffer from the plight of information isolation, i.e., learning features of multi-suquences individually and leaving the mask departed from model updating. In this work, we propose a new unfolding solution, namely Information-coupled MRI Acceleration (IMA), to address the isolation issue. Concretely, two specific mechanisms are presented. On the one hand, the latent connections across different sequences are explicitly molded via two auxiliary matrices. While the first matrix is meticulously engineered to assemble the spatial details, the second one hammers at capturing the depth information conditioned on the enriched channels. On the other hand, following a deep analysis on the non-uniform distribution in low- and high-frequency components of the given mask, we elaborate a new unfolding flow using a progressive masking scheme, featuring a dilation-contraction mechanism during forward propagation of successive stages. Massive experiments are conducted under various sampling patterns and acceleration rates, whose results demonstrate that, without any sophisticated architectures, our IMA outperforms the current cutting-edge methods both visually and numerically.

Abstract: Point cloud completion aims to reconstruct the complete 3D shape from incomplete point clouds, and it is crucial for tasks such as 3D object detection and segmentation. Despite the continuous advances in point cloud analysis techniques, feature extraction methods are still confronted with apparent limitations. The sparse sampling of point clouds, used as inputs in most methods, often results in a certain loss of global structure information. Meanwhile, traditional local feature extraction methods usually struggle to capture the intricate geometric details. To overcome these drawbacks, we introduce PointCFormer, a transformer framework optimized for robust global retention and precise local detail capture in point cloud completion. This framework embraces several key advantages. First, we propose a relationbased local feature extraction method to perceive local delicate geometry characteristics. This approach establishes a fine-grained relationship metric between the target point and its k-nearest neighbors, quantifying each neighboring point's contribution to the target point's local features. Secondly, we introduce a progressive feature extractor that integrates our local feature perception method with self-attention. Starting with a denser sampling of points as input, it iteratively queries long-distance global dependencies and local neighborhood relationships. This extractor maintains enhanced global structure and refined local details, without generating substantial computational overhead. Additionally, we develop a correction module after generating point proxies in the latent space to reintroduce denser information from the input points, enhancing the representation capability of the point proxies. PointCFormer demonstrates state-of-the-art performance on several widely used benchmarks.

Abstract: A recent endeavor in one class of video anomaly detection is to leverage diffusion models and posit the task as a generation problem, where the diffusion model is trained to recover normal patterns exclusively, thus reporting abnormal patterns as outliers. Yet, existing attempts neglect the various formations of anomaly and predict normal samples at the feature level regardless that abnormal objects in surveillance videos are often relatively small. To address this, a novel patchbased diffusion model is proposed, specifically engineered to capture fine-grained local information. We further observe that anomalies in videos manifest themselves as deviations in both appearance and motion. Therefore, we argue that a comprehensive solution must consider both of these aspects simultaneously to achieve accurate frame prediction. To address this, we introduce innovative motion and appearance conditions that are seamlessly integrated into our patch diffusion model. These conditions are designed to guide the model in generating coherent and contextually appropriate predictions for both semantic content and motion relations. Experimental results on four challenging video anomaly detection datasets empirically substantiate the efficacy of our proposed approach, demonstrating that it consistently outperforms most existing methods in detecting abnormal behaviors.

Abstract: In the field of audiovisual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, untrimmed videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration (CMCC) and the Multi-Temporal Granularity Collaboration (MTGC). Specifically, the CMCC module contains two branches: a cross-modal interaction branch and a temporal consistency-gated branch. The former branch facilitates the aggregation of consistent event semantics across modalities through the encoding of audio-visual relations, while the latter branch guides one modality's focus to pivotal event-relevant temporal areas as discerned in the other modality. The MTGC module includes a coarse-to-fine collaboration block and a fine-to-coarse collaboration block, providing bidirectional support among coarse- and fine-grained temporal features. Extensive experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization.

Abstract: This paper introduces MultiBooth, a method that generates images from texts containing various concepts from users.Despite diffusion models bringing significant advancements for customized textto-image generation, existing methods often struggle with multi-concept scenarios due to low concept fidelity and high inference cost. MultiBooth addresses these issues by dividing the multi-concept generation process into two phases: a single-concept learning phase and a multi-concept integration phase. During the single-concept learning phase, we employ a multi-modal image encoder and an efficient concept encoding technique to learn a concise and discriminative representation for each concept. In the multi-concept integration phase, we use bounding boxes to define the generation area for each concept within the cross-attention map. This method enables the creation of individual concepts within their specified regions, thereby facilitating the formation of multi-concept images. This strategy not only improves concept fidelity but also reduces additional inference cost. MultiBooth surpasses various baselines in both qualitative and quantitative evaluations, showcasing its superior performance and computational efficiency.

Abstract: Recent advancements in textto-3D generation can generate neural radiance fields (NeRFs) with score distillation sampling, enabling 3D asset creation without real-world data capture. With the rapid advancement in NeRF generation quality, protecting the copyright of the generated NeRF has become increasingly important. While prior works can watermark NeRFs in a post-generation way, they suffer from two vulnerabilities. First, a delay lies between NeRF generation and watermarking because the secret message is embedded into the NeRF model post-generation through fine-tuning. Second, generating a non-watermarked NeRF as an intermediate creates a potential vulnerability for theft. To address both issues, we propose Dreamark to embed a secret message by backdooring the NeRF during NeRF generation. In detail, we first pre-train a watermark decoder. Then, Dreamark generates backdoored NeRFs in a way that the target secret message can be verified by the pre-trained watermark decoder on an arbitrary trigger viewport. We evaluate the generation quality and watermark robustness against image- and model-level attacks. Extensive experiments show that the watermarking process will not degrade the generation quality, and the watermark achieves 90+% accuracy among both image-level attacks (e.g., Gaussian noise) and model-level attacks (e.g., pruning attack).

College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, The Hong Kong Polytechnic University, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, Computer and Information Science, Faculty of Science and Technology, University of Macau, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China

Abstract: The mesoscopic level serves as a bridge between the macroscopic and microscopic worlds, addressing gaps overlooked by both. Image manipulation localization (IML), a crucial technique to pursue truth from fake images, has long relied on lowlevel (microscopic-level) traces. However, in practice, most tampering aims to deceive the audience by altering image semantics. As a result, manipulation commonly occurs at the object level (macroscopic level), which is equally important as microscopic traces. Therefore, integrating these two levels into the mesoscopic level presents a new perspective for IML research. Inspired by this, our paper explores how to simultaneously construct mesoscopic representations of micro and macro information for IML and introduces the Mesorch architecture to orchestrate both. Specifically, this architecture i) combines Transformers and CNNs in parallel, with Transformers extracting macro information and CNNs capturing micro details, and ii) explores across different scales, assessing micro and macro information seamlessly. Additionally, based on the Mesorch architecture, the paper introduces two baseline models aimed at solving IML tasks through mesoscopic representation. Extensive experiments across four datasets have demonstrated that our models surpass the current state-of-the-art in terms of performance, computational complexity, and robustness.

School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, School of Biomedical Engineering, Shanghai Jiao Tong University Shanghai United Imaging Intelligence Co., Ltd., School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, School of Biomedical Engineering, Shanghai Jiao Tong University Shanghai United Imaging Intelligence Co., Ltd., School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University Shanghai Clinical Research and Trial Center, School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University Shanghai Clinical Research and Trial Center

Abstract: Multiple cameras can provide comprehensive multiview video coverage of a person. Fusing this multi-view data is crucial for tasks like behavioral analysis, although it traditionally requires camera calibration—a process that is often complex. Moreover, previous studies have overlooked the challenges posed by self-occlusion under multiple views and the continuity of human body shape estimation. In this study, we introduce a method to reconstruct the 3D human body from multiple uncalibrated camera views. Initially, we utilize a pre-trained human body encoder to process each camera view individually, enabling the reconstruction of human body models and parameters for each view along with predicted camera positions. Rather than merely averaging the models across views, we develop a neural network trained to assign weights to individual views for all human body joints, based on the estimated distribution of joint distances from each camera. Additionally, we focus on the mesh surface of the human body for dynamic fusion, allowing for the seamless integration of facial expressions and body shape into a unified human body model. Our method has shown excellent performance in reconstructing the human body on two public datasets, advancing beyond previous work from the SMPL model to the SMPL-X model. This extension incorporates more complex hand poses and facial expressions, enhancing the detail and accuracy of the reconstructions. Crucially, it supports the flexible ad-hoc deployment of any number of cameras, offering significant potential for various applications.

Abstract: Multimodal large language models (MLLMs) enhance their perceptual capabilities by integrating visual and textual information. However, processing the massive number of visual tokens incurs a significant computational cost. Existing analysis of the MLLM attention mechanisms remains shallow, leading to coarsegrain token pruning strategies that fail to effectively balance speed and accuracy. In this paper, we conduct a comprehensive investigation of MLLM attention mechanisms with LLaVA. We find that numerous visual tokens and partial attention computations are redundant during the decoding process. Based on this insight, we propose Spatial-Temporal Visual Token Trimming (ST3), a framework designed to accelerate MLLM inference without retraining. ST3 consists of two primary components: 1) Progressive Visual Token Pruning (PVTP), which eliminates inattentive visual tokens across layers, and 2) Visual Token Annealing (VTA), which dynamically reduces the number of visual tokens in each layer as the generated tokens grow. Together, these techniques deliver around 2x faster inference with only about 30% KV cache memory compared to the original LLaVA, while maintaining consistent performance across various datasets. Crucially, ST3 can be seamlessly integrated into existing pre-trained MLLMs, providing a plug-and-play solution for efficient inference.

Abstract: Video inpainting is a crucial task with diverse applications, including finegrained video editing, video recovery, and video dewatermarking. However, most existing video inpainting methods primarily focus on visual content completion while neglecting text information. There are only a limited number of text-guided video inpainting techniques, and these techniques struggle with maintaining visual quality and exhibit poor semantic representation capabilities. In this paper, we introduce CoCoCo, a text-guided video inpainting diffusion framework. To address the aforementioned challenges, we enhance both the training data and model structure. Specifically, we devise an instance-aware region selection strategy for masked area sampling and develop a novel motion block that incorporates efficient 3D full attention and textual cross attention. Additionally, our CoCoCo framework can be seamlessly integrated with various personalized text-to-image diffusion models through a delicate training-free transfer mechanism. Comprehensive experiments demonstrate that CoCoCo can create high-quality visual content with enhanced temporal consistency, improved text controllability, and better compatibility with personalized image models.

Abstract: Although stateof-the-art (SOTA) SAT solvers based on conflict-driven clause learning (CDCL) have achieved remarkable engineering success, their sequential nature limits the parallelism that may be extracted for acceleration on platforms such as the graphics processing unit (GPU). In this work, we propose FastFourierSAT, a highly parallel hybrid SAT solver based on gradient-driven continuous local search (CLS). This is achieved by a parallel algorithm inspired by the fast Fourier transform (FFT)-based convolution for computing the elementary symmetric polynomials (ESPs), which is the major computational task in previous CLS methods. The complexity of our algorithm matches the best previous result. Furthermore, the substantial parallelism inherent in our algorithm can leverage the GPU for acceleration, demonstrating significant improvement over the previous CLS approaches. FastFourierSAT is compared with a wide set of SOTA parallel SAT solvers on extensive benchmarks including combinatorial and industrial problems. Results show that FastFourierSAT computes the gradient 100+ times faster than previous prototypes on CPU. Moreover, FastFourierSAT solves most instances and demonstrates promising performance on larger-size instances.

Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Data Science & School of Computing, National University of Singapore, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences

Abstract: Spatiotemporal Graph Learning (SGL) under ZeroInflated Distribution (ZID) is crucial for urban risk management tasks, including crime prediction and traffic accident profiling. However, SGL models are vulnerable to adversarial attacks, compromising their practical utility. While adversarial training (AT) has been widely used to bolster model robustness, our study finds that traditional AT exacerbates performance disparities between majority and minority classes under ZID, potentially leading to irreparable losses due to underreporting critical risk events. In this paper, we first demonstrate the smaller top-k gradients and lower separability of minority class are key factors contributing to this disparity. To address these issues, we propose MinGRE, a framework for Minority Class Gradients and Representations Enhancement. MinGRE employs a multi-dimensional attention mechanism to reweight spatiotemporal gradients, minimizing the gradient distribution discrepancies across classes. Additionally, we introduce an uncertainty-guided contrastive loss to improve the inter-class separability and intra-class compactness of minority representations with higher uncertainty. Extensive experiments demonstrate that the MinGRE framework not only significantly reduces the performance disparity across classes but also achieves enhanced robustness compared to existing baselines. These findings underscore the potential of our method in fostering the development of more equitable and robust models.

Abstract: Given the ubiquity of multitask in practical systems, Multi-Task Learning (MTL) has found widespread application across diverse domains. In real-world scenarios, these tasks often have different priorities. For instance, In web search, relevance is often prioritized over other metrics, such as click-through rates or user engagement. Existing frameworks pay insufficient attention to the prioritization among different tasks, which typically adjust task-specific loss function weights to differentiate task priorities. However, this approach encounters challenges as the number of tasks grows, leading to exponential increases in hyper-parameter tuning complexity. Furthermore, the simultaneous optimization of multiple objectives can negatively impact the performance of high-priority tasks due to interference from lower-priority tasks. In this paper, we introduce a novel multi-task learning framework employing Lagrangian Differential Multiplier Methods for step-wise multi-task optimization. It is designed to boost the performance of high-priority tasks without interference from other tasks. Its primary advantage lies in its ability to automatically optimize multiple objectives without requiring balancing hyper-parameters for different tasks, thereby eliminating the need for manual tuning. Additionally, we provide theoretical analysis demonstrating that our method ensures optimization guarantees, enhancing the reliability of the process. We demonstrate its effectiveness through experiments on multiple public datasets and its application in Taobao search, a large-scale industrial search ranking system, resulting in significant improvements across various business metrics.

Abstract: Outlier detection (OD) is the task of identifying unusual observations (or outliers) from a given or upcoming data by learning unique patterns of normal observations (or inliers). Recently, a study introduced a powerful unsupervised OD (UOD) solver based on a new observation of deep generative models, called inliermemorization (IM) effect, which suggests that generative models memorize inliers before outliers in early learning stages. In this study, we aim to develop a theoretically principled method to address UOD tasks by maximally utilizing the IM effect. We begin by observing that the IM effect is observed more clearly when the given training data contain fewer outliers. This finding indicates a potential for enhancing the IM effect in UOD regimes if we can effectively exclude outliers from mini-batches when designing the loss function. To this end, we introduce two main techniques: 1) increasing the mini-batch size as the model training proceeds and 2) using an adaptive threshold to calculate the truncated loss function. We theoretically show that these two techniques effectively filter out outliers from the truncated loss function, allowing us to utilize the IM effect to the fullest. Coupled with an additional ensemble technique, we propose our method and term it Adaptive Loss Truncation with Batch Increment (ALTBI). We provide extensive experimental results to demonstrate that ALTBI achieves state-of-the-art performance in identifying outliers compared to other recent methods, even with lower computation costs. Additionally, we show that our method yields robust performances when combined with privacy-preserving algorithms.

Abstract: We propose an energy amplification technique to address the issue that existing models easily overlook lowenergy components in time series forecasting. This technique comprises an energy amplification block and an energy restoration block. The energy amplification block enhances the energy of low-energy components to improve the model's learning efficiency for these components, while the energy restoration block returns the energy to its original level. Moreover, considering that the energy-amplified data typically displays two distinct energy peaks in the frequency spectrum, we integrate the energy amplification technique with a seasonal-trend forecaster to model the temporal relationships of these two peaks independently, serving as the backbone for our proposed model, Amplifier. Additionally, we propose a semi-channel interaction temporal relationship enhancement block for Amplifier, which enhances the model's ability to capture temporal relationships from the perspective of the commonality and specificity of each channel in the data. Extensive experiments on eight time series forecasting benchmarks consistently demonstrate our model's superiority in both effectiveness and efficiency compared to state-of-the-art methods.

Abstract: Detecting anomalies in business processes is crucial for ensuring operational success. While many existing methods rely on statistical frequency to detect anomalies, it's important to note that infrequent behavior doesn't necessarily imply undesirability. To address this challenge, detecting anomalies from a semantic viewpoint proves to be a more effective approach. However, current semantic anomaly detection methods treat a trace (i.e., process instance) as multiple event pairs, disrupting longdistance dependencies. In this paper, we introduce DABL, a novel approach for detecting semantic anomalies in business processes using large language models (LLMs). We collect 143,137 real-world process models from various domains. By generating normal traces through the playout of these process models and simulating both ordering and exclusion anomalies, we fine-tune Llama 2 using the resulting log. Through extensive experiments, we demonstrate that DABL surpasses existing state-of-the-art semantic anomaly detection methods in terms of both generalization ability and learning of given processes. Users can directly apply DABL to detect semantic anomalies in their own datasets without the need for additional training. Furthermore, DABL offers the ability to interpret anomalies' causes in natural language, providing valuable insights into the detected anomalies.

College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph, College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph, Ant Group, College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph, Ant Group, Ant Group, School of Software Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph, College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Lab of Knowledge Graph Zhejiang Key Laboratory of Big Data Intelligent Computing

Abstract: Recent advancements in large language models (LLMs) have significantly improved various natural language processing (NLP) tasks. Typically, LLMs are trained to predict the next token, aligning well with many NLP tasks. However, in knowledge graph (KG) scenarios, entities are the fundamental units and identifying an entity requires at least several tokens. This leads to a granularity mismatch between KGs and natural languages. To address this issue, we propose KON, which integrates KG knowledge into the LLM by employing multiple head layers for next k-step prediction. K-ON can not only generate entity-level results in one step, but also enables contrastive loss against entities, which is the most powerful tool in KG representation learning. Experimental results show that K-ON outperforms state-of-the-art methods that incorporate text and even the other modalities.

Abstract: The recent emergence of extreme climate events has significantly raised awareness about sustainable living. In addition to developing energysaving materials and technologies, existing research mainly relies on traditional methods that encourage behavioral shifts towards sustainability, which can be overly demanding or only passively engaging. In this work, we propose to employ recommendation systems to actively nudge users toward more sustainable choices. We introduce Green Recommender Aligned with Personalized Eating (GRAPE), which is designed to prioritize and recommend sustainable food options that align with users’ evolving preferences. We also design two innovative Green Loss functions that cater to green indicators with either uniform or differentiated priorities, thereby enhancing adaptability across a range of scenarios. Extensive experiments on a real-world dataset demonstrate the effectiveness of our GRAPE.

Abstract: Traditional recurrent neural network architectures, such as long shortterm memory neural networks (LSTM), have historically held a prominent role in time series forecasting (TSF) tasks. While the recently introduced sLSTM for Natural Language Processing (NLP) introduces exponential gating and memory mixing that are beneficial for long term sequential learning, its potential short memory issue is a barrier to applying sLSTM directly in TSF. To address this, we propose a simple yet efficient algorithm named P-sLSTM, which is built upon sLSTM by incorporating patching and channel independence. These modifications substantially enhance sLSTM's performance in TSF, achieving state-of-the-art results. Furthermore, we provide theoretical justifications for our design, and conduct extensive comparative and analytical experiments to fully validate the efficiency and superior performance of our model.

Abstract: Defect detection aims to detect and localize regions out of the normal distribution. The previous approaches often explicitly incorporate the defect detection concept, such as by utilizing selfsupervised ground truth or manually defined feature comparison. The aforementioned processes involve modeling the distribution of normal samples, and they rely on the modeled normality for accurate inference. This reliance may hinder their ability to generalize to unseen test scenarios or the test set that deviates from the training distribution. In this paper, we propose a one-stage framework that detects defective patterns directly without the modeling process. This ability is adopted through the joint efforts of three parties: a generative adversarial network (GAN), a newly proposed scaled pattern loss, and a dynamic correction mechanism that allows the network to self-correct. In training, explicit information that could indicate the position of defects is intentionally excluded to avoid learning any direct mapping. Experimental results show that the proposed method performs superior in comparison with the previous SOTA methods in various test scenarios.

Software College, Northeastern University, New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, China, Software College, Northeastern University, Software College, Northeastern University, New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, China, Software College, Northeastern University, Software College, Northeastern University, School of Computer Science and Engineering, Northeastern University

Abstract: Involving collaborative information in Large Language Models (LLMs) is a promising technique for adapting LLMs for recommendation. Existing methods achieve this by concatenating collaborative features with text tokens into a unified sequence input and then finetuning to align these features with LLM's input space. Although effective, in this work, we identify two limitations when adapting LLMs to recommendation tasks, which hinder the integration of general knowledge and collaborative information, resulting in sub-optimal recommendation performance. (1) Fine-tuning LLM with recommendation data can undermine its inherent world knowledge and fundamental competencies, which are crucial for interpreting and inferring recommendation text. (2) Incorporating collaborative features into textual prompts disrupts the semantics of the original prompts, preventing LLM from generating appropriate outputs. In this paper, we propose a new paradigm, Collaborative LoRA (CoRA), with a collaborative query generator. Rather than input space alignment, this method aligns collaborative information with LLM's parameter space, representing them as incremental weights to update LLM's output. This way, LLM perceives collaborative information without altering its general knowledge and text inference capabilities. Specifically, we employ a collaborative filtering model to extract user and item embeddings and inject them into a set number of learnable queries. We then convert collaborative queries into collaborative weights with low-rank properties and merge the collaborative weights into LLM's weights, enabling LLM to perceive the collaborative signals and generate personalized recommendations without fine-tuning or extra collaborative tokens in prompts. Extensive experiments confirm that CoRA effectively integrates collaborative information into LLM, enhancing recommendation performance.

Abstract: Spatiotemporal data imputation plays a crucial role in various fields such as traffic flow monitoring, air quality assessment, and climate prediction. However, spatiotemporal data collected by sensors often suffer from temporal incompleteness, and the sparse and uneven distribution of sensors leads to missing data in the spatial dimension. Among existing methods, autoregressive approaches are prone to error accumulation, while simple conditional diffusion models fail to adequately capture the spatiotemporal relationships between observed and missing data. To address these issues, we propose a novel twostage Refined Diffusion Probability Impuation (RDPI) framework based on an initial network and a conditional diffusion model. In the initial stage, deterministic imputation methods are used to generate preliminary estimates of the missing data. In the refinement stage, residuals are treated as the diffusion target, and observed values are innovatively incorporated into the forward process. This results in a conditional diffusion model better suited for spatiotemporal data imputation, bridging the gap between the preliminary estimates and the true values. Experiments on multiple datasets demonstrate that RDPI not only achieves state-of-the-art imputation performance but also significantly reduces sampling computational costs.

Abstract: Existing efforts to boost multimodal fusion of 3D anomaly detection (3DAD) primarily concentrate on devising more effective multimodal fusion strategies. However, little attention was devoted to analyzing the role of multimodal fusion architecture (topology) design in contributing to 3D-AD. In this paper, we aim to bridge this gap and present a systematic study on the impact of multimodal fusion architecture design on 3D-AD. This work considers the multimodal fusion architecture design at the intra-module fusion level, i.e., independent modality-specific modules, involving early, middle or late multimodal features with specific fusion operations, and also at the inter-module fusion level, i.e., the strategies to fuse those modules. In both cases, we first derive insights through theoretically and experimentally exploring how architectural designs influence 3D-AD. Then, we extend SOTA neural architecture search (NAS) paradigm and propose 3D-ADNAS to simultaneously search across multimodal fusion strategies and modality-specific modules for the first time. Extensive experiments show that 3D-ADNAS obtains consistent improvements in 3D-AD across various model capacities in terms of accuracy, frame rate, and memory usage, and it exhibits great potential in dealing with few-shot 3D-AD tasks.

Abstract: Multidomain recommendation (MDR) aims to enhance recommendation performance across various domains. However, real-world recommender systems in online platforms often need to handle dozens or even hundreds of domains, far exceeding the capabilities of traditional MDR algorithms, which typically focus on fewer than five domains. Key challenges include a substantial increase in parameter count, high maintenance costs, and intricate knowledge transfer patterns across domains. Furthermore, minor domains often suffer from data sparsity, leading to inadequate training in classical methods. To address these issues, we propose Adaptive REcommendation for All Domains with counterfactual augmentation (AREAD). AREAD employs a hierarchical structure with a limited number of expert networks at several layers, to effectively capture domain knowledge at different granularities. To adaptively capture the knowledge transfer pattern across domains, we generate and iteratively prune a hierarchical expert network selection mask for each domain during training. Additionally, counterfactual assumptions are used to augment data in minor domains, supporting their iterative mask pruning. Our experiments on two public datasets, each encompassing over twenty domains, demonstrate AREAD's effectiveness, especially in data-sparse domains.

College of Computer Science and Electronic Engineering, Hunan University, China, College of Computer Science and Electronic Engineering, Hunan University, China, NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, College of Computer Science, Xiangtan University, College of Computer Science and Electronic Engineering, Hunan University, China, College of Computer Science and Electronic Engineering, Hunan University, China

Abstract: Inductive Knowledge Graph Completion (KGC) aims to infer missing facts between newly emerged entities within knowledge graphs (KGs), posing a significant challenge. While recent studies have shown promising results in inferring such entities through knowledge subgraph reasoning, they suffer from (i) the semantic inconsistencies of similar relations, and (ii) noisy interactions inherent in KGs due to the presence of unconvincing knowledge for emerging entities. To address these challenges, we propose a Semantic Structureaware Denoising Network (S2DN) for inductive KGC. Our goal is to learn adaptable general semantics and reliable structures to distill consistent semantic knowledge while preserving reliable interactions within KGs. Specifically, we introduce a semantic smoothing module over the enclosing subgraphs to retain the universal semantic knowledge of relations. We incorporate a structure refining module to filter out unreliable interactions and offer additional knowledge, retaining robust structure surrounding target links. Extensive experiments conducted on three benchmark KGs demonstrate that S2DN surpasses the performance of state-of-the-art models. These results demonstrate the effectiveness of S2DN in preserving semantic consistency and enhancing the robustness of filtering out unreliable interactions in contaminated KGs.

Abstract: SessionBased Recommendation (SBR) based on Graph Neural Networks (GNN) has become a new paradigm for recommender systems, and plays a fundamental role in e-commerce and other relevant domains. Existing graph aggregation methods primarily form node representations by capturing basic relationships between neighboring and central nodes. Despite their encouraging results, the global relationships of items and user intentions within sessions typically change over time, which degrades the effectiveness of existing embedding schemes. To resolve this challenge, we propose a Long and Short-Term Temporal Graph Neural Network (LS-TGNN) for SBR. LS-TGNN employs a novel temporal session graph to aggregate neighborhood information, and models user interests from both long and short-term perspectives. Specifically, we design long-term and short-term encoders to model the long and short-term interests of users, respectively. In order to better model the interests of users in different time dimensions, we introduce an item-granularity method that distinguishes between long and short-term interests. Extensive experiments on three widely used datasets demonstrate that LS-TGNN outperforms existing methods with a large margin.

Abstract: Recommender systems in various applications often encounter the challenge of coldstart, which refers to how to provide recommendations for completely new users. Cross-domain recommendation offers a solution to address this cold-start issue by leveraging user interaction information from other domains and providing recommendations for users in the target domain. However, applying the classic two-tower model in cross-domain scenarios for pure cold-start users proves challenging, and most existing cross-domain cold-start recommendation models adopt an embedding-mapping framework that lacks end-to-end efficiency. The parallel training recommendation method lacks consideration of the domain-level intrinsic characteristics of cross-domain information. In this paper, we propose a generalized framework that Domain-level Disentanglement framework based on information enhancement for Cross-domain Cold-start Recommendation. On one hand, we achieve deep utilization of domain-level information through independent extraction of domain knowledge and fusion using heuristic strategies. On the other hand, our model is incorporated with an information enhancement network based on user attention and a user personalized adaptor. We introduce measures to assess user variability and immutability in cross-domain recommendation, aiming to eliminate inter-domain bias and highlight individual user preferences. Experimental results on widely used cross-domain recommendation datasets demonstrate that our proposed model outperforms state-of-the-art methods, validating its effectiveness.

School of Computer Science and Technology, University of the Chinese Academy of Sciences, Beijing, China, Renaissance Era Investment Management Co., Ltd, Beijing, China Financial Development and Credit Management Research Center, Hunan University, Changsha, China Business School of Hunan University, Hunan University, Changsha, China, School of Computer Science and Technology, University of the Chinese Academy of Sciences, Beijing, China, Shangqiu Normal University, Shangqiu, China, International College, University of the Chinese Academy of Sciences, Beijing, China, Institute of Computing Technology, University of Chinese Academy of Sciences, Beijing, China CASMINO Ltd., Suzhou, China, York University, Toronto, Canada, University of Toronto, Toronto, Canada

Abstract: The complexity of financial data, characterized by its variability and low signalto-noise ratio, necessitates advanced methods in quantitative investment that prioritize both performance and interpretability.Transitioning from early manual extraction to genetic programming, the most advanced approach in the alpha factor mining domain currently employs reinforcement learning to mine a set of combination factors with fixed weights. However, the performance of resultant alpha factors exhibits inconsistency, and the inflexibility of fixed factor weights proves insufficient in adapting to the dynamic nature of financial markets. To address this issue, this paper proposes a two-stage formulaic alpha generating framework AlphaForge, for alpha factor mining and factor combination. This framework employs a generative-predictive neural network to generate factors, leveraging the robust spatial exploration capabilities inherent in deep learning while concurrently preserving diversity. The combination model within the framework incorporates the temporal performance of factors for selection and dynamically adjusts the weights assigned to each component alpha factor. Experiments conducted on real-world datasets demonstrate that our proposed model outperforms contemporary benchmarks in formulaic alpha factor mining. Furthermore, our model exhibits a notable enhancement in portfolio returns within the realm of quantitative investment and real money investment.

School of Computer Science and Engineering, Northeastern University, China School of Computing, Macquarie University, Australia, School of Computer Science and Engineering, Northeastern University, China, Singapore University of Technology and Design, Singapore, School of Computing, Macquarie University, Australia, School of Computer Science and Engineering, Northeastern University, China National Frontiers Science Center for Industrial Intelligence and Systems Optimization, China Key Laboratory of Data Analytics and Optimization for Smart Industry (Northeastern University), Ministry of Education, China, Bytedance(Seed), Singapore

Abstract: Sequential Recommenders (SRs) are trained to predict the next item as the target given its preceding items as the input, assuming every inputtarget pair is matched and is reliable for training. However, users can be induced by external distractions to click on items inconsistent with their true preferences, resulting in unreliable training instances with mismatched input-target pairs. To resist unreliable data, researchers attempt to develop Robust SRs (RSRs). However, our data analysis unveils that existing RSRs are data-driven. That is, for most instances formed by infrequently co-occurred items, existing RSRs are uncertain about their reliability. To fill this gap, we propose a generic framework -- LLM4RSR (Large Language Models for Robust Sequential Recommendation) to semantically complement data-driven RSRs by correcting uncertain instances into reliable ones based on LLMs' semantic comprehension of items beyond co-occurrence. In this way, RSRs can be re-trained with the corrected data for better accuracy. This is a selective knowledge distillation procedure, where the LLM acts as a teacher guiding student RSRs via uncertain instances. To align LLMs with the data correction task and mitigate inherent hallucinations, we equip the LLM with profile, plan, and memory modules, which are automatically optimized via textual gradient descent, eliminating the need for human effort and expertise. Experiments on four real-world datasets spanning eight backbones verify the generality, effectiveness, and efficiency of LLM4RSR.

Abstract: Computerized adaptive testing(CAT) is a crucial task in computeraided education, which aims to adaptively select suitable question to diagnose examinees' ability status. Existing CAT approaches enhance selection performance by exploring examinee-question(E-Q) relation. These approaches either exclusively utilize explicit E-Q relation. For instance, policy-based approaches determine question selection based on predefined criteria. While effective in adapting to changes in question banks, these methods often entail significant computational costs in searching for suitable questions. Conversely, some studies focus solely on implicit E-Q relation. For example, learning-based approaches train agents to efficiently select questions by learning from large-scale datasets. However, they may struggle with newly introduced questions. Additionally, most of these existing question selectors are based on greedy strategies, which potentially overlooks promising quuestions. To bridge the above two types of approaches, we propose a novel framework named Relation Exploiting-based CAT(RECAT) by exploring and exploiting the implicit and explicit examinee-question relation. Specifically, we first define an examinee true ability-oriented selection objective to select more suitable questions. Then, to learn the implicit E-Q relation, we design a question selector, which explores the examinee ability and generates best-fitting questions for specific examinee ability from two aspects, including generation consistency and knowledge matching. The former aims to maximize the likelihood estimation of the implicit E-Q relation learning process, while the latter is employed to fit the distribution of real questions. To fully exploit explicit E-Q relation, we generate a high-quality candidate set for the given examinee's ability using implicit E-Q relation, which streamlines the search process, minimizing selection latency. We demonstrate the effectiveness and efficiency of our framework through comprehensive experiments on real-world datasets.

Abstract: In this paper, we study how to synthesize a dynamic reference from an external dictionary to perform conditional coding of the input image in the latent domain and how to learn the conditional latent synthesis and coding modules in an endto-end manner. Our approach begins by constructing a universal image feature dictionary using a multi-stage approach involving modified spatial pyramid pooling, dimension reduction, and multi-scale feature clustering. For each input image, we learn to synthesize a conditioning latent by selecting and synthesizing relevant features from the dictionary, which significantly enhances the model's capability in capturing and exploring image source correlation. This conditional latent synthesis involves a correlation-based feature matching and alignment strategy, comprising a Conditional Latent Matching (CLM) module and a Conditional Latent Synthesis (CLS) module. The synthesized latent is then used to guide the encoding process, allowing for more efficient compression by exploiting the correlation between the input image and the reference dictionary. According to our theoretical analysis, the proposed conditional latent coding (CLC) method is robust to perturbations in the external dictionary samples and the selected conditioning latent, with an error bound that scales logarithmically with the dictionary size, ensuring stability even with large and diverse dictionaries. Experimental results on benchmark datasets show that our new method improves the coding performance by a large margin (up to 1.2 dB) with a very small overhead of approximately 0.5% bits per pixel.

Key Laboratory of Big Data Intelligent Computing Chongqing Key Laboratory of Computational Intelligence Chongqing University of Posts and Telecommunications, Chongqing, China Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, Chongqing Key Laboratory of Computational Intelligence Chongqing University of Posts and Telecommunications, Chongqing, China Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, Chongqing Key Laboratory of Computational Intelligence Chongqing University of Posts and Telecommunications, Chongqing, China Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, Chongqing Key Laboratory of Computational Intelligence Chongqing University of Posts and Telecommunications, Chongqing, China Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, Key Laboratory of Big Data Intelligent Computing Chongqing Key Laboratory of Computational Intelligence Chongqing University of Posts and Telecommunications, Chongqing, China Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, National Center for Applied Mathematics in Chongqing, Chongqing Normal University, Chongqing 401331, China

Abstract: Graph Neural Networks (GNNs) have demonstrated significant achievements in processing graph data, yet scalability remains a substantial challenge. To address this, numerous graph coarsening methods have been developed. However, most existing coarsening methods are trainingdependent, leading to lower efficiency, and they all require a predefined coarsening rate, lacking an adaptive approach. In this paper, we employ granular-ball computing to effectively compress graph data. We construct a coarsened graph network by iteratively splitting the graph into granular-balls based on a purity threshold and using these granular-balls as super vertices. This granulation process significantly reduces the size of the original graph, thereby greatly enhancing the training efficiency and scalability of GNNs. Additionally, our algorithm can adaptively perform splitting without requiring a predefined coarsening rate. Experimental results demonstrate that our method achieves accuracy comparable to training on the original graph. Noise injection experiments further indicate that our method exhibits robust performance. Moreover, our approach can reduce the graph size by up to 20 times without compromising test accuracy, substantially enhancing the scalability of GNNs.

Abstract: The superiority of graph contrastive learning (GCL) has prompted its application to anomaly detection tasks for more powerful risk warning systems. Unfortunately, existing GCLbased models tend to excessively prioritize overall detection performance while neglecting robustness to structural imbalance, which can be problematic for many real-world networks following power-law degree distributions. Particularly, GCL-based methods may fail to capture tail anomalies (abnormal nodes with low degrees). This raises concerns about the security and robustness of current anomaly detection algorithms and therefore hinders their applicability in a variety of realistic high-risk scenarios. To the best of our knowledge, research on the robustness of graph anomaly detection to structural imbalance has received little scrutiny. To address the above issues, this paper presents a novel GCL-based framework named AD-GCL. It devises the neighbor pruning strategy to filter noisy edges for head nodes and facilitate the detection of genuine tail nodes by aligning from head nodes to forged tail nodes. Moreover, AD-GCL actively explores potential neighbors to enlarge the receptive field of tail nodes through anomaly-guided neighbor completion. We further introduce intra- and inter-view consistency loss of the original and augmentation graph for enhanced representation. The performance evaluation of the whole, head, and tail nodes on multiple datasets validates the comprehensive superiority of the proposed AD-GCL in detecting both head anomalies and tail anomalies.

Abstract: Fraud detection that aims to discern frauds from the majority of benigns has become an increasingly prominent research field. Recently, Graph Neural Networks (GNNs) have been widely applied in graphbased fraud detection due to their outstanding data analysis and mining capabilities. However, owing to the inherent homophily-heterophily mixture and class imbalance of fraud graphs, most GNNs with homophily assumption inevitably suffer from local abnormal signal loss during information propagation, posing significant challenges in situations where frauds are rare and valuable. To address the aforementioned issues, we present a novel dynamic neighborhood modeling via node-subgraph contrastive learning for graph-based fraud detection, dubbed DCL-GFD. Specifically, we first design a node abnormality estimation module from the perspective of feature, which analyses the likelihood of a node belonging to fraud or benign by comparing the feature similarity between the target node and its corresponding subgraph. We then present a dynamic neighborhood modeling mechanism guided by the abnormal probability of a node to adaptively group and aggregate neighborhood information. By this means, the target node can effectively aggregate the neighbor information from the perspective of fraud or benign, thereby preserving as much fraud characteristics that occupy minority population as possible. Extensive experiments across four real-world fraud detection datasets demonstrate the superiority and effectiveness of our proposed DCL-GFD over state-of-the-art baselines.

Abstract: Large Language Models (LLMs) offer groundbreaking advancements in recommender systems through superior text analysis and decisionmaking support. However, integrating LLMs into recommender systems still suffers from the problems of identifier uninterpretability and lack of transparency. To address these issues and fully leverage the capabilities of LLMs, we propose a chain of thought (CoT) based recommendation framework called CoT4Rec which employs LLMs as data enhancers for user preference analysis. Initially, we design a CoT reasoning strategy that can derive more behaviorally-aligned user preference features by clustering users’ historical interactions. Subsequently, we propose a two-stage recommendation model that not only makes full use of the world knowledge embedded in LLMs but also generates a logically transparent reasoning path. By integrating a user preference analyzer early in the recommendation pipeline, the model deeply analyzes users' historical interactions, helping to enhance the personalization and transparency of the recommender system. CoT4Rec demonstrates superior performance over existing state-of-the-art models in recommendation tasks across four public datasets, achieving improvements ranging from 2.2% to 12.2%.

Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable ability in semisupervised node classification. However, most existing GNNs rely heavily on a large amount of labeled data for training, which is labor-intensive and requires extensive domain knowledge. In this paper, we first analyze the restrictions of GNNs generalization from the perspective of supervision signals in the context of few-shot semi-supervised node classification. To address these challenges, we propose a novel algorithm named NormProp, which utilizes the homophily assumption of unlabeled nodes to generate additional supervision signals, thereby enhancing the generalization against label scarcity. The key idea is to efficiently capture both the class information and the consistency of aggregation during message passing, via decoupling the direction and Euclidean norm of node representations. Moreover, we conduct a theoretical analysis to determine the upper bound of Euclidean norm, and then propose homophilous regularization to constraint the consistency of unlabeled nodes. Extensive experiments demonstrate that NormProp achieve state-of-the-art performance under low-label rate scenarios with low computational complexity.

Abstract: Predicting ClickThrough Rates is a crucial function within recommendation and advertising platforms, as the output of CTR prediction determines the order of items shown to users. The Embedding and MLP paradigm has become a standard approach for industrial recommendation systems and has been widely deployed. However, this paradigm suffers from cold-start problems, where there is either no or only limited user action data available, leading to poorly learned ID embeddings. The cold-start problem hampers the performance of new items. To address this problem, we design a novel diffusion model to generate a warmed-up embedding for new items. Specifically, we define a novel diffusion process between the ID embedding space and the side information space. In addition, we can derive a sub-sequence from the diffusion steps to expedite training, given that our diffusion model is non-Markovian. Our diffusion model is supervised by both the variational inference and binary cross-entropy objectives, enabling it to generate warmed-up embeddings for items in both the cold-start and warm-up phases. Additionally, we have conducted extensive experiments on three recommendation datasets. The results confirmed the effectiveness of our approach.

Abstract: Spectral Graph Neural Networks effectively handle graphs with different homophily levels, with lowpass filter mining feature smoothness and high-pass filter capturing differences. When these distinct filters could naturally form two opposite views for self-supervised learning, the commonalities between these counterparts for the same node remain unexplored, leading to suboptimal performance. In this paper, a simple yet effective self-supervised contrastive framework, LOHA, is proposed to address this gap. LOHA optimally leverages low-pass and high-pass views by embracing "harmony in diversity". Rather than solely maximizing the difference between these distinct views, which may lead to feature separation, LOHA harmonizes the diversity by treating the propagation of graph signals from both views as a composite feature. Specifically, a novel high-dimensional feature named spectral signal trend is proposed to serve as the basis for the composite feature, which remains relatively unaffected by changing filters and focuses solely on original feature differences. LOHA achieves an average performance improvement of 2.8% over runner-up models on 9 real-world datasets with varying homophily levels. Notably, LOHA even surpasses fully-supervised models on several datasets, which underscores the potential of LOHA in advancing the efficacy of spectral GNNs for diverse graph structures.

Abstract: A fundamental task in multiagent systems is to match n agents to n alternatives (e.g., resources or tasks). This is often done by eliciting agents' ordinal rankings over the alternatives rather than their exact numerical utilities. While this simplifies elicitation, the incomplete information leads to inefficiency, captured by a worst-case measure called distortion. Recent work shows that making just a few cardinal utility queries per agent can significantly improve the distortion, with Amanatidis et al. (2024) achieving O(√n) distortion with two queries per agent. We generalize their result by achieving O(n^(1/λ)) distortion with λ queries per agent, for any constant λ, which is optimal up to a constant factor given a previous lower bound by Amanatidis et al. (2022). We extend this finding to the general social choice problem of selecting one of m alternatives based on n agents' preferences, achieving O((min{n, m})^(1/λ)) distortion with λ queries per agent, for any constant λ, which is also optimal given prior results. Thus, our work settles open questions regarding the optimal distortion achievable with a fixed number of cardinal value queries in both settings.

Abstract: We adopt a parametric approach to analyze the worstcase degradation in social welfare when the allocation of indivisible goods is constrained to be fair. Specifically, we are concerned with cardinality-constrained allocations, which require that each agent has at most k items in their allocated bundle. We propose the notion of the price of cardinality, which captures the worst-case multiplicative loss of utilitarian or egalitarian social welfare resulting from imposing the cardinality constraint. We then characterize tight or almost-tight bounds on the price of cardinality as exact functions of the instance parameters, demonstrating how the social welfare improves as k is increased. In particular, one of our main results refines and generalizes the existing asymptotic bound of Θ(√n) on the price of balancedness. We also further extend our analysis to the problem where the items are partitioned into disjoint categories, and each category has its own cardinality constraint. Through a parametric study of the price of cardinality, we provide a framework which aids decision makers in choosing an ideal level of cardinality-based fairness, using their knowledge of the potential loss of utilitarian and egalitarian social welfare.

Abstract: Consider public health officials aiming to spread awareness about a new vaccine in a community interconnected by a social network. How can they distribute information with minimal resources, so as to avoid polarization and ensure communitywide convergence of opinion? To tackle such challenges, we initiate the study of sample complexity of opinion formation in networks. Our framework is built on the recognized opinion formation game, where we regard each agent’s opinion as a data-derived model, unlike previous works that treat opinions as data-independent scalars. The opinion model for every agent is initially learned from its local samples and evolves game-theoretically as all agents communicate with neighbors and revise their models towards an equilibrium. Our focus is on the sample complexity needed to ensure that the opinions converge to an equilibrium such that every agent’s final model has low generalization error. Our paper has two main technical results. First, we present a novel polynomial time optimization framework to quantify the total sample complexity for arbitrary networks, when the underlying learning problem is (generalized) linear regression. Second, we leverage this optimization to study the network gain which measures the improvement of sample complexity when learning over a network compared to that in isolation. Towards this end, we derive network gain bounds for various network classes including cliques, star graphs, and random regular graphs. Additionally, our framework provides a method to study sample distribution within the network, suggesting that it is sufficient to allocate samples inversely to the degree. Empirical results on both synthetic and real-world networks strongly support our theoretical findings.

Abstract: Search and recommendation ecosystems exhibit competition among content creators. This competition has been tackled in a variety of gametheoretic frameworks. Content creators generate documents with the aim of being recommended by a content ranker for various information needs. In order for the ecosystem, modeled as a content ranking game, to be effective and maximize user welfare, it should guarantee stability, where stability is associated with the existence of pure Nash equilibrium in the corresponding game. Moreover, if the contents' ranking algorithm possesses a game in which any best-response learning dynamics of the content creators converge to equilibrium of high welfare, the system is considered highly attractive. However, as classical content ranking algorithms, employed by search and recommendation systems, rank documents by their distance to information needs, it has been shown that they fail to provide such stability properties. As a result, novel content ranking algorithms have been devised. In this work, we offer an alternative approach: corpus enrichment with a small set of fixed dummy documents. It turns out that, with the right design, such enrichment can lead to pure Nash equilibrium and even to the convergence of any best-response dynamics to a high welfare result, where we still employ the classical/current content ranking approach. We show two such corpus enrichment techniques with tight bounds on the number of documents needed to obtain the desired results. Interestingly, our study is a novel extension of Borel's Colonel Blotto game.

Abstract: Reciprocity plays a crucial role in maintaining cooperation in human societies and AI systems. In this paper, we focus on reciprocity within multichannel games and examine how cooperation evolves in this context. We propose a unified framework that allows us to evaluate the reputations of interdependent actions across multiple channels while simultaneously exploring both direct and indirect reciprocity mechanisms. We identify partner and semipartner strategies under both forms of reciprocity, with the former leading to full cooperation and the latter resulting in partial cooperation. Through equilibrium analysis, we characterize the conditions under which full cooperation and partial cooperation emerge. Moreover, we show that when players can link multiple interactions, they learn to coordinate their behavior across different games to maximize overall cooperation. Our findings provide new insights into the maintenance of cooperation across various reciprocity mechanisms and interaction patterns.

Abstract: We consider the Coalition Structure Learning (CSL) problem in multiagent systems, motivated by the existence of coalitions in many real-world systems, e.g., trading platforms and auction systems. In this problem, there is a hidden coalition structure within a set of n agents, which affects the behavior of the agents in games. Our goal is to actively design a sequence of games for the agents to play, such that observations in these games can be used to learn the hidden coalition structure. In particular, we consider the setting where in each round, we design and present a game together with a strategy profile to the agents, and receive a multiple-bit observation -- for each agent, we observe whether or not they would like to deviate from the specified strategy. We show that we can learn the coalition structure in O(log n) rounds if we are allowed to design any normal-form game, matching the information-theoretical lower bound. For practicality, we extend the result to settings where we can only choose games of a specific format, and design algorithms to learn the coalition structure in these settings. For most settings, our complexity matches the theoretical lower bound up to a constant factor.

Abstract: Deep learning models have demonstrated exceptional performance in a variety of realworld applications. These successes are often attributed to strong base models that can generalize to novel tasks with limited supporting data while keeping prior knowledge intact. However, these impressive results are based on the availability of a large amount of high-quality data, which is often lacking in specialized biomedical applications. In such fields, models are often developed with limited data that arrive incrementally with novel categories. This requires the model to adapt to new information while preserving existing knowledge. Few-Shot Class-Incremental Learning (FSCIL) methods offer a promising approach to addressing these challenges, but they also depend on strong base models that face the same aforementioned limitations. To overcome these constraints, we propose AnchorInv following the straightforward and efficient buffer-replay strategy. Instead of selecting and storing raw data, AnchorInv generates synthetic samples guided by anchor points in the feature space. This approach protects privacy and regularizes the model for adaptation. When evaluated on three public physiological time series datasets, AnchorInv exhibits efficient knowledge forgetting prevention and improved adaptation to novel classes, surpassing state-of-the-art baselines.

Abstract: Quality control is a crucial issue of label data collection by crowdsourcing. Typically, aggregation methods to redundant crowd labels are proposed for estimating highquality labels from noisy crowd labels. Most of the existing works concentrate on the label aggregation for Single Crowd Tasks (SCTs) which have a single object set with homogeneous question types. However, it is useful for a requester to combine multiple relevant but different crowd tasks into a Composite Crowd Task (CCT) which have heterogeneous question types and (or) multiple object sets for diverse purposes. Instead of the label aggregation on each crowd task respectively, label aggregation methods by bridging multiple SCTs in CCTs can potentially improve the label quality of all tasks. In this paper, we propose a general label aggregation approach for such CCTs by worker ability constraint satisfaction and relaxed optimization. We collected real crowd datasets of CCTs with diverse task settings based on heterogeneous question types, including categorization, pairwise preference comparisons, and pairwise similarity comparisons. The results demonstrate that our approach can effectively bridge the worker information of CCTs to improve the quality of aggregated labels and outperforms the baselines proposed for SCTs.

Abstract: WiFibased human activity recognition (HAR) holds significant application potential across various fields. To handle dynamic environments where new activities are continuously introduced, WiFi-based HAR systems must adapt by learning new concepts without forgetting previously learned ones. Furthermore, retaining knowledge from old activities by storing historical exemplar is impractical for WiFi-based HAR due to privacy concerns and limited storage capacity of edge devices. In this work, we propose ConSense, a lightweight and fast-adapted exemplar-free class incremental learning framework for WiFi-based HAR. The framework leverages the transformer architecture and involves dynamic model expansion and selective retraining to preserve previously learned knowledge while integrating new information. Specifically, during incremental sessions, small-scale trainable parameters that are trained specifically on the data of each task are added in the multi-head self-attention layer. In addition, a selective retraining strategy that dynamically adjusts the weights in multilayer perceptron based on the performance stability of neurons across tasks is used. Rather than training the entire model, the proposed strategies of dynamic model expansion and selective retraining reduce the overall computational load while balancing stability on previous tasks and plasticity on new tasks. Evaluation results on three public WiFi datasets demonstrate that ConSense not only outperforms several competitive approaches but also requires fewer parameters, highlighting its practical utility in class-incremental scenarios for HAR.

Abstract: Despite significant advancements in image and text conditional image editing, the exploration of using brain signals, which are more direct and personalized to reflect user intentions, remains limited. An intuitive method is to convert implicit brain signals into explicit representations such as images, which can then serve as prompts for editing. However, such twostage method suffers from low inference efficiency, inaccurate brain interpretation, and unnatural editing results. In this paper, we apply brain signals of visual perception as prompts and propose a cross-modal self-supervised learning for natural image painting (MindPainter). This method achieves efficient and natural brain-conditioned image editing in a straightforward manner. MindPainter is trained for reconstruction from masked images directly with pseudo-brain signals, which is simulated by the proposed Pseudo Brain Generator. It facilitates efficient cross-modal integration. The proposed Brain Adapter further eliminates the gap in implicit space between modalities, ensuring accurate semantic interpretation of brain signals and coherent consolidation. Besides, the designed Multi-Mask Generation Policy enhances the generalization, realizing high-quality editing in various painting scenarios, including inpainting and outpainting. To the best of our knowledge, MindPainter is the first work to achieve efficient brain-conditioned image painting, providing potential for direct brain control in creative AI. The code and the link to the extended version will be available on GitHub.

Abstract: Electroencephalogram (EEG) signals have attracted significant attention from researchers due to their noninvasive nature and high temporal sensitivity in decoding visual stimuli. However, most recent studies have focused solely on the relationship between EEG and image data pairs, neglecting the valuable "beyond-image-modality" information embedded in EEG signals. This results in the loss of critical multimodal information in EEG. To address the limitation, this paper proposes a unified framework that fully leverages multimodal data to represent EEG signals, named CognitionCapturer. Specifically, CognitionCapturer trains modality expert encoders for each modality to extract cross-modal information from the EEG modality. Then, it introduces a diffusion prior to map the EEG embedding space to the CLIP embedding space, followed by using a pretrained generative model, the proposed framework can reconstruct visual stimuli with high semantic and structural fidelity. Notably, the framework does not require any fine-tuning of the generative models and can be extended to incorporate more modalities. Through extensive experiments, we demonstrate that CognitionCapturer outperforms state-of-the-art methods both qualitatively and quantitatively.

Abstract: With the advancements of artificial intelligence (AI), emerging scenarios involving close collaboration between AI and other unknown agents are becoming increasingly common. This requires sometimes training AI agents to collaborate with unknown agents in the absence of a reward function which may be unavailable to the AI agents or even undefined by the unknown agents themselves -- thus posing news challenges to existing learning algorithms that often require knowing the shared reward. In this paper, we show that effective teaming with unknown agents can be achieved in the absence of a reward function, through actively modeling other unknown agents and reasoning about their latent rewards from available interaction/observation history. In particular, we propose a novel framework that leverages a kernel density Bayesian inverse learning method for active reward/goal inference and prove that multi-agent reinforcement learning guided by the inferred reward signals can converge to an optimal policy teaming with unknown agents. The result enables us to develop an adaptive policy update strategy, through the use of a family of pre-trained, goal-conditioned policies, further eliminating the need for online retraining. The proposed solution is evaluated using a wide range of diverse unknown agents of latent and even non-stationary reward. Our solution significantly increases the teaming performance between AI and unknown agents in the absence of reward.

Abstract: Motion planning is a crucial component in autonomous driving. Stateof-the-art motion planners are trained on meticulously curated datasets, which are not only expensive to annotate but also insufficient in capturing rarely seen critical scenarios. Failing to account for such scenarios poses a significant risk to motion planners and may lead to incidents during testing. An intuitive solution is to manually compose such scenarios by programming and executing a simulator (e.g., CARLA). However, this approach incurs substantial human costs. Motivated by this, we propose an inexpensive method for generating diverse critical traffic scenarios to train more robust motion planners. First, we represent traffic scenarios as scripts, which are then used by the simulator to generate traffic scenarios. Next, we develop a method that accepts user-specified text descriptions, which a Large Language Model translates into scripts using in-context learning. The output scripts are sent to the simulator that produces the corresponding traffic scenarios. As our method can generate abundant safety-critical traffic scenarios, we use them as synthetic training data for motion planners. To demonstrate the value of generated scenarios, we train existing motion planners on our synthetic data, real-world datasets, and a combination of both. Our experiments show that motion planners trained with our data significantly outperform those trained solely on real-world data, showing the usefulness of our synthetic data and the effectiveness of our data generation method.

Abstract: Motion planning is a critical module in autonomous driving, with the primary challenge of uncertainty caused by interactions with other participants. As most previous methods treat prediction and planning as separate tasks, it is difficult to model these interactions. Furthermore, since the route path navigates ego vehicles to a predefined destination, it provides relatively stable intentions for ego vehicles and helps constrain uncertainty. On this basis, we construct Int2Planner, an Intentionbased Integrated motion Planner achieves multi-modal planning and prediction. Instead of static intention points, Int2Planner utilizes route intention points for ego vehicles and generates corresponding planning trajectories for each intention point to facilitate multi-modal planning. The experiments on the private dataset and the public nuPlan benchmark show the effectiveness of route intention points, and Int2Planner achieves state-of-the-art performance. We also deploy it in real-world vehicles and have conducted autonomous driving for hundreds of kilometers in urban areas. It further verifies that Int2Planner can continuously interact with the traffic environment.

Abstract: Humans achieve contactrich dexterous grasping through the synergy of visual and tactile information. However, the high-dimensional action space of high DoF multi-fingered hands poses significant challenges to this operation. In this study, we address this complexity by controlling the robotic hand at the reduced dimensional level of individual fingers instead of the entire hand, and develop a finger-based multi-agent deep reinforcement learning strategy by regarding the wrist, arm, and each finger of the hand as intelligent agents. We commence by applying a single-agent reinforcement learning algorithm to guide the whole hand to reach the feasible approaching direction and distance to the object. Then, we develop neuroscience-inspired visuo-tactile fusion networks to train multiple agents to control their assigned fingers by effectively leveraging visual and tactile feedback. This enables dynamic and collaborative adjustments of finger-object interactions, ultimately achieving precise contact with specific areas of the objects. The grasping results on 8 objects show that our approach can achieve stable and compliant grasps. To the best of our knowledge, this is the first work that employs a finger-based multi-agent reinforcement learning approach to control the dexterous grasping process under the guidance of both visual and tactile feedback.

Abstract: Reinforcement learning (RL) has shown promising performance in tackling robotic manipulation tasks (RMTs), which require learning a prolonged sequence of manipulation actions to control robots efficiently. However, most RL algorithms often suffer from two problems when solving RMTs: inefficient exploration due to the extremely large action space and catastrophic forgetting due to the poor sampling efficiency. To alleviate these problems, this paper introduces an Evolutionary Reinforcement Learning algorithm with parameterized Action Primitives, called ERLAP, which combines the advantages of an evolutionary algorithm (EA) and hierarchical RL (HRL) to solve diverse RMTs. A library of heterogeneous action primitives is constructed in HRL to enhance the exploration efficiency of robots and dual populations with new evolutionary operators are run in EA to optimize these primitive sequences, which can diversify the distribution of replay buffer and avoid catastrophic forgetting. The experiments show that ERLAP outperforms four stateof-the-art RL algorithms in simulated RMTs with dense rewards and can effectively avoid catastrophic forgetting in a set of more challenging simulated RMTs with sparse rewards.

Abstract: Perception and interaction with articulated objects present a unique challenge for service robots. Although recent research has emphasized understanding articulated shapes and affordance proposals, existing methods only address isolated aspects, failing to develop comprehensive strategies for robotic perception and manipulation of articulated objects. To bridge this gap, we propose GMAP, which systematically integrates the entire process from command to perception and manipulation. Specifically, we first perform precise partlevel segmentation of the object and identify the geometric and kinematic parameters of articulated joints. Then, by evaluating point-level affordance proposals, we determine the interaction poses for the robot's end-effector. Finally, the robot's execution trajectory is dynamically computed by combining commands with joint parameters and interaction points. Additionally, a key innovation of GMAP is addressing the scarcity of annotated data. We designed a multi-scale point cloud feature extraction module and introduced pre-training and fine-tuning techniques, significantly enhancing the generalization capability of the perception model. Extensive experiments demonstrate that GMAP achieves state-of-the-art (SOTA) performance in both the perception and manipulation of articulated objects and adapts to real-world scenarios.

Abstract: Given a graph representing the workspace, MultiAgent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective start vertex to their respective goal vertex while minimizing path costs. Although many MAPF algorithms were developed and can handle up to thousands of agents, they usually rely on the assumption that each action of the agent takes a time unit, and the actions of all agents are synchronized in a sense that the actions of agents start at the same discrete time step, which may limit their use in practice. Only a few algorithms have been developed to address asynchronous actions, and they all lie on one end of the spectrum, focusing on finding optimal solutions with limited scalability. This paper develops new planners that lie on the other end of the spectrum, trading off solution quality for scalability, by finding an unbounded sub-optimal solution for many agents. Our method leverages both search-based methods in handling asynchronous actions and techniques in rule-based planning for MAPF. We analyze the properties of our method and test it against several baselines with up to a thousand agents with asynchronous actions in various maps. Given a runtime limit, our method can handle an order of magnitude more agents than the existing methods with about 25% longer makespan.

Abstract: In the logic of theory change, the AGM model has acquired the status of a standard model. However, the AGM model does not seem adequate for some contexts and application domains. This inspired many researchers to propose extensions and generalizations to AGM. Among these extensions, one of the most important are belief bases. Belief bases have more expressivity than belief sets, as explicit and implicit beliefs have different statuses. In this paper, we present reformulation, a belief change operation that allows us to reformulate a belief base making some particular sentences explicit without modifying the consequences of the belief base. We provide a constructive method and its axiomatic characterization.

Abstract: We study Consistent Query Answering (CQA) over knowledge bases with existential rules. Specifically, we propose a novel framework for CQA that combines previous approaches, allowing for the simultaneous presence of both open and closed predicates, i.e. predicates interpreted under openand closed-world assumption, respectively. We establish the data complexity of answering unions of conjunctive queries in such a new framework under the so-called AR semantics and for different classes of existential rules. We also provide new complexity results for the standard (i.e. non-inconsistency tolerant) query answering in the presence of both open and closed predicates. Our results show that, for certain classes of rules, the complexity of CQA matches that of non-inconsistency-tolerant query answering.

Abstract: Knowledgebased questions are typically employed to evaluate LLM's knowledge boundaries; meanwhile, numerous studies focus on question generation as a means to enhance the capabilities of both models and individuals. However, there is a lack of in-depth exploration about what constitutes a good question from the perspective of knowledge cognition. This paper proposes aligning the complete knowledge underlying questions with educational criteria effectively employed in physics courses, thereby developing novel knowledge-intensive metrics of question quality. To this end, we propose Meta-Fact Checking (MFC), which transforms questions into knowledge graph (KG) triples utilizing LLMs through few-shot prompting, thereby quantifying question quality based on the patterns observed within these triples. MFC introduces a novel interaction mechanism for KGs that communicates meta-facts, illustrating the types of knowledge that KGs can offer to the LLM for reasoning questions, rather than relying solely on the original triples. This strategy ensures that MFC remains unaffected by unexplored triples that LLM has not yet encountered within KGs compared to the retrieve-while-reasoning routine. Experiments across multiple datasets and LLMs demonstrate that MFC significantly improves the accuracy and efficiency of both question answering and assessing. This research marks a pioneering effort to automate the evaluation of question quality based on cognitive capabilities.

Abstract: Imitation learning (IL) is notably effective for robotic tasks where directly programming behaviors or defining optimal control costs is challenging. In this work, we address a scenario where the imitator relies solely on observed behavior and cannot make environmental interactions during learning. It does not have additional supplementary datasets beyond the expert's dataset nor any information about the transition dynamics. Unlike stateof-the-art (SOTA) IL methods, this approach tackles the limitations of conventional IL by operating in a more constrained and realistic setting. Our method uses the Markov balance equation and introduces a novel conditional density estimation-based imitation learning framework. It employs conditional normalizing flows for transition dynamics estimation and aims at satisfying a balance equation for the environment. Through a series of numerical experiments on Classic Control and MuJoCo environments, we demonstrate consistently superior empirical performance compared to many SOTA IL algorithms.

Abstract: Conceptbased methods have emerged as a promising direction to develop interpretable neural networks in standard supervised settings. However, most works that study them in incremental settings assume either a static concept set across all experiences or assume that each experience relies on a distinct set of concepts. In this work, we study concept-based models in a more realistic, dynamic setting where new classes may rely on older concepts in addition to introducing new concepts themselves. We show that concepts and classes form a complex web of relationships, which is susceptible to degradation and needs to be preserved and augmented across experiences. We introduce new metrics to show that existing concept-based models cannot preserve these relationships even when trained using methods to prevent catastrophic forgetting, since they cannot handle forgetting at concept, class, and concept-class relationship levels simultaneously. To address these issues, we propose a novel method - MuCIL - that uses multimodal concepts to perform classification without increasing the number of trainable parameters across experiences. The multimodal concepts are aligned to concepts provided in natural language, making them interpretable by design. Through extensive experimentation, we show that our approach obtains state-of-the-art classification performance compared to other concept-based models, achieving over 2x the classification performance in some cases. We also study the ability of our model to perform interventions on concepts, and show that it can localize visual concepts in input images, providing post-hoc interpretations.

Abstract: In this paper, our goal is to generate synthetic data for heterogeneous (mixedtype) tabular datasets with high machine learning utility (MLu). Since the MLu performance depends on accurately approximating the conditional distributions, we focus on devising a synthetic data generation method based on conditional distribution estimation. We introduce MaCoDE by redefining the consecutive multi-class classification task of Masked Language Modeling (MLM) as histogram-based non-parametric conditional density estimation. Our approach enables the estimation of conditional densities across arbitrary combinations of target and conditional variables. We bridge the theoretical gap between distributional learning and MLM by demonstrating that minimizing the orderless multi-class classification loss leads to minimizing the total variation distance between conditional distributions. To validate our proposed model, we evaluate its performance in synthetic data generation across 10 real-world datasets, demonstrating its ability to adjust data privacy levels easily without re-training. Additionally, since masked input tokens in MLM are analogous to missing data, we further assess its effectiveness in handling training datasets with missing values, including multiple imputations of the missing entries.

Abstract: Wearable sensing devices, such as Holter monitors, will play a crucial role in the future of digital health. Unsupervised learning frameworks such as SelfSupervised Learning (SSL) are essential to map these single-lead electrocardiogram (ECG) signals with their anticipated clinical outcomes. These signals are characterized by a tempo-variant component whose patterns evolve through the recording and an invariant component with patterns that remain unchanged. However, existing SSL methods only drive the model to encode the invariant attributes, leading the model to neglect tempo-variant information which reflects subject-state changes through time. In this paper, we present Parallel-Learning of Invariant and Tempo-variant Attributes (PLITA), a novel SSL method designed for capturing both invariant and tempo-variant ECG attributes. The latter are captured by mandating closer representations in space for closer inputs on time. We evaluate both the capability of the method to learn the attributes of these two distinct kinds, as well as PLITA ’s performance compared to existing SSL methods for ECG analysis. PLITA performs significantly better in the set-ups where tempo-variant attributes play a major role.

Abstract: Predicting cellular responses to various perturbations is a critical focus in drug discovery and personalized therapeutics, with deep learning models playing a significant role in this endeavor. Singlecell datasets contain technical artifacts that may hinder the predictability of such models, which poses quality control issues highly regarded in this area. To address this, we propose Cradle-VAE, a causal generative framework tailored for single-cell gene perturbation modeling, enhanced with counterfactual reasoning-based artifact disentanglement. Throughout training, Cradle-VAE models the underlying latent distribution of technical artifacts and perturbation effects present in single-cell datasets. It employs counterfactual reasoning to effectively disentangle such artifacts by modulating the latent basal spaces and learns robust features for generating cellular response data with improved quality. Experimental results demonstrate that this approach improves not only treatment effect estimation performance but also generative quality as well.

Abstract: Recent advances in machine learning have led to a surge in adoption of neural networks for various tasks, but lack of interpretability remains an issue for many others in which an understanding of the features influencing the prediction is necessary to ensure fairness, safety, and legal compliance. In this paper we consider one class of such tasks, tabular dataset classification, and propose a novel neurosymbolic architecture, Neural Reasoning Networks (NRN), that is scalable and generates logically sound textual explanations for its predictions. NRNs are connected layers of logical neurons that implement a form of real valued logic. A training algorithm (R-NRN) learns the weights of the network as usual using gradient descent optimization with backprop, but also learns the network structure itself using a bandit-based optimization. Both are implemented in an extension to PyTorch that takes full advantage of GPU scaling and batched training. Evaluation on a diverse set of 22 open-source datasets for tabular classification demonstrates performance (measured by ROC AUC) which improves over Multilayer Perceptron (MLP) and is statistically similar to other state-of-the-art approaches such as Random Forest, XGBoost and Gradient Boosted Trees, while offering 43% faster training and a more than 2 orders of magnitude reduction in the number of parameters required, on average. Furthermore, R-NRN explanations are shorter than the compared approaches while producing more accurate feature importance scores.

Abstract: The reward signal plays a central role in defining the desired behaviors of agents in reinforcement learning (RL). Rewards collected from realistic environments could be perturbed, corrupted, or noisy due to an adversary, sensor error, or because they come from subjective human feedback. Thus, it is important to construct agents that can learn under such rewards. Existing methodologies for this problem make strong assumptions, including that the perturbation is known in advance, clean rewards are accessible, or that the perturbation preserves the optimal policy. We study a new, more general, class of unknown perturbations, and introduce a distributional reward critic framework for estimating reward distributions and perturbations during training. Our proposed methods are compatible with any RL algorithm. Despite their increased generality, we show that they achieve comparable or better rewards than existing methods in a variety of environments, including those with clean rewards. Under the challenging and generalized perturbations we study, we win/tie the highest return in 44/48 tested settings (compared to 11/48 for the best baseline). Our results broaden and deepen our ability to perform RL in rewardperturbed environments.

Abstract: Since its introduction, the transformer has shifted the development trajectory away from traditional models (e.g., RNN, MLP) in time series forecasting, which is attributed to its ability to capture global dependencies within temporal tokens. Followup studies have largely involved altering the tokenization and self-attention modules to better adapt Transformers for addressing special challenges like non-stationarity, channel-wise dependency, and variable correlation in time series. However, we found that the expressive capability of sequence representation is a key factor influencing Transformer performance in time forecasting after investigating several representative methods, where there is an almost linear relationship between sequence representation entropy and mean square error, with more diverse representations performing better. In this paper, we propose a novel attention mechanism with Sequence Complementors and prove feasible from an information theory perspective, where these learnable sequences are able to provide complementary information beyond current input to feed attention. We further enhance the Sequence Complementors via a diversification loss that is theoretically covered. The empirical evaluation of both long-term and short-term forecasting has confirmed its superiority over the recent state-of-the-art methods.

Abstract: The rapid development of diffusion models has significantly advanced AIgenerated content (AIGC), particularly in Text-to-Image (T2I) and Text-to-Video (T2V) generation. Text-based video editing, leveraging these generative capabilities, has emerged as a promising field, enabling precise modifications to videos based on text prompts. Despite the proliferation of innovative video editing models, there is a conspicuous lack of comprehensive evaluation benchmarks that holistically assess these models’ performance across various dimensions. Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score, which obscures models’ effectiveness on individual editing tasks. To address this gap, we propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models. EditBoard encompasses nine automatic metrics across four dimensions, evaluating models on four task categories and introducing three new metrics to assess fidelity. This task-oriented benchmark facilitates objective evaluation by detailing model performance and providing insights into each model’s strengths and weaknesses. By open-sourcing EditBoard, we aim to standardize evaluation and advance the development of robust video editing models.

Abstract: The nonlinear twotime-scale stochastic approximation is widely studied under conditions of bounded variances in noise. Motivated by recent advances that allow for variability linked to the current state or time, we consider state- and time-dependent noises. We show that the Lyapunov function exhibits polynomial convergence rates in both cases, with the rate of polynomial delay depending on the parameters of state- or time-dependent noises. Notably, if the state noise parameters fully approach their limiting value, the Lyapunov function achieves an exponential convergence rate. We provide two numerical examples to illustrate our theoretical findings in the context of stochastic gradient descent with Polyak-Ruppert averaging and stochastic bilevel optimization.

Abstract: Endogeneity, i.e. the dependence of noise and covariates, is a common phenomenon in real data due to omitted variables, strategic behaviours, measurement errors etc. In contrast, the existing analyses of stochastic online linear regression with unbounded noise and linear bandits depend heavily on exogeneity, i.e. the independence of noise and covariates. Motivated by this gap, we study the overand just-identified Instrumental Variable (IV) regression, specifically Two-Stage Least Squares, for stochastic online learning, and propose to use an online variant of Two-Stage Least Squares, namely O2SLS. We show that O2SLS achieves O(dx dz log^2(T)) identification and O(γ √dz √T ) oracle regret after T interactions, where dx and dz are the dimensions of covariates and IVs, and γ is the bias due to endogeneity. For γ = 0, i.e. under exogeneity, O2SLS exhibits O(dx^2 log^2 T ) oracle regret, which is of the same order as that of the stochastic online ridge. Then, we leverage O2SLS as an oracle to design OFUL-IV, a stochastic linear bandit algorithm to tackle endogeneity. OFUL-IV yields O(√dx √dz √T ) regret that matches the regret lower bound under exogeneity. For different datasets with endogeneity, we experimentally show efficiencies of O2SLS and OFUL-IV.

Abstract: Testtime adaptation (TTA) aims to fine-tune a trained model online using unlabeled testing data to adapt to new environments or out-of-distribution data, demonstrating broad application potential in real-world scenarios. However, in this optimization process, unsupervised learning objectives like entropy minimization frequently encounter noisy learning signals. These signals produce unreliable gradients, which hinder the model’s ability to converge to an optimal solution quickly and introduce significant instability into the optimization process. In this paper, we seek to resolve these issues from the perspective of optimizer design. Unlike prior TTA using manually designed optimizers like SGD, we employ a learning-to-optimize approach to automatically learn an optimizer, called Meta Gradient Generator (MGG). Specifically, we aim for MGG to effectively utilize historical gradient information during the online optimization process to optimize the current model. To this end, in MGG, we design a lightweight and efficient sequence modeling layer -- gradient memory layer. It exploits a self-supervised reconstruction loss to compress historical gradient information into network parameters, thereby enabling better memorization ability over a long-term adaptation process. We only need a small number of unlabeled samples to pre-train MGG, and then the trained MGG can be deployed to process unseen samples. Promising results on ImageNet-C/R/Sketch/A indicate that our method surpasses current state-of-the-art methods with fewer updates, less data, and significantly shorter adaptation times. Compared with a previous SOTA SAR, we achieve 7.4% accuracy improvement and 4.2x faster adaptation speed on ImageNet-C.

Abstract: Computational complexity of Bayesian learning is impeding its adoption in practical, largescale tasks, despite demonstrations of significant merits such as improved robustness and resilience to unseen or out-of-distribution inputs over their non-Bayesian counterparts. Although, Deep ensemble methods (Seligmann et al. 2024; Lakshminarayanan, Pritzel, and Blundell 2017) have proven to be highly effective for Bayesian deep learning, their practical application is hindered by substantial computational cost. In this study, we introduce an innovative framework to mitigate the computational burden of ensemble Bayesian deep learning. We explore a more feasible alternative, inspired by the recent success of low-rank adapters, we introduce Bayesian Low-Rank LeArning (Bella). We show, i) Bella achieves a dramatic reduction in the number of trainable parameters required to approximate a Bayesian posterior; and ii) it not only maintains, but in some instances, surpasses the performance–in accuracy and out-of-distribution generalisation–of conventional Bayesian learning methods and non-Bayesian baselines. Our extensive empirical evaluation in large-scale tasks such as ImageNet, CAMELYON17, DomainNet, VQA with CLIP, LLaVA demonstrate the effectiveness and versatility of Bella in building highly scalable and practical Bayesian deep models for real-world applications.

Abstract: Confidence calibration of classification models is a technique to estimate the true posterior probability of the predicted class, which is critical for ensuring reliable decisionmaking in practical applications. Existing confidence calibration methods mostly use statistical techniques to estimate the calibration curve from data or fit a user-defined calibration function, but often overlook fully mining and utilizing the prior distribution behind the calibration curve. However, a well-informed prior distribution can provide valuable insights beyond the empirical data under the limited data or low-density regions of confidence scores. To fill this gap, this paper proposes a new method that integrates the prior distribution behind the calibration curve with empirical data to estimate a continuous calibration curve, which is realized by modeling the sampling process of calibration data as a binomial process and maximizing the likelihood function of the binomial process. We prove that the calibration curve estimating method is Lipschitz continuous with respect to data distribution and requires smaller sample sizes than histogram binning. Also, a new calibration metric has been designed, leveraging the estimated calibration curve to estimate the true calibration error, and it has been proven to be a consistent calibration measure. Furthermore, realistic calibration datasets can be generated by the binomial process modeling from a preset true calibration curve and confidence score distribution, which can serve as a benchmark to measure and compare the discrepancy between existing calibration metrics and the true calibration error. The effectiveness of our calibration method and metric are verified in real-world and simulated data. We believe our exploration of integrating prior distributions with empirical data will guide the development of better-calibrated models, contributing to trustworthy AI.

Abstract: In the field of mixedmotive games, extensive multi-agent learning studies have explored the balance between egoism (individual interest), utilitarianism (collective interest), and egalitarianism (fairness). Traditional approaches often rely on manually designed reward functions, social norms, and alliance/federation mechanisms to transition agents from individualistic behaviors toward cooperative strategies. However, these methods typically require all agents to share private local information or to mandatorily participate in federations, which is impractical in real-world applications. To address these issues, this paper proposes a Flexible-Participation Federation (FPF) framework that allows agents to participate in the federation voluntarily. Furthermore, we extend the federation from a global to a Local Multi-Federation (LMF) framework, enabling agents to form multiple localized federations, thereby promoting more efficient and adaptive cooperation. Theoretical evidence demonstrates that the global FPF model, along with the discrepancy between decentralized egoistic policies and federated utilitarian policies, achieves an O(1/T) convergence rate. Agents in the LMF framework also reach consensus within a sublinear gap. Extensive experiments show that agents opting out of federation participation experience a reduction in egoism, and our approach outperforms multiple baselines in terms of both utilitarianism and egalitarianism.

Department of Mathematics and Industrial Engineering, Polytechnique Montréal Canada Excellence Research Chair in Data-Science for Real-time Decision-Making (CERC), Department of Mathematics and Industrial Engineering, Polytechnique Montréal CIRRELT & SCALE-AI Chair in Data-Driven Supply Chains, Department of Mathematics and Industrial Engineering, Polytechnique Montréal CIRRELT & SCALE-AI Chair in Data-Driven Supply Chains, Department of Mathematics and Industrial Engineering, Polytechnique Montréal CIRRELT & SCALE-AI Chair in Data-Driven Supply Chains

Abstract: Tree ensembles, including boosting methods, are highly effective and widely used for tabular data. However, large ensembles lack interpretability and require longer inference times. We introduce a method to prune a tree ensemble into a reduced version that is "functionally identical" to the original model. In other words, our method guarantees that the prediction function stays unchanged for any possible input. As a consequence, this pruning algorithm is lossless for any aggregated metric. We formalize the problem of functionally identical pruning on ensembles, introduce an exact optimization model, and provide a fast yet highly effective method to prune large ensembles. Our algorithm iteratively prunes considering a finite set of points, which is incrementally augmented using an adversarial model. In multiple computational experiments, we show that our approach provides a "free lunch", significantly reducing the ensemble size without altering the model's behavior. Thus, we can preserve stateof-the-art performance at a fraction of the original model's size.

Abstract: Contrastive learning underpins most current selfsupervised time series representation methods. The strategy for constructing positive and negative sample pairs significantly affects the final representation quality. However, due to the continuous nature of time series semantics, the modeling approach of contrastive learning struggles to accommodate the characteristics of time series data. This results in issues such as difficulties in constructing hard negative samples and the potential introduction of inappropriate biases during positive sample construction. Although some recent works have developed several scientific strategies for constructing positive and negative sample pairs with improved effectiveness, they remain constrained by the contrastive learning framework. To fundamentally overcome the limitations of contrastive learning, this paper introduces Frequency-masked Embedding Inference (FEI), a novel non-contrastive method that completely eliminates the need for positive and negative samples. The proposed FEI constructs 2 inference branches based on a prompting strategy: 1) Using frequency masking as prompts to infer the embedding representation of the target series with missing frequency bands in the embedding space, and 2) Using the target series as prompts to infer its frequency masking embedding. In this way, FEI enables continuous semantic relationship modeling for time series. Experiments on 8 widely used time series datasets for classification and regression tasks, using linear evaluation and end-to-end fine-tuning, show that FEI significantly outperforms existing contrastive-based methods in terms of generalization. This study provides new insights into self-supervised representation learning for time series.

Abstract: Federated Graph Learning (FGL) enables multiple clients to jointly train powerful graph learning models, e.g., Graph Neural Networks (GNNs), without sharing their local graph data for graphrelated downstream tasks, such as graph property prediction. In the real world, however, the graph data can suffer from significant distribution shifts across clients as the clients may collect their graph data for different purposes. In particular, graph properties are usually associated with invariant label-relevant substructures (i.e., subgraphs) across clients, while label-irrelevant substructures can appear in a client-specific manner. The issue of distribution shifts of graph data hinders the efficiency of GNN training and leads to serious performance degradation in FGL. To tackle the aforementioned issue, we propose a novel FGL framework entitled FedVN that eliminates distribution shifts through client-specific graph augmentation strategies with multiple learnable Virtual Nodes (VNs). Specifically, FedVN lets the clients jointly learn a set of shared VNs while training a global GNN model. To eliminate distribution shifts, each client trains a personalized edge generator that determines how the VNs connect local graphs in a client-specific manner. Furthermore, we provide theoretical analyses indicating that FedVN can eliminate distribution shifts of graph data across clients. Comprehensive experiments on four datasets under five settings demonstrate the superiority of our proposed FedVN over nine baselines.

Abstract: Multimodal information plays an important role in the advanced Internet of Things (IoT) in the era of 6G, which provides reliable and comprehensive assistance for downstream tasks through further fusion and analysis via federated learning (FL). One of the primary challenges in FL is data heterogeneity, which may lead to domain shifts and sharply different local longtailed category distribution across nodes. These issues hinder the large-scale deployment of FL in IoT applications equipped with multiple various multimodal sensors due to performance deterioration. In this paper, we propose a novel multimodal fusion framework to tackle the aforementioned coupled problems arising during the cooperative fusion of multimodal information without privacy exposure among decentralized nodes equipped with diverse sensors. Specifically, we introduce a flexible global logit alignment (GLA) method based on multi-view domains. This method enables the fusion of diverse multimodal information with the consideration of domain shifts caused by modality-based data heterogeneity. Furthermore, we propose a novel local angular margin (LAM) scheme, which dynamically adjusts decision boundaries for locally seen categories while preserving global decision boundaries for unseen categories. This effectively mitigates severe model divergence caused by significantly different category distributions. Extensive simulations demonstrate the superiority of the proposed framework, which exhibits significant merits in tackling model degeneration caused by data heterogeneity and enhancing modality-based generalization for heterogeneous scenarios.

Abstract: Large Language Models (LLMs) have shown impressive performance across a wide range of tasks. However, the size of LLMs is steadily increasing, hindering their application on computationally constrained environments. On the other hand, despite their general capabilities, there are many situations where only one specific task is performed, rendering all other capabilities unnecessary and wasteful. This leads us to the following question: Is it possible to extract the minimal subset from an LLM that is able to perform a specific task in a faster, standalone manner? Recent works on Mechanistic Interpretability (MI) have shown that specific tasks are performed by a localized subset of components, or circuit. However, current techniques used to identify the circuit cannot be used to extract it for its standalone usage. In this work, we propose a novel approach to automatically extract the subset of the LLM that properly performs a targeted task requiring no additional training and a small amount of data samples. We evaluate our approach on different tasks and show that the resulting models are (i) considerably smaller, reducing the number of parameters up to 82.77% and (ii) more interpretable, as they focus on the circuit that is used to carry out the specific task, and can therefore be understood using MI techniques.

Abstract: Forecasting relations between entities is paramount in the current era of data and AI. However, it is often overlooked that realworld relationships are inherently directional, involve more than two entities, and can change with time. In this paper, we provide a comprehensive solution to the problem of forecasting directional relations in a general setting, where relations are higher-order, i.e., directed hyperedges in a hypergraph. This problem has not been previously explored in the existing literature. The primary challenge in solving this problem is that the number of possible hyperedges is exponential in the number of nodes at each event time. To overcome this, we propose a sequential generative approach that segments the forecasting process into multiple stages, each contingent upon the preceding stages, thereby reducing the search space involved in predictions of hyperedges. The first stage involves a temporal point process-based node event forecasting module that identifies the subset of nodes involved in an event. The second stage is a candidate generation module that predicts hyperedge sizes and adjacency vectors for nodes observing events. The final stage is a directed hyperedge predictor that identifies the truth by searching over the set of candidate hyperedges. To validate the effectiveness of our model, we compiled five datasets and conducted an extensive empirical study to assess each downstream task. Our proposed method achieves a performance gain of 32% and 41% compared to the state-of-the-art pairwise and hyperedge event forecasting models, respectively, for the event type prediction.

Abstract: Due to the superior ability of global dependency, transformer and its variants have become the primary choice in Masked Timeseries Modeling (MTM) towards time-series classification task. In this paper, we experimentally analyze that existing transformer-based MTM methods encounter with two under-explored issues when dealing with time series data: (1) they encode features by performing long-dependency ensemble averaging, which easily results in rank collapse and feature homogenization as the layer goes deeper; (2) they exhibit distinct priorities in fitting different frequency components contained in the time-series, inevitably leading to spectrum energy imbalance of encoded feature. To tackle these issues, we propose an auxiliary content-aware balanced decoder (CBD) to optimize the encoding quality in the spectrum space within masked modeling scheme. Specifically, the CBD iterates on a series of fundamental blocks, and thanks to two tailored units, each block could progressively refine the masked representation via adjusting the interaction pattern based on local content variations of time-series and learning to recalibrate the energy distribution across different frequency components. Moreover, dual-constraint loss is devised to enhance the mutual optimization of vanilla decoder and our CBD. Extensive experimental results on ten time-series classification datasets show that our method nearly surpasses a bunch of baselines. Meanwhile, a series of explanatory results are showcased to sufficiently demystify the behaviors of our method.

Abstract: Universal Domain Adaptation (UniDA) focuses on transferring source domain knowledge to the target domain under both domain shift and unknown category shift. Its main challenge lies in identifying common class samples and aligning them. Current methods typically obtain target domain semantics centers from an unconstrained continuous image representation space. Due to domain shift and the unknown number of clusters, these centers often result in complex and less robust alignment algorithm. In this paper, based on visionlanguage models, we search for semantic centers in a semantically meaningful and discrete text representation space. The constrained space ensures almost no domain bias and appropriate semantic granularity for these centers, enabling a simple and robust adaptation algorithm. Specifically, we propose TArget Semantics Clustering (TASC) via Text Representations, which leverages information maximization as a unified objective and involves two stages. First, with the frozen encoders, a greedy search-based framework is used to search for an optimal set of text embeddings to represent target semantics. Second, with the search results fixed, encoders are refined based on gradient descent, simultaneously achieving robust domain alignment and private class clustering. Additionally, we propose Universal Maximum Similarity (UniMS), a scoring function tailored for detecting open-set samples in UniDA. Experimentally, we evaluate the universality of UniDA algorithms under four category shift scenarios. Extensive experiments on four benchmarks demonstrate the effectiveness and robustness of our method, which has achieved state-of-the-art performance.

Abstract: Spiking Neural Networks (SNNs) are promising for lowpower computation due to their event-driven mechanism but often suffer from lower accuracy compared to Artificial Neural Networks (ANNs). ANN-to-SNN knowledge distillation can improve SNN performance, but previous methods either focus solely on label information, missing valuable intermediate layer features, or use a layer-wise approach that neglects spatial and temporal semantic inconsistencies, leading to performance degradation. To address these limitations, we propose a novel method called self-attentive spatio-temporal calibration (SASTC). SASTC uses self-attention to identify semantically aligned layer pairs between ANN and SNN, both spatially and temporally. This enables the autonomous transfer of relevant semantic information. Extensive experiments show that SASTC outperforms existing methods, effectively solving the mismatching problem. Superior accuracy results include 95.12% on CIFAR-10, 79.40% on CIFAR-100 with 2 time steps, and 68.69% on ImageNet with 4 time steps for static datasets, and 97.92% on DVS-Gesture and 83.60% on DVS-CIFAR10 for neuromorphic datasets. This marks the first time SNNs have outperformed ANNs on both CIFAR-10 and CIFAR-100, shedding the new light on the potential applications of SNNs.

School of Computer Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China Knowledge and Data Engineering Laboratory of Chinese Medicine, School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China, School of Computer Science and Technology, Hainan University, Haikou 570228, China, School of Computer Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China, Computer School, Beijing Information Science and Technology University, Beijing 100101, China, School of Computer Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China

Abstract: Unsupervised multiview feature selection involves selecting a subset of crucial features across diverse views to diminish feature dimensionality without leveraging label information. While numerous studies have delved into this area, current solutions predominantly rely on linear multi-view data or employ weakly supervised learning to aid in feature selection. These approaches may risk losing semantic information when applied to real-world multi-view datasets. In this study, we introduce a novel model, Unsupervised Kernel-based Multi-view Feature selection with Robust self-representation and Binary hashing (UKMFS), which aims to identify robust consistent graph representation across views and leverage binary hashing codes to guide feature selection. Specifically, we first explore the underlying geometry by unifying the dimension of multi-view data with non-linear kernel mapping. Then, we search the consistent graph across views by fusing unique graph representations of each view in a self-representation manner. Additionally, we impose low-rank constraints on the graph of each view to mitigate noise and unimportant parts for preserving the main structures and patterns. Furthermore, we design an unsupervised hashing feature selection model to exploit reliable binary labels across views and weighted matrices from each view. Finally, an effective optimization method is customised to solve the formulated problem iteratively. Comprehensive experiments on public multi-view datasets indicate that our proposed method achieves state-of-the-art performance compared with the representative comparison methods regarding the clustering and the feature selection task.

Abstract: The symbolic discovery of Ordinary Differential Equations (ODEs) from trajectory data plays a pivotal role in AIdriven scientific discovery. Existing symbolic methods predominantly rely on fixed, pre-collected training datasets, which often result in suboptimal performance, as demonstrated in our case study in Figure 1. Drawing inspiration from active learning, we investigate strategies to query informative trajectory data that can enhance the evaluation of predicted ODEs. However, the butterfly effect in dynamical systems reveals that small variations in initial conditions can lead to drastically different trajectories, necessitating the storage of vast quantities of trajectory data using conventional active learning. To address this, we introduce Active Symbolic Discovery of Ordinary Differential Equations via Phase Portrait Sketching (APPS). Instead of directly selecting individual initial conditions, our APPS first identifies an informative region within the phase space and then samples a batch of initial conditions from this region. Compared to traditional active learning methods, APPS mitigates the gap of maintaining a large amount of data. Extensive experiments demonstrate that APPS consistently discovers more accurate ODE expressions than baseline methods using passively collected datasets.

Abstract: Federated learning (FL) is a promising technology for data privacy and distributed optimization, but it suffers from data imbalance and heterogeneity among clients. Existing FL methods try to solve the problems by aligning client with server model or by correcting client model with control variables. These methods excel on IID and general NonIID data but perform mediocrely in Simpson's Paradox scenarios. Simpson's Paradox refers to the phenomenon that the trend observed on the global dataset disappears or reverses on a subset, which may lead to the fact that global model obtained through aggregation in FL does not accurately reflect the distribution of global data. Thus, we propose FedCFA, an novel FL framework employing counterfactual learning to generate counterfactual samples by replacing local data critical factors with global average data, aligning local data distributions with the global and mitigating Simpson's Paradox effects. In addition, to improve the counterfactual samples quality, we introduce factor decorrelation (FDC) loss to reduce the correlation among features and thus improve the independence of extracted factors. We conduct extensive experiments on six datasets and verify that our method outperforms other FL methods in terms of efficiency and global model accuracy under limited communication rounds.

Abstract: The Mixtureof-Experts (MoE) structure scales the Transformer-based large language models (LLMs) and improves their performance with only the sub-linear increase in computation resources. Recently, a fine-grained DeepSeekMoE structure is proposed, which can further improve the computing efficiency of MoE without performance degradation. However, the All-to-All communication introduced by MoE has become a bottleneck, especially for the fine-grained structure, which typically involves and activates more experts, hence contributing to heavier communication overhead. In this paper, we propose a novel MoE structure named BigMac, which is also fine-grained but with high communication efficiency. The innovation of BigMac is mainly due to that we abandon the Communicate-Descend-Ascend-Communicate (CDAC) manner used by fine-grained MoE, which leads to the All-to-All communication always taking place at the highest dimension. Instead, BigMac designs an efficient Descend-Communicate-Communicate-Ascend (DCCA) manner. Specifically, we add a descending and ascending projection at the entrance and exit of the expert, respectively, which enables the communication to perform at a very low dimension. Furthermore, to adapt to DCCA, we re-design the structure of small experts, ensuring that the expert in BigMac has enough complexity to address tokens. Experimental results show that BigMac achieves comparable or even better model quality than fine-grained MoEs with the same number of experts and a similar number of total parameters. Equally importantly, BigMac reduces the end-to-end latency by up to 3.09 x for training and increases the throughput by up to 3.11 x for inference on state-of-the-art AI computing frameworks including Megatron, Tutel, and DeepSpeed-Inference.

Abstract: While pruning methods effectively maintain model performance without extra training costs, they often focus solely on preserving crucial connections, overlooking the impact of pruned weights on subsequent finetuning or distillation, leading to inefficiencies. Moreover, most compression techniques for generative models have been developed primarily for GANs, tailored to specific architectures like StyleGAN, and research into compressing Diffusion models has just begun. Even more, these methods are often applicable only to GANs or Diffusion models, highlighting the need for approaches that work across both model types. In this paper, we introduce Singular Value Scaling (SVS), a versatile technique for refining pruned weights, applicable to both model types. Our analysis reveals that pruned weights often exhibit dominant singular vectors, hindering fine-tuning efficiency and leading to suboptimal performance compared to random initialization. Our method enhances weight initialization by minimizing the disparities between singular values of pruned weights, thereby improving the fine-tuning process. This approach not only guides the compressed model toward superior solutions but also significantly speeds up fine-tuning. Extensive experiments on StyleGAN2, StyleGAN3 and DDPM demonstrate that SVS improves compression performance across model types without additional training costs.

Department of Computer Science, City University of Hong Kong Centre for Artificial Intelligence and Robotics, HK Institute of Science & Innovation, Chinese Academy of Sciences City University of Hong Kong Shenzhen Research Institute, Department of Computer Science, City University of Hong Kong University of Science and Technology of China, Centre for Artificial Intelligence and Robotics, HK Institute of Science & Innovation, Chinese Academy of Sciences, Department of Computer Science, City University of Hong Kong, Department of Computer Science, City University of Hong Kong City University of Hong Kong Shenzhen Research Institute, Centre for Artificial Intelligence and Robotics, HK Institute of Science & Innovation, Chinese Academy of Sciences State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences

Abstract: Continual learning aims to learn multiple tasks sequentially. A key challenge in continual learning is balancing between two objectives: retaining knowledge from old tasks (stability) and adapting to new tasks (plasticity). Experience replay methods, which store and replay past data alongside new data, have become a widely adopted approach to mitigate catastrophic forgetting. However, these methods neglect the dynamic nature of the stabilityplasticity trade-off and aim to find a fixed and unchanging balance, resulting in suboptimal adaptation during training and inference. In this paper, we propose Pareto Continual Learning (ParetoCL), a novel framework that reformulates the stability-plasticity trade-off in continual learning as a multi-objective optimization (MOO) problem. ParetoCL introduces a preference-conditioned model to efficiently learn a set of Pareto optimal solutions representing different trade-offs and enables dynamic adaptation during inference. From a generalization perspective, ParetoCL can be seen as an objective augmentation approach that learns from different objective combinations of stability and plasticity. Extensive experiments across multiple datasets and settings demonstrate that ParetoCL outperforms state-of-the-art methods and adapts to diverse continual learning scenarios.

Abstract: We propose HYBOOD, a hybrid outof-distribution model based on normalizing flow followed by a simple linear classification model. In real-world settings, it is known that data corruption has a strong influence on model degradation; for example image quality like noise, blur and image geometry like translation, scaling, rotation. MNIST-C, CIFAR10-C are the general synthesized datasets to measure model performance and corruption difficulty in terms of covariate and semantic shifts. HYBOOD shows that the separability between in-distribution, covariate shift, and semantic shift can be represented by distribution distance and log-scale density. We also find out the attributes of covariate shifts are ordered by corruption difficulty ranking (CDR) for the datasets. To the best of our knowledge, this is the first method to measure data corruption difficulty with generative models using Wasserstein Distance, Mutual Information and Minimal Description Length. In this paper, we pose interesting experimental results that the MNIST-C trained generative model is most deteriorated by fog, impulse noise and stripe corruption types. This can be interpreted that those attributes are challenging corruptions to the generative model in uncertainty and complexity. By training in-distribution data only, HYBOOD achieves out-of-distribution detection performance for distinguishable covariate and semantic shifts, and quantifying covariate shift ranking.

Abstract: Visual Reinforcement Learning (RL) facilitates learning directly from raw images; however, the domain gap between training and testing environments frequently leads to a decline in performance within unseen environments. In this paper, we propose Fourier Guided Adaptive Adversarial Augmentation (FGA3), a novel augmentation method that maintains semantic consistency. We focus on style augmentation in the frequency domain by keeping the phase and altering the amplitude to preserve the state of the original data. For adaptive adversarial perturbation, we reformulate the worstcase problem to RL by employing adversarial example training, which leverages value loss and cosine similarity within a semantic space. Moreover, our findings illustrate that cosine similarity is effective in quantifying feature distances within a semantic space. Extensive experiments on DMControl-GB and Procgen have shown that FGA3 is compatible with a wide range of visual RL algorithms, both off-policy and on-policy, and significantly improves the robustness of the agent in unseen environments.

Abstract: Selfsupervised learning (SSL) methods based on the instance discrimination tasks with InfoNCE have achieved remarkable success. Despite their success, SSL models often struggle to generate effective representations for unseen-domain data. To address this issue, research on unsupervised domain generalization (UDG), which aims to develop SSL models that can generate domain-irrelevant features, has been conducted. Most UDG approaches utilize contrastive learning with InfoNCE to generate representations, and perform feature alignment based on strong assumptions to generalize domain-irrelevant common features from multi-source domains. However, existing methods that rely on instance discrimination tasks are not effective at extracting domain-irrelevant common features. This leads to the suppression of domain-irrelevant common features and the amplification of domain-relevant features, thereby hindering domain generalization. Furthermore, strong assumptions underlying feature alignment can lead to biased feature learning, reducing the diversity of common features. In this paper, we propose a novel approach, DomCLP, Domain-wise Contrastive Learning with Prototype Mixup. We explore how InfoNCE suppresses domain-irrelevant common features and amplifies domain-relevant features. Based on this analysis, we propose Domain-wise Contrastive Learning (DCon) to enhance domain-irrelevant common features. We also propose Prototype Mixup Learning (PMix) to generalize domain-irrelevant common features across multiple domains without relying on strong assumptions. The proposed method consistently outperforms state-of-the-art methods on the PACS and DomainNet datasets across various label fractions, showing significant improvements.

Abstract: Recently, a stateof-the-art series of algorithms—Goal-Conditioned Weighted Supervised Learning (GCWSL) methods—has been introduced to address the challenges inherent in offline goal-conditioned reinforcement learning (RL). GCWSL optimizes a lower bound on the goal-conditioned RL objective and has demonstrated exceptional performance across a range of goal-reaching tasks, offering a simple, effective, and stable solution. Nonetheless, researches has revealed a critical limitation in GCWSL: the absence of trajectory stitching capabilities. In response, goal data augmentation strategies have been proposed to enhance these methods. However, existing techniques often fail to effectively sample appropriate augmented goals for GCWSL. In this paper, we establish unified principles for goal data augmentation, emphasizing goal diversity, action optimality, and goal reachability. Building on these principles, we propose a Model-based Goal Data Augmentation (MGDA) approach, which leverages a dynamics model to sample more appropriate augmented goals. MGDA uniquely incorporates the local Lipschitz continuity assumption within the learned model to mitigate the effects of compounding errors. Empirical results demonstrate that MGDA significantly improves the performance of GCWSL methods on both state-based and vision-based maze datasets, outperforming previous goal data augmentation techniques in their ability to enhancing stitching capabilities.

Abstract: In this paper, we propose a novel Temporal SequenceAware-Model (TSAM) for few-shot action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings. Different from the existing fine-tuning approaches that capture temporal information by exploring the relationships among all the frames, our perceiver-based adapter recurrently captures the sequential dynamics alongside the timeline, which could perceive the frame order change. To obtain the discriminative representations for each class, we extend a textual corpus for each class derived from the large language models (LLMs) and enrich the visual prototypes by integrating the contextual semantic information. Besides, We introduce an unbalanced optimal transport strategy for feature matching that mitigates the impact of class-unrelated features, thereby facilitating more effective decision-making. Experimental results on five FSAR datasets demonstrate that our method establishes a new benchmark, outperforming the second-best competitors.

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and AstronauticsMIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence Department of Computer Science, Hong Kong Baptist University, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence

Abstract: In recent Openset Recognition (OSR) community, a prevailing belief is that enhancing the discriminative boundaries of closed-set classes can improve the robustness of Deep Neural Networks (DNNs) against open data during testing. Typical studies validate this *implicitly* by empirical evidence, without a formalized understanding of *how DNNs help the closed-set features obtain more discriminative boundaries?* For this, we provide an answer from the Neural Collapse (NC) perspective: DNNs align the closed-set with a *Simplex Equiangular Tight Frame* (ETF) structure that has geometric and mathematical interpretability. Regrettably, although NC naturally occurs in DNNs, we discover that typical studies cannot guarantee the features being learned to strictly align with the ETF. Thus, we introduce a novel concept, Fixed ETF Template (FiT), which holds an ideal structure associated with closed-set classes. To force class means and classifier vectors to align with FiT, we further design a Dual ETF (DEF) loss involving two components. Specifically, *F*-DEF loss is designed to align class means with FiT strictly, yielding optimal inter-class separability. Meanwhile, we extend a dual form to classifier vectors, termed *C*-DEF loss, which guides class means and classifier vectors to satisfy self-duality. Our theoretical analysis proves the validity of the proposed approach, and extensive experiments demonstrate that DEF achieves comparable or superior results with reduced computational resources on standard OSR benchmarks.

Abstract: Oneshot methods have significantly advanced the field of neural architecture search (NAS) by adopting weight-sharing strategy to reduce search costs. However, the accuracy of performance estimation can be compromised by co-adaptation. Few-shot methods divide the entire supernet into individual sub-supernets by splitting edge by edge to alleviate this issue, yet neglect relationships among edges and result in performance degradation on huge search space. In this paper, we introduce HEP-NAS, a hierarchy-wise partition algorithm designed to further enhance accuracy. To begin with, HEP-NAS treats edges sharing the same end node as a hierarchy, permuting and splitting edges within the same hierarchy to directly search for the optimal operation combination for each intermediate node. This approach aligns more closely with the ultimate goal of NAS. Furthermore, HEP-NAS selects the most promising sub-supernet after each segmentation, progressively narrowing the search space in which the optimal architecture may exist. To improve performance evaluation of sub-supernets, HEP-NAS employs search space mutual distillation, stabilizing the training process and accelerating the convergence of each individual sub-supernet. Within a given budget, HEP-NAS enables the splitting of all edges and gradually searches for architectures with higher accuracy. Experimental results across various datasets and search spaces demonstrate the superiority of HEP-NAS compared to state-of-the-art methods.

Abstract: Federated learning is a decentralized machine learning approach that consists of servers and clients. It protects data privacy during model training by keeping the training data locally in each client. However, the requirement for the server and clients to frequently synchronize the parameters of the model brings a heavy burden to the communication links, especially when the model size has grown drastically in recent years. Several methods have been proposed to compress the model size by sparsification to reduce the communication overhead, albeit with significant accuracy degradation. In this work, we propose methods to better tradeoff between model accuracy and training efficiency in federated learning. Our first proposed method is a novel sparse mask readjustment rule on the server and the second is a parameter-freezing method during training on the clients. Experimental results show that the model accuracy has significantly improved when combining our proposed methods. For example, compared with the previous state-of-the-art methods with the same total amount of communication cost and computation FLOPs, the accuracy increases on average by 4% and 6% in our methods for CIFAR-10 and CIFAR-100 datasets on ResNet-18, respectively. On the other hand, when targeting the same accuracy, the proposed method can reduce the communication cost by 4-8 times for different datasets with different sparsity levels.

Abstract: Recently, prompt tuning methods for pretrained models have demonstrated promising performance in Class Incremental Learning (CIL). These methods typically involve learning task-specific prompts and predicting the task ID to select the appropriate prompts for inference. However, inaccurate task ID predictions can cause severe inconsistencies between the prompts used during training and inference, leading to knowledge forgetting and performance degradation. Additionally, existing prompt tuning methods rely solely on the pre-trained model to predict task IDs, without fully leveraging the knowledge embedded in the learned prompt parameters, resulting in inferior prediction performance. To address these issues, we propose a novel Cyclic Prompt Aggregation (CAPrompt) method that eliminates the dependency on task ID prediction by cyclically aggregating the knowledge from different prompts. Specifically, rather than predicting task IDs, we introduce an innovative prompt aggregation strategy during both training and inference to overcome prompt inconsistency by utilizing a weighted sum of different prompts. Thorough theoretical analysis demonstrates that under concave conditions, the aggregated prompt achieves lower error compared to selecting a single task-specific prompt. Consequently, we incorporate a concave constraint and a linear constraint to guide prompt learning, ensuring compliance with the concave condition requirement. Furthermore, to fully exploit the prompts and achieve more accurate prompt weights, we develop a cyclic weight prediction strategy. This strategy begins with equal weights for each task and automatically adjusts them to more appropriate values in a cyclical manner. Experiments on various datasets demonstrate that our proposed CAPrompt outperforms state-of-the-art methods by 2%-3%.

Abstract: Deep reinforcement learning (DRL) has revolutionized quantitative trading (Qtrading) by achieving decent performance without significant human expert knowledge. Despite its achievements, we observe that the current state-of-the-art DRL models are still ineffective in identifying the market trends, causing them to miss good trading opportunity or suffer from large drawdowns when encountering market crashes. To address this limitation, a natural approach is to incorporate human expert knowledge in identifying market trends. Whereas, such knowledge is abstract and hard to be quantified. In order to effectively leverage abstract human expert knowledge, in this paper, we propose a universal logic-guided deep reinforcement learning framework for Q-trading, called Logic-Q. In particular, Logic-Q adopts the program synthesis by sketching paradigm and introduces a logic-guided model design that leverages a lightweight, plug-and-play market trend-aware program sketch to determine the market trend and correspondingly adjusts the DRL policy in a post-hoc manner. Extensive evaluations of two popular quantitative trading tasks demonstrate that Logic-Q can significantly improve the performance of previous state-of-the-art DRL trading strategies.

Abstract: Endto-end training with global optimization have popularized graph neural networks (GNNs) for node classification, yet inadvertently introduced vulnerabilities to adversarial edge-perturbing attacks. Adversaries can exploit the inherent opened interfaces of GNNs' input and output, perturbing critical edges and thus manipulating the classification results. Current defenses, due to their persistent utilization of global-optimization-based end-to-end training schemes, inherently encapsulate the vulnerabilities of GNNs. This is specifically evidenced in their inability to defend against targeted secondary attacks. In this paper, we propose the Graph Agent Network (GAgN) to address the aforementioned vulnerabilities of GNNs. GAgN is a graph-structured agent network in which each node is designed as an 1-hop-view agent. Through the decentralized interactions between agents, they can learn to infer global perceptions to perform tasks including inferring embeddings, degrees and neighbor relationships for given nodes. This empowers nodes to filtering adversarial edges while carrying out classification tasks. Furthermore, agents' limited view prevents malicious messages from propagating globally in GAgN, thereby resisting global-optimization-based secondary attacks. We prove that single-hidden-layer multilayer perceptrons (MLPs) are theoretically sufficient to achieve these functionalities. Experimental results show that GAgN effectively implements all its intended capabilities and, compared to state-of-the-art defenses, achieves optimal classification accuracy on the perturbed datasets.

Abstract: Code search is a highly required technique for software development. In recent years, the rapid development of transformerbased language models has made it increasingly more popular to adapt a pre-trained language model to a code search task, where contrastive learning is typically adopted to semantically align user queries and codes in an embedding space. Considering that the same semantic meaning can be presented using diverse language styles in user queries and codes, the representation of queries and codes in an embedding space may thus be non-deterministic. To address the above-specified point, this paper proposes an uncertainty-aware contrastive learning approach for code search. Specifically, for both queries and codes, we design an uncertainty learning strategy to produce diverse embeddings by learning to transform the original inputs into Gaussian distributions and then taking a reparameterization trick. We also design a hard negative sampling strategy to construct query-code pairs for improving the effectiveness of uncertainty-aware contrastive learning. The experimental results indicate that our approach outperforms 10 baseline methods on a large code search dataset with six programming languages. The results also show that our strategies of uncertainty learning and hard negative sampling can really help enhance the representation of queries and codes leading to an improvement of the code search performance.

Abstract: Phenotypic drug discovery has attracted widespread attention because of its potential to identify bioactive molecules. Transcriptomic profiling provides a comprehensive reflection of phenotypic changes in cellular responses to external perturbations. In this paper, we propose XTransferCDR, a novel generative framework designed for feature decoupling and transferable representation learning across domains. Given a pair of perturbed expression profiles, our approach decouples the perturbation representations from basal states through domain separation encoders and then crosstransfers them in the latent space. The transferred representations are then used to reconstruct the corresponding perturbed expression profiles via a shared decoder. This cross-transfer constraint effectively promotes the learning of transferable drug perturbation representations. We conducted extensive evaluations of our model on multiple datasets, including single-cell transcriptional responses to drugs and single- and combinatorial genetic perturbations. The experimental results show that XTransferCDR achieved better performance than current state-of-the-art methods, showcasing its potential to advance phenotypic drug discovery.

Abstract: Fewshot learning (FSL) aims to recognize new concepts using a limited number of visual samples. Existing methods attempt to incorporate semantic information into the limited visual data for category understanding. However, these methods often enrich class-level feature representations with abstract category names, failing to capture nuanced features essential for effective generalization. To address this issue, we propose a novel framework for FSL, which incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs), to enhance the representation of the class prototypes. Specifically, our framework composes a Semantic-guided Visual Pattern Extraction (SVPE) module and a Prototype-Calibration (PC) module, where the SVPE meticulously extracts semantic-aware visual patterns across diverse scales, while the PC module seamlessly integrates these patterns to refine the visual prototype, enhancing its representativeness. Extensive experiments on four few-shot classification benchmarks and the BSCD-FSL cross-domain benchmark showcase remarkable advancements over the current state-of-the-art methods. Notably, for the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an impressive average improvement of 1.95% over the second-best competitor.

School of Computer Science, Beijing University of Posts and Telecommunications, China, School of Science, Beijing University of Posts and Telecommunications, China, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China, School of Computer Science, Beijing University of Posts and Telecommunications, China, School of Computer Science, Beijing University of Posts and Telecommunications, China, School of Computer Science, Beijing University of Posts and Telecommunications, China, School of Computer Science, Beijing University of Posts and Telecommunications, China, School of Computer Science, Beijing University of Posts and Telecommunications, China

Abstract: Crossmodal retrieval maps data under different modalities via semantic relevance. Existing approaches implicitly assume that data pairs are well-aligned and ignore the widely existing annotation noise, i.e., noisy correspondence (NC). Consequently, it inevitably causes performance degradation. Despite attempts that employ the co-teaching paradigm with identical architectures to provide distinct data perspectives, the differences between these architectures primarily stem from random initialization. Thus, the model becomes increasingly homogeneous along with the training process. Consequently, the additional information brought by this paradigm is severely limited. In order to resolve this problem, we introduce Tripartite Learning with Semantic Variation Consistency (TSVC) for robust image-text retrieval. We design a tripartite cooperative learning mechanism comprising a Coordinator, a Master, and an Assistant model. The Coordinator distributes data, and the Assistant model supports the Master model's noisy label prediction with diverse data. Moreover, we introduce a soft label estimation method based on mutual information variation, which quantifies the noise in new samples and assigns corresponding soft labels. We also present a new loss function to enhance robustness and optimize training effectiveness. Extensive experiments on three widely used datasets demonstrate that, even at increasing noise ratios, TSVC exhibits significant advantages in retrieval accuracy and maintains stable training performance.

School of Artificial Intelligence, University of Chinese Academy of Science Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Science Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Science Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Science Institute of Automation, Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences

Abstract: Guiding the policy of multiagent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.

Abstract: CrossDomain Few-Shot Learning (CDFSL) requires the model to transfer knowledge from the data-abundant source domain to data-scarce target domains for fast adaptation, where the large domain gap makes CDFSL a challenging problem. Masked Autoencoder (MAE) excels in effectively using unlabeled data and learning image’s global structures, enhancing model generalization and robustness. However, in the CDFSL task with significant domain shifts, we find MAE even shows lower performance than the baseline supervised models. In this paper, we first delve into this phenomenon for an interpretation. We find that MAE tends to focus on low-level domain information during reconstructing pixels while changing the reconstruction target to token features could mitigate this problem. However, not all features are beneficial, as we then find reconstructing high-level features can hardly improve the model’s transferability, indicating a trade-off between filtering domain information and preserving the image’s global structure. In all, the reconstruction target matters for the CDFSL task. Based on the above findings and interpretations, we further propose Domain-Agnostic Masked Image Modeling (DAMIM) for the CDFSL task. DAMIM includes an Aggregated Feature Reconstruction module to automatically aggregate features for reconstruction, with balanced learning of domain-agnostic information and images’ global structure, and a Lightweight Decoder module to further benefit the encoder’s generalizability. Experiments on four CDFSL datasets demonstrate that our method achieves state-of-the-art performance.

Harbin Institute of Technology Pengcheng Laboratory Shenzhen University of Advanced Technology, Harbin Institute of Technology Pengcheng Laboratory, Northwestern Polytechnical University, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Guangdong Provincial Key Laboratory of Computility Microelectronics Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Harbin Institute of Technology Pengcheng Laboratory, Shenzhen University of Advanced Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Guangdong Provincial Key Laboratory of Computility Microelectronics Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

Abstract: Large multimodal language models (MLLMs) have revolutionized natural language processing and visual understanding, but often contain outdated or inaccurate information. Current multimodal knowledge editing evaluations are limited in scope and potentially biased, focusing on narrow tasks and failing to assess the impact on indomain samples. To address these issues, we introduce ComprehendEdit, a comprehensive benchmark comprising eight diverse tasks from multiple datasets. We propose two novel metrics: Knowledge Generalization Index (KGI) and Knowledge Preservation Index (KPI), which evaluate editing effects on in-domain samples without relying on AI-synthetic samples. Based on insights from our framework, we establish Hierarchical In-Context Editing (HICE), a baseline method employing a two-stage approach that balances performance across all metrics. This study provides a more comprehensive evaluation framework for multimodal knowledge editing, reveals unique challenges in this field, and offers a baseline method demonstrating improved performance. Our work opens new perspectives for future research and provides a foundation for developing more robust and effective editing techniques for MLLMs.

Abstract: Text summarization task extracts salient information from a large amount of text for productivity enhancement. However, most existing methods heavily rely on training models from ample and centrally stored data which is infeasible to collect in practice, due to privacy concerns and data scarcity nature under several settings (e.g., edge computing or cold starting). The main challenge lies in constructing the privacypreserving and well-behaved summarization model under the data scarcity scenario, where the data scarcity nature will lead to the knowledge shortage of the model while magnifying the impact of data bias, causing performance degeneration. To tackle this challenge, previous studies attempt to complement samples or improve the efficiency of data. The former is usually associated with high computing costs or has a large dependence on empirical settings, while the latter might not effective due to the lack of consideration of data bias. In this work, we propose FedSum which extends the standard FL framework from depth and breadth to further extract prime and diversified knowledge from limited resources for text summarization. For depth extension, we introduce a Data Partition method to cooperatively recognize the prime samples that are more significant and unbiased, and the Data skip mechanism is introduced to help the model further focus on those prime samples during the local training process. For breadth extension, FedSum extends the source of knowledge and develops the summarization model by extracting knowledge from the data samples, hidden spaces, and globally received parameters. Extensive experiments on four benchmark datasets verify the promising improvement of FedSum compared to baselines, and show its generalizability, scalability, and robustness.

Abstract: Multicriteria decision making (MCDM) and preference learning (PL) are crucial subfields of intelligent decision-making, both aiming to aid decision-makers (DMs) in selecting, classifying, or ranking alternatives. While MCDM and PL can complement each other to some extent, existing approaches combining MCDM and PL often struggle with large data volumes and complex relational information. To address this, we propose a novel approach called ID-GMLM that integrates graph models and large language models (LLMs) for intelligent decision-making. It reformulates decision-making as a high-parallelism ranking function in the graph domain, using graph neural networks (GNNs) to learn and understand complex relationships between alternatives or criteria, and LLMs to parse and quantify the preferences of DMs. ID-GMLM features a multi-task learning framework that optimizes the primary task of predicting alternative rankings while modeling criterion interactions through the auxiliary task. Additionally, ID-GMLM incorporates a parameter tuning network based on criterion weights and an attention network, allowing the model to adaptively adjust to the context of the current task and the evolving preferences of DMs. Experiments on benchmark datasets demonstrate that ID-GMLM achieves significant performance improvements, inheriting the interpretability and intuitive appeal of MCDM while leveraging the computational efficiency and high accuracy of PL.

Abstract: Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D shape generation, particularly at high resolutions, remains underexplored. Traditional diffusion transformers (DiT) with selfattention mechanisms, despite their potential, face scalability challenges due to the cubic complexity of attention operations as input length increases. This complexity becomes a significant hurdle when dealing with high-resolution voxel sizes. To address this challenge, we introduce a novel diffusion architecture tailored for 3D point clouds generation—Diffusion Mamba (DiM-3D). This architecture forgoes traditional attention mechanisms, instead utilizing the inherent efficiency of the Mamba architecture to maintain linear complexity with respect to sequence length. DiM-3D is characterized by fast inference times and substantially lower computational demands, quantified in reduced Gflops, thereby addressing the key scalability issues of prior models. Our empirical results on the ShapeNet benchmark demonstrate that DiM-3D achieves state-of-the-art performance in generating high-fidelity and diverse 3D shapes. Additionally, DiM-3D shows superior capabilities in tasks like 3D point cloud completion. This not only proves the model's scalability but also underscores its efficiency in generating detailed, high-resolution voxels necessary for advanced 3D shape modeling, particularly excelling in environments requiring 256-resolution voxel sizes. Through these findings, we illustrate the exceptional scalability and efficiency of the Diffusion Mamba framework in 3D shape generation, setting a new standard for the field and paving the way for future explorations in high-resolution 3D modeling technologies.

Abstract: In recent years, employing Shapley values to compute feature importance has gained considerable attention. Calculating these values inherently necessitates managing an exponential number of parameters—a challenge commonly mitigated through an additivity assumption coupled with linear regression. This paper proposes a novel approach by modeling supervised learning as a multilinear game, incorporating both direct and interaction effects to establish the requisite values for Shapley value computation. To efficiently handle the exponentially increasing parameters intrinsic to multilinear games, we introduce a support vector machine (SVM)based method for parameter estimation, its complexity is predominantly contingent on the number of samples due to the implementation of a dual SVM formulation. Additionally, we unveil an optimized dynamic programming algorithm capable of directly computing the Shapley value and interaction index from the dual SVM. Our proposed methodology is versatile and we demonstrate that it can be applied to local explanation and feature selection. Experiments underscore the competitive efficacy of our proposed methods in terms of feature selection and explanation.

Abstract: Obtaining high certainty in predictive models is crucial for making informed and trustworthy decisions in many scientific and engineering domains. However, extensive experimentation required for model accuracy can be both costly and timeconsuming. This paper presents an adaptive sampling approach designed to reduce epistemic uncertainty in predictive models. Our primary contribution is the development of a metric that estimates potential epistemic uncertainty leveraging prediction interval-generation neural networks. This estimation relies on the distance between the predicted upper and lower bounds and the observed data at the tested positions and their neighboring points. Our second contribution is the proposal of a batch sampling strategy based on Gaussian processes (GPs). A GP is used as a surrogate model of the networks trained at each iteration of the adaptive sampling process. Using this GP, we design an acquisition function that selects a combination of sampling locations to maximize the reduction of epistemic uncertainty across the domain. We test our approach on three unidimensional synthetic problems and a multi-dimensional dataset based on an agricultural field for selecting experimental fertilizer rates. The results demonstrate that our method consistently converges faster to minimum epistemic uncertainty levels compared to Normalizing Flows Ensembles, MC-Dropout, and simple GPs.

Abstract: ActorCritic (AC) algorithms like SAC and TD3 were shown to perform well in a variety of continuous-action tasks. However, the theoretical basis for the pessimistic objectives these algorithms employ remains unestablished, raising questions about the specific class of policies they are implementing. In this work, we apply the expected utility hypothesis, a fundamental concept in economics, to illustrate that both pessimistic and non-pessimistic RL objectives can be interpreted through expected utility maximization using an exponential utility function. This approach reveals that pessimistic policies effectively maximize value certainty equivalent, aligning them with the optimization of risk-aware objectives. Furthermore, we propose Decoupled Policy Actor-Critic (DAC). DAC is a model-free algorithm that features two distinct actor networks: a pessimistic actor for temporal-difference learning and an optimistic actor for exploration. Our evaluations of DAC across various locomotion and manipulation tasks demonstrate improvements in sample efficiency and final performance. Remarkably, DAC, while requiring significantly fewer computational resources, matches the performance of leading model-based methods in the complex dog and humanoid domains.

Abstract: Time series forecasting is an important application in various domains such as energy management, traffic planning, financial markets, meteorology, and medicine. However, realtime series data often present intricate temporal variability and sharp fluctuations, which pose significant challenges for time series forecasting. Previous models that rely on 1D time series representations usually struggle with complex temporal variations. To address the limitations of 1D time series, this study introduces the Times2D method that transforms the 1D time series into 2D space. Times2D consists of three main parts: first, a Periodic Decomposition Block (PDB) that captures temporal variations within a period and between the same periods by converting the time series into a 2D tensor in the frequency domain. Second, the First and Second Derivative Heatmaps (FSDH) capture sharp changes and turning points, respectively. Finally, an Aggregation Forecasting Block (AFB) integrates the output tensors from PDB and FSDH for accurate forecasting. This 2D transformation enables the utilization of 2D convolutional operations to effectively capture long and short characteristics of the time series. Comprehensive experimental results across large-scale data in the literature demonstrate that the proposed Times2D model achieves state-of-the-art performance in both short-term and long-term forecasting.

Abstract: When it comes to expensive blackbox optimization problems, Bayesian Optimization (BO) is a well-known and powerful solution. Many real-world applications involve a large number of dimensions, hence scaling BO to high dimension is of much interest. However, state-of-the-art high-dimensional BO methods still suffer from the curse of dimensionality, highlighting the need for further improvements. In this work, we introduce BOIDS, a novel high-dimensional BO algorithm that guides optimization by a sequence of one-dimensional direction lines using a novel tailored line-based optimization procedure. To improve the efficiency, we also propose an adaptive selection technique to identify most optimal lines for each round of line-based optimization. Additionally, we incorporate a subspace embedding technique for better scaling to high-dimensional spaces. We further provide theoretical analysis of our proposed method to analyze its convergence property. Our extensive experimental results show that BOIDS outperforms state-of-the-art baselines on various synthetic and real-world benchmark problems.

Abstract: Personalized Federated Learning (PFL) is widely employed in the Internet of Things (IoT) to handle highvolume, non-iid client data while ensuring data privacy. However, heterogeneous edge devices owned by clients may impose varying degrees of resource constraints, causing computation and communication bottlenecks for PFL. Federated Dropout has emerged as a popular strategy to address this challenge, wherein only a subset of the global model, i.e. a sub-model, is trained on a client's device, thereby reducing computation and communication overheads. Nevertheless, the dropout-based model-pruning strategy may introduce bias, particularly towards non-iid local data. When biased sub-models absorb highly divergent parameters from other clients, performance degradation becomes inevitable. In response, we propose federated learning with stochastic parameter update (FedSPU). Unlike dropout that tailors local models to small-size sub-models, FedSPU maintains the full model architecture on each device but randomly freezes a certain percentage of neurons in the local model during training while updating the remaining neurons. This approach ensures that a portion of the local model remains personalized, thereby enhancing the model's robustness against biased parameters from other clients. Experimental results demonstrate that FedSPU outperforms federated dropout by 4.45% on average in terms of accuracy. Furthermore, an introduced early stopping scheme leads to a significant reduction of the training time in FedSPU by 25%~71% while maintaining high accuracy.

Abstract: We address the problem of federated domain generalization in an unsupervised setting for the first time. We first theoretically establish a connection between domain shift and alignment of gradients in unsupervised federated learning and show that aligning the gradients at both client and server levels can facilitate the generalization of the model to new (target) domains. Building on this insight, we propose a novel method named FedGaLA, which performs gradient alignment at the client level to encourage clients to learn domaininvariant features, as well as global gradient alignment at the server to obtain a more generalized aggregated model. To empirically evaluate our method, we perform various experiments on four commonly used multi-domain datasets, PACS, OfficeHome, DomainNet, and TerraInc. The results demonstrate the effectiveness of our method which outperforms comparable baselines. Ablation and sensitivity studies demonstrate the impact of different components and parameters in our approach.

Abstract: Deep imbalanced regression (DIR), where the target values have a highly skewed distribution and are also continuous, is an intriguing yet underexplored problem in machine learning. While recent works have already shown that incorporating various classification-based regularizers can produce enhanced outcomes, the role of classification remains elusive in DIR. Moreover, such regularizers (e.g., contrastive penalties) merely focus on learning discriminative features of data, which inevitably results in ignorance of either continuity or similarity across the data. To address these issues, we first bridge the connection between the objectives of DIR and classification from a Bayesian perspective. Consequently, this motivates us to decompose the objective of DIR into a combination of classification and regression tasks, which naturally guides us toward a divide-and-conquer manner to solve the DIR problem. Specifically, by aggregating the data at nearby labels into the same groups, we introduce an ordinal group-aware contrastive learning loss along with a multi-experts regressor to tackle the different groups of data thereby maintaining the data continuity. Meanwhile, considering the similarity between the groups, we also propose a symmetric descending soft labeling strategy to exploit the intrinsic similarity across the data, which allows classification to facilitate regression more effectively. Extensive experiments on real-world datasets also validate the effectiveness of our method.

Abstract: Classincremental learning (CIL) aims to continuously introduce novel categories into a classification system without forgetting previously learned ones, thus adapting to evolving data distributions. Researchers are currently focusing on leveraging the rich semantic information of pre-trained models (PTMs) in CIL tasks. Prompt learning has been adopted in CIL for its ability to adjust data distribution to better align with pre-trained knowledge. This paper critically examines the limitations of existing methods from the perspective of prompt learning, which heavily rely on input information. To address this issue, we propose a novel PTM-based CIL method called Input-Agnostic Prompt Enhancement with NegAtive Feedback ReguLation (PEARL). In PEARL, we implement an input-agnostic global prompt coupled with an adaptive momentum update strategy to reduce the model's dependency on data distribution, thereby effectively mitigating catastrophic forgetting. Guided by negative feedback regulation, this adaptive momentum update addresses the parameter sensitivity inherent in fixed-weight momentum updates. Furthermore, it fosters the continuous enhancement of the prompt for new tasks by harnessing correlations between different tasks in CIL. Experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance.

Abstract: Recent advancements in graph neural networks (GNNs) have highlighted the critical need of calibrating model predictions, with neighborhood prediction similarity recognized as a pivotal component. Existing studies suggest that nodes with analogous neighborhood prediction similarity often exhibit similar calibration characteristics. Building on this insight, recent approaches incorporate neighborhood similarity into nodewise temperature scaling techniques. However, our analysis reveals that this assumption does not hold universally. Calibration errors can differ significantly even among nodes with comparable neighborhood similarity, depending on their confidence levels. This necessitates a re-evaluation of existing GNN calibration methods, as a single, unified approach may lead to sub-optimal calibration. In response, we introduce Simi-Mailbox, a novel approach that categorizes nodes by both neighborhood similarity and their own confidence, irrespective of proximity or connectivity. Our method allows fine-grained calibration by employing group-specific temperature scaling, with each temperature tailored to address the specific miscalibration level of affiliated nodes, rather than adhering to a uniform trend based on neighborhood similarity. Extensive experiments demonstrate the effectiveness of our Simi-Mailbox across diverse datasets on different GNN architectures, achieving up to 13.79% error reduction compared to uncalibrated GNN predictions.

Abstract: Multiview document clustering (MvDC) aims to improve the accuracy and robustness of clustering by fully considering the complementarity of different views. However, in real-world clustering applications, most existing works suffer from the following challenges: 1) They primarily align multi-view data based on a single perspective, such as features and classes, thus ignoring the diversity and comprehensiveness of representations. 2) They treat each instance equally in cross-view contrastive learning without considering ambiguous ones, which weakens the model's discriminative ability. To address these problems, we propose an ambiguous instance-aware contrastive network with multi-level matching (AICN-MLM) for MvDC tasks. This model contains two key modules: a multi-level matching module and an ambiguous instance-aware contrastive learning module. The former attempts to align multi-view data from different perspectives, including features, pseudo-labels, and prototypes. The latter dynamically adjusts instance weights through a weight modulation function to highlight ambiguous instance pairs. Thus, our proposed method can effectively explore the consistency of multi-view document data and focus on ambiguous instances to enhance the model's discriminative ability. Extensive experimental results on several multi-view document datasets verify the effectiveness of our proposed method.

Abstract: Federated learning (FL) enables collaborative learning among decentralized clients while safeguarding the privacy of their local data. Existing studies on FL typically assume offline labeled data available at each client when the training starts. Nevertheless, the training data in practice often arrive at clients in a streaming fashion without groundtruth labels. Given the expensive annotation cost, it is critical to identify a subset of informative samples for labeling on clients. However, selecting samples locally while accommodating the global training objective presents a challenge unique to FL. In this work, we tackle this conundrum by framing the data querying process in FL as a collaborative decentralized decision-making problem and proposing an effective solution named LeaDQ, which leverages multi-agent reinforcement learning algorithms. In particular, under the implicit guidance from global information, LeaDQ effectively learns the local policies for distributed clients and steers them towards selecting samples that can enhance the global model's accuracy. Extensive simulations on image and text tasks show that LeaDQ advances the model performance in various FL scenarios, outperforming the benchmarking algorithms.

Abstract: Deep learning models often suffer from performance degradation in unseen domains, posing a risk for safetycritical applications such as autonomous driving. To tackle this problem, recent studies have leveraged pre-trained Visual Foundation Models (VFMs) to enhance generalization. However, exsiting works mainly focus on designing intricate networks for VFMs, neglecting their inherent strong generalization potential. Moreover, these methods typically perform inference on low-resolution images. The loss of detail hinders accurate predictions in unseen domains, especially for small objects. In this paper, we argue that simply fine-tuning VFMs and leveraging high-resolution images unleash the power of VFMs for generalizable semantic segmentation. Therefore, we design a VFM-based segmentation network (VFMNet) that adapts VFMs to this task with minimal fine-tuning, preserving their generalizable knowledge. Then, to fully utilize high-resolution images, we train a Mask-guided Refinement Network (MGRNet) to refine VFMNet's predictions combining detailed image features. Furthermore, we adopt a two-stage coarse-to-fine inference approach. MGRNet is used to refine the low-confidence regions predicted by VFMNet to obtain fine-grained results. Extensive experiments demonstrate the effectiveness of our method, outperforming state-of-the-art methods by 3.3% on the average mIoU in synthetic-to-real domain generalization.

Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence State Key Laboratory for Novel Software Technology, Nanjing University, Huazhong University of Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, HoHai University, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence

Abstract: Longtailed (LT) data distribution is common in multi-label image classification (MLC) and can significantly impact the performance of classification models. One reason is the challenge of learning unbiased instance representations (i.e. features) for imbalanced datasets. Additionally, the co-occurrence of head/tail classes within the same instance, along with complex label dependencies, introduces further challenges. In this work, we delve into this problem through the lens of neural collapse (NC). NC refers to a phenomenon where the last-layer features and classifier of a deep neural network model exhibit a simplex Equiangular Tight Frame (ETF) structure during its terminal training phase. This structure creates an optimal linearly separable state. However, this phenomenon typically occurs in balanced datasets but rarely applies to the typical imbalanced problem. To induce NC properties under Long-tailed multi-label classification (LT-MLC) conditions, we propose an approach named MLC-NC, which aims to learn high-quality data representations and improve the model’s generalization ability. Specifically, MLC-NC accounts for the fact that different labels correspond to different feature parts located in images. MLC-NC extracts class-wise features from each instance through a cross-attention mechanism. To guide the features toward the ETF structure, we introduce visual-semantic feature alignment with a fixed ETF structured label embedding, which helps to learn evenly distributed class centers. To reduce within-class feature variation, we introduce collapse calibration within a lower-dimensional feature space. To mitigate classification bias, we concatenate features and feed them into a binarized fixed ETF classifier. As an orthogonal approach to existing methods, MLC-NC can be seamlessly integrated into various frameworks. Extensive experiments on widely-used benchmarks demonstrate the effectiveness of our method.

Aerospace Information Research Institute, Chinese Academy of Sciences Key Laboratory of Target Cognition and Application Technology (TCAT) University of Chinese Academy of Sciences School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Aerospace Information Research Institute, Chinese Academy of Sciences Key Laboratory of Target Cognition and Application Technology (TCAT) University of Chinese Academy of Sciences School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Aerospace Information Research Institute, Chinese Academy of Sciences, Aerospace Information Research Institute, Chinese Academy of Sciences Key Laboratory of Target Cognition and Application Technology (TCAT) University of Chinese Academy of Sciences School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Aerospace Information Research Institute, Chinese Academy of Sciences, Aerospace Information Research Institute, Chinese Academy of Sciences Key Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences Key Laboratory of Target Cognition and Application Technology (TCAT) University of Chinese Academy of Sciences School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Aerospace Information Research Institute, Chinese Academy of Sciences Key Laboratory of Target Cognition and Application Technology (TCAT)

Abstract: Longterm Multivariate Time Series (LMTS) forecasting aims to predict extended future trends based on channel-interrelated historical data. Considering the elusive channel correlations, most existing methods compromise by treating channels as independent or tentatively modeling pairwise channel interactions, making it challenging to handle the characteristics of both higher-order interactions and time variation in channel correlations. In this paper, we propose HyperMixer, a novel specializable hypergraph channel mixing plugin which introduces versatile hypergraph structures to capture group channel interactions and time-varying patterns for long-term multivariate time series forecasting. Specifically, to encode the higher-order channel interactions, we structure multiple channels into a hypergraph, achieving a two-phase message-passing mechanism: channel-to-group and group-to-channel. Moreover, the functionally specializable hypergraph structures are presented to boost the capability of hypergraph to capture the time-varying patterns across periods, further refining modeling of channel correlations. Extensive experimental results on seven available benchmark datasets demonstrate the effectiveness and generalization of our plugin in LMTS forecasting. The visual analysis further illustrates that HyperMixer with specializable hypergraphs tailors channel interactions specific to certain periods.

Abstract: Multitask learning (MTL) is widely utilized across a variety of real-world applications, including recommendation systems. For instance, in the field of e-commerce, MTL is commonly employed to simultaneously model click, conversion, and user dwelling time. Among a various of MTL models, the Multi-gate Mixture-of-Experts (MMoE) has gained significant popularity. However, MMoE suffers from the polarization issue during training, where the weights of certain experts tend to converge towards 0. To address this issue, we propose a novel method called Bagging-Expert network (BEnet) for multi-task learning. BEnet effectively mitigates the problem of polarization and achieves excellent performance in multi-task learning. It incorporates a bagging layer and an attention mechanism to encourage experts focusing on diverse knowledge domains. Simultaneously, polarization is avoided as different experts execute respective duties and specialize in distinct domains. Experimental results on real-world datasets demonstrate that BEnet has strong robustness and outperforms other state-of-the-art (SOTA) MTL methods.

PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China, School of Intelligence Science and Technology, Nanjing University, China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China

Abstract: Spherical SlicedWasserstein (SSW) has recently been proposed to measure the discrepancy between spherical data distributions in various fields, such as geology, medical domains, computer vision, and deep representation learning. However, in the original SSW, all projection directions are treated equally, which is too idealistic and cannot accurately reflect the importance of different projection directions for various data distributions. To address this issue, we propose a novel data-adaptive Discriminative Spherical Sliced-Wasserstein (DSSW) distance, which utilizes a projected energy function to determine the discriminative projection direction for SSW. In our new DSSW, we introduce two types of projected energy functions to generate the weights for projection directions with complete theoretical guarantees. The first type employs a non-parametric deterministic function that transforms the projected Wasserstein distance into its corresponding weight in each projection direction. This improves the performance of the original SSW distance with negligible additional computational overhead. The second type utilizes a neural network-induced function that learns the projection direction weight through a parameterized neural network based on data projections. This further enhances the performance of the original SSW distance with less extra computational overhead. Finally, we evaluate the performance of our proposed DSSW by comparing it with several state-of-the-art methods across a variety of machine learning tasks, including gradient flows, density estimation on real earth data, and self-supervised learning.

Abstract: Multilayer Perceptron (MLP) is a simple practice of Neural Network (NN) and the cornerstone of research and development of deep learning. Each neuron is connected to all neurons in the previous layer and implements a nonlinear mapping through activation functions. MLP can learn complex non-linear relationships among features through the superposition of multiple hidden layers, but it still cannot discover the inherent strong correlation among features. The reason is that each neuron uses a simple weighted summation method to organize all the neurons in the previous layer. Inspired by quantum theory, this paper builds a non-linear NN layer that can mine strong correlations among features based on multi-body quantum systems, and then constructs a multi-layer perceptron, called Quantum-inspired MLP (QiMLP). It is conceivable that QiMLP will have important inspirational significance in reshaping machine learning, deep learning and large language models. We theoretically analyzed the basis for QiMLP to mine strong correlations among features, and implemented experiments on multiple classic deep learning datasets. Experimental results verify that QiMLP not only learns strong correlations among features, but also significantly reduces the number of parameters with hundreds of times improvement.

Abstract: Fewshot learning has emerged as an important problem on graphs to combat label scarcity, which can be approached by current trends in pre-trained graph neural networks (GNNs) and meta-learning. Recent efforts integrate both paradigms in a white-box setting, leaving the more realistic black-box setting largely underexplored, where the parameters and gradients in the pre-trained GNNs are inaccessible. In this paper, we study the critical problem: Leveraging black-box pre-trained GNNs for graph few-shot learning. Despite its appeal, two key issues hinder the unlocking of its potential: the inherent task gap between pre-training and downstream stages, which can introduce irrelevant knowledge and undermine the generalizability of a pre-trained black-box GNN on downstream tasks; and the inaccessibility of parameters and gradients, which limits the model's adaptation to novel tasks. To effectively leverage the black-box pre-trained GNNs and improve generalization, we propose a lightweight graph meta-learner to extract relevant knowledge from a black-box pre-trained GNN, meanwhile harnessing knowledge from related tasks for rapid adaptation on novel tasks. Furthermore, we prune the graph meta-learner to enhance its generalization on novel tasks. Extensive experiments on real-world datasets for few-shot node classification validate the effectiveness of our proposed method in the black-box setting.

Abstract: Pretrained model assessment for transfer learning aims to identify the optimal candidate for the downstream tasks from a model hub, without the need of time-consuming fine-tuning. Existing advanced works mainly focus on analyzing the intrinsic characteristics of the entire features extracted by each pre-trained model or how well such features fit the target labels. This paper proposes a novel perspective for pre-trained model assessment through the Distribution of Spectral Components (DISCO). Through singular value decomposition of features extracted from pre-trained models, we investigate different spectral components and observe that they possess distinct transferability, contributing diversely to the fine-tuning performance. Inspired by this, we propose an assessment method based on the distribution of spectral components which measures the proportions of their corresponding singular values. Pre-trained models with features concentrating on more transferable components are regarded as better choices for transfer learning. We further leverage the labels of downstream data to better estimate the transferability of each spectral component and derive the final assessment criterion. Our proposed method is flexible and can be applied to both classification and regression tasks. We conducted comprehensive experiments across three benchmarks and two tasks including image classification and object detection, demonstrating that our method achieves state-of-the-art performance in choosing proper pre-trained models from the model hub for transfer learning.

Abstract: Scaling Deep Neural Networks (DNNs) requires significant computational resources in terms of GPU quantity and compute capacity. In practice, there usually exists a large number of heterogeneous GPU devices due to the rapid release cycle of GPU products. It is highly needed to efficiently and economically harness the power of heterogeneous GPUs, so that it can meet the requirements of DNN research and development. The paper introduces Poplar, a distributed training system that extends Zero Redundancy Optimizer (ZeRO) with heterogeneousaware capabilities. We explore a broader spectrum of GPU heterogeneity, including compute capability, memory capacity, quantity and a combination of them. In order to achieve high computational efficiency across all heterogeneous conditions, Poplar conducts fine-grained measurements of GPUs in each ZeRO stage. We propose a novel batch allocation method and a search algorithm to optimize the utilization of heterogeneous GPUs clusters. Furthermore, Poplar implements fully automated parallelism, eliminating the need for deploying heterogeneous hardware and finding suitable batch size. Extensive experiments on three heterogeneous clusters, comprising six different types of GPUs, demonstrate that Poplar achieves a training throughput improvement of 1.02-3.92x over current state-of-the-art heterogeneous training systems.

Abstract: Numerous deep learningbased works focusing on 3D semantic segmentation have been proposed and have achieved impressive performance. However, due to the catastrophic forgetting, existing methods will degrade dramatically in a real-world scenario where new 3D semantic categories are arriving continually. Straightforwardly applying typical class-incremental learning methods on 3D data even aggravates forgetting due to the irregular and noisy geometric structure. Aiming to address this realistic challenge, from the perspective of capturing local topological characteristics and mitigating global semantic shift, we propose a unified framework named Local topological Alignment and Global semantic Deconstruction (LAGD) to incrementally learn semantic knowledge of novel 3D categories while maintaining performance on previously learned knowledge. Specifically, we develop a novel Interaction Topological-aware Alignment (ITA) to maintain the learned knowledge efficiently by capturing the local geometric characteristics with interacted adjacent state-specific knowledge. Besides, to mitigate the forgetting caused by the global semantic shift, we deconstruct the logits into positive and negative parts which are distilled separately, achieving an elaborate distillation process in terms of Semantic-knowledge Deconstruction Distillation (SDD). With the cooperation of ITA and SDD, LAGD achieves a sota performance, especially in the long-term incremental learning scenario. Extensive experimental results illustrate the superiority of our proposed LAGD.

Abstract: Deep neural networks (DNNs) typically employ an endto-end (E2E) training paradigm which presents several challenges, including high GPU memory consumption, inefficiency, and difficulties in model parallelization during training. Recent research has sought to address these issues, with one promising approach being local learning. This method involves partitioning the backbone network into gradient-isolated modules and manually designing auxiliary networks to train these local modules. Existing methods often neglect the interaction of information between local modules, leading to myopic issues and a performance gap compared to E2E training. To address these limitations, we propose the Multilaminar Leap Augmented Auxiliary Network (MLAAN). Specifically, MLAAN comprises Multilaminar Local Modules (MLM) and Leap Augmented Modules (LAM). MLM captures both local and global features through independent and cascaded auxiliary networks, alleviating performance issues caused by insufficient global features. However, overly simplistic auxiliary networks can impede MLM's ability to capture global information. To address this, we further design LAM, an enhanced auxiliary network that uses the Exponential Moving Average (EMA) method to facilitate information exchange between local modules, thereby mitigating the shortsightedness resulting from inadequate interaction. The synergy between MLM and LAM has demonstrated excellent performance. Our experiments on the CIFAR-10, STL-10, SVHN, and ImageNet datasets show that MLAAN can be seamlessly integrated into existing local learning frameworks, significantly enhancing their performance and even surpassing end-to-end (E2E) training methods, while also reducing GPU memory consumption.

Abstract: Quantization stands as a pivotal technique for large language model (LLM) serving, yet it poses significant challenges particularly in achieving effective lowbit quantization. The limited numerical mapping makes the quantized model produce a non-trivial error, bringing out intolerable performance degration. This paper is anchored in the basic idea of model compression objectives, and delves into the layer-wise error distribution of LLMs during post-training quantization. Subsequently, we introduce ASER, an algorithm consisting of (1) Error Reconstruction: low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD; (2) Activation Smoothing: outlier extraction to gain smooth activation and better error compensation. ASER is capable of quantizing typical LLMs to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup. Experimental results show that ASER is competitive among the state-of-the-art quantization algorithms, showing potential to activation quantization, with minor overhead.

Abstract: Exploration in cooperative multiagent reinforcement learning (MARL) remains challenging for value-based agents due to the absence of an explicit policy. Existing approaches include individual exploration based on uncertainty towards the system and collective exploration through behavioral diversity among agents. However, the introduction of additional structures often leads to reduced training efficiency and infeasible integration of these methods. In this paper, we propose Adaptive exploration via Identity Recognition~(AIR), which consists of two adversarial components: a classifier that recognizes agent identities from their trajectories, and an action selector that adaptively adjusts the mode and degree of exploration. We theoretically prove that AIR can facilitate both individual and collective exploration during training, and experiments also demonstrate the efficiency and effectiveness of AIR across various tasks.

Abstract: Knowledge Distillation (KD) is essential in transferring dark knowledge from a large teacher to a small student network, such that the student can be much more efficient than the teacher but with comparable accuracy. Existing KD methods, however, rely on a large teacher trained specifically for the target task, which is both very inflexible and inefficient. In this paper, we argue that a SSLpretrained model can effectively act as the teacher and its dark knowledge can be captured by the coordinate system or linear subspace where the features lie in. We then need only one forward pass of the teacher, and then tailor the coordinate system (TCS) for the student network. Our TCS method is teacher-free and applies to diverse architectures, works well for KD and practical few-shot learning, allows cross-architecture distillation with large capacity gap. Experiments show that TCS achieves significantly higher accuracy than state-of-the-art KD methods, while only requiring roughly half of their training time and GPU memory costs.

Abstract: Transferable neural architecture search (TNAS) has been introduced to design efficient neural architectures for multiple tasks, to enhance the practical applicability of NAS in realworld scenarios. In TNAS, architectural knowledge accumulated in previous search processes is reused to warm up the architecture search for new tasks. However, existing TNAS methods still search in an extensive search space, necessitating the evaluation of numerous architectures. To overcome this challenge, this work proposes a novel transfer paradigm, i.e., design principle transfer. In this work, the linguistic description of various structural components' effects on architectural performance is termed design principles. They are learned from established architectures and then can be reused to reduce the search space tasks by discarding unpromising architectures. Searching in the refined search space can boost both the search performance and efficiency for new NAS tasks. To this end, a large language model (LLM)-assisted design principle transfer (LAPT) framework is devised. In LAPT, LLM is applied to automatically reason the design principles from a set of given architectures, and then a principle adaptation method is applied to refine these principles progressively based on the search results. Experimental results demonstrate that LAPT can beat the state-of-the-art TNAS methods on most tasks and achieve comparable performance on the remainder.

Abstract: Decision trees are widely used in machine learning due to their simplicity and interpretability, but they often lack robustness to adversarial attacks and data perturbations. The paper proposes a novel islandbased coevolutionary algorithm (ICoEvoRDF) for constructing robust decision tree ensembles. The algorithm operates on multiple islands, each containing populations of decision trees and adversarial perturbations. The populations on each island evolve independently, with periodic migration of top-performing decision trees between islands. This approach fosters diversity and enhances the exploration of the solution space, leading to more robust and accurate decision tree ensembles. ICoEvoRDF utilizes a popular game theory concept of mixed Nash equilibrium for ensemble weighting, which further leads to improvement in results. ICoEvoRDF is evaluated on 20 benchmark datasets, demonstrating its superior performance compared to state-of-the-art methods in optimizing both adversarial accuracy and minimax regret. The flexibility of ICoEvoRDF allows for the integration of decision trees from various existing methods, providing a unified framework for combining diverse solutions. Our approach offers a promising direction for developing robust and interpretable machine learning models.

Abstract: Quantitative requirements play an important role in the context of multiagent systems, where there is often a trade-off between the tasks of individual agents and the constraints that the agents must jointly adhere to. We study multi-agent systems whose requirements are formally specified in the quantitative temporal logic LTL[F] as a combination of local task specifications for the individual agents and a shared safety constraint, The intricate dependencies between the individual agents entailed by their local and shared objectives make the design of multi-agent systems error-prone, and their verification time-consuming. In this paper we address this problem by proposing a novel notion of quantitative assume-guarantee contracts, that enables the compositional design and verification of multi-agent systems with quantitative temporal specifications. The crux of these contracts lies in their ability to capture the coordination between the individual agents to achieve an optimal value of the overall specification under any possible behavior of the external environment. We show that the proposed framework improves the scalability and modularity of formal verification of multi-agent systems against quantitative temporal specifications.

School of Computing and Artificial Intelligence, Southwest Jiaotong University Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, School of Computing and Artificial Intelligence, Southwest Jiaotong University Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, School of Computing and Artificial Intelligence, Southwest Jiaotong University Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, School of Computing and Artificial Intelligence, Southwest Jiaotong University Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education

Abstract: Multiagent collaborative perception is expected to significantly improve perception performance by overcoming the limitations of single-agent perception through exchanging complementary information. However, training a robust collaborative perception model requires collecting sufficient training data that covers all possible collaboration scenarios, which is impractical due to intolerable deployment costs. Hence, the trained model is not robust against new traffic scenarios with inconsistent data distribution and fundamentally restricts its real-world applicability. Further, existing methods, such as domain adaptation, have mitigated this issue by exposing the deployment data during the training stage but incur a high training cost, which is infeasible for resource-constrained agents. In this paper, we propose a Parameter-Efficient Fine-Tuning-based lightweight framework, CoPEFT, for fast adapting a trained collaborative perception model to new deployment environments under low-cost conditions. CoPEFT develops a Collaboration Adapter and Agent Prompt to perform macro-level and micro-level adaptations separately. Specifically, the Collaboration Adapter utilizes the inherent knowledge from training data and limited deployment data to adapt the feature map to new data distribution. The Agent Prompt further enhances the Collaboration Adapter by inserting fine-grained contextual information about the environment. Extensive experiments demonstrate that our CoPEFT surpasses existing methods with less than 1\% trainable parameters, proving the effectiveness and efficiency of our proposed method.

Abstract: MultiAgent Motion Planning (MAMP) finds various applications in fields such as traffic management, airport operations, and warehouse automation. In many of these environments, differential drive robots are commonly used. These robots have a kinodynamic model that allows only in-place rotation and movement along their current orientation, subject to speed and acceleration limits. However, existing Multi-Agent Path Finding (MAPF)-based methods often use simplified models for robot kinodynamics, which limits their practicality and realism. In this paper, we introduce a three-level framework called MASS to address these challenges. MASS combines MAPF-based methods with our proposed stationary state search planner to generate high-quality kinodynamically-feasible plans. We further extend MASS using an adaptive window mechanism to address the lifelong MAMP problem. Empirically, we tested our methods on the single-shot grid map domain and the lifelong warehouse domain. Our method shows up to 400% improvements in terms of throughput compared to existing methods.

Cyberspace Institute of Advanced Technology, Guangzhou University, China Huangpu Research School of Guangzhou University, China, CCSE, Beihang University, China, University of the Chinese Academy of Sciences, China, CCSE, Beihang University, China, CCSE, Beihang University, China, Cyberspace Institute of Advanced Technology, Guangzhou University, China Huangpu Research School of Guangzhou University, China, Cyberspace Institute of Advanced Technology, Guangzhou University, China Huangpu Research School of Guangzhou University, China

Abstract: Large visionlanguage models show tremendous potential in understanding visual information through human languages. However, they are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image. In this paper, we reveal that object hallucination can be attributed to overconfidence in irrelevant visual features when soft visual tokens map to the LLM's word embedding space. Specifically, by figuring out the semantic similarity between visual tokens and LLM's word embedding, we observe that the smoothness of similarity distribution strongly correlates with the emergence of object hallucinations. To mitigate hallucinations, we propose using the Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing stochastic noise, facilitating the constraining of irrelevant information. Furthermore, we propose an entropy-based noise-controlling strategy to enable the injected noise to be adaptively constrained regarding the smoothness of the similarity distribution. We adapt the proposed AdaVIB across distinct model architectures. Experimental results demonstrate that the proposed AdaVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks.

Abstract: Instruction tuning has emerged as an effective approach that notably improves large language models (LLMs) performance, showing particular promise in natural language generation tasks by producing more diverse, coherent, and taskrelevant outputs. However, extending instruction tuning to natural language understanding (NLU) tasks presents significant challenges, primarily due to the difficulty in achieving high-precision responses and the scarcity of large-scale, high-quality instruction data necessary for effective tuning. In this work, we introduce Adversarial Noisy Instruction Tuning (ANIT) to improve NLU performance on LLMs. First, we leverage low-resource techniques to construct noisy instruction datasets. Second, we employ semantic distortion-aware techniques to quantify the intensity of noise within these instructions. Last, we devise an adversarial training method that incorporates a noise response strategy to achieve noisy instruction tuning. ANIT enhances LLMs capability to detect and accommodate semantic distortions in noisy instructions, thereby augmenting their comprehension of task objectives and ability to generate more accurate responses. We evaluate our approach across diverse noisy instructions and semantic distortion quantification methods on multiple NLU tasks. Comprehensive empirical results demonstrate that our method consistently outperforms existing approaches across various experimental settings.

Abstract: The unwavering disparity in labeled resources between resourcerich languages and those considered low-resource remains a significant impediment for Large Language Models (LLMs). Recent strides in cross-lingual in-context learning (X-ICL), mainly through semantically aligned examples retrieved from multilingual pre-trained transformers, have shown promise in mitigating this issue. However, our investigation reveals that LLMs intrinsically reward in-language semantically aligned cross-lingual instances over direct cross-lingual semantic alignments, with a pronounced disparity in handling time–sensitive queries in the X-ICL setup. Such queries demand sound temporal reasoning ability from LLMs, yet the advancements have predominantly focused on English. This study aims to bridge this gap by improving temporal reasoning capabilities in low-resource languages. To this end, we introduce mTEMPREASON, a temporal reasoning dataset aimed at the varied degrees of low-resource languages and propose Cross-Lingual Time-Sensitive Semantic Alignment (CLiTSSA), a novel method to improve temporal reasoning in these contexts. To facilitate this, we construct an extension of mTEMPREASON comprising pairs of parallel cross–language temporal queries along with their anticipated in-language semantic similarity scores. Our empirical evidence underscores the superior performance of CLiTSSA compared to established baselines across three languages -- Romanian, German, and French, encompassing three temporal tasks and including a diverse set of four contemporaneous LLMs. This marks a significant step forward in addressing resource disparity in the context of temporal reasoning across languages.

Abstract: The alignment of large language models (LLMs) is crucial for generating helpful and harmless content. Existing approaches leverage preferencebased human feedback data to learn the reward function and align the LLM with the feedback data. However, these approaches focus on modeling the reward difference between the chosen and rejected demonstrations, rather than directly modeling the true reward from each demonstration. Moreover, these approaches assume that the reward is only obtained at the end of the sentence, which overlooks the modeling of intermediate rewards. These issues lead to insufficient use of training signals in the feedback data, limiting the representation and generalization ability of the reward and potentially resulting in reward hacking. In this paper, we formulate LLM alignment as a Bayesian Inverse Reinforcement Learning (BIRL) problem and propose a novel training objective, Approximated Variational Alignment (AVA), to perform LLM alignment through Approximated Variational Reward Imitation Learning (AVRIL). The BIRL formulation facilitates intermediate reward modeling and direct reward modeling on each individual demonstration, which enhances the utilization of training signals in the feedback data. Experiments show that AVA outperforms existing LLM alignment approaches in reward modeling, RL fine-tuning, and direct optimization.

Department of Computer Science and Engineering, Shanghai Jiao Tong University Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University Shanghai Key Laboratory of Trusted Data Circulation and Governance in Web3, Department of Computer Science and Engineering, Shanghai Jiao Tong University Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University Shanghai Key Laboratory of Trusted Data Circulation and Governance in Web3, Department of Computer Science and Engineering, Shanghai Jiao Tong University Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University Shanghai Key Laboratory of Trusted Data Circulation and Governance in Web3

Abstract: Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions. However, recent researches reveal safetyaligned LLMs tend to reject benign queries due to the exaggerated safety issue, limiting their helpfulness. In this paper, we propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns in aligned LLMs. First, SCANS extracts the refusal steering vectors within the activation space and utilizes vocabulary projection to anchor some specific safety-critical layers which influence model refusal behavior. Second, by tracking the hidden state transition, SCANS identifies the steering direction and steers the model behavior accordingly, achieving a balance between exaggerated safety and adequate safety. Experiments show that SCANS achieves new state-of-the-art performance on XSTest and OKTest benchmarks, without impairing their defense capability against harmful queries and maintaining almost unchanged model capability.

Fudan University Stanford University, South China University of Technology NUS (Chongqing) Research Institute, University of California, San Diego, University of Illinois at Urbana-Champaign, Wuhan University Fenz AI, Carnegie Mellon University, The Hong Kong Polytechnic University, Wuhan University, Fudan University South China University of Technology, University of California, San Diego, Georgia Institute of Technology, University of California, San Diego, Fudan University, Wuhan University

Abstract: Large Language Models (LLMs) have revolutionized text generation, making detecting machinegenerated text increasingly challenging. Although past methods have achieved good performance on detecting pure machine-generated text, those detectors have poor performance on distinguishing machine-revised text (rewriting, expansion, and polishing), which can have only minor changes from its original human prompt. As the content of text may originate from human prompts, detecting machine-revised text often involves identifying distinctive machine styles, e.g., worded favored by LLMs. However, existing methods struggle to detect machine-style phrasing hidden within the content contributed by humans. We propose the “Imitate Before Detect” (ImBD) approach, which first imitates the machine-style token distribution, and then compares the distribution of the text to be tested with the machine-style distribution to determine whether the text has been machine-revised. To this end, we introduce Style Preference Optimization (SPO), which aligns a scoring LLM model to the preference of text styles generated by machines. The aligned scoring model is then used to calculate the style-conditional probability curvature (Style-CPC), quantifying the log probability difference between the original and conditionally sampled texts for effective detection. We conduct extensive comparisons across various scenarios, encompassing text revisions by six LLMs, four distinct text domains, and three machine revision types. Compared to existing state-of-the-art methods, our method yields a 13% increase in AUC for detecting text revised by open-source LLMs, and improves performance by 5% and 19% for detecting GPT-3.5 and GPT-4o revised text, respectively. Notably, our method surpasses the commercially trained GPT-Zero with just 1,000 samples and five minutes of SPO, demonstrating its efficiency and effectiveness.

Abstract: Large Language Models (LLMs) have demonstrated significant capabilities, particularly in the domain of question answering (QA). However, their effectiveness in QA is often undermined by the vagueness of user questions. To address this issue, we introduce singleround instance-level prompt optimization, referred to as question rewriter. By enhancing the intelligibility of human questions for black-box LLMs, our question rewriter improves the quality of generated answers. The rewriter is optimized using direct preference optimization based on feedback collected from automatic criteria for evaluating generated answers; therefore, its training does not require costly human annotations. The experiments across multiple black-box LLMs and long-form question answering (LFQA) datasets demonstrate the efficacy of our method. This paper provides a practical framework for training question rewriters and sets a precedent for future explorations in prompt optimization within LFQA tasks.

Abstract: Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the MultiLevel Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes.

Abstract: Representation Misdirection for Unlearning (RMU), which steers model representation in the intermediate layer to a target random representation, is an effective method for large language model (LLM) unlearning. Despite its high performance, the underlying cause and explanation remain underexplored. In this paper, we theoretically demonstrate that steering forget representations in the intermediate layer reduces token confidence, causing LLMs to generate wrong or nonsense responses. We investigate how the coefficient influences the alignment of forgetsample representations with the random direction and hint at the optimal coefficient values for effective unlearning across different network layers. We show that RMU unlearned models are robust against adversarial jailbreak attacks. Furthermore, our empirical analysis shows that RMU is less effective when applied to the middle and later layers in LLMs. To resolve this drawback, we propose Adaptive RMU---a simple yet effective alternative method that makes unlearning effective with most layers. Extensive experiments demonstrate that Adaptive RMU significantly improves the unlearning performance compared to prior art while incurring no additional computational cost.

Abstract: RetrievalAugmented Generation (RAG) can alleviate hallucinations of Large Language Models (LLMs) by referencing external documents. However, the misinformation in external documents may mislead LLMs' generation. To address this issue, we explore the task of "credibility-aware RAG", in which LLMs automatically adjust the influence of retrieved documents based on their credibility scores to counteract misinformation. To this end, we introduce a plug-and-play method named Credibility-aware Attention Modification (CrAM). CrAM identifies influential attention heads in LLMs and adjusts their attention weights based on the credibility of the documents, thereby reducing the impact of low-credibility documents. Experiments on Natual Questions and TriviaQA using Llama2-13B, Llama3-8B, and Qwen1.5-7B show that CrAM improves the RAG performance of LLMs against misinformation pollution by over 20%, even surpassing supervised fine-tuning methods.

Abstract: Alignment, endowing a pretrained Large language model (LLM) with the ability to follow instructions, is crucial for its real-world applications. Conventional supervised fine-tuning (SFT) methods formalize it as causal language modeling typically with a cross-entropy objective, requiring a large amount of high-quality instruction-response pairs. However, the quality of widely used SFT datasets can not be guaranteed due to the high cost and intensive labor for the creation and maintenance in practice. To overcome the limitations associated with the quality of SFT datasets, we introduce a novel preference-oriented supervised fine-tuning approach, namely PoFT. The intuition is to boost SFT by imposing a particular preference: favoring the target model over aligned LLMs on the same SFT data. This preference encourages the target model to predict a higher likelihood than that predicted by the aligned LLMs, incorporating assessment information on data quality (i.e., predicted likelihood by the aligned LLMs) into the training process. Extensive experiments are conducted, and the results validate the effectiveness of the proposed method. PoFT achieves stable and consistent improvements over the SFT baselines across different training datasets and base models. Moreover, we prove that PoFT can be integrated with existing SFT data filtering methods to achieve better performance, and further improved by following preference optimization procedures, such as DPO.

Abstract: Unnatural text correction aims to automatically detect and correct spelling errors or adversarial perturbation errors in sentences. Existing methods typically rely on finetuning or adversarial training to correct errors, which have achieved significant success. However, these methods exhibit poor generalization performance due to the difference in data distribution between training data and real-world scenarios, known as the exposure bias problem. In this paper, we propose a self-correct adversarial training framework for learning from mistakes (LIMIT), which is a task- and model-independent framework to correct unnatural errors or mistakes. Specifically, we fully utilize errors generated by the model that are actively exposed during the inference phase, i.e., predictions that are inconsistent with the target. This training method not only simulates potential errors in real application scenarios, but also mitigates the exposure bias of the traditional training process. Meanwhile, we design a novel decoding intervention strategy to maintain semantic consistency. Extensive experimental results on Chinese unnatural text error correction datasets show that our proposed method can correct multiple forms of errors and outperforms the state-of-the-art text correction methods. In addition, extensive results on Chinese and English datasets validate that LIMIT can serve as a plug-and-play defense module and can extend to new models and datasets without further training.

Abstract: Supervised finetuning has become the predominant method for adapting large pretrained models to downstream tasks. However, recent studies have revealed that these models are vulnerable to backdoor attacks, where even a small number of malicious samples can successfully embed backdoor triggers into the model. While most existing defense methods focus on post-training backdoor defense, efficiently defending against backdoor attacks during training phase remains largely unexplored. To address this gap, we propose a novel defense method called Backdoor Token Unlearning (BTU), which proactively detects and neutralizes trigger tokens during the training stage. Our work is based on two key findings: 1) backdoor learning causes distinctive differences between backdoor token parameters and clean token parameters in word embedding layers, and 2) the success of backdoor attacks heavily depends on backdoor token parameters. The BTU defense leverages these properties to identify aberrant embedding parameters and subsequently removes backdoor behaviors using a fine-grained unlearning technique. Extensive evaluations across three datasets and four types of backdoor attacks demonstrate that BTU effectively defends against these threats while preserving the model’s performance on primary tasks.

Abstract: Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a LanguageAgnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.

Abstract: Denoising diffusion probabilistic models (DDPMs) have gained popularity in devising neural vocoders and obtained outstanding performance. However, existing DDPMbased neural vocoders struggle to handle the prosody diversities due to their susceptibility to mode-collapse issues confronted with imbalanced data. We introduced Cauchy Diffusion, a model incorporating the Cauchy noises to address this challenge. The heavy-tailed Cauchy distribution exhibits better resilience to imbalanced speech data, potentially improving prosody modeling. Our experiments on the LJSpeech and VCTK datasets demonstrate that Cauchy Diffusion achieved state-of-the-art speech synthesis performance. Compared to existing neural vocoders, our Cauchy Diffusion notably improved speech diversity while maintaining superior speech quality. Remarkably, Cauchy Diffusion surpassed neural vocoders based on generative adversarial networks (GANs) that are explicitly optimized to improve diversity.

The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, National Science Library, Chinese Academy of Sciences, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China Shanghai Artificial Intelligence Laboratory, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

Abstract: In this paper, we propose NeuralSymbolic Collaborative Distillation (NesyCD), a novel knowledge distillation method for learning the complex reasoning abilities of Large Language Models (LLMs, e.g., \textgreater 13B). We argue that complex reasoning tasks are difficult for Small Language Models (SLMs, e.g., $\leq$ 7B), as these tasks demand not only general cognitive abilities but also specialized knowledge, which is often sparse and difficult for these neural-based SLMs to effectively capture. Therefore, NesyCD distills the general capabilities and specialized knowledge in LLMs using different manners.On the one hand, we distill only general abilities from teacher LLMs into the student SLMs of parameterized neural networks. On the other hand, for the specialized abilities and uncommon knowledge of a complex reasoning task, we employ a symbolic knowledge distillation approach to obtain and store the specialized knowledge within a symbolic knowledge base (KB).By decoupling general and specialized capabilities, the proposed NesyCD can achieve superior performance cost-effectively, utilizing smaller models and blending parameterized neural networks with symbolic KB. Moreover, the specialized KB generalizes well and is comprehended and manipulated by humans.Our experiments show that NesyCD significantly boosts SLMs' complex reasoning performance on in-domain (BBH, GSM8K) and out-of-domain (AGIEval, ARC) datasets. Notably, our approach enabled the LLaMA3-8B and Qwen2-7B to surpass GPT-3.5-turbo in performance and come close to matching LLaMA3-70B, despite the latter having nine times more parameters.

Abstract: Pretrained language models enhanced parsers have achieved outstanding performance in rich-resource languages. Cross-lingual dependency parsing aims to learn useful knowledge from high-resource languages to alleviate data scarcity in low-resource languages. However, effectively reducing the syntactic structure distributional bias and excavating the commonalities among languages is the key challenge for cross-lingual dependency parsing. To address this issue, we propose novel dynamic syntactic feature filtering and injecting networks based on the typical shared-private model that employs one shared and two private encoders to separate source and target language features. Concretely, a Language-Specific Filtering Network (LSFN) on private encoders emphasizes helpful information and ignores the irrelevant or harmful parts of it from the source language. Meanwhile, a Language-Invariant Injecting Network (LIIN) on the shared encoder integrates the advantages of BiLSTM and improved Transformer encoders to transcend language boundaries, thus amplifying syntactic commonalities across languages. Experiments on seven benchmark datasets show that our model achieves an average absolute gain of 1.84 UAS and 3.43 LAS compared with the shared-private model. Comparative experiments validate that both LSFN and LIIN components are complementary in transferring beneficial knowledge from source to target languages. Detailed analyses highlight that our model can effectively capture linguistic commonalities and mitigate the effect of distributional bias, showcasing its robustness and efficacy.

Abstract: Clinical diagnosis prediction models, when provided with a patient's medical history, aim to detect potential diseases early, facilitating timely intervention and improving prognostic outcomes. However, the inherent scarcity of patient data and large disease candidate space often pose challenges in developing satisfactory models for this intricate task. The exploration of leveraging Large Language Models (LLMs) for encapsulating clinical decision processes has been limited. We introduce MERA, a clinical diagnosis prediction model that bridges pertaining natural language knowledge with medical practice. We apply hierarchical contrastive learning on a disease candidate ranking list to alleviate the large decision space issue. With concept memorization through finetuning, we bridge the natural language clinical knowledge with medical codes. Experimental results on MIMIC-III and IV datasets show that MERA achieves the state-of-the-art diagnosis prediction performance and dramatically elevates the diagnosis prediction capabilities of generative LMs.

Abstract: Augmenting large language models (LLMs) with tools significantly enhances their problemsolving potential across multifaceted tasks. However, current tools automatically created by LLMs often serve as a mere summary of specific problems or solutions, which face two main issues: 1) Low reusability: The tools are overly problem-specific and struggle to handle new problems. 2) Limited diversity: The toolsets are too narrow, limiting their application to address a broader range of different problems. In this paper, we propose the Knowledge-grounded Tool Creation with Evolution (KTCE) framework, which aims to craft reusable and comprehensive toolsets for LLMs in a two-stage process. In the first stage (Knowledge-based Tool Creation), we conceptualize tools as a form of executable domain knowledge and propose a problem-knowledge-tool paradigm. Specifically, we leverage LLMs to abstract "knowledge" from "problems" and create a three-layer knowledge tree of topics, concepts, and key points. This hierarchical structure serves as a foundation for inducing atomic "tools" from "knowledge", grounding them in fundamental concepts and enhancing their usability. In the second stage (Tool Evolutionary Search), we evolve the toolsets through several actions including tool selection, mutation, and crossover. This stage mimics the biological evolution process, aiding toolsets in discovering new tools or updating existing ones, thereby increasing the diversity of the toolset. Experiments on challenging mathematical/tabular/scientific reasoning tasks demonstrate that our approach achieves substantial accuracy improvements ranging from 6.23% to 18.49% on average. Moreover, in-depth analyses reveal the superior characteristics of our toolkit, including high reusability, high diversity, and high generalizability on cross-data/LLM performance with low complexity.

Abstract: With the outstanding capabilities of Large Language Models (LLMs), solving math word problems (MWP) has greatly progressed, achieving higher performance on several benchmark datasets. However, it is more challenging to solve plane geometry problems (PGPs) due to the necessity of understanding, reasoning and computation on two modality data including both geometry diagrams and textual questions, where MultiModal Large Language Models (MLLMs) have not been extensively explored. Previous works simply regarded a plane geometry problem as multi-modal QA task, which ignored the importance of explicit parsing geometric elements from problems. To tackle this limitation, we propose to solve plane Geometry problems by Neural-Symbolic reasoning with MLLMs (GNS). We first leverage an MLLM to understand PGPs through knowledge prediction and symbolic parsing, next perform mathematical reasoning to obtain solutions, last adopt a symbolic solver to compute answers. Correspondingly, we introduce the largest PGPs dataset GNS-260K with multiple annotations including symbolic parsing, understanding, reasoning and computation. In experiments, our Phi3-Vision-based MLLM wins the first place on the PGPs solving task of MathVista benchmark, outperforming GPT-4o, Gemini Ultra and other much larger MLLMs. While LLaVA-13B-based MLLM markedly exceeded other close-source and open-source MLLMs on the MathVerse benchmark and also achieved the new SOTA on GeoQA dataset.

Abstract: AI systems make decisions in physical environments through primitive actions or affordances that are accessed via API calls. While deploying AI agents in the real world involves numerous highlevel actions, existing embodied simulators offer a limited set of domain-salient APIs. This naturally brings up the questions: how many primitive actions (APIs) are needed for a versatile embodied agent, and how should they look like? We explore this via a thought experiment: assuming that wikiHow tutorials cover a wide variety of human-written tasks, what is the space of APIs needed to cover these instructions? We propose a framework to iteratively induce new APIs by grounding wikiHow instruction to situated agent policies. Inspired by recent successes in large language models (LLMs) for embodied planning, we propose a few-shot prompting to steer GPT-4 to generate Pythonic programs as agent policies and bootstrap a universe of APIs by 1) reusing a seed set of APIs; and then 2) fabricate new API calls when necessary. The focus of this thought experiment is on defining these APIs rather than their excitability. We apply the proposed pipeline on instructions from wiki- How tutorials. On a small fraction (0.5%) of tutorials, we induce an action space of 300+ APIs necessary for capturing the rich variety of tasks in the physical world. A detailed automatic and human analysis of the induction output reveals that the proposed pipeline enables effective reuse and creation of APIs. Moreover, a manual review revealed that existing simulators support only a small subset of the induced APIs (9 of the top 50 frequent APIs), motivating the development of action-rich embodied environments.

Abstract: Zeroshot multi-intent detection is capable of capturing multiple intents within a single utterance without any training data, which gains increasing attention. Building on the success of large language models (LLM), dominant approaches in the literature explore prompting techniques to enable zero-shot multi-intent detection. While significant advancements have been witnessed, the existing prompting approaches still face two major issues: lacking explicit reasoning and lacking interpretability. Therefore, in this paper, we introduce a Divide-Solve-Combine Prompting (DSCP) to address the above issues. Specifically, DSCP explicitly decomposes multi-intent detection into three components including (1) single-intent division prompting is utilized to decompose an input query into distinct sub-sentences, each containing a single intent; (2) intent-by-intent solution prompting is applied to solve each sub-sentence recurrently; and (3) multi-intent combination prompting is employed for combining each sub-sentence result to obtain the final multi-intent result. By decomposition, DSCP allows the model to track the explicit reasoning process and improve the interpretability. In addition, we propose an interactive divide-solve-combine prompting (Inter-DSCP) to naturally capture the interaction capabilities of large language models. Experimental results on two standard multi-intent benchmarks (i.e., MixATIS and MixSNIPS) reveal that both DSCP and Inter-DSCP obtain substantial improvements over baselines, achieving superior performance and higher interpretability.

School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China, School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security Wuhan JinYinHu Laboratory

Abstract: Large Language Models (LLMs) have shown great promise in vulnerability identification. As C/C++ comprise half of the opensource Software (OSS) vulnerabilities over the past decade and updates in OSS mainly occur through commits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing Commits (VCCs) is essential. However, current studies primarily focus on further pre-training LLMs on massive code datasets, which is resource-intensive and poses efficiency challenges. In this paper, we enhance the ability of BERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++ programs and LLMs. Based on commits, CLNX efficiently converts the source code into a more natural representation while preserving key details. Specifically, CLNX first applies Structure-level Naturalization to decompose complex programs, followed by Token-level Naturalization to interpret complex symbols. We evaluate CLNX on public datasets of 25,872 C/C++ functions with their commits. The results demonstrate that CLNX substantially improves the ability of LLMs to detect C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new state-of-the-art performance and identifies 38 OSS vulnerabilities in the real world.

Abstract: Constructing highquality query-response pairs from custom corpora is crucial for supervised fine-tuning (SFT) large language models (LLMs) in many applications, like creating domain-specific AI assistants or roleplaying agents. However, sourcing this data through human annotation is costly, and existing automated methods often fail to capture the diverse range of contextual granularity and tend to produce homogeneous data. To tackle these issues, we introduce a novel method named AUGCON, capable of automatically generating context-driven SFT data across multiple levels of granularity with high diversity, quality and fidelity. AUGCON begins by generating queries using the Context-SplitTree (CST), an innovative approach for recursively deriving queries and splitting context to cover full granularity. Then, we train a scorer through contrastive learning to collaborate with CST to rank and refine queries. Finally, a synergistic integration of self-alignment and self-improving is introduced to obtain high-fidelity responses. Extensive experiments are conducted incorporating both automatic and human evaluations, encompassing four widely-used benchmarks and a test scenario in English and Chinese. The results highlight the significant advantages of AUGCON in producing high diversity, quality, and fidelity SFT data against several state-of-the-art methods.

Abstract: Recent advancements in question generation (QG) have been significantly propelled by reinforcement learning (RL). Although extensive reward models have been designed to capture the attributes of ideal questions, their associated learning challenges, particularly in sample efficiency and diversity, remain underexplored. This paper introduces a bilevel policy decomposition (BPD) framework and a diversityseeking RL (DSRL) objective to address these issues. The BPD framework utilizes two cascading policies to divide QG into two more manageable sub-tasks: answer-centric summary generation and summary-augmented QG, facilitating exploration and accelerating policy learning. Concurrently, the DSRL objective preserves the inherent diversity of QG by ensuring the bilevel policies align probabilistically with their reward models rather than merely maximizing returns. Our integrated approach, named BPD-DSRL, demonstrates superior performance over existing baselines on multiple question quality and diversity metrics across various QG benchmarks.

Abstract: Facebased Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. To address these issues, we present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations. More precisely, we propose an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion. Besides, unlike prior works, our method can accept either audio or text inputs, offering controllable speech generation with adjustable emotional tone and speed. Extensive experiments demonstrate that ID-FaceVC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity.

Abstract: The utility of large language models (LLMs) depends heavily on the quality and quantity of their training data. Many organizations possess large data corpora that could be leveraged to train or finetune LLMs tailored to their specific needs. However, these datasets often come with access restrictions that are based on user privileges and enforced by access control mechanisms. Training LLMs on such datasets could result in exposure of sensitive information to unauthorized users. A straightforward approach for preventing such exposure is to train a separate model for each access level. This, however, may result in low utility models due to the limited amount of training data per model compared to the amount in the entire organizational corpus. Another approach is to train a single LLM on all the data while limiting the exposure of unauthorized information. However, current exposure-limiting methods for LLMs are ineffective for access-controlled data, where sensitive information appears frequently across many training examples. We propose DOMBA - double model balancing - a simple approach for training and deploying LLMs that provides high utility and access-control functionality with security guarantees. DOMBA aggregates the probability distributions of two models, each trained on documents with (potentially many) different access levels, using a "min-bounded" average function (a function that is bounded by the smaller value, e.g., harmonic mean). A detailed mathematical analysis and extensive evaluation show that DOMBA safeguards restricted information while offering utility comparable to non-secure models.

Dept. Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta, Dept. Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta, Dept. Electrical and Computer Engineering & Ingenuity Labs Research Institute, Queen’s University, Quebec Artificial Intelligence Institute (Mila), McGill University Canada CIFAR AI Chair, Dept. Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta Canada CIFAR AI Chair

Abstract: We address unsupervised dependency parsing by building an ensemble of diverse existing models through post hoc aggregation of their output dependency parse structures. We observe that these ensembles often suffer from low robustness against weak ensemble components due to error accumulation. To tackle this problem, we propose an efficient ensembleselection approach that considers error diversity and avoids error accumulation. Results demonstrate that our approach outperforms each individual model as well as previous ensemble techniques. Additionally, our experiments show that the proposed ensemble-selection method significantly enhances the performance and robustness of our ensemble, surpassing previously proposed strategies, which have not accounted for error diversity.

School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, China, State Key Laboratory of Intelligent Game, Institute of Software, Chinese Academy of Sciences, Beijing, China, School of Atmospheric Sciences, Sun Yat-sen University, Zhuhai, China, School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, China Pazhou Lab, Guangzhou, China, School of Electronics and Communication Engineering, Sun Yat-sen University, Guangzhou, China, School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, China

Abstract: Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentenceimage pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utilizing human-designed type queries, struggles to differentiate ambiguous entities, such as Jordan (Person) and off-White x Jordan (Shoes). The latter, following the one-by-one decoding order, suffers from exposure bias issues. We maintain that these works misunderstand the relationships of multimodal entities. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN explicitly aligns textual entities with visual regions by employing a set of learnable queries to strengthen intra-entity connections. Based on distinct intra-entity modeling, MQSPN reformulates GMNER as a set prediction, guiding models to establish appropriate inter-entity relationships from a optimal global matching perspective. Additionally, we incorporate a query-guided Fusion Net (QFNet) as a glue network to boost better alignment of two-level relationships. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks.

School of Informatics, Xiamen University, China Shanghai Artificial Intelligence Laboratory, China, Tencent AI Lab, Bellevue, WA, Tencent AI Lab, Bellevue, WA, Tencent AI Lab, Bellevue, WA, Tencent AI Lab, Bellevue, WA, Tencent AI Lab, Bellevue, WA, School of Informatics, Xiamen University, China Shanghai Artificial Intelligence Laboratory, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China, Tencent AI Lab, Bellevue, WA

Abstract: Recent research suggests that tree search algorithms (e.g. Monte Carlo Tree Search) can dramatically boost LLM performance on complex mathematical reasoning tasks. However, they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications. This study introduces a novel guided tree search algorithm with a goaldirected heuristic function and node-level exploration budget (maximum number of children) calculation to tackle this issue. By considering the search progress towards the final answer (history) and the guidance from a value network (future) trained without any step-wise annotations, our algorithm iteratively selects the most promising tree node before expanding it within the boundaries of the allocated computational budget. Experiments conducted on the GSM8K, TabMWP, and MATH datasets demonstrate that our method not only offers competitive performance but also enjoys significantly lower computational costs compared to baseline methods.

Abstract: Logical reading comprehension is a challenging task that entails grasping the underlying semantics of text and applying reasoning to deduce the correct answer. Prior researches have primarily focused on enhancing logical reasoning capabilities through Chainof-Thought (CoT) or data augmentation. However, previous work constructing chain-of-thought rationales concentrates solely on analyzing correct options, neglecting the incorrect alternatives. Addtionally, earlier efforts on data augmentation by altering contexts rely on rule-based methods, which result in generated contexts that lack diversity and coherence. To address these issues, we propose a Premise-Oriented Data Augmentation (PODA) framework. This framework can generate CoT rationales including analyses for both correct and incorrect options, while constructing diverse and high-quality counterfactual contexts from incorrect candidate options. We integrate summarizing premises and identifying premises for each option into rationales. Subsequently, we employ multi-step prompts with identified premises to construct counterfactual context. To facilitate the model's capabilities to better differentiate the reasoning process associated with each option, we introduce a novel thought-path contrastive learning method that compares reasoning path between the original and counterfactual samples. Experimental results on three representative LLMs demonstrate that our method can improve the baselines substantially across two challenging logical reasoning benchmarks (ReClor and LogiQA 2.0).

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China Yunnan Key Laboratory of Artificial Intelligence, Kunming, China, Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China Yunnan Key Laboratory of Artificial Intelligence, Kunming, China, Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China Yunnan Key Laboratory of Artificial Intelligence, Kunming, China, Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China Yunnan Key Laboratory of Artificial Intelligence, Kunming, China, Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China Yunnan Key Laboratory of Artificial Intelligence, Kunming, China, Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China Yunnan Key Laboratory of Artificial Intelligence, Kunming, China, Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China Yunnan Key Laboratory of Artificial Intelligence, Kunming, China, Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China Yunnan Key Laboratory of Artificial Intelligence, Kunming, China

Abstract: With the rapid advancement of large language models (LLMs), discrete speech representations have become crucial for integrating speech into LLMs. Existing methods for speech representation discretization rely on a predefined codebook size and Euclidean distancebased quantization. However, 1) the size of codebook is a critical parameter that affects both codec performance and downstream task training efficiency. 2) The Euclidean distance-based quantization may lead to audio distortion when the size of the codebook is controlled within a reasonable range. In fact, in the field of information compression, structural information and entropy guidance are crucial, but previous methods have largely overlooked these factors. Therefore, we address the above issues from an information-theoretic perspective, we present SECodec, a novel speech representation codec based on structural entropy (SE) for building speech language models. Specifically, we first model speech as a graph, clustering the speech features nodes within the graph and extracting the corresponding codebook by hierarchically and disentangledly minimizing 2D SE. Then, to address the issue of audio distortion, we propose a new quantization method. This method still adheres to the 2D SE minimization principle, adaptively selecting the most suitable token corresponding to the cluster for each incoming original speech node. Furthermore, we develop a Structural Entropy-based Speech Language Model (SESLM) that leverages SECodec. Experimental results demonstrate that SECodec performs comparably to EnCodec in speech reconstruction, and SESLM surpasses VALL-E in zero-shot text-to-speech tasks.

Abstract: Openset speaker recognition is to identify whether the voices are from the same speaker. One challenge of speaker recognition is collecting large amounts of high-quality data. Based on the promising results of image classification, one intuitively feasible solution is semi-supervised learning (SSL) which uses confidence thresholds to assign pseudo labels for unlabeled data. However, we empirically demonstrated that applying SSL methods to speaker recognition is non-trivial. These methods focus solely on inter-class discrepancy as thresholds to select pseudo labels, overlooking intra-class compactness, which is particularly important for open-set speaker recognition tasks. Motivated by this, we propose Int*-Match, a semi-supervised speaker recognition method selecting reliable pseudo labels with intra-class compactness and inter-class discrepancy for speaker recognition. In particular, we use the inter-class discrepancy of labeled data as the threshold for pseudo-label selection and adjust the threshold based on the intra-class compactness of the pseudo labels dynamically and adaptively. Our systematic experiments demonstrate the superiority of Int*-Match, presenting an outstanding Equal Error Rate (EER) of 1.00% on the VoxCeleb1 original test set, which is merely 0.06% below the performance achieved by fully supervised learning.

Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing School of Artificial Intelligence, Beihang University, China, Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing School of Artificial Intelligence, Beihang University, China, Institute of Computing Technology, Chinese Academy of Sciences, School of Artificial Intelligence, Beihang University, China, Beijing Academy of Blockchain and Edge Computing, China, Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing School of Artificial Intelligence, Beihang University, China

Abstract: In a realworld RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts, necessitating query rewriting to better describe user's information needs. However, traditional context-based rewriting has minimal enhancement on downstream generation tasks due to the lengthy process from query rewriting to response generation. Some researchers try to utilize reinforcement learning with generation feedback to assist the rewriter, but this sparse rewards provide little guidance in most cases, leading to unstable training and generation results.We find that user's needs are also reflected in the gold documents, retrieved documents and ground-truth. Therefore, by feeding back these multi-aspect dense rewards to query rewriting, more stable and satisfactory responses can be achieved. In this paper, we propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results. Specifically, we first use manual data to train a T5 model for the rewriter initialization. Next, we design three metrics as reinforcement learning feedback: the similarity between the rewritten query and the gold document, the ranking metrics, and ROUGE between the generation and the ground truth. Inspired by RLAIF, we train three kinds of reward models for the above metrics to achieve more efficient training. Finally, we combine the scores of these reward models as feedback, and use PPO algorithm to explore the optimal query rewriting strategy.Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.

Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing School of Artificial Intelligence, Beihang University, China, Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing School of Artificial Intelligence, Beihang University, China, Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing School of Artificial Intelligence, Beihang University, China, Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing School of Artificial Intelligence, Beihang University, China, School of Artificial Intelligence, Beihang University, China

Abstract: Federated learning is susceptible to model poisoning attacks, especially those meticulously crafted for servers. Traditional defense methods mainly focus on updating assessments or robust aggregation against manually crafted myopic attacks. When facing advanced attacks, their defense stability is notably insufficient. Therefore, it is imperative to develop adaptive defenses against such advanced poisoning attacks. We find that benign clients exhibit significantly higher data distribution stability than malicious clients in federated learning in both CV and NLP tasks. Therefore, the malicious clients can be recognized by observing the stability of their data distribution. In this paper, we propose AdaAggRL, an RLbased Adaptive Aggregation method, to defend against sophisticated poisoning attacks. Specifically, we first utilize distribution learning to simulate the clients' data distributions. Then, we use maximum mean discrepancy (MMD) to calculate the pairwise similarity of the current local model data distribution, its historical data distribution, and global model data distribution. Finally, we use policy learning to adaptively determine the aggregation weights based on the above similarities. Experiments on four real-world datasets demonstrate that the proposed defense model significantly outperforms widely adopted defense models for sophisticated attacks.

Abstract: The ability of zeroshot translation emerges when we train a multilingual model with certain translation directions; the model can then directly translate in unseen directions. Alternatively, zero-shot translation can be accomplished by pivoting through a third language (e.g., English). In our work, we observe that both direct and pivot translations are noisy and achieve less satisfactory performance. We propose EBBS, an ensemble method with a novel bi-level beam search algorithm, where each ensemble component explores its own prediction step by step at the lower level but all components are synchronized by a "soft voting" mechanism at the upper level. Results on two popular multilingual translation datasets show that EBBS consistently outperforms direct and pivot translations, as well as existing ensemble techniques. Further, we can distill the ensemble's knowledge back to the multilingual model to improve inference efficiency; profoundly, our EBBS-distilled model can even outperform EBBS as it learns from the ensemble knowledge.

Abstract: LLMpowered personalization agent systems employ Large Language Models (LLMs) to predict users’ behavior from their past activities. However, their effectiveness often hinges on the ability to effectively leverage extensive, long user historical data due to its inherent noise and length of such data. Existing pre-trained LLMs may generate summaries that are concise but lack the necessary context for downstream tasks, hindering their utility in personalization systems. To address these challenges, we introduce Reinforcement Learning from Prediction Feedback (RLPF). RLPF fine-tunes LLMs to generate concise, human-readable user summaries that are optimized for downstream task performance. By maximizing the usefulness of the generated summaries, RLPF effectively distills extensive user history data while preserving essential information for downstream tasks. Our empirical evaluation demonstrates significant improvements in both extrinsic downstream task utility and intrinsic summary quality, surpassing baseline methods by up to 22% and achieving an up to 84.59% win rate on Factuality, iveness, and Readability. RLPF also achieves a remarkable 74% reduction while improving performance on 16 out of 19 unseen tasks and/or datasets, showcasing its generalizability. This approach offers a promising solution for enhancing LLM personalization by effectively transforming long, noisy user histories into informative and human-readable representations.

Abstract: Selfcorrection is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S^3cMath, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.

The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen, China, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen, China Shenzhen Research Institute of Big Data, Tencent AI Lab, Tencent AI Lab, Tencent AI Lab, Tencent AI Lab, Tsinghua University, National Key Laboratory of Novel Software Technology, Nanjing University, X-LANCE Lab, Shanghai Jiao Tong University, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen, China Shenzhen Research Institute of Big Data

Abstract: The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although stateof-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flexible and effective production. In this paper, we present SongEditor, the first song editing paradigm that introduces the editing capabilities into language-modeling song generation approaches, facilitating both segment-wise and track-wise modifications. SongEditor offers the flexibility to adjust lyrics, vocals, and accompaniments, as well as synthesizing songs from scratch. The core components of SongEditor include a music tokenizer, an autoregressive language model, and a diffusion generator, enabling generating an entire section, masked lyrics, or even separated vocals and background music. Extensive experiments demonstrate that the proposed SongEditor achieves exceptional performance in end-to-end song editing, as evidenced by both objective and subjective metrics.

Abstract: Elaborating a series of intermediate reasoning steps significantly improves the ability of large language models (LLMs) to solve complex problems, as such steps would evoke LLMs to think sequentially. However, human sarcasm understanding is often considered an intuitive and holistic cognitive process, in which various linguistic, contextual, and emotional cues are integrated to form a comprehensive understanding, in a way that does not necessarily follow a stepby-step fashion. To verify the validity of this argument, we introduce a new prompting framework (called SarcasmCue) containing four sub-methods, viz. chain of contradiction (CoC), graph of cues (GoC), bagging of cues (BoC) and tensor of cues (ToC), which elicits LLMs to detect human sarcasm by considering sequential and non-sequential prompting methods. Through a comprehensive empirical comparison on four benchmarks, we highlight three key findings: (1) CoC and GoC show superior performance with more advanced models like GPT-4 and Claude 3.5, with an improvement of 3.5%. (2) ToC significantly outperforms other methods when smaller LLMs are evaluated, boosting the F1 score by 29.7% over the best baseline. (3) Our proposed framework consistently pushes the state-of-the-art (i.e., ToT) by 4.2%, 2.0%, 29.7%, and 58.2% in F1 scores across four datasets. This demonstrates the effectiveness and stability of the proposed framework.

Abstract: The emergence of finetuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the fine-tuning process, leading to a compromised alignment state. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose Neuron-Level Safety Realignment (NLSR), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy. Our findings indicate that safety-critical neurons exhibit significant regional variations after fine-tuning, which can be effectively corrected through neuron transplantation from the reference model without the need for additional training.

Abstract: Conversational Emotion Recognition (CER) has recently been explored through conversational context modeling to learn the emotion distribution, i.e., the likelihood over emotion categories associated with each utterance. While these methods have shown promising results in emotion classification, they often focus on the interactions between utterances (utteranceview) and overlook shifts in the speaker's emotions (emotion-view). This emphasis on homogeneous view modeling limits their overall effectiveness. To address this limitation, we propose DVL-CER, a novel Dual-View Learning approach for CER. DVL-CER integrates both the utterance-view and emotion-view using two projection heads, enabling cross-view projection of emotion distributions. Our approach offers several key advantages: (1) We introduce an emotion-view that captures shifts in a speaker's emotions from initial to subsequent states within a conversation. This view enriches the conversation modeling and supports seamless integration with various CER baseline models. (2) Our dual-view projection learning strategy ﬂexibly balances consistency and independence between the two heterogeneous views, promoting view-specific adaptation learning and incorporating the emotion verification capability within CER. We validate DVL-CER through extensive experiments on two widely-used datasets, IEMOCAP and EmoryNLP. The results demonstrate that DVL-CER achieves state-of-the-art performance, delivering robust and high-quality emotion distributions compared with existing CER methods and other dual-view learning strategies.

School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China, Department of Informatics, King’s College London, UK, School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China, Department of Informatics, King’s College London, UK The Alan Turing Institute, UK, School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China

Abstract: Despite the notable advancements of existing prompting methods, such as InContext Learning and Chain-of-Thought for Large Language Models (LLMs), they still face challenges related to various biases. Traditional debiasing methods primarily focus on the model training stage, including approaches based on data augmentation and reweighting, yet they struggle with the complex biases inherent in LLMs. To address such limitations, the causal relationship behind the prompting methods is uncovered using a structural causal model, and a novel causal prompting method based on front-door adjustment is proposed to effectively mitigate LLMs biases. In specific, causal intervention is achieved by designing the prompts without accessing the parameters and logits of LLMs. The chain-of-thought generated by LLM is employed as the mediator variable and the causal effect between input prompts and output answers is calculated through front-door adjustment to mitigate model biases. Moreover, to accurately represent the chain-of-thoughts and estimate the causal effects, contrastive learning is used to fine-tune the encoder of chain-of-thought by aligning its space with that of the LLM. Experimental results show that the proposed causal prompting approach achieves excellent performance across seven natural language processing datasets on both open-source and closed-source LLMs.

Abstract: Recent advances in large language model assistants have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the laborintensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Overall, our method provides an effective and efficient approach to LLM red teaming, accelerating real-world deployment. WARNING: This paper contains examples of potentially harmful text.

Abstract: Unsupervised sentence embedding representation has become a hot research topic in natural language processing. As a tensor, sentence embedding has two critical properties: direction and norm. Existing works have been limited to constraining only the orientation of the samples' representations while ignoring the features of their module lengths. To address this issue, we propose a new training objective that optimizes the training of unsupervised contrastive learning by constraining the module length features between positive samples. We combine the training objective of Tensor's Norm Constraints with ensemble learning to propose a new Sentence Embedding representation framework, TNCSE. We evaluate seven semantic text similarity tasks, and the results show that TNCSE and derived models are the current stateof-the-art approach; in addition, we conduct extensive zero-shot evaluations, and the results show that TNCSE outperforms other baselines.

Abstract: A reproducibility crisis has been reported in science, but the extent to which it affects AI research is not yet fully understood. Therefore, we performed a systematic replication study including 30 highly cited AI studies relying on original materials when available. In the end, eight articles were rejected because they required access to data or hardware that was practically impossible to acquire as part of the project. Six articles were successfully reproduced, while five were partially reproduced. In total, 50% of the articles included was reproduced to some extent. The availability of code and data correlate strongly with reproducibility, as 86% of articles that shared code and data were fully or partly reproduced, while this was true for 33% of articles that shared only data. The quality of the data documentation correlates with successful replication. Poorly documented or missspecified data will probably result in unsuccessful replication. Surprisingly, the quality of the code documentation does not correlate with successful replication. Whether the code is poorly documented, partially missing, or not versioned is not important for successful replication, as long as the code is shared. This study emphasizes the effectiveness of open science and the importance of properly documenting data work.

Wuhan National Laboratory For Optoelectronics, Huazhong University of Science and Technology Ping An Technology (Shenzhen) Co., Ltd, Wuhan National Laboratory For Optoelectronics, Huazhong University of Science and Technology, Ping An Technology (Shenzhen) Co., Ltd, Wuhan National Laboratory For Optoelectronics, Huazhong University of Science and Technology, Wuhan National Laboratory For Optoelectronics, Huazhong University of Science and Technology, Wuhan National Laboratory For Optoelectronics, Huazhong University of Science and Technology, Ping An Technology (Shenzhen) Co., Ltd, Ping An Technology (Shenzhen) Co., Ltd

Abstract: Enabling object detectors to recognize outof-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model's capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.

Abstract: We consider infinitehorizon Markov Decision Processes where parameters, such as transition probabilities, are unknown and estimated from data. The popular distributionally robust approach to addressing the parameter uncertainty can sometimes be overly conservative. In this paper, we utilize the recently proposed formulation, Bayesian risk Markov Decision Process (BR-MDP), to address parameter (or epistemic) uncertainty in MDPs. To solve the infinite-horizon BR-MDP with a class of convex risk measures, we propose a computationally efficient approach called approximate bilevel difference convex programming (ABDCP). The optimization is performed offline and produces the optimal policy that is represented as a finite state controller with desirable performance guarantees. We also demonstrate the empirical performance of the BR-MDP formulation and the proposed algorithm.

Abstract: Despite significant recent advances in the field of Deep Reinforcement Learning (DRL), such methods typically incur high cost of training to learn effective policies, thus posing cost and safety challenges in many practical applications. To improve the learning efficiency of (D)RL methods, transfer learning (TL) has emerged as a promising approach to leverage prior experience on a source domain to speed learning on a new, but related, target domain. In this paper, we take a novel modelinformed approach to TL in DRL by assuming that we have knowledge of both the source and target domain models (which would be the case in the prevalent setting of DRL with simulators). While directly solving either the source or target MDP via solution methods like value iteration is computationally prohibitive, we exploit the fact that if the target and source MDPs differ only due to a small structural change in their rewards, we can apply structured value iteration methods in a procedure we term ModelDiff to solve the much smaller target-source ``Diff'' MDP for a reasonable horizon. This ModelDiff approach can then be integrated into extensions of standard DRL algorithms like ModelDiff (MD) DQN, where it provides enhanced provable lower bound guidance to DQN that often speeds convergence for the positive transfer case while critically avoiding decelerated learning in the negative transfer case. Experiments show that MD-DQN matches or outperforms existing TL methods and baselines in both positive and negative transfer settings.

Abstract: In causal inference, a randomized experiment is a de facto method to overcome various theoretical issues in observational study. However, the experimental design requires expensive costs, so an efficient experimental design is necessary. We propose ABC3, a Bayesian active learning policy for causal inference. We show a policy minimizing an estimation error on conditional average treatment effect is equivalent to minimizing an integrated posterior variance, similar to Cohn criteria. We theoretically prove ABC3 also minimizes an imbalance between the treatment and control groups and the type 1 error probability. Imbalanceminimizing characteristic is especially notable as several works have emphasized the importance of achieving balance. Through extensive experiments on real-world data sets, ABC3 achieves the highest efficiency, while empirically showing the theoretical results hold.

Abstract: For a datagenerating process for random variables that can be described with a linear structural equation model, we consider a situation in which (i) a set of covariates satisfying the back-door criterion cannot be observed or (ii) such a set can be observed, but standard statistical estimation methods cannot be applied to estimate causal effects because of multicollinearity/high-dimensional data problems. We propose a novel two-stage penalized regression approach, the penalized covariate-mediator selection operator (PCM Selector), to estimate the causal effects in such scenarios. Unlike existing penalized regression analyses, when a set of intermediate variables is available, PCM Selector provides a consistent or less biased estimator of the causal effect. In addition, PCM Selector provides a variable selection procedure for intermediate variables to obtain better estimation accuracy of the causal effects than does the back-door criterion.

Abstract: Horiyama et al. (AAAI 2024) considered the problem of generating instances with a unique minimum vertex cover under certain conditions. The Preassignment for Uniquification of Minimum Vertex Cover problem (shortly PAU-VC) is the problem, for given a graph G, to find a minimum set S of vertices in G such that there is a unique minimum vertex cover of G containing S. We show that PAU-VC is fixed parameter tractable parameterized by clique-width, which improves an exponential algorithm for trees given by Horiyama et al. Among natural graph classes with unbounded clique-width, we show that the problem can be solved in polynomial time on split graphs and unit interval graphs.

Abstract: The clique partitioning problem (CPP) aims to find a partition of vertices of a complete graph in order to maximize the sum of edge weights within each partition (clique), which has been proven to be NPhard and has wide real-world applications. In this paper, we propose an elite-guided weighted simulated annealing algorithm called EWSA to solve the CPP. First, EWSA employs two specific configurations and alternates between them via an oscillation strategy, which balances the exploitation and exploration of the search. Second, a weighting strategy is introduced to improve the scoring function in traditional simulated annealing, which is able to guide the search to explore diverse solutions. Finally, a partition restriction strategy is adopted to reduce search space and increase the search efficiency. Experiments on 255 instances demonstrate the competitiveness of EWSA. For 130 open instances, EWSA discovers new upper bounds in 32 cases and matches the best known results for the others. For the remaining 125 closed instances, EWSA achieves the best known objective values within a short computational time.

Abstract: Discovering elements of a hidden set, also known as Group Testing (GT), is a wellestablished area in which one party tries to discover elements hidden by the other party by asking queries and analyzing feedback. The feedback is a function of the intersection of the query with the hidden set - in our case, it is a classical double-threshold function, which returns i if the intersection is a singleton i and "null" otherwise (i.e., when the intersection is empty or of size at least 2). In this work, we enhance GT by two features. First, we introduce a local feedback framework to this problem: each hidden element is an "autonomous" element and can analyze feedback itself, but only for the queries to which it belongs. The goal is to design a deterministic non-adaptive sequence of queries that enables each non-hidden element to learn about all other hidden elements. We show that, surprisingly, this task requires substantially more queries than the classic group testing -- by proving a super-cubic (in terms of the number of hidden elements) lower bound and by constructing a specific query sequence of slightly longer length. Such a query system is also an extension of a well-known superimposed code, in a way that the decoding can be done only by the owners of the codewords. Second, we extend the results to the model where elements may belong to certain clusters and retrieving them could be done only via queries avoiding elements from "interfering" clusters. The main challenge is in not knowing which interfering clusters are non-empty (and thus, need to be avoided) and how to speed up the retrieval process by asking queries across many clusters. Our algorithms can be generalized to other feedback functions, to adversarial/stochastic fault-prone scenarios, implemented in a distributed setting and applied to the information theory and codes.

Institute of Semiconductors, Chinese Academy of Sciences School of Electronic, Electrical, and Communication Engineering, University of Chinese Academy of Sciences Zhongguancun Academy, Institute of Semiconductors, Chinese Academy of Sciences School of Electronic, Electrical, and Communication Engineering, University of Chinese Academy of Sciences Zhongguancun Academy School of Integrated Circuits, University of Chinese Academy of Sciences, Institute of Semiconductors, Chinese Academy of Sciences, Institute of Semiconductors, Chinese Academy of Sciences, Institute of Semiconductors, Chinese Academy of Sciences, Institute of Semiconductors, Chinese Academy of Sciences School of Electronic, Electrical, and Communication Engineering, University of Chinese Academy of Sciences, Institute of Semiconductors, Chinese Academy of Sciences School of Electronic, Electrical, and Communication Engineering, University of Chinese Academy of Sciences, Institute of Semiconductors, Chinese Academy of Sciences

Abstract: Mathematical formulas are the language of communication between humans and nature. Discovering latent formulas from observed data is an important challenge in artificial intelligence, commonly known as symbolic regression(SR). The current mainstream SR algorithms regard SR as a combinatorial optimization problem and use Genetic Programming (GP) or Reinforcement Learning (RL) to solve the SR problem. These methods perform well on simple problems, but poorly on slightly more complex tasks. In addition, this class of algorithms ignores an important aspect: in SR tasks, symbols have explicit numerical meaning. So can we take full advantage of this important property and try to solve the SR problem with more efficient numerical optimization methods? Extrapolation and Learning Equation (EQL) replaces activation functions in neural networks with basic symbols and sparsifies connections to derive a simplified expression from a large network. However, EQL's fixed network structure can't adapt to the complexity of different tasks, often resulting in redundancy or insufficient, limiting its effectiveness. Based on the above analysis, we propose MetaSymNet, a treelike network that employs the PANGU meta-function as its activation function. PANGU meta-function can evolve into various candidate functions during training. The network structure can also be adaptively adjusted according to different tasks. Then the symbol network evolves into a concise, interpretable mathematical expression. To evaluate the performance of MetaSymNet and five baseline algorithms, we conducted experiments across more than ten datasets, including SRBench. The experimental results show that MetaSymNet has achieved relatively excellent results on various evaluation metrics.

Abstract: By exploiting the correlation between the structure and the solution of MixedInteger Linear Programming (MILP), Machine Learning (ML) has become a promising method for solving large-scale MILP problems. Existing ML-based MILP solvers mainly focus on end-to-end solution learning, which suffers from the scalability issue due to the high dimensionality of the solution space. Instead of directly learning the optimal solution, this paper aims to learn a reduced and equivalent model of the original MILP as an intermediate step. The reduced model often corresponds to interpretable operations and is much simpler, enabling us to solve large-scale MILP problems much faster than existing commercial solvers. However, current approaches rely only on the optimal reduced model, overlooking the significant preference information of all reduced models. To address this issue, this paper proposes a preference-based model reduction learning method, which considers the relative performance (i.e., objective cost and constraint feasibility) of all reduced models on each MILP instance as preferences. We also introduce an attention mechanism to capture and represent preference information, which helps improve the performance of model reduction learning tasks. Moreover, we propose a SetCover based pruning method to control the number of reduced models (i.e., labels), thereby simplifying the learning process. Evaluation on real-world MILP problems shows that 1) compared to the state-of-the-art model reduction ML methods, our method obtains nearly 20% improvement on solution accuracy, and 2) compared to the commercial solver Gurobi, two to four orders of magnitude speedups are achieved.

Abstract: We introduce Limited Rollout Beam Search (LRBS), a beam search strategy for deep reinforcement learning (DRL) based combinatorial optimization improvement heuristics. Utilizing pretrained models on the Euclidean Traveling Salesperson Problem, LRBS significantly enhances both in-distribution performance and generalization to larger problem instances, achieving optimality gaps that outperform existing improvement heuristics and narrowing the gap with state-of-the-art constructive methods. We also extend our analysis to two pickup and delivery TSP variants to validate our results. Finally, we employ our search strategy for offline and online adaptation of the pre-trained improvement policy, leading to improved search performance and surpassing recent adaptive methods for constructive heuristics.

Abstract: The exact nadir objective vector of a multiobjective discrete optimization problem (MODOP) is crucial for decision-making but remains challenging to find. Existing methods for tackling this issue have limitations in theoretical guarantees or high computational costs. This paper applies boundary decomposition to the MODOP and proposes an exact algorithm called BDNC. BDNC is designed to address a bilevel optimization problem for each objective with finite-time convergence guarantees. The lower-level optimization problem, termed the boundary subproblem, is a scalarization of the MODOP. It can be solved using any suitable single-objective exact solver. According to the theoretical foundations of boundary decomposition, some specific settings of the boundary subproblem can ensure alignment with the nadir objective vector under mild conditions. The upper-level optimization problem evaluates a potential setting using the optimal solution to the lower-level one. It employs our proposed novel pruning method to efficiently identify the specific settings. Moreover, BDNC can leverage a trade-off provided by the decision-makers, potentially facilitating the decision-making process. Experiments on various MODOPs demonstrate that BDNC exhibits superior and reliable performance in terms of runtime compared to existing exact methods.

Abstract: This paper proposes a dual divideand-optimize algorithm (DualOpt) for solving the large-scale traveling salesman problem (TSP). DualOpt combines two complementary strategies to improve both solution quality and computational efficiency. The first strategy is a grid-based divide-and-conquer procedure that partitions the TSP into smaller sub-problems, solving them in parallel and iteratively refining the solution by merging nodes and partial routes. The process continues until only one grid remains, yielding a high-quality initial solution. The second strategy involves a path-based divide-and-optimize procedure that further optimizes the solution by dividing it into sub-paths, optimizing each using a neural solver, and merging them back to progressively improve the overall solution. Extensive experiments conducted on two groups of TSP benchmark instances, including randomly generated instances with up to 100,000 nodes and real-world datasets from TSPLIB, demonstrate the effectiveness of DualOpt. The proposed DualOpt achieves highly competitive results compared to 10 state-of-the-art algorithms in the literature. In particular, DualOpt achieves an improvement gap up to 1.40% for the largest instance TSP100K with a remarkable 104x speed-up over the leading heuristic solver LKH3. Additionally, DualOpt demonstrates strong generalization on TSPLIB benchmarks, confirming its capability to tackle diverse real-world TSP applications.

Abstract: Language models aligned for safety often exhibit fragile and imbalanced mechanisms, increasing the chances of producing unsafe content. In addition, editing techniques to incorporate new knowledge can further compromise safety. To tackle these issues, we propose SafeInfer, a contextadaptive, decoding-time safety alignment strategy for generating safe responses to user queries. safeInfer involves two phases: the 'safety amplification' phase, which uses safe demonstration examples to adjust the model’s hidden states and increase the likelihood of safer outputs, and the 'safety-guided decoding' phase, which influences token selection based on safety-optimized distributions to ensure the generated content adheres to ethical guidelines. Further, we introduce HarmEval, a novel benchmark for comprehensive safety evaluations, designed to address potential misuse scenarios in line with the policies of leading AI technology companies.

Abstract: Researchers, policymakers, and developers of artificial intelligence (AI) are actively collaborating to establish trustworthy AI standards that align with broader societal values, particularly in the context of large language models (LLMs). However, the critical discourse on bridging the vast knowledge gap between experts who shape and implement standards for LLMs and users whose values are at stake remains largely unaddressed. Taking a "bottomup" perspective and using a mixed-method approach, we first conducted interviews (N = 12) to engage with users' perceptions of normative standards in the context of LLMs. We thereby identified 68 specific criteria that users' consider when evaluating whether their values are fulfilled. Second, we conducted an online survey (N = 379) to further investigate how users prioritize these standards and the identified criteria in conversational LLM-based applications. Our findings reveal opportunities for strategic communication measures, the importance of transparent governance mechanisms and the necessity of non-technical complements to technical solutions for bridging the knowledge gap. We discuss actionable steps to effectively communicate trustworthy AI standards.

Abstract: Obtaining highquality explanations of a model's output enables developers to identify and correct biases, align the system's behavior with human values, and ensure ethical compliance. Explainable Artificial Intelligence (XAI) practitioners rely on specific measures to gauge the quality of such explanations. These measures assess key attributes, such as how closely an explanation aligns with a model's decision process (faithfulness), how accurately it pinpoints the relevant input features (localization), and its consistency across different cases (robustness). Despite providing valuable information, these measures do not fully address a critical practitioner's concern: how does the quality of a given explanation compare to other potential explanations? Traditionally, the quality of an explanation has been assessed by comparing it to a randomly generated counterpart. This paper introduces an alternative: the Quality Gap Estimate (QGE). The QGE method offers a direct comparison to what can be viewed as the `inverse' explanation, one that conceptually represents the antithesis of the original explanation. Our extensive testing across multiple model architectures, datasets, and established quality metrics demonstrates that the QGE method is superior to the traditional approach. Furthermore, we show that QGE enhances the statistical reliability of these quality assessments. This advance represents a significant step toward a more insightful evaluation of explanations that enables a more effective inspection of a model's behavior.

Abstract: The success of the reward model in distinguishing between responses with subtle safety differences depends critically on the highquality preference dataset, which should capture the fine-grained nuances of harmful and harmless responses. This motivates the need to develop the datasets involving preference margins, which accurately quantify how harmless one response is compared to another. In this paper, we take the first step to propose an effective and cost-efficient framework to promote the margin-enhanced preference dataset development. Our framework, Legend, Leverages rEpresentation enGineering to annotate preferENce Datasets. It constructs the specific direction within the LLM's embedding space that represents safety. By leveraging this safety direction, Legend can then leverage the semantic distances of paired responses along this direction to annotate margins automatically. We experimentally demonstrate our effectiveness in both reward modeling and harmless alignment for LLMs. Legend also stands out for its efficiency, requiring only the inference time rather than additional training. This efficiency allows for easier implementation and scalability, making Legend particularly valuable for practical applications in aligning LLMs with safe conversations.

Abstract: Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries. To mitigate potential harm and prevent misuse, there have been concerted efforts to align the LLMs with human values and legal compliance by incorporating various techniques, such as Reinforcement Learning from Human Feedback (RLHF), into the training of the LLMs. However, recent research has exposed that even aligned LLMs are susceptible to adversarial manipulations known as Jailbreak Attacks. To address this challenge, this paper proposes a method called Token Highlighter to inspect and mitigate the potential jailbreak threats in the user query. Token Highlighter introduced a concept called Affirmation Loss to measure the LLM's willingness to answer the user query. It then uses the gradient of Affirmation Loss for each token in the user query to locate the jailbreakcritical tokens. Further, Token Highlighter exploits our proposed Soft Removal technique to mitigate the jailbreak effects of critical tokens via shrinking their token embeddings. Experimental results on two aligned LLMs (LLaMA-2 and Vicuna-V1.5) demonstrate that the proposed method can effectively defend against a variety of Jailbreak Attacks while maintaining competent performance on benign questions of the AlpacaEval benchmark. In addition, Token Highlighter is a cost-effective and interpretable defense because it only needs to query the protected LLM once to compute the Affirmation Loss and can highlight the critical tokens upon refusal.

Abstract: Large language models (LLMs) are expected to follow instructions from users and engage in conversations. Techniques to enhance LLMs' instructionfollowing capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understood, which is crucial for deploying LLMs safely at scale. In this paper, we investigate how chat templates affect safety alignment of LLMs. We identify a common vulnerability, named ChatBug, that is introduced by chat templates. Our key insight to identify ChatBug is that the chat templates provide a rigid format that need to be followed by LLMs, but not by users. Hence, a malicious user may not necessarily follow the chat template when prompting LLMs. Instead, malicious users could leverage their knowledge of the chat template and accordingly craft their prompts to bypass safety alignments of LLMs. We study two attacks to exploit the ChatBug vulnerability. Additionally, we demonstrate that the success of multiple existing attacks can be attributed to the ChatBug vulnerability. We show that a malicious user can exploit the ChatBug vulnerability of eight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses from these models. Moreover, we show that ChatBug can be exploited by existing jailbreak attacks to enhance their attack success rates. We investigate potential countermeasures to ChatBug. Our results show that while adversarial training effectively mitigates the ChatBug vulnerability, the victim model incurs significant performance degradation. These results highlight the trade-off between safety alignment and helpfulness. Developing new methods for instruction tuning to balance this trade-off is an open and critical direction for future research.

Abstract: Open source is a driving force behind scientific advancement. However, this openness is also a doubleedged sword, with the inherent risk that innovative technologies can be misused for purposes harmful to society. What is the likelihood that an open source AI model or dataset will be used to commit a real-world crime, and if a criminal does exploit it, will the people behind the technology be able to escape legal liability? To address these questions, we explore a legal domain where individual choices can have a significant impact on society. Specifically, we build the EVE-v1 dataset that comprises 200 question-answer pairs related to criminal offenses based on 200 Korean precedents first to explore the possibility of malicious models emerging. We further developed EVE-v2 using 600 fraud-related precedents to confirm the existence of malicious models that can provide harmful advice on a wide range of criminal topics to test the domain generalization ability. Remarkably, widely used open-source large-scale language models (LLMs) provide unethical and detailed information about criminal activities when fine-tuned with \oursall. We also take an in-depth look at the legal issues that malicious language models and their builders could realistically face. Our findings highlight the paradoxical dilemma that open source accelerates scientific progress, but requires great care to minimize the potential for misuse.

Abstract: We present a Reinforcement Learning Platform for Adversarial Blackbox untargeted and targeted attacks, RLAB, that allows users to select from various distortion filters to create adversarial examples. The platform uses a Reinforcement Learning agent to add minimum distortion to input images while still causing misclassification by the target model. The agent uses a novel dual-action method to explore the input image at each step to identify sensitive regions for adding distortions while removing noises that have less impact on the target model. This dual action leads to faster and more efficient convergence of the attack. The platform can also be used to measure the robustness of image classification models against specific distortion types. Also, retraining the model with adversarial samples significantly improved robustness when evaluated on benchmark datasets. The proposed platform outperforms state-of-the-art methods in terms of the average number of queries required to cause misclassification. This advances trustworthiness with a positive social impact.

Abstract: The aim of inverse reinforcement learning (IRL) is to infer an agent's preferences from observing their behaviour. Usually, preferences are modelled as a reward function, R, and behaviour is modelled as a policy, pi. One of the central difficulties in IRL is that multiple preferences may lead to the same observed behaviour. That is, R is typically underdetermined by pi, which means that R is only partially identifiable. Recent work has characterised the extent of this partial identifiability for different types of agents, including optimal and Boltzmannrational agents. However, work so far has only considered agents that discount future reward exponentially: this is a serious limitation, especially given that extensive work in the behavioural sciences suggests that humans are better modelled as discounting hyperbolically. In this work, we newly characterise partial identifiability in IRL for agents with non-exponential discounting: our results are in particular relevant for hyperbolical discounting, but they also more generally apply to agents that use other types of (non-exponential) discounting. We significantly show that generally IRL is unable to infer enough information about R to identify the correct optimal policy, which entails that IRL alone can be insufficient to adequately characterise the preferences of such agents.

Abstract: Direct Preference Optimization (DPO) has recently expanded its successful application from aligning large language models (LLMs) to aligning textto-image models with human preferences, which has generated considerable interest within the community. However, we have observed that these approaches rely solely on minimizing the reverse Kullback-Leibler divergence during alignment process between the fine-tuned model and the reference model, neglecting incorporation of other divergence constraints. In this study, we focus on extending reverse Kullback-Leibler divergence in the alignment paradigm of text-to-image models to f-divergence, which aims to garner better alignment performance as well as good generation diversity. We provide the generalized formula of text-to-image alignment paradigm under f-divergence condition and thoroughly analyze the impact of different divergence constraints on alignment process from the perspective of gradient fields. We conduct comprehensive evaluation on text-image alignment performance, human value alignment performance and generation diversity performance under different divergence constraints, and the results indicate that text-to-image alignment based on Jensen-Shannon divergence achieves the best trade-off among them. The option of divergence employed for aligning text-to-image models significantly impacts the trade-off between alignment performance (especially human value alignment) and generation diversity, which highlights the necessity of selecting an appropriate divergence for practical applications.

Abstract: Citizen science biodiversity data present great opportunities for ecology and conservation across vast spatial and temporal scales. However, the opportunistic nature of these data lacks the sampling structure required by modeling methodologies that address a pervasive challenge in ecological data collection: imperfect detection, i.e., the likelihood of underobserving species on field surveys. Occupancy modeling is an example of an approach that accounts for imperfect detection by explicitly modeling the observation process separately from the biological process of habitat selection. This produces species distribution models that speak to the pattern of the species on a landscape after accounting for imperfect detection in the data, rather than the pattern of species observations corrupted by errors. To achieve this benefit, occupancy models require multiple surveys of a site across which the site's status (i.e., occupied or not) is assumed constant. Since citizen science data are not collected under the required repeated-visit protocol, observations may be grouped into sites post hoc. Existing approaches for constructing sites discard some observations and/or consider only geographic distance and not environmental similarity. In this study, we compare ten approaches for site construction in terms of their impact on downstream species distribution models for 31 bird species in Oregon, using observations recorded in the eBird database. We find that occupancy models built on sites constructed by spatial clustering algorithms perform better than existing alternatives.

School of Computing, Korea Advanced Institute of Science and Technology (KAIST), School of Computing, Korea Advanced Institute of Science and Technology (KAIST), School of Computing, Korea Advanced Institute of Science and Technology (KAIST), College of Business, Korea Advanced Institute of Science and Technology (KAIST) School of Computing, Korea Advanced Institute of Science and Technology (KAIST) Graduate School of Data Science, Korea Advanced Institute of Science and Technology (KAIST), Hong Kong University of Science and Technology (HKUST), Max Planck Institute for Security and Privacy (MPI-SP) School of Computing, Korea Advanced Institute of Science and Technology (KAIST)

Abstract: The increasing frequency and intensity of natural disasters call for rapid and accurate damage assessment. In response, disaster benchmark datasets from highresolution satellite imagery have been constructed to develop methods for detecting damaged areas. However, these methods face significant challenges when applied to previously unseen regions due to the limited geographical and disaster-type diversity in the existing datasets. We introduce DAVI (Disaster Assessment with VIsion foundation model), a novel approach that addresses domain disparities and detects structural damage at the building level without requiring ground-truth labels for target regions. DAVI combines task-specific knowledge from a model trained on source regions with task-agnostic knowledge from an image segmentation model to generate pseudo labels indicating potential damage in target regions. It then utilizes a two-stage refinement process, which operate at both pixel and image levels, to accurately identify changes in disaster-affected areas. Our evaluation, including a case study on the 2023 Türkiye earthquake, demonstrates that our model achieves exceptional performance across diverse terrains (e.g., North America, Asia, and the Middle East) and disaster types (e.g., wildfires, hurricanes, and tsunamis). This confirms its robustness in disaster assessment without dependence on ground-truth labels and highlights its practical applicability.

Abstract: During the COVID19 pandemic, a major driver of new surges has been the emergence of new variants. When a new variant emerges in one or more countries, other nations monitor its spread in preparation for its potential arrival. The impact of the new variant and the timings of epidemic peaks in a country highly depend on when the variant arrives. The current methods for predicting the spread of new variants rely on statistical modeling, however, these methods work only when the new variant has already arrived in the region of interest and has a significant prevalence. Can we predict when a variant existing elsewhere will arrive in a given region? To address this question, we propose a variant-dynamics-informed Graph Neural Network (GNN) approach. First, we derive the dynamics of variant prevalence across pairs of regions (countries) that apply to a large class of epidemic models. The dynamics motivate the introduction of certain features in the GNN. We demonstrate that our proposed dynamics-informed GNN outperforms all the baselines, including the currently pervasive framework of Physics-Informed Neural Networks (PINNs). To advance research in this area, we introduce a benchmarking tool to assess a user-defined model's prediction performance across 87 countries and 36 variants.

Abstract: Accurate detection of dust storms is challenging due to complex meteorological interactions. With the development of deep learning, deep neural networks have been increasingly applied to dust storm detection, offering better learning and generalization capabilities compared to traditional physical modeling. However, existing methods face some limitations, leading to performance bottlenecks in dust storm detection. From the task perspective, existing research focuses on occurrence detection while neglecting intensity detection. From the data perspective, existing research fails to explore the utilization of multisource data. From the model perspective, most models are built on convolutional neural networks, which have an inherent limitation in capturing long-range dependencies. To address the challenges mentioned, this study proposes Dust-Mamba. To the best of our knowledge, this study is the first attempt to accomplish both the occurrence and intensity detection of dust storms with advanced deep learning technology. In Dust-Mamba, multi-source data is introduced to provide a comprehensive perspective, Mamba and attention are applied to boost feature selection while maintaining long-range modeling capability. Additionally, this study proposes Structure Sharing Transfer Learning Strategies for intensity detection, which further enhances the performance of Dust-Mamba with minimal time cost. As shown by experiments, Dust-Mamba achieves Dice scores of 0.963 for occurrence detection and 0.560 for intensity detection, surpassing several baseline models. In conclusion, this study offers valuable baselines for dust storm detection, with significant reference value and promising application potential.

Abstract: Correctly assessing the malignancy of breast lesions identified during ultrasound examinations is crucial for effective clinical decisionmaking. However, the current "gold standard" relies on manual BI-RADS scoring by clinicians, often leading to unnecessary biopsies and a significant mental health burden on patients and their families. In this paper, we introduce PersonalizedUS, an interpretable machine learning system that leverages recent advances in conformal prediction to provide precise and personalized risk estimates with local coverage guarantees and sensitivity, specificity, and predictive values above 0.9 across various threshold levels. In particular, we identify meaningful lesion subgroups where distribution-free, model-agnostic conditional coverage holds, with approximately 90% of our prediction sets containing only the ground truth in most lesion subgroups, thus explicitly characterizing for which patients the model is most suitably applied. Moreover, we make available a curated tabular dataset of 1936 biopsied breast lesions from a recent observational multicenter study and benchmark the performance of several state-of-the-art learning algorithms. We also report a successful case study of the deployed system in the same multicenter context. Concrete clinical benefits include up to a 65% reduction in requested biopsies among BI-RADS 4a and 4b lesions, with minimal to no missed cancer cases.

Abstract: In recent years, the rapid development of Large Language Models has highlighted the urgent need for largescale, high-quality, and diverse data. We have launched an LLM data co-creation platform aimed at bringing together a wide range of participants to contribute data. Within six months, the platform has attracted over 10,000 participants who contributed more than 150,000 data entries across more than 200 tasks. An observable user cohort was constructed around the question, "Who is the best data contributor?" along with sub-questions concerning user preferences, task competence, and more. Through a detailed analysis of data contributors, this paper reveals several data collection patterns related to human factors. It reveals that contributors who provide high-quality data often do not meet initial expectations, as their behavior exhibits typical characteristics of the Dunning-Kruger effect. This paper examined the cognitive bias between users' self-assessment and actual abilities, where individuals tend to overestimate their capabilities in certain tasks, leading to a decreased willingness to continue contributing and a consequent waste of human resources. To address this issue, we propose a task reassignment method based on multi-task fine-tuning of small language models (SLMs) to better align user groups with appropriate task types. After the reallocation, we observed a significant increase in user engagement and platform benefits, along with improved overall platform efficiency. The versatility of this method makes it applicable to broader data collection scenarios.

Abstract: This work introduces a novel graph neural networks (GNNs)based method to predict stream water temperature and reduce model bias across locations of different income and education levels. Traditional physics-based models often have limited accuracy because they are necessarily approximations of reality. Recently, there has been an increasing interest of using GNNs in modeling complex water dynamics in stream networks. Despite their promise in improving the accuracy, GNNs can bring additional model bias through the aggregation process, where node features are updated by aggregating neighboring nodes. The bias can be especially pronounced when nodes with similar sensitive attributes are frequently connected. We introduce a new method that leverages physical knowledge to represent the node influence in GNNs, and then utilizes physics-based influence to refine the selection and weights over the neighbors. The objective is to facilitate equitable treatment over different sensitive groups in the graph aggregation, which helps reduce spatial bias over locations, especially for those in underprivileged groups. The results on the Delaware River Basin demonstrate the effectiveness of the proposed method in preserving equitable performance across locations in different sensitive groups.

Xi'an Key Laboratory of Big Data and Intelligent Vision School of Computer Science and Technology, Xidian University, Xi'an Key Laboratory of Big Data and Intelligent Vision School of Computer Science and Technology, Xidian University, Xi'an Key Laboratory of Big Data and Intelligent Vision School of Computer Science and Technology, Xidian University, School of Computer Science and Technology, Xidian University, School of Computer Science and Technology, Xidian University, Xi'an Key Laboratory of Big Data and Intelligent Vision School of Computer Science and Technology, Xidian University

Abstract: Our world faces the challenge of efficiently and responsibly managing the evergrowing volume of urban waste. Many countries and regions have implemented categorized trash bins and require residents to sort their waste according to specified criteria. Proper waste classification by residents significantly reduces the workload in the waste disposal process. However, due to the lack of effective supervision during classification, the quality of waste sorting is often compromised. This misclassification can lead to higher pollution risks, lower recycling rates, and increased waste management costs and difficulties. To address this issue, we propose using images captured from within trash bins to supervise garbage delivery. We introduce UrbanWaste, an image dataset specifically designed for in-the-bin waste detection and segmentation. The dataset includes 25,254 RGB images and 140,008 annotated items, featuring dense annotations and multi-granularity labels across 193 distinct waste categories. We evaluated state-of-the-art segmentation models to understand their generalization and performance on UrbanWaste. Based on this dataset, we developed a comprehensive workflow for waste classification inspection, which has been deployed in real-world districts to assess the system's effectiveness. We hope UrbanWaste will inspire new directions in AI research for environmental sustainability.

Abstract: Illegal, Unreported, and Unregulated (IUU) fishing aggravates the global crisis caused by overfishing, threatening the sustainability of marine ecosystems and fisheries worldwide. Distinctive operational characteristics of fishing vessels result in unique footprints on marine environments and socioeconomic structures, depending on their fishing method and gear type such as trawlers with non-selective gear that disrupts the seabed, purse seiners using Fish Aggregating Devices (FADs), and longliners notorious for high bycatch rates. As these vessels play an essential role in commercial fishing and the industry, effective monitoring, regulation, and enforcement are critical to mitigate the devastating consequences of overfishing and promote sustainable fishing practices. To this end, this paper introduces a novel multi-stage method for Gear type Identification by Spatiotemporal trajectory Transformation (GIST). This method proposes a data-centric approach that employs domain knowledge to facilitate the deployment of an efficient and accurate analysis of operational patterns of fishing vessels derived from Automatic Identification System (AIS) data. Our method first extracts fishing patterns from vessel trajectories to refine data integrity and isolate only the most relevant activities, thereby ensuring a more accurate result. Next, it encapsulates the distributional insights of fishing activities into fixed-sized "images" as actionable input for a multi-class CNN-based classifier. Utilizing GIST bypasses complicated linear analyses of time series data for rendering lengthy trajectories, advancing an efficient gear type identification with 97% accuracy. To the best of our knowledge, GIST is the first to use a multi-stage method to distinguish three principal gear types widely used globally. Our experiments confirm GIST's practicability and effectiveness, marking a significant advancement towards stricter enforcement of regulations in the fight against IUU fishing.

Abstract: Monitoring realtime air quality is essential for safeguarding public health and fostering social progress. However, the widespread deployment of air quality monitoring stations is constrained by their significant costs. To address this limitation, we introduce AirRadar, a deep neural network designed to accurately infer real-time air quality in locations lacking monitoring stations by utilizing data from existing ones. By leveraging learnable mask tokens, AirRadar reconstructs air quality features in unmonitored regions. Specifically, it operates in two stages: first capturing spatial correlations and then adjusting for distribution shifts. We validate AirRadar’s efficacy using a year-long dataset from 1,085 monitoring stations across China, demonstrating its superiority over multiple baselines, even with varying degrees of unobserved data.

Abstract: Physicsguided machine learning (PGML) has become a prevalent approach in studying scientific systems due to its ability to integrate scientific theories for enhancing machine learning (ML) models. However, most PGML approaches are tailored to isolated and relatively simple tasks, which limits their applicability to complex systems involving multiple interacting processes and numerous influencing features. In this paper, we propose a Physics-Guided Foundation Model (PGFM) that combines pre-trained ML models and physics-based models and leverages their complementary strengths to improve the modeling of multiple coupled processes. To effectively conduct pre-training, we construct a simulated environmental system that encompasses a wide range of influencing features and various simulated variables generated by physics-based models. The model is pre-trained in this system to adaptively select important feature interactions guided by multi-task objectives. We then fine-tune the model for each specific task using true observations, while maintaining consistency with established physical theories, such as the principles of mass and energy conservation. We demonstrate the effectiveness of this methodology in modeling water temperature and dissolved oxygen dynamics in real-world lakes. The proposed PGFM is also broadly applicable to a range of scientific fields where physics-based models are being used.

Abstract: Estimating service capabilities for logistics terminal stations is essential for guiding operations adjustments to enhance customer experience. However, existing studies often focus on isolated metrics like ontime delivery or complaint rates, each reflecting a specific aspect of service capabilities. To provide a more comprehensive evaluation, we design AdaService, an Adaptive multi-faceted Service capabilities co-estimation framework. We begin by constructing Multi-faceted Hypergraph to encode stations using multiple performance metrics. We then introduce a Multi-faceted Hypergraph Convolution Network (MHCN) to capture the heterogeneous service capabilities across stations, providing a comprehensive capabilities representation. Finally, we apply an Adaptive Multi-faceted Estimation module that uses multi-task learning to model dynamic interactions among these metrics, enhancing predictive accuracy. Extensive evaluation with real-world data collected from nationwide stations in a leading logistics company in China demonstrates that AdaService significantly outperforms state-of-the-art methods, improving estimation accuracy for on-time delivery, on-time pick-up, and complaint rates by up to 18.98%, 9.30%, and 39.62%.

Abstract: Ensuring that AI systems do what we, as humans, actually want them to do, is one of the biggest open research challenges in AI alignment and safety. My research seeks to directly address this challenge by enabling AI systems to interact with humans to learn aligned and robust behaviors. The way in which robots and other AI systems behave is often the result of optimizing a reward function. However, manually designing good reward functions is highly challenging and error prone, even for domain experts. Consider trying to write down a reward function that describes good driving behavior or how you like your bed made in the morning. While reward functions for these tasks are difficult to manually specify, human feedback in the form of demonstrations or preferences are often much easier to obtain. However, human data is often difficult to interpret, due to ambiguity and noise. Thus, it is critical that AI systems take into account epistemic uncertainty over the human's true intent. My talk will give an overview of my lab's progress along the following fundamental research areas: (1) efficiently maintaining uncertainty over human intent, (2) directly optimizing behavior to be robust to uncertainty over human intent, and (3) actively querying for additional human input to reduce uncertainty over human intent.

Abstract: Machine Learning (ML) algorithms are increasingly used in our daily lives, yet often exhibit discrimination against protected groups. In this talk, I discuss the growing concern of bias in ML and overview existing approaches to address fairness issues. Then, I present three novel approaches developed by my research group. The first leverages generative AI to eliminate biases in training datasets, the second tackles nonconvex problems arise in fair learning, and the third introduces a matrix decomposition-based post-processing approach to identify and eliminate unfair model components.

Abstract: A prevalent assumption in humanrobot and human-AI teaming is that artificial teammates should be compliant and obedient. In this talk, I will question this assumption by presenting the Guide Robot Grand Challenge and discussing the components required to design and build a service robot that can intelligently disobey. This challenge encompasses a variety of research problems, as I will exemplify via three challenges: reasoning about the goals of other agents, choosing when to interrupt, and interacting in a tightly coupled physical environment.

Abstract: Modelbased reasoning agents are ill-equipped to act in novel situations in which their model of the environment no longer sufficiently represents the world. We propose HYDRA, a framework for designing model-based agents operating in mixed discrete-continuous worlds that can autonomously detect when the environment has evolved from its canonical setup, understand how it has evolved, and adapt the agents' models to perform effectively. HYDRA is based upon PDDL+, a rich modeling language for planning in mixed, discrete-continuous environments. It augments the planning module with visual reasoning, task selection, and action execution modules for closed-loop interaction with complex environments. HYDRA implements a novel meta-reasoning process that enables the agent to monitor its own behavior from a variety of aspects. The process employs a diverse set of computational methods to maintain expectations about the agent's own behavior in an environment. Divergences from those expectations are useful in detecting when the environment has evolved and identifying opportunities to adapt the underlying models. HYDRA builds upon ideas from diagnosis and repair and uses a heuristics-guided search over model changes such that they become competent in novel conditions. The HYDRA framework has been used to implement novelty-aware agents for three diverse domains - CartPole++ (a higher dimension variant of a classic control problem), Science Birds (an IJCAI competition problem), and PogoStick (a specific problem domain in Minecraft). We report empirical observations from these domains to demonstrate the efficacy of various components in the novelty meta-reasoning process.

Abstract: As artificial intelligence (AI) systems become increasingly deployed across the world, they are also increasingly implicated in AI incidents – harm events to individuals and society. As a result, industry, civil society, and governments worldwide are developing best practices and regulations for monitoring and analyzing AI incidents. The AI Incident Database (AIID) is a project that catalogs AI incidents and supports further research by providing a platform to classify incidents for different operational and researchoriented goals. This study reviews the AIID’s dataset of 750+ AI incidents and two independent taxonomies applied to these incidents to identify common challenges to indexing and analyzing AI incidents. We find that certain patterns of AI incidents present structural ambiguities that challenge incident databasing and explore how epistemic uncertainty in AI incident reporting is unavoidable. We therefore report mitigations to make incident processes more robust to uncertainty related to cause, extent of harm, severity, or technical details of implicated systems. With these findings, we discuss how to develop future AI incident reporting practices.

Abstract: This paper presents an experiential learning pedagogy that teaches undergraduate business management information systems students handson AI skills through the lens of sustainability. The learning modules aim to empower undergraduate business students to gain interest and confidence in AI knowledge, skills, and careers, to sharpen their higher order thinking abilities, and to help them gain a deeper understanding of sustainability issues. Students learn AI through developing chatbots that address pressing sustainability issues within their own communities. Results of the pilot study indicate that students have increased self-efficacy in AI, more positive attitudes towards AI learning and AI-related careers, enhanced sustainability awareness, and more confidence in their ability to innovate.

Abstract: As Artificial Intelligence (AI) continues to integrate into more aspects of society, equipping younger generations with foundational AI knowledge becomes increasingly critical. This paper presents Word2Vec4Kids (W2V4K), an interactive application designed to familiarize middle school students with word embeddings, a key aspect of Natural Language Processing (NLP). W2V4K leverages the Word2Vec model, allowing students to explore word associations, similarity, and vector arithmetic through engaging game modes. The application was tested with 38 middle school students aged 1114 at a Science Technology Engineering Math (STEM)-focused charter school. Data were collected on students' interactions with the application, including screen recordings, audio, and survey responses. Results demonstrated that W2V4K effectively introduces NLP concepts to students. Qualitative observations revealed high levels of engagement with students expressing excitement and curiosity about word relationships. As they progressed through the game modes, students showed increasing confidence in predicting word associations, brainstorming relevant words, and connecting the concepts to real-world applications. Quantitative data from post-interaction surveys indicated positive learning outcomes with 44.5% of students achieving perfect scores on concept-related items. Additionally, students demonstrated an ability to critically think about language representation. This study suggests that W2V4K provides an effective and engaging method for introducing NLP concepts to middle school students, contributing to the broader goal of enhancing AI literacy among younger generations.

Abstract: Developing trustworthy AI requires advancing methods that meet key requirements such as privacy or fairness while maintaining strong utility, as well as understanding the intricate interdependencies between these dimensions, which often manifest as tradeoffs. My PhD research focuses on differential privacy, which is widely regarded as the state-of-the-art for protecting privacy in data analysis and machine learning. I investigate the relationships between differential privacy, utility and fairness, with the goal of advancing the adoption of differentially private machine learning in real-world settings.

Abstract: Aging biomarkers play a crucial role in uncovering the biological mechanisms behind aging and in developing strategies to support healthy aging. However, the search for reliable aging biomarkers is particularly challenging due to the intricate and multifactorial nature of the aging process. Furthermore, biomarker names and categories are not wellstandardized in the current literature. While, a formal definition of a biomarker is nonexistent in the current literature, formally defining biomarkers and standardizing the vocabulary for biomarkers can help accelerate AI research around this concept which can lead to better, faster and more accurate analyses of the existing data and literature. Thus, in this work, we generated Knowledge Graphs that can help us define and standardize biomarkers. We present our Knowledge Graphs (KGs) generated using both an LLM and expert-curated datasets. We compare both KGs to understand why systematic integration between these two models is needed. The integration of Knowledge Graphs (KGs) and Large Language Models (LLMs) presents a promising approach to advancing aging biomarker research through the inherent structured and standardized nature of ontology schemas in knowledge graphs. We showcase that the accuracy of LLM-generated KGs remains questionable but systematic methods such as KNARM can help us with the accuracy of these efforts. In future work, we will propose a synergistic framework where KGs and LLMs interact iteratively to improve both the comprehensiveness and accuracy of aging biomarker information.

Abstract: Phishing emails are an escalating threat, underscoring the need for precise detection methods. While large language models (LLMs) have gained attention for their potential in this area, their reliance on extensive data for finetuning poses practical challenges. This paper introduces DualLM for phishing detection with minimal data, which distills the reasoning ability from a large LM to enhance a small target LM and integrates trainable perturbations to improve the small LM's inference capabilities. Experiments demonstrate that DualLM can benefit from dual LMs, which reduces training parameters and data required, while maintaining high performance in phishing email detection with limited data.

Abstract: The quality of interactions between parents and children is a critical factor in child development. Recent years have seen programs to improve parenting behaviors through evidencebased approaches, such as attachment-based interventions. A vital element of these programs is to assess the quality of parenting behaviors via video recordings of parent-child interactions, which is often time-intensive. In our previous work, we explored machine learning models to predict expert ratings of parenting behaviors from video recordings of semi-structured parent-child play. However, the large set of low-level multimodal features struggled to provide explainable insights, which created barriers to communicating with domain experts and improving the models further. In this work, we developed a machine learning pipeline that combines sparse multiple canonical correlation analysis with causal discovery techniques to uncover explainable causal relationships between nine categories of behavioral features and the quality ratings of parent-child interactions. This approach offers valuable insights into the otherwise black-box models and contributes to the growing body of work on transparent and trustworthy machine learning models of parenting behaviors.

Abstract: The task of 3D object detection is crucial for various applications that rely on identifying objects in threedimensional space using inputs like LiDAR point clouds and images. However, LiDAR-based detection faces challenges due to the sparsity of point clouds, especially at greater distances. To address this, depth completion models have been used to generate virtual points from RGB images, but they struggle with real-time applications due to high computational costs. Our work eliminates the depth completion process, significantly improving processing speed while minimizing performance degradation. Consequently, our method has achieved an optimal balance between speed and accuracy on the KITTI leaderboard.

Abstract: In the rapidly advancing field of AIassisted medical diagnosis, the generation of medical reports for Chest X-rays (CXR) has significantly improved with the increased availability of radiographs and their corresponding reports. However, these reports often contain complex medical terminology, making them difficult for patients and non-healthcare professionals to understand. In this study, we introduce a strategy called Chained Prompting for Improved Readability of Medical Reports (CPIR-MR), which translates original medical reports into more comprehensible language. Our primary contribution is the creation of a new extension to the IU X-Ray dataset, providing Simplified Medical Reports (SMRs) generated by CPIR-MR. Additionally, we demonstrate that standard methodologies can effectively produce these simplified reports by proposing a multi-modal text decoder (MTD) that combines BLIP with a classification network to generate simplified medical explanations (SMEs) when fine-tuned on SMRs.

Abstract: Achieving optimal design is a crucial aspect of any design process for safe and efficient operation. Such tasks typically require numerous simulations over many iterations, which can become computationally expensive. This paper proposes a novel method that combines Physicsinformed Neural Networks (PINNs) with a Genetic Algorithm to optimize the parameters of an airfoil that aims to achieve favourable aerodynamic conditions. Traditional solvers are computationally expensive for performing such tasks, but using PINNs can significantly reduce this while keeping accuracy high. The proposed approach shows the advantage of using PINNs in optimizing complex engineering problems.

Abstract: In federated learning, frequent parameter transmission between clients and the server results in significant communication overhead, particularly due to redundancy within the parameters. To address this issue, we propose a Complementary Pruning for Deviceto-Device Communication (FedCPD) method. This approach effectively reduces the amount of transmitted parameters by applying complementary pruning techniques on both the server and clients. Additionally, we decrease the communication frequency between clients and the server by employing chain updates among clients (i.e., device-to-device communication). We conducted experiments on the MNIST, FMNIST, CIFAR-10, and CIFAR-100 datasets, and the results demonstrate that our method significantly reduces communication costs while improving model accuracy.

Abstract: Large language models (LLMs) are trained on vast amounts of publicly available text. However, the current training frameworks take for granted that these annotations are accurate reflections of the authors’ true intents. This study questions that assumption by examining the gaps between writers’ actual psychological states and the inferences made by thirdparty annotators. We explore how readers interpret psychological cues in text and demonstrate that third-person annotations often fail to align with first-person realities. By integrating both first- and third-person annotations, we develop computational models that reveal significant biases in how psychological states are perceived and the downstream effects these perceptions have on reader behavior. Our findings challenge the foundational assumptions of LLM training, suggesting that the reliance on potentially flawed third-person annotations could impact model accuracy and real-world applications.

Abstract: Large language models (LLMs) have shown remarkable progress in understanding and generating natural language across various applications. However, they often struggle with resolving ambiguities in realworld, enterprise-level interactions, where context and domain-specific knowledge play a crucial role. In this demonstration, we introduce ECLAIR (Enhanced CLArification for Interactive Responses), a multi-agent framework for interactive disambiguation. ECLAIR enhances ambiguous user query clarification through an interactive process where custom agents are defined, ambiguity reasoning is conducted by the agents, clarification questions are generated, and user feedback is leveraged to refine the final response. When tested on real-world customer data, ECLAIR demonstrates significant improvements in clarification question generation compared to standard few-shot methods.

Abstract: In this paper, we propose SDAS, a new motion data assessment and storage system designed to acquire new motion data with reduced redundancy and maximizing diversity. SDAS collects data in the field, retrieves the most similar data from the database in realtime, and provides visualization tools that allow for the comparison of differences between the capture data and the stored data. Through this system, researchers can efficiently build and manage a database. The demonstration video is available at https://youtu.be/vqW0uMDnZTw.

Abstract: The increasing demand for realtime analysis in video streaming has driven significant advancements in object detection and motion prediction. This paper presents SkelAI, an innovative application that combines YOLOv8, OpenCV, OpenAI API, and our own innovative algorithms to achieve real-time object detection and medial axis skeletonization tailored explicitly for live video streaming environments. In addition, SkelAI integrates AI-generated image capabilities through the DALL-E 3 model, enabling the extraction of skeletons from synthetic content that simulates streaming scenarios. The application supports exporting skeleton data in PyTorch-compatible formats, facilitating the training of sequence predicting deep learning models. Comprehensive evaluations demonstrate SkelAI’s enhanced accuracy, efficiency, and versatility compared to existing tools, underscoring its potential applications in digital animation, biomechanical research and robotics, human-computer interaction, and video compression within streaming platforms.

Abstract: In this paper, we introduce AutoMV, an autonomous agent framework designed for generating real estate marketing videos. The framework integrates a diverse set of existing models into a tool library, allowing the agent to intelligently select and execute the appropriate tools. Given property images and text, the agent decomposes the task into manageable subtasks, generating storyline directives and corresponding camera movement trajectories to guide the video production process. By automatically applying video synthesis techniques and incorporating multimedia elements such as subtitles and background music, the agent transforms static real estate images into dynamic, visually appealing videos, thereby optimizing their impact for digital marketing purposes.

Abstract: In the field of multimodal large language models (MLLMs), common methods typically involve unfreezing the language model during training to foster profound visual understanding. However, the finetuning of such models with vision-language data often leads to a diminution of their natural language processing (NLP) capabilities. To avoid this performance degradation, a straightforward solution is to freeze the language model while developing multimodal competencies. Unfortunately, previous works have not attained satisfactory outcomes. Building on the strategy of freezing the language model, we conduct thorough structural exploration and introduce the Inner-Adaptor Architecture (IAA). Specifically, the architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers, thereby enabling the frozen language model to acquire multimodal capabilities. Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets. We conduct extensive experiments to improve the general multimodal capabilities and visual grounding abilities of the MLLM. Our approach remarkably outperforms previous state-of-the-art methods across various vision-language benchmarks without sacrificing performance on NLP tasks. Code and models will be released.

Abstract: Dealing with the distribution shift is a significant challenge when building offline reinforcement learning (RL) models that can generalize from a static dataset to outof-distribution (OOD) scenarios. Previous approaches have employed pessimism or conservatism strategies. More recently, data-driven work has taken a distributional perspective, treating offline data as a domain adaptation problem. However, these methods use heuristic techniques to simulate distribution shifts, resulting in a limited diversity of artificially created distribution gaps. In this paper, we propose a novel perspective: offline datasets inherently contain multiple latent distributions, with behavior data from diverse policies potentially following different distributions and data from the same policy across various time phases also exhibiting distribution variance. We introduce the Latent Distribution Representation Learning (LAD) framework, which aims to characterize the multiple latent distributions within offline data and reduce the distribution gaps between any pair of them. LAD consists of a min-max adversarial process: it first identifies the "worst-case" distributions to enlarge the diversity of distribution gaps and then reduces these gaps to learn invariant representations for generalization. We derive a generalization error bound to support LAD theoretically and verify its effectiveness through extensive experiments.

Abstract: Pareto Front Learning (PFL) has been one of the effective means to resolve multiobjective optimization problems through exploring all optimal solutions to learn the entire Pareto front. Pareto Hypernetwork (PHN) is a new promising way to generate the sequence of Pareto-optimal solutions that can be further used as potential solutions to constitute the Pareto front. However, the existing PHN-based approaches suffer from two performance issues: They take as inputs human-crafted preference vector or chunk embedding, rather than the input data samples, and thus vulnerable to data distribution shifts. Such approaches cannot optimize all potential solutions when forming the Pareto front, as they merely optimize the loss pertaining to one single input at a time of optimization round. To improve the quality of the Pareto front, we propose IOP, a novel Idempotent-like Optimization method to learn the entire Pareto front accurately and enhance Hypernetwork's adaptability to distribution shifts. In particular, IOP performs idempotent-like optimization by exploiting manifold space mapping, so that the target networks generated by the optimized Hypernetwork can effectively handle samples with similar distributions of the input samples, without the pre-defined human-crafted inputs. IOP maximizes the Hypervolume indicator that is composed of all potential solutions at a higher level. Experimental results demonstrate that IOP outperforms the state-of-the-art methods by 4.7% on average in producing the Pareto front and has a 10.5% improvement in adaptability.

Abstract: The codesign of neural network architectures, quantization precisions, and hardware accelerators offers a promising approach to achieving an optimal balance between performance and efficiency, particularly for model deployment on resource-constrained edge devices. In this work, we propose the JAQ Framework, which jointly optimizes the three critical dimensions. However, effectively automating the design process across the vast search space of those three dimensions poses significant challenges, especially when pursuing extremely low-bit quantization. Specifical, the primary challenges include: (1) Memory overhead in software-side: Low-precision quantization-aware training can lead to significant memory usage due to storing large intermediate features and latent weights for backpropagation, potentially causing memory exhaustion. (2) Search time-consuming in hardware-side: The discrete nature of hardware parameters and the complex interplay between compiler optimizations and individual operators make the accelerator search time-consuming. To address these issues, JAQ mitigates the memory overhead through a channel-wise sparse quantization (CSQ) scheme, selectively applying quantization to the most sensitive components of the model during optimization. Additionally, JAQ designs BatchTile, which employs a hardware generation network to encode all possible tiling modes, thereby speeding up the search for the optimal compiler mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ, achieving approximately 7% higher Top-1 accuracy on ImageNet compared to previous methods and reducing the hardware search time per iteration to 0.15 seconds.

Abstract: Partial label learning (PLL) addresses situations where each training example is associated with a set of candidate labels, among which only one corresponds to the true class label. As the candidate labels often come from crowdsourced workers, their generation is inherently dependent on the features of the instance. Existing PLL methods primarily aim to resolve these ambiguous labels to enhance classification accuracy, overlooking the opportunity to use this feature dependency for causal representation learning. This focus on accuracy can make PLL systems vulnerable to stylistic variations and shifts in domain. In this paper, we explore the learning of causal representations within an instancedependent PLL framework, introducing a new approach that uncovers identifiable latent representations. By separating content from style in the identified causal representation, we introduce CausalPLL+, an algorithm for instance-dependent PLL based on causal representation. Our algorithm performs exceptionally well in terms of both classification accuracy and generalization robustness. Qualitative and quantitative experiments on instance-dependent PLL benchmarks and domain generalization tasks verify the effectiveness of our approach.

Abstract: Most of the federated learning techniques are limited to homogeneous model fusion. With the rapid growth of smart applications on resourceconstrained edge devices, it becomes a barrier to accommodate their heterogeneous computing power and memory in the real world. Federated Distillation is a promising alternative to enable aggregation from heterogeneous models. However, the effectiveness of knowledge transfer still remains elusive under the shadow of distinct representation power from heterogeneous models. In this paper, we approach from an adversarial perspective to characterize the decision boundaries during distillation. By leveraging K-step PGD attacks, we successfully model the dynamics of the closest boundary points and establish a quantitative connection between the predictive uncertainty and boundary margin. Based on these findings, we further propose a new loss function to make the distillation attend to samples close to the decision boundaries, thus learning from more informed logit distributions. The extensive experiments over CIFAR-10/100 and Tiny-ImageNet demonstrate about 0.5-3.5% improvement of accuracy under different IID and non-IID settings, with only a small increment of computational overhead.

Abstract: ClassIncremental Learning (CIL) requires an artificial intelligence system to learn different tasks without class overlaps continually. To achieve CIL, some methods introduce the Pre-Trained Model (PTM) and leverage the generalized feature representation of PTM to learn downstream incremental tasks continually. However, the generalized feature representations of PTM are not adaptive and discriminative for these various incremental classes, which may be out of distribution for the pre-trained dataset. In addition, since the incremental classes cannot be learned at once, the class relationship cannot be constructed optimally, leading to undiscriminating feature representation for understream tasks. Thus, we propose a novel Pre-Trained Model-based Class-Incremental Learning (PTM-CIL) method to explore the potential of PTM and obtain optimal class relationships. Inspired by Neural Collapse theory, we introduce the frozen Equiangular Tight Frame classifier to construct optimal classifier structure for all seen classes, guiding the feature representation adaptation for downstream continual tasks. Specifically, Task-Related Adaptation is proposed to modulate the generalized feature representation to bridge the gap between the pre-trained dataset and various downstream datasets. Then, the Feature Compression Module is introduced to compress various features to the specific classifier weights, constructing the feature transfer pattern and satisfying the characteristic of Neural Collapse. Optimal Structural Alignment is designed to supervise the feature compression process, assisting in achieving optimal class relationships across different tasks. Sufficient experiments on seven datasets prove the effectiveness of our method.

Abstract: Anomaly detection on attributed graphs has applications in various domains such as finance and email spam detection, thus gaining substantial attention. Distributed scenarios can also involve issues related to anomaly detection in attribute graphs, such as in medical scenarios. However, most of the existing anomaly detection methods are designed for centralized scenarios, and directly applying them to distributed settings may lead to reduced performance. One possible reason for this issue is that, when graph data are distributed across multiple clients, federated graph learning may struggle to fully exploit the potential of the dispersed data, leading to suboptimal performance. Building on this insight, we propose FedCLGN, a federated graph anomaly detection framework that leverages contrastive selfsupervised learning. First, we put forward an augmentation method to maintain global negative pairs on the server. This involves identifying anomalous nodes using pseudo-labels, extracting embedding representations of the negative pairs corresponding to these anomalous nodes from clients, and uploading them to the server. Then, we adopt graph diffusion to enhance the feature representation of nodes, capturing the global structure and local connection patterns. This strategy can strengthen the differentiation between positive and negative instance pairs. Finally, the effectiveness of our approach is verified by experimental results on four real graph datasets.

Abstract: In an era overwhelmed by vast amounts of data, the effective curation of webcrawl datasets is essential for optimizing model performance. This paper tackles the challenges associated with the unstructured and heterogeneous nature of such datasets. Traditional heuristic curation methods often inadequately capture complex features, resulting in biases and the exclusion of relevant data. We introduce an advanced, learning-driven approach, Ensemble Curation Of DAta ThroUgh Multimodal Operators, called EcoDatum, which employs a novel quality-guided deduplication method to balance feature distribution. EcoDatum strategically integrates various unimodal and multimodal data curation operators within a weak supervision ensemble framework, utilizing automated optimization to effectively score each data point. EcoDatum, which significantly improves the data curation quality and efficiency, outperforms existing state-of-the-art (SOTA) techniques, ranking 1st on the DataComp leaderboard with an average performance score of 0.182 across 38 diverse evaluation datasets. This represents a 28% improvement over the DataComp baseline method, demonstrating its effectiveness in improving dataset curation and model training efficiency.

Abstract: Learning optimal policies in multiagent cooperative settings with visual observations is significant and challenging. Agents must first perform state representation learning for their image observations and then learn policies in the abstracted state space. Aiming at this problem, we propose a novel model-based MARL method named Contrastive Latent World for Policy Optimization (CLWPO). In CLWPO, we first design a state representation model to facilitate learning in the latent state space. With the support of this model, we construct the latent world and introduce a contrastive variational bound (CVB) to optimize it. Subsequently, we develop a heuristic policy optimization (HPO) scheme, incorporating model-free learning with model-based planning to obtain robust policies that predict future behaviors. In particular, in the planning, we maintain a queue of teammate models and calculate an adaptive rollout length for each agent to support their self-imagination and reduce the model-based return discrepancy. Finally, we conducted extensive experiments in the PettingZoo benchmark, and results show that CLWPO significantly enhances learning efficiency and improves agent performance compared to state-of-the-art MARL methods.

Hebei Province Key Laboratory of Big Data Calculation Hebei University of Technology, Hebei Province Key Laboratory of Big Data Calculation Hebei University of Technology, Hebei Province Key Laboratory of Big Data Calculation Hebei University of Technology, Hebei Province Key Laboratory of Big Data Calculation Hebei University of Technology, Hebei Province Key Laboratory of Big Data Calculation Hebei University of Technology, Beijing JiaoTong University, Northwestern Polytechnical University, Sun Yat-sen University

Abstract: As an essential technique for Graph Contrastive Learning (GCL), Graph Augmentation (GA) improves the generalization capability of the GCLs by introducing different forms of the same graph. To ensure information integrity, existing GA strategies have been designed to simultaneously process the two types of information available in graphs: node attributes and graph topology. Nonetheless, these strategies tend to augment the two types of graph information separately, ignoring their correlation, resulting in limited representation ability. To overcome this drawback, this paper proposes a novel GCL framework with a Joint spectrAl augMentation, named GCLJAM. Motivated the equivalence between the graph learning objective on an attribute graph and the spectral clustering objective on the attribute-interpolated graph, the node attributes are first abstracted as another type of node to harmonize the node attributes and graph topology. The newly constructed graph is then utilized to perform spectral augmentation to capture the correlation during augmentation. Theoretically, the proposed joint spectral augmentation is proved to perturb more inter-class edges and noise attributes compared to separate augmentation methods. Extensive experiments on homophily and heterophily graphs validate the effectiveness and universality of GCL-JAM.

Abstract: In vision transformers, position embedding (PE) plays a crucial role in capturing the order of tokens. However, in vision transformer structures, there is a limitation in the expressiveness of PE due to the structure where position embedding is simply added to the token embedding. A layerwise method that delivers PE to each layer and applies independent LNs for token embedding and PE has been adopted to overcome this limitation. In this paper, we identify the conflicting result that occurs in a layer-wise structure when using the global average pooling (GAP) method instead of the class token. To overcome this problem, we propose MPVG, which maximizes the effectiveness of PE in a layer-wise structure with GAP. Specifically, we identify that PE counterbalances token embedding values at each layer in a layer-wise structure. Furthermore, we recognize that the counterbalancing role of PE is insufficient in the layer-wise structure, and we address this by maximizing the effectiveness of PE through MPVG. Through experiments, we demonstrate that PE performs a counterbalancing role and that maintaining this counterbalancing directionality significantly impacts vision transformers. As a result, the experimental results show that MPVG outperforms existing methods across vision transformers on various tasks.

Abstract: Talking face generation (TFG) allows for producing lifelike talking videos of any character using only facial images and accompanying text. Abuse of this technology could pose significant risks to society, creating the urgent need for research into corresponding detection methods. However, research in this field has been hindered by the lack of public datasets. In this paper, we construct the first largescale multi-scenario talking face dataset (MSTF), which contains 22 audio and video forgery techniques, filling the gap of datasets in this field. The dataset covers 11 generation scenarios and more than 20 semantic scenarios, closer to the practical application scenario of TFG. Besides, we also propose a TFG detection framework, which leverages the analysis of both global and local coherence in the multimodal content of TFG videos. Therefore, a region-focused smoothness detection module (RSFDM) and a discrepancy capture-time frame aggregation module (DCTAM) are introduced to evaluate the global temporal coherence of TFG videos, aggregating multi-grained spatial information. Additionally, a visual-audio fusion module (V-AFM) is designed to evaluate audiovisual coherence within a localized temporal perspective. Comprehensive experiments demonstrate the reasonableness and challenges of our datasets, while also indicating the superiority of our proposed method compared to the state-of-the-art deepfake detection approaches.

Abstract: Protein structure prediction is pivotal for understanding the structurefunction relationship of proteins, advancing biological research, and facilitating pharmaceutical development and experimental design. While deep learning methods and the expanded availability of experimental 3D protein structures have accelerated structure prediction, the dynamic nature of protein structures has received limited attention. This study introduces an innovative 4D diffusion model incorporating molecular dynamics (MD) simulation data to learn dynamic protein structures. Our approach is distinguished by the following components: (1) a unified diffusion model capable of generating dynamic protein structures, including both the backbone and side chains, utilizing atomic grouping and side-chain dihedral angle predictions; (2) a reference network that enhances structural consistency by integrating the latent embeddings of the initial 3D protein structures; and (3) a motion alignment module aimed at improving temporal structural coherence across multiple time steps. To our knowledge, this is the first diffusion-based model aimed at predicting protein trajectories across multiple time steps simultaneously. Validation on benchmark datasets demonstrates that our model exhibits high accuracy in predicting dynamic 3D structures of proteins containing up to 256 amino acids over 32 time steps, effectively capturing both local flexibility in stable states and significant conformational changes.

Abstract: Automated Program Repair (APR) is a task to automatically generate patches for the buggy code. However, most research focuses on generating correct patches while ignoring the consistency between the fixed code and the original buggy code. How to conduct adaptive bug fixing and generate patches with minimal modifications have seldom been investigated. To bridge this gap, we first introduce a novel task, namely AdaPR (Adaptive Program Repair). We then propose a twostage approach AdaPatcher (Adaptive Patch Generator) to enhance program repair while maintaining the consistency. In the first stage, we utilize a Bug Locator with self-debug learning to accurately pinpoint bug locations. In the second stage, we train a Program Modifier to ensure consistency between the post-modified fixed code and the pre-modified buggy code. The Program Modifier is enhanced with a location-aware repair learning strategy to generate patches based on identified buggy lines, a hybrid training strategy for selective reference and an adaptive preference learning to prioritize fewer changes. The experimental results show that our approach outperforms a set of baselines by a large margin, validating the effectiveness of our two-stage framework for the newly proposed AdaPR task.

Abstract: In science and engineering, machine learning techniques are increasingly successful in physical systems modeling (predicting future states of physical systems). Effectively integrating PDE loss as a constraint of system transition can improve the model's prediction by overcoming generalization issues due to data scarcity, especially when data acquisition is costly. However, in many realworld scenarios, due to sensor limitations, the data we can obtain is often only partial observation, making the calculation of PDE loss seem to be infeasible, as the PDE loss heavily relies on high-resolution states. We carefully study this problem and propose a novel framework named Re-enable PDE Loss under Partial Observation (RPLPO). The key idea is that although enabling PDE loss to constrain system transition solely is infeasible, we can re-enable PDE loss by reconstructing the learnable high-resolution state and constraining system transition simultaneously. Specifically, RPLPO combines an encoding module for reconstructing learnable high-resolution states with a transition module for predicting future states. The two modules are jointly trained by data and PDE loss. We conduct experiments in various physical systems to demonstrate that RPLPO has significant improvement in generalization, even when observation is sparse, irregular, noisy, and PDE is inaccurate.

School of Computing and Information Technology, University of Wollongong, Australia ARC Training Centre for Innovative Composites for the Future of Sustainable Mining, University of Wollongong, Australia, Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, USA, Centre for Nutrition and Food Sciences, University of Queensland, Australia, School of Electrical and Computer Engineering, University of Sydney, Australia, Department of Business Strategy and Innovation, Griffith University, Australia, Department of Biological Sciences, Purdue University, USA, NVIDIA, USA, School of Computing and Information Technology, University of Wollongong, Australia, Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, USA

Abstract: CryoElectron Tomography (cryo-ET) is a 3D imaging technology that facilitates the study of macromolecular structures at near-atomic resolution. Recent volumetric segmentation approaches on cryo-ET images have drawn widespread interest in the biological sector. However, existing methods heavily rely on manually labeled data, which requires highly professional skills, thereby hindering the adoption of fully-supervised approaches for cryo-ET images. Some unsupervised domain adaptation (UDA) approaches have been designed to enhance the segmentation network performance using unlabeled data. However, applying these methods directly to cryo-ET image segmentation tasks remains challenging due to two main issues: 1) the source dataset, usually obtained through simulation, contains a fixed level of noise, while the target dataset, directly collected from raw-data from the real-world scenario, have unpredictable noise levels. 2) the source data used for training typically consists of known macromoleculars. In contrast, the target domain data are often unknown, causing the model to be biased towards those known macromolecules, leading to a domain shift problem. To address such challenges, in this work, we introduce a voxel-wise unsupervised domain adaptation approach, termed Vox-UDA, specifically for cryo-ET subtomogram segmentation. Vox-UDA incorporates a noise generation module to simulate target-like noises in the source dataset for cross-noise level adaptation. Additionally, we propose a denoised pseudo-labeling strategy based on the improved Bilateral Filter to alleviate the domain shift problem. More importantly, we construct the first UDA cryo-ET subtomogram segmentation benchmark on three experimental datasets. Extensive experimental results on multiple benchmarks and newly curated real-world datasets demonstrate the superiority of our proposed approach compared to state-of-the-art UDA methods.

Shanghai Artificial Intelligence Laboratory Shanghai Jiaotong University, Shanghai Artificial Intelligence Laboratory Fudan University, Shanghai Artificial Intelligence Laboratory Nankai University, Shanghai Artificial Intelligence Laboratory University of Science and Technology of China, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory University of Science and Technology of China, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory Shanghai Jiaotong University, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory

Abstract: Large Language Models (LLMs) have achieved remarkable success and have been applied across various scientific fields, including chemistry. However, many chemical tasks require the processing of visual information, which cannot be successfully handled by existing chemical LLMs. This brings a growing need for models capable of integrating multimodal information in the chemical domain. In this paper, we introduce ChemVLM, an opensource chemical multimodal large language model specifically designed for chemical applications. ChemVLM is trained on a carefully curated bilingual multimodal dataset that enhances its ability to understand both textual and visual chemical information, including molecular structures, reactions, and chemistry examination questions. We develop three datasets for comprehensive evaluation, tailored to Chemical Optical Character Recognition (OCR), Multimodal Chemical Reasoning (MMCR), and Multimodal Molecule Understanding tasks. We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks. Experimental results demonstrate that ChemVLM achieves competitive performance across all evaluated tasks.

Abstract: Continual Learning (CL) for malware classification tackles the rapidly evolving nature of malware threats and the frequent emergence of new types. Generative Replay (GR)based CL systems utilize a generative model to produce synthetic versions of past data, which are then combined with new data to retrain the primary model. Traditional machine learning techniques in this domain often struggle with catastrophic forgetting, where a model's performance on old data degrades over time. In this paper, we introduce a GR-based CL system that employs Generative Adversarial Networks (GANs) with feature matching loss to generate high-quality malware samples. Additionally, we implement innovative selection schemes for replay samples based on the model’s hidden representations. Our comprehensive evaluation across Windows and Android malware datasets in a class-incremental learning scenario -- where new classes are introduced continuously over multiple tasks -- demonstrates substantial performance improvements over previous methods. For example, our system achieves an average accuracy of 55% on Windows malware samples, significantly outperforming other GR-based models by 28%. This study provides practical insights for advancing GR-based malware classification systems.

Abstract: Enhancing the intelligence of smart systems, such as smart homes, smart vehicles, and smart grids, critically depends on developing sophisticated planning capabilities that can anticipate the next desired function based on historical interactions. While existing methods view user behaviors as sequential data and apply models like RNNs and Transformers to predict future actions, they often fail to incorporate domain knowledge and capture personalized user preferences. In this paper, we propose a novel approach that incorporates LLMenhanced logs and personalized prompts. Our approach first constructs a graph that captures individual behavior preferences derived from their interaction histories. This graph effectively transforms into a soft continuous prompt that precedes the sequence of user behaviors. Then our approach leverages the vast general knowledge and robust reasoning capabilities of a pretrained LLM to enrich the oversimplified and incomplete log records. By enhancing these logs semantically, our approach better understands the user's actions and intentions, especially for those rare events in the dataset. We evaluate the method across four real-world datasets from both smart vehicle and smart home settings. The findings validate the effectiveness of our LLM-enhanced description and personalized prompt, shedding light on potential ways to advance the intelligence of smart space.

Abstract: Ceramic artworks with elegant patterns present enormous collectible value and profits. To claim the copyright, the builder usually pastes their conspicuous stamp on the bottom or side of the ceramic artworks, which inevitably affects the external image of the artwork. In addition, the stamp is weak in resisting forgery attacks due to its visible nature. To address the above issues, we propose in this paper a novel framework for embedding invisible watermarking into patterns of the ceramic artworks. In the framework, a templatebased watermarking embedding scheme is designed to map the watermark to an invisible template, which is added to the ceramic pattern to create its watermarked version. A distortion layer is further proposed to model the distortion of ceramic patterns in the ceramic manufacturing process, where a color-halftoning and an adaptive brightness adjustment strategy are developed to counter the print and firing operations that introduce the most significant distortions. Finally, a deep decoder is learned to extract the watermarking from the distorted pattern. Various experiments have been conducted to demonstrate the advantage of our proposed method for protecting the copyright of the ceramic artworks, which provides reliable watermark extraction accuracy without the need for a conspicuous stamp.

Abstract: Currently, most Hyperspectral (HS) pansharpening methods have two problems, namely the lack of consideration the spatial variations of HS images and inaccurate feature reconstruction in multichannel complex mapping relationships, leading to spectral and spatial distortions in the fusion results. To address these issues, we propose a dynamic network based on feature modulation and probability mask (FMPM-DNet) for HS pansharpening, including two stages of spectral-spatial feature modulation and feature reconstruction. In the first stage, to increase the feature representation ability of the model, a wave function is defined based on complex transformation to convert spatial features into wave-like features. On this basis, considering the spatial variations of HS images, a dynamic feature modulation unit (DFMU) is constructed to achieve adaptive modulation and coarse fusion of features by dynamically generating spectral-spatial correction matrix. In the second stage, a feature probability mask unit (FPMU) is designed to realize global feature embedding at different depths and local feature embedding at the same depth to obtain refined fused features. Extensive experiments on three widely used datasets demonstrate that the proposed FMPM-Net achieves significant improvements in both spatial and spectral quality metrics compared to some state-of-the-art (SOTA) methods.

Abstract: Personalized textto-image synthesis models, such as DreamBooth, have demonstrated significant potential in creating lifelike images tailored to a specific individual by fine-tuning from a limited set of face images and simple prompts. However, if misused, these model could pose a serious risk of privacy infringement by generating harmful images containing violent or pornographic content. To tackle this issue, this paper introduces MYOPIA, a method that renders facial images unlearnable by incorporating error-minimizing perturbations. These meticulously designed perturbations enables the model to quickly overfit to them, resulting in a swift reduction in loss and the cessation of model fine-tuning, effectively preventing the model from capturing genuine facial features. Moreover, to ensure the imperceptibility and robustness of the perturbations, we utilize the Just-Noticeable-Difference and Expectation-of-Transformation techniques to regulate both their location and intensity. Evaluation on two face dataset, i.e., VGGFace2 and CelebA-HQ, with various model versions illustrates the effectiveness of our approach in preserving personal privacy. Furthermore, our method showcases robust transferability across diverse model versions and demonstrates resilience against various image pre-processing techniques.

Abstract: Although Split Federated Learning (SFL) effectively enables knowledge sharing among resourceconstrained clients, it suffers from low training performance due to the neglect of data heterogeneity and catastrophic forgetting problems. To address these issues, we propose a novel SFL approach named MultiSFL, which adopts i) an effective multi-model aggregation mechanism to alleviate gradient divergence caused by heterogeneous data and ii) a novel knowledge replay strategy to deal with the catastrophic forgetting problem. MultiSFL adopts two servers (i.e., the fed server and main server) to maintain multiple branch models for local training and an aggregated master model for knowledge sharing among branch models. To mitigate catastrophic forgetting, the main server of MultiSFL selects multiple assistant devices for knowledge replay according to the training data distribution of each full branch model. Experimental results obtained from various non-IID and IID scenarios demonstrate that MultiSFL significantly outperforms conventional SFL methods by up to a 23.25% test accuracy improvement.

Abstract: Spatialtemporal data collected across different geographic locations often suffer from missing values, posing challenges to data analysis. Existing methods primarily leverage fixed spatial graphs to impute missing values, which implicitly assume that the spatial relationship is roughly the same for all features across different locations. However, they may overlook the different spatial relationships of diverse features recorded by sensors in different locations. To address this, we introduce the multi-scale Graph Structure Learning framework for spatial-temporal Imputation (GSLI) that dynamically adapts to the heterogeneous spatial correlations. Our framework encompasses node-scale graph structure learning to cater to the distinct global spatial correlations of different features, and feature-scale graph structure learning to unveil common spatial correlation across features within all stations. Integrated with prominence modeling, our framework emphasizes nodes and features with greater significance in the imputation process. Furthermore, GSLI incorporates cross-feature and cross-temporal representation learning to capture spatial-temporal dependencies. Evaluated on six real incomplete spatial-temporal datasets, GSLI showcases the improvement in data imputation and downstream applications.

Abstract: Image forgery detection and localization (IFDL) is of vital importance as forged images can spread misinformation that poses potential threats to our daily life. However, previous methods still struggled to effectively handle forged images processed with diverse forgery operations in realworld scenarios. In this paper, we propose a novel Reinforced Multi-teacher Knowledge Distillation (Re-MTKD) framework for the IFDL task, structured around an encoder-decoder ConvNeXt-UperNet along with Edge-Aware Module, named Cue-Net. First, three Cue-Net models are separately trained for the three main types of image forgeries, i.e., copy-move, splicing and inpainting, which then serve as the multi-teacher models to train the target student model with Cue-Net through self-knowledge distillation. A Reinforced Dynamic Teacher Selection (Re-DTS) strategy is developed to dynamically assign weights to the involved teacher models, which facilitates specific knowledge transfer and enables the student model to effectively learn both the common and specific natures of diverse tampering traces. Extensive experiments demonstrate that, compared with other state-of-the-art methods, the proposed method achieves superior performance on several recently emerged datasets comprised of various kinds of image forgeries.

AI for Science Interdisciplinary Research Center, School of Computer Science, Northwestern Polytechnical University, College of Intelligence and Computing, Tianjin University, AI for Science Interdisciplinary Research Center, School of Computer Science, Northwestern Polytechnical University, AI for Science Interdisciplinary Research Center, School of Computer Science, Northwestern Polytechnical University, College of Intelligence and Computing, Tianjin University, AI for Science Interdisciplinary Research Center, School of Computer Science, Northwestern Polytechnical University School of Computer Science, Research and Development Institute of Northwestern Polytechnical University in Shenzhen Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology

Abstract: Translation elongation is essential for cellular proteostasis and is implicated in cancer and neurodegeneration. Accurately predicting the rate of ribosome elongation in each codon (also called ribosomal A site) on mRNA is important for understanding and modulating protein synthesis. However, predicting elongation rates is challenging due to the tradeoff between capturing distal codon interactions and focusing on proximal codon effects at the A site. Approaches capturing distal codon interactions in the coding sequences (CDS) of mRNA fail to effectively differentiate critical regions (codons near the A site) due to insufficient effective mechanisms for focusing on these regions. Conversely, due to the limitations of models when handling long mRNA sequences, some methods simplify inputs by conditioning solely on proximal codons surrounding the A site, leading to the loss of important information from distal codons. To address this issue, we leverage Mamba's success in capturing long-range dependencies to enable the consideration of distant codons' impact on the A site. Additionally, we introduce a sliding window attention mechanism to emphasize the proximal codons around the A site during ribosome elongation. Building on these advancements, we present Sliding Window Attention Mamba (SWAMamba), a novel framework that simultaneously leverages both proximal and distal codon effects on the A site. We conduct comprehensive evaluations on ribosome data across four species and find that SWAMamba significantly outperformed current state-of-the-art methods in predicting translation elongation rates.

Abstract: Messenger RNA (mRNA)based vaccines are accelerating the discovery of new drugs and revolutionizing the pharmaceutical industry. However, selecting particular mRNA sequences for vaccines and therapeutics from extensive mRNA libraries is costly. Effective mRNA therapeutics require carefully designed sequences with optimized expression levels and stability. This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec. In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec. We jointly use the 5' untranslated region (UTR) and coding sequence (CDS) region as the input sequences. We adapt our LM-based approach specifically to mRNA by 1) considering the importance of location on the mRNA sequence with probabilistic masking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure (SS) classification as additional pretext tasks. mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also gives a competitive performance in mRNA stability and protein production level tasks in CDS such as CodonBERT.

School of Future Technology, University of Chinese Academy of Sciences Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences Center for Long-term Artificial Intelligence, School of Artificial Intelligence, University of Chinese Academy of Sciences Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, School of Future Technology, University of Chinese Academy of Sciences Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences Center for Long-term Artificial Intelligence, Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences Center for Long-term Artificial Intelligence, School of Artificial Intelligence, University of Chinese Academy of Sciences Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, School of Future Technology, University of Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences Center for Long-term Artificial Intelligence Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, CAS

Abstract: Dynamic Vision Sensors (DVS) capture event data with high temporal resolution and low power consumption, presenting a more efficient solution for visual processing in dynamic and realtime scenarios compared to conventional video capture methods. Event data augmentation serves as an essential method for overcoming the limitation of scale and diversity in event datasets. Our comparative experiments demonstrate that the two factors, spatial integrity and temporal continuity, can significantly affect the capacity of event data augmentation, which guarantee the maintenance of the sparsity and high dynamic range characteristics unique to event data. However, existing augmentation methods often neglect the preservation of spatial integrity and temporal continuity. To address this, we developed a novel event data augmentation strategy EventZoom, which employs a temporal progressive strategy, embedding transformed samples into the original samples through progressive scaling and shifting. The scaling process avoids the spatial information loss associated with cropping, while the progressive strategy prevents interruptions or abrupt changes in temporal information. We validated EventZoom across various supervised learning frameworks. The experimental results show that EventZoom consistently outperforms existing event data augmentation methods with SOTA performance. For the first time, we have concurrently employed Semi-supervised and Unsupervised learning to verify feasibility on event augmentation algorithms, demonstrating the applicability and effectiveness of EventZoom as a powerful event-based data augmentation tool in handling real-world scenes with high dynamics and variability environments.

The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China Cixi Institute of Biomedical Engineering, Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, School of Computer Science, University of Nottingham Malaysia, The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China Beacons of Excellence Research and Innovation Institute, University of Nottingham Ningbo China, The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China Beacons of Excellence Research and Innovation Institute, University of Nottingham Ningbo China, Cixi Institute of Biomedical Engineering, Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China, School of Electrical & Electronic Engineering, Nanyang Technological University

Abstract: visual reasoning (AVR) is a critical ability of humans, and it has been widely studied, but arithmetic visual reasoning, a unique task in AVR to reason over number sense, is less studied in the literature. To facilitate this research, we construct a Machine Number Reasoning (MNR) dataset to assess the model's ability in arithmetic visual reasoning over number sense and spatial layouts. To solve the MNR tasks, we propose a Dualbranch Arithmetic Regression Reasoning (DARR) framework, which includes an Intra-Image Arithmetic Regression Reasoning (IIARR) module and a Cross-Image Arithmetic Regression Reasoning (CIARR) module. The IIARR includes a set of Intra-Image Regression Blocks to identify the correct number orders and the underlying arithmetic rules within individual images, and an Order Gate to determine the correct number order. The CIARR establishes the arithmetic relations across different images through a `3-to-1' regressor and a set of `2-to-1' regressors, with a Selection Gate to select the most suitable `2-to-1' regressor and a gated fusion to combine the two kinds of regressors. Experiments on the MNR dataset show that the DARR outperforms state-of-the-art models for arithmetic visual reasoning.

Abstract: Binary Spiking Neural Networks (BSNNs) inherit the eventdriven paradigm of SNNs, while also adopting the reduced storage burden of binarization techniques. These distinct advantages grant BSNNs lightweight and energy-efficient characteristics, rendering them ideal for deployment on resource-constrained edge devices. However, due to the binary synaptic weights and non-differentiable spike function, effectively training BSNNs remains an open question. In this paper, we conduct an in-depth analysis of the challenge for BSNN learning, namely the frequent weight sign flipping problem. To mitigate this issue, we propose an Adaptive Gradient Modulation Mechanism (AGMM), which is designed to reduce the frequency of weight sign flipping by adaptively adjusting the gradients during the learning process. The proposed AGMM can enable BSNNs to achieve faster convergence speed and higher accuracy, effectively narrowing the gap between BSNNs and their full-precision equivalents. We validate AGMM on both static and neuromorphic datasets, and results indicate that it achieves state-of-the-art results among BSNNs. This work substantially reduces storage demands and enhances SNNs' inherent energy efficiency, making them highly feasible for resource-constrained environments.

Guangdong Institute of Intelligence Science and Technology Department of Information and Communications Engineering, Tokyo Institute of Technology, School of Computer Science and Technology, Dalian University of Technology, Center for Brain Inspired Computing Research (CBICR), Department of Precision Instrument, Tsinghua University, Department of Information and Communications Engineering, Tokyo Institute of Technology, Institute of Automation, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, School of Computer Science and Technology, Dalian University of Technology, School of Computer Science and Technology, Dalian University of Technology, Guangdong Institute of Intelligence Science and Technology, Guangdong Institute of Intelligence Science and Technology Center for Brain Inspired Computing Research (CBICR), Department of Precision Instrument, Tsinghua University

Abstract: Considering the importance of capturing both global conversational topics and local speaker dependencies for multimodal emotion recognition in conversations, current approaches first utilize sequence models like Transformer to extract global context information, then apply Graph Neural Networks to model local speaker dependencies for local context information extraction, coupled with Graph Contrastive Learning (GCL) to enhance node representation learning. However, this sequential design introduces potential biases: the extracted global context information inevitably influences subsequent processing, compromising the independence and diversity of the original local features; current graph augmentation methods in GCL cannot consider both global and local context information in conversations to evaluate the node importance, hindering the learning of key information. Inspired by the human brain excels at handling complex tasks by efficiently integrating local and global information processing mechanisms, we propose an aligned globallocal context fusion framework for sequence-based design to address these problems. This design includes a dual-attention Transformer and a dual-evaluation method for graph augmentation in GCL. The dual-attention Transformer combines global attention for overall context extraction with sliding-window attention for local context capture, both enhanced by spiking neuron dynamics. The dual-evaluation method in GCL comprises global importance evaluation to identify nodes crucial for overall conversation context, and local importance evaluation to detect nodes significant for local semantics, generating augmented graph views that preserve both global and local information. This approach ensures balanced information processing throughout the pipeline, enhancing biological plausibility and achieving superior emotion recognition.

School of Electrical Engineering, Guangxi University, Nanning, China, School of Electrical Engineering, Guangxi University, Nanning, China, School of Electrical Engineering, Guangxi University, Nanning, China, School of Electrical Engineering, Guangxi University, Nanning, China, State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China, Key Laboratory of Big Data and Intelligent Robot of Ministry of Education, SCUT, Guangzhou, China School of Software Engineering, South China University of Technology, Guangzhou, China, School of Electrical Engineering, Guangxi University, Nanning, China Guangxi Key Laboratory of Multimedia Communications and Network Technology, Nanning, China

Abstract: As a longterm challenge and fundamental requirement in vision and language tasks, visual grounding aims to localize a target referred by a natural language query. The regional annotations form a superficial correlation between the subject of expression and some common visual entities, which hinder models from comprehending the linguistic content and structure. However, current one-stage methods struggle to uniformly model the visual and linguistic structure due to the structural gap between continuous image patches and discrete text tokens. In this paper, we propose a semi-structured reasoning framework for visual grounding to gradually comprehend the linguistic content and structure. Specifically, we devise a cross-modal content alignment module to effectively align unlabeled contextual information into a stable semantic space corrected by token-level prior knowledge obtained with CLIP. A multi-branch modulated localization module is also established to obtain modulation grounding by linguistic structure. Through a soft split mechanism, our method can destructure the expression into a fixed semi-structure (i.e., subject and context) while ensuring the completeness of linguistic content. Our method is thus capable of building a semi-structured reasoning system to effectively comprehend the linguistic content and structure by content alignment and structure modulated grounding. Experimental results on five widely-used datasets validate the performance improvements of our proposed method.

Abstract: Face recognition in the presence of age and quality variations poses a formidable challenge. While recent marginbased loss functions have shown promise in addressing these variations individually, real-world scenarios such as selfie versus ID face matching often involve simultaneous variations of both age and quality. In response, we propose a comprehensive framework aimed at mitigating the impact of these variations while preserving vital identity-related information crucial for accurate face recognition. The proposed adaptive margin-based loss function AQUAFace exhibits adaptiveness towards hard samples characterized by significant age and quality variations. This loss function is meticulously designed to prioritize the preservation of identity-related features while simultaneously mitigating the adverse effects of age and quality variations on face recognition accuracy. To validate the effectiveness of our approach, we focus on the specific task of selfie versus ID document matching. Our results demonstrate that AQUAFace effectively handles age and quality differences, leading to enhanced recognition performance. Additionally, we explore the benefits of fine-tuning the recognition model with synthetic data, further boosting performance. As a result, our proposed model, AQUAFace, achieves state-of-the-art performance on six benchmark datasets (CALFW, CPLFW, CFP-FP, AgeDB, IJB-C, and TinyFace), each exhibiting diverse age and quality variations.

PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology

Abstract: With the rapid development of autonomous driving, LiDARbased 3D Human Pose Estimation (3D HPE) is becoming a research focus. However, due to the noise and sparsity of LiDAR-captured point clouds, robust human pose estimation remains challenging. Most of the existing methods use temporal information, multi-modal fusion, or SMPL optimization to correct biased results. In this work, we try to obtain sufficient information for 3D HPE only by modeling the intrinsic properties of low-quality point clouds. Hence, a simple yet powerful method is proposed, which provides insights both on modeling and augmentation of point clouds. Specifically, we first propose a concise and effective density-aware pose transformer (DAPT) to get stable keypoint representations. By using a set of joint anchors and a carefully designed exchange module, valid information is extracted from point clouds with different densities. Then 1D heatmaps are utilized to represent the precise locations of the keypoints. Secondly, a comprehensive LiDAR human synthesis and augmentation method is proposed to pre-train the model, enabling it to acquire a better human body prior. We increase the diversity of point clouds by randomly sampling human positions and orientations and by simulating occlusions through the addition of laser-level masks. Extensive experiments have been conducted on multiple datasets, including IMU-annotated LidarHuman26M, SLOPER4D, and manually annotated Waymo Open Dataset v2.0 (Waymo), HumanM3. Our method demonstrates SOTA performance in all scenarios. In particular, compared with LPFormer on Waymo, we reduce the average MPJPE by 10.0mm. Compared with PRN on SLOPER4D, we notably reduce the average MPJPE by 20.7mm.

Guilin University of Electronic Technology Chongqing College of Finance and Economics, Guilin University of Electronic Technology, Guangdong Provincial People's Hospital, Guilin University of Electronic Technology, Guilin University of Electronic Technology, Guilin University of Electronic Technology, Guilin University of Electronic Technology, Guilin University of Electronic Technology Guangdong Provincial People's Hospital Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guilin University of Electronic Technology

Abstract: Cancer is a leading cause of death worldwide due to its aggressive nature and complex variability. Accurate prognosis is therefore challenging but essential for guiding personalized treatment and followup. Previous research often relied on single data sources, missing the opportunity to combine various types of patient information for more comprehensive survival predictions. To address these challenges, we propose a two-stage fusion method named Cross-Attention and Multimodal Low-Rank Interaction Fusion Framework (CA-MLIF). In the first stage, we propose a CA mechanism for real-time feature updates and cross-modal mutual learning to capture rich semantic information. In the second stage, we design a novel multimodal low-rank interaction fusion method for survival prediction. Specifically, we present modal attention mechanism (MAM) for feature filtration, low-rank multimodal fusion (LMF) for model complexity reduction, and optimal weight concatenation (OWC) for maximizing feature integration. Extensive experiments on two public datasets TCGA-GBMLGG and TCGA-KIRC, as well as a multi-center in-house lung adenocarcinoma (LUAD) dataset validate the effectiveness of CA-MLIF, which demonstrate that our method outperforms existing approaches in survival prediction under both pathology-gene fusion and CT-pathology fusion scenarios.

School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institution of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institution of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institution of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institution of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institution of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institution of Automation, Chinese Academy of Sciences

Abstract: With the advancement of largescale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated exceptional performance in various visual tasks. Most of the current large multimodal models achieve this by mapping visual features obtained from the visual encoder into a large language model and using them as inputs alongside text for downstream tasks. Therefore, the number of visual tokens directly affects the training and inference speed of the model. There has been significant work on token pruning for visual transformers, but for large multimodal models, only relying on visual information for token pruning or compression may lead to significant loss of important information. On the other hand, the textual input in the form of a question may contain valuable information that can aid in answering the question, providing additional knowledge to the model. To address the potential oversimplification and excessive pruning that can occur with most purely visual token pruning methods, we propose a text information-guided dynamic visual token recovery mechanism that does not require training. This mechanism leverages the similarity between the question text and visual tokens to recover visually meaningful tokens with important text information while merging other less important tokens, to achieve efficient computation for large multimodal models. Experimental results demonstrate that our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10\% of the original quantity.

Abstract: Enabling models to recognize vast openworld categories has been a longstanding pursuit in object detection. By leveraging the generalization capabilities of vision-language models, current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories. However, when the scale of the category vocabularies during training expands to a real-world level, previous classifiers aligned with coarse class names significantly reduce the recognition performance of these detectors. In this paper, we introduce Prova, a multi-modal prototype classifier for vast-vocabulary object detection. Prova extracts comprehensive multi-modal prototypes as initialization of alignment classifiers to tackle the vast-vocabulary object recognition failure problem. On V3Det, this simple method greatly enhances the performance among one-stage, two-stage, and DETR-based detectors with only additional projection layers in both supervised and open-vocabulary settings. In particular, Prova improves Faster R-CNN, FCOS, and DINO by 3.3, 6.2, and 2.9 AP respectively in the supervised setting of V3Det. For the open-vocabulary setting, Prova achieves a new state-of-the-art performance with 32.8 base AP and 11.0 novel AP, which is of 2.6 and 4.3 gain over the previous methods.

Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture Key Laboratory of Intereactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture Key Laboratory of Intereactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications

Abstract: Recent Anomaly Detection (AD) methods have achieved great success with InDistribution (ID) data. However, real-world data often exhibits distribution shift, causing huge performance decay on traditional AD methods. From this perspective, few previous work has explored AD with distribution shift, and the distribution-invariant normality learning has been proposed based on the Reverse Distillation (RD) framework. However, we observe the misalignment issue between the teacher and the student network that causes detection failure, thereby propose FiCo, Filter or Compensate, to address the distribution shift issue in AD. FiCo firstly compensates the distribution-specific information to reduce the misalignment between the teacher and student network via the Distribution-Specific Compensation (DiSCo) module, and secondly filters all abnormal information to capture distribution-invariant normality with the Distribution-Invariant Filter (DiIFi) module. Extensive experiments on three different AD benchmarks demonstrate the effectiveness of FiCo, which outperforms all existing state-of-the-art (SOTA) methods, and even achieves better results on the ID scenario compared with RD-based methods.

Abstract: Selfsupervised monocular depth estimation (SSMDE) has gained attention in the field of deep learning as it estimates depth without requiring ground truth depth maps. This approach typically uses a photometric consistency loss between a synthesized image, generated from the estimated depth, and the original image, thereby reducing the need for extensive dataset acquisition. However, the conventional photometric consistency loss relies on the Lambertian assumption, which often leads to significant errors when dealing with reflective surfaces that deviate from this model. To address this limitation, we propose a novel framework that incorporates intrinsic image decomposition into SSMDE. Our method synergistically trains for both monocular depth estimation and intrinsic image decomposition. The accurate depth estimation facilitates multi-image consistency for intrinsic image decomposition by aligning different view coordinate systems, while the decomposition process identifies reflective areas and excludes corrupted gradients from the depth training process. Furthermore, our framework introduces a pseudo-depth generation and knowledge distillation technique to further enhance the performance of the student model across both reflective and non-reflective surfaces. Comprehensive evaluations on multiple datasets show that our approach significantly outperforms existing SSMDE baselines in depth prediction, especially on reflective surfaces.

Abstract: Hairstyle transfer is a challenging task in the image editing field that modifies the hairstyle of a given face image while preserving its other appearance and background features. The existing hairstyle transfer approaches heavily rely on StyleGAN, which is pretrained on cropped and aligned face images. Hence, they struggle to generalize under challenging conditions such as extreme variations of head poses or focal lengths. To address this issue, we propose a one-stage hairstyle transfer diffusion model, HairFusion, that applies to real-world scenarios. Specifically, we carefully design a hair-agnostic representation as the input of the model, where the original hair information is thoroughly eliminated. Next, we introduce a hair align cross-attention (Align-CA) to accurately align the reference hairstyle with the face image while considering the difference in their head poses. To enhance the preservation of the face image’s original features, we leverage adaptive hair blending during the inference, where the output’s hair regions are estimated by the cross-attention map in Align-CA and blended with non-hair areas of the face image. Our experimental results show that our method achieves state-of-the-art performance compared to the existing methods in preserving the integrity of both the transferred hairstyle and the surrounding features.

Center for Advanced Systems Understanding (CASUS), Görlitz, Germany Helmholtz-Zentrum Dresden-Rossendorf e. V. (HZDR), Dresden, Germany School of Computation, Information and Technology, Technical University of Munich, Germany, Department of Computing, Imperial College London, London, United Kingdom, UCL Centre for Kidney and Bladder Health, Division of Medicine, University College London, Royal Free Hospital Campus, London, United Kingdom, Department of Computing, Imperial College London, London, United Kingdom, Institute of Computer Science, University of Wrocław, Wrocław, Poland Center for Advanced Systems Understanding (CASUS), Görlitz, Germany Helmholtz-Zentrum Dresden-Rossendorf e. V. (HZDR), Dresden, Germany

Abstract: Phase imaging is gaining importance due to its applications in fields like biomedical imaging and material characterization. In biomedical applications, it can provide quantitative information missing in labelfree microscopy modalities. One of the most prominent methods in phase quantification is the Transport-of-Intensity Equation (TIE). TIE often requires multiple acquisitions at different defocus distances, which is not always feasible in a clinical setting due to hardware constraints. To address this issue, we propose the use of chromatic aberrations to induce the required through-focus images with a single exposure, effectively generating a through-focus stack. Since the defocus distance induced by the aberrations is small, conventional TIE solvers are insufficient to address the resulting artifacts. We propose Zero-Mean Diffusion, a modified version of diffusion models designed for quantitative image prediction, and train it with synthetic data to ensure robust phase retrieval. Our contributions offer an alternative TIE approach that leverages chromatic aberrations, achieving accurate single-exposure phase measurement with white light and thus improving the efficiency of phase imaging. Additionally, we present a new class of diffusion models that are well-suited for quantitative data and have a sound theoretical basis. To validate our approach, we employ a widespread brightfield microscope equipped with a commercially available color camera. We apply our model to clinical microscopy of patients' urine, obtaining accurate phase measurements.

Abstract: Zeroshot Natural Language Video Localization (NLVL) aims to automatically generate moments and corresponding pseudo queries from raw videos for the training of the localization model without any manual annotations. Existing approaches typically produce pseudo queries as simple words, which overlook the complexity of queries in real-world scenarios. Considering the powerful text modeling capabilities of large language models (LLMs), leveraging LLMs to generate complete queries that are closer to human descriptions is a potential solution. However, directly integrating LLMs into existing approaches introduces several issues, including insensitivity, isolation, and lack of regulation, which prevent the full exploitation of LLMs to enhance zero-shot NLVL performance. To address these issues, we propose BTDP, an innovative framework for Boundary-aware Temporal Dynamic Pseudo-supervision pairs generation. Our method contains two crucial operations: 1) Boundary Segmentation that identifies both visual boundaries and semantic boundaries to generate the atomic segments and activity descriptions, tackling the issue of insensitivity. 2) Context Aggregation that employs the LLMs with a self-evaluation process to aggregate and summarize global video information for optimized pseudo moment-query pairs, tackling the issue of isolation and lack of regulation. Comprehensive experimental results on the Charades-STA and ActivityNet Captions datasets demonstrate the effectiveness of our BTDP method.

Abstract: Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentricview images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal Egocentric human Motion dataset with Head-Mounted Display (HMD) and body-worn IMUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.

School of Electronic and Computer Engineering, Peking University, China, School of Electronic and Computer Engineering, Peking University, China Peng Cheng Laboratory, China, School of Electronic and Computer Engineering, Peking University, China Peng Cheng Laboratory, China, School of Electronic and Computer Engineering, Peking University, China, School of Electronic and Computer Engineering, Peking University, China, School of Electronic and Computer Engineering, Peking University, China Peng Cheng Laboratory, China, School of Electronic and Computer Engineering, Peking University, China Peng Cheng Laboratory, China

Abstract: Compared to framebased methods, computational neuromorphic imaging using event cameras offers significant advantages, such as minimal motion blur, enhanced temporal resolution, and high dynamic range. The multi-view consistency of Neural Radiance Fields combined with the unique benefits of event cameras, has spurred recent research into reconstructing NeRF from data captured by moving event cameras. While showing impressive performance, existing methods rely on ideal conditions with the availability of uniform and high-quality event sequences and accurate camera poses, and mainly focus on the object level reconstruction, thus limiting their practical applications. In this work, we propose AE-NeRF to address the challenges of learning event-based NeRF from non-ideal conditions, including non-uniform event sequences, noisy poses, and various scales of scenes. Our method exploits the density of event streams and jointly learn a pose correction module with an event-based NeRF (e-NeRF) framework for robust 3D reconstruction from inaccurate camera poses. To generalize to larger scenes, we propose hierarchical event distillation with a proposal e-NeRF network and a vanilla e-NeRF network to resample and refine the reconstruction process. We further propose an event reconstruction loss and a temporal loss to improve the view consistency of the reconstructed scene. We established a comprehensive benchmark that includes large-scale scenes to simulate practical non-ideal conditions, incorporating both synthetic and challenging real-world event datasets. The experimental results show that our method achieves a new state-of-the-art in event-based 3D reconstruction.

Abstract: Imagelevel weakly supervised semantic segmentation (WSSS) reduces the dependence on high-quality data annotation, which plays a crucial role in computational pathology. Benefit from the ability to localize the objects with only binary labels, Class Activation Map (CAM) is a widely used method to initial pseudo masks. However, due to the low contrast among different tissues in histopathological images, most existing CAM-based methods perform poorly in gland segmentation. We retrospect this process and find that class consistency and semantic consistency can guide the network to effectively distinguish confusing pixels and generate fine-grained pseudo masks. Specifically, for class consistency, we propose Consistency Correlation Attention (CCA) to encourage the network to focus on the contribution of class features to semantic dependencies. For semantic consistency, we propose Multi-scale Pyramid Fusion Pooling (MPFP) to aggregate coarse-to-fine global semantic information from CAMs at multiple spatial resolutions, thus identifying class localization. Additionally, we introduce a Purified Labels Filtration (PLF) strategy during the segmentation phase to mitigate the noisy supervision signal and improve the segmentation quality of the model. Extensive experiments show that the our method achieves new state-of-the-art results on three publicly available gland datasets. Furthermore, our method demonstrates impressive domain adaptation capability, achieving satisfactory results with only a small portion of samples when faced with unseen domain data.

Abstract: Visual text generation, which aims to generate photorealistic images with coherent and well-formed scene text being rendered, has attracted widespread attention. Although recent works have achieved promising performance, the limited flexibility and controllability hinder their practical applications. We observe that different from natural objects, visual text in real scenes often has an arbitrarily shaped structure with different granularities (i.e., character, word, or line). In this paper, we consider the modality gap between image and text, and propose a new separation and composition pipeline for flexible and controllable visual text generation from only text prompts. At the core of our framework is a novel Hierarchical and Directional Layout representation, i.e., HDLayout, which can model the sequential and multi-granularity nature of the visual text. Under this formulation, we are able to generate arbitrarily shaped visual text automatically. Extensive experiments demonstrate that our method outperforms several strong baselines in a variety of scenarios both qualitatively and quantitatively, yielding state-of-the-art performances on arbitrarily shaped visual text generation.

Abstract: Leveraging its robust linear global modeling capability, Mamba has notably excelled in computer vision. Despite its success, existing Mambabased vision models have overlooked the nuances of event-driven tasks, especially in video reconstruction. Event-based video reconstruction (EBVR) demands spatial translation invariance and close attention to local event relationships in the spatio-temporal domain. Unfortunately, conventional Mamba algorithms apply static window partitions and standard reshape scanning methods, leading to significant losses in local connectivity. To overcome these limitations, we introduce EventMamba—a specialized model designed for EBVR task. EventMamba innovates by incorporating random window offset (RWO) in the spatial domain, moving away from the restrictive fixed partitioning. Additionally, it features a new consistent traversal serialization approach in the spatio-temporal domain, which maintains the proximity of adjacent events both spatially and temporally. These enhancements enable EventMamba to retain Mamba’s robust modeling capabilities while significantly preserving the spatio-temporal locality of event data. Comprehensive testing on multiple datasets shows that EventMamba markedly enhances video reconstruction, drastically improving computation speed while delivering superior visual quality compared to Transformer-based methods.

School of Computer Science and Engineering, Sun Yat-sen University, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China, School of Computer Science and Engineering, Sun Yat-sen University, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China, School of Computer Science and Engineering, Sun Yat-sen University, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China, School of Computer Science and Engineering, Sun Yat-sen University, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China, School of Computer Science and Engineering, Sun Yat-sen University, China Peng Cheng Laboratory, Shenzhen, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China

Abstract: The generation of a virtual digital avatar is a crucial research topic in the field of computer vision. Many existing works utilize Neural Radiance Fields (NeRF) to address this issue and have achieved impressive results. However, previous works assume the images of the training person are available and fixed while the appearances and poses of a subject could constantly change and increase in realworld scenarios. How to update the human avatar but also maintain the ability to render the old appearance of the person is a practical challenge. One trivial solution is to combine the existing virtual avatar models based on NeRF with continual learning methods. However, there are some critical issues in this approach: learning new appearances and poses can cause the model to forget past information, which in turn leads to a degradation in the rendering quality of past appearances, especially color bleeding issues, and incorrect human body poses. In this work, we propose a maintainable avatar (MaintaAvatar) based on neural radiance fields by continual learning, which resolves the issues by utilizing a Global-Local Joint Storage Module and a Pose Distillation Module. Overall, our model requires only limited data collection to quickly fine-tune the model while avoiding catastrophic forgetting, thus achieving a maintainable virtual avatar. The experimental results validate the effectiveness of our MaintaAvatar model.

Abstract: Millimeterwave radar plays a vital role in 3D object detection for autonomous driving due to its all-weather and all-lighting-condition capabilities for perception. However, radar point clouds suffer from pronounced sparsity and unavoidable angle estimation errors. To address these limitations, incorporating a camera may partially help mitigate the shortcomings. Nevertheless, the direct fusion of radar and camera data can lead to negative or even opposite effects due to the lack of depth information in images and low-quality image features under adverse lighting conditions. Hence, in this paper, we present the radar-camera fusion network with Hybrid Generation and Synchronization (HGSFusion), designed to better fuse radar potentials and image features for 3D object detection. Specifically, we propose the Radar Hybrid Generation Module (RHGM), which fully considers the Direction-Of-Arrival (DOA) estimation errors in radar signal processing. This module generates denser radar points through different Probability Density Functions (PDFs) with the assistance of semantic information. Meanwhile, we introduce the Dual Sync Module (DSM), comprising spatial sync and modality sync, to enhance image features with radar positional information and facilitate the fusion of distinct characteristics in different modalities. Extensive experiments demonstrate the effectiveness of our approach, outperforming the state-of-the-art methods in the VoD and TJ4DRadSet datasets by 6.53% and 2.03% in RoI AP and BEV AP, respectively.

Abstract: Immunohistochemistry (IHC) examination is essential for characterizing tumor subtypes, providing prognostic information, and developing personalized treatment plans. However, IHC staining preparation is more complex and expensive compared to Hematoxylin and Eosin (H&E) staining, limiting its widespread clinical application. Transforming H&E images into IHC images presents a promising solution. In this paper, we propose OTStainNet, a novel virtual IHC staining method. OT-StainNet employs a pre-trained diffusion model with richer prior knowledge as the generator and fine-tunes it with LoRA adapters through adversarial training. Given that adjacent images of the same tissue stained with H&E and IHC are not precisely aligned at the pixel level, existing methods struggle to fully utilize the supervisory information from weakly paired IHC images. To address this issue, we propose an optimal transport-driven semantic matching (OTSM) mechanism, establishing accurate semantic correspondences between H&E-IHC image pairs. By leveraging the real IHC features obtained through the OTSM mechanism, we design a semantic consistency constraint (SCC) to ensure that the correlations among virtual IHC features remain consistent with those among real IHC features, thereby preserving valuable correlation information during stain transfer. We validate OT-StainNet using four types of IHC staining across two datasets. Extensive experiments demonstrate the effectiveness of our method compared to state-of-the-art approaches.

School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, Guangdong, 518172, P.R. China, Tencent PCG, Tencent PCG, Shandong University, School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, Guangdong, 518172, P.R. China The Shenzhen Future Network of Intelligence Institute, CUHK-Shenzhen, 518172, P.R. China The Guangdong Provincial Key Laboratory of Future Networks of Intelligence, CUHK-Shenzhen, 518172, P.R. China, Harbin Institute of Technology, Tencent PCG, Tencent PCG, Tencent PCG

Abstract: Video Temporal Grounding (VTG) strives to accurately pinpoint event timestamps in a specific video using linguistic queries, significantly impacting downstream tasks like video browsing and editing. Unlike traditional taskspecific models, Video Large Language Models (video LLMs) can handle multiple tasks concurrently in a zero-shot manner. Consequently, exploring the application of video LLMs for VTG tasks has become a burgeoning research area. However, despite considerable advancements in video content understanding, video LLMs often struggle to accurately pinpoint timestamps within videos, limiting their effectiveness in VTG tasks. To address this, we introduce VTG-LLM, a model designed to enhance video LLMs' timestamp localization abilities. Our approach includes: (1) effectively integrating timestamp knowledge into visual tokens; (2) incorporating absolute-time tokens to manage timestamp knowledge without concept shifts; and (3) introducing a lightweight, high-performance, slot-based token compression technique designed to accommodate the demands of a large number of frames to be sampled for VTG tasks. Additionally, we present VTG-IT-120K, a collection of publicly available VTG datasets that we have re-annotated to improve upon low-quality annotations. Our comprehensive experiments demonstrate the superior performance of VTG-LLM in comparison to other video LLM methods across a variety of VTG tasks.

Abstract: 3D Human Pose Estimation (HPE) is a oneto-many problem by nature, making it challenging to estimate an accurate 3D pose from a single 2D pose. Some prior works have attempted to tackle this problem by using a conditional generative network. They generate 3D poses from a given 2D pose with noises from a standard Gaussian distribution, while the depth distribution is dependent on each posture and more complex than the standard Gaussian distribution. This may lead to inaccurate distribution learning. In this paper, we propose a probabilistic framework called ProPose to address this issue. ProPose employs Pose Instance-Level Gaussian Distribution (PILGD) derived from 3D pose-based self-representation learning to obtain reliable distribution which is able to address pose-dependent depth distribution. To access this PILGD, we utilize normalizing flow, which learns a mapping function between the PILGD and a 2D Pose-Adaptive Gaussian Distribution (PAGD). This converts the problem of directly estimating 3D poses from 2D poses to a mapping problem between PILGD and PAGD using a normalizing flow. Extensive experiments show the advantages of utilizing the PILGD and PAGD. ProPose achieves comparable performances to previous state-of-the-art probabilistic methods in a multi-hypothesis setting. Notably, ProPose in a single-hypothesis setting demonstrates comparable generalization ability to existing state-of-the-art deterministic methods.

Abstract: There are two crucial aspects of reliable autonomous driving systems: the reasoning behind decisionmaking and the precision of environmental perception. This paper introduces DME-Driver, a new autonomous driving system that enhances performance and robustness by fully leveraging the two crucial aspects. This system comprises two main models. The first, the Decision Maker, is responsible for providing logical driving instructions. The second, the Executor, receives these instructions and generates precise control signals for the vehicles. To ensure explainable and reliable driving decisions, we build the Decision-Maker based on a large vision language model. This model follows the logic employed by experienced human drivers and simulates making decisions in a safe and reasonable manner. On the other hand, the generation of accurate control signals relies on precise and detailed environmental perception, where 3D scene perception models excel. Therefore, a planning-oriented perception model is employed as the Executor. It translates the logical decisions made by the Decision-Maker into accurate control signals for the self-driving cars. To effectively train the proposed system, a new dataset named Human-driver Behavior and Decision-making (HBD) dataset has been collected. This dataset encompasses a diverse range of human driver behaviors and their underlying motivations. By leveraging this dataset, our system achieves high-precision planning accuracy through a logical thinking process.

Abstract: Painterly image harmonization aims at seamlessly blending disparate visual elements within a single image. However, previous approaches often struggle due to limitations in training data or reliance on additional prompts, leading to inharmonious and contentdisrupted output. To surmount these hurdles, we design a Training-and-prompt-Free General Painterly Harmonization method (TF-GPH). TF-GPH incorporates a novel “Similarity Disentangle Mask”, which disentangles the foreground content and background image by redirecting their attention to corresponding reference images, enhancing the attention mechanism for multi-image inputs. Additionally, we propose a “Similarity Reweighting” mechanism to balance harmonization between stylization and content preservation. This mechanism minimizes content disruption by prioritizing the content-similar features within the given background style reference. Finally, we address the deficiencies in existing benchmarks by proposing novel range-based evaluation metrics and a new benchmark to better reflect real-world applications. Extensive experiments demonstrate the efficacy of our method across benchmarks.

Abstract: Deep hashing models have achieved great success in retrieval tasks due to their powerful representation and strong information compression capabilities. However, they inherit the vulnerability of deep neural networks to adversarial perturbations. Attackers can severely impact the retrieval capability of hashing models by adding subtle, carefully crafted adversarial perturbations to benign images, transforming them into adversarial images. Most existing adversarial attacks target image classification models, with few focusing on retrieval models. We propose HUANG, the first targeted adversarial attack algorithm to leverage a diffusion model for hashing retrieval in blackbox scenarios. In our approach, adversarial denoising uses adversarial perturbations and residual image to guide the shift from benign to adversarial distribution. Extensive experiments demonstrate the superiority of HUANG across different datasets, achieving state-of-the-art performance in black-box targeted attacks. Additionally, the dynamic interplay between denoising and adding adversarial perturbations in adversarial denoising endows HUANG with exceptional robustness and transferability.

Abstract: FineGrained Sketch-Based Image Retrieval (FG-SBIR) aims to minimize the distance between sketches and corresponding images in the embedding space. However, scalability is hindered by the growing complexity of solutions, mainly due to the abstract nature of fine-grained sketches. In this paper, we propose an effective approach to narrow the gap between the two domains. It mainly facilitates unified mutual information sharing both intra- and inter-samples, rather than treating them as a single feature alignment problem between modalities. Specifically, our approach includes: (i) Employing dual weight-sharing networks to optimize alignment within the sketch and image domain, which also effectively mitigates model learning saturation issues. (ii) Introducing an objective optimization function based on contrastive loss to enhance the model's ability to align features in both intra- and inter-samples. (iii) Presenting a self-supervised Multi-Scale Token Recycling (MSTR) Module featured by recycling discarded patch tokens in multi-scale features, further enhancing representation capability and retrieval performance. Our framework achieves excellent results on CNN- and ViT-based backbones. Extensive experiments demonstrate its superiority over existing methods. We also introduce Cloths-V1, the first professional fashion sketch-image dataset, utilized to validate our method and will be beneficial for other applications.

Abstract: Recent research explores the potential of Diffusion Models (DMs) for consistent object editing, which aims to modify object position, size, and composition, etc., while preserving the consistency of objects and background without changing their texture and attributes. Current inferencetime methods often rely on DDIM inversion, which inherently compromises efficiency and the achievable consistency of edited images. Recent methods also utilize energy guidance which iteratively updates the predicted noise and can drive the latents away from the original image, resulting in distortions. In this paper, we propose PixelMan, an inversion-free and training-free method for achieving consistent object editing via Pixel Manipulation and generation, where we directly create a duplicate copy of the source object at target location in the pixel space, and introduce an efficient sampling approach to iteratively harmonize the manipulated object into the target location and inpaint its original location, while ensuring image consistency by anchoring the edited image to be generated to the pixel-manipulated image as well as by introducing various consistency-preserving optimization techniques during inference. Experimental evaluations based on benchmark datasets as well as extensive visual comparisons show that in as few as 16 inference steps, PixelMan outperforms a range of state-of-the-art training-based and training-free methods (usually requiring 50 steps) on multiple consistent object editing tasks.

Abstract: Current methods commonly utilize threebranch structures of inversion, reconstruction, and editing, to tackle consistent image editing task. However, these methods lack control over the generation position of the edited object and have issues with background preservation. To overcome these limitations, we propose a tuning-free method with only two branches: inversion and editing. This approach allows users to simultaneously edit the object's action and control the generation position of the edited object. Additionally, it achieves improved background preservation. Specifically, we transfer the edited object information to the target area and repair or preserve the background of other areas during the inversion process at a specific time step. In the editing stage, we use the image features in self-attention to query the key and value of the corresponding time step in the inversion to achieve consistent image editing. Impressive image editing results and quantitative evaluation demonstrate the effectiveness of our method.

Abstract: In this paper, we introduce ProtoOcc, a novel 3D occupancy prediction model designed to predict the occupancy states and semantic classes of 3D voxels via a deep semantic understanding of scenes. ProtoOcc consists of two main components: the Dual Branch Encoder (DBE) and the Prototype Query Decoder (PQD). The DBE produces a new 3D voxel representation by combining 3D voxel and BEV representations across multiple scales using a dual branch structure. This design combines the BEV representation, which offers a large receptive field, with the voxel representation, known for its higher spatial resolution, thereby improving both performance and computational efficiency. The PQD employs two types of prototypebased queries to expedite the Transformer decoding process. Scene-Adaptive Prototypes are generated from the 3D voxel features of the input sample, while Scene-Agnostic Prototypes are updated during training using an Exponential Moving Average of the Scene-Adaptive Prototypes. Using these prototype-based queries for decoding, we can directly predict 3D occupancy in a single step, eliminating the need for iterative Transformer decoding. Additionally, we propose Robust Prototype Learning, which introduces noise into the prototype generation process and trains the model to denoise during the training phase. This approach enhances the robustness of ProtoOcc against degraded prototype feature quality. ProtoOcc achieves state-of-the-art performance with 45.02% mIoU on the Occ3D-nuScenes benchmark. For the single-frame method, it reaches 39.56% mIoU with 12.83 FPS on an NVIDIA RTX 3090.

Abstract: While 3D head reconstruction is widely used for modeling, existing neural reconstruction approaches rely on highresolution multi-view images, posing notable privacy issues. Individuals are particularly sensitive to facial features, and facial image leakage can lead to many malicious activities, such as unauthorized tracking and deepfake. In contrast, geometric data is less susceptible to misuse due to its complex processing requirements, and absence of facial texture features. In this paper, we propose a novel two-stage 3D facial reconstruction method aimed at avoiding exposure to sensitive facial information while preserving detailed geometric accuracy. Our approach first uses non-sensitive rear-head images for initial geometry and then refines this geometry using processed privacy-removed gradient images. Extensive experiments show that the resulting geometry is comparable to methods using full images, while the process is resistant to DeepFake applications and facial recognition (FR) systems, thereby proving its effectiveness in privacy protection.

Abstract: Medical image segmentation often faces the dual challenges of limited annotations and domain shifts, further complicated by degraded images in practical scenarios. Traditional methods tend to underperform when these issues occur simultaneously, as they are typically designed for specific tasks. To address this, we propose a unified framework that effectively handles limited annotations and domain shifts while also managing both clean and degraded images during inference. Overcoming these challenges requires focusing on three critical aspects: First, the model must be robust to various noise conditions. Second, it should excel at capturing domaininvariant features. Third, it should effectively utilize unlabeled data. We propose three major components in our approach to tackle these challenges. First, the Wavelet-based Cross-Component Exchange (WCCE) swaps high-frequency wavelet components between labeled and unlabeled images to enhance robustness. Second, we employ a diffusion VNet architecture with a reweighting mechanism to capture domain-invariant features. Finally, we utilize Cross-Decoder Pseudo (CDP) training to effectively leverage unlabeled data. Evaluations on three publicly available medical datasets and across four types of degraded image scenarios demonstrate that our method outperforms state-of-the-art (SOTA) techniques, consistently delivering superior performance across varying image qualities. Our approach not only addresses annotation scarcity and domain shift but also effectively manages noisy and blurred conditions, setting a new benchmark in medical image segmentation.

Research Institute of Trustworthy Adaptive Systems, Southern University of Science and Technology Department of Computer Science and Engineering, Southern University of Science and Technology, Research Institute of Trustworthy Adaptive Systems, Southern University of Science and Technology, Research Institute of Trustworthy Adaptive Systems, Southern University of Science and Technology Department of Computer Science and Engineering, Southern University of Science and Technology, Research Institute of Trustworthy Adaptive Systems, Southern University of Science and Technology Department of Computer Science and Engineering, Southern University of Science and Technology, Beijing Information Science & Technology University, Institute of High Performance Computing, Agency for Science, Technology and Research, Research Institute of Trustworthy Adaptive Systems, Southern University of Science and Technology Department of Computer Science and Engineering, Southern University of Science and Technology

Abstract: Decoupling domainvariant information (DVI) from domain-invariant information (DII) serves as a prominent strategy for mitigating domain shifts in the practical implementation of deep learning algorithms. However, in medical settings, concerns surrounding data collection and privacy often restrict access to both training and test data, hindering the empirical decoupling of information by existing methods. To tackle this issue, we propose an Adaptive Information Filter-driven Source-free Domain Adaptation (AIF-SFDA) algorithm, which leverages a frequency-based learnable information filter to autonomously decouple DVI and DII. Information Bottleneck (IB) and Self-supervision (SS) are incorporated to optimize the learnable frequency filter. The IB governs the information flow within the filter to diminish redundant DVI, while SS preserves DII in alignment with the specific task and image modality. Thus, the adaptive information filter can overcome domain shifts relying solely on target data. A series of experiments covering various medical image modalities and segmentation tasks were conducted to demonstrate the benefits of AIF-SFDA through comparisons with leading algorithms and ablation studies.

Abstract: If unaligned multimodal medical images can be simultaneously aligned and fused using a singlestage approach within a unified processing framework, it will not only achieve mutual promotion of dual tasks but also help reduce the complexity of the model. However, the design of this model faces the challenge of incompatible requirements for feature fusion and alignment. To address this challenge, this paper proposes an unaligned medical image fusion method called Bidirectional Stepwise Feature Alignment and Fusion (BSFA-F) strategy. To reduce the negative impact of modality differences on cross-modal feature matching, we incorporate the Modal Discrepancy-Free Feature Representation (MDF-FR) method into BSFA-F. MDF-FR utilizes a Modality Feature Representation Head (MFRH) to integrate the global information of the input image. By injecting the information contained in MFRH of the current image into other modality images, it effectively reduces the impact of modality differences on feature alignment while preserving the complementary information carried by different images. In terms of feature alignment, BSFA-F employs a bidirectional stepwise alignment deformation field prediction strategy based on the path independence of vector displacement between two points. This strategy solves the problem of large spans and inaccurate deformation field prediction in single-step alignment. Finally, Multi-Modal Feature Fusion block achieves the fusion of aligned features. The experimental results across multiple datasets demonstrate the effectiveness of our method.

Abstract: Many studies have concentrated on constructing supervised models utilizing paired datasets for image denoising, which proves to be expensive and timeconsuming. Current self-supervised and unsupervised approaches typically rely on blind-spot networks or sub-image pairs sampling, resulting in pixel information loss and destruction of detailed structural information, thereby significantly constraining the efficacy of such methods. In this paper, we introduce Prompt-SID, a prompt-learning-based single image denoising framework that emphasizes the preservation of structural details. This approach is trained in a self-supervised manner using downsampled image pairs. It captures original-scale image information through structural encoding and integrates this prompt into the denoiser. To achieve this, we propose a structural representation generation model based on the latent diffusion process and design a structural attention module within the transformer-based denoiser architecture to decode the prompt. Additionally, we introduce a scale replay training mechanism, which effectively mitigates the scale gap from images of different resolutions. We conduct comprehensive experiments on synthetic, real-world, and fluorescence imaging datasets, showcasing the remarkable effectiveness of Prompt-SID.

Abstract: As digital media manipulation becomes increasingly sophisticated, accurately detecting and localizing image forgeries with minimal supervision has become a critical challenge. Existing weakly supervised image forgery detection (WIFD) methods often rely on convolutional neural networks (CNNs) and limited exploration of internal relationships, leading to poor detection and localization performance with only image-level labels. To address these limitations, we introduce a novel Multi-View and Multi-Level Relation Learning Network (M²RL-Net) for W-IFD. M²RL-Net effectively identifies forged images using only image-level annotations by exploring relationships between different views and hierarchical levels within images. Specifically, M²RL-Net achieves patch-level self-consistency learning (PSL) and feature-level contrastive learning (FCL) across different views, facilitating more generalized self-supervised learning of forgery features. In detail, PSL employs self-supervised learning to distinguish consistent and inconsistent regions within images, enhancing its ability to accurately locate tampered areas. FCL utilizes feature-level self-view and multi-view contrastive learning to differentiate between genuine and tampered image features, thereby improving the recognition of authentic and manipulated content across different views. Extensive experiments on various datasets demonstrate that M²RL-Net outperforms existing weakly-supervised methods in both detection and localization accuracy. This research sets a new benchmark for weakly-supervised image forgery detection and lays a robust foundation for future studies in this field.

Abstract: Infrared and visible image fusion (IVIF) is a crucial technique for enhancing visual performance by integrating unique information from different modalities into one fused image. Exiting methods pay more attention to conducting fusion with undisturbed data, while overlooking the impact of deliberate interference on the effectiveness of fusion results. To investigate the robustness of fusion models, in this paper, we propose a novel adversarial attack resilient network, called A2RNet. Specifically, we develop an adversarial paradigm with an antiattack loss function to implement adversarial attacks and training. It is constructed based on the intrinsic nature of IVIF and provide a robust foundation for future research advancements. We adopt a Unet as the pipeline with a transformer-based defensive refinement module (DRM) under this paradigm, which guarantees fused image quality in a robust coarse-to-fine manner. Compared to previous works, our method mitigates the adverse effects of adversarial perturbations, consistently maintaining high-fidelity fusion results. Furthermore, the performance of downstream tasks can also be well maintained under adversarial attacks.

School of Electronic and Computer Engineering, Peking University, Shenzhen, China Pengcheng Laboratory, Shenzhen, China, Pengcheng Laboratory, Shenzhen, China Sun Yat-Sen University, Guangzhou, China, School of Electronic and Computer Engineering, Peking University, Shenzhen, China Pengcheng Laboratory, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China, Department of Automation and BNRist, Tsinghua University, Beijing, China, School of Electronic and Computer Engineering, Peking University, Shenzhen, China Pengcheng Laboratory, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China

Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable cognitive capabilities in various crossmodal tasks.However, existing MLLMs struggle with tasks that require physical digital cognition, such as accurately reading an electric meter or pressure gauge. This limitation significantly reduces their effectiveness in practical applications like industrial monitoring and home energy management, where digital sensors are not feasible. For humans, physical digits are artificially defined quantities presented on specific carriers, which require training to recognize. As existing MLLMs are only pre-trained in the manner of object recognition, they fail to comprehend the relationship between digital carriers and their reading. To this end, referring to human behavior, we propose a novel DigitalLLaVA method to explicitly inject digital cognitive abilities into MLLMs in a two-step manner. In the first step, to improve the MLLM's understanding of physical digit carriers, we propose a digit carrier mapping method. This step utilizes object-level text-image pairs to enhance the model's comprehension of objects containing physical digits. For the second step, unlike previous methods that rely on sequential digital prediction or digit regression, we propose a 32 bit floating point simulation approach that treats digit prediction as a whole. Using digit-level text-image pairs, we train three float heads to predict 32-bit floating-point numbers using 0/1 binary classification. This step significantly reduces the search space, making the prediction process more robust and straightforward. Being simple but effective, our method can identify very precise metrics (i.e., accurate to ±0.001) and provide floating-point results, showing its applicability in digital carrier domains.

Abstract: Contrastive LanguageImage Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method. Additionally, by exploring the extent of mutual guidance, we propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately ×100 less time cost.

Abstract: The aim of textbased person retrieval is to identify pedestrians using natural language descriptions within a large-scale image gallery. Traditional methods rely heavily on manually annotated image-text pairs, which are resource-intensive to obtain. With the emergence of Large Vision-Language Models (LVLMs), the advanced capabilities of contemporary models in image understanding have led to the generation of highly accurate captions. Therefore, this paper explores the potential of employing Large Vision-Language Models for unsupervised text-based pedestrian image retrieval and proposes a Multi-grained Uncertainty Modeling and Alignment framework (MUMA). Initially, multiple Large Vision-Language Models are employed to generate diverse and hierarchically structured pedestrian descriptions across different styles and granularities. However, the generated captions inevitably introduce noise. To address this issue, an uncertainty-guided sample filtration module is proposed to estimate and filter out unreliable image-text pairs. Additionally, to simulate the diversity of styles and granularities in captions, a multi-grained uncertainty modeling approach is applied to model the distributions of captions, with each caption represented as a multivariate Gaussian distribution. Finally, a multi-level consistency distillation loss is employed to integrate and align the multi-grained captions, aiming to transfer knowledge across different granularities. Experimental evaluations conducted on three widely-used datasets demonstrate the significant advancements achieved by our approach.

Abstract: Despite the significant impact of visual events on human cognition, understanding events in videos remains a challenging task for AI due to their complex structures, semantic hierarchies, and dynamic evolution. To address this, we propose the task of video event understanding that extracts event scripts and makes predictions with these scripts from videos. To support this task, we introduce VidEvent, a largescale dataset containing over 23,000 well-labeled events, featuring detailed event structures, broad hierarchies, and logical relations extracted from movie recap videos. The dataset was created through a meticulous annotation process, ensuring high-quality and reliable event data. We also provide comprehensive baseline models offering detailed descriptions of their architecture and performance metrics. These models serve as benchmarks for future research, facilitating comparisons and improvements. Our analysis of VidEvent and the baseline models highlights the dataset's potential to advance video event understanding and encourages the exploration of innovative algorithms and models.

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China Shenzhen Audencia Financial Technology Institute, Shenzhen University, Shenzhen, China, Department of Intelligent Manufacturing, CATL, Ningde, China, School of Computing and Artificial Intelligence, Fuyao University of Science and Technology, Fuzhou, China, School of Software, Northwestern Polytechnical University, Xi’an, China, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen, China

Abstract: 3D anomaly detection has recently become a significant focus in computer vision. Several advanced methods have achieved satisfying anomaly detection performance. However, they typically concentrate on the external structure of 3D samples and struggle to leverage the internal information embedded within samples. Inspired by the basic intuition of why not look inside for more, we observed this prototype is straightforward and effective. As a result, we introduce a newly designed mode named Internal Spatial Modality Perception (ISMP) to explore the feature representation from internal views fully. Specifically, our proposed ISMP consists of a critical perception module, Spatial Insight Engine (SIE), which abstracts complex internal information of point clouds into essential global features. Besides, to better align structural information with point data, we propose an enhanced key point feature extraction method for amplifying spatial structure feature representation. Simultaneously, a novel feature filtering module is incorporated to reduce noise and redundant features for further precise spatial structure aligning. Extensive experiments validate the efficiency of our proposed method, achieving objectlevel and pixel-level AUROC improvements of 4.2% and 13.1%, respectively, on the Real3D-AD benchmarks. Note that the strong generalization ability of SIE has been theoretically proven and verified in both classification and segmentation tasks. Our code will be released upon acceptance.

Abstract: Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plugand-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.

Abstract: Acquiring pairwise noisyclean training data is challenging. Consequently, some self-supervised denoising methods utilize noisy image pairs as both input and target for network training. However, a major issue with these methods is the gap between the clean images of the input and target. In this paper, we achieve high-quality image denoising by reducing or even eliminating this gap. Our method requires no training data or prior knowledge of the noise distribution. It consists of two lightweight networks that can be trained using only a single noisy test image. Specifically, we propose a random mask-based downsampler that generates multiple pairs of downsampled noisy images, which are similar but distinct. These image pairs serve as the input for the first network, with the mean image of each pair used as the target. This initially reduces the gap between the clean images of the input and target. Particularly, in our method, the clean counterpart of the first network's target (i.e., the mean image) can be obtained. We then train a second network using the mean image as input and its clean counterpart as the target. This effectively eliminates the gap and achieves better denoising results. Extensive experiments demonstrate that our method outperforms in both denoising performance and efficiency.

Abstract: Multilabel few-shot image classification is a crucial and challenging task due to limited annotated data and elusive category specificity. However, research on this topic is still in the rudimentary stage and few methods are available. Existing methods either leverage data augmentation to alleviate data scarcity or utilize label features as auxiliary knowledge to eliminate the negative effect caused by irrelevant categories, but they ignore the utilization of image region features for data augmentation, and overlook to learn appropriate text feature to better match the image features of specific categories. Moreover, these methods only focus on one side and do not effectively tackle the above two issues simultaneously. In this paper, we introduce a novel prototype-based multi-label few-shot learning framework that seamlessly integrates pairwise feature augmentation and flexible prompt learning. Specifically, by pairwise feature augmentation, we leverage the region features of images in the support set to generate more image features and construct image prototypes, thus alleviating the issue of data scarcity. By flexible prompt learning, we adaptively acquire class-specific prompts to build text prototypes that highly match the image features of specific classes, thereby mitigating the impact of irrelevant classes. Finally, with adaptive learnable parameters, we merge image and text prototypes to obtain the final prototypes, achieving a more powerful classifier for multi-label few-shot image classification. Extensive experimental results demonstrate that our proposed method can push the performance to a higher level.

Abstract: Reconstructing 3D models from sensor data is a valuable and promising direction for developing testing and validation environments in applications like autonomous driving. However, existing methods for 3D modeling often rely on extensive multiview data or controlled conditions, making them difficult and expensive to scale. Furthermore, these methods, particularly those based on neural radiance fields, typically produce implicit models that can be challenging to manipulate and suffer from slow rendering speeds. In this paper, we introduce ProtoCar, a novel approach that overcomes these limitations by learning 3D vehicle prototypes from single-view images with diverse and unconstrained visual conditions. ProtoCar uses real-world driving data from LiDAR and image sensors, and employs 3D Gaussian splatting techniques to represent explicit geometric and texture. Extensive experiments demonstrate that ProtoCar generates high-quality 3D models and adapts well to various vehicle types and challenging visual scenarios, offering a scalable and effective solution for 3D modeling in environments with limited and variable visual information.

Abstract: The domain gap resulting from mismatches in acquisition details like protocol and scanner between training and test data hinders the deployment of the trained model in clinical practice. To address this issue, Continual testtime adaptation (CTTA) has been proposed to adapt the source model to continually changing unlabeled domains without accessing the source data. Existing methods learn an image-level visual prompt for target domains and inject the trainable prompt into the input space. However, they either combine the input with a prompt of equal scale or determine the prompt injection position through complex strategies such as uncertainty estimation or Fourier Transform. These approaches substantially increase the number of trainable parameters and computational burden, especially in high-dimensional medical imaging data. To overcome these challenges, we propose the Efficient Deformable Convolutional Prompt (EDCP), which leverages the inductive bias of convolution to reduce trainable parameters compared to standard prompts. We further enhance convolution by making it deformable, addressing fine-grained domain shifts at the pixel level through an offset branch. To improve training efficiency and balance parameters between the convolution and offset branches, we decompose the offset transformation into two parts, storing one in an offset bank that also serves as a domain indicator. This bank accelerates training by skipping test images similar to those already stored. Prompt updates are guided by layer-wise alignment of source-target statistics without unfreezing batch normalization layers. Extensive experiments demonstrate the superiority of our method in 2D and 3D medical image segmentation tasks.

National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center of Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center of Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Institute of Automation, Chinese Academy of Science Wuhan Al Research, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center of Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center of Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University

Abstract: Deciphering visual content from fMRI sheds light on the human vision system, but data scarcity and noise limit brain decoding model performance. Traditional approaches rely on subjectspecific models, which are sensitive to training sample size. In this paper, we address data scarcity by proposing shallow subject-specific adapters to map cross-subject fMRI data into unified representations. A shared deep decoding model then decodes these features into the target feature space. We use both visual and textual supervision for multi-modal brain decoding and integrate high-level perception decoding with pixel-wise reconstruction guided by high-level perceptions. Our extensive experiments reveal several interesting insights: 1) Training with cross-subject fMRI benefits both high-level and low-level decoding models; 2) Merging high-level and low-level information improves reconstruction performance at both levels; 3) Transfer learning is effective for new subjects with limited training data by training new adapters; 4) Decoders trained on visually-elicited brain activity can generalize to decode imagery-induced activity, though with reduced performance.

Abstract: Recently, deep neural networks (DNNs) have emerged as the leading approach for lowlight image enhancement (LLIE). However, training these models generally requires large-scale paired datasets, which are challenging to obtain due to the labor-intensive and time-consuming nature of real-world data collection. To alleviate this issue, synthetic data are often combined with real-captured data for training. However, most existing low-light image synthesis methods are simply performed in the sRGB domain using Gamma correction or manual adjustments via Lightroom, which fail to incorporate the physical imaging prior through the image signal processing (ISP) pipeline and thus result in limited dataset size and degradation space. Consequently, LLIE methods trained on such data often exhibit some drawbacks in the results, such as inaccurate white balance and abnormal enhancement artifacts, which limit their practicality and generalizability. In this paper, we propose a practical low-light image synthesis pipeline capable of generating unlimited paired training data. Our pipeline starts with a reverse ISP model that converts sRGB images back to the unprocessed RAW domain, where we then simulate low-light degradation, noise degradation, and white balance adjustments. Finally, the degraded RAW images are processed through a forward ISP model to produce low-light sRGB images. The pipeline further employs multiple tone mapping curves and color correction matrices (CCMs) to expand the degradation space. Hence, trained with our proposed synthetic data, existing state-of-the-art (SOTA) LLIE deep models are expected to improve their performance. Extensive experiments across various datasets demonstrate that our synthetic data can indeed effectively enhance existing LLIE deep models, improving both their practicality and generalizability.

Abstract: As a longrange prior, motion consensus essentially forces the overall spatial transformation between a pair of images to be smooth and consistent, which is naturally well-suited for two-view correspondence learning. However, such precious property remains under-explored by most existing studies due to the modeling challenges posed by the sparsity and uneven distributions of putative correspondences. In this paper, we propose DeMo, a novel and cutting-edge network for outlier rejection, which possesses the capacity to fully capture global motion consensus clues by way of consensus interpolation over the entire high-dimensional motion field generated by putative correspondences. Specifically, through incorporating regularization techniques into a Reproducing Kernel Hilbert Space (RKHS), a concise interpolation formula can be derived for the high-dimensional motion field, which inherently allows a closed-form solution. Subsequently, learnable deep kernels are collaboratively used to flexibly and efficiently capture the relationships between global inputs, thus maintaining the entire motion field consensus. In addition, to remedy the cubic computational overhead of explicit interpolation, a scene-adaptive sampling strategy is introduced, which implicitly selects the more scene-representative motions, reducing the computational complexity of motion consensus interpolation to be approximately linear while maintaining the accuracy. Moreover, to deal with underlying depth discontinuities caused by complicated scene variations, a local consensus complementation block is designed, which maintains local bilateral consensus across both feature and spatial channels. Without bells and whistles, DeMo achieves superior performance in various geometric tasks, including relative pose estimation, homography estimation, and visual localization.

Abstract: UNet is a widely used model for medical image segmentation, renowned for its strong feature extraction capabilities and U-shaped design, which incorporates skip connections to preserve critical information. However, its decoders exhibit information-specific preferences for the supplementary content provided by skip connections, instead of adhering to a strict one-to-one correspondence, which limits its flexibility across diverse tasks. To address this limitation, we propose the Task-Adaptive Mixture of Skip Connections (TA-MoSC) module, inspired by the Mixture of Experts (MoE) framework. TA-MoSC innovatively reinterprets skip connections as a task allocation problem, employing a routing mechanism to adaptively select expert combinations at different decoding stages. By introducing MoE, our approach enhances the sparsity of the model, and lightweight convolutional experts are shared across all skip connection stages, with a Balanced Expert Utilization (BEU) strategy ensuring that all experts are effectively trained, maintaining training balance and preserving computational efficiency. Our approach introduces minimal additional parameters to the original U-Net but significantly enhances its performance and stability. Experiments on GlaS, MoNuSeg, Synapse, and ISIC16 datasets demonstrate state-of-the-art accuracy and better generalization across diverse tasks. Moreover, while this work focuses on medical image segmentation, the proposed method can be seamlessly extended to other segmentation tasks, offering a flexible and efficient solution for diverse applications.

Abstract: Data augmentation is expected to bring about unseen features of training set, enhancing the model’s ability to generalize in situations where data is limited. Generative image models trained on large webcrawled datasets such as LAION are known to produce images with stereotypes and imperceptible bias when used to augment training data, owing to dataset misalignment and the generator’s ignorance of the downstream model. We improve downstream task awareness in generated images by proposing a task-aware fine-tuning strategy that actively detects failures of downstream task in the target model to fine-tune the generation process between epochs. The dynamic fine-tuning strategy is achieved by (1) inspecting misalignment between generated data and original data via VLM captioners and (2) adjusts both prompts and diffusion model so that the strategy dynamically guides the generator by focusing on the detected bias of VLM. This is done via re-captioning the overfitted data as well as finetuning the diffusion trajectory in a contrastive manner. To co-operate with the VLM captioner, the contrastive fine-tuning process dynamically adjusts different parts of the diffusion trajectory based on detected misalignment, thus shifting the the generated distribution away from making the downstream model overfit. Our experiments on few-shot class incremental learning show that our instruction-guided finetuning strategy consistently assists the downstream model with higher classification accuracy compared to generative data augmentation baselines such as Stable Diffusion and GPT-4o, and state-of-the-art non-generative strategies.

Abstract: Recently, a number of effective methods have been proposed to tackle the challenging task of FewShot Fine-Grained Image Classification (FS-FGIC). However, how to fully leverage the backbone network to discover and extract detailed features to generate more discriminative class prototypes, as well as how to accurately model the similarity relationship between query samples and the class prototypes, are still issues to be further considered. Therefore, we propose a novel progreSsively featUre refInement and conTinuous rElationship moDeling method, SUITED for short, to address these two issues existing in the State-of-the-Art FS-FGIC methods. Specifically, we design the Progressive Feature Refinement Module (PFRM) to fully exploit the backbone network's progressive feature extraction capabilities, forming multi-scale feature representations to further enhance discriminative features. Then, the Continuous Relationship Modeling Module (CRMM) is proposed to capture the dependencies between query samples and the corresponding class prototypes, achieving precise optimization of the distances among corresponding sample points in the feature space. We conducted extensive experiments on five fine-grained benchmark datasets, and the experimental results demonstrate that the proposed method is comprehensively ahead of the existing State-of-the-Art methods.

Abstract: Visionlanguage retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.

Abstract: Object detection, particularly openvocabulary object detection, plays a crucial role in Earth sciences, such as environmental monitoring, natural disaster assessment, and land-use planning. However, existing open-vocabulary detectors, primarily trained on natural-world images, struggle to generalize to remote sensing images due to a significant data domain gap. Thus, this paper aims to advance the development of open-vocabulary object detection in remote sensing community. To achieve this, we first reformulate the task as Locate Anything on Earth (LAE) with the goal of detecting any novel concepts on Earth. We then developed the LAE-Label Engine which collects, auto-annotates, and unifies up to 10 remote sensing datasets creating the LAE-1M — the first large-scale remote sensing object detection dataset with broad category coverage. Using the LAE-1M, we further propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task, featuring Dynamic Vocabulary Construction (DVC) and Visual-Guided Text Prompt Learning (VisGT) modules. DVC dynamically constructs vocabulary for each training batch, while VisGT maps visual features to semantic space, enhancing text features. We comprehensively conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark. Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.

College of Computer Science and Software Engineering, Shenzhen University, College of Computer Science and Software Engineering, Shenzhen University Guangdong Provincial Key Laboratory of Intelligent Information Processing, Guangdong Provincial Key Laboratory of Intelligent Information Processing National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Guangdong Provincial Key Laboratory of Intelligent Information Processing National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University

Abstract: Learning with Noisy Labels (LNL) aims to improve the model generalization when facing data with noisy labels, and existing methods generally assume that noisy labels come from known classes, called closedset noise. However, in real-world scenarios, noisy labels from similar unknown classes, i.e., open-set noise, may occur during the training and inference stage. Such open-world noisy labels may significantly impact the performance of LNL methods. In this study, we propose a novel dual-space joint learning method to robustly handle the open-world noise. To mitigate model overfitting on closed-set and open-set noises, a dual representation space is constructed by two networks. One is a projection network that learns shared representations in the prototype space, while the other is a One-Vs-All (OVA) network that makes predictions using unique semantic representations in the class-independent space. Then, bi-level contrastive learning and consistency regularization are introduced in two spaces to enhance the detection capability for data with unknown classes. To benefit from the memorization effects across different types of samples, class-independent margin criteria are designed for sample identification, which selects clean samples, weights closed-set noise, and filters open-set noise effectively. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods and achieves an average accuracy improvement of 4.55\% and an AUROC improvement of 6.17\% on CIFAR80N.

School of Artificial Intelligence, Wuhan University School of Computer Science, Wuhan University, School of Computer Science, Wuhan University, Shanghai Artificial Intelligence Laboratory, School of Computer Science, Wuhan University, School of Computer Science, Wuhan University, Shanghai Artificial Intelligence Laboratory State Key Lab. of LIESMARS, Wuhan University, Sun Yat-Sen University, SenseTime Research, SenseTime Research, School of Artificial Intelligence, Wuhan University School of Computer Science, Wuhan University State Key Lab. of LIESMARS, Wuhan University Institue for Math & AI, Wuhan University, Shanghai Artificial Intelligence Laboratory SenseTime Research

Abstract: This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis. VHM is built on a largescale remote sensing image-text dataset with rich-content captions (VersaD), and an honest instruction dataset comprising both factual and deceptive questions (HnstD). Unlike prevailing remote sensing image-text datasets, in which image captions focus on a few prominent objects and their relationships, VersaD captions provide detailed information about image properties, object attributes, and the overall scene. This comprehensive captioning enables VHM to thoroughly understand remote sensing images and perform diverse remote sensing tasks. Moreover, different from existing remote sensing instruction datasets that only include factual questions, HnstD contains additional deceptive questions stemming from the non-existence of objects. This feature prevents VHM from producing affirmative answers to nonsense queries, thereby ensuring its honesty. In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding. Additionally, VHM achieves competent performance on several unexplored tasks, such as building vectorizing, multi-label classification and honest question answering.

Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, University of Chinese Academy of Sciences Institute of Automation, Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Google

Abstract: Understanding of bimanual handobject interaction plays an important role in robotics and virtual reality. However, due to significant occlusions between hands and object as well as the high degree-of-freedom motions, it is challenging to collect and annotate a high-quality, large-scale dataset, which prevents further improvement of bimanual hand-object interaction-related baselines. In this work, we propose a new 3D Gaussian Splatting based data augmentation framework for bimanual hand-object interaction, which is capable of augmenting existing dataset to large-scale photorealistic data with various hand-object pose and viewpoints. First, we use mesh-based 3DGS to model objects and hands, and to deal with the rendering blur problem due to multi-resolution input images used, we design a super-resolution module. Second, we extend the single hand grasping pose optimization module for the bimanual hand object to generate various poses of bimanual hand-object interaction, which can significantly expand the pose distribution of the dataset. Third, we conduct an analysis for the impact of different aspects of the proposed data augmentation on the understanding of the bimanual hand-object interaction. We perform our data augmentation on two benchmarks, H2O and Arctic, and verify that our method can improve the performance of the baselines.

School of Aeronautics and Astronautics, Sichuan University Megvii Technology, School of Aeronautics and Astronautics, Sichuan University Megvii Technology, School of Aeronautics and Astronautics, Sichuan University Key Laboratory of Advanced Spatial Mechanism and Intelligent Spacecraft, Sichuan University, School of Aeronautics and Astronautics, Sichuan University Key Laboratory of Advanced Spatial Mechanism and Intelligent Spacecraft, Sichuan University, University of Electronic Science and Technology of China Megvii Technology

Abstract: RAWto-sRGB mapping, or the simulation of the traditional camera image signal processor (ISP), aims to generate DSLR-quality sRGB images from raw data captured by smartphone sensors. Despite achieving comparable results to sophisticated handcrafted camera ISP solutions, existing learning-based methods still struggle with detail disparity and color distortion. In this paper, we present ISPDiffuser, a diffusion-based decoupled framework that separates the RAW-to-sRGB mapping into detail reconstruction in grayscale space and color consistency mapping from grayscale to sRGB. Specifically, we propose a texture-aware diffusion model that leverages the generative ability of diffusion models to focus on local detail recovery, in which a texture enrichment loss is further proposed to prompt the diffusion model to generate more intricate texture details. Subsequently, we introduce a histogram-guided color consistency module that utilizes color histogram as guidance to learn precise color information for grayscale to sRGB color consistency mapping, with a color consistency loss designed to constrain the learned color information. Extensive experimental results show that the proposed ISPDiffuser outperforms state-of-the-art competitors both quantitatively and visually.

Abstract: Unsupervised camouflaged object detection (UCOD) poses significant challenges, primarily attributed to the absence of human labels. Existing UCOD methodologies, leveraging attention mechanisms, often struggle to achieve precise localization of camouflaged objects. To overcome this limitation, we introduce a groundbreaking fully unsupervised algorithm for attentionguided camouflaged object localization, shift, and inference, termed the self-distilled attention localization and shift network (SdalsNet). In this study, we formulate an attention localization methodology aimed at accurately identifying the central coordinate of the camouflaged object. Furthermore, we propose four distinct loss functions tailored to refine the precision of attentional positioning. These loss functions effectively constrain the distances between three types of class tokens, facilitating seamless attentional shifting across the input sample. Additionally, we design a sophisticated prediction inference technique to reconstruct the binary output of an attention map, thereby providing a comprehensive understanding of the detected camouflaged objects. Experimental results on four challenging COD benchmark datasets corroborate the effectiveness of our proposed approach, demonstrating notable superiority over state-of-the-art methods.

Abstract: The introduction of visionlanguage models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.

Abstract: Point cloud data labeling is considered a timeconsuming and expensive task in autonomous driving, whereas annotation-free learning training can avoid it by learning point cloud representations from unannotated data. In this paper, we propose AFOV, a novel 3D Annotation-Free framework assisted by 2D Open-Vocabulary segmentation models. It consists of two stages: In the first stage, we innovatively integrate high-quality textual and image features of 2D open-vocabulary models and propose the Tri-Modal contrastive Pre-training (TMP). In the second stage, spatial mapping between point clouds and images is utilized to generate pseudo-labels, enabling cross-modal knowledge distillation. Besides, we introduce the Approximate Flat Interaction (AFI) to address the noise during alignment and label confusion. To validate the superiority of AFOV, extensive experiments are conducted on multiple related datasets. We achieved a record-breaking 47.73% mIoU on the annotation-free 3D segmentation task in nuScenes, surpassing the previous best model by 3.13% mIoU. Meanwhile, the performance of fine-tuning with 1% data on nuScenes and SemanticKITTI reached a remarkable 51.75% mIoU and 48.14% mIoU, outperforming all previous pre-trained models.

Abstract: Producing large images using small diffusion models is gaining increasing popularity, as the cost of training large models could be prohibitive. A common approach involves jointly generating a series of overlapped image patches and obtaining large images by merging adjacent patches. However, results from existing methods often exhibit obvious artifacts, e.g., seams and inconsistent objects and styles. To address the issues, we proposed Guided Fusion (GF), which mitigates the negative impact from distant image regions by applying a weighted average to the overlapping regions. Moreover, we proposed VarianceCorrected Fusion (VCF), which corrects data variance at post-averaging, generating more accurate fusion for the Denoising Diffusion Probabilistic Model. Furthermore, we proposed a one-shot Style Alignment (SA), which generates a coherent style for large images by adjusting the initial input noise without adding extra computational burden. Extensive experiments demonstrated that the proposed fusion methods improved the quality of the generated image significantly. As a plug-and-play module, the proposed method can be widely applied to enhance other fusion-based methods for large image generation.

Abstract: Composed Image Retrieval (CIR) aims to retrieve target images from candidate set using a hybridmodality query consisting of a reference image and a relative caption that describes the user intent. Recent studies attempt to utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing the task. However, these methods typically fail to simultaneously meet two key requirements of CIR: comprehensively extracting visual information and faithfully following the user intent. In this work, we propose CIR-LVLM, a novel framework that leverages the large vision-language model (LVLM) as the powerful user intent-aware encoder to better meet these requirements. Our motivation is to explore the advanced reasoning and instruction-following capabilities of LVLM for accurately understanding and responding the user intent. Furthermore, we design a novel hybrid intent instruction module to provide explicit intent guidance at two levels: (1) The task prompt clarifies the task requirement and assists the model in discerning user intent at the task level. (2) The instance-specific soft prompt, which is adaptively selected from the learnable prompt pool, enables the model to better comprehend the user intent at the instance level compared to a universal prompt for all instances. CIR-LVLM achieves state-of-the-art performance across three prominent benchmarks with acceptable inference efficiency. We believe this study provides fundamental insights into CIR-related fields.

Beijing Advanced Innovation Center on Biomedical Engineering, School of Engineering Medicine, Beihang University Image Processing Center, School of Astronautics, Beihang University, Beijing Advanced Innovation Center on Biomedical Engineering, School of Engineering Medicine, Beihang University Image Processing Center, School of Astronautics, Beihang University Tianmushan Laboratory, School of Software, Hefei University of Technology, Department of Pathology, the First Affiliated Hospital of USTC Intelligent Pathology Institute, Division of Life Sciences and Medicine, University of Science and Technology of China, Department of Pathology, the First Affiliated Hospital of USTC Intelligent Pathology Institute, Division of Life Sciences and Medicine, University of Science and Technology of China, Beijing Advanced Innovation Center on Biomedical Engineering, School of Engineering Medicine, Beihang University

Abstract: Gigapixel image analysis, particularly for whole slide images (WSIs), often relies on multiple instance learning (MIL). Under the paradigm of MIL, patch image representations are extracted and then fixed during the training of the MIL classifiers for efficiency consideration. However, the invariance of representations makes it difficult to perform data augmentation for WSIlevel model training, which significantly limits the performance of the downstream WSI analysis. The current data augmentation methods for gigapixel images either introduce additional computational costs or result in a loss of semantic information, which is hard to meet the requirements for efficiency and stability needed for WSI model training. In this paper, we propose a Promptable Representation Distribution Learning framework (PRDL) for both patch-level representation learning and WSI-level data augmentation. Meanwhile, we explore the use of prompts to guide data augmentation in feature space, which achieves promptable data augmentation for training robust WSI-level models. The experimental results have demonstrated that the proposed method stably outperforms state-of-the-art methods.

School of Intelligence Science and Technology, Nanjing University, Suzhou, China, Jilin University, Changchun, China, China Mobile Research Institute, Beijing, China, China Mobile Research Institute, Beijing, China, Beijing Shuzhimei Technology Co., Ltd, Beijing, China, State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China School of Intelligence Science and Technology, Nanjing University, Suzhou, China, State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China School of Intelligence Science and Technology, Nanjing University, Suzhou, China, School of New Media and Communication, Tianjin University, Tianjin, China, State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China School of Intelligence Science and Technology, Nanjing University, Suzhou, China

Abstract: Recent advancements in imageconditioned image generation have demonstrated substantial progress. However, foreground-conditioned image generation remains underexplored, encountering challenges such as compromised object integrity, foreground-background inconsistencies, limited diversity, and reduced control flexibility. These challenges arise from current end-to-end inpainting models, which suffer from inaccurate training masks, limited foreground semantic understanding, data distribution biases, and inherent interference between visual and textual prompts. To overcome these limitations, we present Anywhere, a multi-agent framework that departs from the traditional end-to-end approach. In this framework, each agent is specialized in a distinct aspect, such as foreground understanding, diversity enhancement, object integrity protection, and textual prompt consistency. Our framework is further enhanced with the ability to incorporate optional user textual inputs, perform automated quality assessments, and initiate re-generation as needed. Comprehensive experiments demonstrate that this modular design effectively overcomes the limitations of existing end-to-end models, resulting in higher fidelity, quality, diversity and controllability in foreground-conditioned image generation. Additionally, the Anywhere framework is extensible, allowing it to benefit from future advancements in each individual agent.

Abstract: Evaluation metric of visual captioning is important yet not thoroughly explored. Traditional metrics like BLEU, METEOR, CIDEr, and ROUGE often miss semantic depth, while trained metrics such as CLIPScore, PAC-S, and Polos are limited in zero-shot scenarios. Advanced Language Model-based metrics also struggle with aligning to nuanced human preferences. To address these issues, we introduce G-VEval, a novel metric inspired by G-Eval and powered by the new GPT-4o. G-VEval uses chain-of-thought reasoning in large multimodal models and supports three modes: reference-free, reference-only, and combined, accommodating both video and image inputs. We also propose MSVD-Eval, a new dataset for video captioning evaluation, to establish a more transparent and consistent framework for both human experts and evaluation metrics. It is designed to address the lack of clear criteria in existing datasets by introducing distinct dimensions of Accuracy, Completeness, Conciseness, and Relevance (ACCR). Extensive results show that G-VEval outperforms existing methods in correlation with human annotations, as measured by Kendall tau-b and Kendall tau-c. This provides a flexible solution for diverse captioning tasks and suggests a straightforward yet effective approach for large language models to understand video content, paving the way for advancements in automated captioning.

Abstract: Satisfactory progress has been achieved recently in universal segmentation of CT images. Following the success of visionlanguage methods, there is a growing trend towards utilizing text prompts and contrastive learning to develop universal segmentation models. However, there exists a significant imbalance in information density between 3D images and text prompts. Moreover, the standard fully connected layer segmentation approach faces significant challenges with handling multiple classes and exhibits poor generalizability. To address these challenges, we propose VOxel Interacting with LAnguage method (VOILA) for universal CT image segmentation. Initially, we align voxels and language into a shared representation space and classify voxels based on cosine similarity. Subsequently, we develop the Voxel-Language Interaction framework to mitigate the impact of class imbalance caused by foreground-background discrepancies and variations in target volumes. Furthermore, a Complexity-Aware Sampling method is proposed to focus on region hard to segment, achieved by generating pseudo heatmaps from a trainable Gaussian mixture distribution. Our results indicate the proposed VOILA is capable to achieve improved performance with reduced parameters and computational cost during training. Furthermore, it demonstrates significant generalizability across diverse datasets without additional fine-tuning.

State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Alibaba Group, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Alibaba Group, Alibaba Group, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation,Chinese Academy of Sciences

Abstract: In recent years, diffusion models have revolutionized visual generation, outperforming traditional frameworks like Generative Adversarial Networks (GANs). However, generating images of humans with realistic semantic parts, such as hands and faces, remains a significant challenge due to their intricate structural complexity. To address this issue, we propose a novel postprocessing solution named RealisHuman. The RealisHuman framework operates in two stages. First, it generates realistic human parts, such as hands or faces, using the original malformed parts as references, ensuring consistent details with the original image. Second, it seamlessly integrates the rectified human parts back into their corresponding positions by repainting the surrounding areas to ensure smooth and realistic blending. The RealisHuman framework significantly enhances the realism of human generation, as demonstrated by notable improvements in both qualitative and quantitative metrics.

Abstract: Fewshot point cloud semantic segmentation aims to accurately segment "unseen" new categories in point cloud scenes using limited labeled data. However, pretraining-based methods not only introduce excessive time overhead but also overlook the local structure representation among irregular point clouds. To address these issues, we propose a pretraining-free local structure fitting network for few-shot point cloud semantic segmentation, named TaylorSeg. Specifically, inspired by Taylor series, we treat the local structure representation of irregular point clouds as a polynomial fitting problem and propose a novel local structure fitting convolution, called TaylorConv. This convolution learns the low-order basic information and high-order refined information of point clouds from explicit encoding of local geometric structures. Then, using TaylorConv as the basic component, we construct two variant of TaylorSeg: a non-parametric TaylorSeg-NN and a parametric TaylorSeg-PN. The former can achieve performance comparable to existing parametric models without pretraining. For the latter, we equip it with an Adaptive Push-Pull (APP) module to mitigate the feature distribution differences between the query set and the support set. Extensive experiments validate the effectiveness of the proposed method. Notably, under the 2-way 1-shot setting, TaylorSeg-PN achieves improvements of +2.28% and +4.37% mIoU on the S3DIS and ScanNet datasets respectively, compared to the previous state-of-the-art methods.

Abstract: Camerabased Bird's Eye View (BEV) perception models receive increasing attention for their crucial role in autonomous driving, a domain where concerns about the robustness and reliability of deep learning have been raised. While only a few works have investigated the effects of randomly generated semantic perturbations, aka natural corruptions, on the multi-view BEV detection task, we develop a black-box robustness evaluation framework that adversarially optimises three common semantic perturbations: geometric transformation, colour shifting, and motion blur, to deceive BEV models, serving as the first approach in this emerging field. To address the challenge posed by optimising the semantic perturbation, we design a smoothed, distance-based surrogate function to replace the mAP metric and introduce SimpleDIRECT, a deterministic optimisation algorithm that utilises observed slopes to guide the optimisation process. By comparing with randomised perturbation and two optimisation baselines, we demonstrate the effectiveness of the proposed framework. Additionally, we provide a benchmark on the semantic robustness of ten recent BEV models. The results reveal that PolarFormer, which emphasises geometric information from multi-view images, exhibits the highest robustness, whereas BEVDet is fully compromised, with its precision reduced to zero.

Abstract: Discrete latent representation techniques, such as Vector Quantization (VQ) and Sparse Coding (SC), have demonstrated superior image reconstruction and generation quality compared to continuous representation methods in Variational Autoencoders (VAEs). However, existing approaches often treat the latent representations of an image independently in their discrete representation space, neglecting both the inherent structural information within each representation and the correlations among them. This oversight leads to coarse representations and suboptimal generated results. In this paper, we address these limitations by introducing correlations among and within the latent representations of individual images in the latent discrete space of VAEs using sparse coding. We impose twodimensional structural information through adaptive thresholding, enhancing local structure in image representations while suppressing noise via parsimonious representation with a learned dictionary. Empirical studies on three real benchmark datasets, including a clinical Ultrasound dataset, BSDS500, and mini-Imagenet, demonstrate that our proposed model preserves fine-grained details in image reconstruction and significantly outperforms baseline models of SC-VAE and VQ-VAE across objective and subjective image quality metrics. Particularly noteworthy are the substantial performance improvements observed on the ultrasound dataset, where structure information is crucial. Specifically, we observe significant performance improvements of 7.68 % and 17.03 % in SSIM, 3.25 dB and 6.58 dB in PSNR, 0.15 and 0.24 in LPIPS, 45.38 and 84.05 in FID over SC-VAE and VQ-VAE, respectively, indicating the superiority of our method in terms of image reconstruction quality and fidelity.

Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, China, Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, China, Anhui Provincial Key Laboratory of Security Artificial Intelligence, School of Artificial Intelligence, Anhui University, China, Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, China, Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, China

Abstract: Alignmentfree RGB-Thermal (RGB-T) salient object detection (SOD) aims to achieve robust performance in complex scenes by directly leveraging the complementary information from unaligned visible-thermal image pairs, without requiring manual alignment. However, the labor-intensive process of collecting and annotating image pairs limits the scale of existing benchmarks, hindering the advancement of alignment-free RGB-T SOD. In this paper, we construct a large-scale and high-diversity unaligned RGB-T SOD dataset named UVT20K, comprising 20,000 image pairs, 407 scenes, and 1256 object categories. All samples are collected from real-world scenarios with various challenges, such as low illumination, image clutter, complex salient objects, and so on. To support the exploration for further research, each sample in UVT20K is annotated with a comprehensive set of ground truths, including saliency masks, scribbles, boundaries, and challenge attributes. In addition, we propose a Progressive Correlation Network (PCNet), which models inter- and intra-modal correlations on the basis of explicit alignment to achieve accurate predictions in unaligned image pairs. Extensive experiments conducted on two unaligned three weakly aligned three aligned datasets demonstrate the effectiveness of our method.

Abstract: Neural Radiance Fields (NeRF) have demonstrated prominent performance in novel view synthesis tasks. However, their input heavily relies on image acquisition under normal light conditions, making it challenging to learn accurate scene contents in lowlight environments where images typically exhibit significant noise and severe color distortion. To address these challenges, we propose a novel approach, Bright-NeRF, which learns enhanced and high-quality radiance fields from multi-view low-light RAW images in an unsupervised manner. Our method simultaneously achieves color restoration, denoising, and enhanced novel view synthesis. Specifically, we leverage a physically-inspired model of the sensor's response to illumination and introduce a chromatic adaptation loss to constrain the learning of response, enabling consistent color perception of objects regardless of lighting conditions. We further utilize the RAW data's properties to expose the scene's intensity automatically. Additionally, we have collected a multi-view low-light RAW image dataset of real-world scenes to advance research in this field. Experimental results demonstrate that our proposed method significantly outperforms existing 2D and 3D approaches. Our code and dataset will be made publicly available.

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China, School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China Yunnan Key Laboratory of Software Engineering, Yunan, China, School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China, School of Artificial Intelligence, Anhui University, Hefei, China, School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China, School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China

Abstract: Dynamic quantization has attracted rising attention in image superresolution (SR) as it expands the potential of heavy SR models onto mobile devices while preserving competitive performance. Most current methods explore layer-to-bit configuration upon varying local regions, adaptively allocating the bit to each layer and patch. Despite the benefits, they still fall short in the tradeoff of SR accuracy and quantization efficiency. Apart from this, adapting the quantization level for each layer individually can disturb the original inter-layer relationships, thus diminishing the representation capability of quantized models. In this work, we propose Granular-DQ, which takes advantage of multi-granularity clues and local patch statistics, achieving a distinctive patch-wise and layer-invariant dynamic quantization paradigm. Specifically, Granular-DQ initiates by developing a granularity-bit controller to apprehend the coarse-to-fine granular representations of local patches, matching their proportional contribution to the entire image to determine the proper bit-width allocation. On this premise, we investigate the interrelationships between bit-width and information density within high-bit patches, establishing a soft gate that enables further fine-grained dynamic bit adaption. Extensive experiments validate the superiority of Granular-DQ in the trade-off between efficiency and accuracy over recent state-of-the-art methods on various SR models.

Abstract: Despite the advanced longsequence modeling of Mamba, which has expanded its applications in image restoration, there remains a lack of exploration combining its strengths with the specific characteristics of JPEG image restoration, where high-frequency components are lost after the Discrete Cosine Transform (DCT). To address this, we introduce DCTMamba, a new framework designed to apply Mamba more effectively to JPEG image restoration. Specifically, our method integrates the Discrete Cosine Transform (DCT) into the Mamba to establish the sequential scanning from lower to higher frequencies, enabling the network to initially reconstruct coarse structures and progressively refine the image with more intricate details. Furthermore, recognizing the variable frequency distributions that arise from DCT transformations across different image sizes, we have developed Scale-Adaptive Normalization to manage these variations adeptly. Comprehensive experiments confirm that DCTMamba significantly outperforms existing solutions, achieving high fidelity in both coarse structures and fine details.CTMamba significantly outperforms existing solutions, achieving high fidelity in both coarse structures and fine details.

Abstract: 3D visual grounding (3DVG), which aims to correlate a natural language description with the target object within a 3D scene, is a significant yet challenging task. Despite recent advancements in this domain, existing approaches commonly encounter a shortage: a limited amount and diversity of text3D pairs available for training. Moreover, they fall short in effectively leveraging different contextual clues (e.g., rich spatial relations within the 3D visual space) for grounding. To address these limitations, we propose AugRefer, a novel approach for advancing 3D visual grounding. AugRefer introduces cross-modal augmentation designed to extensively generate diverse text-3D pairs by placing objects into 3D scenes and creating accurate and semantically rich descriptions using foundation models. Notably, the resulting pairs can be utilized by any existing 3DVG methods for enriching their training data. Besides, AugRefer presents a language-spatial adaptive decoder that effectively adapts the potential referring objects based on the language description and various 3D spatial relations. Extensive experiments on three benchmark datasets clearly validate the effectiveness of AugRefer.

Abstract: Humanobject contact (HOT) is designed to accurately identify the areas where humans and objects come into contact. Current methods frequently fail to account for scenarios where objects are frequently blocking the view, resulting in inaccurate identification of contact areas. To tackle this problem, we suggest using a perspective interaction HOT detector called PIHOT, which utilizes a depth map generation model to offer depth information of humans and objects related to the camera, thereby preventing false interaction detection. Furthermore, we use mask dilatation and object restoration techniques to restore the texture details in covered areas, improve the boundaries between objects, and enhance the perception of humans interacting with objects. Moreover, a spatial awareness perception is intended to concentrate on the characteristic features close to the points of contact. The experimental results show that the PIHOT algorithm achieves state-of-the-art performance on three benchmark datasets for HOT detection tasks. Compared to the most recent DHOT, our method enjoys an average improvement of 13%, 27.5%, 16%, and 18.5% on SC-Acc., C-Acc., mIoU, and wIoU metrics, respectively.

College of Computer Science and Technology, Zhejiang Normal University, Zhejiang, 311231, China Research Institute of Hangzhou Artificial Intelligence, Zhejiang Normal University, Hangzhou, Zhejiang, 311231, China, College of Computer Science and Technology, Zhejiang Normal University, Zhejiang, 311231, China Research Institute of Hangzhou Artificial Intelligence, Zhejiang Normal University, Hangzhou, Zhejiang, 311231, China, College of Computer Science and Technology, Zhejiang Normal University, Zhejiang, 311231, China Research Institute of Hangzhou Artificial Intelligence, Zhejiang Normal University, Hangzhou, Zhejiang, 311231, China, College of Computer Science and Technology, Zhejiang Normal University, Zhejiang, 311231, China Research Institute of Hangzhou Artificial Intelligence, Zhejiang Normal University, Hangzhou, Zhejiang, 311231, China Beijing Geekplus Technology Co, Ltd, Beijing, 100101, China, Beijing Geekplus Technology Co, Ltd, Beijing, 100101, China

Abstract: Driven by the rapid development of deep learning technology, the YOLO series has set a new benchmark for realtime object detectors. Additionally, transformer-based structures have emerged as the most powerful solution in the field, greatly extending the model's receptive field and achieving significant performance improvements. However, this improvement comes at a cost, as the quadratic complexity of the self-attentive mechanism increases the computational burden of the model. To address this problem, we introduce a simple yet effective baseline approach called Mamba YOLO. Our contributions are as follows: 1) We propose that the ODMamba backbone introduce a State Space Model (SSM) with linear complexity to address the quadratic complexity of self-attention. Unlike the other Transformer-base and SSM-base method, ODMamba is simple to train without pretraining. 2) For real-time requirement, we designed the macro structure of ODMamba, determined the optimal stage ratio and scaling size. 3) We design the RG Block that employs a multi-branch structure to model the channel dimensions, which addresses the possible limitations of SSM in sequence modeling, such as insufficient receptive fields and weak image localization. This design captures localized image dependencies more accurately and significantly. Extensive experiments on the publicly available COCO benchmark dataset show that Mamba YOLO achieves state-of-the-art performance compared to previous methods. Specifically, a tiny version of Mamba YOLO achieves a 7.5% improvement in mAP on a single 4090 GPU with an inference time of 1.5 ms.

Abstract: Enhancing images captured under lowlight conditions has been a topic of research for several years. Nonetheless, existing image restoration techniques mainly concentrate on reconstructing images from RGB data, often neglecting the possibility of utilizing additional modalities. With the progress in handheld technology, capturing thermal images with mobile devices has become straightforward. Investigating the integration of thermal data into image restoration presents a valuable research opportunity. Therefore, in this paper, we propose a multimodal low-light image enhancement task based on thermal information and establish a dataset named TLIE (Thermal-aware Low-light Image Enhancement), consisting of 1,113 samples. Each sample in our dataset includes a low-light image, a normal-light image, and the corresponding thermal map. Additionally, based on TLIE dataset, we develop a multimodal approach that simultaneously processes input images and thermal map data to produce the predicted normal-light images. We compare our method with previous unimodal and multimodal state-of-the-art LIE methods, and the experimental results and detailed ablation studies prove the effectiveness of our method.

Abstract: Salient object detection (SOD) methods for 2D images have great significance in the field of humancomputer interaction (HCI). However, as a common data format in HCI, the SOD research in the form of 3D point cloud data remains limited. Previous works commonly treat this task as point cloud segmentation, which perceives all points in the scene for prediction. However, these methods neglect that SOD is designed to simulate human visual perception where human can only see the surfaces rather than occluded point clouds. Thereby, these methods may fail when meet such situations. This paper aims to solve this problem by approximately simulating the perception paradigm of humans towards 3D scenes. Thus, we propose a framework based on the 3D visual point cloud backbone and its multi-view projection named MSV-PCT. Specifically, instead of relying solely on general point cloud learning frameworks, we additionally introduce multi-sparse-view learning branches to supplement the SOD perception. Furthermore, we propose a novel point cloud edge detection loss function to effectively address artifacts, enabling the accurate segmentation of the edges of salient objects from the background. Finally, to evaluate the generalization of point cloud SOD methods, we introduce a new approach to generate simulated PC-SOD datasets from RGBD-SOD data. Experiments on the simulated datasets show that MSV-PCT achieves better accuracy and robustness.

Abstract: In fetal magnetic resonance imaging (MRI), sliceto-volume reconstruction (SVR) involves the computational creation of a 3D volume from multiple stacks of 2D slices. This process is challenging due to slice misalignment and image noise. Current state-of-the-art (SOTA) SVR methods typically employ coarse-to-fine techniques that iteratively refine slice-to-volume motion correction and 3D volume reconstruction. However, both processes are inherently inefficient, making these methods time-consuming and prone to errors. This often results in less robust and accurate outcomes, primarily due to insufficient modeling of the spatial relationships between slices. Typically, 2D fetal MRI slices are acquired using the interleave sequence, which first acquires the odd slices and then the even slices in one stack. To this end, we propose a novel Mamba-based framework called SVRMamba, which integrates slice-to-volume reconstruction with slice sequence-guided state space modeling. Specifically, our approach reformulates Mamba’s unidirectional scanning into a slice sequence-guided odd-even directional scanning method and marks the slice positions using sequence embedding tokens. This enables the network to learn the slice relationships and spatial sequences, enhancing fetal MRI SVR motion correction performance. We further integrate a convolutional neural network (CNN)-based interpolation network that generates a noise-suppressed 3D reconstruction by leveraging the predicted motion for each slice. This framework notably enhances 3D fetal brain SVR, delivering substantial improvements in both reconstruction speed and overall performance. Extensive experiments conducted on various benchmark and clinical datasets demonstrate that SVRMamba significantly outperforms existing SOTA methods, delivering comparable results with a remarkable sixtyfold increase in reconstruction speed.

Department of Computer Science and Engineering, The Chinese University of Hong Kong Huawei Hong Kong Research Center, Huawei Hong Kong Research Center, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Huawei 2012 Laboratories, Huawei 2012 Laboratories, Artificial Intelligence Institute, Shanghai Jiao Tong University, School of Computer Science and Technology, Huazhong University of Science and Technology, School of Computer Science and Technology, Xidian University, Guangzhou Institute of Technology, Xidian University ICTT and ISN Laboratory, Xidian University, Department of Computer Science and Engineering, The Chinese University of Hong Kong

Abstract: Controversial contents largely inundate the Internet, infringing various cultural norms and child protection standards. Traditional Image Content Moderation (ICM) models fall short in producing precise moderation decisions for diverse standards, while recent multimodal large language models (MLLMs), when adopted to general rulebased ICM, often produce classification and explanation results that are inconsistent with human moderators. Aiming at flexible, explainable, and accurate ICM, we design a novel rule-based dataset generation pipeline, decomposing concise human-defined rules and leveraging well-designed multi-stage prompts to enrich short explicit image annotations. Our ICM-Instruct dataset includes detailed moderation explanation and moderation Q-A pairs. Built upon it, we create our ICM-Assistant model in the framework of rule-based ICM, making it readily applicable in real practice. Our ICM-Assistant model demonstrates exceptional performance and flexibility. Specifically, it significantly outperforms existing approaches on various sources, improving both the moderation classification (36.8% on average) and moderation explanation quality (26.6% on average) consistently over existing MLLMs. Caution: Content includes offensive language or images.

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Chin, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Chin, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Chin, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Chin, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Chin

Abstract: Despite the efficiency of prompt learning in transferring visionlanguage models (VLMs) to downstream tasks, existing methods mainly learn the prompts in a coarse-grained manner where the learned prompt vectors are shared across all categories. Consequently, the tailored prompts often fail to discern class-specific visual concepts, thereby hindering the transferred performance for classes that share similar or complex visual attributes. Recent advances mitigate this challenge by leveraging external knowledge from Large Language Models (LLMs) to furnish class descriptions, yet incurring notable inference costs. In this paper, we introduce TextRefiner, a plug-and-play method to refine the text prompts of existing methods by leveraging the internal knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module to encapsulate fine-grained visual concepts derived from local tokens within the image branch. By aggregating and aligning the cached visual descriptions with the original output of the text branch, TextRefiner can efficiently refine and enrich the learned prompts from existing methods without relying on any external expertise. For example, it improves the performance of CoOp from 71.66% to 76.96% on 11 benchmarks, surpassing CoCoOp which introduced instance-wise feature for text prompts. Equipped with TextRefiner, PromptKD achieves state-of-the-art performance while keep inference efficient.

Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) Xi'an Jiaotong University, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Peking University, Sun Yat-sen University, Tsinghua University, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University Carleton University

Abstract: Talking head synthesis with arbitrary speech audio is a crucial challenge in the field of digital humans. Recently, methods based on radiance fields have received increasing attention due to their ability to synthesize highfidelity and identity-consistent talking heads from just a few minutes of training video. However, due to the limited scale of the training data, these methods often exhibit poor performance in audio-lip synchronization and visual quality. In this paper, we propose a novel 3D Gaussian-based method called PointTalk, which constructs a static 3D Gaussian field of the head and deforms it in sync with the audio. It also incorporates an audio-driven dynamic lip point cloud as a critical component of the conditional information, thereby facilitating the effective synthesis of talking heads. Specifically, the initial step involves generating the corresponding lip point cloud from the audio signal and capturing its topological structure. The design of the dynamic difference encoder aims to capture the subtle nuances inherent in dynamic lip movements more effectively. Furthermore, we integrate the audio-point enhancement module, which not only ensures the synchronization of the audio signal with the corresponding lip point cloud within the feature space, but also facilitates a deeper understanding of the interrelations among cross-modal conditional features. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking head synthesis compared to previous methods.

Abstract: While Signed Distance Fields (SDF) are wellestablished for modeling watertight surfaces, Unsigned Distance Fields (UDF) broaden the scope to include open surfaces and models with complex inner structures. Despite their flexibility, UDFs encounter significant challenges in high-fidelity 3D reconstruction, such as non-differentiability at the zero level set, difficulty in achieving the exact zero value, numerous local minima, vanishing gradients, and oscillating gradient directions near the zero level set. To address these challenges, we propose Details Enhanced UDF (DEUDF) learning that integrates normal alignment and the SIREN network for capturing fine geometric details, adaptively weighted Eikonal constraints to address vanishing gradients near the target surface, unconditioned MLP-based UDF representation to relax non-negativity constraints, and DCUDF for extracting the local minimal average distance surface. These strategies collectively stabilize the learning process from unoriented point clouds and enhance the accuracy of UDFs. Our computational results demonstrate that DEUDF outperforms existing UDF learning methods in both accuracy and the quality of reconstructed surfaces.

State Key Laboratory of Integrated Services Networks, Xidian University, State Key Laboratory of Integrated Services Networks, Xidian University, Department of Radiology, Guangdong Provincial People’s Hospital, Southern Medical University, Department of Radiology, Xiangyang No. 1 People’s Hospital, Hubei University of Medicine, Department of Radiology, Xiangyang No. 1 People’s Hospital, Hubei University of Medicine, Department of Radiology, Guangdong Provincial People’s Hospital, Southern Medical University, State Key Laboratory of Integrated Services Networks, Xidian University, Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications

Abstract: Motion artifacts present in magnetic resonance imaging (MRI) can seriously interfere with clinical diagnosis. Removing motion artifacts is a straightforward solution and has been extensively studied. However, paired data are still heavily relied on in recent works and the perturbations in kspace (frequency domain) are not well considered, which limits their applications in the clinical field. To address these issues, we propose a novel unsupervised purification method which leverages pixel-frequency information of noisy MRI images to guide a pre-trained diffusion model to recover clean MRI images. Specifically, considering that motion artifacts are mainly concentrated in high-frequency components in k-space, we utilize the low-frequency components as the guide to ensure correct tissue textures. Additionally, given that high-frequency and pixel information are helpful for recovering shape and detail textures, we design alternate complementary masks to simultaneously destroy the artifact structure and exploit useful information. Quantitative experiments are performed on datasets from different tissues and show that our method achieves superior performance on several metrics. Qualitative evaluations with radiologists also show that our method provides better clinical feedback.

Abstract: The application of Contrastive LanguageImage Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research demonstrates powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, specifically by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel-level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more representative vision prototypes. The core of this framework is to learn class-specific vision prototypes in vision space with the help of text prototypes, to capture high-quality localization maps. Moreover, we propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes, leading to more comprehensive and robust feature learning. Experimental results show that our proposed framework achieves state-of-the-art performance on two benchmark datasets.

Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School，Peking University, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School，Peking University, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School，Peking University, School of Computer Science and Engineering, Central South University, School of Computer Science and Engineering, Central South University

Abstract: Recently, numerous robust image hashing schemes have been developed for content identification. However, many of these schemes face the challenges of maintaining discrimination while simultaneously resisting largescale attacks. In this paper, we propose a robust image hashing scheme based on Contrastive Masked Autoencoder with weak-strong augmentation Alignment (CMAA). Leveraging contrastive learning, CMAA is designed to learn features that are robust to large-scale and hybrid attacks while maintaining the discrimination of those features. Specifically, it utilizes distribution divergence to align weak attack augmented features with strong attack augmented features, namely weak-strong augmentation alignment, to enhance the robustness to strong attacks. In addition, a masked vision transformer is incorporated to further enhance content identification performance. CMAA also includes a parameter-free quantization layer to mitigate the loss induced by binarization. Experimental results demonstrate that our method exhibits remarkable robustness against various attacks, including challenging ones such as rotation and hybrid attacks, and delivers excellent identification performance with a F1 score close to 1.0. Our code and supplementary materials are available on Github.

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China, School of Information Management, Nanjing University, Nanjing, China, Skywork AI, Singapore Nanyang Technological University, Singapore, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China, Department of Computer Science and Software Engineering, The University of Western Australia, Australia, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China, Skywork AI, Singapore Nanyang Technological University, Singapore

Abstract: Domain Generalization (DG) has been recently explored to improve the generalizability of point cloud classification (PCC) models toward unseen domains. However, they often suffer from limited receptive fields or quadratic complexity due to the use of convolution neural networks or vision Transformers. In this paper, we present the first work that studies the generalizability of state space models (SSMs) in DG PCC and find that directly applying SSMs into DG PCC will encounter several challenges: the inherent topology of the point cloud tends to be disrupted and leads to noise accumulation during the serialization stage. Besides, the lack of designs in domainagnostic feature learning and data scanning will introduce unanticipated domain-specific information into the 3D sequence data. To this end, we propose a novel framework, PointDGMamba, that excels in strong generalizability toward unseen domains and has the advantages of global receptive fields and efficient linear complexity. PointDGMamba consists of three innovative components: Masked Sequence Denoising (MSD), Sequence-wise Cross-domain Feature Aggregation (SCFA), and Dual-level Domain Scanning (DDS). In particular, MSD selectively masks out the noised point tokens of the point cloud sequences, SCFA introduces cross-domain but same-class point cloud features to encourage the model to learn how to extract more generalized features. DDS includes intra-domain scanning and cross-domain scanning to facilitate information exchange between features. In addition, we propose a new and more challenging benchmark PointDG-3to1 for multi-domain generalization. Extensive experiments demonstrate the effectiveness and state-of-the-art performance of PointDGMamba.

State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University AI2 Robotics, The Chinese University of Hong Kong, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, The Chinese University of Hong Kong, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, AI2 Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University

Abstract: Recently, Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and image understanding. While these models are powerful, they have not yet been developed to comprehend the more challenging 3D geometric and physical scenes, especially when it comes to the sparse outdoor LiDAR data. In this paper, we introduce LiDARLLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs to gain a comprehensive understanding of outdoor 3D scenes. The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem, encompassing tasks such as 3D captioning, 3D grounding, 3D question answering, etc. Specifically, due to the scarcity of 3D LiDAR-text pairing data, we introduce a three-stage training strategy and generate relevant datasets, progressively aligning the 3D modality with the language embedding of LLM. Furthermore, we design a Position-Aware Transformer (PAT) to connect the 3D encoder with the LLM, which effectively bridges the modality gap and enhances the LLM's spatial orientation comprehension of visual features. Our experiments demonstrate that LiDAR-LLM effectively comprehends a wide range of instructions related to 3D scenes, achieving a 40.9 BLEU-1 score on the 3D captioning dataset, a Grounded Captioning accuracy of 63.1%, and a BEV mIoU of 14.3%.

Abstract: As climate change reshapes global weather patterns, the increasing frequency and intensity of extreme rainfall events have amplified the safety imperatives for autonomous driving systems. During such events, rainfall can escalate from heavy to violent, as defined by the World Meteorological Organization, severely impairing images with diverse and significant degradations. Many existing semantic segmentation models perform well under light to heavy rain, but there is a notable absence of datasets addressing violent rain conditions for these models to validate and learn from. In this paper, we introduce the Extreme RainFall (ERF) dataset for semantic segmentation in both image and video tasks under violent rain conditions. Our dataset comprises 14,757 unlabeled frames and 100 labeled frames, all captured during four different violent rainfall periods. We use our dataset to evaluate the robustness of various methods against violent rainfall, focusing on four approaches: 1) imagebased foundation models, 2) image-based domain generalization methods, 3) image-based domain adaptation methods, and 4) video-based methods. The results reveal that none of the existing models tested is capable of withstanding the extreme challenges posed by violent rainfall conditions. By analyzing the results, we offer insights and suggestions for developing more robust models under extreme rainfall events.

Nanyang Technological University, Singapore, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China, Nanyang Technological University, Singapore, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China, Nanyang Technological University, Singapore, Beihang University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China

Abstract: While Neural Radiance Fields (NeRFs) have advanced the frontiers of novel view synthesis (NVS) using LiDAR data, they still struggle in dynamic scenes. Due to the low frequency and sparsity characteristics of LiDAR point clouds, it is challenging to spontaneously learn a dynamic and consistent scene representation from posed scans. In this paper, we propose STGCNeRF, a novel LiDAR NeRF method that combines spatial-temporal geometry consistency to enhance the reconstruction of dynamic scenes. First, we propose a temporal geometry consistency regularization to enhance the regression of time-varying scene geometries from low-frequency LiDAR sequences. By estimating the pointwise correspondences between synthetic (or real) and real frames at different times, we convert them into various forms of temporal supervision. This alleviates the inconsistency caused by moving objects in dynamic scenes. Second, to improve the reconstruction of sparse LiDAR data, we propose spatial geometric consistency constraints. By computing multiple neighborhood feature descriptors incorporating geometric and contextual information, we capture structural geometry information from sparse LiDAR data. This helps encourage consistent direction, smoothness, and detail of the local surface. Extensive experiments on the KITTI-360 and nuScenes datasets demonstrate that STGC-NeRF outperforms state-of-the-art methods in both geometry and intensity accuracy for dynamic LiDAR scene reconstruction.

Abstract: NoReference Image Quality Assessment (NR-IQA), responsible for assessing the quality of a single input image without using any reference, plays a critical role in evaluating and optimizing computer vision systems, e.g., low-light enhancement. Recent research indicates that NR-IQA models are susceptible to adversarial attacks, which can significantly alter predicted scores with visually imperceptible perturbations. Despite revealing vulnerabilities, these attack methods have limitations, including high computational demands, untargeted manipulation, limited practical utility in white-box scenarios, and reduced effectiveness in black-box scenarios. To address these challenges, we shift our focus to another significant threat and present a novel poisoning-based backdoor attack against NR-IQA (BAIQA), allowing the attacker to manipulate the IQA model's output to any desired target value by simply adjusting a scaling coefficient alpha for the trigger. We propose to inject the trigger in the discrete cosine transform (DCT) domain to improve the local invariance of the trigger for countering trigger diminishment in NR-IQA models due to widely adopted data augmentations. Furthermore, the universal adversarial perturbations (UAP) in the DCT space are designed as the trigger, to increase IQA model susceptibility to manipulation and improve attack effectiveness. In addition to the heuristic method for poison-label BAIQA (P-BAIQA), we explore the design of clean-label BAIQA (C-BAIQA), focusing on alpha sampling and image data refinement, driven by theoretical insights we reveal. Extensive experiments on diverse datasets and various NR-IQA models demonstrate the effectiveness of our attacks.

Abstract: By adaptively controlling the density and generating more Gaussians in regions with highfrequency information, 3D Gaussian Splatting (3DGS) can better represent scene details. From the signal processing perspective, representing details usually needs more Gaussians with relatively smaller scales. However, 3DGS currently lacks an explicit constraint linking the density and scale of 3D Gaussians across the domain, leading to 3DGS using improper-scale Gaussians to express frequency information, resulting in the loss of accuracy. In this paper, we propose to establish a direct relation between density and scale through the reparameterization of the scaling parameters and ensure the consistency between them via explicit constraints (i.e., density responds well to changes in frequency). Furthermore, we develop a frequency-aware density control strategy, consisting of densification and deletion, to improve representation quality with fewer Gaussians. A dynamic threshold encourages densification in high-frequency regions, while a scale-based filter deletes Gaussians with improper scale. Experimental results on various datasets demonstrate that our method outperforms existing state-of-the-art methods quantitatively and qualitatively.

Abstract: Infrared small target detection (IRSTD) focuses on identifying small targets in infrared images. Despite advancements with deep learning, challenges persist due to the IR longrange imaging mechanism, where targets are small, dim, and easily lost in noise and background clutter. Current deep learning methods struggle to suppress noise and background interference while preserving fine details, leading to missed detections and false alarms. To address these issues, we propose IRMamba, an encoder-decoder architecture featuring Pixel Difference Mamba (PDMamba) and a Layer Restoration Module (LRM). Specifically, PDMamba integrates the intensity and directional information of pixel differences between scanning positions and their central neighborhoods into the state equation of the state space model (SSM). This enhances target detail representation and suppresses background interference by capturing local 2D dependencies from a global perspective. In addition, LRM incorporates the double-depth image prior into the iterative convergence algorithm, and utilizes the inter-layer interrelationships to gradually reverse the separation of the target layer, achieving noise suppression and refined reconstruction of the image mask. Experiments conducted on multiple public datasets, including NUAA-SIRST, NUDT-SIRST, and IRSTD-1K, demonstrate the significant advantages of IRMamba over SOTA methods.

Abstract: Singleframe Infrared Small Target (SIRST) detection has made significant advancements, but it still faces challenges due to limited labeled data and the foreground-background class imbalance. To address these issues, we introduce a novel Semi-Supervised SIRST Detection (S^3D) pipeline in this paper. First, drawing inspiration from thermodynamics, we propose augmenting infrared images using both chromatically and spatially uneven perturbations. This dual-stream perturbation enhances the diversity and balance of infrared samples, contributing to the robustness of detection models. Additionally, we develop a confidence-adaptive matching method to maintain weighted consistency among perturbed unlabeled samples. Second, to tackle class imbalance in labeled data, we compel the model to generate discriminative predictions for challenging, misclassified examples while down-weighting well-classified examples. We achieve this by modifying the standard cross-entropy loss to squeeze the detector and truncating the loss on well-classified examples. Our innovative Truncated Squeeze (TS) loss focuses on learning discriminative representations for difficult cases and prevents over-optimization for simpler ones. To assess the effectiveness of the perturbation techniques and loss functions, we apply them to various SIRST detectors and conduct comprehensive experiments on two benchmark datasets. Notably, our proposed methods consistently and significantly improve accuracy. Remarkably, our approach achieves over 98% performance of the state-of-the-art fully-supervised method using only 1/8 of the labeled samples.

Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, School of Cyber Science and Engineering, Nanjing University of Science and Technology, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, VCIP & TMCC & DISSec, College of Computer Science, Nankai University, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China

Abstract: Video textbased visual question answering (TextVQA) is a practical task that aims to answer questions by jointly reasoning textual and visual information in a given video. Inspired by the development of TextVQA in image domain, existing Video TextVQA approaches leverage a language model (e.g. T5) to process text-rich multiple frames and generate answers auto-regressively. Nevertheless, the spatio-temporal relationships among visual entities (including scene text and objects) will be disrupted and models are susceptible to interference from unrelated information, resulting in irrational reasoning and inaccurate answering. To tackle these challenges, we propose the TEA (stands for "Track the Answer'') method that better extends the generative TextVQA framework from image to video. TEA recovers the spatio-temporal relationships in a complementary way and incorporates OCR-aware clues to enhance the quality of reasoning questions. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. TEA outperforms existing TextVQA methods, video-language pretraining methods and video large language models by great margins. The code will be publicly released.

Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Google, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences

Abstract: Generating highquality whole-body human object interaction motion sequences is becoming increasingly important in various fields such as animation, VR/AR, and robotics. The main challenge of this task lies in determining the level of involvement of each hand given the complex shapes of objects in different sizes and their different motion trajectories, while ensuring strong grasping realism and guaranteeing the coordination of movement in all body parts. Contrasting with existing work, which either generates human interaction motion sequences without detailed hand grasping poses or only models a static grasping pose, we propose a simple yet effective framework that jointly models the relationship between the body, hands, and the given object motion sequences within a single diffusion model. To guide our network in perceiving the object's spatial position and learning more natural grasping poses, we introduce novel contact-aware losses and incorporate a data-driven, carefully designed guidance. Experimental results demonstrate that our approach outperforms the state-of-the-art method and generates plausible results.

Abstract: Given the critical role of birds in ecosystems, FineGrained Bird Recognition (FGBR) has gained increasing attention, particularly in distinguishing birds within similar subcategories. Although Vision Transformer (ViT)-based methods often outperform Convolutional Neural Network (CNN)-based methods in FGBR, recent studies reveal that the limited receptive field of plain ViT model hinders representational richness and makes them vulnerable to scale variance. Thus, enhancing the multi-scale capabilities of existing ViT-based models to overcome this bottleneck in FGBR is a worthwhile pursuit. In this paper, we propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM), which explores diverse cues at different scales across various stages of a multi-scale Vision Transformer (MS-ViT) in an ``Activation-Selection-Aggregation'' paradigm. Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different. Subsequently, a multi-scale token selection mechanism is proposed to remove redundant noise and highlight discriminative, scale-specific cues at each stage. Finally, the selected tokens from each stage are independently utilized for bird recognition, and the recognition results from multiple stages are adaptively fused through a multi-scale dynamic aggregation mechanism for final model decisions. Both qualitative and quantitative results demonstrate the effectiveness of our proposed MDCM, which outperforms CNN- and ViT-based models on several widely-used FGBR benchmarks.

Abstract: Baglabel-based multi-instance learning (MIL) has demonstrated significant performance in whole slide image (WSI) analysis, particularly in pseudo-label-based learning schemes. However, due to inaccurate feature representation and interference, existing MIL methods often yield unreliable pseudo-labels, which spawn undesired predictions. To address these issues, we propose an Online Pseudo-Supervision and Dynamic Mutual Learning (OODML) framework that enhances pseudo-label generation and feature representation while exploring their mutual learning to improve bag-level prediction. Specifically, we design an Adaptive Memory Bank (AMB) to collect the most informative components of the current WSI. We also introduce a Self-Progressive Feature Fusion (SPFF) module that integrates label-related historical information from the AMB with current semantic variations, thereby enhancing the representation of pseudo-bag tokens. Furthermore, we propose a Decision Revision Pseudo-Label (DRPL) generation scheme to explore intrinsic connections between pseudo-bag representations and bag-label predictions, resulting in more reliable pseudo-label generation. To alleviate redundant and ambiguous representations, the class-wise prior of pseudo-label prediction is borrowed to facilitate label-related feature learning and to update the AMB, forming a mutual refinement between feature representation and pseudo-label generation. Additionally, a Dynamic Decision-Making (DDM) module is developed to harmonize explicit and implicit representations of bag information for more robust decision-making. Extensive experiments on four datasets demonstrate that our OODML surpasses the state-of-the-art by 3.3% and 6.9% on the CAMELYON16 and TCGA Lung datasets.

Abstract: Image change captioning (ICC) poses great challenges stemming from describing subtle differences between two similar images in natural language, significantly increasing the complexity of feature extraction and crossmodal learning compared to the image captioning task. Existing ICC methods often suffer from two key challenges: 1) Massive irrelevant information of uni-image features leads to suboptimal visual difference representations; 2) Imprecise inter-modality correspondence degrades the quality of generated captions. This paper proposes a Difference-aware Contrastive Diffusion Model with Adversarial Perturbations (DECIDER) for ICC due to the excellent performance of diffusion models in image/text generation. Technically, difference-aware cross-modal learning is developed to suppress irrelevant information and learn compact yet robust visual difference representations. This is achieved by optimizing a novel objective mathematically derived from the information bottleneck principle that excels in filtering redundant features and highlighting differences. Furthermore, we propose to dynamically generate ``hard'' positive and negative samples via adversarial perturbations, which are involved in contrastive diffusion training with a tighter variational bound. This design encourages our DECIDER to excavate and construct complex correspondences between visual differences and captions, thereby improving generalization performance. Extensive experiments on four datasets demonstrate that DECIDER significantly exceeds state-of-the-art performance.

Abstract: Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data. How to schedule the domainspecific perceiving models and analyze the collected videos uniformly, efficiently, and especially intelligently to accomplish complicated tasks is challenging. To address the challenge, we build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to analyze multimedia data collaboratively. To support VIoTGPT and related future works, we meticulously crafted the VIoT-Tool dataset, including the training dataset and the benchmark involving 11 representative vision models across three categories based on semi-automatic annotations. To guide LLM to act as the intelligent agent towards intelligent VIoT, we resort to ReAct instruction tuning method based on VIoT-Tool to learn the tool capability. Quantitative and qualitative experiments and analyses demonstrate the effectiveness of VIoTGPT. We believe VIoTGPT contributes to improving human-centered experiences in VIoT applications.

Abstract: Testtime adaptation (TTA) deals with domain shifts during inference by training models based on only unlabeled test samples. Test samples may include noisy samples, which degrade domain adaptation. Existing methods rely on the model's output prediction to detect and filter noisy samples, and further search for flat regions during optimization, which makes the optimization more robust on noisy samples. However, there are two issues: (1) the output prediction tends to be inaccurate due to domain shifts, weakening noisy-sample detection; (2) current approaches for searching flat regions focus on optimization to enhance the worst case, which ignores achieving flatness by avoiding the quick changing of losses. To address these challenges, we propose a model pruning-based test-time adaptation model for noisy data streams, named MoTTA, which leverages a new proposed filtering, output difference under pruning (ODP)-based filtering, and a flatness-aware entropy minimization (FlatEM). Specifically, to reduce the impact of inaccurate output predictions, ODP-based filtering measures the output difference of a sample before and after model pruning, which works even under inaccurate output. To improve the search for flat loss surfaces, FlatEM integrates zeroth-order flatness and first-order flatness (minimize the maximal gradient normalization with a weight perturbation constrained in a small Euclidean ball) on entropy minimization. To solve these hard maximum problems, we leverage Taylor expansion to obtain approximated results for optimization. FlatEM also adopts a parameter regularization to mitigate incorrect updates from noisy samples. The experiments show our advantages in dealing with noisy data streams at TTA comparable to existing baselines.

Abstract: Mesh watermark embeds secret messages in 3D meshes and decodes the message from watermarked meshes for ownership verification. Current watermarking methods directly hide secret messages in vertex and face sets of meshes. However, mesh is a discrete representation that uses vertex and face sets to describe a continuous signal, which can be discretized in other discrete representations with different vertex and face sets. This raises the question of whether the watermark can still be verified on the different discrete representations of the watermarked mesh. We conduct this research in an attackthen-defense manner by proposing a novel function space mesh watermark removal attack FuncEvade and then mitigating it through function space mesh watermarking FuncMark. In detail, FuncEvade generates a different discrete representation of a watermarked mesh by extracting it from the signed distance function of the watermarked mesh. We observe that the generated mesh can evade ALL previous watermarking methods. FuncMark mitigates FuncEvade by watermarking signed distance function through message-guided deformation. Such deformation can survive isosurfacing and thus be inherited by the extracted meshes for further watermark decoding. Extensive experiments demonstrate that FuncEvade achieves 100% evasion rate among all previous watermarking methods while achieving only 0.3% evasion rate on FuncMark. Besides, our FuncMark performs similarly on other metrics compared to state-of-the-art mesh watermarking methods.

Abstract: Computing an optimal classification tree that provably maximizes training performance within a given size limit, is NPhard, and in practice, most state-of-the-art methods do not scale beyond computing optimal trees of depth three. Therefore, most methods rely on a coarse binarization of continuous features to maintain scalability. We propose a novel algorithm that optimizes trees directly on the continuous feature data using dynamic programming with branch-and-bound. We develop new pruning techniques that eliminate many sub-optimal splits in the search when similar to previously computed splits and we provide an efficient subroutine for computing optimal depth-two trees. Our experiments demonstrate that these techniques improve runtime by one or more orders of magnitude over state-of-the-art optimal methods and improve test accuracy by 5% over greedy heuristics.

Abstract: Column Generation (CG) is an effective and iterative algorithm to solve largescale linear programs (LP). During each CG iteration, new columns are added to improve the solution of the LP. Typically, CG greedily selects one column with the most negative reduced cost, which can be improved by adding more columns at once. However, selecting all columns with negative reduced costs would lead to the addition of redundant columns that do not improve the objective value. Therefore, selecting the appropriate columns to add is still an open problem and previous machine-learning-based approaches for CG only add a constant quantity of columns per iteration due to the state-space explosion problem. To address this, we propose Fast Family Column Generation (FFCG) — a novel reinforcement-learning-based CG that selects a variable number of columns as needed in an iteration. Specifically, we formulate the column selection problem in CG as an MDP and design a reward metric that balances both the convergence speed and the number of redundant columns. In our experiments, FFCG converges faster on the common benchmarks and reduces the number of CG iterations by 77.1% for Cutting Stock Problem (CSP) and 84.8% for Vehicle Routing Problem with Time Windows (VRPTW), and a 71.4% reduction in computing time for CSP and 84.0% for VRPTW on average compared to several state-of-the-art baselines.

Abstract: For many realworld problems, users are often interested not only in finding a single solution but in obtaining a sufficiently diverse collection of solutions. In this work, we consider the Diverse SAT problem, aiming to find a set of diverse satisfying assignments for a given propositional formula. We propose a novel and effective local search algorithm, DiverSAT, to solve the problem. To cope with diversity, we introduce three heuristics and a perturbation strategy based on some relevant information. We conduct extensive experiments on a large number of public benchmarks, collected from semiformal hardware verification, logistics planning, and other domains. The results show that DiverSAT outperforms the existing algorithms on most of these benchmarks.

Abstract: Efficiently capturing consistent and complementary semantic features in context is crucial for Multimodal Emotion Recognition in Conversations (MERC). However, limited by the oversmoothing or low-pass filtering characteristics of spatial graph neural networks, are insufficient to accurately capture the long-distance consistency low-frequency information and complementarity high-frequency information of the utterances. To this end, this paper revisits the task of MERC from the perspective of the graph spectrum and proposes a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and designs efficient Fourier graph operators (FGO) to extract long-distance high-frequency and low-frequency information, respectively. FGO can be stacked in multiple layers, which can effectively alleviate the over-smoothing problem. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect genuine emotions. Finally, GS-MCC inputs the coordinated high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets.

Abstract: In recent years, knowledge graphs have been integrated into recommender systems as itemside auxiliary information, enhancing recommendation accuracy. However, constructing and integrating structural user-side knowledge remains a significant challenge due to the improper granularity and inherent scarcity of user-side features. Recent advancements in Large Language Models (LLMs) offer the potential to bridge this gap by leveraging their human behavior understanding and extensive real-world knowledge. Nevertheless, integrating LLM-generated information into recommender systems presents challenges, including the risk of noisy information and the need for additional knowledge transfer. In this paper, we propose an LLM-based user-side knowledge inference method alongside a carefully designed recommendation framework to address these challenges. Our approach employs LLMs to infer user interests based on historical behaviors, integrating this user-side information with item-side and collaborative data to construct a hybrid structure: the Collaborative Interest Knowledge Graph (CIKG). Furthermore, we propose a CIKG-based recommendation framework that includes a user interest reconstruction module and a cross-domain contrastive learning module to mitigate potential noise and facilitate knowledge transfer. We conduct extensive experiments on three real-world datasets to validate the effectiveness of our method. Our approach achieves state-of-the-art performance compared to competitive baselines, particularly for users with sparse interactions.

Abstract: Spatialtemporal forecasting and imputation are important for real-world intelligent systems. Most existing methods are tailored for individual forecasting or imputation tasks but are not designed for both. Additionally, they are less effective for zero-shot and few-shot learning. While pre-trained language model (PLM) have exhibited strong pattern recognition and reasoning abilities across various tasks, including few-shot and zero-shot learning, their applications in spatial-temporal data understanding has been constrained by insufficient modeling of complex correlations such as the temporal correlations, spatial connectivity, non-pairwise and high-order spatial-temporal correlations within data. In this paper, we propose STD-PLM for understanding both spatial and temporal properties of Spatial-Temporal Data with PLM, which is capable of implementing both spatial-temporal forecasting and imputation tasks. STD-PLM understands spatial-temporal correlations via explicitly designed spatial and temporal tokenizers. Topology-aware node embeddings are designed for PLM to comprehend and exploit the topology structure of data in inductive manner. Furthermore, to mitigate the efficiency issues introduced by the PLM, we design a sandglass attention module(SGA) combined with a specific constrained loss function, which significantly improves the model's efficiency while ensuring performance. Extensive experiments demonstrate that STD-PLM exhibits competitive performance and generalization capabilities across the forecasting and imputation tasks on various datasets. Moreover, STD-PLM achieves promising results on both few-shot and zero-shot tasks.

Abstract: Conversational Recommender Systems (CRS) aim to provide tailored recommendation responses via a chat interface, including both the user's preferred item and its accompanying explanation. However, due to its generative nature, CRS are prone to responding with factually incorrect explanations (i.e., hallucinations). To solve this problem, we propose incorporating a passage retrieval module into CRS with the objective of enhancing the factuality and informativeness of system responses. Specifically, we outline essential directions for employing a passage retrieval module in CRS to address the following critical issues: (1) the risk of passage retrieval not aligning with the user preference; (2) the absence of supervision for training a passage retrieval module. As a solution, we introduce ESPRESSO, a novel passage retrieval approach for CRS, to effectively tackle the above issues with two core ideas: adaptive item selection and relevancebased groupwise learning. Our extensive experiments show that ESPRESSO effectively resolves issues, achieving up to 36% higher Hit@3 accuracy than the best of 8 competing methods. Additionally, we verify that leveraging passages retrieved by ESPRESSO significantly improves the response quality of CRS.

Abstract: Graph Neural Networks (GNNs) have become essential tools for graph representation learning in various domains, such as social media and healthcare. However, they often suffer from fairness issues due to inherent biases in node attributes and graph structure, leading to unfair predictions. To address these challenges, we propose a novel GNN framework, DABGNN, that Disentangles, Amplifies, and deBiases attribute, structure, and potential biases in the GNN mechanism. DAB-GNN employs a disentanglement and amplification module that isolates and amplifies each type of bias through specialized disentanglers, followed by a debiasing module that minimizes the distance between subgroup distributions to ensure fairness. Extensive experiments on five datasets demonstrate that DAB-GNN significantly outperforms ten state-of-the-art competitors in terms of achieving an optimal balance between accuracy and fairness.

Abstract: Large language models (LLMs), endowed with exceptional reasoning capabilities, are adept at discerning profound user interests from historical behaviors, thereby presenting a promising avenue for the advancement of recommendation systems. However, a notable discrepancy persists between the sparse collaborative semantics typically found in recommendation systems and the dense token representations within LLMs. In our study, we propose a novel framework that harmoniously merges traditional recommendation models with the prowess of LLMs. We initiate this integration by transforming ItemIDs into sequences that align semantically with the LLMs' space, through the proposed Alignment Tokenization module. Additionally, we design a series of specialized supervised learning tasks aimed at aligning collaborative signals with the subtleties of natural language semantics. To ensure practical applicability, we optimize online inference by precaching the top-K results for each user, reducing latency and improving efficiency. Extensive experimental evidence indicates that our model markedly improves recall metrics and displays remarkable scalability of recommendation systems.

Department of Computer Science and Engineering, The Chinese University of Hong Kong IDEA Research, International Digital Economy Academy, Artificial Intelligence Thrust, Hong Kong University of Science and Technology (Guangzhou) IDEA Research, International Digital Economy Academy, IDEA Research, International Digital Economy Academy, Department of Engineering, University of Cambridge, IDEA Research, International Digital Economy Academy, IDEA Research, International Digital Economy Academy, Independent Researcher, Department of Computer Science and Engineering, The Chinese University of Hong Kong

Abstract: Inductive knowledge graph completion (KGC) aims to predict missing triples with unseen entities. Recent works focus on modeling reasoning paths between the head and tail entity as direct supporting evidence. However, these methods depend heavily on the existence and quality of reasoning paths, which limits their general applicability in different scenarios. In addition, we observe that latent type constraints and neighboring facts inherent in KGs are also vital in inferring missing triples. To effectively utilize all useful information in KGs, we introduce CATS, a novel contextaware inductive KGC solution. With sufficient guidance from proper prompts and supervised fine-tuning, CATS activates the strong semantic understanding and reasoning capabilities of large language models to assess the existence of query triples, which consist of two modules. First, the type-aware reasoning module evaluates whether the candidate entity matches the latent entity type as required by the query relation. Then, the subgraph reasoning module selects relevant reasoning paths and neighboring facts, and evaluates their correlation to the query triple. Experiment results on three widely used datasets demonstrate that CATS significantly outperforms state-of-the-art methods in 16 out of 18 transductive, inductive, and few-shot settings with an average absolute MRR improvement of 7.2%.

Abstract: Graphbased fraud detection is crucial in identifying illegal activities in social networks, finance, and other sectors. Despite recent progress in this area, most of current researches typically require a large amount of annotated data to demonstrate its benefits. In practice, obtaining sufficient high-quality annotated data is challenging, limiting the effectiveness of model training. Therefore, leveraging extremely limited label information is crucial to enhance model performance. We propose a context-aware graph neural network (CGNN) to address this. CGNN performs category semantic decomposition on the contextual neighbor features of the center node to enrich the category semantics. In the neighbor message aggregation stage, the denoising attention mechanism enables the center node to adaptively aggregate heterophilic and homophilic information from neighbors. Particularly for unlabeled data, feature augmentation within the category subspace and consistency regularization driven by entropy minimization ensure that such data can further enhance model performance under explicit semantic guidance. We demonstrate on four real-world datasets that CGNN significantly outperforms other baseline methods with extremely limited labels.

Abstract: The existing federated learning (FL) methods for spatiotemporal forecasting fail to capture the inherent spatio-temporal heterogeneity, which calls for personalized FL (PFL) methods to model the spatio-temporally variant representations. While contrastive learning is promising in tackling spatio-temporal heterogeneity, the existing methods are noneffective in distinguishing positive and negative pairs and can hardly apply to PFL paradigm. To tackle this limitation, we propose a novel PFL method, named Federated dUal sEmantic aLignment-based contraStive learning (FUELS), which can adaptively align positive and negative pairs based on semantic similarity, thereby injecting precise spatio-temporal heterogeneity into the latent representation space by auxiliary contrastive tasks. From temporal perspective, a hard negative filtering module is introduced to dynamically align heterogeneous temporal representations for the supplemented intra-client contrastive task. From spatial perspective, we design lightweight-but-efficient prototypes as client-level semantic representations, based on which the server evaluates spatial similarity and yields client-customized global prototypes for the supplemented inter-client contrastive task. Extensive experiments demonstrate that FUELS outperforms state-of-the-art methods, with impressive communication cost reduction.

CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, University of Amsterdam, Amsterdam, The Netherlands, CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China

Abstract: Neural ranking models (NRMs) have been shown to be highly effective in terms of retrieval performance. Unfortunately, they have also displayed a higher degree of sensitivity to attacks than previous generation models. To help expose and address this lack of robustness, we introduce a novel ranking attack framework named Attackin-the-Chain, which tracks interactions between large language models (LLMs) and NRMs based on chain-of-thought (CoT) prompting to generate adversarial examples under black-box settings. Our approach starts by identifying anchor documents with higher ranking positions than the target document as nodes in the reasoning chain. We then dynamically assign the number of perturbation words to each node and prompt LLMs to execute attacks. Finally, we verify the attack performance of all nodes at each reasoning step and proceed to generate the next reasoning step. Empirical results on two web search benchmarks show the effectiveness of our method.

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, Big Data Institute, Central South University, Changsha, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, Faculty of Information Science and Engineering, Ocean University of China, Qingdao, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China

Abstract: Multimodal Recommendation Systems (MRSs) boost traditional useritem interaction-based methods by incorporating multimodal information. However, existing methods ignore the inherent noise brought by (1) noisy semantic priors in multimodal content, and (2) noisy user interactions in history records, therefore diminishing model performance. To fill this gap, we propose to denoise MRSs by jointly EValuating structure Effectiveness and mitigating Noisy links (EVEN). Firstly, for semantic prior noise in multimodal content, EVEN builds item homogeneous consistency and denoises it by evaluating behavior-driven confidence. Secondly, for noise in user interactions, EVEN updates user feedback by denoising observed interactions following implicit contribution evaluation of high-order representations. Thirdly, EVEN performs cross-modal alignment through self-guided structure learning, reinforcing task-specific inter-modal dependency modeling and cross-modal fusion. Through extensive experiments on three widely-used datasets, EVEN achieves an average improvement of 8.95% and 5.90% in recommendation accuracy compared with LGMRec and FREEDOM, respectively, without extending the total training time.

Abstract: Graph Neural Networks (GNNs) have been shown vulnerable to graph adversarial attacks. Current robust graph representation learning methods mainly defend against graph structure attack, and improves performance of GNNs. However node feature in graph can been easily attacked in reality. The joint defense on graph structure and feature dual attacks remains challenging yet less studied. To fulfill this gap, we propose Adversarial Contrastive Graph Masked AutoEncoder (ACGMAE) to defend against graph structure and feature dual attacks. ACGMAE employs adversarial feature masking for reconstructing node feature to mitigate the influence of feature attack. ACGMAE employs contrastive learning on kNN graph and attacked graph, considers neighbor nodes as positive samples, and further calculates their probabilities being true positive to mitigate the effect of adversarial edges. Extensive experiments on node classification and clustering demonstrate the effectiveness of the proposed ACGMAE especially under graph structure and feature dual attacks.

Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China., Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China., Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China., Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China., Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China., Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China., Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China., Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China., Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China.

Abstract: Learningbased compression shows competitive compression ratios for genomics data. It often includes three types of compressors: static, adaptive and semi-adaptive. However, these existing compressors suffer from inferior compression ratios or throughput, and adaptive compressors also faces model cold-start problems. To address these issues, we propose DeepGeCo, a novel genomics data lossless adaptive compression framework with (s,k)-mer encoding and deep neural networks, involving three compression modes (MINI for static, PLUS for adaptive, ULTRA for semi-adaptive) for flexible requirements of compression ratios or throughput. In DeepGeCo, (1) we develop BiGRU and Transformer as the backbone to build Warm-Start and Supporter models in terms of cold-start problems. (2) We introduce (s,k)-mer encoding to pre-process genomics data before feeding it into the DNN model for improve model throughput, and we propose a new metric - Ranking of Throughput and Compression Ratio (RTCR) for effective encoding parameters selection. (3) We design a threshold controller and a probabilistic mixer within the backbone to balance compression ratios and model throughput. Experiments on 10 real-world datasets show that DeepGeCo's three compression modes improve up to a 22.949X average throughput and up to a 31.095% average compression ratio improvement while occupying low CPU or GPU memory.

Department of Computer Science, City University of Hong Kong, Department of Computer Science and Engineering, The Chinese University of Hong Kong, Advance Computing and Storage Lab, Huawei Technologies, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Advance Computing and Storage Lab, Huawei Technologies, Advance Computing and Storage Lab, Huawei Technologies, Department of Electronic Engineering, Tsinghua University, Department of Computer Science, City University of Hong Kong

Abstract: Retrosynthesis prediction focuses on identifying reactants capable of synthesizing a target product. Typically, the retrosynthesis prediction involves two phases: Reaction Center Identification and Reactant Generation. However, we argue that most existing methods suffer from two limitations in the two phases: (i) Existing models do not adequately capture the ``face'' information in molecular graphs for the reaction center identification. (ii) Current approaches for the reactant generation predominantly use sequence generation in a 2D space, which lacks versatility in generating reasonable distributions for completed reactive groups and overlooks molecules' inherent 3D properties. To overcome the above limitations, we propose GDiffRetro. For the reaction center identification, GDiffRetro uniquely integrates the original graph with its corresponding dual graph to represent molecular structures, which helps guide the model to focus more on the faces in the graph. For the reactant generation, GDiffRetro employs a conditional diffusion model in 3D to further transform the obtained synthon into a complete reactant. Our experimental findings reveal that GDiffRetro outperforms stateof-the-art semi-template models across various evaluative metrics.

Abstract: Recent studies have attempted to refine the Transformer architecture to demonstrate its effectiveness in LongTerm Time Series Forecasting (LTSF) tasks. Despite surpassing many linear forecasting models with ever-improving performance, we remain skeptical of Transformers as a solution for LTSF. We attribute the effectiveness of these models largely to the adopted Patch mechanism, which enhances sequence locality to an extent yet fails to fully address the loss of temporal information inherent to the permutation-invariant self-attention mechanism. Further investigation suggests that simple linear layers augmented with the Patch mechanism may outperform complex Transformer-based LTSF models. Moreover, diverging from models that use channel independence, our research underscores the importance of cross-variable interactions in enhancing the performance of multivariate time series forecasting. The interaction information between variables is highly valuable but has been misapplied in past studies, leading to suboptimal cross-variable models. Based on these insights, we propose a novel and simple Patch-based MLP (PatchMLP) for LTSF tasks. Specifically, we employ simple moving averages to extract smooth components and noise-containing residuals from time series data, engaging in semantic information interchange through channel mixing and specializing in random noise with channel independence processing. The PatchMLP model consistently achieves state-of-the-art results on several real-world datasets. We hope this surprising finding will spur new research directions in the LTSF field and pave the way for more efficient and concise solutions.

Abstract: Despite the superior performance of Large language models on many NLP tasks, they still face significant limitations in memorizing extensive world knowledge. Recent studies have demonstrated that leveraging the RetrievalAugmented Generation (RAG) framework, combined with Knowledge Graphs that encapsulate extensive factual data in a structured format, robustly enhances the reasoning capabilities of LLMs. However, deploying such systems in real-world scenarios presents challenges: the continuous evolution of non-stationary environments may lead to performance degradation and user satisfaction requires a careful balance of performance and responsiveness. To address these challenges, we introduce a Multi-objective Multi-Armed Bandit enhanced RAG framework, supported by multiple retrieval methods with diverse capabilities under rich and evolving retrieval contexts in practice. Within this framework, each retrieval method is treated as a distinct "arm''. The system utilizes real-time user feedback to adapt to dynamic environments, by selecting the appropriate retrieval method based on input queries and the historical multi-objective performance of each arm. Extensive experiments conducted on two benchmark KGQA datasets demonstrate that our method significantly outperforms baseline methods in non-stationary settings while achieving state-of-the-art performance in station environments.

Abstract: Learningbased point cloud compression methods have made significant progress in terms of performance. However, these methods still encounter challenges including high complexity, limited compression modes, and a lack of support for variable rate, which restrict the practical application of these methods. In order to promote the development of practical point cloud compression, we propose an efficient unified point cloud geometry compression framework, dubbed as UniPCGC. It is a lightweight framework that supports lossy compression, lossless compression, variable rate and variable complexity. First, we introduce the Uneven 8-Stage Lossless Coder (UELC) in the lossless mode, which allocates more computational complexity to groups with higher coding difficulty, and merges groups with lower coding difficulty. Second, Variable Rate and Complexity Module (VRCM) is achieved in the lossy mode through joint adoption of a rate modulation module and dynamic sparse convolution. Finally, through the dynamic combination of UELC and VRCM, we achieve lossy compression, lossless compression, variable rate and complexity within a unified framework. Compared to the previous state-of-the-art method, our method achieves a compression ratio (CR) gain of 8.1% on lossless compression, and a Bjontegaard Delta Rate (BD-Rate) gain of 14.02% on lossy compression, while also supporting variable rate and variable complexity.

Abstract: With the rise of social media and LocationBased Social Networks (LBSN), check-in data across platforms has become crucial for User Identity Linkage (UIL). These data not only reveal users' spatio-temporal information but also provide insights into their behavior patterns and interests. However, cross-platform identity linkage faces challenges like poor data quality, high sparsity, and noise interference, which hinder existing methods from extracting cross-platform user information. To address these issues, we propose a Correlation-Attention Masked Transformer for User Identity Link age Network (MT-Link), a transformer-based framework to enhance model performance by learning spatio-temporal co-occurrence patterns of cross-platform users. Our model effectively captures spatio-temporal co-occurrence in cross-platform user check-in sequences. It employs a correlation attention mechanism to detect the spatio-temporal co-occurrence between user check-in sequences. Guided by attention weight maps, the model focuses on co-occurrence points while filtering out noise, ultimately improving classification performance. Experimental results show that our model significantly outperforms state-of-the-art baselines by 12.92%-17.76% and 5.80%-8.38% improvements in terms of Macro-F1 and Area Under Curve (AUC).

Abstract: Graph unlearning, which aims to eliminate the influence of specific nodes, edges, or attributes from a trained Graph Neural Network (GNN), is essential in applications where privacy, bias, or data obsolescence is a concern. However, existing graph unlearning techniques often necessitate additional training on the remaining data, leading to significant computational costs, particularly with largescale graphs. To address these challenges, we propose a two-stage training-free approach, Erase then Rectify (ETR), designed for efficient and scalable graph unlearning while preserving the model utility. Specifically, we first build a theoretical foundation showing that masking parameters critical for unlearned samples enables effective unlearning. Building on this insight, the Erase stage strategically edits model parameters to eliminate the impact of unlearned samples and their propagated influence on intercorrelated nodes. To further ensure the GNN's utility, the Rectify stage devises a gradient approximation method to estimate the model's gradient on the remaining dataset, which is then used to enhance model performance. Overall, ETR achieves graph unlearning without additional training or full training data access, significantly reducing computational overhead and preserving data privacy. Extensive experiments on seven public datasets demonstrate the consistent superiority of ETR in model utility, unlearning efficiency, and unlearning effectiveness, establishing it as a promising solution for real-world graph unlearning challenges.

Abstract: The vast, complex, and dynamic nature of social message data has posed challenges to social event detection (SED). Despite considerable effort, these challenges persist, often resulting in inadequately expressive message representations (ineffective) and prolonged learning durations (inefficient). In response to the challenges, this work introduces an unsupervised framework, HyperSED (Hyperbolic SED). Specifically, the proposed framework first models social messages into semanticbased message anchors, and then leverages the structure of the anchor graph and the expressiveness of the hyperbolic space to acquire structure- and geometry-aware anchor representations. Finally, HyperSED builds the partitioning tree of the anchor message graph by incorporating differentiable structural information as the reflection of the detected events. Extensive experiments on public datasets demonstrate HyperSED's competitive performance, along with a substantial improvement in efficiency compared to the current state-of-the-art unsupervised paradigm. Statistically, HyperSED boosts incremental SED by an average of 2%, 2%, and 25% in NMI, AMI, and ARI, respectively; enhancing efficiency by up to 37.41 times and at least 12.10 times, illustrating the advancement of the proposed framework.

Abstract: Graph Neural Networks (GNNs) demonstrate superior performance in various graph learning tasks, yet their wider realworld application is hindered by the computational overhead when applied to large-scale graphs. To address the issue, the Graph Lottery Hypothesis (GLT) has been proposed, advocating the identification of subgraphs and subnetworks, i.e., winning tickets, without compromising performance. The effectiveness of current GLT methods largely stems from the use of iterative magnitude pruning (IMP), which offers greater stability and better performance than one-shot pruning. However, identifying GLTs is highly computationally expensive, due to the iterative pruning and retraining required by IMP. In this paper, we reevaluate the correlation between one-shot pruning and IMP: while one-shot tickets are suboptimal compared to IMP, they offer a fast track to tickets with a stronger performance. We introduce a one-shot pruning and denoising framework to validate the efficacy of the fast track. Compared to current IMP-based GLT methods, our framework achieves a double-win situation of graph lottery tickets with higher sparsity and faster speeds. Through extensive experiments across 4 backbones and 6 datasets, our method demonstrates a 1.32%-45.62% improvement in weight sparsity and a 7.49%-22.71% increase in graph sparsity, along with a 1.7-44× speedup over IMP-based methods and 95.3%-98.6% MAC savings.

Abstract: Dynamic point cloud compression (DPCC) is crucial in applications like autonomous driving and AR/VR. Current compression methods face challenges with complexity management and rate control. This paper introduces a novel dynamic coding framework that supports variable bitrate and computational complexities. Our approach includes a slimmable framework with multiple coding routes, allowing for efficient RateDistortion-Complexity Optimization (RDCO) within a single model. To address data sparsity in inter-frame prediction, we propose the coarse-to-fine motion estimation and compensation module that deconstructs geometric information while expanding the perceptive field. Additionally, we propose a precise rate control module that content-adaptively navigates point cloud frames through various coding routes to meet target bitrates. The experimental results demonstrate that our approach reduces the average BD-Rate by 5.81% and improves the BD-PSNR by 0.42 dB compared to the state-of-the-art method, while keeping the average bitrate error at 0.40%. Moreover, the average coding time is reduced by up to 44.6% compared to D-DPCC, underscoring its efficiency in real-time and bitrate-constrained DPCC scenarios.

Abstract: With the burst of big data, 2D3D cross-modal retrieval has received increasing attention, which aims to retrieve relevant data from one modality given the query from the other modality. In this paper, we study an underexplored yet practical problem of semi-supervised 2D-3D cross-modal retrieval, which could suffer from serious label scarcity in real-world applications. Moreover, the huge heterogeneous gap could deteriorate the process of learning from unlabeled data. In this work, we propose a novel approach named Decoupled Discriminative Learning with Bigraph-aware Alignment (DREAM) for semi-supervised 2D-3D cross-modal retrieval. The core of our DREAM is to decouple the label prediction and reliability measurement processes to reduce overconfident samples in discriminative learning. In particular, we enhance a label prediction module with label propagation from labeled samples and additionally introduce a reliability measurement module to learn the scores of predicted labels. To reduce class-related bias, we compare reliability scores with class-specific adaptive thresholds to identify samples for additional learning. In addition, negative labels are estimated for unselected samples, which guides soft semantic learning to make the best use of all the information. To further minimize the heterogeneous gap, we build a bigraph graph that connects cross-modal similar examples and then conduct learning to cluster with most edges kept for alignment. Extensive experiments on several benchmark datasets validate the superiority of the proposed DREAM.

School of Artificial Intelligence, Jilin University, Changchun Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University, Changchun, School of Computer Science and Engineering, Nanyang Technological University, Singapore, School of Artificial Intelligence, Jilin University, Changchun Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University, Changchun, School of Artificial Intelligence, Jilin University, Changchun Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University, Changchun, School of Artificial Intelligence, Jilin University, Changchun Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University, Changchun

Abstract: Graph neural networks (GNNs) have shown significant success in learning graph representations. However, recent studies reveal that GNNs often fail to outperform simple MLPs on heterophilous graph tasks, where connected nodes may differ in features or labels, challenging the homophily assumption. Existing methods addressing this issue often overlook the importance of information granularity and rarely consider implicit relationships between distant nodes. To overcome these limitations, we propose the Granular and Implicit Graph Network (GRAIN), a novel GNN model specifically designed for heterophilous graphs. GRAIN enhances node embeddings by aggregating multiview information at various granularity levels and incorporating implicit data from distant, non-neighboring nodes. This approach effectively integrates local and global information, resulting in smoother, more accurate node representations. We also introduce an adaptive graph information aggregator that efficiently combines multi-granularity and implicit data, significantly improving node representation quality, as shown by experiments on 13 datasets covering varying homophily and heterophily. GRAIN consistently outperforms 12 state-of-the-art models, excelling on both homophilous and heterophilous graphs.

Abstract: Graph Neural Networks (GNNs) have exhibited remarkable capabilities for dealing with graphstructured data. However, recent studies have revealed their fragility to adversarial attacks, where imperceptible perturbations to the graph structure can easily mislead predictions. To enhance adversarial robustness, some methods attempt to learn robust representation through improving GNN architectures. Subsequently, another approach suggests that these GNNs might taint feature information and have poor classifier performance, leading to the introduction of Graph Contrastive Learning (GCL) methods to build a refining-classifying pipeline. However, existing methods focus on global-local contrastive strategies, which fails to address the robustness issues inherent in the contexts of adversarial robustness. To address these challenges, we propose a novel paradigm named GRANCE to enhance the robustness of learned representations by shifting the focus to local neighborhoods. Specifically, a dual neighborhood contrastive learning strategy is designed to extract local topological and semantic information. Paired with a neighbor estimator, the strategy can learn robust representations that are resilient to adversarial edges. Additionally, we also provide an improved GNN as classifier. Theoretical analyses provide a stricter lower bound of mutual information, ensuring the convergence of GRANCE. Extensive experiments validate the effectiveness of GRANCE compared to state-of-the-art baselines against various adversarial attacks.

Abstract: The design of multiitem, multi-bidder auctions involves a delicate balancing act of economic objectives, bidder incentives, and real-world complexities. Efficient auctions, that is, auctions that allocate items to maximize total bidder value, are practically desirable since they promote the most economically beneficial use of resources. Arguably the biggest drawback of efficient auctions, however, is their potential to generate very low revenue. In this work, we show how the auction designer can artificially inject competition into the auction to boost revenue while striving to maintain efficiency. First, we invent a new auction family that enables the auction designer to specify competition in a precise, expressive, and interpretable way. We then introduce a new model of bidder behavior and individual rationality to understand how bidders act when prices are too competitive. Next, under our bidder behavior model, we use our new competitive auction class to derive the globally revenue-optimal efficient auction under two different knowledge models for the auction designer: knowledge of full bidder value distributions and knowledge of bidder value quantiles. Finally, we study a third knowledge model for the auction designer: knowledge of historical bidder valuation data. In this setting we present sample and computationally efficient learning algorithms that find high-revenue probably-efficient competitive auctions from bidder data. Our learning algorithms are instance adaptive and can be run in parallel across bidders, unlike most prior approaches to data-driven auction design.

Abstract: In Hotelling's model of spatial competition, a unit mass of voters is distributed in the interval [0,1] (with their location corresponding to their political persuasion), and each of m candidates selects as a strategy their distinct position in this interval. Each voter votes for the nearest candidate, and candidates choose their strategy to maximize their votes. It is known that if there are more than two candidates, equilibria may not exist in this model. It was unknown, however, how close to an equilibrium one could get. Our work studies approximate equilibria in this model, where a strategy profile is an (additive) ϵequilibria if no candidate can increase their votes by ϵ, and provides tight or nearly-tight bounds on the approximation ϵ achievable. We show that for 3 candidates, for any distribution of the voters, ϵ ≥ 1/12. Thus, somewhat surprisingly, for any distribution of the voters and any strategy profile of the candidates, at least 1/12th of the total votes is always left ``on the table.'' Extending this, we show that in the worst case, there exist voter distributions for which ϵ ≥ 1/6, and this is tight: one can always compute a 1/6-approximate equilibria in polynomial time. We then study the general case of m candidates, and show that as m grows large, we get closer to an exact equilibrium: one can always obtain a 1/(m+1)-approximate equilibria in polynomial time. We show this bound is asymptotically tight, by giving voter distributions for which ϵ ≥ 1/(m+3).

Abstract: Equilibrium problems in Bayesian auction games can be described as systems of differential equations. Depending on the model assumptions, these equations might be such that we do not have a rigorous mathematical solution theory. The lack of analytical or numerical techniques with guaranteed convergence for the equilibrium problem has plagued the field and limited equilibrium analysis to rather simple auction models such as singleobject auctions. Recent advances in equilibrium learning led to algorithms that find equilibrium under a wide variety of model assumptions. Monotonicity and the Minty condition are the known sufficient conditions for learning algorithms to converge to an equilibrium in games. Not much is known about convergence of learning algorithms beyond these conditions. We analyze first- and second-price auctions where simple learning algorithms consistently converge to an equilibrium. The analysis is challenging, because these properties need to be shown in infinite dimensions. Interestingly, we show that neither monotonicity nor pseudo- or quasi-monotonicity holds for the respective variational inequalities (VIs). The second-price auction's equilibrium is a Minty-type solution, but the first-price auction is not. However, the analysis via infinite-dimensional VIs allows us to get ex-post guarantees for gradient-based algorithms. We show that the Bayes--Nash equilibrium is the unique solution to the VI within the class of uniformly increasing bid functions, which ensures that gradient-based algorithms attain the equilibrium in case of convergence, as also observed in numerical experiments.

Abstract: In twoplayer cooperative games, agents can play together effectively when they have accurate assumptions about how their teammate will behave, but may perform poorly when these assumptions are inaccurate. In language games, failure may be due to disagreement in the understanding of either the semantics or pragmatics of an utterance. We model coarse uncertainty in semantics using a prior distribution of language models and uncertainty in pragmatics using the cognitive hierarchy, combining the two aspects into a single prior distribution over possible partner types. Fine-grained uncertainty in semantics is modeled using noise that is added to the embeddings of words in the language. To handle all forms of uncertainty we construct agents that learn the behavior of their partner using Bayesian inference and use this information to maximize the expected value of a heuristic function. We test this approach by constructing Bayesian agents for the game of Codenames, and show that they perform better in experiments where semantics is uncertain.

Abstract: Imagine we want to split a group of agents into teams in the most efficient way, considering that each agent has their own preferences about their teammates. This scenario is modeled by the extensively studied Coalition Formation problem. Here, we study a version of this problem where each team must additionally be of bounded size. We conduct a systematic algorithmic study, providing several intractability results as well as multiple exact algorithms that scale well as the input grows (FPT), which could prove useful in practice. Our main contribution is an algorithm that deals efficiently with treelike structures (bounded treewidth) for ``small'' teams. We complement this result by proving that our algorithm is asymptotically optimal. Particularly, there can be no algorithm that vastly outperforms the one we present, under reasonable theoretical assumptions, even when considering star-like structures (bounded vertex cover number).

Abstract: We provide mechanisms and new metric distortion bounds for lineup elections. In such elections, a set of n voters, k candidates, and ell positions are all located in a metric space. The goal is to choose a set of candidates and assign them to different positions, so as to minimize the total cost of the voters. The cost of each voter consists of the distances from itself to the chosen candidates (measuring how much the voter likes the chosen candidates, or how similar it is to them), as well as the distances from the candidates to the positions they are assigned to (measuring the fitness of the candidates for their positions). Our mechanisms, however, do not know the exact distances, and instead produce good outcomes while only using a smaller amount of information, resulting in small distortion. We consider several different types of information: ordinal voter preferences, ordinal position preferences, and knowing the exact locations of candidates and positions, but not those of voters. In each of these cases, we provide constant distortion bounds, thus showing that only a small amount of information is enough to form outcomes close to optimum in line-up elections.

Abstract: Vehicleto-infrastructure (V2I) cooperative perception systems can enhance the sensing abilities of autonomous vehicles. Existing V2I solutions often consider LiDARs devices instead of cameras, the most prevalent sensors with low cost and wide installation. In addition, a major challenge that has been underexplored is the time asynchrony between image frames from different sources. This asynchrony arises because of clock differences, varying times involved in data processing and transmission, causing uncertain delays that complicate data alignment and potentially reduce perception accuracy. We propose BEVSync, a camera-based V2I cooperative perception system that adaptively aligns frames from the ego-vehicle and infrastructure by compensating for motion deviations. Specifically, we develop an extractor-compensator model to extract and predict perceptual features using historical frames, thereby smoothing out the data misalignment. Experiments on the real-world dataset DAIR-V2X show that our approach surpasses existing methods in terms of performance and robustness.

Abstract: Imageguided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.

Abstract: Expressing negative conditions is a crucial feature of query languages for knowledge bases (KBs). Answering such queries over ontological KBs, however, is a very challenging task that becomes undecidable even for lightweight Description Logic (DL) ontologies. Such negative results hold even for Conjunctive Queries (CQs) equipped with basic forms of negative conditions such as the socalled safe negation or inequality atoms. One ontology language that is seemingly unaffected by these results is (the DL counterpart of) RDFS even if equipped with disjointness axioms. Answering CQs with inequalities over such ontologies is known to be Pi^p_2-complete, if the number of inequality atoms is unbounded, and NP-complete if we limit this number to one. Notably, these results leave open the cases of CQs with a fixed number greater than two of inequality atoms. Additionally, such a thorough analysis is missing for CQs with safe negation. In this paper, we embark in a refined analysis of the combined complexity of answering CQs with inequality atoms and safe negation over RDFS ontologies augmented with disjointness axioms. Firstly, we provide a unified Pi^p_2 query answering algorithm for the general problem. Secondly, we confirm the generally held conjecture according to which answering CQs with two inequality atoms over such ontologies is already Pi^p_2-hard. This result closes an important gap in the current literature and has an impact on the widely influential problem of query containment. Lastly, for CQs with safe negation, we prove a behavior similar to that of CQs with inequality atoms. Specifically, we show that answering CQs with at most one negated atom can be done in NP, while allowing at most two negated atoms is sufficient to obtain Pi^p_2-hardness.

Abstract: Equational reasoning is one of the most intuitive and widely used types of symbolic reasoning. In this setting, the goal is to determine whether a given ground equation t=t' follows as a consequence of a set of equational axioms E using the process of replacing equals with equals. An equation t=t' is variablepreserving if each variable occurring in t occurs in t' and vice versa. Such equations capture essential algebraic properties, such as commutativity, associativity and distributivity, which arise in a wide variety of contexts. In this work, we show that for each fixed set E of variable-preserving equations, the set of ground equations derivable from E in depth at most d is soundly over-approximable in fixed-parameter tractable time. More specifically, we devise an algorithm that takes as input a set E of variable-preserving equations and a target ground equation t=t', and always halts with a YES or NO answer. 1) If equation t=t' can be derived from E in depth at most d, the algorithm always halts with a YES. 2) If equation t=t' does not belong to the equational closure of E, then the algorithm always halts with a NO. In other words, the set of YES instances contains the set of ground equations that can be deduced from E in depth at most d, and possibly other equations that require derivations of higher depth. However, this set contains no ground equation that is not in the equational closure of E. For this reason, the algorithm is sound. Our algorithm works in time f(d) * |t| * |t'|, where |t| and |t'| are the number of symbols in t and t' respectively, d is the depth parameter, and f(d) is a function whose growth depends only on d and on parameters associated with the equations in E.

Abstract: Conditional independence is a crucial concept supporting adequate modelling and efficient reasoning in probabilistics. In knowledge representation, the idea of conditional independence has also been introduced for specific formalisms, such as propositional logic and belief revision. In this paper, the notion of conditional independence is studied in the algebraic framework of approximation fixpoint theory. This gives a languageindependent account of conditional independence that can be straightforwardly applied to any logic with fixpoint semantics. It is shown how this notion allows to reduce global reasoning to parallel instances of local reasoning, leading to fixed-parameter tractability results. Furthermore, relations to existing notions of conditional independence are discussed and the framework is applied to normal logic programming.

Abstract: Coordination and joint ability are important topics in representation and reasoning about multiagent systems. The modal logic JAADL proposed by Liu et al. extends ATL with joint abilities, which enables reasoning about whether a coalition of agents can coordinate and achieve a goal without communication. However, like ATL, strategic abilities in JAADL are defined in terms of combinatorial strategies, which are functions from histories or states to actions. On the other hand, there has been research on reasoning about natural strategic abilities, where a natural strategy is formalized as a sequence of condition-action pairs, making it more human-friendly than combinatorial strategy. In this work, we propose SJAADL, a variation of JAADL where strategic abilities are defined in terms of structured strategies represented with LDL (linear dynamic logic) formulas, with bounded complexity. We use nondeterministic strategies since they are more expressive, natural and succinct than determinstic ones. We present syntax and semantics of SJAADL. We show that model checking SJAADL can be done in time quasi-polynomial with the model size, exponential with the formula size, and with the complexity bound of structured strategies, exponential in the memoryless case and double exponential in the memoryful case. Finally, we introduce the problem of synthesizing norms to achieve joint abilities, and give two algorithms for it.

Abstract: Knowledge compilation is a method of transforming knowledge into a compressed and tractable form for permitting more efficient operations. For Boolean functions, numerous representations have been proposed that enhance succinctness and tractability. In this paper, we introduce a new representation named structured Decomposable AndSum Circuit (st-DASC), which employs AND and SUM nodes with signed edges, in place of the standard AND and OR nodes with unsigned edges. Notably, incorporating negative signs permits polytime logical negation. By following a knowledge compilation map, we show that st-DASCs are more succinct than Sentential Decision Diagrams (SDDs) while maintaining support for every operation on the knowledge compilation map that SDD supports. Furthermore, st-DASCs are even more succinct than structured d-DNNFs (st-d-DNNFs), which are more succinct than SDDs although they support fewer operations than SDDs. Accordingly, st-DASCs break the traditional trade-off between succinctness and tractability over SDDs and st-d-DNNFs.

Abstract: A knowledge compilation map analyzes tractable operations in Boolean function representations and compares their succinctness. This enables the selection of appropriate representations for different applications. In the knowledge compilation map, all representation classes are subsets of the negation normal form (NNF). However, Boolean functions may be better expressed by a representation that is different from that of the NNF subsets. In this study, we treat tensor trains as Boolean function representations and analyze their succinctness and tractability. Our study is the first to evaluate the expressiveness of a tensor decomposition method using criteria from knowledge compilation literature. Our main results demonstrate that tensor trains are more succinct than ordered binary decision diagrams (OBDDs) and support the same polytime operations as OBDDs. Our study broadens their application by providing a theoretical link between tensor decomposition and existing NNF subsets.

Abstract: Capitalizing on the intuitive premise that shape characteristics are more robust to perturbations, we bridge adversarial graph learning with the emerging tools from computational topology, namely, persistent homology representations of graphs. We introduce the concept of witness complex to adversarial analysis on graphs, which allows us to focus only on the salient shape characteristics of graphs, yielded by the subset of the most essential nodes (i.e., landmarks), with minimal loss of topological information on the whole graph. The remaining nodes are then used as witnesses, governing which higherorder graph substructures are incorporated into the learning process. Armed with the witness mechanism, we design Witness Graph Topological Layer (WGTL), which systematically integrates both local and global topological graph feature representations, the impact of which is, in turn, automatically controlled by the robust regularized topological loss. Given the attacker's budget, we derive the important stability guarantees of both local and global topology encodings and the associated robust topological loss. We illustrate the versatility and efficiency of WGTL by its integration with five GNNs and three existing non-topological defense mechanisms. Our extensive experiments demonstrate that WGTL boosts the robustness of GNNs across a range of perturbations and against a range of adversarial attacks.

Abstract: We provide improved upper and lower bounds for the MinSum-Radii (MSR) and Min-Sum-Diameters (MSD) clustering problems with a bounded number of clusters k. In particular, we propose an exact MSD algorithm with running-time n^O(k). We also provide (1 + Ɛ) approximation algorithms for both MSR and MSD with running-times of O(kn) + (1/Ɛ)^O(dk) in metrics spaces of doubling dimension d. Our algorithms extend to k-center, improving upon previous results, and to α-MSR, where radii are raised to the α power for α > 1. For α-MSD we prove an exponential time ETH-based lower bound for α > log 3. All algorithms can also be modified to handle outliers. Moreover, we can extend the results to variants that observe fairness constraints, as well as to the general framework of mergeable clustering, which includes many other popular clustering variants. We complement these upper bounds with ETH-based lower bounds for these problems, in particular proving that n^O(k) time is tight for MSR and α-MSR even in doubling spaces, and that 2^o(k) bounds are impossible for MSD.

Abstract: We prove that using global observables to train the matrix product state ansatz results in the vanishing of all partial derivatives, also known as barren plateaus, while using local observables avoids this. This ansatz is widely used in quantum machine learning for learning weakly entangled state approximations. Additionally, we empirically demonstrate that in many cases, the objective function is an inner product of almost sparse operators, highlighting the potential for classically simulating such a learning problem with few quantum resources. All our results are experimentally validated across various scenarios.

Abstract: Is it better to perform tennis training in a pristine indoor environment or a noisy outdoor one? To model this problem, here we investigate whether shifts in the transition probabilities between the training and testing environments in reinforcement learning problems can lead to better performance under certain conditions. We generate new Markov Decision Processes (MDPs) starting from a given MDP, by adding quantifiable, parametric noise into the transition function. We refer to this process as Noise Injection and the resulting environments as δenvironments. This process allows us to create variations of the same environment with quantitative control over noise serving as a metric of distance between environments. Conventional wisdom suggests that training and testing on the same MDP should yield the best results. In stark contrast, we observe that agents can perform better when trained on the noise-free environment and tested on the noisy δ-environments, compared to training and testing on the same δ-environments. We confirm that this finding extends beyond noise variations: it is possible to showcase the same phenomenon in ATARI game variations including varying Ghost behavior in PacMan, and Paddle behavior in Pong. We demonstrate this intriguing behavior in 60 different variations of ATARI games, including PacMan, Pong, and Breakout. We refer to this phenomenon as the Indoor-Training Effect. Code to reproduce our experiments and to implement Noise Injection.

Abstract: In recent years, as data and problem sizes have increased, distributed learning has become an essential tool for training highperformance models. However, the communication bottleneck, especially for high-dimensional data, is a challenge. Several techniques have been developed to overcome this problem. These include communication compression and implementation of local steps, which work particularly well when there is similarity of local data samples. In this paper, we study the synergy of these approaches for efficient distributed optimization. We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity, leveraging variance reduction and error feedback frameworks. In terms of communication time our theory gives ?(1+[M^(-¼)+?^(-½)]√(? /?)) complexity for unbiased compressors and ?(1+β^(¼)√(? /?)) for biased ones, where M is the number of computational nodes, β is the compression power, ? is the similarity measure and ? is the parameter of strong convexity of the objective. Our theoretical results are of record and confirmed by experiments on different average losses and datasets.

Abstract: Current methods for time series forecasting struggle in the online scenario, since it is difficult to preserve longterm dependency while adapting short-term changes when data are arriving sequentially. Although some recent methods solve this problem by controlling the updates of latent states, they cannot disentangle the long/short-term states, leading to the inability to effectively adapt to nonstationary. To tackle this challenge, we propose a general framework to disentangle long/short-term states for online time series forecasting. Our idea is inspired by the observations where short-term changes can be led by unknown interventions like abrupt policies in the stock market. Based on this insight, we formalize a data generation process with unknown interventions on short-term states. Under mild assumptions, we further leverage the independence of short-term states led by unknown interventions to establish the identification theory to achieve the disentanglement of long/short-term states. Built on this theory, we develop a Long Short-Term Disentanglement model (LSTD) to extract the long/short-term states with long/short term encoders, respectively. Furthermore, the LSTD model incorporates a smooth constraint to preserve the long-term dependencies and an interrupted dependency constraint to enforce the forgetting of short-term dependencies, together boosting the disentanglement of long/short-term states. Experimental results on several benchmark datasets show that our LSTD model outperforms existing methods for online time series forecasting, validating its efficacy in real-world applications.

Abstract: In this paper, we propose an efficient, fast, and versatile distillation method to accelerate the generation of pretrained diffusion models. The method reaches state-of-the-art performances in terms of FID and CLIP-Score for few steps image generation on the COCO2014 and COCO2017 datasets, while requiring only several GPU hours of training and fewer trainable parameters than existing methods. In addition to its efficiency, the versatility of the method is also exposed across several tasks such as *text-to-image*, *inpainting*, *face-swapping*, *super-resolution* and using different backbones such as UNet-based denoisers (SD1.5, SDXL), DiT (Pixart) and MMDiT (SD3), as well as adapters. In all cases, the method allowed to reduce drastically the number of sampling steps while maintaining very high-quality image generation.

Abstract: Incomplete multiview clustering has become one of the important research problems due to the extensive missing multi-view data in the real world. Although the existing methods have made great progress, there are still some problems: 1) most methods cannot effectively mine the information hidden in the missing data; 2) most methods typically divide representation learning and clustering into two separate stages, but this may affect the clustering performance as the clustering results directly depend on the learned representation. To address these problems, we propose a novel incomplete multi-view clustering method with hierarchical information transfer. Firstly, we design the view-specific Graph Convolutional Networks (GCN) to obtain the representation encoding the graph structure, which is then fused into the consensus representation. Secondly, considering that one layer of GCN transfers one-order neighbor node information, the global graph propagation with the consensus representation is proposed to handle the missing data and learn deep representation. Finally, we design a weight-sharing pseudo-classifier with contrastive learning to obtain an end-to-end framework that combines view-specific representation learning, global graph propagation with hierarchical information transfer, and contrastive clustering for joint optimization. Extensive experiments conducted on several commonly-used datasets demonstrate the effectiveness and superiority of our method in comparison with other state-of-the-art approaches.

College of Informatics, Huazhong Agricultural University, Wuhan, China, College of Informatics, Huazhong Agricultural University, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China, School of Artificial Intelligence, Jilin University, Jilin, China, University of Pittsburgh, College of Control Science and Engineering, China University of Petroleum (East China), Qingdao, China, College of Informatics, Huazhong Agricultural University, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China

Abstract: In recent years, there have been a growing number of works studying the generalization properties of stochastic gradient descent (SGD) from the perspective of algorithmic stability. However, few of them devote to simultaneously studying the generalization and optimization for the nonconvex setting, especially pairwise SGD with heavy-tailed gradient noise. This paper considers the impact of the heavy-tailed gradient noise obeying sub-Weibull distribution on the stability-based learning guarantees for non-convex pairwise SGD by investigating its generalization and optimization jointly. Specifically, based on two novel pairwise uniform model stability tools, we firstly bound the generalization error of pairwise SGD in the general non-convex setting after bridging the quantitative relationships between stability and generalization error. Then, we further consider the practical heavy-tailed sub-Weibull gradient noise condition to establish a refined generalization bound without the bounded gradient condition. Finally, sharper error bounds for generalization and optimization are built by introducing the gradient dominance condition. Comparing these results reveals that sub-Weibull gradient noise brings some positive dependencies on the heavy-tailed strength for generalization and optimization. Furthermore, we extend our analysis to the corresponding pairwise minibatch SGD and derive the first stability-based near-optimal generalization and optimization bounds which are consistent with many empirical observations.

Abstract: The advent of Spatial Transcriptomics (ST) has revolutionized understanding of tissue architecture by creating highresolution maps of gene expression patterns. However, the low capture rate of ST leads to significant sparsity. The aim of imputation is to recover biological signals by imputing the dropouts in ST data to approximate the true expression values. In this paper, we introduce a Spatial Gene Expression Imputation Diffusion model to facilitate ST data imputation, and our model is referred to as SpotDiff. Specifically, we incorporate a spot-gene prompt learning module to capture the association between spots and genes. Further, SpotDiff integrates single-cell RNA sequencing data to impute gene expression at each spot. The proposed approach is able to reduce the uncertainty in the imputation process, since the aggregation of multiple single-cell measurements yield a stable representation of the corresponding spot expression profile. Extensive experiments have been performed to demonstrate that SpotDiff outperforms existing imputation methods across multiple benchmarks in terms of yielding more accurate and biologically relevant gene expression profiles, particularly in highly sparse scenarios.

Abstract: Datafree quantization (DFQ) is a technique that creates a lightweight network from its full-precision counterpart without the original training data, often through a synthetic dataset. Although several DFQ methods have been proposed for vision transformer (ViT) architectures, they fail to achieve efficacy in low-bit settings. Examining the existing methods, we observe that their synthetic data produce misaligned attention maps, while those of the real samples are highly aligned. From this observation, we find that aligning attention maps of synthetic data helps improve the overall performance of quantized ViTs. Motivated by this finding, we devise MimiQ, a novel DFQ method designed for ViTs that enhances inter-head attention similarity. First, we generate synthetic data by aligning head-wise attention outputs from each spatial query patch. Then, we align the attention maps of the quantized network to those of the full-precision teacher by applying head-wise structural attention distillation. The experimental results show that the proposed method significantly outperforms baselines, setting a new state-of-the-art for ViT-DFQ.

Abstract: Developments in deep neural nets have trended towards increasingly larger overparameterized architectures, resulting in lengthy training sessions with ever more elusive training dynamics. Thus, ensuring these models learn accurate generalizable representations of data efficiently is challenging. Previous works have developed specialized techniques from datapruning, architecture selection, pseudo-label generation, bias identification, or label refurbishment to improve downstream training. Problematically, most methods require prohibitively expensive iterative model training. In this paper, we demonstrate that we can exploit the recent neural tangent kernel (NTK) theory for understanding and improving model training behavior before ever training a model. First, we show a powerful signal derived from the NTK theory can be computed remarkably fast. We then leverage this signal for the design of a unified suite of surprisingly effective tools for the four important tasks of architecture selection, pseudo-label verification, bias identification, and label refurbishment, all requiring zero model training.

School of Computer Science and Engineering, Central South University, Changsha, China Xiangjiang Laboratory, Changsha, China, Big Data Institute, Central South University, Changsha, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, School of Computer Science and Engineering, Central South University, Changsha, China, School of Computer Science and Engineering, Central South University, Changsha, China, School of Computer Science and Engineering, Central South University, Changsha, China Xiangjiang Laboratory, Changsha, China

Abstract: Federated Adversarial Learning (FAL) is a robust framework for resisting adversarial attacks on federated learning. Although some FAL studies have developed efficient algorithms, they primarily focus on convergence performance and overlook generalization. Generalization is crucial for evaluating algorithm performance on unseen data. However, generalization analysis is more challenging due to nonsmooth adversarial loss functions. A common approach to addressing this issue is to leverage smoothness approximation. In this paper, we develop algorithm stability measures to evaluate the generalization performance of two popular FAL algorithms: Vanilla FAL (VFAL) and Slack FAL (SFAL), using three different smooth approximation methods: 1) Surrogate Smoothness Approximation (SSA), (2) Randomized Smoothness Approximation (RSA), and (3) Over-Parameterized Smoothness Approximation (OPSA). Based on our in-depth analysis, we answer how to properly set the smoothness approximation method to mitigate generalization error in FAL. Moreover, we identify RSA as the most effective generalization error reduction method. In highly data-heterogeneous scenarios, we also recommend employing SFAL to mitigate the deterioration of generalization performance caused by heterogeneity. Based on our theoretical results, we provide insights to help develop more efficient FAL algorithms, such as designing new metrics and dynamic aggregation rules to mitigate heterogeneity.

Abstract: Multiview clustering (MVC) aims to integrate complementary information from multiple views to enhance clustering performance. Late Fusion Multi-View Clustering (LFMVC) has shown promise by synthesizing diverse clustering results into a unified consensus. However, current LFMVC methods struggle with noisy and redundant partitions and often fail to capture high-order correlations across views. To address these limitations, we present a novel theoretical framework for analyzing the generalization error bounds of multiple kernel k-means, leveraging local Rademacher complexity and principal eigenvalue proportions. Our analysis establishes a convergence rate of O(1/n), significantly improving upon the existing rate in the order of O(sqrt(k/n)). Building on this insight, we propose a low-pass graph filtering strategy within a multiple linear K-means framework to mitigate noise and redundancy, further refining the principal eigenvalue proportion and enhancing clustering accuracy. Experimental results on benchmark datasets confirm that our approach outperforms state-of-the-art methods in clustering performance and robustness.

Abstract: Prompt instruction tuning is a popular approach to better adjust pretrained LLMs for specific downstream tasks. How to extend this approach to simultaneously handle multiple tasks and data distributions is an interesting question. We propose Mixture of Prompts (MoPs) with smart gating functionality. Our proposed system identifies relevant skills embedded in different groups of prompts and dynamically weighs experts (i.e., collection of prompts) based on the target task. Experiments show that MoPs are resilient to model compression, data source, and task composition, making them highly versatile and applicable in various contexts. In practice, MoPs can simultaneously mitigate prompt training ``interference'' in multitask, multi-source scenarios (e.g., task and data heterogeneity across sources) and possible implications from model approximations. Empirically, MoPs show particular effectiveness in compressed model scenarios, while maintaining favorable performance in uncompressed settings: MoPs can reduce final perplexity from 9% up to 70% in non-i.i.d. distributed cases and from 3% up to 30% in centralized cases, compared to baselines.

Abstract: The identification of the financial characteristics of industry sectors has a large importance in accounting audit, allowing auditors to prioritize the most important area during audit. Existing company classification standards such as the Standard Industry Classification (SIC) code allow to map a company to a category based on its activity and products. In this paper, we explore the potential of machine learning algorithms and language models to analyze the relationship between those categories and companies' financial statements. We propose a supervised company classification methodology and analyze several types of representations for financial statements. Existing works address this task using solely numerical information in financial records. Our findings show that beyond numbers, textual information occurring in financial records can be leveraged by language models to match the performance of dedicated decision treebased classifiers, while providing better explainability and more generic accounting representations. We think this work can serve as a preliminary work towards semi-automatic auditing.

Abstract: Distributionally robust optimization tackles outof-sample issues like overfitting and distribution shifts by adopting an adversarial approach over a range of possible data distributions, known as the ambiguity set. To balance conservatism and accuracy, these sets must include realistic probability distributions by leveraging information from the nominal distribution. Assuming that nominal distributions arise from a structural causal model with a directed acyclic graph G and structural equations, previous methods such as adapted and G-causal optimal transport have only utilized causal graph information in designing ambiguity sets. In this work, we propose incorporating structural equations, which include causal graph information, to enhance ambiguity sets, resulting in more realistic distributions. We introduce structural causal optimal transport and its associated ambiguity set, demonstrating their advantages and connections to previous methods. A key benefit of our approach is a relaxed version, where a regularization term replaces the complex causal constraints, enabling an efficient algorithm via difference-of-convex programming to solve structural causal optimal transport. We also show that when structural information is absent and must be estimated, our approach remains effective and provides finite sample guarantees. Lastly, we address the radius of ambiguity sets, illustrating how our method overcomes the curse of dimensionality in optimal transport problems, achieving faster shrinkage with dimension-free order.

Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, College of Computer Science, Zhejiang University, China, Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, College of Computer Science, Zhejiang University, China, Cyberspace Institute of Advanced Technology, Guangzhou University, China, Cooperative Medianet Innovation Center, Shanghai Jiaotong University, China, Research Center for Data Hub and Security, Zhejiang Lab, China, Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, College of Computer Science, Zhejiang University, China

Abstract: Personalized federated learning (PFL) on graphs is an emerging field focusing on the collaborative development of architectures across multiple clients, each with distinct graph data distributions while adhering to strict privacy standards. This area often requires extensive expert intervention in model design, which is a significant limitation. Recent advancements have aimed to automate the search for graph neural network architectures, incorporating large language models (LLMs) for their advanced reasoning and selfreflection capabilities. However, two technical challenges persist. First, although LLMs are effective in natural language processing, their ability to meet the complex demands of graph neural architecture search (GNAS) is still being explored. Second, while LLMs can guide the architecture search process, they do not directly solve the issue of client drift due to heterogeneous data distributions. To address these challenges, we introduce a novel method, Personalized Federated Graph Neural Architecture Search (PFGNAS). This approach employs a task-specific prompt to identify and integrate optimal GNN architectures continuously. To counteract client drift, PFGNAS utilizes a weight-sharing strategy of supernet, which optimizes the local architectures while ensuring client-specific personalization. Extensive evaluations show that PFGNAS significantly outperforms traditional PFL methods, highlighting the advantages of integrating LLMs into personalized federated learning environments.

Abstract: Building on the success of textto-image diffusion models (DPMs), image editing is an important application to enable human interaction with AI-generated content. Among various editing methods, editing within the prompt space gains more attention due to its capacity and simplicity of controlling semantics. However, since diffusion models are commonly pretrained on descriptive text captions, direct editing of words in text prompts usually leads to completely different generated images, violating the requirements for image editing. On the other hand, existing editing methods usually consider introducing spatial masks to preserve the identity of unedited regions, which are usually ignored by DPMs and therefore lead to inharmonic editing results. Targeting these two challenges, in this work, we propose to disentangle the comprehensive image-prompt interaction into several item-prompt interactions, with each item linked to a special learned prompt. The resulting framework, named D-Edit, is based on pretrained diffusion models with cross-attention layers disentangled and adopts a two-step optimization to build item-prompt associations. Versatile image editing can then be applied to specific items by manipulating the corresponding prompts. We demonstrate state-of-the-art results in four types of editing operations including image-based, text-based, mask-based editing, and item removal, covering most types of editing applications, all within a single unified framework. Notably, D-Edit is the first framework that can (1) achieve item editing through mask editing and (2) combine image and text-based editing. We demonstrate the quality and versatility of the editing results for a diverse collection of images through both qualitative and quantitative evaluations.

Abstract: Graph representation learning methods are highly effective in handling complex nonEuclidean data by capturing intricate relationships and features within graph structures. However, traditional methods face challenges when dealing with heterogeneous graphs that contain various types of nodes and edges due to the diverse sources and complex nature of the data. Existing heterogeneous graph neural networks (HGNNs) have shown promising results but require prior knowledge of node and edge types and unified node feature formats, which limits their applicability. Recent advancements in graph representation learning using large language models (LLMs) offer new solutions by integrating LLMs' data processing capabilities, enabling the alignment of various graph representations. Nevertheless, these methods often overlook heterogeneous graph data and require extensive preprocessing. To address these limitations, we propose an LLM-enhanced Heterogeneous Graph Neural Network (LHGNN). LHGNN leverages the strengths of both LLM and GNN, allowing for the processing of graph data with any format and type of nodes and edges without the need for type information or special preprocessing. LHGNN employs LLM to automatically summarize and classify different data formats and types, aligns node features, and uses a specialized GNN for targeted learning, thus obtaining effective graph representations for downstream tasks. Theoretical analysis and experimental validation have demonstrated the effectiveness of our method.

Abstract: Understanding relations arising out of interactions among entities can be very difficult, and predicting them is even more challenging. This problem has many applications in various fields, such as financial networks and ecommerce. These relations can involve much more complexities than just involving more than two entities. One such scenario is evolving recursive relations between multiple entities, and so far, this is still an open problem. This work addresses the problem of forecasting higher-order interaction events that can be multi-relational and recursive. We pose the problem in the framework of representation learning of temporal hypergraphs that can capture complex relationships involving multiple entities. The proposed model, \textit{Relational Recursive Hyperedge Temporal Point Process} (RRHyperTPP) uses an encoder that learns a dynamic node representation based on the historical interaction patterns and then a hyperedge link prediction-based decoder to model the occurrence of interaction events. These learned representations are then used for downstream tasks involving forecasting the type and time of interactions. The main challenge in learning from hyperedge events is that the number of possible hyperedges grows exponentially with the number of nodes in the network. This will make the computation of negative log-likelihood of the temporal point process expensive, as the calculation of survival function requires a summation over all possible hyperedges. In our work, we develop a noise contrastive estimation method to learn the parameters of our model, and we have experimentally shown that our models perform better than previous state-of-the-art methods for interaction forecasting.

Abstract: A common problem in MessagePassing Neural Networks is oversquashing -- the limited ability to facilitate effective information flow between distant nodes. Oversquashing is attributed to the exponential decay in information transmission as node distances increase. This paper introduces a novel perspective to address oversquashing, leveraging dynamical systems properties of global and local non-dissipativity, that enable the maintenance of a constant information flow rate. We present SWAN, a uniquely parameterized GNN model with antisymmetry both in space and weight domains, as a means to obtain non-dissipativity. Our theoretical analysis asserts that by implementing these properties, SWAN offers an enhanced ability to transmit information over extended distances. Empirical evaluations on synthetic and real-world benchmarks that emphasize long-range interactions validate the theoretical understanding of SWAN, and its ability to mitigate oversquashing.

Abstract: In the geospatial domain, universal representation models are significantly less prevalent than their extensive use in natural language processing and computer vision. This discrepancy arises primarily from the high costs associated with the input of existing representation models, which often require street views and mobility data. To address this, we develop a novel, trainingfree method that leverages large language models (LLMs) and auxiliary map data from OpenStreetMap to derive geolocation representations (LLMGeovec). LLMGeovec can represent the geographic semantics of city, country, and global scales, which acts as a generic enhancer for spatio-temporal learning. Specifically, by direct feature concatenation, we introduce a simple yet effective paradigm for enhancing multiple spatio-temporal tasks including geographic prediction (GP), long-term time series forecasting (LTSF), and graph-based spatio-temporal forecasting (GSTF). LLMGeovec can seamlessly integrate into a wide spectrum of spatio-temporal learning models, providing immediate enhancements. Experimental results demonstrate that LLMGeovec achieves global coverage and significantly boosts the performance of leading GP, LTSF, and GSTF models.

Abstract: This study aims to build a pretrained Graph Neural Network (GNN) model on molecules without human annotations or prior knowledge. Although various attempts have been proposed to overcome limitations in acquiring labeled molecules, the previous pre-training methods still rely on semantic subgraphs, i.e., functional groups. Only focusing on the functional groups could overlook the graph-level distinctions. The key challenge to build a pre-trained GNN on molecules is how to (1) generate well-distinguished graph-level representations and (2) automatically discover the functional groups without prior knowledge. To solve it, we propose a novel Subgraph-conditioned Graph Information Bottleneck, named S-CGIB, for pre-training GNNs to recognize core subgraphs (graph cores) and significant subgraphs. The main idea is that the graph cores contain compressed and sufficient information that could generate well-distinguished graph-level representations and reconstruct the input graph conditioned on significant subgraphs across molecules under the S-CGIB principle. To discover significant subgraphs without prior knowledge about functional groups, we propose generating a set of functional group candidates, i.e., ego networks, and using an attention-based interaction between the graph core and the candidates. Despite being identified from self-supervised learning, our learned subgraphs match the real-world functional groups. Extensive experiments on molecule datasets across various domains demonstrate the superiority of S-CGIB.

Abstract: Zerothorder (ZO) optimization as the gradient-free method has become a powerful tool when the first-order gradient is unavailable or expensive to obtain, especially in decentralized learning scenarios where data and computational resources are distributed across multiple clients. There have been many efforts to analyze the optimization convergence rate of zeroth-order decentralized stochastic gradient descent (ZO-DSGD) algorithms. However, the generalization of these methods has not been well studied. In this paper, we provide a generalization analysis of ZO-DSGD with changing topology, where the clients run zeroth-order SGD with local data and communicate with each other according to time-varying topology. We systematically analyze the generalization error in convex, strongly convex, and non-convex cases. The obtained results in the convex and strongly convex cases with zeroth-order oracles recover the results of SGD. Moreover, the generalization bounds derived in non-convex cases align with that of DSGD. To capture the influence of communication topology on the generalization performance, we analyze local generalization bounds concerning local models held at different clients. The obtained results reflect the influence of the number of clients, local sample size, and topology on the generalization error. To the best of our knowledge, this is the first work that provides a generalization analysis of zeroth-order decentralized stochastic gradient descent methods and recovers the results of SGD.

Abstract: Posterior drift refers to changes in the relationship between responses and covariates while the distributions of the covariates remain unchanged. In this work, we explore functional linear regression under posterior drift with transfer learning. Specifically, we investigate when and how auxiliary data can be leveraged to improve the estimation accuracy of the slope function in the target model when posterior drift occurs. We employ the approximated least square method together with a lasso penalty to construct an estimator that transfers beneficial knowledge from source data. Theoretical analysis indicates that our method avoids negative transfer under posterior drift, even when the contrast between slope functions is quite large. Specifically, the estimator is shown to perform at least as well as the classical estimator using only target data, and it enhances the learning of the target model when the source and target models are sufficiently similar. Furthermore, to address scenarios where covariate distributions may change, we propose an adaptive algorithm using aggregation techniques. This algorithm is robust against noninformative source samples and effectively prevents negative transfer. Simulation and real data examples are provided to demonstrate the effectiveness of the proposed algorithm.

Abstract: Medical image segmentation provides detailed understanding and aids in diagnosis, treatment planning, and monitoring of diseases. Due to the high cost of acquiring labeled data in the field of medical image analysis, semisupervised segmentation methods have garnered increasing attention. Benefiting from their simplicity and effectiveness, consistency regularization-based methods have emerged as a significant research focus by utilizing perturbations. However, existing methods typically consider perturbation strategies from only a single perspective: either instance perturbation or model perturbation, thus ignoring the potential benefit of effectively combining both. In response, we propose a unified perturbation framework named GapMatch, which bridges instance and model perturbations to broaden the perturbation space and employs dual perturbation to impose consistency regularization on the model. Specifically, GapMatch involves using instance perturbation to update the decision boundary and model perturbation to further optimize it. These two steps mutually reinforce each other in an iterative manner, effectively pushing the decision boundary towards low-density regions while maximizing the class margin. Extensive experimental results on two popular medical image benchmarks demonstrate the effectiveness and generality of the proposed method.

Abstract: We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning, including hyperparameter optimization, loss function learning, fewshot learning and more. These problems are often formalized as Bi-Level Optimizations (BLO). We introduce a novel perspective by turning a given BLO problem into a stochastic optimization, where the inner loss function becomes a smooth probability distribution, and the outer loss becomes an expected loss over the inner distribution. To solve this stochastic optimization, we adopt Stochastic Gradient Langevin Dynamics (SGLD) MCMC to sample inner distribution, and propose a recurrent algorithm to compute the MC-estimated hypergradient. Our derivation is similar to forward-mode differentiation, but we introduce a new first-order approximation that makes it feasible for large models without needing to store huge Jacobian matrices. The main benefits are two fold: i) Our stochastic formulation takes into account uncertainty, which makes the method robust to suboptimal inner optimization or non-unique multiple inner minima due to overparametrization; ii) Compared to existing methods that often exhibit unstable behavior and hyperparameter sensitivity in practice, our method leads to considerably more reliable solutions. We demonstrate that the new approach achieves promising results on diverse meta learning problems and easily scales to learning 87M hyper-parameters in the case of Vision Transformers.

Abstract: Label corruption, where training samples are mislabeled due to nonexpert annotation or adversarial attacks, significantly degrades model performance. Acquiring large, perfectly labeled datasets is costly, and retraining models from scratch is computationally expensive. To address this, we introduce Scaled Activation Projection (SAP), a novel SVD (Singular Value Decomposition)-based corrective machine unlearning algorithm. SAP mitigates label noise by identifying a small subset of trusted samples using cross-entropy loss and projecting model weights onto a clean activation space estimated using SVD on these trusted samples. This process suppresses the noise introduced in activations due to the mislabeled samples. In our experiments, we demonstrate SAP’s effectiveness on synthetic noise with different settings and real-world label noise. SAP applied to the CIFAR dataset with 25% synthetic corruption show upto 6% generalization improvements. Additionally, SAP can improve the generalization over noise robust training approaches on CIFAR dataset by ∼ 3.2% on average. Further, we observe generalization improvements of 2.31% for a Vision Transformer model trained on naturally corrupted Clothing1M.

Abstract: Using offline observational data for policy evaluation and learning allows decisionmakers to evaluate and learn a policy that connects characteristics and interventions. Most existing literature has focused on either discrete treatment spaces or assumed no difference in the distributions between the policy-learning and policy-deployed environments. These restrict applications in many real-world scenarios where distribution shifts are present with continuous treatment. To overcome these challenges, this paper focuses on developing a distributionally robust policy under a continuous treatment setting. The proposed distributionally robust estimators are established using the Inverse Probability Weighting (IPW) method extended from the discrete one for policy evaluation and learning under continuous treatments. Specifically, we introduce a kernel function into the proposed IPW estimator to mitigate the exclusion of observations that can occur in the standard IPW method to continuous treatments. We then provide finite-sample analysis that guarantees the convergence of the proposed distributionally robust policy evaluation and learning estimators. The comprehensive experiments further verify the effectiveness of our approach when distribution shifts are present.

School of Electronic Engineering, Xidian University, Xi’an 710071, China Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xi’an 710071, China, School of Electronic Engineering, Xidian University, Xi’an 710071, China Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xi’an 710071, China, School of Artificial Intelligence, Xidian University, Xi’an 710071, China, School of Computer Science and Technology, Xidian University, Xi’an 710071, China Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xi’an 710071, China, School of Electronic Engineering, Xidian University, Xi’an 710071, China Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xi’an 710071, China, School of Electronic Engineering, Xidian University, Xi’an 710071, China Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xi’an 710071, China

Abstract: When the current physical adversarial patches cannot deceive thermal infrared detectors, the existing techniques implement adversarial attacks from scratch, such as digital patch generation, material production, and physical deployment. Besides, it is difficult to finely regulate infrared radiation. To address these issues, this paper designs an adversarial thermal display (AdvDisplay ) by assembling thermoelectric coolers (TECs) as an array. Specifically, to reduce the gap between patches in the physical and digital worlds and decrease the power of AdvDisplay device, heat transfer loss and electric power loss are designed to guide the patch optimization. In addition, a precise temperature control scheme for AdvDisplay is proposed based on proportionalintegral-derivative (PID) control. Due to the accurate temperature regulation and the reusability of AdvDisplay , our method is able to improve the attack success rate and the efficiency of physical deployments. Extensive experimental results indicate that the proposed method possesses superior adversarial effectiveness compared to other methods and demonstrates strong robustness in physical attacks.

Abstract: Recent advancements in Large VisionLanguage Models (LVLMs) highlight their ability to integrate and process multi-modal information. However, hallucinations—where generated content is inconsistent with input vision and instructions—remain a challenge. In this paper, we analyze LVLMs' layer-wise decoding and identify that hallucinations can arise during the reasoning and factual information injection process. Additionally, as the number of generated tokens increases, the forgetting of the original prompt may also lead to hallucinations.To address this, we propose a training-free decoding method called Mixture of Layer Experts (MoLE). MoLE leverages a heuristic gating mechanism to dynamically select multiple layers of LVLMs as expert layers: the Final Expert, the Second Opinion expert, and the Prompt Retention Expert. By the cooperation of each expert, MoLE enhances the robustness and faithfulness of the generation process. Our extensive experiments demonstrate that MoLE significantly reduces hallucinations, outperforming the current state-of-the-art decoding techniques across three mainstream LVLMs and two established hallucination benchmarks. Moreover, our method reveals the potential of LVLMs to independently produce more reliable and accurate outputs.

Abstract: Irregularly sampled multivariate time series (ISMTS) are prevalent in reality. Due to their nonuniform intervals between successive observations and varying sampling rates among series, the channel-independent (CI) strategy, which has been demonstrated more desirable for complete multivariate time series forecasting in recent studies, has failed. This failure can be further attributed to the sampling sparsity, which provides insufficient information for effective CI learning, thereby reducing its capacity. When we resort to the channel-dependent (CD) strategy, even higher capacity cannot mitigate the potential loss of diversity in learning similar embedding patterns across different channels. We find that existing work considers CI and CD strategies to be mutually exclusive, primarily because they apply these strategies to the global channel. However, we hold the view that channel strategies do not necessarily have to be used globally. Instead, by appropriately applying them locally and globally, we can create an opportunity to take full advantage of both strategies. This leads us to introduce the Channel Harmony ISMTS Transformer (TimeCHEAT), which utilizes the CD strategy locally and the CI strategy globally. Specifically, we segment the ISMTS into sub-series level patches. Locally, the CD strategy aggregates information within each patch for time embedding learning, maximizing the use of relevant observations while reducing long-range irrelevant interference. Here, we enhance generality by transforming embedding learning into an edge weight prediction task using bipartite graphs, eliminating the need for special prior knowledge. Globally, the CI strategy is applied across patches, allowing the Transformer to learn individualized attention patterns for each channel. Experimental results indicate our proposed TimeCHEAT demonstrates competitive state-of-the-art performance across three mainstream tasks including classification, forecasting and interpolation.

Abstract: Structured pruning for large language models (LLMs) has garnered significant academic interest due to its ability to efficiently compress and accelerate LLMs by eliminating redundant weight groups at a coarsegrained granularity. Current structured pruning methods for LLMs typically depend on a singular granularity for assessing weight importance, resulting in notable performance degradation in downstream tasks. Intriguingly, our empirical investigations reveal that utilizing unstructured pruning, which achieves better performance retention by pruning weights at a finer granularity, \emph{i.e.}, individual weights, yields significantly varied sparse LLM structures when juxtaposed to structured pruning. This suggests that evaluating both holistic and individual assessments for weight importance are essential for LLM pruning. Building on this insight, we introduce the Hybrid-grained Weight Importance Assessment (HyWIA), a novel method that merges fine-grained and coarse-grained evaluations of weight importance for the pruning of LLMs. Leveraging an attention mechanism, HyWIA adaptively determines the optimal blend of granularity in weight importance assessments in an end-to-end pruning manner. Extensive experiments on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks demonstrate the effectiveness of HyWIA in pruning LLMs. For example, HyWIA surpasses the cutting-edge LLM-Pruner by an average margin of 2.82% in accuracy across seven downstream tasks when pruning LLaMA-7B by 50%.

Abstract: Existing local modelagnostic explanation techniques are ineffective for machine learning models that consider inputs of variable lengths, as they do not consider temporal information embedded in these models. To address this limitation, we propose ReX, a general framework for incorporating temporal information in these techniques. Our key insight is that these techniques typically learn a model surrogate by sampling model inputs and outputs, and we can incorporate temporal information in a uniform way by only changing the sampling process and the surrogate features. We instantiate our approach on three popular explanation techniques: Anchors, LIME, and Kernel SHAP. To evaluate the effectiveness of ReX, we apply our approach to six models in three different tasks. Our evaluation results demonstrate that our approach 1) significantly improves the fidelity of explanations, making model-agnostic techniques outperform a state-of-the-art model-specific technique on its target model, and 2) helps end users better understand the models' behaviors.

Abstract: Graph Neural Networks (GNNs) have recently achieved significant success in several graphrelated tasks. However, traditional GNNs and their variants are constantly limited by the implicit homophily, assuming neighboring nodes belong to the same class. This results in weak performance on heterophilic graphs where most nodes are linked to neighbors of different classes. Despite the numerous attempts to adequately deal with heterophily, most methods still use the uniform propagation aggregation mechanism. In this paper, we argue that identifying neighbors with different class labels and exploiting them individually is crucial for heterophilic GNNs. We then propose a simple and efficient novel co-training approach, EG-GCN, which uses group aggregation to handle homophilic and heterophilic neighbors separately. In EG-GCN, we first use an edge discriminator to classify edges and split the neighborhood of every node into two parts. We then apply group graph convolution to the divided neighborhoods to obtain node representations. During training, we continuously optimize the edge discriminator to improve neighborhood partition and use the node classification results to identify highly confident unlabeled nodes to expand the edge training set. This co-training strategy enables both components to enhance each other mutually. Extensive experiments demonstrate that EG-GCN significantly outperforms the state-of-the-art approaches.

Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology, Shenzhen, China, Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China, Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China Guangdong Provincial Key Laboratory of Intelligent Information Processing, China

Abstract: Diabetic retinopathy (DR), with its large patient population, has become a formidable threat to human visual health. In the clinical diagnosis of DR, multiview fundus images are considered to be more suitable for DR diagnosis because of the wide coverage of the field of view. Therefore, different from most of the previous single-view DR grading methods, we design a dynamic selection-driven multi-view DR grading method to fit clinical scenarios better. Since lesion information plays a key role in DR diagnosis, previous methods usually boost the model performance by enhancing the lesion feature. However, during the actual diagnosis, ophthalmologists not only focus on the crucial parts, but also exclude irrelevant features to ensure the accuracy of judgment. To this end, we introduce the idea of dynamic selection and design a series of selection mechanisms from fine granularity to coarse granularity. In this work, we first introduce an Ophthalmic Image Reader (OIR) agent to provide the model with pixel-level prompts of suspected lesion areas. Moreover, a Multi-View Token Selection Module (MVTSM) is designed to prune redundant feature tokens and realize dynamic selection of key information. In the final decision stage, we dynamically fuse multi-view features through the novel Multi-View Mixture of Experts Module (MVMoEM), to enhance key views and reduce the impact of conflicting views. Extensive experiments on a large multi-view fundus image dataset with 34,452 images demonstrate that our method performs favorably against state-of-the-art models.

College of Computer Science, Beijing University of Technology School of Computer Science and Technology, Beijing Jiaotong University Idealism Beijing Technology Co., Ltd., College of Computer Science, Beijing University of Technology, School of Computer Science and Technology, Beijing Jiaotong University Department of Automation, Tsinghua University, School of Computer Science and Technology, Beijing Jiaotong University Key Laboratory of Big Data \& Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education

Abstract: Multilabel Learning with Partial Labels (ML-PL) learns from training data, where each sample is annotated with part of positive labels while leaving the rest of positive labels unannotated. Existing methods mainly focus on extending multi-label losses to estimate unannotated labels, further inducing a missing-robust network. However, training with single network could lead to confirmation bias (i.e., the model tends to confirm its mistakes). To tackle this issue, we propose a novel learning paradigm termed Co-Label Selection (CLS), where two networks feed forward all data and cooperate in a co-training manner for critical label selection. Different from traditional co-training based methods that networks select confident samples for each other, we start from a new perspective that two networks are encouraged to remove false-negative labels while keep training samples reserved. Meanwhile, considering the extreme positive-negative label imbalance in ML-PL that leads the model to focus on negative labels, we enforce the model to concentrate on positive labels by abandoning non-informative negative labels to alleviate such issue. By shifting the cooperation strategy from "Sample Selection'' to "Label Selection'', CLS avoids directly dropping samples and reserves training data in most extent, thus enhancing the utilization of supervised signals and the generalization of the learning model. Empirical results performed on various multi-label datasets demonstrate that our CLS is significantly superior to other state-of-the-art methods.

Abstract: Realworld optimization problems often involve complex objective functions with costly evaluations. While Bayesian optimization (BO) with Gaussian processes is effective for these challenges, it suffers in high-dimensional spaces due to performance degradation from limited function evaluations. To overcome this, simplification techniques like dimensionality reduction have been employed, yet they often rely on assumptions about the problem characteristics, potentially underperforming when these assumptions do not hold. Trust-region-based methods, which avoid such assumptions, focus on local search but risk stagnation in local optima. In this study, we propose a novel acquisition function, regional expected improvement (REI), designed to enhance trust-region-based BO in medium to high-dimensional settings. REI identifies regions likely to contain the global optimum, improving performance without relying on specific problem characteristics. We provide a theoretical proof that REI effectively identifies optimal trust regions and empirically demonstrate that incorporating REI into trust-region-based BO outperforms conventional BO and other high-dimensional BO methods in medium to high-dimensional real-world problems.

Abstract: Expensive multiobjective optimization problems (EMOPs) are common in real-world scenarios where evaluating objective functions is costly and involves extensive computations or physical experiments. Current Pareto set learning methods for such problems often rely on surrogate models like Gaussian processes to approximate the objective functions. These surrogate models can become fragmented, resulting in numerous small uncertain regions between explored solutions. When using acquisition functions such as the Lower Confidence Bound (LCB), these uncertain regions can turn into pseudo-local optima, complicating the search for globally optimal solutions. To address these challenges, we propose a novel approach called SVH-PSL, which integrates Stein Variational Gradient Descent (SVGD) with Hypernetworks for efficient Pareto set learning. Our method addresses the issues of fragmented surrogate models and pseudo-local optima by collectively moving particles in a manner that smooths out the solution space. The particles interact with each other through a kernel function, which helps maintain diversity and encourages the exploration of underexplored regions. This kernel-based interaction prevents particles from clustering around pseudo-local optima and promotes convergence towards globally optimal solutions. Our approach aims to establish robust relationships between trade-off reference vectors and their corresponding true Pareto solutions, overcoming the limitations of existing methods. Through extensive experiments across both synthetic and real-world MOO benchmarks, we demonstrate that SVH-PSL significantly improves the quality of the learned Pareto set, offering a promising solution for expensive multi-objective optimization problems.

Abstract: Realworld time series analysis faces significant challenges when dealing with irregular and incomplete data. While Neural Differential Equation (NDE) based methods have shown promise, they struggle with limited expressiveness, scalability issues, and stability concerns. Conversely, Neural Flows offer stability but falter with irregular data. We introduce 'DualDynamics', a novel framework that synergistically combines NDE-based method and Neural Flow-based method. This approach enhances expressive power while balancing computational demands, addressing critical limitations of existing techniques. We demonstrate DualDynamics' effectiveness across diverse tasks: classification of robustness to dataset shift, irregularly-sampled series analysis, interpolation of missing data, and forecasting with partial observations. Our results show consistent outperformance over state-of-the-art methods, indicating DualDynamics' potential to advance irregular time series analysis significantly.

Abstract: Neural architecture search (NAS) enables finding the bestperforming architecture from a search space automatically. Most NAS methods exploit an over-parameterized network (i.e., a supernet) containing all possible architectures (i.e., subnets) in the search space. However, the subnets that share the same set of parameters are likely to have different characteristics, interfering with each other during training. To address this, few-shot NAS methods have been proposed that divide the space into a few subspaces and employ a separate supernet for each subspace to limit the extent of weight sharing. They achieve state-of-the-art performance, but the computational cost increases accordingly. We introduce in this paper a novel few-shot NAS method that exploits the number of nonlinear functions to split the search space. To be specific, our method divides the space such that each subspace consists of subnets with the same number of nonlinear functions. Our splitting criterion is efficient, since it does not require comparing gradients of a supernet to split the space. In addition, we have found that dividing the space allows us to reduce the channel dimensions required for each supernet, which enables training multiple supernets in an efficient manner. We also introduce a supernet-balanced sampling (SBS) technique, sampling several subnets at each training step, to train different supernets evenly within a limited number of training steps. Extensive experiments on standard NAS benchmarks demonstrate the effectiveness of our approach.

Abstract: Data imbalance across clients in federated learning often leads to different local feature space partitions, harming the global model's generalization ability. Existing methods either employ knowledge distillation to guide consistent local training or performs procedures to calibrate local models before aggregation. However, they overlook the illposed model aggregation caused by imbalanced representation learning. To address this issue, this paper presents a cross-silo feature space alignment method (FedFSA), which learns a unified feature space for clients to bridge inconsistency. Specifically, FedFSA consists of two modules, where the in-silo prototypical space learning (ISPSL) module uses predefined text embeddings to regularize representation learning, which can improve the distinguishability of representations on imbalanced data. Subsequently, it introduces a variance transfer approach to construct the prototypical space, which aids in calibrating minority classes feature distribution and provides necessary information for the cross-silo feature space alignment (CSFSA) module. Moreover, the CSFSA module utilizes augmented features learned from the ISPSL module to learn a generalized mapping and align these features from different sources into a common space, which mitigates the negative impact caused by imbalanced factors. Experimental results from three datasets verified that FedFSA improves the consistency between diverse spaces on imbalanced data, which results in superior performance compared to existing methods.

Abstract: Finetuning large language models on private data for downstream applications poses significant privacy risks in potentially exposing sensitive information. Several popular community platforms now offer convenient distribution of a large variety of pre-trained models, allowing anyone to publish without rigorous verification. This scenario creates a privacy threat, as pre-trained models can be intentionally crafted to compromise the privacy of fine-tuning datasets. In this study, we introduce a novel poisoning technique that uses model-unlearning as an attack tool. This approach manipulates a pre-trained language model to increase the leakage of private data during the fine-tuning process. Our method enhances both membership inference and data extraction attacks while preserving model utility. Experimental results across different models, datasets, and fine-tuning setups demonstrate that our attacks significantly surpass baseline performance. This work serves as a cautionary note for users who download pretrained models from unverified sources, highlighting the potential risks involved.

Abstract: While the area under the ROC curve is perhaps the most common measure that is used to rank relative performance of different binary classifiers, longstanding field folklore has noted that it can be a measure that illcaptures the benefits of different classifiers when either the actual class values or misclassification costs are highly unbalanced between the two classes. We introduce a new ROC surface, and the VOROS, a volume over this ROC surface, as a natural way to capture these costs, by lifting the ROC curve to 3D. Compared to previous attempts to generalize the ROC curve, our formulation provides also a simple and intuitive way to model the scenario when only ranges, rather than exact values, are known for possible class imbalance and misclassification costs.

Abstract: As a strategy for sustainability of deep learning, reusing an existing model by retraining it rather than training a new model from scratch is critical. In this paper, we propose REpresentation Shift QUantifying Estimator (RESQUE), a predictive quantifier to estimate the retraining cost of a model to distributional shifts or change of tasks. It provides a single concise index for an estimate of resources required for retraining the model. Through extensive experiments, we show that RESQUE has a strong correlation with various retraining measures. Our results validate that RESQUE is an effective indicator in terms of epochs, gradient norms, changes of parameter magnitude, energy, and carbon emissions. The measures align well with RESQUE for new tasks, multiple noise types, and varying noise intensities. As a result, RESQUE enables users to make informed decisions for retraining to different tasks/distribution shifts and determine the most costeffective and sustainable option, allowing for the reuse of a model with a much smaller footprint in the environment.

Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen ACSLab, Huawei Technologies Co., Ltd., Shenzhen, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen ACSLab, Huawei Technologies Co., Ltd., Shenzhen, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen ACSLab, Huawei Technologies Co., Ltd., Shenzhen, ACSLab, Huawei Technologies Co., Ltd., Shenzhen School of Mathematical Sciences, Peking University, Beijing, ACSLab, Huawei Technologies Co., Ltd., Shenzhen, Department of Computer Science, City University of Hong Kong, Hong Kong, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen Pengcheng Laboratory, Shenzhen, ACSLab, Huawei Technologies Co., Ltd., Shenzhen

Abstract: Known as low energy consumption networks, spiking neural networks (SNNs) have gained a lot of attention within the past decades. While SNNs are increasing competitive with artificial neural networks (ANNs) for vision tasks, they are rarely used for long sequence tasks, despite their intrinsic temporal dynamics. In this work, we develop spiking state space models (SpikingSSMs) for long sequence learning by leveraging on the sequence learning abilities of state space models (SSMs). Inspired by dendritic neuron structure, we hierarchically integrate neuronal dynamics with the original SSM block, meanwhile realizing sparse synaptic computation. Furthermore, to solve the conflict of eventdriven neuronal dynamics with parallel computing, we propose a light-weight surrogate dynamic network which accurately predicts the after-reset membrane potential and compatible to learnable thresholds, enabling orders of acceleration in training speed compared with conventional iterative methods. On the long range arena benchmark task, SpikingSSM achieves competitive performance to state-of-the-art SSMs meanwhile realizing on average 90% of network sparsity. On language modeling, our network significantly surpasses existing spiking large language models (spikingLLMs) on the WikiText-103 dataset with only a third of the model size, demonstrating its potential as backbone architecture for low computation cost LLMs.

Abstract: Recently, diffusion models have emerged as promising newcomers in the field of generative models, shining brightly in image generation. However, when employed for object removal tasks, they still encounter issues such as generating random artifacts and the incapacity to repaint foreground object areas with appropriate content after removal. To tackle these problems, we propose Attentive Eraser, a tuningfree method to empower pre-trained diffusion models for stable and effective object removal. Firstly, in light of the observation that the self-attention maps influence the structure and shape details of the generated images, we propose Attention Activation and Suppression (ASS), which re-engineers the self-attention mechanism within the pre-trained diffusion models based on the given mask, thereby prioritizing the background over the foreground object during the reverse generation process. Moreover, we introduce Self-Attention Redirection Guidance (SARG), which utilizes the self-attention redirected by ASS to guide the generation process, effectively removing foreground objects within the mask while simultaneously generating content that is both plausible and coherent. Experiments demonstrate the stability and effectiveness of Attentive Eraser in object removal across a variety of pre-trained diffusion models, outperforming even training-based methods. Furthermore, Attentive Eraser can be implemented in various diffusion model architectures and checkpoints, enabling excellent scalability.

Abstract: Distributed optimization is the standard way of speeding up machine learning training, and most of the research in the area focuses on distributed firstorder, gradient-based methods. Yet, there are settings where some computationally-bounded nodes may not be able to implement first-order, gradient-based optimization, while they could still contribute to joint optimization tasks. In this paper, we initiate the study of hybrid decentralized optimization, studying settings where nodes with zeroth-order and first-order optimization capabilities co-exist in a distributed system, and attempt to jointly solve an optimization task over some data distribution. We essentially show that, under reasonable parameter settings, such a system can not only withstand noisier zeroth-order agents but can even benefit from integrating such agents into the optimization process, rather than ignoring their information. At the core of our approach is a new analysis of distributed optimization with noisy and possibly-biased gradient estimators, which may be of independent interest. Our results hold for both convex and non-convex objectives. Experimental results on standard optimization tasks confirm our analysis, showing that hybrid first-zeroth order optimization can be practical, even when training deep neural networks.

Abstract: Automated machine learning (AutoML) is a collection of techniques designed to automate the machine learning development process. While traditional AutoML approaches have been successfully applied in several critical steps of model development (e.g. hyperparameter optimization), there lacks a AutoML system that automates the entire endto-end model production workflow for computer vision. To fill this blank, we propose a novel request-to-model task, which involves understanding the user's natural language request and execute the entire workflow to output production-ready models. This empowers non-expert individuals to easily build task-specific models via a user-friendly language interface. To facilitate development and evaluation, we develop a new experimental platform called AutoMMLab and a new benchmark called LAMP for studying key components in the end-to-end request-to-model pipeline. Hyperparameter optimization (HPO) is one of the most important components for AutoML. Traditional approaches mostly rely on trial-and-error, leading to inefficient parameter search. To solve this problem, we propose a novel LLM-based HPO algorithm, called HPO-LLaMA. Equipped with extensive knowledge and experience in model hyperparameter tuning, HPO-LLaMA achieves significant improvement of HPO efficiency.

Abstract: Federated learning (FL), as a privacypreserving collaborative machine learning paradigm, has attracted significant interest from industry and academia. To allow each data owner (FL client) to train a heterogeneous and personalized local model based on its local data distribution, system resources and requirements on model structure, the field of model-heterogeneous personalized federated learning (MHPFL) has emerged. Existing MHPFL approaches either rely on the availability of a public dataset with special characteristics to facilitate knowledge transfer, incur high computational and communication costs, or face potential model leakage risks. To address these limitations, we propose a model-heterogeneous personalized Federated learning approach based on generalized proxy feature Extractor Sharing (pFedES) for supervised image classification tasks. (1) We devise a shared small proxy homogeneous feature extractor before each client's heterogeneous local model. (2) Clients train them via the proposed iterative learning to enable the exchange of global generalized knowledge and local personalized knowledge. (3) The small proxy local homogeneous extractors produced after local training are uploaded to the server for aggregation to facilitate knowledge fusion across clients. We theoretically prove pFedES converges with a non-convex convergence rate O(1/T). Experiments on 3 benchmark datasets against 9 baselines demonstrate that pFedES performs state-of-the-art model accuracy while maintaining efficient communication and computation.

College of Computer Science & Technology and Liangzhu Laboratory, Zhejiang University State Key Laboratory of Transvascular Implantation Devices of The Second Affiliated Hospital, Zhejiang University, College of Computer Science & Technology and Liangzhu Laboratory, Zhejiang University, College of Computer Science & Technology and Liangzhu Laboratory, Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, School of Artificial Intelligence and Data Science, University of Science and Technology of China, Alibaba Cloud Computing, Alibaba Cloud Computing, Alibaba Cloud Computing, College of Pharmaceutical Sciences, Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, College of Computer Science & Technology and Liangzhu Laboratory, Zhejiang University State Key Laboratory of Transvascular Implantation Devices of The Second Affiliated Hospital, Zhejiang University Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence

Abstract: Antibodies defend our health by binding to antigens with high specificity and potentiality, primarily relying on the ComplementarityDetermining Region (CDR). Yet, current experimental methods of discovering new antibody CDRs are heavily time-consuming. Computational design could alleviate this burden; especially, protein language models have proven quite beneficial in many recent studies. However, most existing models solely focus on antibody potentiality and struggle to encapsulate the diverse range of plausible CDR candidates, limiting their effectiveness in real-world scenarios as binding is only one factor in the multitude of drug-forming criteria. In this paper, we introduce PG-AbD, a framework uniting Generative Flow Networks (GFlowNets) and pretrained Protein Language Models (PLMs) to successfully generate highly potent, diverse and novel antibody candidates. We innovatively construct a Products of Experts (PoE) composed by the global-distribution-modeling PLM and the local-distribution-modeling Potts Model to serve as the reward function of GFlowNet. The joint training paradigm is introduced, where PoE is trained by contrastive divergence with the negative samples generated by GFlowNet, and then guides GFlowNet to sample diverse antibody candidates. We evaluate PG-AbD on extensive antibody design benchmarks. It significantly outperforms existing methods in diversity (13.5% on RabDab, 31.1% on SabDab) while maintaining optimal potential and novelty. Generated antibodies are also found to form stable, regular 3D structures with their corresponding antigens, demonstrating the great potential of PG-AbD to accelerate real-world antibody discovery.

Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, China Telecom Cloud Computing Research Institute, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, College of Computer Science and Electronic Engineering, Hunan University, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications

Abstract: Drugtarget interaction prediction (DTI) is essential in various applications including drug discovery and clinical application. There are two perspectives of input data widely used in DTI prediction: Intrinsic data represents how drugs or targets are constructed, and extrinsic data represents how drugs or targets are related to other biological entities. However, any of the two perspectives of input data can be scarce for some drugs or targets, especially for those unpopular or newly discovered. Furthermore, ground-truth labels for specific interaction types can also be scarce. Therefore, we propose the first method to tackle DTI prediction under input data and/or label scarcity. To make our model functional when only one perspective of input data is available, we design two separate experts to process intrinsic and extrinsic data respectively and fuse them adaptively according to different samples. Furthermore, to make the two perspectives complement each other and remedy label scarcity, two experts synergize with each other in a mutually supervised way to exploit the enormous unlabeled data. Extensive experiments on 3 real-world datasets under different extents of input data scarcity and/or label scarcity demonstrate our model outperforms states of the art significantly and steadily, with a maximum improvement of 53.53%. We also test our model without any data scarcity and it still outperforms current methods.

Abstract: Textual Graphs (TGs) present a graphbased representation of textual data and find wide applications in real-world scenarios, such as citation networks, knowledge graphs, and social networks. While the traditional "pre-train, fine-tune" framework effectively addresses tasks requiring abundant labeled data, it falls short in scenarios with limited resource or zero-shot learning capabilities, particularly in low-resource textual graph node classification. Additionally, prevalent approaches that convert text nodes into shallow or manually engineered features fail to capture the rich semantic nuances within the text. The conventional methods often neglect the fusion of semantic and topological information, resulting in suboptimal model learning. To overcome these challenges, we proposed a novel method of low-resource textual graph node classification based on large language models, i.e., Textual graph learning with semantic and topological awareness (TGLsta), which comprehensively explores the semantic information, near neighborhood information, and the topology information in textual graphs, where these components are the most important information source contained in textual graphs. Graph prompt tuning for both zero- and few-shot textual graph node classification is further introduced.

Abstract: Vertical federated learning (VFL) trains model when the features of data samples are scattered over multiple clients. To improve efficiency, a promising approach is to find a coreset of the data samples and use it as a smaller training set. However, existing methods produce a large coreset when there are many clients and have long running time. To address these problems, we propose HaCore for efficient coreset construction in VFL setting. HaCore first employs locality sensitive hashing (LSH) to map features to bit signatures locally on the clients, and then merges the local signatures for kmedoids clustering. Data samples that correspond to the medoids are added to the coreset. The core idea is that the distance of original data samples can be approximated by the Hamming distance between their LSH-based bit signatures. To accelerate k-medoids, we utilize an inverted index to search the nearest medoid and a bit-counting method to quickly compute the aggregate distance from many signatures to a medoid. We evaluate HaCore on 5 datasets and compare with state-of-the-art coreset construction methods for VFL. The results show that HaCore accelerates the best-performing baseline by over 45x and matches the accuracy of training with all samples.

Abstract: Time Series Forecasting (TSF) aims at predicting future values for a time series data and plays a crucial role in many realworld applications, e.g., finance, disease spread, or weather predictions. However, it is also a very challenging task due to complex temporal dependencies in the data, especially for long-term forecasting. In this paper, we introduce WaveletMixer, an iterative multi-levels, multi-resolutions and multi-phases approach to effectively capture long-term dependencies of multivariate time series in both global and local perspectives for improving forecasting performance. WaveletMixer fundamentally differs from existing works in the following key aspects. First, it exploits multi-levels properties of Wavelet transformation to create multiple forecasting models for different frequency domains at various levels of resolutions. Second, the relationships among different frequency domains are exploited to iteratively adjust all prediction models at all levels simultaneously in both local and global perspectives to reduce prediction errors and biases, thus significantly improving the final accuracy. Third, while WaveletMixer is a general framework that can be used to boost the performance of any deep-learning architecture (e.g., MLP, LSTM or Transformer), we additionally introduce TS-Learner, an MLP-based model to further enhance the performance in long-term forecasting. Extensive experiments have been conducted on nine real-world datasets to demonstrate the outstanding performance of WaveletMixer compared to SOTA methods and to reveal its important characteristics.

Abstract: Federated Transfer Learning (FTL) is a popular approach to solve the problem of heterogeneous feature space and label distribution. Among the mainstream strategies for FTL, parameter decoupling, which balance the impact of a single global model and multiple personalized models under data heterogeneity, has attracted the attention of many researchers. However, few attacks have been proposed to evaluate the privacy risk of FTL. We find that the finetuned structures and the gradient update mechanisms of parameter decoupling would be more likely to leak personalized information for the server to infer private labels. Based on our findings, we propose the label inference attack that combines meta classifier with contrastive learning in FTL. Our experiments show that the proposed attack has ability to extract local personalized information from the differences before and after fine-tuning to improve the accuracy of the attack in the absence of a downstream model. Our research can reveal potential privacy risks in FTL and motivate more research on private and secure FTL.

Abstract: The accuracy of deep neural networks is significantly influenced by the effectiveness of minibatch construction during training. In single-label scenarios, such as binary and multi-class classification tasks, it has been demonstrated that batch selection algorithms preferring samples with higher uncertainty achieve better performance than difficulty-based methods. Although there are two batch selection methods tailored for multi-label data, none of them leverage important uncertainty information. Adapting the concept of uncertainty to multi-label data is not a trivial task, since there are two issues that should be tackled. First, traditional variance or entropy-based uncertainty measures ignore fluctuations of predictions within sliding windows and the importance of the current model state. Second, existing multi-label methods do not explicitly exploit the label correlations, particularly the uncertainty-based label correlations that evolve during the training process. In this paper, we propose an uncertainty-based multi-label batch selection algorithm. It assesses uncertainty for each label by considering differences between successive predictions and the confidence of current outputs, and further leverages dynamic uncertainty-based label correlations to emphasize instances whose uncertainty is synergistically expressed across multiple labels. Empirical studies demonstrate the effectiveness of our method in improving the performance and accelerating the convergence of various multi-label deep learning models.

State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, Intelligent Software Research Center, Institute of Software, CAS, Beijing, China, Intelligent Software Research Center, Institute of Software, CAS, Beijing, China, Intelligent Software Research Center, Institute of Software, CAS, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Intelligent Software Research Center, Institute of Software, CAS, Beijing, China University of Chinese Academy of Sciences, Beijing, China, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, Intelligent Software Research Center, Institute of Software, CAS, Beijing, China, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Institute of AI for Industries, CAS, China

Abstract: As a crucial operator in numerous scientific and engineering computing applications, the automatic optimization of General Matrix Multiplication (GEMM) with full utilization of everevolving hardware architectures (e.g. GPUs and RISC-V) is of paramount importance. While Large Language Models (LLMs) can generate functionally correct code for simple tasks, they have yet to produce high-performance code. The key challenge resides in deeply understanding diverse hardware architectures and crafting prompts that effectively unleash the potential of LLMs to generate high-performance code. In this paper, we propose a novel prompt mechanism called QiMeng-GEMM which enables LLMs to comprehend the architectural characteristics of different hardware platforms and automatically search for the optimization combinations for GEMM. The key of QiMeng-GEMM is a set of informative, adaptive, and iterative meta-prompts. Based on this, a searching strategy for optimal combinations of meta-prompts is used to iteratively generate high-performance code. Extensive experiments conducted on 4 leading LLMs, various paradigmatic hardware platforms, and representative matrix dimensions unequivocally demonstrate QiMeng-GEMM’s superior performance in auto-generating optimized GEMM code. Compared to vanilla prompts, our method achieves a performance enhancement of up to 113×. Even when compared to human experts, our method can reach 115% of cuBLAS on NVIDIA GPUs and 211% of OpenBLAS on RISC-V CPUs. Notably, while human experts often take months to optimize GEMM, our approach reduces the development cost by over 240×.

Abstract: Prototypebased federated learning has emerged as a promising approach that shares lightweight prototypes to transfer knowledge among clients with data heterogeneity in a model-agnostic manner. However, existing methods often collect prototypes directly from local models, which inevitably introduce inconsistencies into representation learning due to the biased data distributions and differing model architectures among clients. In this paper, we identify that both statistical and model heterogeneity create a vicious cycle of representation inconsistency, classifier divergence, and skewed prototype alignment, which negatively impacts the performance of clients. To break the vicious cycle, we propose a novel framework named Federated Learning via Semantic Anchors (FedSA) to decouple the generation of prototypes from local representation learning. We introduce a novel perspective that uses simple yet effective semantic anchors serving as prototypes to guide local models in learning consistent representations. By incorporating semantic anchors, we further propose anchor-based regularization with margin-enhanced contrastive learning and anchor-based classifier calibration to correct feature extractors and calibrate classifiers across clients, achieving intra-class compactness and inter-class separability of prototypes while ensuring consistent decision boundaries. We then update the semantic anchors with these consistent and discriminative prototypes, which iteratively encourage clients to collaboratively learn a unified data representation with robust generalization. Extensive experiments under both statistical and model heterogeneity settings show that FedSA significantly outperforms existing prototype-based FL methods on various classification tasks.

Abstract: Consider the scenario where multiple agents have to move in an optimal way through a network, each one towards their ending position, and while avoiding collisions. By optimal, we mean as fast as possible, which is evaluated by a measure known as the makespan of the proposed solution. This is the setting studied in the Multiagent Path Finding problem. In this work we additionally provide the agents with a way to communicate with each other. Due to size constraints, it is reasonable to assume that the range of the communication of each agent will be limited. What should be the trajectories of the agents to, additionally, maintain a backbone of communication? In this work we study this Multiagent Path Finding with Communication Constraint problem under the parameterized complexity framework. Our main contribution is three exact algorithms that are efficient when considering particular structures for the input network. We provide such algorithms for the case when the communication range and the number of agents (the makespan resp.) is provided in the input and the network has a tree topology, or bounded maximum degree (has a treelike topology, i.e., bounded treewidth resp.). We complement these results by showing that it is highly unlikely to construct efficient algorithms when considering the number of agents as part of the input, even if the makespan is 3 and the communication range is 1.

Abstract: Learning in zerosum games studies a situation where multiple agents competitively learn their strategy. In such multi-agent learning, we often see that the strategies cycle around their optimum, i.e., Nash equilibrium. When a game periodically varies (called a ``periodic'' game), however, the Nash equilibrium moves generically. How learning dynamics behave in such periodic games is of interest but still unclear. Interestingly, we discover that the behavior is highly dependent on the relationship between the two speeds at which the game changes and at which players learn. We observe that when these two speeds synchronize, the learning dynamics diverge, and their time-average does not converge. Otherwise, the learning dynamics draw complicated cycles, but their time-average converges. Under some assumptions introduced for the dynamical systems analysis, we prove that this behavior occurs. Furthermore, our experiments observe this behavior even if these assumptions are removed. This study discovers a novel phenomenon, i.e., synchronization, and gains insight widely applicable to learning in periodic games.

Abstract: Collaborative Perception (CP) has shown a promising technique for autonomous driving, where multiple connected and autonomous vehicles (CAVs) share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, ego CAV needs to receive messages from the collaborators, which makes it easy to be attacked by malicious agents. For example, a malicious agent can send harmful information to the ego CAV to mislead it. To address this critical issue, we propose a novel method, **CPGuard**, a tailored defense mechanism for CP that can be deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is that CP will lead to a consensus rather than a conflict against the ego CAV's perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method that can effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we design a collaborative consistency loss (CCLoss) to calculate the discrepancy between the ego CAV and the collaborators, which is used as a verification criterion for consensus. Finally, we conduct extensive experiments in collaborative bird's eye view (BEV) tasks and the results demonstrate the effectiveness of our CP-Guard.

Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence

Abstract: Communication has been widely employed to enhance multiagent collaboration. Previous research has typically assumed delay-free communication, a strong assumption that is challenging to meet in practice. However, real-world agents suffer from channel delays, receiving messages sent at different time points, termed Asynchronous Communication, leading to cognitive biases and breakdowns in collaboration. This paper first defines two communication delay settings in MARL and emphasizes their harm to collaboration. To handle the above delays, this paper proposes a novel framework, Communication Delay-Tolerant Multi-Agent Collaboration (CoDe). At first, CoDe learns an intent representation as messages through future action inference, reflecting the stable future behavioral trends of the agents. Then, CoDe devises a dual alignment mechanism of intent and timeliness to strengthen the fusion process of asynchronous messages. In this way, agents can extract the long-term intent of others, even from delayed messages, and selectively utilize the most recent messages that are relevant to their intent. Experimental results demonstrate that CoDe outperforms baseline algorithms in three MARL benchmarks without delay and exhibits robustness under fixed and time-varying delays.

Abstract: Although multiagent reinforcement learning (MARL) has shown its success across diverse domains, extending its application to large-scale real-world systems still faces significant challenges. Primarily, the high complexity of real-world environments exacerbates the credit assignment problem, substantially reducing training efficiency. Moreover, the variability of agent populations in large-scale scenarios necessitates scalable decision-making mechanisms. To address these challenges, we propose a novel framework: Sequential rollout with Sequential value estimation (SrSv). This framework aims to capture agent interdependence and provide a scalable solution for cooperative MARL. Specifically, SrSv leverages the autoregressive property of the Transformer model to handle varying populations through sequential action rollout. Furthermore, to capture the interdependence of policy distributions and value functions among multiple agents, we introduce an innovative sequential value estimation methodology and integrates the value approximation into an attention-based sequential model. We evaluate SrSv on three benchmarks: Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, and DubinsCars. Experimental results demonstrate that SrSv significantly outperforms baseline methods in terms of training efficiency without compromising convergence performance. Moreover, when implemented in a large-scale DubinsCar system with 1,024 agents, our framework surpasses existing benchmarks, highlighting the excellent scalability of SrSv.

Abstract: In multiagent systems, when we account for the possibility of delays during execution, online planning becomes more complicated, as both execution and planning should be able to handle delays when agents are moving. Lifelong Multi-Agent Path Finding (LMAPF) is the problem of (re)planning the collision-free moves of agents to their goals in a shared space, while agents continuously receive new goals. PIE (Planning and Improving while Executing) is a recent approach to LMAPF which concurrently replans later parts of agents' trajectories while execution occurs. However, the execution is assumed to be perfect. Existing approaches either use policy-based methods to quickly coordinate agents every timestep with instant delay feedback, or deploy an execution policy to adjust a solution for delays on the fly. These approaches may introduce large amounts of unnecessary delays to agents due to their planner guarantees or simple delay-handling policies. In this paper, we extend PIE to define a framework for solving the lifelong MAPF problem with execution delays. We instantiate our framework with different execution and replanning strategies, and experimentally evaluate them. Overall, we find that this framework can substantially improve the throughput by up to a factor 3 for lifelong MAPF, compared to approaches that handle delays with simple execution policies.

Beijing University of Posts and Telecommunications, China, Beijing University of Posts and Telecommunications, China, Hangzhou Dianzi University, China, Singapore Management University, Singapore, Beijing University of Posts and Telecommunications, China, Institute of Computing Technology, Chinese Academy of Sciences, China, Xi'an Jiaotong University, China, National University of Singapore, Singapore, Beijing University of Posts and Telecommunications, China, Beijing University of Posts and Telecommunications, China

Abstract: Large Language Models (LLMs) have impressive capabilities in text understanding and zeroshot reasoning. However, delays in knowledge updates may cause them to reason incorrectly or produce harmful results. Knowledge Graphs (KGs) provide rich and reliable contextual information for the reasoning process of LLMs by structurally organizing and connecting a wide range of entities and relations. Existing KG-based LLM reasoning methods only inject KGs' knowledge into prompts in a textual form, ignoring its structural information. Moreover, they mostly rely on close-source models or open-source models with large parameters, which poses challenges to high resource consumption. To address this, we propose a novel Lightweight and efficient Prompt learning-ReasOning Framework for KGQA (LightPROF), which leverages the full potential of LLMs to tackle complex reasoning tasks in a parameter-efficient manner. Specifically, LightPROF follows a “Retrieve-Embed-Reason” process, first accurately, and stably retrieving the corresponding reasoning graph from the KG through retrieval module. Next, through a Transformer-based Knowledge Adapter, it finely extracts and integrates factual and structural information from the KG, then maps this information to the LLM’s token embedding space, creating an LLM-friendly prompt to be used by the LLM for the final reasoning. Additionally, LightPROF only requires training Knowledge Adapter and can be compatible with any open-source LLM. Extensive experiments on two public KGQA benchmarks demonstrate that LightPROF achieves superior performance with small-scale LLMs. Furthermore, LightPROF shows significant advantages in terms of input token count and reasoning time.

Abstract: LLMs obtain remarkable performance but suffer from hallucinations. Most research on detecting hallucination focuses on questions with short and concrete correct answers that are easy to check faithfulness. Hallucination detections for text generation with openended answers are more hard. Some researchers use external knowledge to detect hallucinations in generated texts, but external resources for specific scenarios are hard to access. Recent studies on detecting hallucinations in long texts without external resources conduct consistency comparison among multiple sampled outputs. To handle long texts, researchers split long texts into multiple facts and individually compare the consistency of each pair of facts. However, these methods (1) hardly achieve alignment among multiple facts; (2) overlook dependencies between multiple contextual facts. In this paper, we propose a graph-based context-aware (GCA) hallucination detection method for text generations, which aligns facts and considers the dependencies between contextual facts in consistency comparison. Particularly, to align multiple facts, we conduct a triple-oriented response segmentation to extract multiple knowledge triples. To model dependencies among contextual triples (facts), we construct contextual triples into a graph and enhance triples’ interactions via message passing and aggregating via RGCN. To avoid the omission of knowledge triples in long texts, we conduct an LLM-based reverse verification by reconstructing the knowledge triples. Experiments show that our model enhances hallucination detection and excels all baselines.

Abstract: Causal language models acquire vast amount of knowledge from general text corpus during pretraining, but the efficiency of knowledge learning is known to be unsatisfactory, especially when learning from knowledgedense and small-sized corpora. The deficiency can come from long-distance dependencies which are hard to capture by language models, and overfitting to co-occurrence patterns and distracting clues in the training text. To address these issues, the paper proposes a method to enhance knowledge learning during language model pretraining, by enhancing elusive but important clues in text discovered by the language model themselves. We found that larger language models pay more attention to non-obvious but important clues, which are often overlooked by smaller language models. Therefore, we can identify these clues by contrasting the attention weights of large and small language models. We use the identified clues as a guide to perform token-dropout data augmentation on the training text, and observed a significant boost in both small and large models' performance in fact memorization. This shows that the behavior contrast between more and less-performant language models contains important clues for knowledge learning, and it can be "amplified" for a straight-forward improvement in knowledge learning efficiency.

State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

Abstract: Personalized learning represents a promising educational strategy within intelligent educational systems, aiming to enhance learners' practice efficiency. However, the scarcity of offline practice response data (e.g., answer correctness) and potential biases in human online practice create a significant gap between offline metrics and the actual online performance of personalized learning services. To address this challenge, we introduce Agent4Edu, a novel personalized learning simulator leveraging recent advancements in human intelligence through large language models (LLMs). Agent4Edu features LLMpowered generative agents equipped with learner profile, memory, and action modules tailored to personalized learning algorithms. The learner profiles are initialized using real-world response data, capturing practice styles and cognitive factors. Inspired by psychology theory, the memory module records practice facts and high-level summaries, integrating reflection mechanisms. The action module supports various behaviors, including exercise understanding, analysis, and response generation. Each agent can interact with personalized learning algorithms, such as computerized adaptive testing, enabling a multifaceted evaluation and enhancement of customized services. Through a comprehensive assessment, we explore the strengths and weaknesses of Agent4Edu, emphasizing the consistency and discrepancies in responses between agents and human learners.

Abstract: In the rapidly developing field of automatic text generation and understanding, the quality of input data has been shown to be a key factor affecting the efficiency and accuracy of large language model (LLM) output. With the advent of advanced tools such as ChatGPT, input refinement work has mainly focused on prompt engineering. However, existing methods are often too dependent on specific contexts and are easily affected by individual expert experience and potential biases, limiting their wide applicability in diverse realworld applications. To address this problem, this study develops an Reinforced Token-Level Input Refinement, called RTLIR. We choose to optimize the input data at the fine-grained level of tokens, cleverly preserving the original text structure. Operationally, each state is defined by the token set of the current text, and each action is a binary decision process to decide whether to retain a specific token information. The agent automatically calculates and determines the selection probability of each token based on the current state, thereby optimizing the entire decision process. Through continuous exploration and learning, the agent can autonomously learn to identify the key inputs that have the greatest impact on the generation results and achieve refinement of the input data. In addition, RTLIR is a plug-and-play, LLM-agnostic module that can be used for a wide range of tasks and models. Experimental results show that RTLIR improves the performance of LLM in various input scenarios and tasks, with an average accuracy increase of 6%.

School of Computer Science and Technology, East China Normal University, Shanghai, China, School of Computer Science and Technology, Donghua University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China Shanghai Key Laboratory of Multidimensional Information Processing

Abstract: Large language models (LLMs) demonstrate impressive performance on downstream tasks through incontext learning(ICL). However, there is a significant gap between their performance in Named Entity Recognition (NER) and in fine-tuning methods. We believe this discrepancy is due to inconsistencies in labeling definitions in NER. In addition, recent research indicates that LLMs do not learn the specific input-label mappings from the demonstrations. Therefore, we argue that using examples to implicitly capture the mapping between inputs and labels in in-context learning is not suitable for NER. Instead, it requires explicitly informing the model of the range of entities contained in the labels, such as annotation guidelines. In this paper, we propose GuideNER, which uses LLMs to summarize concise annotation guidelines as contextual information in ICL. We have conducted experiments on widely used NER datasets, and the experimental results indicate that our method can consistently and significantly outperform state-of-the-art methods, while using shorter prompts. Especially on the GENIA dataset, our model outperforms the previous state-of-the-art model by 12.63 F1 scores.

Abstract: Large language models (LLMs) have demonstrated significant progress in multilingual language understanding and generation. However, due to the imbalance in training data, their capabilities in nonEnglish languages are limited. Recent studies revealed the English-pivot multilingual mechanism of LLMs, where LLMs implicitly convert non-English queries into English ones at the bottom layers and adopt English for thinking at the middle layers. However, due to the absence of explicit supervision for cross-lingual alignment in the intermediate layers of LLMs, the internal representations during these stages may become inaccurate. In this work, we introduce a deep supervision fine-tuning method (DFT) that incorporates additional supervision in the internal layers of the model to guide its workflow. Specifically, we introduce two training objectives on different layers of LLMs: one at the bottom layers to constrain the conversion of the target language into English, and another at the middle layers to constrain reasoning in English. To effectively achieve the guiding purpose, we designed two types of supervision signals: logits and feature, which represent a stricter constraint and a relatively more relaxed guidance. Our method guides the model to not only consider the final generated result when processing non-English inputs but also ensure the accuracy of internal representations. We conducted extensive experiments on typical English-centric large models, LLaMA-2 and Gemma-2, and the results on multiple multilingual datasets show that our method significantly outperforms traditional fine-tuning methods.

Institute of Artificial Intelligence, Xiamen University, Xiamen, China School of Informatics, Xiamen University, Xiamen, China, School of Film, Xiamen University, Xiamen, China, Institute of Artificial Intelligence, Xiamen University, Xiamen, China, Institute of Artificial Intelligence, Xiamen University, Xiamen, China School of Informatics, Xiamen University, Xiamen, China, School of Journalism and Communication, Xiamen University, Xiamen, China, School of Film, Xiamen University, Xiamen, China School of Informatics, Xiamen University, Xiamen, China Institute of Artificial Intelligence, Xiamen University, Xiamen, China Xiamen Key Laboratory of Intelligent Storage and Computing, School of Informatics, Xiamen University, Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China, School of Film, Xiamen University, Xiamen, China School of Informatics, Xiamen University, Xiamen, China Institute of Artificial Intelligence, Xiamen University, Xiamen, China Xiamen Key Laboratory of Intelligent Storage and Computing, School of Informatics, Xiamen University

Abstract: Aspect Sentiment Quad Prediction (ASQP) is the most complex subtask of Aspectbased Sentiment Analysis (ABSA), aiming to predict all sentiment quadruples within the given sentence. Due to the complexity of sentence syntaxes and the diversity of sentiment expressions, generative methods gradually become the mainstream approach in ASQP. However, existing generative models are constrained in the effectiveness of demonstrations. Semantically similar demonstrations help in judging sentiment categories and polarities but may confuse the model in recognizing aspect and opinion terms, which are more related to sentence syntaxes. To this end, we first develop Syn2Vec, a method for calculating syntactic vectors to support the retrieval of syntactically similar demonstrations. Then, we propose Syntactic and Semantic Similarity Retrieval Prompting (SimRP) to construct effective prompts by retrieving the most related demonstrations that are syntactically and semantically similar. With these related demonstrations, pre-trained generative models, especially Large Language Models (LLMs), can fully release their potential to recognize sentiment quadruples. Extensive experiments in Supervised Fine-Tuning (SFT) and In-context Learning (ICL) paradigms demonstrate the effectiveness of SimRP. Furthermore, we find that LLMs' capabilities in ASQP are severely underestimated by biased data annotations and the exact matching metric. We propose a novel constituent subtree-based fuzzy metric for more accurate and rational quadruple recognition.

School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications, School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications, School of Computer Science and Engineering, Southeast University, China State Key Laboratory for Novel Software Technology, Nanjing University, China, School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications, Monash University, The University of Manchester, Alibaba Group, School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications, School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications, Law and Innovation Lab, Law School, Southeast University, China

Abstract: Table Understanding (TU) has achieved promising advancements, but it faces the challenges of the scarcity of manually labeled tables and the presence of complex table structures. To address these challenges, we propose HeGTa, a heterogeneous graph (HG)enhanced large language model (LLM) designed for few-shot TU tasks. This framework aligns structural table semantics with the LLM's parametric knowledge through soft prompts and instruction tuning. It also addresses complex tables with a multi-task pre-training scheme, incorporating three novel multi-granularity self-supervised HG pre-text tasks. We empirically demonstrate the effectiveness of HeGTa, showing that it outperforms the SOTA for few-shot complex TU on several benchmarks.

Abstract: Large language models (LLMs) have triggered a new stream of research focusing on compressing the context length to reduce the computational cost while ensuring the retention of helpful information for LLMs to answer the given question. Tokenbased removal methods are one of the most prominent approaches in this direction, but risk losing the semantics of the context caused by intermediate token removal, especially under high compression ratios, while also facing challenges in computational efficiency. In this work, we propose context-aware prompt compression (CPC), a sentence-level prompt compression technique where its key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. To train this encoder, we generate a new dataset consisting of questions, positives, and negative pairs where positives are sentences relevant to the question, while negatives are irrelevant context sentences. We train the encoder in a contrastive setup to learn context-aware sentence representations. Our method considerably outperforms prior works on prompt compression on benchmark datasets and is up to 10.93x faster at inference compared to the best token-level compression method. We also find better improvement for shorter length constraints in most benchmarks, showing the effectiveness of our proposed solution in the compression of relevant information in a shorter context. Finally, we release the code and the dataset for quick reproducibility and further development.

Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Mathematical and Computer Sciences, Heriot-Watt University, University of Trento, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University

Abstract: Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MIDELIGHT for short text classification in this work. Specifically, it first performs multi-source information (i.e., statistical information, linguistic information, and factual information) exploration to alleviate the sparsity issues. Then, the graph learning approach is adopted to learn the representation of short texts, which are presented in graph forms. Moreover, we introduce a dual-level (i.e., instance-level and cluster-level) contrastive learning auxiliary task to effectively capture different-grained contrastive information within massive unlabeled data. Meanwhile, previous models merely perform the main task and auxiliary tasks in parallel, without considering the relationship among tasks. Therefore, we introduce a hierarchical architecture to explicitly model the correlations between tasks. We conduct extensive experiments across various benchmark datasets, demonstrating that MI-DELIGHT significantly surpasses previous competitive models. It even outperforms popular large language models on several datasets.

School of Information, Renmin University of China, Beijing, China Key Laboratory of Data Engineering and Knowledge Engineering, MOE, China, School of Artificial Intelligence, Beijing Normal University, Beijing, China Engineering Research Center of Intelligent Technology and Educational Application, MOE, China, School of Information, Renmin University of China, Beijing, China Key Laboratory of Data Engineering and Knowledge Engineering, MOE, China, School of Information, Renmin University of China, Beijing, China Engineering Research Center of Database and Business Intelligence, MOE, China

Abstract: Catastrophic forgetting is a key challenge in incremental named entity recognition (INER). Existing methods often address this issue through distillationbased approaches, which involve transferring previously learned knowledge from the old model to the new one. However, these methods may not fully equip the new model with an adequate understanding of the characteristics about old entity types, leading to confusion when classifying tokens associated with these entity types. To address this challenge, we propose a novel method called Prototypical Replay with Old-class Focusing Knowledge Distillation (POF) for INER. Our approach focuses on preserving the main characteristics of each previous entity type by storing compact prototypes and replaying them with appropriate frequency. This replay strategy makes the new model review the knowledge of old entity types while minimizing storage needs. Additionally, we introduce an old-class focusing knowledge distillation (OFKD) loss, which distills features only in old-class regions to maintain the quality of old-class prototypes and prevent ineffective prototypical replay while preserving sufficient plasticity for learning new entity types. We conducted experiments on three benchmark datasets (i.e., Few-NERD, I2B2 and OntoNotes5), and the results demonstrate that our method outperforms all previous state-of-the-art methods.

Abstract: Steplevel reward models (SRMs) can significantly enhance mathematical reasoning performance through process supervision or step-level preference alignment based on reinforcement learning. The performance of SRMs is pivotal, as they serve as critical guidelines, ensuring that each step in the reasoning process is aligned with desired outcomes. Recently, AlphaZero-like methods, where Monte Carlo Tree Search (MCTS) is employed for automatic step-level preference annotation, have proven particularly effective. However, the precise mechanisms behind the success of SRMs remain largely unexplored. To address this gap, this study delves into the counterintuitive aspects of SRMs, particularly focusing on MCTS-based approaches. Our findings reveal that the removal of natural language descriptions of thought processes has minimal impact on the efficacy of SRMs. Furthermore, we demonstrate that SRMs are adept at assessing the complex logical coherence present in mathematical language while having difficulty in natural language. These insights provide a nuanced understanding of the core elements that drive effective step-level reward modeling in mathematical reasoning. By shedding light on these mechanisms, this study offers valuable guidance for developing more efficient and streamlined SRMs, which can be achieved by focusing on the crucial parts of mathematical reasoning.

Abstract: Postprocessing networks (PPNs) are components that modify the outputs of arbitrary modules in task-oriented dialogue systems and are optimized using reinforcement learning (RL) to improve the overall task completion capability of the system. However, previous PPN-based approaches have been limited to handling only a subset of modules within a system, which poses a significant limitation in improving the system performance. In this study, we propose a joint optimization method for post-processing the outputs of all modules using universal post-processing networks (UniPPNs), which are language-model-based networks that can modify the outputs of arbitrary modules in a system as a sequence-transformation task. Moreover, our RL algorithm, which employs a module-level Markov decision process, enables fine-grained value and advantage estimation for each module, thereby stabilizing joint learning for post-processing the outputs of all modules. Through both simulation-based and human evaluation experiments using the MultiWOZ dataset, we demonstrated that UniPPN outperforms conventional PPNs in the task completion capability of task-oriented dialogue systems.

Abstract: Model knowledge editing has become a widely researched topic because it enables efficient and rapid injection of new knowledge into language models or the correction of erroneous or outdated knowledge. Existing model knowledge editing methods typically categorized into singleinstance sequential editing and massive one-time editing. However, in practical applications, the batched and iterative editing manner better aligns with model updating patterns. In this work, we explored the performance of parameter-update-based models in a new batched iterative editing benchmark. Our findings show that with an increase in the number of editing iterations, the accumulation of updated parameters leads to a greater change in the distribution of model parameters, making it more challenging to maintain editing performance and model stability. To address this degradation issue, we propose two methods: the Wasserstein distance constraint and update parameter sparsification, where the Wasserstein distance constraint optimizes the transition of parameter distribution before and after the editing, and update parameter sparsification significantly reduces the number of update parameters, thereby alleviating the issue of instability in the parameter distribution caused by the accumulation of update parameters through iterations. Our methods can be generally applied to different parameter-update-based knowledge editing models. Experiments on the zsRE and CounterFact datasets demonstrate that our methods can improve editing performance and enhance the later-stage stability of batched iterative editing across different models.

Abstract: Given news articles about an entity, such as a public figure or organization, timeline summarization (TLS) involves generating a timeline that summarizes the key events about the entity. However, the TLS task is too underspecified, since what is of interest to each reader may vary, and hence there is not a single ideal or optimal timeline. In this paper, we introduce a novel task, called Constrained Timeline Summarization (CTLS), where a timeline is generated in which all events in the timeline meet some constraint. An example of a constrained timeline concerns the legal battles of Tiger Woods, where only events related to his legal problems are selected to appear in the timeline. We collected a new humanverified dataset of constrained timelines involving 47 entities and 5 constraints per entity. We propose an approach that employs a large language model (LLM) to summarize news articles according to a specified constraint and cluster them to identify key events to include in a constrained timeline. In addition, we propose a novel self-reflection method during summary generation, demonstrating that this approach successfully leads to improved performance.

Abstract: We address the challenging task of neural machine translation (NMT) in the entertainment domain, where the objective is to automatically translate a given dialogue from a source language content to a target language. This task has various applications, particularly in automatic dubbing, subtitling, and other content localization tasks, enabling source content to reach a wider audience. Traditional NMT systems typically translate individual sentences in isolation, without facilitating knowledge transfer of crucial elements such as the context and style from previously encountered sentences. In this work, we emphasize the significance of these fundamental aspects in producing pertinent and captivating translations. We demonstrate their significance through several examples and propose a novel framework for entertainment translation, which, to our knowledge, is the first of its kind. Furthermore, we introduce an algorithm to estimate the context and style of the current session and use these estimations to generate a prompt that guides a Large Language Model (LLM) to generate highquality translations. Our method is both language and LLM-agnostic, making it a general-purpose tool. We demonstrate the effectiveness of our algorithm through various numerical studies and observe significant improvement in the COMET scores over various state-of-the-art LLMs. Moreover, our proposed method consistently outperforms baseline LLMs in terms of win-ratio.

Abstract: Deep learning approaches for multimodal aspectlevel sentiment classification (MALSC) often require extensive data, which is costly and time-consuming to obtain. To mitigate this, current methods typically fine-tune small-scale pretrained models like BERT and BART with few-shot examples. While these models have shown success, Large Vision-Language Models (LVLMs) offer significant advantages due to their greater capacity and ability to understand nuanced language in both zero-shot and few-shot settings. However, there is limited work on fine-tuning LVLMs for MALSC. A major challenge lies in selecting few-shot examples that effectively capture the underlying patterns in data for these LVLMs. To bridge this research gap, we propose an acquisition function designed to select challenging samples for the few-shot learning of LVLMs for MALSC. We compare our approach, Verification and ZERO-shot feedback acquisition (VERO), with diverse acquisition functions for few-shot learning in MALSC. Our experiments show that VERO outperforms prior methods, achieving an F1 score improvement of up to 6.07% on MALSC benchmark datasets.

School of Computer Science and Engineering, Northeastern University, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China, CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China

Abstract: Large visionlanguage models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using human-preference alignment techniques, such as best-of-n sampling and reinforcement learning. However, these techniques face the difficulty arising from the scarcity of visual preference data, which is required to train a visual reward model (VRM). In this work, we continue the line of research. We present a Robust Visual Reward Model (RoVRM) which improves human-preference alignment for LVLMs. RoVRM leverages auxiliary textual preference data through a three-phase progressive training and optimal transport-based preference data selection to effectively mitigate the scarcity of visual preference data. We experiment with RoVRM on the commonly used vision-language tasks based on the LLaVA-1.5-7B and -13B models. Experimental results demonstrate that RoVRM consistently outperforms traditional VRMs. Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization.

School of Computer Science and Technology, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, School of Psychology and Cognitive Science, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China Lab of Artificial Intelligence for Education, East China Normal University, Shanghai, China Shanghai Institute of Artificial Intelligence for Education, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China, Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, School of Psychology and Cognitive Science, East China Normal University, Shanghai, China, Shanghai Changning Mental Health Center, Shanghai, China

Abstract: Large Language Models (LLMs) are known to hallucinate facts and make nonfactual statements which can undermine trust in their output. The essence of hallucination lies in the absence of metacognition in LLMs, namely the understanding of their own cognitive processes. However, there has been limited research on quantitatively measuring metacognition within LLMs. Drawing inspiration from cognitive psychology theories, we first quantify the metacognitive ability of LLMs as their ability to evaluate the correctness of responses through confidence. Subsequently, we introduce a general framework called DMC designed to decouple metacognitive ability and cognitive ability. This framework tackles the challenge of noisy quantification caused by the coupling of metacognition and cognition in current research, such as calibration-based metrics. Specifically, the DMC framework comprises two key steps. Initially, the framework tasks the LLM with failure prediction, aiming to evaluate the model's performance in predicting failures, a performance jointly determined by both cognitive and metacognitive abilities of the LLM. Following this, the framework disentangles metacognitive ability and cognitive ability based on the failure prediction performance, providing a quantification of the LLM's metacognitive ability independent of cognitive influences. Experiments conducted on eight datasets across five domains reveal that (1) Our proposed DMC framework effectively separates the metacognition and cognition of LLMs; (2) Various confidence elicitation methods impact the quantification of metacognitve ability differently; (3) Stronger metacognitive ability are exhibited by LLMs with better overall performance; (4) Enhancing metacognition holds promise for alleviating hallucination issues.

School of Artificial Intelligence, Beijing Normal University, Beijing, Institute of Artificial Intelligence and Future Networks, Beijing Normal University, Zhuhai, School of Artificial Intelligence, Beijing Normal University, Beijing, Elmleaf Ltd., Shanghai, Institute of Artificial Intelligence and Future Networks, Beijing Normal University, Zhuhai BNU-UIC Institute of Artificial Intelligence and Future Networks, Beijing Normal University (Zhuhai), Guangdong Key Lab of AI and Multi-Modal Data Processing, BNU-HKBU United International College, Zhuhai, Guang Dong, PR China.

Abstract: The real estate market relies heavily on structured data, such as property details, market trends, and price fluctuations. However, the lack of specialized Tabular Question Answering datasets in this domain limits the development of automated questionanswering systems. To fill this gap, we introduce RETQA, the first large-scale open-domain Chinese Tabular Question Answering dataset for Real Estate. RETQA comprises 4,932 tables and 20,762 question-answer pairs across 16 sub-fields within three major domains: property information, real estate company finance information and land auction information. Compared with existing tabular question answering datasets, RETQA poses greater challenges due to three key factors: long-table structures, open-domain retrieval, and multi-domain queries. To tackle these challenges, we propose the SLUTQA framework, which integrates large language models with spoken language understanding tasks to enhance retrieval and answering accuracy. Extensive experiments demonstrate that SLUTQA significantly improves the performance of large language models on RETQA by in-context learning. RETQA and SLUTQA provide essential resources for advancing tabular question answering research in the real estate domain, addressing critical challenges in open-domain and long-table question-answering.

Abstract: Despite their remarkable capabilities, Large Language Models (LLMs) are prone to generate responses that contradict verifiable facts, i.e., unfaithful hallucination content. Existing efforts generally focus on optimizing model parameters or editing semantic representations, which compromise the internal factual knowledge of target LLMs. In addition, hallucinations typically exhibit multifaceted patterns in downstream tasks, limiting the model's holistic performance across tasks. In this paper, we propose a Comparatordriven Decoding-Time (CDT) framework to alleviate the response hallucination. Firstly, we construct hallucinatory and truthful comparators with multi-task fine-tuning samples. In this case, we present an instruction prototype-guided mixture of experts strategy to enhance the ability of the corresponding comparators to capture different hallucination or truthfulness patterns in distinct task instructions. CDT constrains next-token predictions to factuality-robust distributions by contrasting the logit differences between the target LLMs and these comparators. Systematic experiments on multiple downstream tasks show that our framework can significantly improve the model performance and response factuality.

Abstract: The mathematical capabilities of AI systems are complex and multifaceted. Most existing research has predominantly focused on the correctness of AIgenerated solutions to mathematical problems. In this work, we argue that beyond producing correct answers, AI systems should also be capable of, or assist humans in, developing novel solutions to mathematical challenges. This study explores the creative potential of Large Language Models (LLMs) in mathematical reasoning, an aspect that has received limited attention in prior research. We introduce a novel framework and benchmark, CreativeMath, which encompasses problems ranging from middle school curricula to Olympic-level competitions, designed to assess LLMs' ability to propose innovative solutions after some known solutions have been provided. Our experiments demonstrate that, while LLMs perform well on standard mathematical tasks, their capacity for creative problem-solving varies considerably. Notably, the Gemini-1.5-Pro model outperformed other LLMs in generating novel solutions. This research opens a new frontier in evaluating AI creativity, shedding light on both the strengths and limitations of LLMs in fostering mathematical innovation, and setting the stage for future developments in AI-assisted mathematical discovery.

The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security

Abstract: Large language models (LLMs) have significantly advanced the performance of various natural language processing tasks, including textto-SQL. Current LLM-based text-to-SQL schemes mainly focus on improving the understanding of natural language questions (NLQs) or refining the quality of generated SQLs. While these strategies are effective, they often address specific, nuanced aspects. In contrast, humans approach text-to-SQL with a holistic view, applying transitional logical reasoning across multiple steps to arrive at the final answer. We believe LLMs can leverage human cognitive processes to achieve greater accuracy in text-to-SQL. In this paper, we present COGSQL, a framework featuring a suite of tailored models and strategies aimed at replicating human cognitive processes for enhanced LLM-based text-to-SQL. COGSQL consists of three key modules: (1) SQL preparation: we employ a coarse-to-fine schema linking and syntax keyword prediction, akin to how human recall and align key concepts for better understanding. (2) SQL generation: we introduce a concept-enhanced chain-of-thought prompting, enhancing NLQ interpretation and SQL composition of LLMs, similar to humans drafting SQL query. (3) SQL correction: we develop NLQ consistency and result consistency techniques to correct various errors, mirroring how humans evaluate and refine reasoning. We conduct extensive experiments using diverse benchmarks and LLMs. The results and analysis verify the effectiveness and generalizability of COGSQL.

Abstract: Serving LLMs requires substantial memory due to the storage requirements of KeyValue (KV) embeddings in the KV cache, which grows with sequence length. An effective approach to compress KV cache is quantization. However, traditional quantization methods face significant memory overhead due to the need to store quantization constants (at least a zero point and a scale) in full precision per data block. Depending on the block size, this overhead can add 1 or 2 bits per quantized number. We introduce QJL, a new quantization approach that consists of a Johnson-Lindenstrauss (JL) transform followed by sign-bit quantization. In contrast to existing methods, QJL eliminates memory overheads by removing the need for storing quantization constants. We propose an asymmetric estimator for the inner product of two vectors and demonstrate that applying QJL to one vector and a standard JL transform without quantization to the other provides an unbiased estimator with minimal distortion. We have developed an efficient implementation of the QJL sketch and its corresponding inner product estimator, incorporating a lightweight CUDA kernel for optimized computation. When applied across various LLMs and NLP tasks to quantize the KV cache to only 3 bits, QJL demonstrates a more than fivefold reduction in KV cache memory usage without compromising accuracy, all while achieving faster runtime.

Abstract: Word Sense Disambiguation (WSD) aims to determine the meaning of target words according to the given context. The recognition of highfrequency senses has reached expectations, and the current research focus is mainly on low-frequency senses, namely Long-tail Senses (LTSs). One of the challenges in long-tail WSD is to obtain clear and distinguishable definition representations based on limited word sense definitions. Researchers try to mine word sense definition information from data from different sources to enhance the representations. Inspired by quantum theory, this paper provides a constraint mechanism for representations under non-homogeneous data to leverage the geometric relationship in its Hilbert space to constrain the value range of parameters, thereby alleviating the dependence on big data and improving the accuracy of representations. We theoretically analyze the feasibility of the constraint mechanism, and verify the WSD system based on this mechanism on the standard evaluation framework, constructed LTS datasets and cross-lingual datasets. Experimental results demonstrate the effectiveness of the scheme and achieve competitive performance.

Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, School of Computing, National University of Singapore University of Science and Technology of China, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, University of Chinese Academy of Sciences Institute of automation, Chinese academy of science, Chinese Academy of Sciences, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, University of California, Irvine, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University

Abstract: Some of the latest released Code Large Language Models (Code LLMs) have been trained on repositorylevel code data, enabling them to perceive repository structures and utilize cross-file code information. This capability allows us to directly concatenate the content of repository code files in prompts to achieve repository-level code completion. However, in real development scenarios, directly concatenating all code repository files in a prompt can easily exceed the context window of Code LLMs, leading to a significant decline in completion performance. Additionally, overly long prompts can increase completion latency, negatively impacting the user experience. In this study, we conducted extensive experiments, including completion error analysis, topology dependency analysis, and cross-file content analysis, to investigate the factors affecting repository-level code completion. Based on the conclusions drawn from these preliminary experiments, we proposed a strategy called **Hierarchical Context Pruning (HCP)** to construct high-quality completion prompts. We applied the **HCP** to six Code LLMs and evaluated them on the CrossCodeEval dataset. The experimental results showed that, compared to previous methods, the prompts constructed using our **HCP** strategy achieved higher completion accuracy on five out of six Code LLMs. Additionally, the **HCP** managed to keep the prompt length around 8k tokens (whereas the full repository code is approximately 50k tokens), significantly improving completion throughput. Our code and data will be publicly available.

School of Computer Science, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China, School of Computer Science, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China, School of Computer Science, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China, School of Computer Science, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China, School of Computer Science, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China, School of Computer Science, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China Center on Frontiers of Computing Studies, Peking University, Beijing, China Peking University Information Technology Institute (Tianjin Binhai), School of Computer Science, Peking University, Beijing, China Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China Nanhu Laboratory, Jiaxing, China, Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing, China National Engineering Research Center For Software Engineering, Peking University, Beijing, China Peking University Information Technology Institute (Tianjin Binhai)

Abstract: By integrating external knowledge, RetrievalAugmented Generation (RAG) has become an effective strategy for mitigating the hallucination problems that large language models (LLMs) encounter when dealing with knowledge-intensive tasks. However, in the process of integrating external non-parametric supporting evidence with internal parametric knowledge, inevitable knowledge conflicts may arise, leading to confusion in the model's responses. To enhance the knowledge selection of LLMs in various contexts, some research has focused on refining their behavior patterns through instruction-tuning. Nonetheless, due to the absence of explicit negative signals and comparative objectives, models fine-tuned in this manner may still exhibit undesirable behaviors such as contextual ignorance and contextual overinclusion. To this end, we propose a Knowledge-aware Preference Optimization strategy, dubbed KnowPO, aimed at achieving adaptive knowledge selection based on contextual relevance in real retrieval scenarios. Concretely, we proposed a general paradigm for constructing knowledge conflict datasets, which comprehensively cover various error types and learn how to avoid these negative signals through preference optimization methods. Simultaneously, we proposed a rewriting strategy and data ratio optimization strategy to address preference imbalances. Experimental results show that KnowPO outperforms previous methods for handling knowledge conflicts by over 37%, while also exhibiting robust generalization across various out-of-distribution datasets.

Abstract: Humans can perceive speakers’ characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to their voice style. Recently, visiondriven Text-to-speech ( TTS ) scholars grounded their investigations on real-person faces, thereby restricting effective speech synthesis from applying to vast potential usage scenarios with diverse characters and image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity characteristics and emotional representations from a wide variety of image styles. Meanwhile, it mitigates the extraneous information (e.g., background, clothing, and hair color, etc.), resulting in synthesized speech closely aligned with a character’s persona. Furthermore, to overcome the scarcity of multi-modal TTS data, we have devised an innovative dataset, namely Expressive Multi-Modal TTS ( EM2TTS), which is diligently curated and annotated to facilitate research in this domain. The experimental results demonstrate our proposed FaceSpeak can generate portrait-aligned voice with satisfactory naturalness and quality.

Abstract: Chain of Thought (CoT) prompting can encourage language models to engage in multistep logical reasoning. The quality of the provided demonstrations significantly influences the success of downstream inference tasks. Current unsupervised CoT methods primarily select examples based on the semantics of the questions, which can introduce noise and lack interpretability. In this paper, we propose leveraging reasoning patterns to enhance CoT prompting effectiveness. Reasoning patterns represent the process by which language models arrive at their final results. By utilizing prior knowledge and prompt-based methods from large models, we first construct task-specific pattern sets. We then select diverse demonstrations based on different reasoning patterns. This approach not only mitigates the impact of noise but also provides explicit interpretability to help us understand the mechanisms of CoT. Extensive experiments demonstrate that our method is more robust and consistently leads to improvements across various reasoning tasks.

Abstract: Large Language Models (LLMs), when combined with agent mechanisms, show great promise in applications requiring robust planning ability, such as financial analysis and medical diagnostics. However, the increasingly complex reasoning structures designed to enhance the planning ability of language agents often exceed the processing and comprehension capabilities of LLMs, thereby limiting their effectiveness. To address these challenges, we introduce the Encoder of Thoughts (EoT), a novel reasoning structure modeling method based on graph neural networks. EoT processes the reasoning structures of planning methods through a plugand-play structural encoder and aligns these structural information with the input space of LLMs, enabling seamless integration with existing language agents. Experiments on multi-step reasoning and plan generation demonstrate that EoT significantly improves the performance of language agents. Moreover, EoT demonstrated stable results when combined with different LLMs and planning algorithms, further underscoring its potential for broader application.

Abstract: In artificial intelligence (AI), many legal conflicts have arisen, especially concerning privacy and copyright associated with training data. When an AI model's training data incurs privacy concerns, it becomes imperative to develop a new model devoid of influences from such contentious data. However, retraining from scratch is often not viable due to the extensive data requirements and heavy computational costs. Machine unlearning presents a promising solution by enabling the selective erasure of specific knowledge from models. Despite its potential, many existing approaches in machine unlearning are based on scenarios that are either impractical or could lead to unintended degradation of model performance. We utilize the concept of weight prediction to approximate the lesslearned weights based on observations about further training. By repetition of 1) finetuning on specific data and 2) weight prediction, our work gradually eliminates knowledge about the specific data. We verify its ability to eliminate side effects caused by problematic data and show its effectiveness across various architectures, datasets, and tasks.

Abstract: Logistics and transportation networks require a large amount of resources to realise necessary connections between locations and minimizing these resources is a vital aspect of planning research. Since such networks have dynamic connections that are only available at specific times, intricate models are needed to portray them accurately. In this paper, we study the problem of minimizing the number of resources needed to realise a dynamic network, using the temporal graphs model. In a temporal graph, edges appear at specific points in time. Given a temporal graph and a natural number k, we ask whether we can cover every temporal edge exactly once using at most k temporal journeys; in a temporal journey consecutive edges have to adhere to the order of time. We conduct a thorough investigation of the complexity of the problem with respect to four dimensions: (a) whether the type of the temporal journey is a walk, a trail, or a path; (b) whether the chronological order of edges in the journey is strict or nonstrict; (c) whether the temporal graph is directed or undirected; (d) whether the start and end points of each journey are given. We almost completely resolve the complexity of these problems and provide dichotomies for each of them with respect to k.

Abstract: This study investigates scheduling strategies for the stochastic resourceconstrained project scheduling problem with maximal time lags (SRCPSP/max). Recent advances in Constraint Programming (CP) and Temporal Networks have re-invoked interest in evaluating the advantages and drawbacks of various proactive and reactive scheduling methods. First, we present a new, CP-based fully proactive method. Second, we show how a reactive approach can be constructed using an online rescheduling procedure. A third contribution is based on partial order schedules and uses Simple Temporal Networks with Uncertainty (STNUs). Our statistical analysis shows that the STNU-based algorithm performs best in terms of solution quality, while also showing good relative offline and online computation time

Abstract: Integrating metric time into Task And Motion Planning (TAMP) is challenging, especially with simultaneous object motion. Existing work focuses on classical and numeric TAMP, not considering deadlines, motions overlapping in time, and other temporal constraints. In this paper, we fill this gap by formalizing Temporal Task and Motion Planning (TTAMP) for multiobject navigation. We propose a novel interleaved planning technique for this problem, which leverages incremental Satisfiability Modulo Theory to ensure efficient reasoning on deadlines and action duration coupled with a motion planner supporting simultaneous object motion. Geometric data on encountered obstacles prunes unreachable symbolic regions, while temporal bounds limit the geometric search space. For multiple moving objects, our algorithm contextualizes the conflicts learned from the motion planner on overlapping actions so that entire classes of temporal plans are pruned from the search space of the task planner, ensuring the eventual termination of the interplay. We provide a comprehensive benchmark suite and demonstrate the effectiveness of our solver in leveraging these scenarios.

Abstract: Testing a hypothesized causal model against observational data is a key prerequisite for many causal inference tasks. A natural approach is to test whether the conditional independence relations (CIs) assumed in the model hold in the data. While a model can assume exponentially many CIs (with respect to the number of variables), testing all of them is both impractical and unnecessary. Causal graphs, which encode these CIs in polynomial space, give rise to local Markov properties that enable model testing with a significantly smaller subset of CIs. Model testing based on local properties requires an algorithm to list the relevant CIs. However, existing algorithms for realistic settings with hidden variables and nonparametric distributions can take exponential time to produce even a single CI constraint. In this paper, we introduce the c-component local Markov property (C-LMP) for causal graphs with hidden variables. Since C-LMP can still invoke an exponential number of CIs, we develop a polynomial delay algorithm to list these CIs in poly-time intervals. To our knowledge, this is the first algorithm that enables poly-delay testing of CIs in causal graphs with hidden variables against arbitrary data distributions. Experiments on real-world and synthetic data demonstrate the practicality of our algorithm.

Abstract: Somatic Contiguous Hypermutations (CHM) are a popular variation operator used in artificial immune systems for optimisation tasks. Theoretical studies have shown that CHM operators can lead to considerable speedups in the expected optimisation time compared to the traditional standard bit mutation (SBM) operators used in evolutionary computation for both single-objective and multi-objective problems where it is advantageous to mutate large contiguous areas of the genotype representing the candidate solutions. These speed-ups can make the difference between polynomial and exponential runtimes, but come at the expense of the CHM operator being considerably slower than the SBM operator in easy hillclimbing phases of the optimisation process, when small areas of the genotype have to be mutated for progress to be made. In this paper we present a Fast CHM operator that is asymptotically just as fast as traditional SBM for hillclimbing yet maintains the efficacy of the standard CHM operator when large jumps in the search space are required to make progress efficiently. We demonstrate such efficacy on all applications were CHM has been previously studied in the literature.

Abstract: Multiobjective Optimization Problems (MOPs) where objectives have different levels of importance in decision-making, known as Mixed Pareto-Lexicographic MOPs with Priority Levels (PL-MPL-MOPs), are increasingly prevalent in real-world applications. General-purpose Multi-Objective Evolutionary Algorithms (MOEAs) that treat all objectives equally not only increase the workload of decision-making but also suffer from computational inefficiencies due to the necessity of generating many additional solutions. Conversely, strictly adhering to Priority Levels (PLs) during optimization can easily result in premature convergence within some PLs. To address this issue, we suggest an effective Balanced Adaptive Subspace Collaboration (BASC) method in this paper. Specifically, this method decomposes the search space into sub-fronts based on PLs and utilizes a sampling mechanism that operates exclusively within subspaces formed by sub-fronts at the same PL to generate new solutions, thereby emphasizing the exploitation of individual PLs. Furthermore, a set of parameters is employed to control the strictness of adherence to each PL, with these parameters adaptively adjusted to balance exploration across different PLs. The two mechanisms are then collaboratively integrated into MOEAs. Comprehensive experimental studies on benchmark problems and a set of complex job-shop scheduling problems in semiconductor manufacturing demonstrate the competitiveness of the proposed method over existing methods.

Abstract: Batch Bayesian Optimisation (BBO) has emerged as a potent approach for optimising expensive blackbox functions. Central to BBO is the issue of selecting a number of solutions at the same time through a batch method, in the hope for them to represent good, yet different, trade-offs between exploitation and exploration. To address this issue, one of the recent advancements has leveraged multi-objective optimisation to simultaneously consider several acquisition functions (e.g., PI, EI, and LCB), allowing them to complement each other. However, acquisition functions may behave similarly (since they all aim for a good balance between exploitation and exploration), restricting the search on different promising areas. In this paper, we attempt to address the above issue. We directly treat exploitation (reflected by quality, i.e., the posterior mean) and exploration (reflected by uncertainty) as two objectives. When selecting trade-off solutions between the two objectives, we consider a dynamically updated Pareto front where the uncertainty changes once a solution is selected, thereby allowing exploration on different promising areas. Through an extensive experiment study, we show the effectiveness of the proposed method in comparison with state-of-the-arts in the area.

Abstract: The maximin optimisation problem, inspired by Von Neumann’s work (von Neumann 1928) and widely applied in adversarial optimisation, has become a key research area in machine learning. Gradient Descent Ascent (GDA) is a common method for solving these problems but requires the payoff function to be differentiable, making it unsuitable for discrete or binary functions that often occur in game-theoretical scenarios. Co-evolutionary algorithms (CoEAs), which are derivative-free, offer an alternative to these problems. However, the theoretical understanding of CoEAs is still limited. This paper provides the first rigorous runtime analysis of CoEAs with pairwise dominance on binary two-player zero-sum games (or maximin problems), specifically focusing on the DIAGONAL game. The mathematical analysis rigorously shows that the PDCoEA can efficiently find the optimum in polynomial runtime with high probability under low mutation rates and large population sizes. Empirical evidence also identifies an error threshold where higher mutation rates lead to inefficiency. In contrast, single-pair-individual algorithms, i.e., RLS-PD and (1+1)-CoEAs, fail to find the optimum in polynomial time. These findings highlight the usefulness of pairwise dominance, low mutation rates, and large populations in maintaining a “co-evolutionary arms race”.

Abstract: Decision trees, owing to their interpretability, are attractive as control policies for (dynamical) systems. Unfortunately, constructing, or synthesising, such policies is a challenging task. Previous approaches do so by imitating a neuralnetwork policy, approximating a tabular policy obtained via formal synthesis, employing reinforcement learning, or modelling the problem as a mixed-integer linear program. However, these works may require access to a hard-to-obtain accurate policy or a formal model of the environment (within reach of formal synthesis), and may not provide guarantees on the quality or size of the final tree policy. In contrast, we present an approach to synthesise optimal decision-tree policies given a deterministic black-box environment and specification, a discretisation of the tree predicates, and an initial set of states, where optimality is defined with respect to the number of steps to achieve the goal. Our approach is a specialised search algorithm which systematically explores the (exponentially large) space of decision trees under the given discretisation. The key component is a novel trace-based pruning mechanism that significantly reduces the search space. Our approach represents a conceptually novel way of synthesising small decision-tree policies with optimality guarantees even for black-box environments with black-box specifications.

Abstract: Diffusion models are vulnerable to backdoor attacks, where malicious attackers inject backdoors by poisoning certain training samples during the training stage. This poses a significant threat to realworld applications in the Model-as-a-Service (MaaS) scenario, where users query diffusion models through APIs or directly download them from the internet. To mitigate the threat of backdoor attacks under MaaS, black-box input-level backdoor detection has drawn recent interest, where defenders aim to build a firewall that filters out backdoor samples in the inference stage, with access only to input queries and the generated results from diffusion models. Despite some preliminary explorations on the traditional classification tasks, these methods cannot be directly applied to the generative tasks due to two major challenges: (1) more diverse failures and (2) a multi-modality attack surface. In this paper, we propose a black-box input-level backdoor detection framework on diffusion models, called UFID. Our defense is motivated by an insightful causal analysis: Backdoor attacks serve as the confounder, introducing a spurious path from input to target images, which remains consistent even when we perturb the input samples with Gaussian noise. We further validate the intuition with theoretical analysis. Extensive experiments across different datasets on both conditional and unconditional diffusion models show that our method achieves superb performance on detection effectiveness and run-time efficiency.

Abstract: Ensuring that AI systems make strategic decisions aligned with the specified preferences in adversarial sequential interactions is a critical challenge for developing trustworthy AI systems, especially when the environment is stochastic and players' incomplete preferences leave some outcomes unranked. We study the problem of synthesizing preferencesatisfying strategies in two-player stochastic games on graphs where players have opposite (possibly incomplete) preferences over a set of temporal goals. We represent these goals using linear temporal logic over finite traces (LTLf), which enables modeling the nuances of human preferences where temporal goals need not be mutually exclusive and comparison between some goals may be unspecified. We introduce a solution concept of non-dominated almost-sure winning, which guarantees to achieve a most preferred outcome aligned with specified preferences while maintaining robustness against the adversarial behaviors of the opponent. Our results show that strategy profiles based on this concept are Nash equilibria in the game where players are risk-averse, thus providing a practical framework for evaluating and ensuring stable, preference-aligned outcomes in the game. Using a drone delivery example, we demonstrate that our contributions offer valuable insights not only for synthesizing rational behavior under incomplete preferences but also for designing games that motivate the desired behavior from the players in adversarial conditions.

Abstract: Large Language Models (LLMs) aligned with human feedback have recently garnered significant attention. However, it remains vulnerable to jailbreak attacks, where adversaries manipulate prompts to induce harmful outputs. Exploring jailbreak attacks enables us to investigate the vulnerabilities of LLMs and further guides us in enhancing their security. Unfortunately, existing techniques mainly rely on handcrafted templates or generatedbased optimization, posing challenges in scalability, efficiency and universality. To address these issues, we present JailPO, a novel black-box jailbreak framework to examine LLM alignment. For scalability and universality, JailPO meticulously trains attack models to automatically generate covert jailbreak prompts. Furthermore, we introduce a preference optimization-based attack method to enhance the jailbreak effectiveness, thereby improving efficiency. To analyze model vulnerabilities, we provide three flexible jailbreak patterns. Extensive experiments demonstrate that JailPO not only automates the attack process while maintaining effectiveness but also exhibits superior performance in efficiency, universality, and robustness against defenses compared to baselines. Additionally, our analysis of the three JailPO patterns reveals that attacks based on complex templates exhibit higher attack strength, whereas covert question transformations elicit riskier responses and are more likely to bypass defense mechanisms.

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a commonly used alignment method for Large Language Models (LLMs). This method relies on a reward model trained on a preference dataset to provide scalar rewards. However, the humanannotated preference data is often sparse, noisy, and costly to obtain, necessitating more efficient utilization. This paper proposes a new metric for better preference data utilization from both theoretical and empirical perspectives. Starting with the Bradley-Terry model, we compute the Mean Square Error (MSE) between the expected loss and empirical loss of the reward model. Our findings reveal that data with higher and more consistent difference result in lower MSE. We therefore propose the Preference Difference (PD), the reward difference between two samples, as a filter for preference data. Experimental results on three open-source models show that reward models trained by filtered data with PD achieve higher calibrated accuracy, as well as better RLHF alignment performance. The conclusion remains consistent when we extend the experiments and theoretical derivations to implicit reward alignment algorithms, such as Direct Preference Optimization (DPO).

Abstract: The advent of large language models (LLMs) has sparked significant interest in using natural language for preference learning. However, existing methods often suffer from high computational burdens, taxing human supervision, and lack of interpretability. To address these issues, we introduce MAPLE, a framework for large language modelguided Bayesian active preference learning. MAPLE leverages LLMs to model the distribution over preference functions, conditioning it on both natural language feedback and conventional preference learning feedback, such as pairwise trajectory rankings. MAPLE employs active learning to systematically reduce uncertainty in this distribution and incorporates a language-conditioned active query selection mechanism to identify informative and easy-to-answer queries, thus reducing the burden on humans. We evaluate MAPLE's sample efficiency and preference inference quality across two benchmarks, including a real-world vehicle route planning benchmark using OpenStreetMap data. Our results demonstrate that MAPLE accelerates the learning process and effectively improves humans' ability to answer queries.

Abstract: Recent work has proposed automated redteaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this paper, we study red-teaming strategies that enable a targeted security assessment. We propose an optimization framework for red-teaming with proximity constraints, where the discovered prompts must be similar to reference prompts from a given dataset. This dataset serves as a template for the discovered prompts, anchoring the search for test-cases to specific topics, writing styles, or types of harmful behavior. We show that established auto-regressive model architectures do not perform well in this setting. We therefore introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART). DART modifies the reference prompt by perturbing it in the embedding space, directly controlling the amount of change introduced. We systematically evaluate our method by comparing its effectiveness with established methods based on model fine-tuning and zero- and few-shot prompting. Our results show that DART is significantly more effective at discovering harmful inputs in close proximity to the reference prompt.

Abstract: As AI agents are increasingly adopted to collaborate on complex objectives, ensuring the security of autonomous multiagent systems becomes crucial. We develop simulations of agents collaborating on shared objectives to study these security risks and security trade-offs. We focus on scenarios where an attacker compromises one agent, using it to steer the entire system toward misaligned outcomes by corrupting other agents. In this context, we observe infectious malicious prompts - the multi-hop spreading of malicious instructions. To mitigate this risk, we evaluated several strategies: two "vaccination" approaches that insert false memories of safely handling malicious input into the agents' memory stream, and two versions of a generic safety instruction strategy. While these defenses reduce the spread and fulfillment of malicious instructions in our experiments, they tend to decrease collaboration capability in the agent network. Our findings illustrate potential trade-off between security and collaborative efficiency in multi-agent systems, providing insights for designing more secure yet effective AI collaborations.

Abstract: Forests are vital to ecosystems, supporting biodiversity and essential services, but are rapidly changing due to land use and climate change. Understanding and mitigating negative effects requires parsing data on forests at global scale from a broad array of sensory modalities, and using them in diverse forest monitoring applications. Such diversity in data and applications can be effectively addressed through the development of a large, pretrained foundation model that serves as a versatile base for various downstream tasks. However, remote sensing modalities, which are an excellent fit for several forest management tasks, are particularly challenging considering the variation in environmental conditions, object scales, image acquisition modes and spatio-temporal resolutions, etc. With that in mind, we present the first unified Forest Monitoring Benchmark (FoMo-Bench), carefully constructed to evaluate foundation models with such flexibility. FoMo-Bench consists of 15 diverse datasets encompassing satellite, aerial, and inventory data, covering a variety of geographical regions, and including multispectral, red-green-blue, synthetic aperture radar and LiDAR data with various temporal, spatial and spectral resolutions. FoMo-Bench includes multiple types of forest-monitoring tasks, spanning classification, segmentation, and object detection. To enhance task and geographic diversity in FoMo-Bench, we introduce TalloS, a global dataset combining satellite imagery with ground-based annotations for tree species classification across 1,000+ categories and hierarchical taxonomic levels. Finally, we propose FoMo-Net, a pre-training framework to develop foundation models with the capacity to process any combination of commonly used modalities and spectral bands in remote sensing. This work aims to inspire research collaborations between machine learning and forest biology researchers in exploring scalable multi-modal and multi-task models for forest monitoring and beyond. All code, data and appendices are published in the repository and on ArXiv.

Abstract: Emergency response services are vital for enhancing public safety by safeguarding the environment, property, and human lives. As frontline members of these services, 91-1 dispatchers have a direct impact on response times and the overall effectiveness of emergency operations. However, traditional dispatcher training methods, which rely on role-playing by experienced personnel, are labor-intensive, time-consuming, and often neglect the specific needs of underserved communities. To address these challenges, we introduce Sim911, the first training simulation for 9-1-1 dispatchers powered by Large Language Models (LLMs). Sim911 enhances training through three key technical innovations: (1) knowledge construction, which utilizes archived 9-1-1 call data to generate simulations that closely mirror real-world scenarios; (2) context-aware controlled generation, which employs dynamic prompts and vector bases to ensure that LLM behavior aligns with training objectives; and (3) validation with looped correction, which filters out low-quality responses and refines the system performance. Experimental results show Sim911's superior performance in effectiveness and equity. Beyond its technical advancements, Sim911 delivers significant social impacts. Successfully deployed in the Metro X of Emergency Communications (MXDEC)(PS: To ensure a double-blind review, we refer to the city as 'City X,' a mid-sized U.S. city with a population of over 700,000. Its Metro Department of Emergency Communications (MXDEC) employs around 80 dispatchers and call-takers. For the rest of the paper, we refer to MXDEC as 'DEC.') Sim911 has been integrated into multiple training sessions, saving time for dispatchers. By supporting a diverse range of incident types and caller tags, Sim911 provides more realistic and inclusive training experiences. In a conducted user study, 90.00 percent of participants found Sim911 to be as effective or even superior to traditional human-led training, making it a valuable tool for emergency communications centers nationwide, particularly those facing staffing challenges.

Abstract: Mechanical Ventilation (MV) is a critical lifesupport intervention in intensive care units (ICUs). However, optimal ventilator settings are challenging to determine because of the complexity of balancing patient-specific physiological needs with the risks of adverse outcomes that impact morbidity, mortality, and healthcare costs. This study introduces ConformalDQN, a novel distribution-free conformal deep Q-learning approach for optimizing mechanical ventilation in intensive care units. By integrating conformal prediction with deep reinforcement learning, our method provides reliable uncertainty quantification, addressing the challenges of Q-value overestimation and out-of-distribution actions in offline settings. We trained and evaluated our model using ICU patient records from the MIMIC-IV database. ConformalDQN extends the Double DQN architecture with a conformal predictor and employs a composite loss function that balances Q-learning with well-calibrated probability estimation. This enables uncertainty-aware action selection, allowing the model to avoid potentially harmful actions in unfamiliar states and handle distribution shifts by being more conservative in out-of-distribution scenarios. Evaluation against baseline models, including physician policies, policy constraint methods, and behavior cloning, demonstrates that ConformalDQN consistently makes recommendations within clinically safe and relevant ranges, outperforming other methods by increasing the 90-day survival rate. Notably, our approach provides an interpretable measure of confidence in its decisions, which is crucial for clinical adoption and potential human-in-the-loop implementations.

Abstract: The soaring drug overdose crisis in the United States has claimed more than half a million lives in the past decade and remains a major public health threat. The ability to predict drug overdose deaths at the county level can help local communities develop action plans in response to emerging changes. Applying offthe-shelf machine learning algorithms for prediction can be challenging due to the heterogeneous risk profiles of the counties and suppressed data in common publicly available data sources. To fill these gaps, we develop a cluster-aware supervised learning (CASL) framework to enhance the prediction of county-level drug overdose deaths. This CASL model simultaneously clusters counties into groups based on geographical and socioeconomic characteristics and minimizes the loss function that accounts for suppressed values and cluster-specific regularization. Our computational study uses real-world data from 2010 to 2021, focusing on the ten states most severely impacted by the drug overdose crisis. The results demonstrate that our proposed CASL framework significantly outperforms state-of-the-art methods by achieving a superior balance in prediction accuracy for both unsuppressed and suppressed observations. The proposed model also identifies different clusters of counties, capturing heterogeneous patterns of overdose mortality among counties of diverse characteristics.

Abstract: Addressing global sustainability challenges as outlined by the United Nations (UN) Sustainable Development Goals (SDGs) often requires navigating many potentially conflicting societal objectives simultaneously. For instance, increasing hydropower production enhances renewable energy supply but may adversely impact people and nature. Understanding these tradeoffs is crucial, and the Pareto frontier - the set of solutions that cannot be improved with respect to one objective without negatively affecting another - is a valuable framework. Strategic hydropower planning concerns finding energy portfolios that achieve decarbonization targets, while balancing energy production with socioeconomic and environmental impacts. Previous work has considered exact and approximate algorithms for Pareto optimization for tree-structured networks, such as rivers, for hydropower planning. However, such approaches do not account for bounding constraints, such as realistic energy production targets, critical in real-world applications. Herein, we propose a novel approach for constraint-aware Pareto optimization for tree-structured networks, incorporating objective bounds to ensure more realistic and robust solution outcomes. We apply our constraint-aware Pareto approach to the strategic planning of hydropower expansion, considering energy bounds to adhere to the UN's net zero by 2050 decarbonization targets, in the Magdalena River basin, home to more than 80% of Colombia’s population. Our analysis demonstrates how lower and upper bounds can significantly modify the unconstrained Pareto frontier, revealing that feasible Pareto solutions can be dominated by infeasible solutions, and thus may be ignored by constraint-agnostic solvers. Our results highlight the importance of considering real-world constraints in multi-objective problems such as optimizing hydropower expansion to meet both energy and sustainability goals.

Abstract: Federated Learning (FL) in healthcare ensures patient privacy by allowing hospitals to collaboratively train machine learning models while keeping sensitive medical data secure and localized. Most existing research in FL has concentrated on unimodal scenarios, where all healthcare institutes share the same type of data. However, in realworld healthcare situations, some clients may have access to multiple types of data pertaining to the same disease. Multimodal Federated Learning (MMFL) utilizes multiple modalities to build a more powerful FL model than its unimodal counterpart. However, the impact of missing modality in different clients, called modality incongruity, has been greatly overlooked. This paper, for the first time, analyses the impact of modality incongruity and reveals its connection with data heterogeneity across participating clients. We particularly inspect whether incongruent MMFL with unimodal and multimodal clients is more beneficial than unimodal FL. Furthermore, we examine three potential routes of addressing this issue. Firstly, we study the effectiveness of various self-attention mechanisms towards incongruity-agnostic information fusion in MMFL. Secondly, we introduce a modality imputation network (MIN) pre-trained in a multimodal client for modality translation in unimodal clients and investigate its potential towards mitigating the missing modality problem. Thirdly, we introduce several client-level and server-level regularization techniques including Modality-aware knowledge Distillation (MAD) and Leave-one-out teacher (LOOT) towards mitigating modality incongruity effects. Experiments are conducted with Chest X-Ray and radiology reports under several MMFL settings on two publicly available real-world datasets, MIMIC-CXR and Open-I.

Abstract: BrainComputer Interfaces (BCIs) help people with severe speech and motor disabilities communicate and interact with their environment using neural activity. This work focuses on the Rapid Serial Visual Presentation (RSVP) paradigm of BCIs using noninvasive electroencephalography (EEG). The RSVP typing task is a recursive task with multiple sequences, where users see only a subset of symbols in each sequence. Extensive research has been conducted to improve classification in the RSVP typing task, achieving fast classification. However, these methods struggle to achieve high accuracy and do not consider the typing mechanism in the learning procedure. They apply binary target and non-target classification without including recursive training. To improve performance in the classification of symbols while controlling the classification speed, we incorporate the typing setup into training by proposing a Partially Observable Markov Decision Process (POMDP) approach. To the best of our knowledge, this is the first work to formulate the RSVP typing task as a POMDP for recursive classification. Experiments show that the proposed approach, MarkovType, results in a more accurate typing system compared to competitors. Additionally, our experiments demonstrate that while there is a trade-off between accuracy and speed, MarkovType achieves the optimal balance between these factors compared to other methods.

Abstract: Improving global school connectivity is critical for ensuring inclusive and equitable quality education. To reliably estimate the cost of connecting schools, governments and connectivity providers require complete and accurate school location data – a resource that is often scarce in many lowand middle-income countries. To address this challenge, we propose a cost-effective, scalable approach to locating schools in high-resolution satellite images using weakly supervised deep learning techniques. Our best models, which combine vision transformers and convolutional neural networks, achieve AUPRC values above 0.96 across 10 pilot African countries. Leveraging explainable AI techniques, our approach can approximate the precise geographical coordinates of the school locations using only low-cost, classification-level annotations. To demonstrate the scalability of our method, we generate nationwide maps of school location predictions in African countries and present a detailed analysis of our results, using Senegal as our case study. Finally, we demonstrate the immediate usability of our work by introducing an interactive web mapping tool to streamline human-in-the-loop model validation efforts by government partners. This work successfully showcases the real-world utility of deep learning and satellite images for planning regional infrastructure and accelerating universal school connectivity.

Abstract: Probabilistic inference is a fundamental challenge in machine learning, spanning tasks from approximate Bayesian inference to generative AI. In this talk, I will present theoreticallyguaranteed scalable and efficient probabilistic inference with applications in Bayesian deep learning and generative modeling. First, I will introduce a new compute paradigm for probabilistic inference that leverages modern accelerators, specifically low-precision and sparsity, to significantly speed up inference while preserving accuracy. Next, I will present a new framework for efficient inference in discrete domains, utilizing gradient information—a largely overlooked feature of discrete distributions—to enable more informed and directional exploration. Finally, I will showcase experimental results demonstrating the effectiveness of these methods across various ML tasks, including Bayesian neural networks, energy-based models, and large language models.

Abstract: In the domain of merchantoriented risk control decisions within e-commerce, balancing the effectiveness of risk management with merchant satisfaction remains a critical challenge. Strict risk control strategies, while effectively mitigating risks, often lead to increased merchant dissatisfaction. Conversely, loose policies could enhance the merchant experience but raise the likelihood of incidents, potentially incurring substantial financial losses. Additionally, determining personalized risk control strategies for different merchants to achieve optimal overall risk management effectiveness is crucial. Given the high uncertainty in the outcomes of different risk control decisions, manual strategy allocation and real-time adjustments are commonly implemented in practice, leading to significant human and resource costs. In this work, we present a novel automated risk control decision framework that utilizes unbiased data-driven decision-making and dynamic optimization to automate the allocation and adjustment of risk control strategies. Our proposed solution adapts to various online business requirements, demonstrating exceptional risk management performance and significantly reducing overall costs. This approach has been extensively deployed and validated in Alibaba's risk control operations, achieving large-scale automated risk control decisions.

Abstract: Dental disease is a prevalent chronic condition associated with substantial financial burden, personal suffering, and increased risk of systemic diseases. Despite widespread recommendations for twicedaily tooth brushing, adherence to recommended oral self-care behaviors remains sub-optimal due to factors such as forgetfulness and disengagement. To address this, we developed Oralytics, a mHealth intervention system designed to complement clinician-delivered preventative care for marginalized individuals at risk for dental disease. Oralytics incorporates an online reinforcement learning algorithm to determine optimal times to deliver intervention prompts that encourage oral self-care behaviors. We have deployed Oralytics in a registered clinical trial. The deployment required careful design to manage challenges specific to the clinical trials setting in the U.S. In this paper, we (1) highlight key design decisions of the RL algorithm that address these challenges and (2) conduct a re-sampling analysis to evaluate algorithm design decisions. A second phase (randomized control trial) of Oralytics is planned to start in spring 2025.

Abstract: Within the Engineering, Procurement, and Construction (EPC) industry, engineers manually create documents based on engineering drawings, which can be timeconsuming and prone to human error. For example, the expansion of typical assemblies of instrument items (Instrument Typicals) in Piping and Instrumentation Diagrams (P&IDs) is a labor-intensive task. Each Instrument Typical assembly is depicted in the P&IDs via a simplified representation showing only a subset of the utilized instruments. The expansion activity involves recording all utilized instruments to create an instrument item list document based on the P&IDs for a particular EPC project. Fortunately, Artificial Intelligence (AI) could help to automate this process. In this paper, we propose the first method for automating the process of Instrument Typical expansion in P&IDs. The method utilizes computer vision techniques and domain knowledge rules to extract information about the Instrument Typicals from a project's P&IDs and legend sheets. Subsequently, the extracted information is used to automatically generate the listing of all utilized instruments. The effectiveness of our method is evaluated on P&IDs from large industrial EPC projects, resulting in precision rates exceeding 98% and recall rates surpassing 99%. These results demonstrate the suitability of our method for industrial deployment. The successful application of our method has the potential to reduce engineering costs and increase the efficiency of EPC projects. Furthermore, the method could be adapted for additional applications in the EPC industry, which highlights the method's industrial value.

Abstract: In August of 2024, 495 hackers generated evaluations in an openended bug bounty targeting the Open Language Model (OLMo) from The Allen Institute for AI. A vendor panel staffed by representatives of OLMo's safety program adjudicated changes to OLMo's documentation and awarded cash bounties to participants who successfully demonstrated a need for public disclosure clarifying the intent, capacities, and hazards of model deployment. This paper presents a collection of lessons learned, illustrative of flaw reporting best practices intended to reduce the likelihood of incidents and produce safer large language models (LLMs). These include best practices for safety reporting processes, their artifacts, and safety program staffing.

Abstract: Concerns about the risks and harms posed by artificial intelligence (AI) have resulted in significant study into algorithmic transparency, giving rise to a subfield known as Explainable AI (XAI). Unfortunately, despite a decade of development in XAI, an existential challenge remains: progress in research has not been fully translated into the actual implementation of algorithmic transparency by organizations. In this work, we test an approach for addressing the challenge by creating transparency advocates, or motivated individuals within organizations who drive a ground-up cultural shift towards improved algorithmic transparency. Over several years, we created an open-source educational workshop on algorithmic transparency and advocacy. We delivered the workshop to professionals across two separate domains to improve their algorithmic transparency literacy and willingness to advocate for change. In the weeks following the workshop, participants applied what they learned, such as speaking up for algorithmic transparency at an organization-wide AI strategy meeting. We also make two broader observations: first, advocacy is not a monolith and can be broken down into different levels. Second, individuals' willingness for advocacy is affected by their professional field. For example, news and media professionals may be more likely to advocate for algorithmic transparency than those working at technology start-ups.

Abstract: Assessing students' responses, especially natural language responses, is a major challenge in education. In general, in education contexts, automatically evaluating what learners do or say is important as it enables personalized instruction, e.g., based on what the learner knows tailored tasks and feedback are given to the learner. Recently, deep learning techniques led to stateof-the-art methods in NLP such as transformer-based methods which resulted in significant performance improvements for many NLP tasks such as text classification and question answering. However, there is not much work exploring such methods for assessing students' free answers, particularly in the context of code comprehension, which brings additional challenges as the student explanations include code references as well. This paper explores the potential of applying automated assessments methods using transformers to code comprehension. We fine-tuned pre-trained transformer models, including BERT, RoBERTa, CodeBERT, and SciBERT, to see how well they can automatically judge students' responses to code comprehension tasks. Our results demonstrate that these models can significantly enhance the accuracy and reliability of automated assessments, offering insights into how the latest NLP techniques can be leveraged in computer science education to support personalized learning experiences.

Abstract: Initial discussion of AI literacy assessment has focused on competency frameworks and learning standards rather than materials for classroom use. Responsible AI for Computational Action (RAICA), a constructionist AI curriculum for middle and high school students, includes assessment materials to support teachers with the evaluation of student AI literacy competencies in their classrooms. These materials include exit tickets used as formative assessments at the end of each lesson and both teacher and studentfacing rubrics. After beta-testing a module of the curriculum with nine teachers and 282 students, we reviewed teacher usage data and feedback as well as student responses. The review process surfaced a number of improvements to the materials to better align them with classroom teaching practice. These included clarifying language and adding visual scaffolds. We present the assessment materials and iterative design process used to bridge the gap between the theoretical AI literacy competencies and their practical implementation in classrooms.

Abstract: Pretrained transformerbased Language Models (LMs) are well-known for their ability to achieve significant improvement on NLP tasks, but their black-box nature, which leads to a lack of interpretability, has been a major concern. My dissertation focuses on developing intrinsically interpretable models when using LMs as encoders while maintaining their superior performance via prototypical networks. I initiated my research by investigating enhancements in performance for interpretable models of sarcasm detection. My proposed approach focuses on capturing sentiment incongruity to enhance accuracy while offering instance-based explanations for the classification decisions. Later, we develop a novel white-box multi-head graph attention-based prototypical framework designed to explain the decisions of text classification models without sacrificing the accuracy of the original black-box LMs. In addition, I am working on extending the attention-based prototypical framework with contrastive learning to redesign an interpretable graph neural network for document classification, aiming to enhance both the interpretability and performance of the model in document classification.

Abstract: Synthesizing electronic health records (EHR) is essential for addressing data scarcity, bias, and fairness in healthcare models. EHR data are inherently multimodal and sequential, encompassing structured codes, clinical notes, medical images, and irregular time intervals. Traditional generative models like GANs and VAEs struggle to capture these complexities, while diffusionbased models offer improvements but remain limited to task-specific applications. To address these challenges, two diffusion-based models, MedDiffusion and EHRPD, have been developed. MedDiffusion enhances health risk prediction by generating synthetic patient data and capturing visit-level relationships, while EHRPD generates sequential, multimodal EHR data, incorporating temporal interval estimation to improve diversity and fidelity. Future work aims to overcome limitations in multimodal data generation by developing a generalized model capable of handling diverse modalities simultaneously, expanding the applicability of EHR data generation across healthcare tasks.

Abstract: Unit testing is essential for ensuring software quality, but it is often timeconsuming and prone to developer oversight. With the rise of large language models (LLMs) in code generation, there is an increasing need for reliable and automated test generation systems. This work presents QAagent, a multi-agent system designed to generate unit tests using natural language pseudocode. QAagent leverages LLMs to create a detailed natural language plan of a function's implementation and then generates a comprehensive suite of test cases covering both base and edge scenarios. Experiments conducted on two widely-used benchmarks, HumanEval and MBPP, show that QAagent consistently outperforms existing frameworks in terms of code coverage, although its accuracy varies across datasets, demonstrating the potential for utilizing natural language pseudocode to to enhance automated test generation in LLM-driven coding environments.

Abstract: Ocean exploration places high demands on autonomous underwater vehicles, especially when there's observation delay. We propose age of information optimized Markov decision process (AoIMDP) to enhance underwater tasks by modeling observation delay as signal delay and including it in the state space. AoI-MDP also introduces wait time in the action space and integrates AoI with reward functions, optimizing information freshness and decision-making using reinforcement learning. Simulations show AoI-MDP outperforms the standard MDP, demonstrating superior performance, feasibility, and generalization in underwater tasks. To accelerate relevant research, we have made the codes available as open-source at https://github.com/Xiboxtg/AoI-MDP.

Abstract: Accurate forecasting of medication usage and ICD9/10 code streams is critical for optimizing medical logistics, especially during periods of high demand, such as pandemics, disease outbreaks, wartime, or natural disasters. In this study, we develop a novel and robust forecasting framework using unsupervised learning techniques and Natural Language Processing (NLP) methods to build vector representations of daily ICD-9/10 codes and medication daily usage from Electronic Health Record (EHR) data. Multiple forecasting models, including Linear Drift Model, Vector Autoregression (VAR), Temporal Fusion Transformer (TFT), and Autoregressive Long Short-Term Memory (AR-LSTM) are trained, tested and evaluated. Finally multiple TFT and AR-LSTM models with different lookback horizon are trained and ensembled together to achieve better forecasting accuracy in near further (10 days). The AI framework is validated using MIMIC-IV ER and MIMIC-III datasets, resulting in the average forecasting error 5.2% at 5-th day and 18.1% at the 10-th day. The results demonstrate the ensemble model’s superior performance on near-future medication usage forecasting and ICD code progression, offering valuable insights for healthcare logistics and decision making. The framework also provides the mechanism to detect the model drift and finetune the model if necessary, which offers a robust tool for managing healthcare logistics under extreme and fluctuating conditions.

Abstract: We propose a framework that uses renormalization group (RG) theory from statistical physics to analyze and optimize the hierarchical feature learning process in deep neural networks. Here, the layerwise transformations in deep networks can be viewed as analogous to RG transformations, with each layer implementing a coarse-graining operation that extracts increasingly abstract features. We propose an approach to enforce scale invariance in neural networks, introduce scale-aware activation functions, and derive RG flow equations for network parameters. We show that our approach leads to fixed points corresponding to scale-invariant feature representations. Finally, we propose an RG-guided training procedure that converges to these fixed points while minimizing the loss function.

Abstract: Wearable devices are transforming healthcare by providing continuous, realtime physiological data for monitoring and analysis. However, data often suffer from noise and significant missing values due to operational constraints and user compliance. Traditional approaches address these issues through data imputation during pre-processing, introducing biases and inaccuracies. We propose a novel method enabling Recurrent Neural Networks (RNNs) to inherently handle missing data without imputation. By implementing teacher-forcing during Backpropagation Through Time (BPTT) when data are available and switching to autonomous mode otherwise, our approach leverages RNNs' dynamics to model physiological signals accurately. We demonstrate our method's effectiveness using the Lorenz 63 system as a surrogate dataset, achieving robust reconstructions with 80% missing data.

Abstract: Artificial neural networks (ANNs) struggle with continual learning, sacrificing performance on previously learned tasks to acquire new task knowledge. Here we propose a new approach allowing to mitigate catastrophic forgetting during continuous task learning. Typically a new task is trained until it reaches maximal performance, causing complete catastrophic forgetting of the previous tasks. In our new approach, termed Optimal Stopping (OS), network training on each new task continues only while the mean validation accuracy across all the tasks (current and previous) increases. The stopping criterion creates an explicit balance: lower performance on new tasks is accepted in exchange for preserving knowledge of previous tasks, resulting in higher overall network performance. The overall performance is further improved when OS is combined with Sleep Replay Consolidation (SRC), wherein the network converts to a Spiking Neural Network (SNN) and undergoes unsupervised learning modulated by Hebbian plasticity. During the SRC, the network spontaneously replays activation patterns from previous tasks, helping to maintain and restore prior task performance. This combined approach offers a promising avenue for enhancing the robustness and longevity of learned representations in continual learning models, achieving over twice the mean accuracy of baseline continuous learning while maintaining stable performance across tasks.

Abstract: Ocean exploration requires effective collaboration between the unmanned surface vehicle (USV) and autonomous underwater vehicles (AUVs). We propose UACOF, a USVAUV collaboration framework that enhances multi-AUV performance under extreme sea conditions. The framework includes high-precision multi-AUV location via USV path planning with Fisher information matrix optimization and reinforcement learning training for cooperative tasks. Experimental results show UACOF's superior feasibility, performance, coordination and robustness in extreme conditions.

Abstract: The lack of personalization in early education can often leave students with weak foundational skills, causing said students to be behind in their studies. Personalized learning, the idea of tailoring a unique lesson plan to a student, has been shown to improve the understanding of content learned. Robots utilizing personalization techniques in educational settings, coined social robots, have been able to form a connection with students, thereby keeping them engaged while learning. This proposal seeks to study the effects of AIdriven social robotic tutors coupled with personalized learning on early childhood education. The study will consist of five groups of K-4 students: two groups learning while utilizing both a social robot and a tablet (one group with personalized learning and the other without), the two groups interacting with only the tablet (with and without personalization), and the last group learning utilizing both a non-personalized learning tablet and a non-social robot. This study aims to determine whether the combination of robotic interaction and personalized learning leads to better outcomes than solely tablet-based or non-personalized methods. This study will focus on teaching mathematics to the participants. Pre and post-tests will measure learning progress, and the influence of robot interaction on student engagement will also be evaluated. It is expected that the students with social robotic tutors and personalized learning tablets will show the greatest knowledge retention, outperforming all other categories. These findings could have significant implications for the integration of AI and robotics in early education, potentially revolutionizing how personalized learning is implemented therefore improving educational outcomes for young learners.

Abstract: Healthcare diagnostics, especially in underserved communities, faces critical gaps in accessibility and accuracy. African Americans experience significant disparities in mental health care, often receiving delayed or inadequate treatment. This research proposes a diagnostic copilot, an AIpowered assistant designed to work alongside healthcare professionals. Using Knowledge-Infused Learning (KIL) and multi-turn conversations, the system integrates clinical knowledge and patient input to deliver actionable, explainable diagnoses in real-time. By engaging with both patients and clinicians, the copilot aims to reduce disparities, enhance trust, and improve diagnostic accuracy in mental health care.

Abstract: Textto-Video (T2V) models, despite recent advancements, struggle with factual accuracy, especially for knowledge-dense content. We introduce FACT-V (Factual Accuracy in Content Translation to Video), a system integrating multi-source knowledge retrieval into T2V pipelines. FACT-V offers two key benefits: i) improved factual accuracy of generated videos through dynamically retrieved information, and ii) increased interpretability by providing users with the augmented prompt information. A preliminary evaluation demonstrates the potential of knowledge-augmented approaches in improving the accuracy and reliability of T2V systems, particularly for entity-specific or time-sensitive prompts.

Abstract: Our goal is to develop an AI Partner that can provide support for group problem solving and social dynamics. In multiparty working group environments, multimodal analytics is crucial for identifying non-verbal interactions of group members. In conjunction with their verbal participation, this creates an holistic understanding of collaboration and engagement that provides necessary context for the AI Partner. In this demo, we illustrate our present capabilities at detecting and tracking nonverbal behavior in student task-oriented interactions in the classroom, and the implications for tracking common ground and engagement.

Abstract: The potential of Large Language Models (LLMs) in education is not trivial, but concerns about academic misconduct, misinformation, and overreliance limit their adoption. To address these issues, we introduce MerryQuery, an AIpowered educational assistant using Retrieval-Augmented Generation (RAG), to provide contextually relevant, course-specific responses. MerryQuery features guided dialogues and source citation to ensure trust and improve student learning. Additionally, it enables instructors to monitor student interactions, customize response granularity, and input multimodal materials without compromising data fidelity. By meeting both student and instructor needs, MerryQuery offers a responsible way to integrate LLMs into educational settings.

Abstract: We present TRACEcs, a novel hybrid system that combines symbolic reasoning with large language models (LLMs) to address contrastive queries in scheduling problems. TRACE-cs leverages SAT solving techniques to encode scheduling constraints and generate explanations for user queries, while utilizing an LLM to process the user queries into logical clauses as well as refine the explanations generated by the symbolic solver to natural language sentences. By integrating these components, our approach demonstrates the potential of combining symbolic methods with LLMs to create explainable AI agents with correctness guarantees.

Abstract: The Internet of Things (IoT) is widely used in many applications such as smart city, transportation, healthcare, and environment monitoring. A key task of IoT maintenance is to analyze the abnormal sensor records and generate incident report. Traditionally, domain experts engage in such labor intensive tasks. Recent advances in Large Language Model (LLM) have sparked interests in developing AI based systems to automate these labor intensive processes. However, two critical problems hinder the effective application of LLM in IoTs: (1) LLM lacks background knowledge of deployed IoTs; and (2) the incidents are complex events involving many sensors and components. LLM needs to understand the sensor relationships for accurate diagnosis. In this study, we propose a Retrieval Augmented language model based Incident Diagnosing and Reporting system (RAIDR) for IoT applications. RAIDR retrieves related system documents based on the incident features and leverages LLM to analyze anomalies, identify root causes, and automatically generate incident reports. The automated incident reporting process streamlines end users’ decision making for system maintenance and troubleshooting.

Abstract: Chinese characters are a unique blend of language and art, featuring diverse artistic styles. Mastering these styles requires extensive practice and limits public participation. To encourage broader participation, we developed a realtime, interactive tool that supports multiple Chinese character art styles. This tool uses a diffusion model and several LoRA models to capture the diversity of Chinese character art. It generates personalized, visually striking Chinese character artworks in real-time by utilizing handwritten input, allowing users to adjust various stylistic parameters.

Abstract: Federated Learning (FL) can be affected by data and device heterogeneities, caused by clients' different local data distributions and latencies in uploading model updates (i.e., staleness). Traditional schemes consider these heterogeneities as two separate and independent aspects, but this assumption is unrealistic in practical FL scenarios where these heterogeneities are intertwined. In these cases, traditional FL schemes are ineffective, and a better approach is to convert a stale model update into a unstale one. In this paper, we present a new FL framework that ensures the accuracy and computational efficiency of this conversion, hence effectively tackling the intertwined heterogeneities that may cause unlimited staleness in model updates. Our basic idea is to estimate the distributions of clients' local training data from their uploaded stale model updates, and use these estimations to compute unstale client model updates. In this way, our approach does not require any auxiliary dataset nor the clients' local models to be fully trained, and does not incur any additional computation or communication overhead at client devices. We compared our approach with the existing FL strategies on mainstream datasets and models, and showed that our approach can improve the trained model accuracy by up to 25% and reduce the number of required training epochs by up to 35%.

Abstract: In the domain of multivariate time series analysis, the concept of channel independence has been increasingly adopted, demonstrating excellent performance due to its ability to eliminate noise and the influence of irrelevant variables. However, such a concept often simplifies the complex interactions among channels, potentially leading to information loss. To address this challenge, we propose a strategy of channel independence followed by mixing. Based on this strategy, we introduce CSformer, a novel framework featuring a twostage multiheaded self-attention mechanism. This mechanism is designed to extract and integrate both channel-specific and sequence-specific information. Distinctively, CSformer employs parameter sharing to enhance the cooperative effects between these two types of information. Moreover, our framework effectively incorporates sequence and channel adapters, significantly improving the model's ability to identify important information across various dimensions. Extensive experiments on several real-world datasets demonstrate that CSformer achieves state-of-the-art results in terms of overall performance.

Abstract: VisionLanguage Models (VLMs), such as CLIP, have already seen widespread applications. Researchers actively engage in further fine-tuning VLMs in safety-critical domains. In these domains, prediction rationality is crucial: the prediction should be correct and based on valid evidence. Yet, for VLMs, the impact of fine-tuning on prediction rationality is seldomly investigated. To study this problem, we proposed two new metrics called Prediction Trustworthiness and Inference Reliability. We conducted extensive experiments on various settings and observed some interesting phenomena. On the one hand, we found that the well-adopted fine-tuning methods led to more correct predictions based on invalid evidence. This potentially undermines the trustworthiness of correct predictions from fine-tuned VLMs. On the other hand, having identified valid evidence of target objects, fine-tuned VLMs were more likely to make correct predictions. Moreover, the findings are also consistent under distributional shifts and across various experimental settings. We hope our research offer fresh insights to VLM fine-tuning.

Abstract: Personalized federated learning (PFL) has recently gained significant attention for its capability to address the poor convergence performance on highly heterogeneous data and the lack of personalized solutions of traditional federated learning (FL). Existing mainstream approaches either perform personalized aggregation based on a specific model architecture to leverage global knowledge or achieve personalization by exploiting client similarities. However, the former overlooks the discrepancies in client data distributions by indiscriminately aggregating all clients, while the latter lacks finegrained collaboration of classifiers relevant to local tasks. In view of this challenge, we propose a Personalized Federated learning method for Enhancing Collaboration among Similar Classifiers (PFedCS), which aims at improving the client’s accuracy on local tasks. Concretely, it is achieved by leveraging awareness of the client classifier similarities to address the above problems. By iteratively measuring the distance of the classifier parameters between clients and clustering with each client as a cluster center, the central server adaptively identifies the collaborating clients with similar data distributions. In addition, a distance-constrained aggregation method is designed to generate customized collaborative classifiers to guide local training. As a result, extensive experimental evaluations conducted on three datasets demonstrate that our method achieves state-of-the-art performance.

College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China, College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China, College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China, College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China, College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China, College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China

Abstract: In longterm series forecasting (LTSF), it is imperative for models to adeptly discern and distill from historical time series data to forecast future states. Although Transformer-based models excel at capturing long-term dependencies in LTSF, their practical use is limited by issues like computational inefficiency, noise sensitivity, and overfitting on smaller datasets. Therefore, we introduce a novel time series lightweight interactive Mamba with an adaptive Fourier filter model (Affirm). Specifically, (i) we propose an adaptive Fourier filter block. This neural operator employs Fourier analysis to refine feature representation, reduces noise with learnable adaptive thresholds, and captures inter-frequency interactions using global and local semantic adaptive Fourier filters via element-wise multiplication. (ii) A dual interactive Mamba block is introduced to facilitate efficient intra-modal interactions at different granularities, capturing more detailed local features and broad global contextual information, providing a more comprehensive representation for LTSF. Extensive experiments on multiple benchmarks demonstrate that Affirm consistently outperforms existing SOTA methods, offering a superior balance of accuracy and efficiency, making it ideal for various challenging scenarios with noise levels and data sizes.

Abstract: Realworld graph data environments intrinsically exist noise (e.g., link and structure errors) that inevitably disturb the effectiveness of graph representation and downstream learning tasks. For homogeneous graphs, the latest works use original node features to synthesize a similarity graph that can correct the structure of the noised graph. This idea is based on the homogeneity assumption, which states that similar nodes in the homogeneous graph tend to have direct links in the original graph. However, similar nodes in heterogeneous graphs usually do not have direct links, which can not be used to correct the original noise graph. This causes a significant challenge in noised heterogeneous graph learning. To this end, this paper proposes a novel synthesized similarity-based graph neural network compatible with noised heterogeneous graph learning. First, we calculate the original feature similarities of all nodes to synthesize a similarity-based high-order graph. Second, we propose a similarity-aware encoder to embed original and synthesized graphs with shared parameters. Then, instead of graph-to-graph supervising, we synchronously supervise the original and synthesized graph embeddings to predict the same labels. Meanwhile, a target-based graph extracted from the synthesized graph contrasts the structure of the metapath-based graph extracted from the original graph to learn the mutual information. Extensive experiments in numerous real-world datasets show the proposed method achieves state-of-the-art records in the noised heterogeneous graph learning tasks. In highlights, +5~6\% improvements are observed in several noised datasets compared with previous SOTA methods.

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), China University of Petroleum (Beijing) at Karamay, Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), JD AI Research, Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), China University of Petroleum (Beijing) at Karamay

Abstract: Sentiment analysis is rapidly advancing by utilizing various data modalities (e.g., text, video, and audio). However, most existing techniques only learn the atomiclevel features that reflect strong correlations, while ignoring more complex compositions in multimodal data. Moreover, they also neglected the incongruity in semantic distribution among modalities. In light of this, we introduce a novel Hierarchical Correlation Modeling Network (HCMNet), which enhances the multimodal sentiment analysis by exploring both the atomic-level correlations based on dynamic attention reasoning and the composition-level correlations through topological graph reasoning. In addition, we also alleviate the impact of distributional inconsistencies between modalities from both atomic-level and composition-level perspectives. Specifically, we first design an atomic-level contrastive loss that constrains the semantic distribution across modalities to mitigate the atomic-level inconsistency. Then, we design a graph optimal transport module that integrates transport flows with different graphs to constrain the composition-level semantic distribution, thus reducing the inconsistency of compositional nodes. Experiments on three public benchmark datasets have demonstrated the superiority of the proposed model over the state-of-the-art methods.

School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences IPHAC Lab, Institute of Microelectronics of the Chinese Academy of Sciences, IPHAC Lab, Institute of Microelectronics of the Chinese Academy of Sciences, IPHAC Lab, Institute of Microelectronics of the Chinese Academy of Sciences, IPHAC Lab, Institute of Microelectronics of the Chinese Academy of Sciences, IPHAC Lab, Institute of Microelectronics of the Chinese Academy of Sciences, IPHAC Lab, Institute of Microelectronics of the Chinese Academy of Sciences School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences

Abstract: SingleDomain Generalized Object Detection (S-DGOD) aims to train on a single source domain for robust performance across a variety of unseen target domains by taking advantage of an object detector. Existing S-DGOD approaches often rely on data augmentation strategies, including a composition of visual transformations, to enhance the detector's generalization ability. However, the absence of real-world prior knowledge hinders data augmentation from contributing to the diversity of training data distributions. To address this issue, we propose PhysAug, a novel physical model-based non-ideal imaging condition data augmentation method, to enhance the adaptability of the S-DGOD tasks. Drawing upon the principles of atmospheric optics, we develop a universal perturbation model that serves as the foundation for our proposed PhysAug. Given that visual perturbations typically arise from the interaction of light with atmospheric particles, the image frequency spectrum is harnessed to simulate real-world variations during training. This approach fosters the detector to learn domain-invariant representations, thereby enhancing its ability to generalize across various settings. Without altering the network architecture or loss function, our approach significantly outperforms the state-of-the-art across various S-DGOD datasets. In particular, it achieves a substantial improvement of 7.3% and 7.2% over the baseline on DWD and Cityscape-C, highlighting its enhanced generalizability in real-world settings.

Abstract: Graph contrastive learning (GCL) has been widely used as an effective selfsupervised learning method for graph representation learning. However, how to apply adequate and stable graph augmentation to generating proper views for contrastive learning remains an essential problem. Dropping edges is a primary augmentation in GCL while adding edges is not a common method due to its unstable performance. To our best knowledge, there is no theoretical analysis to study why dropping edges usually outperforms adding edges. To answer this question, we introduce a new metric, namely Error Passing Rate (EPR), to quantify how a graph fits the network. Inspired by the theoretical conclusions and the idea of positive-incentive noise, we propose a novel GCL algorithm, Error-PAssing-based Graph Contrastive Learning (EPAGCL), which uses both edge adding and edge dropping as its augmentations. To be specific, we generate views by adding and dropping edges based on the weights derived from EPR. Extensive experiments on various real-world datasets are conducted to validate the correctness of our theoretical analysis and the effectiveness of our proposed algorithm.

Abstract: Finetuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU. A common solution to this memory challenge is offloading compute and data from the GPU to the CPU. However, this approach is hampered by the limited bandwidth of commodity hardware, which constrains communication between the CPU and GPU, and by slower matrix multiplications on the CPU. In this paper, we present an offloading framework, LSP-Offload, that enables near-native speed LLM fine-tuning on commodity hardware through learned sparse projectors. Our data-driven approach involves learning efficient sparse compressors that minimize communication with minimal precision loss. Additionally, we introduce a novel layer-wise communication schedule to maximize parallelism between communication and computation. As a result, our framework can fine-tune a 1.3 billion parameter model on a 4GB laptop GPU and a 6.7 billion parameter model on an NVIDIA RTX 4090 GPU with 24GB memory. Compared to state-of-the-art offloading frameworks, our approach reduces end-to-end fine-tuning time by 33.1%-62.5% when converging to the same accuracy.

Abstract: Verifying safety of neural network control systems that use images as input is a difficult problem because, from a given system state, there is no known way to mathematically model what images are possible in the realworld. We build upon recent work that considers a surrogate verification approach, training a conditional generative adversarial network (cGAN) as an image generator in place of the real world. This setup enables set-based formal analysis of the closed-loop system, providing analysis beyond simulation and testing. While existing work is effective on small examples, excessive overapproximation both within a single control period (one-step error) and across multiple periods (multi-step error) limits its scalability. We propose approaches to overcome these errors. First, we address one-step error by composing the system's dynamics along with the cGAN and neural network controller, without losing the dependencies between input states and the control outputs as in the monotonic analysis of the system dynamics. Second, we reduce multi-step error by repeating the single-step composition, essentially unrolling multiple steps of the control loop into a large neural network. We then leverage existing network verification algorithms to compute accurate reachable sets for multiple steps, avoiding the accumulation of abstraction error at each step.We demonstrate the effectiveness of our approach in terms of both accuracy and scalability using two case studies. On the aircraft taxiing system, the converged reachable set is 175% larger using the prior baseline method compared with our proposed approach. On the emergency braking system, with 24x the number of image output variables from the cGAN, the baseline method fails to prove any states are safe, whereas our improvements enable set-based safety analysis.

Abstract: Pretraining molecular representations from large unlabeled data is essential for molecular property prediction due to the high cost of obtaining groundtruth labels. While there exist various 2D graph-based molecular pretraining approaches, these methods struggle to show statistically significant gains in predictive performance. Recent work have thus instead proposed 3D conformer-based pretraining under the task of denoising, leading to promising results. During downstream finetuning, however, models trained with 3D conformers require accurate atom-coordinates of previously unseen molecules, which are computationally expensive to acquire at scale. In this paper, we propose a simple solution of denoise-and-distill (D&D), a self-supervised molecular representation learning method that pretrains a 2D graph encoder by distilling representations from a 3D denoiser. With denoising followed by cross-modal knowledge distillation, our approach enjoys use of knowledge obtained from denoising as well as painless application to downstream tasks with no access to 3D conformers. Experiments on real-world molecular property prediction datasets show that the graph encoder trained via D&D can infer 3D information based on the 2D graph and shows superior performance and label-efficiency against previous methods.

Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam, Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam, Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam, Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam

Abstract: The rapid spread of information in the digital age highlights the critical need for effective factchecking tools, particularly for languages with limited resources, such as Vietnamese. In response to this challenge, we introduce ViFactCheck, the first publicly available benchmark dataset designed specifically for Vietnamese fact-checking across multiple online news domains. This dataset contains 7,232 human-annotated pairs of claim-evidence combinations sourced from reputable Vietnamese online news, covering 12 diverse topics. It has been subjected to a meticulous annotation process to ensure high quality and reliability, achieving a Fleiss Kappa inter-annotator agreement score of 0.83. Our evaluation leverages state-of-the-art pre-trained and large language models, employing fine-tuning and prompting techniques to assess performance. Notably, the Gemma model demonstrated superior effectiveness, with an impressive macro F1 score of 89.90%, thereby establishing a new standard for fact-checking benchmarks. This result highlights the robust capabilities of Gemma in accurately identifying and verifying facts in Vietnamese. To further promote advances in fact-checking technology and improve the reliability of digital media, we have made the ViFactCheck dataset, model checkpoints, fact-checking pipelines, and source code freely available on GitHub. This initiative aims to inspire further research and enhance the accuracy of information in low-resource languages.

School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing 100876, China Key Laboratory of Trustworthy Distributed Computing and Service, BUPT, Ministry of Education, Beijing 100876, China, School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing 100876, China Key Laboratory of Trustworthy Distributed Computing and Service, BUPT, Ministry of Education, Beijing 100876, China, School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing 100876, China, School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing 100876, China Key Laboratory of Trustworthy Distributed Computing and Service, BUPT, Ministry of Education, Beijing 100876, China, State Key Laboratory of Media Convergence and Communication, CUC, Beijing 100024, China State Key Laboratory of Intelligent Game, Yangtze River Delta Research Institute of NPU, Taicang 215400, China, School of Economics and Management, BUPT, Beijing 100876, China, School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing 100876, China

Abstract: Adding imperceptible watermarks to artwork images, such as paintings and photographs, can effectively safeguard the copyright of these images without compromising their usability. However, existing blind watermarking techniques encounter two major challenges in addressing this task: imperceptibility and robustness, particularly when subjected to various noise attacks. In this paper, we propose a blind watermarking method for artwork image copyright protection, IWRN, which can ensure both the Imperceptibility of the Watermark and Robustness against Noise attacks. For imperceptibility, we design a Learnable Wavelet Network (LWN) to adaptively embed the watermark into the highfrequency region where the watermark has better invisibility. For robustness, we establish a Deform-Attention based Invertible Neural Network (DA-INN) with a decoding optimization, which offers the advantage of computational reversion, and combines the deform-attention mechanism and decoding optimization to enhance the model's resistance against noises. Additionally, we design a Joint Contrast Learning (JCL) mechanism to improve imperceptibility and robustness simultaneously. Experiments show that our IWRN outperforms other state-of-the-art blind watermarking methods, achieves an average performance of 41.55 PSNR and 99.57% accuracy on the Coco2017, Wikiart, and Div2k datasets when facing 12 kinds of noise attacks.

Abstract: Datafree model stealing involves replicating the functionality of a target model into a substitute model without accessing the target model's structure, parameters, or training data. Instead, the adversary can only access the target model's predictions for generated samples. Once the substitute model closely approximates the behavior of the target model, attackers can exploit its white-box characteristics for subsequent malicious activities, such as adversarial attacks. Existing methods within cooperative game frameworks often produce samples with high confidence for the prediction of the substitute model, which makes it difficult for the substitute model to replicate the behavior of the target model. This paper presents a new data-free model stealing approach called Query Efficient Data Generation (QEDG). We introduce two distinct loss functions to ensure the generation of sufficient samples that closely and uniformly align with the target model's decision boundary across multiple classes. Building on the limitation of current methods, which typically yield only one piece of supervised information per query, we propose the query-free sample augmentation that enables the acquisition of additional supervised information without increasing the number of queries. Motivated by theoretical analysis, we adopt the consistency rate metric, which more accurately evaluates the similarity between the substitute and target models. We conducted extensive experiments to verify the effectiveness of our proposed method, which achieved better performance with fewer queries compared to the state-of-the-art methods on the real MLaaS scenario and five datasets.

Abstract: Image forgery localization (IFL) is a crucial technique for preventing tampered image misuse and protecting social safety. However, due to the rapid development of image tampering technologies, extracting more comprehensive and accurate forgery clues remains an urgent challenge. To address these challenges, we introduce a novel informationtheoretic IFL framework named SUMI-IFL that imposes sufficiency-view and minimality-view constraints on forgery feature representation. First, grounded in the theoretical analysis of mutual information, the sufficiency-view constraint is enforced on the feature extraction network to ensure that the latent forgery feature contains comprehensive forgery clues. Considering that forgery clues obtained from a single aspect alone may be incomplete, we construct the latent forgery feature by integrating several orthogonal individual image features. Second, based on the information bottleneck, the minimality-view constraint is imposed on the feature reasoning network to achieve an accurate and concise forgery feature representation that counters the interference of task-unrelated features. Extensive experiments show the superior performance of SUMI-IFL to existing state-of-the-art methods, not only on in-dataset comparisons but also on cross-dataset comparisons.

Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE, Institute of Information Science, Beijing Jiaotong University Center for Project-Based Learning (PBL) D-ITET, ETH Zürich, Switzerland, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE

Abstract: Despite significant advances in deepfake detection, handling varying image quality, especially due to different compressions on online social networks (OSNs), remains challenging. Current methods succeed by leveraging correlations between paired images, whether raw or compressed. However, in openworld scenarios, paired data is scarce, with compressed images readily available but corresponding raw versions difficult to obtain. This imbalance, where unpaired data vastly outnumbers paired data, often leads to reduced detection performance, as existing methods struggle without corresponding raw images. To overcome this issue, we propose a novel approach named the open-world deepfake detection network (ODDN), which comprises two core modules: open-world data aggregation (ODA) and compression-discard gradient correction (CGC). ODA effectively aggregates correlations between compressed and raw samples through both fine-grained and coarse-grained analyses for paired and unpaired data, respectively. CGC incorporates a compression-discard gradient correction to further enhance performance across diverse compression methods in OSN. This technique optimizes the training gradient to ensure the model remains insensitive to compression variations. Extensive experiments conducted on 17 popular deepfake datasets demonstrate the superiority of the ODDN over SOTA baselines.

UbiCIS, Faculty of Computing, Harbin Institute of Technology National Key Laboratory of Smart Farm Technologies and Systems, International Research Institute for Artificial Intelligence, Harbin Institute of Technology (Shenzhen), Department of Computer Science, University at Albany - SUNY, UbiCIS, Faculty of Computing, Harbin Institute of Technology National Key Laboratory of Smart Farm Technologies and Systems, UbiCIS, Faculty of Computing, Harbin Institute of Technology National Key Laboratory of Smart Farm Technologies and Systems International Research Institute for Artificial Intelligence, Harbin Institute of Technology (Shenzhen)

Abstract: Deep neural network (DNN) models are increasingly popular in edge video analytic applications. However, the computeintensive nature of DNN models pose challenges for energyefficient inference on resourceconstrained edge devices. Most existing solutions focus on optimizing DNN inference latency and accuracy, often overlooking energy efficiency. They also fail to account for the varying complexity of video frames, leading to sub-optimal performance in edge video analytics. In this paper, we propose an EnergyEfficient Early-Exit (E4) framework that enhances DNN inference efficiency for edge video analytics by integrating a novel early-exit mechanism with dynamic voltage and frequency scaling (DVFS) governors. It employs an attentionbased cascade module to analyze video frame diversity and automatically determine optimal DNN exit points. Additionally, E4 features a just-in-time (JIT) profiler that uses coordinate descent search to co-optimize CPU and GPU clock frequencies for each layer before the DNN exit points. Extensive evaluations demonstrate that E4 outperforms current state-of-the-art methods, achieving up to 2.8× speedup and 26% average energy saving while maintaining high accuracy.

School of Computer Science and Technology, Dalian University of Technology Faculty of Information Technology, University of Jyväskylä, School of Computer Science and Technology, Xi'an Jiaotong University State Key Lab of Brain-Machine Intelligence, Zhejiang University, School of Computer Science and Technology, Dalian University of Technology, Faculty of Information Technology, University of Jyväskylä, School of Control Science and Engineering, Dalian University of Technology, School of Biomedical Engineering, Dalian University of Technology

Abstract: In recent years, with the advancements in brain science, spiking neural networks (SNNs) have garnered significant attention. SNNs can generate spikes that mimic the function of neurons transmission in humans brain, thereby significantly reducing computational costs by the eventdriven nature during training. While deep SNNs have shown impressive performance on classification tasks, they still face challenges in more complex tasks such as object detection. In this paper, we propose SpikingYOLOX, extending the structure of the original YOLOX by introducing signed spiking neurons and fast Fourier convolution (FFC). The designed ternary signed spiking neurons could generate three kinds of spikes to obtain more robust features in the deep layer of the backbone. Meanwhile, we integrate FFC with SNN modules to enhance object detection performance, because its global receptive field is beneficial to the object detection task. Extensive experiments demonstrate that the proposed SpikingYOLOX achieves state-of-the-art performance among other SNN-based object detection methods.

Abstract: Existing Theory of Mind (ToM) benchmarks diverge from realworld scenarios in three aspects: 1) they assess a limited range of mental states such as beliefs, 2) false beliefs are not comprehensively explored, and 3) the diverse personality traits of characters are overlooked. To address these challenges, we introduce ToMATO, a new ToM benchmark formulated as multiple-choice QA over conversations. ToMATO is generated via LLM-LLM conversations featuring information asymmetry. By employing a prompting method that requires role-playing LLMs to verbalize their thoughts before each utterance, we capture both first- and second-order mental states across five categories: belief, intention, desire, emotion, and knowledge. These verbalized thoughts serve as answers to questions designed to assess the mental states of characters within conversations. Furthermore, the information asymmetry introduced by hiding thoughts from others induces the generation of false beliefs about various mental states. Assigning distinct personality traits to LLMs further diversifies both utterances and thoughts. ToMATO consists of 5.4k questions, 753 conversations, and 15 personality trait patterns. Our analysis shows that this dataset construction approach frequently generates false beliefs due to the information asymmetry between role-playing LLMs, and effectively reflects diverse personalities. We evaluate nine LLMs on ToMATO and find that even GPT-4o mini lags behind human performance, especially in understanding false beliefs, and lacks robustness to various personality traits.

Abstract: Most existing just noticeable difference (JND) methods primarily integrate specific masking effects in a single domain. However, these singledomain JND methods struggle with the structural discrepancies in multi-source content images, limiting their effectiveness in visual redundancy estimation. To address this issue, we propose a dual domain encoder that combines spatial and frequency features to comprehensively capture visual patterns. Our design includes spatial pattern balance and frequency detail correction modules to balance global and local patterns and correct low- and high-frequency distributions. Additionally, we develop a dual domain decoder to effectively extract multi-scale pattern redundancies and integrate them with detail redundancies in the frequency domain. Experiments demonstrate the effectiveness and robustness of our proposed method in handling structural discrepancies in multi-source content images.

Abstract: Large language models and other highly capable AI systems ease the burdens of deciding what to say or do, but this very ease can undermine the effectiveness of our actions in social contexts. We explain this apparent tension by introducing the integrative theoretical concept of "mental proof," which occurs when observable actions are used to certify unobservable mental facts. From hiring to dating, mental proofs enable people to credibly communicate values, intentions, states of knowledge, and other private features of their minds to one another in lowtrust environments where honesty cannot be easily enforced. Drawing on results from economics, theoretical biology, and computer science, we describe the core theoretical mechanisms that enable people to effect mental proofs. An analysis of these mechanisms clarifies when and how artificial intelligence can make low-trust cooperation harder despite making thinking easier.

Abstract: Visual emotion recognition (VER), which aims at understanding humans' emotional reactions toward different visual stimuli, has attracted increasing attention. Given the subjective and ambiguous characteristics of emotion, annotating a reliable largescale dataset is hard. For reducing reliance on data labeling, domain adaptation offers an alternative solution by adapting models trained on labeled source data to unlabeled target data. Conventional domain adaptation methods require access to source data. However, due to privacy concerns, source emotional data may be inaccessible. To address this issue, we propose an unexplored task: source-free domain adaptation (SFDA) for VER, which does not have access to source data during the adaptation process. To achieve this, we propose a novel framework termed Bridge then Begin Anew (BBA), which consists of two steps: domain-bridged model generation (DMG) and target-related model adaptation (TMA). First, the DMG bridges cross-domain gaps by generating an intermediate model, avoiding direct alignment between two VER datasets with significant differences. Then, the TMA begins training the target model anew to fit the target structure, avoiding the influence of source-specific knowledge. Extensive experiments are conducted on six SFDA settings for VER. The results demonstrate the effectiveness of BBA, which achieves remarkable performance gains compared with state-of-the-art SFDA methods and outperforms representative unsupervised domain adaptation approaches.

Abstract: The Iterative Closest Point (ICP) algorithm suffers from sensitivity to outliers and tendency to local optima in point cloud fine registration. In this paper, we introduce a global and robust ICP framework called GranularBall Iterative Closest Point with MultiKernel Correntropy (GRICP). This approach transforms the point cloud into a granular ball cloud and employs MultiKernel Correntropy (MKC) as the loss function, which is designed to smooth out the effects of noise points and provide global information for registration. Specifically, we propose a coarse-grained representation of the point cloud using the granular ball model, which adaptively captures the coarse-grained features of the data and converts the point cloud into a multi-granularity ball cloud. The normal points within each granular ball help mitigate the influence of noise points. To ensure that ICP finds the globally optimal transformation, MKC is introduced to measure the distribution of registration errors, thereby offering global insights for ICP to achieve the optimal solution. The transformations based on MKC and the granular ball cloud are then derived. Extensive experiments on both simulated and real-world datasets demonstrate that GRICP delivers superior registration performance, particularly in scenarios involving large rotation offsets, partial overlaps, and Gaussian noise.

Abstract: 3D scene understanding is an important task, and there has been a recent surge of research interest in aligning 3D representations of point clouds with text to empower embodied AI. However, due to the lack of comprehensive 3D benchmarks, the capabilities of 3D models in realworld scenes, particularly those that are challenging with subtly distinguished objects, remain insufficiently investigated. To facilitate a more thorough evaluation of 3D models' capabilities, we propose a scheme, ObjVariantEnsemble, to systematically introduce more scenes with specified object classes, colors, shapes, quantities, and spatial relationships to meet model evaluation needs. More importantly, we intentionally construct scenes with similar objects to a certain degree and design an LLM-VLM-cooperated annotator to capture key distinctions as annotations. The resultant benchmark can better challenge 3D models, reveal their shortcomings in understanding, and potentially aid in the further development of 3D models.

Abstract: Recent largescale pre-trained diffusion models have demonstrated a powerful generative ability to produce high-quality videos from detailed text descriptions. However, exerting control over the motion of objects in videos generated by any video diffusion model remains a challenging problem. In this paper, we propose a novel zero-shot moving object trajectory control framework, Motion-Zero, to enable arbitrary single-object-trajectory control for the text-to-video diffusion model. To this end, an initial noise prior module is designed to provide a position-based prior to improve the stability of the appearance of the moving object and the accuracy of position. In addition, based on the attention map of the U-Net, spatial constraints are directly applied to the denoising process of diffusion models, which further ensures the positional consistency of moving objects during the inference. Furthermore, temporal consistency is guaranteed with a proposed shift temporal attention mechanism. Our method can be flexibly applied to various state-of-the-art video diffusion models without any training process. Extensive experiments demonstrate our proposed method can control the motion trajectories of arbitrary objects while preserving the original ability to generate high-quality videos.

Deep Space Exploration Laboratory, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Deep Space Exploration Laboratory, University of Science and Technology of China, Deep Space Exploration Laboratory, University of Science and Technology of China, Deep Space Exploration Laboratory, University of Science and Technology of China, Deep Space Exploration Laboratory, University of Science and Technology of China, Deep Space Exploration Laboratory, University of Science and Technology of China, Deep Space Exploration Laboratory, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

Abstract: Mitochondria segmentation from electron microscopy (EM) images plays a crucial role in biological and medical research. However, models trained on source domains often suffer from performance degradation when applied to target domains due to domain shift. Unsupervised domain adaptation (UDA) methods have been proposed to address this issue, but they often overlook the reliability of pseudolabels and the effectiveness of supervision signals. In this paper, we propose R4MITO, a novel UDA framework for robust mitochondria segmentation. First, we introduce Reliable Prototype Pseudo-labels to mitigate the inconsistency of class-level features between across domains by leveraging source prototypes to model target prototypes. Second, we devise Correlation-wise Consistency Regularization to exploit inter-pixel correlations, aligning agent-level correlations under various perturbations. Third, we propose Rank-aware Relationship Consistency Regularization to fully utilize the rich information encoded in inter-agent relationships by imposing rank-aware constraints on agent-ranking probability distributions. Extensive experiments on multiple EM datasets demonstrate the superiority of our R4MITO over existing state-of-the-art UDA methods for mitochondria segmentation.

Abstract: We present EvHDRNeRF to recover a High Dynamic Range (HDR) radiance field from event streams and a set of Low Dynamic Range (LDR) views with single exposures. Using the EvHDR-NeRF, we can generate both novel HDR views and novel LDR views under different exposures. The key to our method is to model the new relationship between events streams and LDR images, which considers both the Camera Response Function (CRF) and exposure time. Based on this relationship, we categorize events into inter-frame events and intra-exposure. The former is utilized for building HDR radiance field and the latter is used to deblur potentially blurred images. Compared to existing methods, this method can effectively reconstruct the HDR radiance field even when the input images are degraded. Experimental results demonstrate that our method achieves state-of-the-art HDR reconstruction, providing a more adaptable and accurate solution for complex imaging applications.

Abstract: Although pretrained large vision foundation models (VFM) yield superior results on various downstream tasks, full fine-tuning is often impractical due to its high computational cost and storage requirements. Recent advancements in parameter-efficient fine-tuning (PEFT) of VFM for image classification show significant promise. However, the application of PEFT techniques to dense prediction tasks remains largely unexplored. Our analysis of existing methods reveals that the underlying premise of utilizing low-rank parameter matrices, despite their efficacy in specific applications, may not be adequately suitable for dense prediction tasks. To this end, we propose a novel PEFT learning approach tailored for dense prediction tasks, namely VFM-Adapter. Specifically, the VFM-Adapter introduces a hybrid operation mapping technique that seamlessly integrates local information with global modeling to the adapter module. It capitalizes on the distinct inductive biases inherent in different operations. Additionally, we dynamically generate parameters for the VFM-Adapter, enabling flexibility of feature extraction given specific inputs. To validate the efficacy of VFM-Adapter, we conduct extensive experiments across object detection, semantic segmentation, and instance segmentation tasks. Results on multiple benchmarks consistently demonstrate the superiority of our method over previous approaches. Notably, with only three percent of the trainable parameters of the SAM-Base backbone, our approach achieves competitive or even superior performance compared to full fine-tuning. The code will be available.

Abstract: Unsupervised point cloud shape correspondence aims to establish pointwise correspondences between point clouds without annotated data. Ensuring efficiency and accuracy is crucial for practically implementing point cloud shape correspondence. Although the current methods have achieved desirable performance, the nature of encoding at dense points limits their application in actual scenarios. Moreover, independently computing per-point correspondences results in numerous multiple-to-one erroneous correspondences. To address these issues, we present an Adaptive siamese Masked autoencoder with Global Optimization (AMIGO), comprising a siamese masked autoencoder and a global optimization module. In the siamese masked autoencoder, we downsample the input point cloud and employ adaptive siamese mask operations to boost the coding capabilities of the encoder, thereby mitigating the information loss caused by downsampling. In the global optimization module, optimal transport is only utilized to generate pseudo-labels during the training phase, facilitating the efficient global planning of the correspondence results. Extensive experiments on four standard human and animal benchmarks demonstrate that AMIGO surpasses existing methods with remarkable margins, achieving new state-of-the-art results.

Abstract: Multimodal 3D object detection based on deep neural networks has indeed made significant progress. However, it still faces challenges due to the misalignment of scale and spatial information between features extracted from 2D images and those derived from 3D point clouds. Existing methods usually aggregate multimodal features at a single stage. However, leveraging multistage cross-modal features is crucial for detecting objects of various scales. Therefore, these methods often struggle to integrate features across different scales and modalities effectively, thereby restricting the accuracy of detection. Additionally, the time-consuming Query-Key-Value-based (QKV-based) cross-attention operations often utilized in existing methods aid in reasoning the location and existence of objects by capturing non-local contexts. However, this approach tends to increase computational complexity. To address these challenges, we present SSLFusion, a novel Scale & Space Aligned Latent Fusion Model, consisting of a scale-aligned fusion strategy (SAF), a 3D-to-2D space alignment module (SAM), and a latent cross-modal fusion module (LFM). SAF mitigates scale misalignment between modalities by aggregating features from both images and point clouds across multiple levels. SAM is designed to reduce the inter-modal gap between features from images and point clouds by incorporating 3D coordinate information into 2D image features. Additionally, LFM captures cross-modal non-local contexts in the latent space without utilizing the QKV-based attention operations, thus mitigating computational complexity. Experiments on the KITTI and DENSE datasets demonstrate that our SSLFusion outperforms state-of-the-art methods. Our approach obtains an absolute gain of 2.15% in 3D AP, compared with the state-of-art method GraphAlign on the moderate level of the KITTI test set.

College of Electronics and Information Engineering, Tongji University, School of Computer Science and Technology, Donghua University, College of Electronics and Information Engineering, Tongji University, College of Electronics and Information Engineering, Tongji University, School of Computer Science and Technology, Donghua University, College of Electronics and Information Engineering, Tongji University Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University

Abstract: Inferring the 3D structure of a scene from a single image is an illposed and challenging problem in the field of vision-centric autonomous driving. Existing methods usually employ neural radiance fields to produce voxelized 3D occupancy, lacking instance-level semantic reasoning and temporal photometric consistency. In this paper, we propose ViPOcc, which leverages the visual priors from vision foundation models (VFMs) for fine-grained 3D occupancy prediction. Unlike previous works that solely employ volume rendering for RGB and depth image reconstruction, we introduce a metric depth estimation branch, in which an inverse depth alignment module is proposed to bridge the domain gap in depth distribution between VFM predictions and the ground truth. The recovered metric depth is then utilized in temporal photometric alignment and spatial geometric alignment to ensure accurate and consistent 3D occupancy prediction. Additionally, we also propose a semantic-guided non-overlapping Gaussian mixture sampler for efficient, instance-aware ray sampling, which addresses the redundant and imbalanced sampling issue that still exists in previous state-of-the-art methods. Extensive experiments demonstrate the superior performance of ViPOcc in both 3D occupancy prediction and depth estimation tasks on diverse public datasets.

Abstract: Existing works in singleimage human reconstruction suffer from weak generalizability due to insufficient training data or 3D inconsistencies for a lack of comprehensive multi-view knowledge. In this paper, we introduce MagicMan, a human-specific multi-view diffusion model to generate high-quality novel views from a single reference image. As its core, we leverage a pre-trained 2D diffusion model as the generative prior for generalizability, with the parametric SMPL-X model as the 3D body prior to promote 3D awareness. To maintain consistency while generating denser views for improved 3D human reconstruction, we introduce hybrid multi-view attention to facilitate efficient and thorough information interchange across views. Besides, we present a geometry-aware dual branch to perform concurrent generation in both RGB and normal domains, further enhancing consistency via geometry cues. Last but not least, to address ill-shaped issues arising from inaccurate SMPL-X estimation, we propose a novel iterative refinement strategy, which progressively optimizes SMPL-X accuracy while enhancing the quality and consistency of the generated multi-views. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in both novel view synthesis and subsequent 3D human reconstruction tasks.

Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, School of Software, Tsinghua University, Beijing, China

Abstract: Recently, Gaussian Splatting has sparked a new trend in the field of computer vision. Apart from novel view synthesis, it has also been extended to the area of multiview reconstruction. The latest methods facilitate complete, detailed surface reconstruction while ensuring fast training speed. However, these methods still require dense input views, and their output quality significantly degrades with sparse views. We observed that the Gaussian primitives tend to overfit the few training views, leading to noisy floaters and incomplete reconstruction surfaces. In this paper, we present an innovative sparse-view reconstruction framework that leverages intra-view depth and multi-view feature consistency to achieve remarkably accurate surface reconstruction. Specifically, we utilize monocular depth ranking information to supervise the consistency of depth distribution within patches and employ a smoothness loss to enhance the continuity of the distribution. To achieve finer surface reconstruction, we optimize the absolute position of depth through multi-view projection features. Extensive experiments on DTU and BlendedMVS demonstrate that our method outperforms state-of-the-art methods with a speedup of 60x to 200x, achieving swift and fine-grained mesh reconstruction without the need for costly pre-training.

School of Computer Science and Technology, Xi’an Jiaotong University MOE KLINNS Lab, Xi’an Jiaotong University, School of Computer Science and Technology, Xi’an Jiaotong University MOE KLINNS Lab, Xi’an Jiaotong University, School of Computer Science and Technology, Xi’an Jiaotong University MOE KLINNS Lab, Xi’an Jiaotong University, School of Computer Science and Technology, Xi’an Jiaotong University Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi’an Jiaotong University Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi’an Jiaotong University MOE KLINNS Lab, Xi’an Jiaotong University

Abstract: Charts are widely used for data visualization across various fields, including education, research, and business. Chart Question Answering (CQA) is an emerging task focused on the automatic interpretation and reasoning of data presented in charts. However, chart images are inherently difficult to interpret, and chartrelated questions often involve complex logical and numerical reasoning, which hinders the performance of existing models. This paper introduces VProChart, a novel framework designed to address these challenges in CQA by integrating a lightweight Visual Perception Alignment Agent (VPAgent) and a Programmatic Solution Reasoning approach. VPAgent aligns and models chart elements based on principles of human visual perception, enhancing the understanding of chart context. The Programmatic Solution Reasoning approach leverages large language models (LLMs) to transform natural language reasoning questions into structured solution programs, facilitating precise numerical and logical reasoning. Extensive experiments on benchmark datasets such as ChartQA and PlotQA demonstrate that VProChart significantly outperforms existing methods, highlighting its capability in understanding and reasoning with charts.

Abstract: Transformers have significantly advanced the field of 3D human pose estimation (HPE). However, existing transformerbased methods primarily use self-attention mechanisms for spatio-temporal modeling, leading to a quadratic complexity, unidirectional modeling of spatio-temporal relationships, and insufficient learning of spatial-temporal correlations. Recently, the Mamba architecture, utilizing the state space model (SSM), has exhibited superior long-range modeling capabilities in a variety of vision tasks with linear complexity. In this paper, we propose PoseMamba, a novel purely SSM-based approach with linear complexity for 3D human pose estimation in monocular video. Specifically, we propose a bidirectional global-local spatio-temporal SSM block that comprehensively models human joint relations within individual frames as well as temporal correlations across frames. Within this bidirectional global-local spatio-temporal SSM block, we introduce a reordering strategy to enhance the local modeling capability of the SSM. This strategy provides a more logical geometric scanning order and integrates it with the global SSM, resulting in a combined global-local spatial scan. We have quantitatively and qualitatively evaluated our approach using two benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments demonstrate that PoseMamba achieves state-of-the-art performance on both datasets while maintaining a smaller model size and reducing computational costs.

Abstract: Rendering dynamic scenes from monocular videos is a crucial yet challenging task. The recent deformable Gaussian Splatting has emerged as a robust solution to represent realworld dynamic scenes. However, it often leads to heavily redundant Gaussians, attempting to fit every training view at various time steps, leading to slower rendering speeds. Additionally, the attributes of Gaussians in static areas are time-invariant, making it unnecessary to model every Gaussian, which can cause jittering in static regions. In practice, the primary bottleneck in rendering speed for dynamic scenes is the number of Gaussians. In response, we introduce Efficient Dynamic Gaussian Splatting (EDGS), which represents dynamic scenes via sparse time-variant attribute modeling. Our approach formulates dynamic scenes using a sparse anchor-grid representation, with the motion flow of dense Gaussians calculated via a classical kernel representation. Furthermore, we propose an unsupervised strategy to efficiently filter out anchors corresponding to static areas. Only anchors associated with deformable objects are input into MLPs to query time-variant attributes. Experiments on two real-world datasets demonstrate that our EDGS significantly improves the rendering speed with superior rendering quality compared to previous state-of-the-art methods.

Abstract: Weaklysupervised temporal action localization (WTAL) aims to identify and localize action instances in untrimmed videos using only video-level labels. Existing methods typically rely on original features from frozen pre-trained encoders designed for trimmed action classification (TAC) tasks, which inevitably introduces task discrepancy. Additionally, these methods often overlook the importance of considering action consistency from multiple perspectives, specifically the consistency in action processes and action semantics, both of which are crucial for the model's understanding of actions. To address these issues, we propose a novel WTAL method based on similar modality enhancement and action consistency learning (SEAL). First, we construct global descriptors for each action category, and use the pseudo-labels generated based on these descriptors to guide the model in learning more consistent representations, thereby mitigating task discrepancy. Second, we design two types of losses to achieve action consistency learning: process consistency loss, which penalizes candidate proposals that deviate from the action center to ensure the completeness of the action process, and semantic consistency loss, which employs local descriptors to help proposals of the same action category (especially those with apparent semantic confusion) learn similar feature distributions. Extensive experiments on the THUMOS14 and ActivityNet datasets demonstrate the superior performance of the proposed method compared to state-of-the-art methods.

State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences

Abstract: In radarcamera 3D object detection, the radar point clouds are sparse and noisy, which causes difficulties in fusing camera and radar modalities. To solve this, we introduce a novel query-based detection method named Radar-Camera Transformer (RCTrans). Specifically, we first design a Radar Dense Encoder to enrich the sparse valid radar tokens, and then concatenate them with the image tokens. By doing this, we can fully explore the 3D information of each interest region and reduce the interference of empty tokens during the fusing stage. We then design a Pruning Sequential Decoder to predict 3D boxes based on the obtained tokens and random initialized queries. To alleviate the effect of elevation ambiguity in radar point clouds, we gradually locate the position of the object via a sequential fusion structure. It helps to get more precise and flexible correspondences between tokens and queries. A pruning training strategy is adopted in the decoder, which can save much time during inference and inhibit queries from losing their distinctiveness. Extensive experiments on the large-scale nuScenes dataset prove the superiority of our method, and we also achieve new state-of-the-art radar-camera 3D detection results.

Zhejiang University, Hangzhou, China Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Department of Computer Science and Engineering, ECUST, China, Department of Computer Science and Engineering, ECUST, China, Zhejiang University, Hangzhou, China, RIKEN AIP, The University of Tokyo, Zhejiang University, Hangzhou, China Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Zhejiang University, Hangzhou, China Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Gongli Hospital of Shanghai Pudong New Area, Gongli Hospital of Shanghai Pudong New Area

Abstract: Malignant brain tumors have become an aggressive and dangerous disease that leads to death worldwide. Multimodal MRI data is crucial for accurate brain tumor segmentation, but missing modalities common in clinical practice can severely degrade the segmentation performance. While incomplete multi-modal learning methods attempt to address this, learning robust and discriminative features from arbitrary missing modalities remains challenging. To address this challenge, we propose a novel Semantic-guided Masked Mutual Learning (SMML) approach to distill robust and discriminative knowledge across diverse missing modality scenarios. Specifically, we propose a novel dual-branch masked mutual learning scheme guided by Hierarchical Consistency Constraints (HCC) to ensure multi-level consistency, thereby enhancing mutual learning in incomplete multi-modal scenarios. The HCC framework comprises a pixel-level constraint that selects and exchanges reliable knowledge to guide the mutual learning process. Additionally, it includes a feature-level constraint that uncovers robust inter-sample and inter-class relational knowledge within the latent feature space. To further enhance multi-modal learning from missing modality data, we integrate a refinement network into each student branch. This network leverages semantic priors from the Segment Anything Model (SAM) to provide supplementary information, effectively complementing the masked mutual learning strategy in capturing auxiliary discriminative knowledge. Extensive experiments on three challenging brain tumor segmentation datasets demonstrate that our method significantly improves performance over state-of-the-art methods in diverse missing modality settings.

Huazhong University of Science and Technology National Key Laboratory of Multispectral Information Intelligent Processing Technology, Huazhong University of Science and Technology National Key Laboratory of Multispectral Information Intelligent Processing Technology, Huazhong University of Science and Technology National Key Laboratory of Multispectral Information Intelligent Processing Technology, Wangxuan Institute of Computer Technology, Peking University, Huazhong University of Science and Technology National Key Laboratory of Multispectral Information Intelligent Processing Technology, Huazhong University of Science and Technology National Key Laboratory of Multispectral Information Intelligent Processing Technology, Huazhong University of Science and Technology National Key Laboratory of Multispectral Information Intelligent Processing Technology

Abstract: Visioncentric autonomous driving systems require diverse data for robust training and evaluation, which can be augmented by manipulating object positions and appearances within existing scene captures. While recent advancements in diffusion models have shown promise in video editing, their application to object manipulation in driving scenarios remains challenging due to imprecise positional control and difficulties in preserving high-fidelity object appearances. To address these challenges in position and appearance control, we introduce DriveEditor, the first diffusion-based framework for object editing in driving videos. DriveEditor offers a unified framework for comprehensive object editing operations, including repositioning, replacement, deletion, and insertion. These diverse manipulations are all achieved through a shared set of varying inputs, processed by identical position control and appearance maintenance modules. The position control module projects the given 3D bounding box while preserving depth information and hierarchically injects it into the diffusion process, enabling precise control over object position and orientation. The appearance maintenance module preserves consistent attributes with a single reference image by employing a three-tiered approach: low-level detail preservation, high-level semantic maintenance, and the integration of 3D priors from a novel view synthesis model. Extensive qualitative and quantitative evaluations on the nuScenes dataset demonstrate DriveEditor's exceptional fidelity and controllability in generating diverse driving scene edits, as well as its remarkable ability to facilitate downstream tasks.

Abstract: The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zeroshot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via input perturbations, our method can reprogram a pre-trained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. First, learnable visual perturbations are used to refine feature extraction for deepfake detection. Then, we exploit information of face embedding to create sample-level adaptative text prompts, improving the performance. Extensive experiments on several popular benchmark datasets demonstrate that (1) the cross dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to Wild-Deepfake); (2) the superior performances are achieved with fewer trainable parameters, making it a promising approach for real-world applications.

Abstract: Recent multiframe lifting methods have dominated the 3D human pose estimation. However, previous methods ignore the intricate dependence within the 2D pose sequence and learn single temporal correlation. To alleviate this limitation, we propose TCPFormer, which leverages an implicit pose proxy as an intermediate representation. Each proxy within the implicit pose proxy can build one temporal correlation therefore helping us learn a more comprehensive temporal correlation of human motion. Specifically, our method consists of three key components: Proxy Update Module (PUM), Proxy Invocation Module (PIM), and Proxy Attention Module (PAM). PUM first uses pose features to update the implicit pose proxy, enabling it to store representative information from the pose sequence. PIM then invocates and integrates the pose proxy with the pose sequence to enhance the motion semantics of each pose. Finally, PAM leverages the above mapping between the pose sequence and pose proxy to enhance the temporal correlation of the whole pose sequence. Experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our proposed TCPFormer outperforms the previous state-of-the-art methods.

Abstract: FewShot Video Object Segmentation (FSVOS) aims to achieve accurate segmentation of video sequences supported by limited annotated images. In this work, we analyze the deficiencies inherent in the use of object prototypes and pixel features as references in previous methods. Then we shed light on that part features, with the ability to adapt to appearance variations and resist noise, are advantageous as representative reference features for aligning support images and query videos. Therefore, we propose a Part Agent Learning Network (PALN) to leverage part features from two aspects. First, we elaborately employ Optimal Transport algorithm with equal partition constraint to make part agents capable of dividing support objects into diverse parts in an adaptive manner. Second, we design a dedicated cache mechanism to learn temporal part agents as lightweight historic target representation to exploit temporal consistency. With the aid of these learned part agents, our PALN can effectively achieve support-query alignment and temporal alignment for accurate segmentation of query videos. Extensive experimental results on two challenging benchmarks demonstrate that our method performs favorably against state-of-the-art FSVOS methods.

Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University, Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University, Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University, Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University, School of Electrical and Electronic Engineering, Shanghai Insittute of Techonlogy, Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University, Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University, Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University, Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University, Shanghai Engineering Research Center of AI & Robotics, Academy for Engineering & Technology, Fudan University Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University Engineering Research Center of AI & Robotics, Ministry of Education, Fudan University

Abstract: Dynamic Facial Expression Recognition (DFER) is crucial for affective computing but often overlooks the impact of scene context. We have identified a significant issue in current DFER tasks: human annotators typically integrate emotions from various angles, including environmental cues and body language, whereas existing DFER methods tend to consider the scene as noise that needs to be filtered out, focusing solely on facial information. We refer to this as the Rigid Cognitive Problem. The Rigid Cognitive Problem can lead to discrepancies between the cognition of annotators and models in some samples. To align more closely with the human cognitive paradigm of emotions, we propose an Overall Understanding of the Scene DFER method (OUS). OUS effectively integrates scene and facial features, combining scenespecific emotional knowledge for DFER. Extensive experiments on the two largest datasets in the DFER field, DFEW and FERV39k, demonstrate that OUS significantly outperforms existing methods. By analyzing the Rigid Cognitive Problem, OUS successfully understands the complex relationship between scene context and emotional expression, closely aligning with human emotional understanding in real-world scenarios.

Abstract: In this paper we study the task of a singleview image-guided point cloud completion. Existing methods have got promising results by fusing the information of image into point cloud explicitly or implicitly. However, given that the image has global shape information and the partial point cloud has rich local details, We believe that both modalities need to be given equal attention when performing modality fusion. To this end, we propose a novel dual-channel modality fusion network for image-guided point cloud completion(named DMF-Net), in a coarse-to-fine manner. In the first stage, DMF-Net takes a partial point cloud and corresponding image as input to recover a coarse point cloud. In the second stage, the coarse point cloud will be upsampled twice with shape-aware upsampling transformer to get the dense and complete point cloud. Extensive quantitative and qualitative experimental results show that DMF-Net outperforms the state-of-the-art unimodal and multimodal point cloud completion works on ShapeNet-ViPC dataset.

Abstract: Eventbased cameras feature high temporal resolution, wide dynamic range, and low power consumption, which are ideal for high-speed and low-light object detection. Spiking neural networks (SNNs) are promising for event-based object recognition and detection due to their spiking nature but lack efficient training methods, leading to gradient vanishing and high computational complexity, especially in deep SNNs. Additionally, existing SNN frameworks often fail to effectively handle multi-scale spatiotemporal features, leading to increased data redundancy and reduced accuracy. To address these issues, we propose CREST, a novel conjointly trained spike-driven framework to exploit spatiotemporal dynamics in event-based object detection. We introduce the conjoint learning rule to accelerate SNN learning and alleviate gradient vanishing. It also supports dual operation modes for efficient and flexible implementation on different hardware types. Additionally, CREST features a fully spike driven framework with a multi-scale spatiotemporal event integrator (MESTOR) and a spatiotemporal-IoU (ST-IoU) loss. Our approach achieves superior object recognition & detection performance and energy efficiency compared with state of-the-art SNN algorithms on three datasets, providing an efficient solution for event-based object detection algorithms suitable for SNN hardware implementation.

Abstract: The enhanced representational power and broad applicability of deep learning models have attracted significant interest from the research community in recent years. However, these models often struggle to perform effectively under domain shift conditions, where the training data (the source domain) is related to but exhibits different distributions from the testing data (the target domain). To address this challenge, previous studies have attempted to reduce the domain gap between source and target data by incorporating a few labeled target samples during training—a technique known as semisupervised domain adaptation (SSDA). While this strategy has demonstrated notable improvements in classification performance, the network architectures used in these approaches primarily focus on exploiting the features of individual images, leaving room for improvement in capturing rich representations. In this study, we introduce a Hierarchical Graph of Nodes designed to simultaneously present representations at both feature and category levels. At the feature level, we introduce a local graph to identify the most relevant patches within an image, facilitating adaptability to defined main object representations. At the category level, we employ a global graph to aggregate the features from samples within the same category, thereby enriching overall representations. Extensive experiments on widely used SSDA benchmark datasets, including Office-Home, DomainNet, and VisDA2017, demonstrate that both quantitative and qualitative results substantiate the effectiveness of HiGDA, establishing it as a new state-of-the-art method.

Abstract: The accurate segmentation of guidewires in interventional cardiac fluoroscopy videos is crucial for computeraided navigation tasks. Although deep learning methods have demonstrated high accuracy and robustness in wire segmentation, they require substantial annotated datasets for generalizability, underscoring the need for extensive labeled data to enhance model performance. To address this challenge, we propose the Segmentation-guided Frame-consistency Video Diffusion Model (SF-VD) to generate large collections of labeled fluoroscopy videos, augmenting the training data for wire segmentation networks. SF-VD leverages videos with limited annotations by independently modeling scene distribution and motion distribution. It first samples the scene distribution by generating 2D fluoroscopy images with wires positioned according to a specified input mask, and then samples the motion distribution by progressively generating subsequent frames, ensuring frame-to-frame coherence through a frame-consistency strategy. A segmentation-guided mechanism further refines the process by adjusting wire contrast, ensuring a diverse range of visibility in the synthesized image. Evaluation on a fluoroscopy dataset confirms the superior quality of the generated videos and shows significant improvements in guidewire segmentation.

Faculty of Electrical Engineering and Computer Science, Ningbo University, China, Faculty of Electrical Engineering and Computer Science, Ningbo University, China, Faculty of Electrical Engineering and Computer Science, Ningbo University, China Merchants’ Guild Economics and Cultural Intelligent Computing Laboratory, Ningbo University, China, Faculty of Electrical Engineering and Computer Science, Ningbo University, China, Faculty of Electrical Engineering and Computer Science, Ningbo University, China Merchants’ Guild Economics and Cultural Intelligent Computing Laboratory, Ningbo University, China, Department of Electrical and Electronic Engineering, The University of Hong Kong

Abstract: Video anomaly detection plays a significant role in intelligent surveillance systems. To enhance model's anomaly recognition ability, previous works have typically involved RGB, optical flow, and text features. Recently, dynamic vision sensors (DVS) have emerged as a promising technology, which capture visual information as discrete events with a very high dynamic range and temporal resolution. It reduces data redundancy and enhances the capture capacity of moving objects compared to conventional camera. To introduce this rich dynamic information into the surveillance field, we created the first DVS video anomaly detection benchmark, namely UCFCrime-DVS. To fully utilize this new data modality, a multi-scale spiking fusion network (MSF) is designed based on spiking neural networks (SNNs). This work explores the potential application of dynamic information from event data in video anomaly detection. Our experiments demonstrate the effectiveness of our framework on UCF-Crime-DVS and its superior performance compared to other models, establishing a new baseline for SNN-based weakly supervised video anomaly detection.

Abstract: Person reidentification (re-ID) via 3D skeleton data is a challenging task with significant value in many scenarios. Existing skeleton-based methods typically assume virtual motion relations between all joints, and adopt average joint or sequence representations for learning. However, they rarely explore key body structure and motion such as gait to focus on more important body joints or limbs, while lacking the ability to fully mine valuable spatial-temporal sub-patterns of skeletons to enhance model learning. This paper presents a generic Motif guided graph transformer with Combinatorial skeleton prototype learning (MoCos) that exploits structure-specific and gait-related body relations as well as combinatorial features of skeleton graphs to learn effective skeleton representations for person re-ID. In particular, motivated by the locality within joints' structure and the body-component collaboration in gait, we first propose the motif guided graph transformer (MGT) that incorporates hierarchical structural motifs and gait collaborative motifs, which simultaneously focuses on multi-order local joint correlations and key cooperative body parts to enhance skeleton relation learning. Then, we devise the combinatorial skeleton prototype learning (CSP) that leverages random spatial-temporal combinations of joint nodes and skeleton graphs to generate diverse sub-skeleton and sub-tracklet representations, which are contrasted with the most representative features (prototypes) of each identity to learn class-related semantics and discriminative skeleton representations. Extensive experiments validate the superior performance of MoCos over existing state-of-the-art models. We further show its generality under RGB-estimated skeletons, different graph modeling, and unsupervised scenarios.

Abstract: Diffusion Models have emerged as powerful generative models for highquality image synthesis, with many subsequent image editing techniques based on them. However, the ease of text-based image editing introduces significant risks, such as malicious editing for scams or intellectual property infringement. Previous works have attempted to safeguard images from diffusion-based editing by adding imperceptible perturbations. These methods are costly and specifically target prevalent Latent Diffusion Models (LDMs), while Pixel-domain Diffusion Models (PDMs) remain largely unexplored and robust against such attacks. Our work addresses this gap by proposing a novel attack framework, AtkPDM. AtkPDM is mainly composed of a feature representation attacking loss that exploits vulnerabilities in denoising UNets and a latent optimization strategy to enhance the naturalness of adversarial images. Extensive experiments demonstrate the effectiveness of our approach in attacking dominant PDM-based editing methods (e.g., SDEdit) while maintaining reasonable fidelity and robustness against common defense methods. Additionally, our framework is extensible to LDMs, achieving comparable performance to existing approaches.

Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE, School of Information Science and Engineering, Yanshan University Hebei Key Laboratory of Information Transmission and Signal Processing, School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE Pengcheng Laboratory, Shenzhen, China, Institute of Information Science, Beijing Jiaotong University Visual Intellgence +X International Cooperation Joint Laboratory of MOE Pengcheng Laboratory, Shenzhen, China

Abstract: This work focuses on AIGC detection to develop universal detectors capable of identifying various types of forgery images. Recent studies have found large pretrained models, such as CLIP, are effective for generalizable deepfake detection along with linear classifiers. However, two critical issues remain unresolved: 1) understanding why CLIP features are effective on deepfake detection through a linear classifier; and 2) exploring the detection potential of CLIP. In this study, we delve into the underlying mechanisms of CLIP's detection capabilities by decoding its detection features into text and performing word frequency analysis. Our finding indicates that CLIP detects deepfakes by recognizing similar concepts. Building on this insight, we introduce Category Common Prompt CLIP, called C2P-CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder, thereby enhancing detection performance. Our method achieves a 12.4% improvement in detection accuracy compared to the original CLIP.

Abstract: Unsupervised visibleinfrared person re-identification (USL-VI-ReID) is of great research and practical significance yet remains challenging due to the absence of annotations. Existing approaches aim to learn modality-invariant representations in an unsupervised setting. However, these methods often encounter label noise within and across modalities due to suboptimal clustering results and considerable modality discrepancies, which impedes effective training. To address these challenges, we propose a straightforward yet effective solution for USL-VI-ReID by mitigating universal label noise using neighbor information. Specifically, we introduce the Neighbor-guided Universal Label Calibration (N-ULC) module, which replaces explicit hard pseudo labels in both homogeneous and heterogeneous spaces with soft labels derived from neighboring samples to reduce label noise. Additionally, we present the Neighbor-guided Dynamic Weighting (N-DW) module to enhance training stability by minimizing the influence of unreliable samples. Extensive experiments on the RegDB and SYSU-MM01 datasets demonstrate that our method outperforms existing USL-VI-ReID approaches, despite its simplicity.

Abstract: We propose DrivingForward, a feedforward Gaussian Splatting model that reconstructs driving scenes from flexible surround-view input. Driving scene images from vehicle-mounted cameras are typically sparse, with limited overlap, and the movement of the vehicle further complicates the acquisition of camera extrinsics. To tackle these challenges and achieve real-time reconstruction, we jointly train a pose network, a depth network, and a Gaussian network to predict the Gaussian primitives that represent the driving scenes. The pose network and depth network determine the position of the Gaussian primitives in a self-supervised manner, without using depth ground truth and camera extrinsics during training. The Gaussian network independently predicts primitive parameters from each input image, including covariance, opacity, and spherical harmonics coefficients. At the inference stage, our model can achieve feed-forward reconstruction from flexible multi-frame surround-view input. Experiments on the nuScenes dataset show that our model outperforms existing state-of-the-art feed-forward and scene-optimized reconstruction methods in terms of reconstruction.

Abstract: The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Contextaware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches.

Abstract: The scarcity data of medical field brings the collaborative training in medical visionlanguage pre-training (VLP) cross different clients. Therefore, the collaborative training in medical VLP faces two challenges: First, the medical data requires privacy, thus can not directly shared across different clients. Second, medical data distribution across institutes is typically heterogeneous, hindering local model alignment and representation capabilities. To simultaneously overcome these two challenges, we propose the framework called personalized model selector with fused multimodal information (PMS-FM). The contribution of PMS-FM is two-fold: 1) PMS-FM uses embeddings to represent information in different formats, allowing for the fusion of multimodal data. 2) PMS-FM adapts to personalized data distributions by training multiple models. A model selector then identifies and selects the best-performing model for each individual client. Extensive experiments with multiple real-world medical datasets demonstrate the superb performance of PMS-FM over existing federated learning methods on different zero-shot classification tasks.

Abstract: Although diffusion models can generate highquality human images, their applications are limited by the instability in generating hands with correct structures. In this paper, we introduce RHanDS, a conditional diffusion-based framework designed to refine malformed hands by utilizing decoupled structure and style guidance. The hand mesh reconstructed from the malformed hand offers structure guidance for correcting the structure of the hand, while the malformed hand itself provides style guidance for preserving the style of the hand. To alleviate the mutual interference between style and structure guidance, we introduce a two-stage training strategy and build a series of multi-style hand datasets. In the first stage, we use paired hand images for training to ensure stylistic consistency in hand refining. In the second stage, various hand images generated based on human meshes are used for training, enabling the model to gain control over the hand structure. Experimental results demonstrate that RHanDS can effectively refine hand structure while preserving consistency in hand style.

Graduate School of Health Sciences, Hokkaido University, Sapporo, Japan, Institute of Integrated Research, Institute of Science Tokyo, Yokohama, Japan, Processor Research Team, RIKEN Center for Computational Science, Kobe, Japan, Research Center For Integrated Quantum Electronics, Hokkaido University, Sapporo, Japan, Institute of Integrated Research, Institute of Science Tokyo, Yokohama, Japan, Research Center For Integrated Quantum Electronics, Hokkaido University, Sapporo, Japan, Institute of Integrated Research, Institute of Science Tokyo, Yokohama, Japan, Faculty of Health Sciences, Hokkaido University, Sapporo, Japan

Abstract: Conventional radiography is the widely used imaging technology in diagnosing, monitoring, and prognosticating musculoskeletal (MSK) diseases because of its easy availability, versatility, and costeffectiveness. Bone overlaps are prevalent in conventional radiographs, and can impede the accurate assessment of bone characteristics by radiologists or algorithms, posing significant challenges to conventional clinical diagnosis and computer-aided diagnosis. This work initiated the study of a challenging scenario - bone layer separation in conventional radiographs, in which separate overlapped bone regions enable the independent assessment of the bone characteristics of each bone layer and lay the groundwork for MSK disease diagnosis and its automation. This work proposed a Bone Layer Separation GAN (BLS-GAN) framework that can produce high-quality bone layer images with reasonable bone characteristics and texture. This framework introduced a reconstructor based on conventional radiography imaging principles, which achieved efficient reconstruction and mitigates the recurrent calculations and training instability issues caused by soft tissue in the overlapped regions. Additionally, pre-training with synthetic images was implemented to enhance the stability of both the training process and the results. The generated images passed the visual Turing test, and improved performance in downstream tasks. This work affirms the feasibility of extracting bone layer images from conventional radiographs, which holds promise for leveraging layer separation technology to facilitate more comprehensive analytical research in MSK diagnosis, monitoring, and prognosis.

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen International Research Institute for Artificial Intelligence, Harbin Institute of Technology, Shenzhen, International Research Institute for Artificial Intelligence, Harbin Institute of Technology, Shenzhen School of Computer Science and Technology, University of Chinese Academy of Sciences Chongqing Research Institute of HIT National Key Laboratory of Smart Farm Technologies and Systems, School of Computer Science and Technology, University of the Chinese Academy of Sciences, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen International Research Institute for Artificial Intelligence, Harbin Institute of Technology, Shenzhen, Chongqing Research Institute of HIT, School of Computer Science and Technology, University of the Chinese Academy of Sciences, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen

Abstract: Openvocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on base category data tend to assign higher confidence to trained categories and confuse novel categories with the background. To resolve this, we propose OV-DQUO, an Open-Vocabulary DETR with Denoising text Query training and open-world Unknown Objects supervision. Specifically, we introduce a wildcard matching method. This method enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy. It synthesizes foreground and background query-box pairs from open-world unknown objects to train the detector through contrastive learning, enhancing its ability to distinguish novel objects from the background. We conducted extensive experiments on the OV-COCO and OV-LVIS benchmarks, achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel categories, respectively.

Abstract: Crossview geo-localization aims at determining the geographic location of a query image by matching the reference images. The matching pairs can be captured from diverse perspectives, such as those from satellites and drones. Most existing methods are supervised that require input of location-labeled images or matched and unmatched image pairs for training, resulting in high labor costs. Moreover, current unsupervised methods perform instances matching directly between different perspectives with dramatic discrepancies, resulting in poor performance. To address these issues, this paper proposes a novel matching and alignment framework from coarse instance-cluster level to fine intermediate instance level for unsupervised cross-view geo-localization. We first introduces cluster-based contrastive learning, assigning pseudo-labels to the instances and generate clusters within each view. Then we design a cross-view location alignment module that fully exploits the feature relationships between instances and clusters for intra- and inter-views. Finally, we design an intermediate state transition module that facilitates further alignment between views by constructing intermediate states and bringing both views closer to the intermediate domain simultaneously. Extensive experiments demonstrate that our method surpasses state-of-the-art unsupervised cross-view geo-localization methods and even achieves comparable performance to state-of-the-art supervised methods.

National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Hubei Engineering Research Center on Big Data Security Hubei Key Laboratory of Distributed System Security School of Cyber Science and Engineering, Huazhong University of Science and Technology, School of Cyber Science and Engineering, Huazhong University of Science and Technology, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology, Huazhong University of Science and Technology, School of Cyber Science and Engineering, Huazhong University of Science and Technology, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Hubei Engineering Research Center on Big Data Security Hubei Key Laboratory of Distributed System Security School of Cyber Science and Engineering, Huazhong University of Science and Technology, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Hubei Engineering Research Center on Big Data Security Hubei Key Laboratory of Distributed System Security School of Cyber Science and Engineering, Huazhong University of Science and Technology, School of Software Engineering, Huazhong University of Science and Technology

Abstract: As deep neural networks (DNNs) are widely applied in the physical world, many researches are focusing on physicalworld adversarial examples (PAEs), which introduce perturbations to inputs and cause the model's incorrect outputs. However, existing PAEs face two challenges: unsatisfactory attack performance (i.e., poor transferability and insufficient robustness to environment conditions), and difficulty in balancing attack effectiveness with stealthiness, where better attack effectiveness often makes PAEs more perceptible. In this paper, we explore a novel perturbation-based method to overcome the challenges. For the first challenge, we introduce a strategy Deceptive RF injection based on robust features (RFs) that are predictive, robust to perturbations, and consistent across different models. Specifically, it improves the transferability and robustness of PAEs by covering RFs of other classes onto the predictive features in clean images. For the second challenge, we introduce another strategy Adversarial Semantic Pattern Minimization, which removes most perturbations and retains only essential adversarial patterns in AEs. Based on the two strategies, we design our method Robust Feature Coverage Attack (RFCoA), comprising Robust Feature Disentanglement and Adversarial Feature Fusion. In the first stage, we extract target class RFs in feature space. In the second stage, we use attention-based feature fusion to overlay these RFs onto predictive features of clean images and remove unnecessary perturbations. Experiments show our method's superior transferability, robustness, and stealthiness compared to existing state-of-the-art methods. Additionally, our method's effectiveness can extend to Large Vision-Language Models (LVLMs), indicating its potential applicability to more complex tasks.

Abstract: Zeroshot Referring Image Segmentation (RIS) identifies the instance mask that best aligns with a specified referring expression without training and fine-tuning, significantly reducing the labor-intensive annotation process. Despite achieving commendable results, previous CLIP-based models have a critical drawback: the models exhibit a notable reduction in their capacity to discern relative spatial relationships of objects. This is because they generate all possible masks on an image and evaluate each masked region for similarity to the given expression, often resulting in decreased sensitivity to direct positional clues in text inputs. Moreover, most methods have weak abilities to manage relationships between primary words and their contexts, causing confusion and reduced accuracy in identifying the correct target region. To address these challenges, we propose IteRPrimE (Iterative Grad-CAM Refinement and Primary word Emphasis), which leverages a saliency heatmap through Grad-CAM from a Vision-Language Pre-trained (VLP) model for image-text matching. An iterative Grad-CAM refinement strategy is introduced to progressively enhance the model's focus on the target region and overcome positional insensitivity, creating a self-correcting effect. Additionally, we design the Primary Word Emphasis module to help the model handle complex semantic relations, enhancing its ability to attend to the intended object. Extensive experiments conducted on the RefCOCO/+/g, and PhraseCut benchmarks demonstrate that IteRPrimE outperforms previous SOTA zero-shot methods, particularly excelling in out-of-domain scenarios.

School of Software Engineering, Xi’an Jiaotong University National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center for Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center for Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center for Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University

Abstract: Recent advances in textto-image diffusion models have shown an outstanding ability in zero-shot style transfer. However, existing methods often struggle to balance preserving the semantic content of the input image and faithfully transferring the target style in line with the edit prompt. Especially when applied to complex traffic scenes with diverse objects, layouts, and stylistic variations, current diffusion models tend to exhibit Style Neglection, i.e., failing to generate the required style in the prompt. To address this issue, we propose Style Nursing, which directs the model to focus on style subject tokens in the text prompt and excites their corresponding visual activations. Moreover, we introduce Spatial and Semantic Guidance to guide the preservation of content after editing, which utilizes spatial features from the DDIM sampling process together with attention maps from the semantic reconstruction. To evaluate the performance of zero-shot style transfer methods in traffic scenes, we present STREET-6K, a new benchmark dataset comprising 6000 images showcasing diverse traffic scenes and style transfer variations, accompanied by comprehensive annotations and evaluation metrics. Our approach beats state-of-the-art image translation methods in comprehensive quantitative metrics and human evaluations on traffic scene image synthesis while seamlessly generalizing to various other types of images without training or fine-tuning. Further experiments on detection and segmentation tasks show that fine-tuning perception models on our synthesized images improves Recall and mean Intersection over Union (mIoU) by over 10% and 3% respectively in rarely-seen traffic scenes.

Shenzhen Campus of Sun Yat-sen University, Shenzhen, China, Beijing University of Posts and Telecommunications, Beijing, China, School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou, China, School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou, China, Artificial Intelligence Institute of China Electronics Technology Group Corporation, Beijing, China, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China

Abstract: Visual object tracking is essentially crucial for unmanned aerial vehicles (UAVs). Despite the substantial progress, most of the existing UAV trackers are designed for wellconditioned daytime data, while for the scenarios in challenging weather condition, e.g. foggy or nighttime environment, the tremendous domain gap leads to significant performance degradation. To address this issue, in this paper, we propose a novel robust UAV tracker termed LVPTrack, which conducts high quality label-aligned visual prompt tuning to adapt to various challenging weather conditions. Specifically, we first synthesize the sequential foggy and nighttime video frames to assist the model training. A domain adaptive teacher-student network is utilized to distill the hierarchical visual semantic of the target objects in cross-domain scenarios. Then we propose a target-aware pseudo-label voting (PLV) strategy to alleviate the target-level misalignment in the dual domains. Furthermore, we propose a dynamic aggregated prompt (DAP) module to facilitate the appearance variation adaptation of the target object in challenging scenarios. Extensive experiments demonstrate that our tracker achieves superior performance over existing state-of-the-art UAV trackers.

Abstract: Textto-image diffusion models have achieved remarkable success in generating photorealistic images. However, the inclusion of sensitive information during pre-training poses significant risks. Machine Unlearning (MU) offers a promising solution to eliminate sensitive concepts from these models. Despite its potential, existing MU methods face two main challenges: 1) limited generalization, where concept erasure is effective only within the unlearned set, failing to prevent sensitive concept generation from out-of-set prompts; and 2) utility degradation, where removing target concepts significantly impacts the model's overall performance. To address these issues, we propose a novel concept domain correction framework named \textbf{DoCo} (\textbf{Do}main \textbf{Co}rrection). By aligning the output domains of sensitive and anchor concepts through adversarial training, our approach ensures comprehensive unlearning of target concepts. Additionally, we introduce a concept-preserving gradient surgery technique that mitigates conflicting gradient components, thereby preserving the model's utility while unlearning specific concepts. Extensive experiments across various instances, styles, and offensive concepts demonstrate the effectiveness of our method in unlearning targeted concepts with minimal impact on related concepts, outperforming previous approaches even for out-of-distribution prompts.

Abstract: Achieving a balance between accuracy and efficiency is a critical challenge in facial landmark detection (FLD). This paper introduces Parallel Optimal Position Search (POPoS), a highprecision encoding-decoding framework designed to address the limitations of traditional FLD methods. POPoS employs three key contributions: (1) Pseudo-range multilateration is utilized to correct heatmap errors, improving landmark localization accuracy. By integrating multiple anchor points, it reduces the impact of individual heatmap inaccuracies, leading to robust overall positioning. (2) To enhance the pseudo-range accuracy of selected anchor points, a new loss function, named multilateration anchor loss, is proposed. This loss function enhances the accuracy of the distance map, mitigates the risk of local optima, and ensures optimal solutions. (3) A single-step parallel computation algorithm is introduced, boosting computational efficiency and reducing processing time. Extensive evaluations across five benchmark datasets demonstrate that POPoS consistently outperforms existing methods, particularly excelling in low-resolution heatmaps scenarios with minimal computational overhead. These advantages make POPoS as a highly efficient and accurate tool for FLD, with broad applicability in real-world scenarios.

Abstract: This paper introduces a lightweight Semanticguided Mutually Reinforcing network (SMR-Net) for the tasks of cross-modal image fusion and salient object detection (SOD). The core concept of SMR-Net is to leverage semantics for directing the mutual reinforcing between image fusion and SOD. Specifically, a Progressive Cross-modal Interaction (PCI) image fusion subnetwork is designed to exploit local interactions via convolution operations and extend to global interactions utilizing spatial and channel attention mechanisms. Subsequently, a cross-modal Bit-Plane Slicing-based SOD subnetwork (BPS) is developed by incorporating the fused image as a third modality. This component employs bit-plane slicing and the deformable convolution technique to effectively extract irregular semantic information embedded in fusion features. The refined semantic information then guides the feature extraction process of the source modalities in a reweighted fashion. By cascading these two subnetworks, BPS leverages final semantic results to direct PCI towards focusing more on semantic information. Ultimately, through this semantic-guided mutual enhancement process, SMR-Net excels in both producing high-quality fused images and achieving effective salient object detection. Our extensive experiments on image fusion and SOD tasks convincingly demonstrate the superiority of our network over existing state-of-the-art alternatives without introducing noticeable computational costs. Compared to nearest competitors, our method demonstrates a stronger generalization ability with 26% fewer parameters.

Abstract: AudioVisual Semantic Segmentation (AVSS) has gained significant attention in the multi-modal domain, aiming to segment video objects that produce specific sounds in the corresponding audio. Despite notable progress, existing methods still struggle to handle new classes not included in the original training set. To this end, we introduce Few-Shot Incremental Learning (FSIL) to the AVSS task, which seeks to seamlessly integrate new classes with limited incremental samples while preserving the knowledge of old classes. Two challenges arise in this new setting: (1) To reduce labeling costs, old classes within the incremental samples are treated as background, similar to silent objects. Training the model directly with background annotations may worsen the loss of distinctive knowledge about old classes, such as their outlines and sounds. (2) Most existing models adopt early cross-modal fusion with a single-tower design, incorporating more characteristics into class representations, which impedes knowledge transfer between classes based on similarity. To address these issues, we propose a Few-shot Incremental learning framework via class-centric foregrouNd aggreGation and dual-tower knowlEdge tRansfer (FINGER) for the AVSS task, which comprises two targeted modules: (1) The class-centric foreground aggregation gathers class-specific features for each foreground class while disregarding background features. The background class is excluded during training and inferred from the foreground predictions. (2) The dual-tower knowledge transfer postpones cross-modal fusion to separately conduct knowledge transfer for each modality. Extensive experiments validate the effectiveness of the FINGER model, significantly surpassing state-of-the-art methods.

Abstract: In utilizing deep learning techniques for medical image segmentation, two types of imbalance issues are observed: interclass imbalance between majority and minority classes and intra-class imbalance between easy and hard samples. However, existing loss functions typically confuse these issues, leading to enhancements that cater to only one aspect. Moreover, loss functions optimized for specific tasks often exhibit limited generalizability. To address these issues, we propose Inter-class and Intra-class Balance loss, as well as a unified loss termed Balance loss. The Inter-class Balance loss controls the extent of hard sample mining for majority class samples by considering the frequency of minority classes present in each input image. This approach requires no manual adjustment weights and adapts automatically to different datasets. The Intra-class Balance loss enhances the network's ability to learn from hard samples by performing mining on hard samples within each class. We evaluate our loss functions on five segmentation tasks with varying degrees of class imbalance. The experimental results show that our proposed Balance loss enhances segmentation performance compared with the current loss functions and exhibits superior robustness.

Abstract: Lifelong person reidentification (LReID) is an important but challenging task that suffers from catastrophic forgetting due to significant domain gaps between training steps. Existing LReID approaches typically rely on data replay and knowledge distillation to mitigate this issue. However, data replay methods compromise data privacy by storing historical exemplars, while knowledge distillation methods suffer from limited performance due to the cumulative forgetting of undistilled knowledge. To overcome these challenges, we propose a novel paradigm that models and rehearses the distribution of the old domains to enhance knowledge consolidation during the new data learning, possessing a strong anti-forgetting capacity without storing any exemplars. Specifically, we introduce an exemplar-free LReID method called Distribution Rehearsing via Adaptive Style Kernel Learning (DASK). DASK includes a Distribution Rehearser Learning mechanism that learns to transform arbitrary distribution data into the current data style at each learning step. To enhance the style transfer capacity, an Adaptive Kernel Prediction network is explored to achieve an instance-specific distribution adjustment. Additionally, we design a Distribution Rehearsing-driven LReID Training module, which rehearses old distribution based on the new data via the old AKPNet model, achieving effective knowledge accumulation. Experimental results show our DASK outperforms the existing methods by 3.6%-6.8% and 4.5%-6.5% on seen and unseen domains, respectively.

Abstract: LiDARbased semantic scene understanding is an important module in the modern autonomous driving perception stack. However, identifying outlier points in a LiDAR point cloud is challenging as LiDAR point clouds lack semantically-rich information. While former SOTA methods adopt heuristic architectures, we revisit this problem from the perspective of Selective Classification, which introduces a selective function into the standard closed-set classification setup. Our solution is built upon the basic idea of abstaining from choosing any inlier categories but learns a point-wise abstaining penalty with a margin-based loss. Apart from learning paradigms, synthesizing outliers to approximate unlimited real outliers is also critical, so we propose a strong synthesis pipeline that generates outliers originated from various factors: object categories, sampling patterns and sizes. We demonstrate that learning different abstaining penalties, apart from point-wise penalty, for different types of (synthesized) outliers can further improve the performance. We benchmark our method on SemanticKITTI and nuScenes and achieve SOTA results.

Abstract: Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip reading model to target speakers in the visual modality. However, the effectiveness of adapting language information, such as vocabulary choice, of the target speaker has not been explored in previous works. Additionally, existing datasets for speaker adaptation have limited vocabulary sizes and pose variations, which restrict the validation of previous speakeradaptive methods in real-world scenarios. To address these issues, we propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. Furthermore, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3. It contains a vocabulary of approximately 100K words, offers diverse pose variations, and enables the validation of adaptation methods in the wild, sentence-level lip reading for the first time in English. Through various experiments, we demonstrate that the existing speaker-adaptive method also improves performance in the wild at the sentence level. Moreover, we show that the proposed method achieves larger improvements compared to the previous works.

Abstract: The image compression model has long struggled with adaptability and generalization, as the decoded bitstream typically serves only human or machine needs and fails to preserve information for unseen visual tasks. Therefore, this paper innovatively introduces supervision obtained from multimodal pretraining models and incorporates adaptive multi-objective optimization tailored to support both human visual perception and machine vision simultaneously with a single bitstream, denoted as Unified and Generalized Image Coding for Machine (UG-ICM). Specifically, to get rid of the reliance between compression models with downstream task supervision, we introduce Contrastive Language-Image Pre-training (CLIP) models into the training constraint for improved generalization. Global-to-instance-wise CLIP supervision is applied to help obtain hierarchical semantics that make models more generalizable for the tasks relying on the information of different granularity. Furthermore, for supporting both human and machine visions with only a unifying bitstream, we incorporate a conditional decoding strategy that takes as conditions human or machine preferences, enabling the bitstream to be decoded into different versions for corresponding preferences. As such, our proposed UG-ICM is fully trained in a self-supervised manner, i.e., without awareness of any specific downstream models and tasks. The extensive experiments have shown that the proposed UG-ICM is capable of achieving remarkable improvements in various unseen machine analytics tasks, while simultaneously providing perceptually satisfying images.

Abstract: Representing and synthesizing novel views in realworld dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach backdrop-driven neural radiance fields offers high-quality view synthesis and a 3D solution to detach the background from the entire dynamic scene, which is called DetRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate DetRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Code will be available soon.

Abstract: Textured meshes significantly enhance the realism and detail of objects by mapping intricate texture details onto the geometric structure of 3D models. This advancement is valuable across various applications, including entertainment, education, and industry. While traditional mesh saliency studies focus on nontextured meshes, our work explores the complexities introduced by detailed texture patterns. We present a new dataset for textured mesh saliency, created through an innovative eye-tracking experiment in a six degrees of freedom (6-DOF) VR environment. This dataset addresses the limitations of previous studies by providing comprehensive eye-tracking data from multiple viewpoints, thereby advancing our understanding of human visual behavior and supporting more accurate and effective 3D content creation. Our proposed model predicts saliency maps for textured mesh surfaces by treating each triangular face as an individual unit and assigning a saliency density value to reflect the importance of each local surface region. The model incorporates a texture alignment module and a geometric extraction module, combined with an aggregation module to integrate texture and geometry for precise saliency prediction. We believe this approach will enhance the visual fidelity of geometric processing while ensuring computational efficiency, essential for real-time rendering and high-detail applications such as VR and gaming.

Hefei Institute of Physical Science, Chinese Academy of Sciences, China University of Science and Technology of China, Hefei, China Astribot, Shenzhen, China, Zhejiang University of Technology, Zhejiang, China, China Agricultural University. Beijing, China, School of Mathematical Sciences, Peking University. Beijing, China, Astribot, Shenzhen, China, Hefei Institute of Physical Science, Chinese Academy of Sciences, China, Hefei Institute of Physical Science, Chinese Academy of Sciences, China, Hefei University of Technology, Hefei, China

Abstract: Human life is filled with articulated objects. Previous works for estimating the pose of categorylevel articulated objects rely on costly 3D point clouds or RGB-D images. In this paper, our goal is to estimate category-level articulation poses from a single RGB image, where we propose R2-Art, a novel category-level Articulation pose estimation framework from a single RGB image and a cascade Render strategy. Given an RGB image as input, R2-Art estimates per-part 6D pose for the articulation. Specifically, we design parallel regression branches tailored to generate camera-to-root translation and rotation. Using the predicted joint states, we perform PC prior transformation and deformation with a joint-centric modeling approach. For further refinement, a cascade render strategy is proposed for projecting the 3D deformed prior onto the 2D mask. Extensive experiments are provided to validate our R2-Art on various datasets ranging from synthetic datasets to real-world scenarios, demonstrating the superior performance and robustness of the R2-Art. We believe that this work has the potential to be applied in many fields including robotics, embodied intelligence, and augmented reality.

Abstract: In the field of Moving Infrared Small Target Detection (MIRSTD), current methods typically use sequential modeling with two individual modules for spatial and temporal processing. However, such a modeling strategy lacks clear guidance on the motion and displacement difference between moving targets and background noise, thereby limiting the feature discriminability and resulting in errorprone target localization. This paper addresses this issue from clip and frame levels and proposes a novel architecture MOCID for MIRSTD. For clip-level feature fusion, we design a spatio-temporal backbone consisting of several proposed Fourier-inspired Spatio-temporal Attention (FISTA) layers. Each FISTA layer sequentially processes the features from spatial and temporal views to capture clip-level temporal motion context, where Fourier Transformation and Inverse Fourier Transformation are employed for each view. This context is then embedded into dynamic convolutional kernels for subsequent spatial feature extraction, thereby enabling clear motion difference guidance and generating comprehensive features. For frame-level feature fusion, we design a Displacement-aware Mamba Module (DAM) to capture detailed frame-to-frame displacement information. DAM utilizes an innovative Temporal Interpolation and Displacement-aware Scan technique to perform spatio-temporal difference-aware displacement modeling, introducing elaborate temporal indicators into feature extraction. Combining the above improvements, our model captures comprehensive motion and displacement contexts, significantly improving the detection of the small target. Extensive experiments demonstrate that MOCID achieves state-of-the-art detection accuracy on popular IRDST and DAUB datasets. Furthermore, MOCID offers a superior balance between throughput and performance compared to other methods. The code for this work will be made publicly available.

Abstract: Ground penetrating radar (GPR) based localization has gained significant recognition in robotics due to its ability to detect stable subsurface features, offering advantages in environments where traditional sensors like cameras and LiDAR may struggle. However, existing methods are primarily focused on smallscale place recognition (PR), leaving the challenges of PR in large-scale maps unaddressed. These challenges include the inherent sparsity of underground features and the variability in underground dielectric constants, which complicate robust localization. In this work, we investigate the geometric relationship between GPR echo sequences and underground scenes, leveraging the robustness of directional features to inform our network design. We introduce learnable Gabor filters for the precise extraction of directional responses, coupled with a direction-aware attention mechanism for effective geometric encoding. To further enhance performance, we incorporate a shift-invariant unit and a multi-scale aggregation strategy to better accommodate variations in dielectric constants. Experiments conducted on public datasets demonstrate that our proposed EDENet not only surpasses existing solutions in terms of PR performance but also offers advantages in model size and computational efficiency.

Abstract: View transformation robustness (VTR) is critical for deeplearning-based multi-view 3D object reconstruction models, which indicates the methods' stability under inputs with various view transformations. However, existing research seldom focused on view transformation robustness in multi-view 3D object reconstruction. One direct way to improve the models' VTR is to produce data with more view transformations and add them to model training. Recent progress on large vision models, particularly Stable Diffusion models, has provided great potential for generating 3D models or synthesizing novel view images with only a single image input. Directly deploying these models at inference consumes heavy computation resources and their robustness to view transformations is not guaranteed either. To fully utilize the power of Stable Diffusion models without extra inference computation burdens, we propose to generate novel views with Stable Diffusion models for better view transformation robustness. Instead of synthesizing random views, we propose a reconstruction error-guided view selection method, which considers the reconstruction errors' spatial distribution of the 3D predictions and chooses the views that could cover the reconstruction errors as much as possible. The methods are trained and tested on sets with large view transformations to validate the 3D reconstruction models' robustness to view transformations. Extensive experiments demonstrate that the proposed method can outperform state-of-the-art 3D reconstruction methods and other view transformation robustness comparison methods.

Abstract: Inferring 3D structures from sparse, unposed observations is challenging due to its unconstrained nature. Recent methods propose to predict implicit representations directly from unposed inputs in a datadriven manner, achieving promising results. However, these methods do not utilize geometric priors and cannot hallucinate the appearance of unseen regions, thus making it challenging to reconstruct fine geometric and textural details. To tackle this challenge, our key idea is to reformulate this ill-posed problem as conditional novel view synthesis, aiming to generate complete observations from limited input views to facilitate reconstruction. With complete observations, the poses of the input views can be easily recovered and further used to optimize the reconstructed object. To this end, we propose a novel pipeline, Pragmatist. First, we generate a complete observation of the object via a multiview conditional diffusion model. Then, we use a feed-forward large reconstruction model to obtain the reconstructed mesh. To further improve the reconstruction quality, we recover the poses of input views by inverting the obtained 3D representations and further optimize the texture using detailed input views. Unlike previous approaches, our pipeline improves reconstruction by efficiently leveraging unposed inputs and generative priors, circumventing the direct resolution of highly ill-posed problems. Extensive experiments show that our approach achieves promising performance in several benchmarks.

Abstract: Deep neural networks (DNNs) have achieved remarkable success in widespread applications. Meanwhile, its vulnerability towards carefully crafted adversarial attacks captures special attention. Not only adversarial perturbations in digital space will fool the target DNNsbased detectors making a wrong decision, but also actually printed patches can be camouflaged to defeat detectors in physical space. In particular, multi-view physical adversarial attacks pose a more serious threat to practical scenarios. The existing attacks are still challenged in three aspects, i.e., high-cost data augmentation, attack performance gap between digital and physical space, and low attack transferability across DNNs. To overcome the challenges, we introduce PhyCamo, a robust physical camouflage framework based on contrastive learning that distinguishes itself from prior research in various critical ways: (1) data augmentation - it utilizes the diffusion model for data augmentation to efficiently simulate sophisticated physical dynamics in real-world; (2) robustness - it leverages contrastive learning to optimize physical camouflage against encoders with the state-of-the-art (SOTA) attack performance; (3) transferability - it mitigates the model-specific noise in the optimization by adopting diverse input methods, thereby amplifying the transferability between models. Extensive experiments are carried out on a car dataset, a tank dataset, and a pedestrian dataset, comparing with 6 classic multi-view physical adversarial attacks in both digital and physical spaces. The results demonstrate PhyCamo’s superior performance. For instance, it generates more effective physical camouflage (with higher attack success rate~×1.26 and reduce the model's average precision by 55%). PhyCamo can also help to improve the robustness of detectors through adversarial training, which contributes to the application of deep neural networks in the field of security sensitivity.

Department of Intelligent Data Science, College of Computer Science，National University of Defense Technology, China, University of Science and Technology of China, University of Sydney, Department of Intelligent Data Science, College of Computer Science，National University of Defense Technology, China, Department of Intelligent Data Science, College of Computer Science，National University of Defense Technology, China, HPCL, College of Computer Science，National University of Defense Technology, China

Abstract: Largescale text-to-image diffusion models, (e.g., DALL-E, SDXL) are capable of generating famous persons by simply referring to their names. Is it possible to make such models generate generic identities as simple as the famous ones, e.g., just use a name? In this paper, we explore the existence of a ``Name Space'', where any point in the space corresponds to a specific identity. Fortunately, we find some clues in the feature space spanned by text embedding of celebrities' names. Specifically, we first extract the embeddings of celebrities' names in the Laion5B dataset with the text encoder of diffusion models. Such embeddings are used as supervision to learn an encoder that can predict the name (actually an embedding) of a given face image. We experimentally find that such name embeddings work well in promising the generated image with good identity consistency. Note that like the names of celebrities, our predicted name embeddings are disentangled from the semantics of text inputs, making the original generation capability of text-to-image models well-preserved. Moreover, by simply plugging such name embeddings, all variants (e.g., from Civitai) derived from the same base model (i.e., SDXL) readily become identity-aware text-to-image models.

Abstract: Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic regionlevel urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding, totaling 14 task types. In constructing UrBench, we utilize data from existing datasets and additionally collect data from 11 cities, creating new annotations using a cross-view detection-matching method. With these images and annotations, we then integrate LMM-based, rule-based, and human-based methods to construct large-scale high-quality questions. Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects. Even the best performing GPT-4o lags behind humans in most tasks, ranging from simple tasks such as counting to complex tasks such as orientation, localization and object attribute recognition, with an average performance gap of 17.4%. Our benchmark also reveals that LMMs exhibit inconsistent behaviors with different urban views, especially with respect to understanding cross-view relations.

Abstract: Noise is an inevitable aspect of point cloud acquisition, necessitating filtering as a fundamental task within the realm of 3D vision. Existing learningbased filtering methods have shown promising capabilities on commonly used datasets. Nonetheless, the effectiveness of these methods is constrained when dealing with a substantial quantity of point clouds. This limitation primarily stems from their limited denoising capabilities for dense and large-scale point clouds and their inclination to generate noisy outliers after denoising. To deal with this challenge, we introduce 3DMambaIPF, for the first time, exploiting Selective State Space Models (SSMs) architecture to handle highly-dense and large-scale point clouds, capitalizing on its strengths in selective input processing and large context modeling capabilities. Additionally, we present a robust and fast differentiable rendering loss to constrain the noisy points around the surface. In contrast to previous methodologies, this differentiable rendering loss enhances the visual realism of denoised geometric structures and aligns point cloud boundaries more closely with those observed in real-world objects. Extensive evaluations on commonly used datasets (typically with up to 50K points) demonstrate that 3DMambaIPF achieves state-of-the-art results. Moreover, we showcase the superior scalability and efficiency of 3DMambaIPF on highly dense and large-scale point clouds with up to 500K points compared to off-the-shelf methods.

Abstract: Imagetext matching is a crucial task that bridges visual and linguistic modalities. Recent research typically formulates it into the problem of maximizing the margin with the truly hardest negatives to enhance the learning efficiency and avoid the poor local optima. We argue that such formulation can lead to a serious limitation, i.e., under this formulation, conventional trainers would confine their horizon within the hardest negative examples, while other negative examples offer a range of semantic differences not present in the hardest negatives. In this paper, we propose an efficient negative distribution guided training framework for image-text matching to unlock the substantial promotion space left by the above limitation. Rather than simply incorporating additional negative examples into the training objective, which could diminish both the leading role of the hardest negatives in training and the effect of a large margin learning in producing a robust matching model, our central idea is to supply the objective with distributional information on the entire set of negative examples. To be precise, we first construct the sample similarity matrix based on several pretrained models to extract the distributional information of the entire negative sample dataset. Then we encode it into a margin regularization module to smooth the similarities differences of all negatives. This enhancement facilitates the capture of fine-grained semantic differences and guides the main learning process by maximizing the margin with hard negative examples. Furthermore, we propose a hardest negative rectification module to address the instability in hardest negative selection based on predicted similarity and to correct erroneous hardest negatives. We evaluate our method in combination with several state-of-the-art image-text matching methods, and our quantitative and qualitative experiments demonstrate its significant generalizability and effectiveness.

Abstract: Audiodriven talking head generation necessitates seamless integration of audio and visual data amidst the challenges posed by diverse input portraits and intricate correlations between audio and facial motions. In response, we propose a robust framework GoHD designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion. GoHD innovates with three key modules: Firstly, an animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles. This module achieves high disentanglement of motion and identity, and it also incorporates gaze orientation to rectify unnatural eye movements that were previously overlooked. Secondly, a conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody. Thirdly, to estimate lip-synchronized and realistic expressions from the input audio within limited training data, a two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions, e.g., blinks and frowns. Extensive experiments validate GoHD's advanced generalization capabilities, demonstrating its effectiveness in generating realistic talking face results on arbitrary subjects.

Abstract: Lagrangian decomposition (LD) is a relaxation method that provides a dual bound for constrained optimization problems by decomposing them into more manageable subproblems. This bound can be used in branch-and-bound algorithms to prune the search space effectively.In brief, a vector of Lagrangian multipliers is associated with each sub-problem, and an iterative procedure (e.g., a sub-gradient optimization) adjusts these multipliers to find the tightest bound. Initially applied to integer programming, Lagrangian decomposition also had success in constraint programming due to its versatility and the fact that global constraints provide natural sub-problems. However, the non-linear and combinatorial nature of sub-problems in constraint programming makes it computationally intensive to optimize the Lagrangian multipliers with sub-gradient methods at each node of the tree search. This currently limits the practicality of LD as a general bounding mechanism for constraint programming. To address this challenge, we propose a self-supervised learning approach that leverages neural networks to generate multipliers directly, yielding tight bounds. This approach significantly reduces the number of sub-gradient optimization steps required, enhancing the pruning efficiency and reducing the execution time of constraint programming solvers. This contribution is one of the few that leverage learning to enhance bounding mechanisms on the dual side, a critical element in the design of combinatorial solvers. This work presents a generic method for learning valid dual bounds in constraint programming. We validate our approach on two challenging combinatorial problems: The multi-dimensional knapsack problem and the shift scheduling problem. The results show that our approach can solve more instances than the standard application of LD to constraint programming, reduce execution time by more than half, and has promising generalization ability through fine-tuning.

Abstract: Among various temporal knowledge graph (TKG) extrapolation methods, rulebased approaches stand out for their explicit rules and transparent reasoning paths. However, the vast search space for rule extraction poses a challenge in identifying high-quality logic rules. To navigate this challenge, we explore the use of generation models to generate new rules, thereby enriching our rule base and enhancing our reasoning capabilities. In this paper, we introduce LLM-DR, an innovative rule-based method for TKG extrapolation, which harnesses diffusion models to generate rules that are consistent with the distribution of the source data, while also amalgamating the rich semantic insights of Large Language Models (LLMs). Specifically, our LLM-DR generates semantically relevant and high-quality rules, employing conditional diffusion models in a classifier-free guidance fashion and refining them with LLM-based constraints. To assess rule efficacy, we meticulously design a coarse-to-fine evaluation strategy that initiates with coarse-grained filtering to eliminate less plausible rules and proceeds with fine-grained scoring to quantify the reliability of the retained. Extensive experiments demonstrate the promising capacity of our LLM-DR.

Abstract: Concept Drift has been extensively studied within the context of Stream Learning. However, it is often assumed that the deployed model's predictions play no role in the concept drift the system experiences. Closer inspection reveals that this is not always the case. Automated trading might be prone to selffulfilling feedback loops. Likewise, malicious entities might adapt to evade detectors in the adversarial setting resulting in a self-negating feedback loop that requires the deployed models to constantly retrain. Such settings where a model may induce concept drift are called performative. In this work, we investigate this phenomenon. Our contributions are as follows: First, we define performative drift within a stream learning setting and distinguish it from other causes of drift. We introduce a novel type of drift detection task, aimed at identifying potential performative concept drift in data streams. We propose a first such performative drift detection approach, called CheckerBoard Performative Drift Detection (CB-PDD). We apply CB-PDD to both synthetic and semi-synthetic datasets that exhibit varying degrees of self-fulfilling feedback loops. Results are positive with CB-PDD showing high efficacy, low false detection rates, resilience to intrinsic drift, comparability to other drift detection techniques, and an ability to effectively detect performative drift in semi-synthetic datasets. Secondly, we highlight the role intrinsic (traditional) drift plays in obfuscating performative drift and discuss the implications of these findings as well as the limitations of CB-PDD.

School of Computer Science and Engineering, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China MIIT Key Laboratory of Data Intelligence and Management, Beihang University, Beijing, China School of Economics and Management, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China, MIIT Key Laboratory of Data Intelligence and Management, Beihang University, Beijing, China School of Economics and Management, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China Shenzhen Institute of Beihang University, Shenzhen, China, MIIT Key Laboratory of Data Intelligence and Management, Beihang University, Beijing, China School of Economics and Management, Beihang University, Beijing, China

Abstract: Effective urban traffic management is vital for sustainable city development, relying on intelligent systems with machine learning tasks such as traffic flow prediction and travel time estimation. Traditional approaches usually focus on static road network and trajectory representation learning, and overlook the dynamic nature of traffic states and trajectories, which is crucial for downstream tasks. To address this gap, we propose TRACK, a novel framework to bridge traffic state and trajectory data for dynamic road network and trajectory representation learning. TRACK leverages graph attention networks (GAT) to encode static and spatial road segment features, and introduces a transformerbased model for trajectory representation learning. By incorporating transition probabilities from trajectory data into GAT attention weights, TRACK captures dynamic spatial features of road segments. Meanwhile, TRACK designs a traffic transformer encoder to capture the spatial-temporal dynamics of road segments from traffic state data. To further enhance dynamic representations, TRACK proposes a co-attentional transformer encoder and a trajectory-traffic state matching task. Extensive experiments on real-life urban traffic datasets demonstrate the superiority of TRACK over state-of-the-art baselines. Case studies confirm TRACK’s ability to capture spatial-temporal dynamics effectively.

Abstract: Contemporary recommendation systems predominantly rely on ID embedding to capture latent associations among users and items. However, this approach overlooks the wealth of semantic information embedded within textual descriptions of items, leading to suboptimal performance and poor generalizations. Leveraging the capability of large language models to comprehend and reason about textual content presents a promising avenue for advancing recommendation systems. To achieve this, we propose an Llmdriven knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world knowledge with collaborative knowledge. We address computational complexity concerns by utilizing pretrained LLMs as item encoders and freezing LLM parameters to avoid catastrophic forgetting and preserve open-world knowledge. To bridge the gap between the open-world and collaborative domains, we design a twin-tower structure supervised by the recommendation task and tailored for practical industrial application. Through experiments on the real large-scale industrial dataset and online A/B tests, we demonstrate the efficacy of our approach in industry application. We also achieve state-of-the-art performance on six Amazon Review datasets to verify the superiority of our method.

The State Key Laboratory of Internet of Things for Smart City, University of Macau, China, The State Key Laboratory of Internet of Things for Smart City, University of Macau, China, The State Key Laboratory of Internet of Things for Smart City, University of Macau, China, The State Key Laboratory of Internet of Things for Smart City, University of Macau, China, Kash Institute of Electronics and Information Industry, The State Key Laboratory of Internet of Things for Smart City, University of Macau, China

Abstract: The rapid spread of diverse information on online social platforms has prompted both academia and industry to realize the importance of predicting content popularity, which could benefit a wide range of applications, such as recommendation systems and strategic decisionmaking. Recent works mainly focused on extracting spatiotemporal patterns inherent in the information diffusion process within a given observation period so as to predict its popularity over a future period of time. However, these works often overlook the future popularity trend, as future popularity could either increase exponentially or stagnate, introducing uncertainties to the prediction performance. Additionally, how to transfer the preceding-term dynamics learned from the observed diffusion process into future-term trends remains an unexplored challenge. Against this background, we propose CasFT, which leverages observed information Cascades and dynamic cues extracted via neural ODEs as conditions to guide the generation of Future popularity-increasing Trends through a diffusion model. These generated trends are then combined with the spatiotemporal patterns in the observed information cascade to make the final popularity prediction. Extensive experiments conducted on three real-world datasets demonstrate that CasFT significantly improves the prediction accuracy compared to state-of-the-art approaches.

Abstract: We focus on the medication recommendation problem aiming to recommend accurate medications for a patient’s current visit. Most existing methods for this problem utilize the patient’s current health status, medications prescribed at her past visits, and an Electronic Health Records (EHR) graph which represents whether medications have been coprescribed. However, we point out their two limitations: (1) they have difficulty in utilizing only the medications which have been prescribed in health status similar to the patient’s current health status, regardless of whether they are prescribed at her past visits or at other patients’ visits; (2) for two medications that have ever been co-prescribed, their EHR graph does not consider the degree to which one medication is prescribed together when the other is prescribed. To address these two limitations, we propose a novel medication recommendation framework, named HI-DR (pronounced as ‘Hi Doctor’), composed of following two core ideas: (Idea 1) Health status-aware attentIon; (Idea 2) an electronic health recorDs gRaph+. Extensive experiments on real-world datasets demonstrate the significant superiority of HI-DR (up to 18.69% higher accuracy than the best competitor) and the effectiveness of two core ideas in HI-DR.

Abstract: Information popularity prediction, aiming to predict the growth of user participation in a trending topic diffusion, is a fundamental task in social networks. Existing methods often treat information diffusion as a single independent process, ignoring the ``public opinion field effect'' where multiple trending topics coexist and compete for user attention simultaneously. Inspired by Hawkes theory, we propose a novel Hawkesprocess-based learning model for information popularity prediction, which takes into account both the temporal correlation among users' propagation behaviors in several topics diffusion and public opinion field effect in social networks. We first propose an improved neural Hawkes process to capture comprehensive propagation law from multiple dimensions and then propose a novel public opinion field paradigm based on the improved Hawkes process and cascade structure. We design a novel learning framework incorporating the public opinion field paradigm to extract high-quality representations for information popularity prediction. Extensive experiments on four real-world datasets validate that our model significantly outperforms the state-of-the-art competitors.

Abstract: Sequential Recommender Systems (SRS), which model a user's interaction history to predict the next item of interest, are widely used in various applications. However, existing SRS often struggle with lowpopularity items, a challenge known as the long-tail problem. This issue leads to reduced serendipity for users and diminished profits for sellers, ultimately harming the overall system. Large Language Model (LLM) has the ability to capture semantic relationships between items, independent of their popularity, making them a promising solution to this problem. In this paper, we introduce LLMEmb, a novel method leveraging LLM to generate item embeddings that enhance SRS performance. To bridge the gap between general-purpose LLM and the recommendation domain, we propose a Supervised Contrastive Fine-Tuning (SCFT) approach. This approach includes attribute-level data augmentation and a tailored contrastive loss to make LLM more recommendation-friendly. Additionally, we emphasize the importance of integrating collaborative signals into LLM-generated embeddings, for which we propose Recommendation Adaptation Training (RAT). This further refines the embeddings for optimal use in SRS. The LLMEmb-derived embeddings can be seamlessly integrated with any SRS model, underscoring the practical value. Comprehensive experiments conducted on three real-world datasets demonstrate that LLMEmb significantly outperforms existing methods across multiple SRS models.

Abstract: Multitask learning (MTL) has emerged as a successful strategy in industrial-scale recommender systems, offering significant advantages such as capturing diverse users’ interests and accurately detecting different behaviors like “click" or “dwell time". However, negative transfer and the seesaw phenomenon pose challenges to MTL models due to the complex and often contradictory task correlations in real-world recommendations. To address the problem while making better use of personalized information, we propose a personalized Direct Routing Gradient framework (DRGrad), which consists of three key components: router, updater and personalized gate network. DRGrad judges the stakes between tasks in the training process, which can leverage all valid gradients for the respective task to reduce conflicts. We evaluate the efficiency of DRGrad on complex MTL using a real-world recommendation dataset with 15 billion samples. The results show that DRGrad’s superior performance over competing state-of-the-art MTL models, especially in terms of AUC (Area Under the Curve) metrics, indicating that it effectively manages task conflicts in multi-task learning environments without increasing model complexity, while also addressing the deficiencies in noise pro-cessing. Moreover, experiments on the public Census-income dataset and Synthetic dataset, have demonstrated the capability of DRGrad in judging and routing the stakes between tasks with varying degrees of correlation and personalization.

Abstract: Due to the remarkable reasoning ability, Large language models (LLMs) have demonstrated impressive performance in knowledge graph question answering (KGQA) tasks, which find answers to natural language questions over knowledge graphs (KGs). To alleviate the hallucinations and lack of knowledge issues of LLMs, existing methods often retrieve the questionrelated information from KGs to enrich the input context. However, most methods focus on retrieving the relevant information while ignoring the importance of different types of knowledge in reasoning, which degrades their performance. To this end, this paper reformulates the KGQA problem as a graphical model and proposes a three-stage framework named the Evidence Path Enhanced Reasoning Model (EPERM) for KGQA. In the first stage, EPERM uses the fine-tuned LLM to retrieve a subgraph related to the question from the original knowledge graph. In the second stage, EPERM filters out the evidence paths that faithfully support the reasoning of the questions, and score their importance in reasoning. Finally, EPERM uses the weighted evidence paths to reason the final answer. Since considering the importance of different structural information in KGs for reasoning, EPERM can improve the reasoning ability of LLMs in KGQA tasks. Extensive experiments on benchmark datasets demonstrate that EPERM achieves superior performances in KGQA tasks.

Abstract: Human experts typically integrate numerical and textual multimodal information to analyze time series. However, most traditional deep learning predictors rely solely on unimodal numerical data, using a fixedlength window for training and prediction on a single dataset, and cannot adapt to different scenarios. The powered pre-trained large language model has introduced new opportunities for time series analysis. Yet, existing methods are either inefficient in training, incapable of handling textual information, or lack zero-shot forecasting capability. In this paper, we innovatively model time series as a foreign language and construct ChatTime, a unified framework for time series and text processing. As an out-of-the-box multimodal time series foundation model, ChatTime provides zero-shot forecasting capability and supports bimodal input/output for both time series and text. We design a series of experiments to verify the superior performance of ChatTime across multiple tasks and scenarios, and create four multimodal datasets to address data gaps. The experimental results demonstrate the potential and utility of ChatTime.

College of Computer Science and Technology & Qingdao Institute of Software, China University of Petroleum(East China), China College of Science, China University of Petroleum(East China), China, College of Computer Science and Technology & Qingdao Institute of Software, China University of Petroleum(East China), China, College of Computer Science and Technology & Qingdao Institute of Software, China University of Petroleum(East China), China School of Software & Microelectronics, Peking University, China, College of Computer Science and Technology & Qingdao Institute of Software, China University of Petroleum(East China), China, College of Computer Science and Technology & Qingdao Institute of Software, China University of Petroleum(East China), China

Abstract: Recent years, multihop reasoning has been widely studied for knowledge graph (KG) reasoning due to its efficacy and interpretability. However, previous multi-hop reasoning approaches are subject to two primary shortcomings. First, agents struggle to learn effective and robust policies at the early phase due to sparse rewards. Second, these approaches often falter on specific datasets like sparse knowledge graphs, where agents are required to traverse lengthy reasoning paths. To address these problems, we propose a multi-hop reasoning model with dual agents based on hierarchical reinforcement learning (HRL), which is named FULORA. FULORA tackles the above reasoning challenges by eFficient GUidance-ExpLORAtion between dual agents. The high-level agent walks on the simplified knowledge graph to provide stage-wise hints for the low-level agent walking on the original knowledge graph. In this framework, the low-level agent optimizes a value function that balances two objectives: (1) maximizing return, and (2) integrating efficient guidance from the high-level agent. Experiments conducted on three real-word knowledge graph datasets demonstrate that FULORA outperforms RL-based baselines, especially in the case of long-distance reasoning.

Abstract: Textattributed graphs have recently garnered significant attention due to their wide range of applications in web domains. Existing methodologies employ word embedding models for acquiring text representations as node features, which are subsequently fed into Graph Neural Networks (GNNs) for training. Recently, the advent of Large Language Models (LLMs) has introduced their powerful capabilities in information retrieval and text generation, which can greatly enhance the text attributes of graph data. Furthermore, the acquisition and labeling of extensive datasets are both costly and time-consuming endeavors. Consequently, few-shot learning has emerged as a crucial problem in the context of graph learning tasks. In order to tackle this challenge, we propose a lightweight paradigm called LLM4NG, which adopts a plug-and-play approach to establish supervision signals by leveraging LLMs for node generation. Specifically, we utilize LLMs to extract semantic information from the labels and generate samples that belong to these categories as exemplars. Subsequently, we employ an edge predictor to capture the structural information inherent in the raw dataset and integrate the newly generated samples into the original graph. This approach harnesses LLMs for enhancing class-level information and seamlessly introduces labeled nodes and edges without modifying the raw dataset, thereby facilitating the node classification task in few-shot scenarios. Extensive experiments demonstrate the outstanding performance of our proposed paradigm, particularly in low-shot scenarios. For instance, in the 1-shot setting of the ogbn-arxiv dataset, LLM4NG achieves a 76% improvement over the baseline model.

Abstract: Representation learning of urban spatialtemporal data is fundamental and critical, serving a wide range of intelligent applications. Given that road networks and trajectories are inherently interrelated, their joint representation learning can significantly enhance the accuracy and utility of these applications. However, effectively learning joint representations for these two types of data remains challenging, particularly due to the complexities of interaction modeling and cross-scale optimization. To this end, we propose a unified framework, named UniTR, for joint representation learning of road networks and trajectories. Specifically, we first design a hierarchical propagation mechanism to model the complex many-to-many interactions between road networks and trajectories, thereby generating informative embeddings. Then, a triple-level contrastive optimization module is incorporated to systematically select valid positive and negative samples, further refining the embeddings. Experiments conducted on real-world datasets from two cities clearly demonstrate the effectiveness and superiority of UniTR.

Abstract: Intuitively, an ideal collaborative filtering (CF) model should learn from users' full rankings over all items to make optimal topK recommendations. Due to the absence of such full rankings in practice, most CF models rely on pairwise loss functions to approximate full rankings, resulting in an immense performance gap. In this paper, we provide a novel analysis using the multiple ordinal classification concept to reveal the inevitable gap between a pairwise approximation and the ideal case. However, bridging the gap in practice encounters two formidable challenges: (1) none of the real-world datasets contains full ranking information; (2) there does not exist a loss function that is capable of consuming ranking information. To overcome these challenges, we propose a pseudo-ranking paradigm (PRP) that addresses the lack of ranking information by introducing pseudo-rankings supervised by an original noise injection mechanism. Additionally, we put forward a new ranking loss function designed to handle ranking information effectively. To ensure our method's robustness against potential inaccuracies in pseudo-rankings, we equip the ranking loss function with a gradient-based confidence mechanism to detect and mitigate abnormal gradients. Extensive experiments on four real-world datasets demonstrate that PRP significantly outperforms state-of-the-art methods.

Abstract: In this article we address the multirobot task allocation problem, where robots must cooperatively assign themselves to accomplish a set of tasks in their environment. We consider the colony maintenance problem as an example, where a team of robots are tasked with continuously maintaining the energy supply of a central colony. We model this as a global game, where each robot measures the energy level of the colony, and the current number of assigned robots, to determine whether or not to forage for energy sources. The key to our approach is introducing a negative feedback term into the robots' utility, which also eliminates the trivial solution where foraging or not foraging are strictly dominant strategies. We compare our approach qualitatively to existing global games, where a positive positive feedback term admits threshold-based decision making, and encourages many robots to forage simultaneously. We show how positive feedback can lead to a cascading failure in the presence of a human who recruits robots, and we demonstrate the resilience of our approach in simulation.

Abstract: We consider committee election of k >= 3 (out of m >= k + 1) candidates, where the voters and the candidates are associated with locations on the real line. Each voter’s cardinal preferences over candidates correspond to her distance to the candidate locations, and each voter’s cardinal preferences over committees is defined as her distance to the nearest candidate elected in the committee. We consider a setting where the true distances and the locations are unknown. We can nevertheless have access to degraded information which consists of an order of candidates for each voter. We investigate the best possible distortion (a worstcase performance criterion) w.r.t. the social cost achieved by deterministic committee election rules based on ordinal preferences submitted by n voters and few additional distance queries. We show that for any k >= 3, the best possible distortion of any deterministic rule that uses at most k−3 distance queries cannot be bounded by any function of n, m and k. We present deterministic rules for k-committee election with distortion of O(n) with O(k) distance queries and O(1) with O(k log(n)) distance queries.

Abstract: The distribution biases and scarcity of samples in multisource data present significant challenges for few-shot learning (FSL) tasks based on brain-computer interface (BCI). Recent efforts have explored the application of diffusion mechanisms in FSL, typically utilizing labeled data to augment the support set. However, this approach has not effectively utilized unlabeled data nor addressed distribution biases. Inspired by the latest advancements in FSL, we propose the manhattan self-attention diffusion residual networks (MSADiff-Resnet) with dynamic bias rectiﬁcation. This model explicitly adds the manhattan self-attention diffusion layer to resnet, using attention mechanisms and manhattan distance-based decay function to control local diffusion intensity, and adjusts the global diffusion strength through the parameter. This diffusion mechanism bridges labeled and unlabeled data, addressing the limitations associated with sample availability. Additionally, we effectively tackle the distribution biases of multi-source data through inter-class bias rectiﬁcation and dynamic intra-class bias rectiﬁcation. Moreover, this study presents for the first time a universal deep learning framework specifically designed for BCI-based FSL tasks. Extensive experiments on multi-source BCI task datasets have validated the effectiveness of proposed method.

Abstract: Robot task planning is an important problem for autonomous robots in longhorizon challenging tasks. As large pre-trained models have demonstrated superior planning ability, recent research investigates utilizing large models to achieve autonomous planning for robots in diverse tasks. However, since the large models are pre-trained with Internet data and lack the knowledge of real task scenes, large models as planners may make unsafe decisions that hurt the robots and the surrounding environments. To solve this challenge, we propose a novel Safe Planner framework, which empowers safety awareness in large pre-trained models to accomplish safe and executable planning. In this framework, we develop a safety prediction module to guide the high-level large model planner, and this safety module trained in a simulator can be effectively transferred to real-world tasks. The proposed Safe Planner framework is evaluated on both simulated environments and real robots. The experiment results demonstrate that Safe Planner not only achieves state-of-the-art task success rates, but also substantially improves safety during task execution.

Abstract: Constructing online HighDefinition (HD) maps is crucial for the static environment perception of autonomous driving systems (ADS). Existing solutions typically attempt to detect vectorized HD map elements with unified models; however, these methods often overlook the distinct characteristics of different non-cubic map elements, making accurate distinction challenging. To address these issues, we introduce an expert-based online HD map method, termed MapExpert. MapExpert utilizes sparse experts, distributed by our routers, to describe various non-cubic map elements accurately. Additionally, we propose an auxiliary balance loss function to distribute the load evenly across experts. Furthermore, we theoretically analyze the limitations of prevalent bird's-eye view (BEV) feature temporal fusion methods and introduce an efficient temporal fusion module called Learnable Weighted Moving Descentage. This module effectively integrates relevant historical information into the final BEV features. Combined with an enhanced slice head branch, the proposed MapExpert achieves state-of-the-art performance and maintains good efficiency on both nuScenes and Argoverse2 datasets.

Abstract: Generalized planning is concerned with how to find a single plan to solve multiple similar planning instances. ions are widely used for solving generalized planning, and QNP (qualitative numeric planning) is a popular abstract model. Recently, Cui et al. showed that a plan solves a sound and complete abstraction of a generalized planning problem if and only if the refined plan solves the original problem. However, existing work on automatic abstraction for generalized planning can hardly guarantee soundness let alone completeness. In this paper, we propose an automatic sound and complete abstraction method for generalized planning with baggable types. We use a variant of QNP, called bounded QNP (BQNP), where integer variables are increased or decreased by only one. Since BQNP is undecidable, we propose and implement a sound but incomplete solver for BQNP. We present an automatic method to abstract a BQNP problem from a classical planning instance with baggable types. The basic idea for abstraction is to introduce a counter for each bag of indistinguishable tuples of objects. We define a class of domains called proper baggable domains, and show that for such domains, the BQNP problem got by our automatic method is a sound and complete abstraction for a generalized planning problem whose instances share the same bags with the given instance but the sizes of the bags might be different. Thus, the refined plan of a solution to the BQNP problem is a solution to the generalized planning problem. Finally, we implement our abstraction method and experiments on a number of domains demonstrate the promise of our approach.

Abstract: This paper shows that the semantics of programs with aggregates implemented by the solvers clingo and dlv can be characterized as extended FirstOrder formulas with intensional functions in the logic of Here-and-There. Furthermore, this characterization can be used to study the strong equivalence of programs with aggregates under either semantics. We also present a transformation that reduces the task of checking strong equivalence to reasoning in classical First-Order logic, which serves as a foundation for automating this procedure.

Abstract: ion is an important and useful concept in the field of artificial intelligence. To the best of our knowledge, there is no syntactic method to compute a sound and complete abstraction from a given lowlevel basic action theory and a refinement mapping. This paper aims to address this issue. To this end, we first present a variant of situation calculus, namely linear integer situation calculus, which serves as the formalization of high-level basic action theory. We then migrate Banihashemi, De Giacomo, and Lesperance’s abstraction framework to one from linear integer situation calculus to extended situation calculus. Furthermore, we identify a class of Golog programs, namely guarded actions, so as to restrict low-level Golog programs, and impose some restrictions on refinement mappings. Finally, we design a syntactic approach to computing a sound and complete abstraction from a low-level basic action theory and a restricted refinement mapping.

State Key Laboratory of Multimodal Artificial Intelligence Systems,Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems,Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China Beijing Wenge Technology Co., Ltd, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems,Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

Abstract: Large language models (LLMs) have achieved significant progress in mathematical reasoning, especially in elementary math. However, they remain indisposed on tackling complex questions at highschool or college levels, which put forward a more advanced requirement of mastering relevant mathematical theorems. For we humans, whether selecting the appropriate theorems according to the provided question is a crucial factor affecting the quality of the ultimate solutions, yet which has been neglected by previous research in the field of LLM reasoning. In this paper, we propose a novel approach to enhance the LLM's capability of utilizing the mathematical theorems to specific problems, which we refer to as Theorem Rationale (TR). To this end, a new dataset encompassing problem-theorem-solution triples is deliberately established for transferring principles of TR. Furthermore, we develop an evolving strategy to boost hierarchical instructions oriented on the theorems to alleviate difficulty in acquiring the curated data and facilitate the digestion of theorem application from various perspectives. Evaluations on a wide range of public datasets exhibit that the model fine-tuned with our dataset achieves consistent improvements at varying mathematical levels compared to the backbone. And further ablation studies illustrate the effectiveness of our proposed evolutionary strategies on enhancing the model's capability of math problem-solving. Overall, extensive experiments reveal the potential of our proposed method which highlights the significance of aligning the problems with the concrete theorems for LLMs to alleviate hallucination and improve the models' mathematical reasoning capabilities.

Abstract: Tensor decomposition (TD) models are promising solutions for knowledge graph completion due to their simple structures but powerful representation capacities. The TD models typically adopt Tucker decomposition with a structured core tensor. Some models with a sparse core tensor, such as DistMult and ComplEx, are too simple and thus limit the interaction between embedding components, while other models with a dense core tensor are too complex and may lead to significant overfitting. To address these issues, we propose a new TD model called SPAC (Sparse Partitioning and Adaptive Core tensor pruning) model for knowledge graph completion. Specifically, SPAC captures coarse and finegrained semantic information using a hybrid core tensor, where auxiliary cores are used to model sparse interactions and main cores for dense interactions. Moreover, SPAC introduces a gating mechanism to control the output of intermediate variables, enhancing the interaction between different partition groups. Furthermore, SPAC employs an adaptive pruning approach to dynamically adjust the shape of the core tensor. Due to the elaborate model design, the proposed TD model enhances expressive capacity and reduces the number of parameters in the core tensor. Experiments are conducted on datasets FB15k-237, WN18RR, and YAGO3-10. The results demonstrate that SPAC outperforms state-of-the-art tensor decomposition models, including MEIM and Tucker models. A series of ablation studies show that the gating mechanism and adaptive pruning strategy in SPAC are crucial for the performance improvement.

Abstract: Medical time series are often irregular and face significant missingness, posing challenges for data analysis and clinical decisionmaking. Existing methods typically adopt a single modeling perspective, either treating series data as sequences or transforming them into image representations for further classification. In this paper, we propose a joint learning framework that incorporates both sequence and image representations. We also design three self-supervised learning strategies to facilitate the fusion of sequence and image representations, capturing a more generalizable joint representation. The results indicate that our approach outperforms seven other state-of-the-art models in three representative real-world clinical datasets. We further validate our approach by simulating two major types of real-world missingness through leave-sensors-out and leave-samples-out techniques. The results demonstrate that our approach is more robust and significantly surpasses other baselines in terms of classification performance.

Abstract: Unsupervised Graph Domain Adaptation (UGDA) seeks to bridge distribution shifts between domains by transferring knowledge from labeled source graphs to given unlabeled target graphs. Existing UGDA methods primarily focus on aligning features in the latent space learned by graph neural networks (GNNs) across domains, often overlooking structural shifts, resulting in limited effectiveness when addressing structurally complex transfer scenarios. Given the sensitivity of GNNs to local structural features, even slight discrepancies between source and target graphs could lead to significant shifts in node embeddings, thereby reducing the effectiveness of knowledge transfer. To address this issue, we introduce a novel approach for UGDA called TargetDomain Structural Smoothing (TDSS). TDSS is a simple and effective method designed to perform structural smoothing directly on the target graph, thereby mitigating structural distribution shifts and ensuring the consistency of node representations. Specifically, by integrating smoothing techniques with neighbor- hood sampling, TDSS maintains the structural coherence of the target graph while mitigating the risk of over-smoothing. Our theoretical analysis shows that TDSS effectively reduces target risk by improving model smoothness. Empirical results on three real-world datasets demonstrate that TDSS outperforms recent state-of-the-art baselines, achieving significant improvements across six transfer scenarios.

Institute of Computer Science, University of Tartu, Tartu, Estonia, Institute of Computer Science, University of Tartu, Tartu, Estonia, Institute of Computer Science, University of Tartu, Tartu, Estonia École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, Institute of Computer Science, University of Tartu, Tartu, Estonia, Institute of Computer Science, University of Tartu, Tartu, Estonia, Institute of Computer Science, University of Tartu, Tartu, Estonia, Institute of Computer Science, University of Tartu, Tartu, Estonia

Abstract: As machine learning models evolve, maintaining transparency demands more humancentric explainable AI techniques. Counterfactual explanations, with roots in human reasoning, identify the minimal input changes needed to obtain a given output and, hence, are crucial for supporting decision-making. Despite their importance, the evaluation of these explanations often lacks grounding in user studies and remains fragmented, with existing metrics not fully capturing human perspectives. To address this challenge, we developed a diverse set of 30 counterfactual scenarios and collected ratings across 8 evaluation metrics from 206 respondents. Subsequently, we fine-tuned different Large Language Models (LLMs) to predict average or individual human judgment across these metrics. Our methodology allowed LLMs to achieve an accuracy of up to 63% in zero-shot evaluations and 85% (over a 3-classes prediction) with fine-tuning across all metrics. The fine-tuned models predicting human ratings offer better comparability and scalability in evaluating different counterfactual explanation frameworks.

Abstract: Federated prototype learning is in the spotlight as global prototypes are effective in enhancing the learning of local representation spaces, facilitating the ability to generalize the global model. However, when encountering domainskewed data, conventional federated prototype learning is susceptible to two dilemmas: 1) Local prototypes obtained by averaging intra-class embedding carry domain-specific markers, the margins among aggregated global prototypes could be attenuated and detrimental to inter-class separation. 2) Local domain-skewed embedding may not exhibit a uniform distribution in Euclidean space, which is not conductive to the prototype-induced intra-class compactness. To address the two drawbacks, we go beyond conventional paradigm of federated prototype learning, and propose learnable semantic anchors with hyperspherical contrast (FedLSA) for domain-skewed data. Specifically, we eschew the pattern of yielding prototypes via averaging intra-class embedding and directly learn a set of semantic anchors aided by the global semantic-aware classifier. Meanwhile, the margins between anchors are augmented via pulling apart them, ensuring decent inter-class separation. To guarantee that local domain-skewed representations can be uniformly distributed, local data is projected into the hyperspherical space, and the intra-class compactness is achieved by optimizing the contrastive loss derived from the von Mises-Fisher distribution. Finally, extensive experimental results on three multi-domain datasets show the superiority of the proposed FedLSA compared to existing typical and state-of-the-state methods.

Abstract: The endto-end automated design of machine learning (ML) pipelines significantly reduces the workload for data scientists and democratizes ML for non-experts. Evolutionary algorithm (EA)-based automated ML (AutoML) systems, a prominent category of AutoML, often face inefficiencies due to the costly fitness evaluation of candidate ML pipelines. Although surrogate models have been employed to approximate the true performance of pipelines more quickly, a key challenge remains in effectively bridging the semantic gap between the heterogeneous features of datasets and pipelines. To address this issue, we propose ADELA, a novel accompanying surrogate-based optimization strategy that accelerates EA-based AutoML while retaining the performance of the resulting pipelines. ADELA operates in two phases: Offline, leveraging a high-quality curated pipeline corpus to meta-learn an accompanying surrogate model; and Online, selecting the accompanying pipeline and using the learned model to predict the performance of evaluation pipelines instead of executing them. The accompanying mechanism effectively mitigates the semantic gap between datasets and pipelines, enabling ADELA to reduce computation times by an average of 73.66% while retaining 98.78% of the final pipeline performance, as demonstrated in extensive experimental evaluations.

Abstract: Multivariate time series (MTS) anomaly detection is a critical task that involves identifying abnormal patterns or events in data that consist of multiple interrelated time series. In order to better model the complex interdependence between entities and the various inherent characteristics of each entity, the graph neural network (GNN) based methods are widely adopted by existing methods. In each layer of GNN, node features aggregate information from their neighboring nodes to update their information. In doing so, from shallow layer to deep layer in GNN, original individual node features continue to be weakened and more structural information, i.e., from shortdistance neighborhood to long-distance neighborhood, continues to be enhanced. However, research to date has largely ignored the understanding of how hierarchical graph information is represented and their characteristics that can benefit anomaly detection. Existing methods simply leverage the output from the last layer of GNN for anomaly estimation while neglecting the essential information contained in the intermediate GNN layers. To address such limitations, in this paper, we propose a Graph Mixture of Experts (Graph-MoE) network for multivariate time series anomaly detection, which incorporates the mixture of experts (MoE) module to adaptively represent and integrate hierarchical multi-layer graph information into entity representations. It is worth noting that our Graph-MoE can be integrated into any GNN-based MTS anomaly detection method in a plug-and-play manner. In addition, the memory-augmented routers are proposed in this paper to capture the correlation temporal information in terms of the global historical features of MTS to adaptively weigh the obtained entity representations to achieve successful anomaly estimation. Extensive experiments on five challenging datasets prove the superiority of our approach and each proposed module.

Abstract: In the era of big data, crossmodal retrieval is increasingly important in research and application. Given the latent complexity and non-intuitive nature of cross-modal relationships, leveraging external knowledge such as large models has become a popular approach to facilitate modality alignment. Existing methods typically address these challenges by fine-tuning model encoders or using a fixed number of prompts. However, these approaches struggle with the significant information asymmetry between image-text pairs and the high distribution diversity of image data. These limitations not only introduce noise during training but also reduce the accuracy and generalization capabilities in cross-modal retrieval tasks. To address the above issues, this paper proposes Adaptive Prompt-Based Semantic Embedding with Inspired Potential of Implicit Knowledge (APSE-IPIK). On one hand, we propose an inspired potential strategy to extract fine-grained and multi-perspective text descriptions from large-scale pre-trained multimodal models, which can be seen as implicit knowledge injection. These descriptions are integrated into the visual-semantic embedding through cross-modal semantic alignment with images, balancing the information asymmetry between modalities and reducing the embedding of inaccurate mapping relationships. On the other hand, we construct an instance-level query-based prompt pool strategy to adaptively extract the most relevant prompts, addressing alignment biases caused by intra-modal (especially image) data diversity and improving alignment accuracy. Extensive experiments are conducted on two widely used datasets, Flickr30k and MSCOCO, which show the effectiveness of the proposed method.

Abstract: Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but the billionscale parameters pose deployment challenges. Although existing methods attempt to reduce the scale of LLMs, they require either special hardware support or expensive post-training to maintain model quality. To facilitate efficient and affordable model slimming, we propose a novel training-free compression method for LLMs, named “SoLA”, which leverages Soft activation sparsity and Low-rAnk decomposition. SoLA can identify and retain a minority of components significantly contributing to inference, while compressing the majority through low-rank decomposition, based on our analysis of the activation pattern in the feed-forward network (FFN) of modern LLMs. To alleviate the decomposition loss, SoLA is equipped with an adaptive component-wise low-rank allocation strategy to assign appropriate truncation positions for different weight matrices. We conduct extensive experiments on LLaMA-2-7B/13B/70B and Mistral-7B models across a variety of benchmarks. SoLA exhibits remarkable improvement in both language modeling and downstream task accuracy without post-training. For example, with a 30% compression rate on the LLaMA-2-70B model, SoLA surpasses the state-of-the-art method by reducing perplexity from 6.95 to 4.44 and enhancing downstream task accuracy by 10%.

Abstract: Multiview unsupervised feature selection (MUFS) has received considerable attention in recent years. Existing MUFS methods for processing unlabeled incomplete multi-view data, where some samples are missing in certain views, first impute the missing values and then perform feature selection on the completed dataset. However, treating imputation and feature selection as two separate processes overlooks their potential interactions. The graph-guided local structure gleaned from feature selection can aid in imputation, which in turn can enhance the feature selection performance. Additionally, most similarity graph-based MUFS methods suffer from high computational costs. To address these problems, we propose a novel MUFS method, termed Tensorial Incomplete Multi-view unsupErvised Feature Selection (TIME-FS). TIME-FS unifies missing value recovery, discriminative feature selection, and low-dimensional representation learning within a joint framework through matrix decomposition. Then, TIME-FS conducts CP decomposition on tensor data formed by the low-dimensional representations of different views to learn a consistent anchor graph across views and a view-preference weight matrix, both of which simultaneously guide missing view imputation and feature selection. Furthermore, an efficient algorithm with low time complexity and rapid convergence is proposed to solve TIME-FS. Extensive experimental results demonstrate the effectiveness and efficiency of TIME-FS over state-of-the-art methods.

Abstract: This study addresses the challenge of detecting anomalies in multivariate time series data. Considering a bag (e.g., multisensor data) consisting of two-dimensional spaces of time points and multivariate instances (e.g., individual sensors), we aim to detect anomalies at both the bag and instance level with a unified model. To circumvent the practical difficulties of labeling at the instance level in such spaces, we adopt a multiple instance learning (MIL)-based approach, which enables learning at both the bag- and instance- levels using only the bag-level labels. In this study, we introduce time-aware and instance-learnable MIL (simply, TAIL-MIL). We propose two specialized attention mechanisms designed to effectively capture the relationships between different types of instances. We innovatively integrate these attention mechanisms with conjunctive pooling applied to the two-dimensional structure at different levels (i.e., bag- and instance-level), enabling TAIL-MIL to effectively pinpoint both the timing and causative multivariate factors of anomalies. We provide theoretical evidence demonstrating TAIL-MIL's efficacy in detecting instances with two-dimensional structures. Furthermore, we empirically validate the superior performance of TAIL-MIL over the state-of-the-art MIL methods and multivariate time-series anomaly detection methods.

Abstract: Audio classification plays a crucial role within fields such as humanmachine interaction and intelligent robotics. However, high-performance audio classification systems typically demand significant computational and storage resources, posing substantial challenges when deploying to the resource-constrained edge devices with an urgent need for such capabilities. To achieve a new level of balance between model complexity and performance, we introduce a novel multi-view method for the separated time-frequency features extraction and utilization, which exists within the proposed Mini Mirror Multi-View Network (M3Net) in the form of the Mirror Attention mechanism. M3Net enables reversible spatial transformation of spectral features is capable of efficiently leverages robust local and global features in the time and frequency domains with low requirements for parameters. Experiments based on Mel-Spectrogram without data augmentation and pre-training indicate that M3Net can achieve classification accuracy over 97% on the UrbanSound8K and SpeechCommandsV2 datasets with only 0.03 million parameters. The contribution of each functional segment in M3Net is fully verified and explained in the ablation experiments.

Abstract: Variate tokenization, which independently embeds each variate as separate tokens, has achieved remarkable improvements in multivariate time series forecasting. However, employing selfattention with variate tokens incurs a quadratic computational cost with respect to the number of variates, thus limiting its training efficiency for large-scale applications. To address this issue, we propose VarDrop, a simple yet efficient strategy that reduces the token usage by omitting redundant variate tokens during training. VarDrop adaptively excludes redundant tokens within a given batch, thereby reducing the number of tokens used for dot-product attention while preserving essential information. Specifically, we introduce k-dominant frequency hashing (k-DFH), which utilizes the ranked dominant frequencies in the frequency domain as a hash value to efficiently group variate tokens exhibiting similar periodic behaviors. Then, only representative tokens in each group are sampled through stratified sampling. By performing sparse attention with these selected tokens, the computational cost of scaled dot-product attention is significantly alleviated. Experiments conducted on public benchmark datasets demonstrate that VarDrop outperforms existing efficient baselines.

Abstract: In this work, we address the challenge of identifying the optimal arm in a stochastic multiarmed bandit scenario with the minimum number of arm pulls, given a predefined error probability in a fixed confidence setting. Our focus is on examining the asymptotic behavior of sample complexity and the distribution of arm weights upon termination, as the error threshold is scaled to zero, under confidence-interval based algorithms. Specifically, we analyze the asymptotic sample complexity and termination weight fractions for the well-known LUCB algorithm, and introduce a new variant, the LUCB Greedy algorithm. We demonstrate that the upper bounds on the sample complexities for both algorithms are asymptotically within a constant factor of the established lower bounds.

Abstract: Vision transformers (ViTs) have demonstrated remarkable performance in a variety of vision tasks. Despite their promising capabilities, training a ViT requires a large amount of diverse data. Several studies empirically found that using rich data augmentations, such as Mixup, Cutmix, and random erasing, is critical to the successful training of ViTs. Now, the use of rich data augmentations has become a standard practice in the current state. However, we report a vulnerability to this practice: Certain data augmentations such as Mixup cause a variance shift in the positional embedding of ViT, which has been a hidden factor that degrades the performance of ViT during the test phase. We claim that achieving a stable effect from positional embedding requires a specific condition on the image, which is often broken for the current data augmentation methods. We provide a detailed analysis of this problem as well as the correct configuration for these data augmentations to remove the side effects of variance shift. Experiments showed that adopting our guidelines improves the performance of ViTs compared with the current configuration of data augmentations.

Abstract: The multiinstance multi-label (MIML) problem is a new supervised learning paradigm that has emerged to efficiently represent complex data. Therefore, various similarity-based algorithms have been proposed, but existing algorithms commonly measure similarity by considering only the structural relationships in the feature space without utilizing information from the label space. As these approaches do not adequately reflect the complex properties of MIML data, it is essential to improve the accuracy of MIML classification by utilizing information from both feature and label spaces. Thus, we propose a new algorithm, T-MDML: triplet-based multiple distance metric learning for MIML. T-MDML defines a distance metric by learning a global property shared by the entire label space and a label-specific property for each label. In addition, we simultaneously consider the structural characteristics of features and label space to extract label correlation and incorporate it into the optimization process. In experiments, we demonstrate the efficiency of our label correlation estimation method and verify its performance by applying it to MIMLkNN. We also demonstrate T-MDML’s relative superiority over existing MIML algorithms, as well as its scalability when applied to similarity-based MIML methods.

Abstract: Masked autoencoders (MAEs) have recently demonstrated effectiveness in tabular data imputation. However, due to the inherent heterogeneity of tabular data, the uniform random masking strategy commonly used in MAEs can disrupt the distribution of missingness, leading to suboptimal performance. To address this, we propose a proportional masking strategy for MAEs. Specifically, we first compute the statistics of missingness based on the observed proportions in the dataset, and then generate masks that align with these statistics, ensuring that the distribution of missingness is preserved after masking. Furthermore, we argue that simple MLPbased token mixing offers competitive or often superior performance compared to attention mechanisms while being more computationally efficient, especially in the tabular domain with the inherent heterogeneity. Experimental results validate the effectiveness of the proposed proportional masking strategy across various missing data patterns in tabular datasets.

Abstract: Bit Flip Attacks (BFAs) are a wellestablished class of adversarial attacks, originally developed for Convolutional Neural Networks within the computer vision domain. Most recently, these attacks have been extended to target Graph Neural Networks (GNNs), revealing significant vulnerabilities. This new development naturally raises questions about the best strategies to defend GNNs against BFAs, a challenge for which no solutions currently exist. Given the applications of GNNs in critical fields, any defense mechanism must not only maintain network performance, but also verifiably restore the network to its pre-attack state. Verifiably restoring the network to its pre-attack state also eliminates the need for costly evaluations on test data to ensure network quality. We offer first insights into the effectiveness of existing honeypot- and hashing-based defenses against BFAs adapted from the computer vision domain to GNNs, and characterize the shortcomings of these approaches. To overcome their limitations, we propose Crossfire, a hybrid approach that exploits weight sparsity and combines hashing and honeypots with bit-level correction of out-of-distribution weight elements to restore network integrity. Crossfire is retraining-free and does not require labeled data. Averaged over 2,160 experiments on six benchmark datasets, Crossfire offers a 21.8% higher probability than its competitors of reconstructing a GNN attacked by a BFA to its pre-attack state. These experiments cover up to 55 bit flips from various attacks. Moreover, it improves post repair prediction quality by 10.85%. Computational and storage overheads are negligible compared to the inherent complexity of even the simplest GNNs.

Abstract: Lowrank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning, with LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with no prior investigation into understanding this limitation. We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to boost 2-bit LLM accuracy. Based on rank analysis revealing model-wise activation discrepancy loss's rank-insensitive nature, RILQ employs this loss to adjust adapters cooperatively across layers, enabling robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and LLaMA-3 demonstrate RILQ's consistent improvements in 2-bit quantized inference across various state-of-the-art quantizers and enhanced accuracy in task-specific fine-tuning. RILQ maintains computational efficiency comparable to existing LoRA methods, enabling adapter-merged weight-quantized LLM inference with significantly enhanced accuracy, making it a promising approach for boosting 2-bit LLM performance.

Abstract: Data sharing is necessary for AI to be widely used, but sharing sensitive data with others with privacy is risky. To solve these problems, it is necessary to synthesize realistic tabular data. In many cases, tabular data contains a mixture of continuous, mixed, categorical columns. Moreover, columns of the same type may have multimodal distribution or be highly imbalanced. These issues make it challenging to synthesize tabular data. The synthesized tabular data should reflect the relational meaning between columns of tabular data, so modeling the probability distribution of tabular data is a nontrivial task. Traditional tabular data synthesizing models are based on GAN or diffusion models and are built using fully connected or convolutional layers. However, fully connected layers have the disadvantage of low inductive bias, and convolutional layers are not invariant to the column order of tabular data. Therefore, we assume that converting tabular data into graph-structured data and using a graph neural network would produce better synthetic data than using fully connected layers or convolutional layers. Our study aims to show that GANs constructed with graph neural networks can outperform existing GAN models using fully connected layers or convolutional layers. We propose CG-TGAN, a conditional GAN built using graph neural networks. To learn how to synthesize realistic data, the graph neural networks in the discriminator and generator learn graph-level tasks and node-level tasks together. The discriminator of CG-TGAN learns a graph-level task to distinguish between real and synthetic data and node-level tasks to predict the value of the target node. CG-TGAN’s generator learns a graph-level task to synthesize an overall graph similar to real data and node-level tasks to learn how to synthesize a fake graph with the proper relation between nodes. In this paper, we show that CG-TGAN outperforms GAN-based models and is comparable to diffusion-based models.

MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

Abstract: Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing eventbased semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65% reduction on the DSEC-Semantic dataset.

Abstract: Recent studies have revealed the vulnerability of graph neural networks (GNNs) to adversarial poisoning attacks on node classification tasks. Current defensive methods require substituting the original GNNs with defense models, regardless of the original's type. This approach, while targeting adversarial robustness, compromises the enhancements developed in prior research to boost GNNs' practical performance. Here we introduce Grimm, the first plugand-play defense model. With just a minimal interface requirement for extracting features from any layer of the protected GNNs, Grimm is thus enabled to seamlessly rectify perturbations. Specifically, we utilize the feature trajectories (FTs) generated by GNNs, as they evolve through epochs, to reflect the training status of the networks. We then theoretically prove that the FTs of victim nodes will inevitably exhibit discriminable anomalies. Consequently, inspired by the natural parallelism between the biological nervous and immune systems, we construct Grimm, a comprehensive artificial immune system for GNNs. Grimm not only detects abnormal FTs and rectifies adversarial edges during training but also operates efficiently in parallel, thereby mirroring the concurrent functionalities of its biological counterparts. We experimentally confirm that Grimm offers four empirically validated advantages: 1) Harmlessness, as it does not actively interfere with GNN training; 2) Parallelism, ensuring monitoring, detection, and rectification functions operate independently of the GNN training process; 3) Generalizability, demonstrating compatibility with mainstream GNNs such as GCN, GAT, and GraphSAGE; and 4) Transferability, as the detectors for abnormal FTs can be efficiently transferred across different systems for one-step rectification.

Abstract: Autonomous underwater vehicle (AUV) is crucial for marine applications such as ocean data collection, pollution monitoring, and navigation. However, their limited energy resources constrain their operational duration, posing a significant challenge for longterm operations. Due to the complex and unpredictable nature of the underwater environment, AUVs allocate energy to their sensing systems to sense the surrounding environment and avoid obstacles. Existing methods focus on reducing energy consumption on AUV computing and movement, neglecting sensing energy consumption and few attempts have been made to balance the AUV energy and sensing ability with a flexible sensing system. Along these lines, we consider both AUV energy consumption and flexible sensing abilities, and propose a deep reinforcement learning-based method to Reduce Energy Consumption by AUV Sensing system (RECS). Specifically, we build an AUV sensing system in a 2-dimension space, with controllable 8-direction sensing abilities to collect the environment information dynamically. Then we divide the underwater environment into several areas and assign weights on the edges of areas based on the AUV planned path. Additionally, we dynamically switch the sensors in different directions and radii to sense the edges of the area where the AUV is located. The Artificial Potential Field (APF) method is employed to re-plan the AUV path to avoid obstacles and reach the target point effectively. Experimental results demonstrate that compared to full sensors on, our method reduces energy consumption by 53.48% and is capable of generalizing to varying environments and varying sensing system radii.

Abstract: Obtaining highprecision aerodynamics in the automotive industry relies on large-scale simulations with computational fluid dynamics, which are generally time-consuming and computationally expensive. Recent advances in operator learning for partial differential equations offer promising improvements in terms of efficiency. However, capturing intricate physical correlations from extensive and varying geometries while balancing large-scale discretization and computational costs remains a significant challenge. To address these issues, we propose **AeroGTO**, an efficient graph-transformer operator designed specifically for learning large-scale aerodynamics in engineering applications. AeroGTO combines local feature extraction through message passing and global correlation capturing via projection-inspired attention, employing a frequency-enhanced graph neural network augmented with k-nearest neighbors to handle three-dimensional (3D) irregular geometries. Moreover, the transformer architecture adeptly manages multi-level dependencies with only linear complexity concerning the number of mesh points, enabling fast inference of the model. Given a car's 3D mesh, AeroGTO accurately predicts surface pressure and estimates drag. In comparisons with five advanced models, AeroGTO is extensively tested on two industry-standard benchmarks, Ahmed-Body and DrivAerNet, achieving a 7.36% improvement in surface pressure prediction and a 10.71% boost in drag coefficient estimation, with fewer FLOPs and only 1% of the parameters used by the prior leading method.

Abstract: Multivariate time series anomaly detection has numerous realworld applications and is being extensively studied. Modeling pairwise correlations between variables is crucial. Existing methods employ learnable graph structures and graph neural networks to explicitly model the spatial dependencies between variables. However, these methods are primarily based on prediction or reconstruction tasks, which can only learn similarity relationships between sequence embeddings and lack interpretability in how graph structures affect time series evolution. In this paper, we designed a framework that models spatial dependencies using interpretable causal relationships and detects anomalies through changes in causal patterns. Specifically, we propose a method to dynamically discover Granger causality using gradients in nonlinear deep predictors and employ a simple sparsification strategy to obtain a Granger causality graph, detecting anomalies from a causal perspective. Experiments on real-world datasets demonstrate that the proposed model achieves more accurate anomaly detection compared to baseline methods.

Abstract: Due to the capability of dynamic state space models (SSMs) in capturing longrange dependencies with linear-time computational complexity, Mamba has shown notable performance in NLP tasks. This has inspired the rapid development of Mamba-based vision models, resulting in promising results in visual recognition tasks. However, such models are not capable of distilling features across layers through feature aggregation, interaction, and selection. Moreover, existing cross-layer feature aggregation methods designed for CNNs or ViTs are not practical in Mamba-based models due to high computational costs. Therefore, this paper aims to introduce an efficient cross-layer feature aggregation mechanism for vision backbone networks. Inspired by the Retinal Ganglion Cells (RGCs) in the human visual system, we propose a new sparse cross-layer connection mechanism termed SparX to effectively improve cross-layer feature interaction and reuse. Specifically, we build two different types of network layers: ganglion layers and normal layers. The former has higher connectivity and complexity, enabling multi-layer feature aggregation and interaction in an input-dependent manner. In contrast, the latter has lower connectivity and complexity. By interleaving these two types of layers, we design a new family of vision backbone networks with sparsely cross-connected layers, achieving an excellent trade-off among model size, computational cost, memory cost, and accuracy in comparison to its counterparts. For instance, with fewer parameters, SparX-Mamba-T improves the top-1 accuracy of VMamba-T from 82.5% to 83.5%, while SparX-Swin-T achieves a 1.3% increase in top-1 accuracy compared to Swin-T. Extensive experimental results demonstrate that our new connection mechanism possesses both superior performance and generalization capabilities on various vision tasks.

Abstract: This paper targets on the regularization effect of momentumbased methods in regression settings and analyzes the popular diagonal linear networks to precisely characterize the implicit bias of continuous versions of heavy-ball (HB) and Nesterov's method of accelerated gradients (NAG). We show that, HB and NAG exhibit different implicit bias compared to GD for diagonal linear networks, which is different from the one for classic linear regression problem where momentum-based methods share the same implicit bias with GD. Specifically, the role of momentum in the implicit bias of GD is twofold: (a) HB and NAG induce extra initialization mitigation effects similar to SGD that are beneficial for generalization of sparse regression; (b) the implicit regularization effects of HB and NAG also depend on the initialization of gradients explicitly, which may not be benign for generalization. As a result, whether HB and NAG have better generalization properties than GD jointly depends on the aforementioned twofold effects determined by various parameters such as learning rate, momentum factor, and integral of gradients. Our findings highlight the potential beneficial role of momentum and can help understand its advantages in practice such as when it will lead to better generalization performance.

Abstract: Handling heterogeneous data in tabular datasets poses a significant challenge for deep learning models. While attentionbased architectures and self-supervised learning have achieved notable success, their application to tabular data remains less effective over linear and tree based models. Although several breakthroughs have been achieved by models which transform tables into uni-modal transformations like image, language and graph, these models often underperform in the presence of feature heterogeneity. To address this gap, we introduce TabGLM (Tabular Graph Language Model), a novel multi-modal architecture designed to model both structural and semantic information from a table. TabGLM transforms each row of a table into a fully connected graph and serialized text, which are then encoded using a graph neural network (GNN) and a text encoder, respectively. By aligning these representations through a joint, multi-modal, self-supervised learning objective, TabGLM leverages complementary information from both modalities, thereby enhancing feature learning. TabGLM's flexible graph-text pipeline efficiently processes heterogeneous datasets with significantly fewer parameters over existing Deep Learning approaches. Evaluations across 25 benchmark datasets demonstrate substantial performance gains, with TabGLM achieving an average AUC-ROC improvement of up to 5.56% over State-of-the-Art (SoTA) tabular learning methods.

Abstract: Masked autoencoders (MAE) have recently succeeded in selfsupervised vision representation learning. Previous work mainly applied custom-designed (e.g., random, block-wise) masking or teacher (e.g., CLIP)-guided masking and targets. However, they ignore the potential role of the self-training (student) model in giving feedback to the teacher for masking and targets. In this work, we present to integrate Collaborative Masking and Targets for boosting Masked AutoEncoders, namely CMT-MAE. Specifically, CMT-MAE leverages a simple collaborative masking mechanism through linear aggregation across attentions from both teacher and student models. We further propose using the output features from those two models as the collaborative target of the decoder. Our simple and effective framework pre-trained on ImageNet-1K achieves state-of-the-art linear probing and fine-tuning performance. In particular, using ViT-base, we improve the fine-tuning results of the vanilla MAE from 83.6% to 85.7%.

Abstract: Deep learning models with largescale backbones have been increasingly adopted to tackle complex visual question answering (VQA) problems in real settings. While providing powerful learning capacities to handle the high-dimensional and multimodal VQA data, these models tend to suffer from the memorization effect leading to overconfident predictions. This can significantly limit their applicability in critical domains (e.g., medicine, cyber-security, and public safety), where confidently wrong predictions may lead to severe consequences. In this work, we propose to perform novel low-rank network factorization, resulting in much better-calibrated networks. These low-rank factorized networks are then aggregated into an ensemble guided by a generalized focal loss to further improve the overall performance and calibration. The overall framework, referred to as the Generalized focal Loss Ensemble of low-rank Networks (GLEN), is an important step toward developing well-calibrated VQA models. We theoretically demonstrate that the generalized focal loss provides a more balanced bias-variance trade-off, which guarantees to lower the confidence of the incorrect predictions, resulting in improved calibration. Extensive experimentation conducted on benchmark datasets and comparison on various VQA models shows that GLEN leads to much better calibration over both in-distribution and out-of-distribution data without sacrificing the VQA accuracy.

Abstract: The purpose of partial multilabel feature selection is to select the most representative feature subset, where the data comes from partial multi-label datasets that have label ambiguity issues. For label disambiguation, previous methods mainly focus on utilizing the information inside the labels and the relationship between the labels and features. However, the information existing in the feature space is rarely considered, especially in partial multi-label scenarios where the noises is considered to be concentrated in the label space while the feature information is correct. This paper proposes a method based on latent space alignment, which uses the information mined in feature space to disambiguate in latent space through the structural consistency between labels and features. In addition, previous methods overestimate the consistency of features and labels in the latent space after convergence. We comprehensively consider the similarity of latent space projections to feature space and label space, and propose new feature selection term. This method also significantly improves the positive label identification ability of the selected features. Comprehensive experiments demonstrate the superiority of the proposed method.

Abstract: Multimodal large language models (MLLMs) can simultaneously process visual, textual, and auditory data, capturing insights that complement human analysis. However, existing video questionanswering (VidQA) benchmarks and datasets often exhibit a bias toward a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities to answer the queries. In this work, we introduce the modality importance score (MIS) to identify such bias. It is designed to assess which modality embeds the necessary information to answer the question. Additionally, we propose an innovative method using state-of-the-art MLLMs to estimate the modality importance, which can serve as a proxy for human judgments of modality perception. With this MIS, we demonstrate the presence of unimodal bias and the scarcity of genuinely multimodal questions in existing datasets. We further validate the modality importance score with multiple ablation studies to evaluate the performance of MLLMs on permuted feature sets. Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets. Our proposed MLLM-derived MIS can guide the curation of modality-balanced datasets that advance multimodal learning and enhance MLLMs' capabilities to understand and utilize synergistic relations across modalities.

Abstract: Personalized federated learning (PFL) studies effective model personalization to address the data heterogeneity issue among clients in traditional federated learning (FL). Existing PFL approaches mainly generate personalized models by relying solely on the clients' latest updated models while ignoring their previous updates, which may result in suboptimal personalized model learning. To bridge this gap, we propose a novel framework termed pFedSeq, designed for personalizing adapters to finetune a foundation model in FL. In pFedSeq, the server maintains and trains a sequential learner, which processes a sequence of past adapter updates from clients and generates calibrations for personalized adapters. To effectively capture the cross-client and cross-step relations hidden in previous updates and generate high-performing personalized adapters, pFedSeq adopts the powerful selective state space model (SSM) as the architecture of sequential learner. Through extensive experiments on four public benchmark datasets, we demonstrate the superiority of pFedSeq over state-of-the-art PFL methods.

Shenyang Institute of Automation, Chinese Academy of Sciences Liaoning Liaohe Laboratory Key Laboratory on Intelligent Detection and Equipment Technology of Liaoning Province University of Chinese Academy of Sciences, Shenyang Institute of Automation, Chinese Academy of Sciences Liaoning Liaohe Laboratory Key Laboratory on Intelligent Detection and Equipment Technology of Liaoning Province, Shenyang Institute of Automation, Chinese Academy of Sciences Liaoning Liaohe Laboratory Key Laboratory on Intelligent Detection and Equipment Technology of Liaoning Province, Shenyang Institute of Automation, Chinese Academy of Sciences Liaoning Liaohe Laboratory Key Laboratory on Intelligent Detection and Equipment Technology of Liaoning Province, Shenyang Institute of Automation, Chinese Academy of Sciences Liaoning Liaohe Laboratory Key Laboratory on Intelligent Detection and Equipment Technology of Liaoning Province

Abstract: We propose a bearing health management framework leveraging large language models (BearLLM), a novel multimodal model that unifies multiple bearingrelated tasks by processing user prompts and vibration signals. Specifically, we introduce a prior knowledge-enhanced unified vibration signal representation to handle various working conditions across multiple datasets. This involves adaptively sampling the vibration signals based on the sampling rate of the sensor, incorporating the frequency domain to unify input dimensions, and using a fault-free reference signal as an auxiliary input. To extract features from vibration signals, we first train a fault classification network, then convert and align the extracted features into word embedding, and finally concatenate these with text embedding as input to an LLM. To evaluate the performance of the proposed method, we constructed the first large-scale multimodal bearing health management (MBHM) dataset, including paired vibration signals and textual descriptions. With our unified vibration signal representation, BearLLM using one set of pre-trained weights achieves state-of-the-art performance on nine publicly available fault diagnosis benchmarks, outperforming specific methods designed for individual datasets. We provide a dataset, our model, and code to inspire future research on building more capable industrial multimodal models.

Abstract: Pretrained Language Models (PLMs) have become the de facto starting point for finetuning on downstream tasks. However, as model sizes continue to increase, traditional fine-tuning of all parameters becomes challenging. To address this, parameter-efficient fine-tuning (PEFT) methods have gained popularity as a means to adapt PLMs effectively. In parallel, recent studies have revealed the presence of activation sparsity within the intermediate outputs of the multilayer perceptron (MLP) blocks in transformers. Low activation density enables efficient model inference on sparsity-aware hardware. Building upon this insight, in this work, we propose a novel density loss that encourages higher activation sparsity (equivalently, lower activation density) in the pre-trained models. We demonstrate the effectiveness of our approach by utilizing mainstream PEFT techniques, including QLoRA, LoRA, Adapter, and Prompt/Prefix Tuning, to facilitate efficient model adaptation across diverse downstream tasks. Experiments show that our proposed method, DEFT (Density-Efficient Fine-Tuning), can consistently reduce activation density by up to 44.94% on RoBERTa (Large) and by 53.19 (encoder density) and 90.60% (decoder density) on Flan-T5-XXL (11B) compared to PEFT, using GLUE and QA (SQuAD) benchmarks respectively while maintaining competitive performance on downstream tasks. We also introduce ADA-DEFT, an adaptive variant of our DEFT approach, which achieves significant memory and runtime savings during inference for large models. For instance, ADA-DEFT reduces runtime by 8.75% and memory usage by 16.78% in Flan-T5-XL and by 2.79% and 2.54%, respectively, in Flan-T5- XXL. Additionally, we showcase that DEFT works complementarily with quantized and pruned models.

Abstract: Probability calibration transforms raw output of a classification model into empirically interpretable probability. When the model is purposed to detect rare event and only a small expensive data source has clean labels, it becomes extraordinarily challenging to obtain accurate probability calibration. Utilizing an additional large cheap data source is very helpful, however, such data sources oftentimes suffer from biased labels. To this end, we introduce an approximate expectationmaximization (EM) algorithm to extract useful information from the large data sources. For a family of calibration methods based on the logistic likelihood, we derive closed-form updates and call the resulting iterative algorithm CalEM. We show that CalEM inherits convergence guarantees from the approximate EM algorithm. We test the proposed model in simulation and on the real marketing datasets, where it shows significant performance increases.

Abstract: The ion and Reasoning Corpus (ARC) poses a significant challenge to artificial intelligence, demanding broad generalization and fewshot learning capabilities that remain elusive for current deep learning methods, including large language models (LLMs). While LLMs excel in program synthesis, their direct application to ARC yields limited success. To address this, we introduce ConceptSearch, a novel function-search algorithm that leverages LLMs for program generation and employs a concept-based scoring method to guide the search efficiently. Unlike simplistic pixel-based metrics like Hamming distance, ConceptSearch evaluates programs on their ability to capture the underlying transformation concept reflected in the input-output examples. We explore three scoring functions: Hamming distance, a CNN-based scoring function, and an LLM-based natural language scoring function. Experimental results demonstrate the effectiveness of ConceptSearch, achieving a significant performance improvement over direct prompting with GPT-4. Moreover, our novel concept-based scoring exhibits up to 30\% greater efficiency compared to Hamming distance, measured in terms of the number of iterations required to reach the correct solution. These findings highlight the potential of LLM-driven program search when integrated with concept-based guidance for tackling challenging generalization problems like ARC.

Abstract: Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, have recently been recognized as potential alternatives to softmax attention thanks to their linear complexity and competitive performance. However, although their linearmemory advantage during training enables dealing with long sequences, it is still hard to handle extremely long sequences with very limited computational resources. In this paper, we propose Sequence Accumulation (SA) which leverages the common recurrence feature of linear sequence modeling methods to manage infinite context length even on a single GPU. Specifically, SA divides long input sequences into fixed-length sub-sequences and accumulates intermediate states sequentially, which reaches only constant-memory consumption. Additionally, we further propose Sequence Accumulation with Pipeline Parallelism (SAPP), to train large models with infinite context length, without incurring any additional synchronization costs in the sequence dimension. Extensive experiments with a wide range of context lengths are conducted to validate the effectiveness of SA and SAPP on both single and multiple GPUs. Results show that SA and SAPP enable the training of infinite context length on even very limited resources, and are well compatible with the out-of-the-box distributed training techniques.

Abstract: Dynamic graphs exhibit intertwined spatiotemporal evolutionary patterns, widely existing in the real world. Nevertheless, the structure incompleteness, noise, and redundancy result in poor robustness for Dynamic Graph Neural Networks (DGNNs). Dynamic Graph Structure Learning (DGSL) offers a promising way to optimize graph structures. However, aside from encountering unacceptable quadratic complexity, it overly relies on heuristic priors, making it hard to discover underlying predictive patterns. How to efficiently refine the dynamic structures, capture intrinsic dependencies, and learn robust representations, remains under-explored. In this work, we propose the novel DG-Mamba, a robust and efficient Dynamic Graph structure learning framework with the Selective State Space Models (Mamba). To accelerate the spatio-temporal structure learning, we propose a kernelized dynamic message-passing operator that reduces the quadratic time complexity to linear. To capture global intrinsic dynamics, we establish the dynamic graph as a self-contained system with State Space Model. By discretizing the system states with the cross-snapshot graph adjacency, we enable the long-distance dependencies capturing with the selective snapshot scan. To endow learned dynamic structures more expressive with informativeness, we propose the self-supervised Principle of Relevant Information for DGSL to regularize the most relevant yet least redundant information, enhancing global robustness. Extensive experiments demonstrate the superiority of the robustness and efficiency of our DG-Mamba compared with the state-of-the-art baselines against adversarial attacks.

Abstract: The community has recently developed various trainingtime defenses to counter neural backdoors introduced through data poisoning. In light of the observation that a model learns poisonous samples responsible for the backdoor easier than benign samples, these approaches either use a fixed threshold of the training loss for splitting or iteratively learn a reference model as an oracle for identifying benign samples. In particular, the latter has proven effective for anti-backdoor learning. Our method, HARVEY, leverages a similar yet crucially different technique: learning an oracle for poisonous rather than benign samples. Learning a backdoored reference model is significantly easier than learning one on benign data. Consequently, we can identify poisonous samples much more accurately than related work identifies benign samples. This crucial difference enables near-perfect backdoor removal as we demonstrate in our evaluation. HARVEY substantially outperforms related approaches across attack types, datasets, and architectures, lowering the attack success rate to the very minimum at a negligible loss in natural accuracy.

Abstract: Hyperdimensional computing (HDC) is an approach from the cognitive science literature for solving information processing tasks using data represented as highdimensional random vectors. The technique has a rigorous mathematical backing, and is easy to implement in energy-efficient and highly parallel hardware like FPGAs and "processing-in-memory" architectures. The effectiveness of HDC in machine learning largely depends on how raw data is mapped to high-dimensional space. In this work, we propose NysHD, a new method for constructing this mapping that is based on the Nyström method from the literature on kernel approximation. Our approach provides a simple recipe to turn any user-defined positive-semidefinite similarity function into an equivalent mapping in HDC. There is a vast literature on the design of such functions for learning problems. Our approach provides a mechanism to import them into the HDC setting, expanding the types of problems that can be tackled using HDC. Empirical evaluation against existing HDC encoding methods shows that NysHD can achieve, on average, 11% and 17% better classification accuracy on graph and string datasets respectively.

Abstract: Federated graph learning (FGL) has emerged as a promising approach to enable collaborative training of graph models while preserving data privacy. However, current FGL methods overlook the outof-distribution (OOD) shifts that occur in real-world scenarios. The distribution shifts between training and testing datasets in each client impact the FGL performance. To address this issue, we propose federated graph OOD generalization framework FedGOG, which includes two modules, i.e., diffusion data exploration (DDE) and latent embedding decorrelation (LED). In DDE, all clients jointly train score models to accurately estimate the global graph data distribution and sufficiently explore sample space using score-based graph diffusion with conditional generation. In LED, each client models a global invariant GNN and a personalized spurious GNN. LED aims to decorrelate spuriousness from invariant relationships by minimizing the mutual information between two categories of latent embeddings from different GNN models. Extensive experiments on six benchmark datasets demonstrate the superiority of FedGOG.

Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences) Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences) Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Tongji University, Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences) Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Shandong Branch of National Computer Network Emergency Response Technical Team/Coordination Center (CNCERT/SD), Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences) Evay Info, University of Jinan

Abstract: With the proliferation of multimodal data, safe and efficient multi-modal hashing retrieval has become a pressing research challenge, particularly due to concerns over data privacy during centralized processing. To address this, we propose Prototype-based Federated Multi-modal Hashing (PFMH), an innovative framework that seamlessly integrates federated learning with multi-modal hashing techniques. PFMH achieves fine-grained fusion of heterogeneous multi-modal data, enhancing retrieval accuracy while ensuring data privacy through prototype-based communication, thereby reducing communication costs and mitigating risks of data leakage. Furthermore, using a prototype completion strategy, PFMH tackles class imbalance and statistical heterogeneity in multi-modal data, improving model generalization and performance across diverse data distributions. Extensive experiments demonstrate the efficiency and effectiveness of PFMH within the federated learning framework, enabling distributed training for secure and precise multi-modal retrieval in real-world scenarios.

Abstract: MultiAgent Path Finding (MAPF) is a critical component of logistics and warehouse management, which focuses on planning collision-free paths for a team of robots in a known environment. Recent work introduced a novel MAPF approach, LNS2, which proposed to repair a quickly-obtainable set of infeasible paths via iterative re-planning, by relying on a fast, yet lower-quality, priority-based planner. At the same time, there has been a recent push for Multi-Agent Reinforcement Learning (MARL) based MAPF algorithms, which let agents learn decentralized policies that exhibit improved cooperation over such priority planning, although inevitably remaining slower. In this paper, we introduce a new MAPF algorithm, LNS2+RL, which combines the distinct yet complementary characteristics of LNS2 and MARL to effectively balance their individual limitations and get the best from both worlds. During early iterations, LNS2+RL relies on MARL for low-level re-planning, which we show eliminates collisions much more than a priority-based planner. There, our MARL-based planner allows agents to reason about past and future/predicted information to gradually learn cooperative decision-making through a finely designed curriculum learning. At later stages of planning, LNS2+RL adaptively switches to priority-based planning to quickly resolve the remaining collisions, naturally trading-off solution quality and computational efficiency. Our comprehensive experiments on challenging tasks across various team sizes, world sizes, and map structures consistently demonstrate the superior performance of LNS2+RL compared to many MAPF algorithms, including LNS2, LaCAM, and EECBS. In maps with complex structures, the advantages of LNS2+RL are particularly pronounced, with LNS2+RL achieving a success rate of over 50% in nearly half of the tested tasks, while that of LaCAM and EECBS falls to 0%.

Abstract: Multiagent systems must learn to communicate and understand interactions between agents to achieve cooperative goals in partially observed tasks. However, existing approaches lack a dynamic directed communication mechanism and rely on global states, thus diminishing the role of communication in centralized training. Thus, we propose the Transformer-based graph coarsening network (TGCNet), a novel multi-agent reinforcement learning (MARL) algorithm. TGCNet learns the topological structure of a dynamic directed graph to represent the communication policy and integrates graph coarsening networks to approximate the representation of global state during training. It also utilizes the Transformer decoder for feature extraction during execution. Experiments on multiple cooperative MARL benchmarks demonstrate state-of-the-art performance compared to popular MARL algorithms. Further ablation studies validate the effectiveness of our dynamic directed graph communication mechanism and graph coarsening networks.

School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing, 100876, China Key Laboratory of Trustworthy Distributed Computing and Service, BUPT, Ministry of Education, Beijing, 100876, China, School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing, 100876, China Key Laboratory of Trustworthy Distributed Computing and Service, BUPT, Ministry of Education, Beijing, 100876, China, State Key Laboratory of Media Convergence and Communication, CUC, Beijing, 100024, China State Key Laboratory of Intelligent Game, Yangtze River Delta Research Institute of NPU, Taicang 215400, China, School of Economics and Management, BUPT, Beijing, 100876, China, Beijing Technology and Business University, Beijing, 100048, China, School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing, 100876, China Xiangjiang Laboratory, Changsha, 410205, China, North China University of Technology, Beijing, 100144, China

Abstract: Personality detection aims to deduce a user’s personality from their published posts. The goal of this task is to map posts to specific personality types. Existing methods encode post information to obtain user vectors, which are then mapped to personality labels. However, existing methods face two main issues: first, only using small models makes it hard to accurately extract semantic features from multiple long documents. Second, the relationship between user vectors and personality labels is not fully considered. To address the issue of poor user representation, we utilize the text embedding capabilities of LLM. To solve the problem of insufficient consideration of the relationship between user vectors and personality labels, we leverage the text generation capabilities of LLM. Therefore, we propose the LLMEnhanced Text Mapping Model (ETM) for Personality Detection. The model applies LLM’s text embedding capability to enhance user vector representations. Additionally, it uses LLM’s text generation capability to create multi-perspective interpretations of the labels, which are then used within a contrastive learning framework to strengthen the mapping of these vectors to personality labels. Experimental results show that our model achieves state-of-the-art performance on benchmark datasets.

Abstract: In this paper we show that corpuslevel aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT. We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem. Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation. Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of MT systems for low-resource languages more trustworthy.

School of Computer Science and Engineering, Central South University, China Key Laboratory of Data Intelligence and Advanced Computing in Provincial Universities, Soochow University, China, Research Center for SCIR, Harbin Institute of Technology, Harbin, China, Research Center for SCIR, Harbin Institute of Technology, Harbin, China, National University of Singapore, Singapore, Research Center for SCIR, Harbin Institute of Technology, Harbin, China, Research Center for SCIR, Harbin Institute of Technology, Harbin, China, School of Computer Science and Engineering, Central South University, China, School of Computer Science and Engineering, Central South University, China Key Laboratory of Data Intelligence and Advanced Computing in Provincial Universities, Soochow University, China

Abstract: Large VisionLanguage Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.

Abstract: Singing Voice Synthesis (SVS) aims to generate singing voices of high fidelity and expressiveness. Conventional SVS systems usually utilize an acoustic model to transform a music score into acoustic features, followed by a vocoder to reconstruct the singing voice. It was recently shown that endto-end modeling is effective in the fields of SVS and Text to Speech (TTS). In this work, we thus present a fully end-to-end SVS method together with a chunkwise streaming inference to address the latency issue for practical usages. Note that this is the first attempt to fully implement end-to-end streaming audio synthesis using latent representations in VAE. We have made specific improvements to enhance the performance of streaming SVS using latent representations. Experimental results demonstrate that the proposed method achieves synthesized audio with high expressiveness and pitch accuracy in both streaming SVS and TTS tasks.

Abstract: Large Language Models (LLMs) have gained significant attention for their exceptional performance across various domains. Despite their advancements, concerns persist regarding their implicit bias, which often leads to negative social impacts. Therefore, it is essential to identify the implicit bias in LLMs and investigate the potential threat posed by it. Our study focused on a specific type of implicit bias, termed the ''YesNo'' implicit bias, which refers to LLMs' inherent tendency to favor ''Yes'' or ''No'' responses to a single instruction. By comparing the probability of LLMs generating a series of ''Yes'' versus ''No'' responses, we observed different inherent response tendencies exhibited by LLMs when faced with different instructions. To further investigate the impact of such bias, we developed an attack method called Implicit Bias In-Context Manipulation, attempting to manipulate LLMs' behavior. Specifically, we explored whether the ''Yes'' implicit bias could manipulate ''No'' responses into ''Yes'' in LLMs' responses to malicious instructions, leading to harmful outputs. Our findings revealed that the ''Yes'' implicit bias brings a significant security threat, comparable to that of carefully designed attack methods. Moreover, we offered a comprehensive analysis from multiple perspectives to deepen the understanding of this security threat, emphasizing the need for ongoing improvement in LLMs' security.

Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China, Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China, Key Laboratory of Noise and Vibration Research, Institute of Acoustics Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China., Department of Automation, Tsinghua University, Beijing, China, Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China, Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China, Key Laboratory of Noise and Vibration Research, Institute of Acoustics Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China., Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China

Abstract: Although the complex spectrumbased speech enhancement (SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that limits the application of SE. To address these problems, we proposed a dual-path network based on compressed frequency using Mamba. First, we extract amplitude and phase information through parallel dual branches. This approach leverages structured complex spectra to implicitly capture phase information and solves the compensation effect by decoupling amplitude and phase, and the network incorporates an interaction module to suppress unnecessary parts and recover missing components from the other branch. Second, to reduce network complexity, the network introduces a band-split strategy to compress the frequency dimension. To further reduce complexity while maintaining good performance, we designed a Mamba-based module that models the time and frequency dimensions under linear complexity. Finally, compared to baselines, our model achieves an average 8.3 times reduction in computational complexity while maintaining superior performance. Furthermore, it achieves a 25 times reduction in complexity compared to transformer-based models.

School of Computer Engineering and Science, Shanghai University, Shanghai, China, School of Computer Engineering and Science, Shanghai University, Shanghai, China, School of Computer Engineering and Science, Shanghai University, Shanghai, China, School of Computer Engineering and Science, Shanghai University, Shanghai, China, School of Computer Engineering and Science, Shanghai University, Shanghai, China, School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing, China

Abstract: Early detection of fake news is crucial to mitigate its negative impact. Current research in fake news detection often utilizes the difference between real and fake news regarding the support degree from reliable sources. However, it has overlooked their different semantic outlier degrees among unreliable source information during the same period. Since fake news often serves idea propaganda, unreliable sources usually publish a lot of information with the same propaganda idea during the same period, making it less likely to be a semantic outlier. To leverage this difference, we propose the ReliableUnreliable Source Reference (RUSR) Fake News Early Detection Method. RUSR introduces the publication background for detected news, which consists of related news with common main objects of description and slightly earlier publication from both reliable and unreliable sources. Furthermore, we develop a strongly preference-driven support degree evaluation model and a two-hop semantic outlier degree evaluation model, which respectively mitigate the interference of news with weak validation effectiveness and the tightness degree of semantic cluster. The designed redistribution module and expanding range relative time encoding are adopted by both models, respectively optimizing early checkpoint of training and expressing the relevance of news implied by their release time gap. Finally, we present a multi-model mutual benefit and collaboration framework that enables the multi-model mutual benefit of generalization in training and multi-perspective prediction of news authenticity in inference. Experiments on our newly constructed dataset demonstrate the superiority of RUSR.

Abstract: Knowledge base question answering (KBQA) refers to the system that produces answers to user queries by reasoning with a largescale structured knowledge base. Advanced works have achieved great success either by generating logical forms (LF) or directly generating answers. Although the former typically yields better performance, these generated LF could be inaccurate, e.g., non-executable. In this regard, large language models (LLMs) have shown exciting potential for accurate generation. However, it is challenging to fine-tune LLMs to generate LF. This is because the context retrieved for prediction typically leads to an excessive number of reasoning paths. In this context, LLMs can generate numerous LF corresponding to these reasoning paths, but a few LF can result in correct answers. Thus, fine-tuning LLMs to generate answer-relevant LF would conflict with the prior knowledge of the LLMs. In this work, we propose a novel learning framework, FM-KBQA, to fine-tune LLMs using multi-task learning for KBQA. Specifically, we propose to fine-tune LLMs using an additional objective: generating the index of reasoning paths that lead to correct answers. This will direct LLMs to pay attention to answer-relevant paths among numerous reasoning paths by completing a simple task where the selected reasoning paths can be supplementary for non-executable LF. Directly generating answers can make LLMs pay attention to the answer-relevant reasoning paths, but it is much more challenging than generating the index of reasoning paths. To verify FM-KBQA's effectiveness, we conduct experiments on mainstream benchmarks, such as WebQuestionsSP (WQSP) and ComplexWebQuestions (CWQ). Extensive evaluations across two public benchmark datasets underscore the superiority of FM-KBQA over current state-of-the-art methods.

Abstract: The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance while preserving linguistic content. In this paper, we propose regularizing emotion intensity in the diffusionbased EVC framework to generate precise speech of the target emotion. Traditional approaches control the intensity of an emotional state in the utterance via emotion class probabilities or intensity labels that often lead to inept style manipulations and degradations in quality. On the contrary, we aim to regulate emotion intensity using self-supervised learning-based feature representations and unsupervised directional latent vector modeling (DVM) in the emotional embedding space within a diffusion-based framework. These emotion embeddings can be modified based on the given target emotion intensity and the corresponding direction vector. Furthermore, the updated embeddings can be fused in the reverse diffusion process to generate the speech with the desired emotion and intensity. In summary, this paper aims to achieve high-quality emotional intensity regularization in the diffusion-based EVC framework, which is the first of its kind work. The effectiveness of the proposed method has been shown across state-of-the-art (SOTA) baselines in terms of subjective and objective evaluations for the English and Hindi languages.

The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China Shanghai Artificial Intelligence Laboratory, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

Abstract: Tool learning enables Large Language Models (LLMs) to interact with the external environment by invoking tools, enriching the accuracy and capability scope of LLMs. However, previous works predominantly focus on improving the model's toolutilizing accuracy and the ability to generalize to new, unseen tools, excessively forcing LLMs to adjust specific tool-invoking pattern without considering the harm to the model's general performance. This deviates from the actual applications and original intention of integrating tools to enhance the model. To tackle this problem, we dissect the capability trade-offs by examining the hidden representation changes and the gradient-based importance score of the model's components. Based on the analysis result, we propose a Component Importance-based Tool-utilizing ability Injection method (CITI). According to the gradient-based importance score of different components, it alleviates the capability conflicts caused by the fine-tuning process by applying distinct training strategies to different components. CITI applies Mixture-Of-LoRA (MOLoRA) for important components. Meanwhile, it fine-tunes the parameters of a few components deemed less important in the backbone of the LLM, while keeping other parameters frozen. CITI can effectively enhance the model's tool-utilizing capability without excessively compromising its general performance. Experimental results demonstrate that our approach achieves outstanding performance across a range of evaluation metrics.

Abstract: Large Language Models (LLMs) have made significant strides in various intelligent tasks but still struggle with complex action reasoning tasks that require systematic search. To address this limitation, we propose a method that bridges the natural language understanding capabilities of LLMs with the symbolic reasoning strengths of action languages. Our approach, termed LLM+AL, leverages the LLM's strengths in semantic parsing and commonsense knowledge generation alongside the action language's proficiency in automated reasoning based on encoded knowledge. We compare LLM+AL against stateof-the-art LLMs, including ChatGPT-4, Claude 3 Opus, Gemini Ultra 1.0, and o1-preview, using benchmarks for complex reasoning about actions. Our findings indicate that, although all methods exhibit errors, LLM+AL, with relatively minimal human corrections, consistently leads to correct answers, whereas standalone LLMs fail to improve even with human feedback. LLM+AL also contributes to automated generation of action languages.

Institute of Advanced Technology & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Institute of Advanced Technology & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Institute of Advanced Technology & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Institute of Advanced Technology & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center School of Computer Science and Artificial Intelligence, Hefei Normal University

Abstract: Large language models (LLMs) have demonstrated exceptional error detection capabilities and can correct sentences with high fluency in grammatical error correction (GEC) tasks. However, when correcting Chinese academic papers, LLMs face significant challenges of overcorrection. To delve deeper into this issue, we explore the underlying reasons. On one hand, each discipline has its unique vocabulary and expressions, and LLMs have insufficient and incomplete understanding of domain-specific sentences. On the other hand, the controllability of generative LLMs in GEC tasks is inherently poor, and the traditional sequence-to-sequence (Seq2Seq) correction structure exacerbates this issue. Considering the two aforementioned factors, we propose a new error correction framework for Chinese academic GEC tasks using LLMs, named ScholarGEC. To improve LLMs’ understanding of domain-specific knowledge, we construct appropriate disciplinary knowledge prefixes for sentences and use this domain-specific knowledge data to fine-tune the LLM. To enhance the controllability of LLMs, we replace the traditional Seq2Seq structure with a Detection-Correction separated structure. We also introduce a special token during the process to improve the model’s error detection stability. Additionally, we incorporate iterative self-reflection to enhance the stability of the generation, in the three parts of LLM generation. Extensive experiments demonstrate the effectiveness and robustness of our framework on a Chinese GEC dataset composed of academic papers, and further analysis reveals the capabilities of our framework in enhancing LLM performance in general GEC tasks.

Abstract: Large Language Models (LLMs) need to adapt to the continuous changes in data, tasks, and user preferences. Due to their massive size and the high costs associated with training, LLMs are not suitable for frequent retraining. However, updates are necessary to keep them in sync with rapidly evolving human knowledge. To address these challenges, this paper proposes the Compression Memory Training (CMT) method, an efficient and effective online adaptation framework for LLMs that features robust knowledge retention capabilities. Inspired by human memory mechanisms, CMT compresses and extracts information from new documents to be stored in a memory bank. When answering to queries related to these new documents, the model aggregates these document memories from the memory bank to better answer user questions. The parameters of the LLM itself do not change during training and inference, reducing the risk of catastrophic forgetting. To enhance the encoding, retrieval, and aggregation of memory, we further propose three new general and flexible techniques, including memoryaware objective, self-matching and top-k aggregation. Extensive experiments conducted on three continual learning datasets (i.e., StreamingQA, SQuAD and ArchivalQA) demonstrate that the proposed method improves model adaptability and robustness across multiple base LLMs (e.g., +4.07 EM & +4.19 F1 in StreamingQA with Llama-2-7b).

Abstract: Multilingual large languagevision models (LVLMs), which understand and generate both text and images across multiple languages, have achieved remarkable performance on English-centric multimodal generation tasks. However, their performance on non-English tasks has been underwhelming. One major challenge with multilingual LVLMs is the modality gap between visual inputs and multilingual textual inputs/outputs due to the lack of high-quality multilingual training data. In this paper, we propose LRM-LLaVA, a multilingual large language-vision model designed for low-resource languages to overcome the modality gap. It is composed of four components: a visual encoder, a multilingual large language model, a vision-text representation projector, and a cross-modal regularizer. Both the projector and regularizer aim at reducing the modality gap and improving multilingual performance. To train LRM-LLaVA, we employ a two-stage training strategy including pre-training and instruction fine-tuning. Meanwhile, we construct a multilingual visual question answering dataset based on English open-source datasets and adopt multiple task instructions. To evaluate the performance of LVLMs across various languages, we construct four multilingual benchmarks for 10 languages, based on English open-source benchmarks. Experimental results show that LRM-LLaVA achieves competitive performance compared to other multilingual LVLMs of similar parameters.

Abstract: The general capabilities of large language models (LLMs) make them the infrastructure for various AI applications, but updating their inner knowledge requires significant resources. Recent model editing is a promising technique for efficiently updating a small amount of knowledge of LLMs and has attracted much attention. In particular, local editing methods, which directly update model parameters, are proven suitable for updating small amounts of knowledge. Local editing methods update weights by computing least squares closedform solutions and identify edited knowledge by vector-level matching in inference, which achieve promising results. However, these methods still require a lot of time and resources to complete the computation. Moreover, vector-level matching lacks reliability, and such updates disrupt the original organization of the model's parameters. To address these issues, we propose a detachable and expandable Subject Word Embedding Altering (SWEA) framework, which finds the editing embeddings through token-level matching and adds them to the subject word embeddings in Transformer input. To get these editing embeddings, we propose optimizing then suppressing fusion method, which first optimizes learnable embedding vectors for the editing target and then suppresses the Knowledge Embedding Dimensions (KEDs) to obtain final editing embeddings. We thus propose SWEAOS method for editing factual knowledge in LLMs. We demonstrate the overall state-of-the-art (SOTA) performance of SWEAOS on the CounterFact and zsRE datasets. To further validate the reasoning ability of SWEAOS in editing knowledge, we evaluate it on the more complex RippleEdits benchmark. The results demonstrate that SWEAOS possesses SOTA reasoning ability.

Abstract: The capability of InContext Learning (ICL) is crucial for large language models to generalize across a wide range of tasks. By utilizing prompts, these models can accurately predict outcomes for previously unseen tasks without necessitating retraining. However, this generalization ability does not extend to the length of the inputs; the effectiveness of ICL likely diminishes with excessively long inputs, resulting in errors in the generated text. To investigate this issue, we propose a study using a dataset of In-Context functions to understand the operational mechanisms of Transformer models in ICL and length generalization. We generated data using regression and Boolean functions and employed meta-learning techniques to endow the model with ICL capabilities. Our experimental results indicate that position encodings can significantly mitigate length generalization issues, with the most effective encoding extending the maximum input length to over eight times that of the original training length. However, further analysis revealed that while position encoding enhances length generalization, it compromises the model's inherent capabilities, such as its ability to generalize across different data types. Overall, our research illustrates that position encodings have a pronounced positive effect on length generalization, though it necessitates a careful trade-off with data generalization performance.

Abstract: Automatic Speech Recognition (ASR) transcripts exhibit recognition errors and various spoken language phenomena such as disfluencies, ungrammatical sentences, and incomplete sentences, hence suffering from poor readability. To improve readability, we propose a Contextualized Spokento-Written conversion (CoS2W) task to address ASR and grammar errors and also transfer the informal text into the formal style with content preserved, utilizing contexts and auxiliary information. This task naturally matches the in-context learning capabilities of Large Language Models (LLMs). To facilitate comprehensive comparisons of various LLMs, we construct a document-level Spoken-to-Written conversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we study the impact of different granularity levels on the CoS2W performance, and propose methods to exploit contexts and auxiliary information to enhance the outputs. Experimental results reveal that LLMs have the potential to excel in the CoS2W task, particularly in grammaticality and formality, our methods achieve effective understanding of contexts and auxiliary information by LLMs. We further investigate the effectiveness of using LLMs as evaluators and find that LLM evaluators show strong correlations with human evaluations on rankings of faithfulness and formality, which validates the reliability of LLM evaluators for the CoS2W task.

Abstract: Large language models (LLMs) have shown excellent performance in natural language processing but struggle with mathematical reasoning. As the training mode gradually solidifies, researchers propose a datacentric concept of artificial intelligence, emphasizing the development of higher-quality data to empower LLMs. Existing studies construct synthetic data for mathematical reasoning by expanding public datasets, thereby performing supervised fine-tuning of LLMs. However, these methods mostly focus on quantity while neglecting quality. The challenging samples fail to receive adequate consideration during data synthesis process, resulting in high construction costs, low-quality density, and serious data homogenization. This paper proposes a multi-agent environment called Virtual ClassRoom (VCR), which leverages various agents driven by LLM to construct high-quality diversified synthetic data. Inspired by the "Cone of Experience" educational theory, VCR introduces three experience levels (direct, iconic, and symbolic) into data synthesis process by analogy with human learning. A user-friendly instruction set and role-playing system are carefully designed, enabling VCR to autonomously plan the scale of synthetic data. This system covers various educational scenarios, including lecture, discussion, problem design and problem-solving. The Adaboost idea embodied in the global iterative process further promotes steady performance improvement. Extensive experiments show that the synthetic data generated by VCR possess higher quality density and generalization capability, which can give LLMs superior mathematical reasoning performance with the same scale.

Abstract: As one of the key technologies leading to Artificial General Intelligence (AGI), Large Language Models (LLMs) have achieved remarkable accomplishments. Exploring the capabilities of LLMs is crucial for scientific research, and many studies propose new challenges from various aspects to explore the boundaries of capabilities in LLMs. This paper attempts to push the challenges of information understanding, synthesizing and reasoning to the extreme, in order to explore the boundaries of more advanced dimensional cognitive capabilities in LLMs. It is defined as the task of HighLevel Cognition (HLC), which involves obtaining high-level conclusions from low-level and fragmented foundational information. To evaluate HLC, we construct a dataset based on soccer matches. Experiments and analysis on this dataset show that current state-of-the-art LLMs lack the ability to effectively solve the task of HLC, because their performance is equivalent to random-level. However, by fine-tuning Llama3-8B-Instruct, there are improvements of 14.4%, 48.1%, and 19.4% over random-level in three types of evaluation tasks. This indicates that LLMs have great potential to solve the task of HLC.

Abstract: Large Language Models (LLMs) may suffer from hallucinations in realworld applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array of symbolic facts. Consequently, integrating LLMs with knowledge graphs has been extensively explored, with Knowledge Graph Question Answering (KGQA) serving as a critical touchstone for the integration. This task requires LLMs to answer natural language questions by retrieving relevant triples from knowledge graphs. However, existing methods face two significant challenges: *excessively long reasoning paths distracting from the answer generation*, and *false-positive relations hindering the path refinement*. In this paper, we propose an iterative interactive KGQA framework that leverages the interactive learning capabilities of LLMs to perform reasoning and Debating over Graphs (DoG). Specifically, DoG employs a subgraph-focusing mechanism, allowing LLMs to perform answer trying after each reasoning step, thereby mitigating the impact of lengthy reasoning paths. On the other hand, DoG utilizes a multi-role debate team to gradually simplify complex questions, reducing the influence of false-positive relations. This debate mechanism ensures the reliability of the reasoning process. Experimental results on five public datasets demonstrate the effectiveness and superiority of our architecture. Notably, DoG outperforms the state-of-the-art method ToG by 23.7% and 9.1% in accuracy on WebQuestions and GrailQA, respectively. Furthermore, the integration experiments with various LLMs on the mentioned datasets highlight the flexibility of DoG.

Abstract: Individuals living with disabilities often face challenges in their daily lives, from managing physical tasks to coping with emotional needs. It is imperative to provide them with personalized, courteous, and empathetic support that can address their unique needs. To bridge this gap, we propose an Empathetic Disability Support System (EDiSS), designed to offer personalized support tailored with correct politeness and empathetic strategies as per individual users’ OCEAN traits, gender, and age. To train EDiSS, first, a specialized personalized disability support dialogue dataset (PDCARE) is created encompassing a wide spectrum of disabilities, such as Spinal Cord Injuries, Neurological Disorders, Orthopedic Disabilities, etc, and support areas like Physical Therapy Exercises, Pain Management, Emotional Support, etc. EDiSS employs a reinforcement learningbased dialogue model with a novel reward function. It adapts its tone and content based on the user’s persona, gender, and age to provide respectful and empathetic assistance across various aspects of daily living. Our experiments and evaluation demonstrate the effectiveness of EDiSS in improving the quality of life of individuals with disabilities, marking a significant advancement in leveraging technology to provide much-needed support and assistance in their daily challenges.

Abstract: In Explainable AI (XAI), counterfactual explanations (CEs) are a wellstudied method to communicate feature relevance through contrastive reasoning of ``what if'' to explain AI models' predictions. However, they only focus on important (i.e., relevant) features and largely disregard less important (i.e., irrelevant) ones. Such irrelevant features can be crucial in many applications, especially when users need to ensure that an AI model's decisions are not affected or biased against specific attributes such as gender, race, religion, or political affiliation. To address this gap, the concept of alterfactual explanations (AEs) has been proposed. AEs explore an alternative reality of ``no matter what'', where irrelevant features are substituted with alternative features (e.g., ``republicans'' -> ``democrats'') within the same attribute (e.g., ``politics'') while maintaining a similar prediction output. This serves to validate whether the specified attributes influence AI model predictions. Despite the promise of AEs, there is a lack of computational approaches to systematically generate them, particularly in the text domain, where creating AEs for AI text classifiers presents unique challenges. This paper addresses this challenge by formulating AE generation as an optimization problem and introducing NoMatterXAI, a novel algorithm that generates AEs for text classification tasks. Our approach achieves high fidelity of up to 95% while preserving context similarity of over 90% across multiple models and datasets. A human study further validates the effectiveness of AEs in explaining AI text classifiers to end users.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in generating diverse and contextually rich text. However, concerns regarding copyright infringement arise as LLMs may inadvertently produce copyrighted material. In this paper, we first investigate the effectiveness of watermarking LLMs as a deterrent against the generation of copyrighted texts. Through theoretical analysis and empirical evaluation, we demonstrate that incorporating watermarks into LLMs significantly reduces the likelihood of generating copyrighted content, thereby addressing a critical concern in the deployment of LLMs. However, we also find that watermarking can have unintended consequences on Membership Inference Attacks (MIAs), which aim to discern whether a sample was part of the pretraining dataset and may be used to detect copyright violations. Surprisingly, we find that watermarking adversely affects the success rate of MIAs, complicating the task of detecting copyrighted text in the pretraining dataset. These results reveal the complex interplay between different regulatory measures, which may impact each other in unforeseen ways. Finally, we propose an adaptive technique to improve the success rate of a recent MIA under watermarking. Our findings underscore the importance of developing adaptive methods to study critical problems in LLMs with potential legal implications.

Abstract: Language steganography in social networks primarily focuses on embedding secret information into social media text efficiently to achieve covert communication. The misuse of such techniques could pose significant potential threats to public cyberspace, such as the spread of malicious code, commands, or viruses. Existing social text steganalysis techniques mainly focus on the analysis of individual social media texts. However, the information content in a single text is very limited, leading to poor detection performance in practical applications. To address this challenge, this paper proposes a social text steganalysis method that combines largescale language models with common-sense knowledge graphs (STLC-KG). This method first uses knowledge graphs to expand the knowledge contained in the text under investigation, enriching its linguistic expression, and then utilizes large-scale language models to extract the linguistic features of the social text. The results of tests conducted on three mainstream social media platforms demonstrate that the proposed method significantly improves the performance of social text steganalysis.

Abstract: Despite extensive training on diverse datasets and alignment with human values, large language models (LLMs) can still generate fallacious outputs. Additionally, the validity of LLM's outputs varies significantly depending on the content. It is crucial to ensure LLMs' logical consistency across different contexts. Drawing inspiration from cognitive psychology studies, we propose a Logic Control Framework (LCF) that disentangles LLMs' hidden representations into separate content and logic spaces. Within the logic space, we use logically valid and invalid samples to construct distinct regions through contrastive learning. By moving logic representations to logically valid regions and fusing them with unchanged content representations, we significantly reduce logical fallacies in LLM outputs while maintaining content coherence. We demonstrate the effectiveness of LCF through experiments on conclusion generation and fallacy identification tasks, showing a significant improvement in logical validity and a reduction in fallacious outputs.

Abstract: The rapidly developing Large Vision Language Models (LVLMs) still face the hallucination phenomena where the generated responses do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarsegrained level or requires expensive annotation (e.g., labeling by human experts or proprietary models). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a detection model which can perform sentence-level hallucination detection. Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for hallucination mitigation training. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) which prioritizes the mitigation of critical hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments on hallucination detection and mitigation benchmarks demonstrate that our method sets a new state-of-the-art in hallucination detection on MHaluBench, surpassing GPT-4V and Gemini, and reduces the hallucination rate by 36.1% on AMBER and 76.3% on Object HalBench compared to the base model.

University of Science and Technology of China & State Key Laboratory of Cognitive Intelligence City University of Hong Kong, University of Science and Technology of China & State Key Laboratory of Cognitive Intelligence, Jarvis Research Center, Tencent YouTu Lab, Jarvis Research Center, Tencent YouTu Lab, Peking University, University of Science and Technology of China & State Key Laboratory of Cognitive Intelligence, Jarvis Research Center, Tencent YouTu Lab, City University of Hong Kong, University of Science and Technology of China & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China & State Key Laboratory of Cognitive Intelligence

Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities, yet struggle with hallucination and outdated knowledge when tasked with complex knowledge reasoning, resulting in factually incorrect outputs. Previous studies have attempted to mitigate it by retrieving factual knowledge from largescale knowledge graphs (KGs) to assist LLMs in logical reasoning and prediction of answers. However, this kind of approach often introduces noise and irrelevant data, especially in situations with extensive context from multiple knowledge aspects. In this way, LLM attention can be potentially mislead from question and relevant information. In our study, we introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework. This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings. The Amar framework comprises two key sub-components: 1) a self-alignment module that aligns commonalities among entities, relations, and subgraphs to enhance retrieved text, thereby reducing noise interference; 2) a relevance gating module that employs a soft gate to learn the relevance score between question and multi-aspect retrieved data, to determine which information should be used to enhance LLMs' output, or even filtered altogether. Our method has achieved state-of-the-art performance on two common datasets, WebQSP and CWQ, showing a 1.9% improvement in accuracy over its best competitor and a 6.6% improvement in logical form generation over a method that directly uses retrieved text as context prompts. These results demonstrate the effectiveness of Amar in improving the reasoning of LLMs.

Abstract: Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALLE, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.

Abstract: Despite being empowered with alignment mechanisms, large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks that can compromise their alignment mechanisms. This vulnerability poses significant risks to realworld applications. Existing work faces challenges in both training efficiency and generalization capabilities (i.e., Reinforcement Learning from Human Feedback and Red-Teaming). Developing effective strategies to enable LLMs to resist continuously evolving jailbreak attempts represents a significant challenge. To address this challenge, we propose a novel defensive paradigm called GuidelineLLM, which assists LLMs in recognizing queries that may have harmful content. Before LLMs respond to a query, GuidelineLLM first identifies potential risks associated with the query, summarizes these risks into guideline suggestions, and then feeds these guidelines to the responding LLMs. Importantly, our approach eliminates the necessity for additional safety fine-tuning of the LLMs themselves; only the GuidelineLLM requires fine-tuning. This characteristic enhances the general applicability of GuidelineLLM across various LLMs. Experimental results demonstrate that GuidelineLLM can significantly reduce the attack success rate (ASR) against LLM (an average reduction of 34.17% ASR) while maintaining the usefulness of LLM in handling benign queries.

Abstract: Instruction FineTuning (IFT) significantly enhances the zero-shot capabilities of pretrained Large Language Models (LLMs). While coding data is known to boost LLM reasoning abilities during pretraining, its role in activating internal reasoning capacities during IFT remains understudied. This paper investigates a key question: How does coding data impact LLMs' reasoning capacities during IFT stage? To explore this, we thoroughly examine the impact of coding data across different coding data proportions, model families, sizes, and reasoning domains, from various perspectives. Specifically, we create three IFT datasets with increasing coding data proportions, fine-tune six LLM backbones across different families and scales on these datasets, evaluate the tuned models' performance across twelve tasks in three reasoning domains, and analyze the outcomes from three broad-to-granular perspectives: overall, domain-level, and task-specific. Our holistic analysis provides valuable insights into each perspective. First, coding data tuning enhances the overall reasoning capabilities of LLMs across different model families and scales. Moreover, while the impact of coding data varies by domain, it shows consistent trends within each domain across different model families and scales. Additionally, coding data generally provides comparable task-specific benefits across model families, with optimal proportions in IFT datasets being task-dependent.

Abstract: Current neural networks often employ multidomain-learning or attribute-injecting mechanisms to incorporate non-independent and identically distributed (non-IID) information for text understanding tasks by capturing individual characteristics and the relationships among samples. However, the extent of the impact of non-IID information and how these methods affect pre-trained language models (PLMs) remains unclear. This study revisits the assumption that non-IID information enhances PLMs to achieve performance improvements from a Bayesian perspective, which unearths and integrates non-IID and IID features. Furthermore, we proposed a multi-attribute multi-grained framework for PLM adaptations (M2A), which combines multi-attribute and multi-grained views to mitigate uncertainty in a lightweight manner. We evaluate M2A through prevalent text-understanding datasets and demonstrate its superior performance, mainly when data are implicitly non-IID, and PLMs scale larger.

Abstract: With the availability of various instruction datasets, a pivotal challenge is how to effectively select and integrate these instructions to finetune large language models (LLMs). Previous research mainly focuses on selecting individual high-quality instructions. However, these works overlooked the joint interactions and dependencies between different categories of instructions, leading to suboptimal selection strategies. Moreover, the nature of these interaction patterns remains largely unexplored, let alone optimize the instruction set with regard to them. To fill these gaps, in this paper, we: (1) systemically investigate interaction and dependency patterns between different categories of instructions, (2) manage to optimize the instruction set concerning the interaction patterns using a linear programming-based method, and optimize the learning schema of SFT using an instruction dependency taxonomy guided curriculum learning. Experimental results across different LLMs demonstrate improved performance over strong baselines on widely adopted benchmarks.

Abstract: Large Language Models (LLMs) are often Englishcentric due to the disproportionate distribution of languages in their pre-training data. Enhancing non-English language capabilities through post-pretraining often results in catastrophic forgetting of high-resource languages. Previous methods either achieve good expansion with severe forgetting or slight forgetting with poor expansion, indicating the challenge of balancing language expansion while preventing forgetting. In this paper, we propose a method called MoE-LPR (Mixture-of-Experts with Language Priors Routing) to alleviate this problem. MoE-LPR employs a two-stage training approach to enhance the multilingual capability. First, the model is post-pretrained into a Mixture-of-Experts(MoE) architecture by upcycling, where all the original parameters are frozen and new experts are added. In this stage, we focus improving the ability on expanded languages, without using any original language data. Then, the model reviews the knowledge of the original languages with replay data amounting to less than 1% of post-pretraining, where we incorporate language priors routing to better recover the abilities of the original languages. Evaluations on multiple benchmarks show that MoE-LPR outperforms other post-pretraining methods. Freezing original parameters preserves original language knowledge while adding new experts preserves the learning ability. Reviewing with LPR enables effective utilization of multilingual knowledge within the parameters. Additionally, the MoE architecture maintains the same inference overhead while increasing total model parameters. Extensive experiments demonstrate MoE-LPR’s effectiveness in improving expanded languages and preserving original language proficiency with superior scalability.

Abstract: Recently, there have been significant advancements in music generation. However, existing models primarily focus on creating modern pop songs, making it challenging to produce ancient music with distinct rhythms and styles, such as ancient Chinese SongCi. In this paper, we introduce SongSong, the first music generation model capable of restoring Chinese SongCi to our knowledge. Our model first predicts the melody from the input SongCi, then separately generates the singing voice and accompaniment based on that melody, and finally combines all elements to create the final piece of music. Additionally, to address the lack of ancient music datasets, we create OpenSongSong, a comprehensive dataset of ancient Chinese SongCi music, featuring 29.9 hours of compositions by various renowned SongCi music masters. To assess SongSong's proficiency in performing SongCi, we randomly select 85 SongCi sentences that were not part of the training set for evaluation against SongSong and music generation platforms such as Suno and SkyMusic. The subjective and objective outcomes indicate that our proposed model achieves leading performance in generating highquality SongCi music.

Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China

Abstract: With the development of large language models (LLMs), numerous online applications based on these models have emerged. As system prompts significantly influence the performance of LLMs, many such applications conceal their system prompts and regard them as intellectual property. Consequently, numerous efforts have been made to steal these system prompts. However, for applications that do not publicly disclose their system prompts, previously stolen prompts have low confidence. This is because previous methods rely on confirmation from application developers, which is unrealistic since developers may be unwilling to acknowledge that their system prompts have been leaked. We observed a phenomenon: when an LLM performs repetitive tasks, it accurately repeats based on the context rather than relying on its internal model parameters. We validated this phenomenon by comparing the results of two different inputs—repetitive tasks and knowledgebased tasks—under conditions of normal execution, contaminated execution, and partially restored execution. By contaminating the input nouns and then partially restoring them using data from the normal execution's intermediate layers, we measured the accuracies of both task types across these three execution processes. Based on this phenomenon, we propose a high-confidence leakage method called RepeatLeakage. By specifying the range that the model needs to repeat and encouraging the model not to change the format, we manage to extract its system prompt and conversation contexts. We validated the repetition phenomenon on multiple open-source models and successfully designed prompts using RepeatLeakage to leak contents from the actual system prompts of GPT-Store and publicly available ChatGPT conversation contexts. Finally, we tested RepeatLeakage in real environments such as ChatGPT web, successfully leaking their system prompts and conversation contexts.

Abstract: Human values and their measurement are longstanding interdisciplinary inquiry. Recent advances in AI have sparked renewed interest in this area, with large language models (LLMs) emerging as both tools and subjects of value measurement. This work introduces Generative Psychometrics for Values (GPV), an LLM-based, data-driven value measurement paradigm, theoretically grounded in text-revealed selective perceptions. The core idea is to dynamically parse unstructured texts into perceptions akin to static stimuli in traditional psychometrics, measure the value orientations they reveal, and aggregate the results. Applying GPV to human-authored blogs, we demonstrate its stability, validity, and superiority over prior psychological tools. Then, extending GPV to LLM value measurement, we advance the current art with 1) a psychometric methodology that measures LLM values based on their scalable and free-form outputs, enabling context-specific measurement; 2) a comparative analysis of measurement paradigms, indicating response biases of prior methods; and 3) an attempt to bridge LLM values and their safety, revealing the predictive power of different value systems and the impacts of various values on LLM safety. Through interdisciplinary efforts, we aim to leverage AI for next-generation psychometrics and psychometrics for value-aligned AI.

Abstract: Parallelization of nonadmissible search algorithms such as GBFS poses a challenge because straightforward parallelization can result in search behavior which significantly deviates from sequential search. Previous work proposed PUHF, a parallel search algorithm which is constrained to only expand states that can be expanded by some tie-breaking strategy for GBFS. We show that despite this constraint, the number of states expanded by PUHF is not bounded by a constant multiple of the number of states expanded by sequential GBFS with the worst-case tie-breaking strategy. We propose and experimentally evaluate One Bench At a Time (OBAT), a parallel greedy search which guarantees that the number of states expanded is within a constant factor of the number of states expanded by sequential GBFS with some tie-breaking policy.

Abstract: A graph G is cclosed if every two vertices with at least c common neighbors are adjacent to each other. This definition is an abstraction of the triadic closure property exhibited by many real-world social networks, namely, friends of friends tend to be friends themselves. Social networks, however, are often temporal rather than static---the connections change over a period of time. And hence temporal graphs, rather than static graphs, are often better suited to model social networks. Motivated by this, we introduce a definition of temporal c-closed graphs, in which if two vertices u and v have at least c common neighbors during a short interval of time, then u and v are adjacent to each other around that time. Our pilot experiments show that several real-world temporal networks are c-closed for rather small values of c. We also study the computational problems of enumerating maximal cliques and other dense subgraphs in temporal c-closed graphs. A clique in a temporal graph is a subgraph that lasts for a certain period of time, during which every possible edge in the subgraph becomes active often enough; other dense subgraphs are defined similarly. We bound the number of such maximal dense subgraphs in a temporal c-closed graph that evolves slowly, and thus show that the corresponding enumeration problems admit efficient algorithms; by slow evolution, we mean that between consecutive time-steps, the local change in adjacencies remains small. Our work also adds to a growing body of literature on defining suitable structural parameters for temporal graphs that can be leveraged to design efficient algorithms.

Abstract: Recent research demonstrates the effectiveness of using pretrained language models for legal case retrieval. Most of the existing works focus on improving the representation ability for the contextualized embedding of the [CLS] token and calculate relevance using textual semantic similarity. However, in the legal domain, textual semantic similarity does not always imply that the cases are relevant enough. Instead, relevance in legal cases primarily depends on the similarity of key facts that impact the final judgment. Without proper treatments, the discriminative ability of learned representations could be limited since legal cases are lengthy and contain numerous non-key facts. To this end, we introduce DELTA, a discriminative model designed for legal case retrieval. The basic idea involves pinpointing key facts in legal cases and pulling the contextualized embedding of the [CLS] token closer to the key facts while pushing away from the non-key facts, which can warm up the case embedding space in an unsupervised manner. To be specific, this study brings the word alignment mechanism to the contextual masked auto-encoder. First, we leverage shallow decoders to create information bottlenecks, aiming to enhance the representation ability. Second, we employ the deep decoder to enable ``translation'' between different structures, with the goal of pinpointing key facts to enhance discriminative ability. Comprehensive experiments conducted on publicly available legal benchmarks show that our approach can outperform existing state-of-the-art methods in legal case retrieval. It provides a new perspective on the in-depth understanding and processing of legal case documents.

Abstract: We consider the challenge of blackbox optimization within hybrid discrete-continuous and variable-length spaces, a problem that arises in various applications, such as decision tree learning and symbolic regression. We propose DisCo-DSO (Discrete-Continuous Deep Symbolic Optimization), a novel approach that uses a generative model to learn a joint distribution over discrete and continuous design variables to sample new hybrid designs. In contrast to standard decoupled approaches, in which the discrete and continuous variables are optimized separately, our joint optimization approach uses fewer objective function evaluations, is robust against non-differentiable objectives, and learns from prior samples to guide the search, leading to significant improvement in performance and sample efficiency. Our experiments on a diverse set of optimization tasks demonstrate that the advantages of DisCo-DSO become increasingly evident as problem complexity grows. In particular, we illustrate DisCo-DSO's superiority over the state-of-the-art methods for interpretable reinforcement learning with decision trees.

Abstract: Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Preference Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and nonbackdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4% of the data to be poisoned to elicit harmful behavior, we exploit the vulnerabilities of DPO by simpler methods so we can poison the model with only as much as 0.5% of the data. We further the investigate efficacy of the existing defence methods and find that these poisoning attacks can evade the existing data anomaly detection methods.

Abstract: Recent years have witnessed an increase in the parameter size of frontier AI models by multiple orders of magnitude. This trend is driven by empirical observations, known as scaling laws, which show that model performance scales with model size, dataset size, and computational power. Motivated by this, researchers are training everlarger models in pursuit of unlocking new capabilities. However, the growing complexity of these models makes understanding their inner workings increasingly challenging. Interpretability is crucial not only in fields like medicine and biotechnology, where understanding the internals of these models could lead to new insights but also in super alignment, where it is the goal to ensure that AI is aligned and acts according to human values and interests. We present a generic, scalable first-of-its-kind method for automatically interpreting neural networks. In a proof-of-concept study we establish the viability of converting neural network activations - here for the first layer of a Convolutional Neural Network - into human-readable language. Additionally, we propose modifications to scale this method for understanding neural networks of any size. In anticipation of more capable large language models, this method could enable the monitoring of their internal mechanisms and decisions.

Abstract: Deep reinforcement learning (DRL) has gained significant attention in autonomous systems, yet its blackbox nature and lack of explainability hinder user trust in safety-critical domains such as autonomous driving. Existing experience replay approaches enhance sample efficiency but often fail to capture the internal causality of training data, leading to a convoluted training process that is difficult for humans to explain. In this work, we introduce Experience Replay with Causal Inference (ERCI), an explainable approach that integrates time series representation and causal inference to offer human-aligned explanations for DRL. Specifically, ERCI 1) introduces a novel multivariate time series representation to extract explainable Time Series Causal Factors (TSCF) from experimental data and 2) leverages internal causality in TSCFs with causal inference as a crucial standard for experience replay in DRL training. We evaluate ERCI using multiple baseline algorithms across diverse environments. Results show that ERCI provides human-aligned explanations and further improves sample efficiency through enhanced explainability. Notably, ERCI outperforms other state-of-the-art approaches by 15% in average performance, highlighting its effectiveness and generalizability.

Abstract: As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, VisionLanguage Models (VLMs), have shown exceptional performance in many real-world tasks. However, VLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model’s safety alignment to elicit harmful responses. The threat of jailbreak attacks on VLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that VLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce MMJ-Bench, a unified pipeline for evaluating jailbreak attacks and defense techniques for VLMs. Through extensive experiments, we assess the effectiveness of various attack methods against SoTA VLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks. Our comprehensive evaluation contribute to the field by offering a unified and systematic evaluation framework and the first public-available benchmark for VLM jailbreak research. We also demonstrate several insightful findings that highlights directions for future studies.

Abstract: The emergence of the large language model (LLM) has shown its superiority in a wide range of disciplines, including language understanding and translation, relational logic reasoning, and even partial differential equations solving. The transformer is the pervasive backbone architecture for the foundation model construction. It is vital to research how to adjust the Transformer architecture to achieve an endto-end privacy guarantee in LLM fine-tuning. This paper investigates three potential information leaks during a federated fine-tuning procedure for LLM (FedLLM). Based on the potential information leakage, we insert two-stage randomness into FedLLM to provide an end-to-end privacy guarantee solution. The first stage is to train a gradient auto-encoder with a Gaussian random prior based on the statistical information of the gradients generated by local clients. The second stage is fine-tuning the overall LLM with a differential privacy guarantee by adopting appropriate Gaussian noises. We show our proposed method's efficiency and accuracy gains with several foundation models and two popular evaluation benchmarks. Furthermore, we present a comprehensive privacy analysis with Gaussian Differential Privacy (GDP) and Renyi Differential Privacy (RDP).

Abstract: Extracting finegrained OCR text from aged documents in diacritic languages remains challenging due to unexpected artifacts, time-induced degradation, and lack of datasets. While standalone spell correction approaches have been proposed, they show limited performance for historical documents due to numerous possible OCR error combinations and differences between modern and classical corpus distributions. We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text, supported by large language models. This technique generates high-precision pseudo-page-to-page labels for diacritic languages, where small strokes pose significant challenges in historical conditions. The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences. Our post-processing method, which generated a large OCR dataset of classical Vietnamese books, achieved a mean grading score of 8.72 on a 10-point scale. This outperformed the state-of-the-art transformer-based Vietnamese spell correction model, which scored 7.03, when evaluated on a sampled subset of the dataset. We also trained a baseline OCR model to assess and compare it with well-known engines. Experimental results demonstrate the strength of our baseline model compared to widely used open-source solutions. The resulting dataset will be released publicly to support future studies.

Abstract: Calls for transparency in AI systems are growing in number and urgency from diverse stakeholders ranging from regulators to researchers to users (with a comparative absence of companies developing AI). Notions of transparency for AI abound, each addressing distinct interests and concerns. In computer security, transparency is likewise regarded as a key concept. The security community has for decades pushed back against socalled security by obscurity - the idea that hiding how a system works protects it from attack - against significant pressure from industry and other stakeholders, e.g., (Bellovin and Bush 2002). And over those decades, in a community process that is imperfect and ongoing, security researchers and practitioners have gradually built up some norms and practices around how to balance transparency interests with possible negative side effects. This paper asks: What insights can the AI community take from the security community's experience with transparency? We identify three key themes in the security community's perspective on the \emph{benefits of transparency} and their approach to balancing transparency against countervailing interests. For each, we investigate parallels and insights relevant to transparency in AI. We then provide a case study discussion on how transparency has shaped the research subfield of anonymization. Finally, shifting our focus from similarities to differences, we highlight key transparency issues where modern AI systems present challenges different from other kinds of security-critical systems, raising interesting open questions for the security and AI communities alike.

Abstract: The development of Large Language Models (LLMs) relies on extensive text corpora, which are often unevenly distributed across languages. This imbalance results in LLMs performing significantly better on highresource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate. Currently, there is a lack of quantitative methods to evaluate the performance of LLMs in these low-resource languages. To address this gap, we propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations. By comparing the LLM's internal representation of various languages against a baseline derived from English, we can assess the model's multilingual capabilities in a robust and language-agnostic manner. Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores, underscoring the effectiveness of our metric in assessing language-specific capabilities. Besides, the experiments show that there is a strong correlation between the LLM’s performance in different languages and the proportion of those languages in its pre-training corpus. These insights underscore the efficacy of the Language Ranker as a tool for evaluating LLM performance across different languages, particularly those with limited resources.

Abstract: Weed control is a critical challenge in modern agriculture, as weeds compete with crops for essential nutrient resources, significantly reducing crop yield and quality. Traditional weed control methods, including chemical and mechanical approaches, have reallife limitations such as associated environmental impact and efficiency. An emerging yet effective approach is laser weeding, which uses a laser beam as the stem cutter. Although there have been studies that use deep learning in weed recognition, its application in intelligent laser weeding still requires a comprehensive understanding. Thus, this study serves the first empirical study on weed recognition for laser weeding. To increase the efficiency of laser beam cut and avoid damaging the crops of interest, the laser beam shall be directly aimed at the weed root. Yet, weed stem detection remains an under-explored problem. We integrate the detection of crop and weed with the localization of weed stem into one end-to-end system. To train and validate the proposed system in a real-life scenario, we curate and construct a high-quality weed stem detection dataset with human annotations. The dataset consists of 7,161 high-resolution pictures collected in the field with annotations of 11,151 instances of weed. The dataset will be released upon acceptance. Experimental results show that, in contrast to seminal weed recognition systems, the proposed system can efficiently improve the weeding accuracy by 5.05% and reduce the energy cost by 32.3%.

Abstract: Existing video factchecking datasets often lack detailed evidence and explanations, compromising the reliability and interpretability of fact-checking methods. To address these gaps, we developed a novel dataset featuring comprehensive annotations for each news item, including veracity labels, the rationales behind these labels, and supporting evidence. This dataset significantly enhances models' ability to accurately identify and explain video content. We also present an explainable automatic framework 3MFact, utilizing Multi-role Multimodal Models for video Fact-checking. Our framework iteratively gathers and synthesizes online evidence to progressively determine the veracity label, generating three key outputs: veracity label, rationale, and supported evidence. We aim for this work to be a pioneering effort, providing robust support for the field of video fact-checking.

Massachusetts Institute of Technology, University of Michigan - Ann Arbor, University of Michigan - Ann Arbor, University of Michigan - Ann Arbor, National Yang Ming Chiao Tung University, Stevens Institute of Technology, National Yang Ming Chiao Tung University, Harvard University, University of the Philippines, University of the Philippines, Universidade Federal de São Paulo, Universidade Federal de São Paulo, University of the Philippines, Massachusetts Institute of Technology, Harvard University, Harvard University, Massachusetts Institute of Technology

Abstract: Current ophthalmology clinical workflows are plagued by overreferrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.

Abstract: Drought has become a critical global threat with significant societal impact. Existing drought monitoring solutions primarily focus on assessing drought severity using quantitative measurements, overlooking the diverse societal impact of drought from humancentric perspectives. Motivated by the collective intelligence on social media and the computational power of AI, this paper studies a novel problem of socially informed AI-driven drought estimation that aims to leverage social and news media information to jointly estimate drought severity and its societal impact. Two technical challenges exist: 1) How to model the implicit temporal dynamics of drought societal impact. 2) How to capture the social-physical interdependence between the physical drought condition and its societal impact. To address these challenges, we develop SIDE, a socially informed AI-driven drought estimation framework that explicitly quantifies the societal impact of drought and effectively models the social-physical interdependency for joint severity-impact estimation. Experiments on real-world datasets from California and Texas demonstrate SIDE's superior performance compared to state-of-the-art baselines in accurately estimating drought severity and its societal impact. SIDE offers valuable insights for developing human-centric drought mitigation strategies to foster sustainable and resilient communities.

Abstract: Learning highlevel representations for graphs is crucial for tasks like node classification, where graph pooling aggregates node features to provide a holistic view that enhances predictive performance. Despite numerous methods that have been proposed in this promising and rapidly developing research field, most efforts to generalize the pooling operation to graphs are primarily performance-driven, with fairness issues largely overlooked: i) the process of graph pooling could exacerbate disparities in distribution among various subgroups; ii) the resultant graph structure augmentation may inadvertently strengthen intra-group connectivity, leading to unintended inter-group isolation. To this end, this paper extends the initial effort on fair graph pooling to the development of fair graph neural networks, while also providing a unified framework to collectively address group and individual graph fairness. Our experimental evaluations on multiple datasets demonstrate that the proposed method not only outperforms state-of-the-art baselines in terms of fairness but also achieves comparable predictive performance.

Abstract: Recent years have witnessed tremendous successes of learning for sequential decisionmaking, and in particular, Reinforcement Learning (RL). Prominent application examples include playing Go and video games, robotics, autonomous driving, and recently large language models. Most such success stories naturally involve "multi-agents". Hence, there has been surging research interest in advancing Multi-Agent Learning in Dynamic Environments, particularly, multi-agent RL (MARL), to which my research has led and made significant contributions. My work has established both sample and computational complexities of learning in Stochastic Games, the most fundamental model of MARL, and advocated a unique Economics perspective of independent learning in Stochastic Games. My work has also initiated the recent studies of distributed and networked MARL, with applications in robust adversarial RL, offline RL, and Robotics. This paper will survey my notable contributions along this journey of developing the foundations of multi-agent learning in dynamic environments.

Abstract: Partially Observable Markov Decision Processes (POMDPs) are a powerful framework for planning under uncertainty. They allow to model state uncertainty as a belief probability distribution. Approximate solvers based on Monte Carlo sampling show great success to relax the computational demand and perform online planning. However, scaling to complex realistic domains with many actions and long planning horizons is still a major challenge, and a key point to achieve good performance is guiding the actionselection process with domain-dependent policy heuristics which are tailored for the specific application domain. We propose to learn high-quality heuristics from POMDP traces of executions generated by any solver. We convert the belief-action pairs to a logical semantics, and exploit data- and time-efficient Inductive Logic Programming (ILP) to generate interpretable belief-based policy specifications, which are then used as online heuristics. We evaluate thoroughly our methodology on two notoriously challenging POMDP problems, involving large action spaces and long planning horizons, namely, rocksample and pocman. Considering different state-of-the-art online POMDP solvers, including POMCP, DESPOT and AdaOPS, we show that learned heuristics expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specific heuristics within lower computational time. Moreover, they well generalize to more challenging scenarios not experienced in the training phase (e.g., increasing rocks and grid size in rocksample, incrementing the size of the map and the aggressivity of ghosts in pocman).

Abstract: Estimationof-distribution algorithms (EDAs) are optimization algorithms that learn a distribution from which good solutions can be sampled easily. A key parameter of most EDAs is the sample size (population size). Too small values lead to the undesired effect of genetic drift, while larger values slow down the process. Building on a quantitative analysis of how the population size leads to genetic drift, we design a smart-restart mechanism for EDAs. By stopping runs when the risk for genetic drift is high, it automatically runs the EDA in good parameter regimes. Via a mathematical runtime analysis, we prove a general performance guarantee for this smart-restart scheme. For many situations where the optimal parameter values are known, this shows that the restart scheme automatically finds these optimal values, leading to the asymptotically optimal performance. We also conduct an extensive experimental analysis. On four classic benchmarks, the smart-restart scheme leads to a performance close to the one obtainable with optimal parameter values. We also conduct experiments with PBIL (cross-entropy algorithm) on the max-cut problem and the bipartition problem. Again, the smart-restart mechanism finds much better values for the population size than those suggested in the literature, leading to a much better performance.

School of Engineering and Applied Sciences, Harvard University, USA Hasso Plattner Institute, University of Potsdam, Germany, School of Engineering and Applied Sciences, Harvard University, USA GE Healthcare, USA, School of Engineering and Applied Sciences, Harvard University, USA, School of Engineering and Applied Sciences, Harvard University, USA, Department of Obstetrics and Gynecology, Massachusetts General Hospital, Harvard Medical School, USA, Mbarara University of Science and Technology, Uganda, Department of Obstetrics and Gynecology, Massachusetts General Hospital, Harvard Medical School, USA, School of Engineering and Applied Sciences, Harvard University, USA

Abstract: Maternal mortality remains a significant global public health challenge. One promising approach to reducing maternal deaths occurring during facilitybased childbirth is through early warning systems, which require the consistent monitoring of mothers' vital signs after giving birth. Wireless vital sign monitoring devices offer a labor-efficient solution for continuous monitoring, but their scarcity raises the critical question of how to allocate them most effectively. We devise an allocation algorithm for this problem by modeling it as a variant of the popular Restless Multi-Armed Bandit (RMAB) paradigm. In doing so, we identify and address novel, previously unstudied constraints unique to this domain, which render previous approaches for RMABs unsuitable and significantly increase the complexity of the learning and planning problem. To overcome these challenges, we adopt the popular Proximal Policy Optimization (PPO) algorithm from reinforcement learning to learn an allocation policy by training a policy and value function network. We demonstrate in simulations that our approach outperforms the best heuristic baseline by up to a factor of 4.

Abstract: Children's mental health is crucial for their development, but it's often overlooked, leading to psychological issues. Many children struggle to express their thoughts and feelings effectively. To address this issue, we have proposed a novel approach to analyze children's drawings for psychological screening using artificial intelligence. Specifically, we're focusing on the `draw a person' (DAP) test, where a child's drawing is used to identify potential indicators of their mental and emotional state. Thus, we are introducing an AIpowered technique to automate the psychological screening process for children using the DAP test, which a human professional would traditionally conduct. The screening tool would suggest whether the child needs or doesn’t need further psychological referral. We have collected a dataset consisting of children's drawings and labeled them by experts as either `need' or `no need', indicating whether the child needs or does not need a referral. We have proposed two alternative approaches for the screening process. The first approach consists of extracting features from the drawings following expert guidelines and training a classification model using the features to classify the drawing as either `need' or `no need'. We also propose an out-of-the-box technique applying prompt engineering on state-of-the-art LLMs to automatically extract features from the images. The second approach involves training an image classification model using the drawings. Both approaches are challenged by the issue of class imbalance, as most of the drawings correspond to the `no-need' class. To address this challenge, we introduce Siamese++, a novel Siamese network for image classification, which uses feature embedding and an adaptive distance threshold for classification, instead of the nearest neighbor classification employed by traditional Siamese. Our proposed method achieves a high F1 score (up to 88%) even with a large class imbalance and without the need for any image augmentation. Thus, we have proposed an innovative interdisciplinary integration of AI with psychology and developed novel techniques to solve the real-world problem of psychological screening.

Strong AI Lab, NAOInstitute, Waipapa Taumata Rau - The University of Auckland Xtracta, New Zealand, School of Computer Science, University of Auckland, Strong AI Lab, NAOInstitute, Waipapa Taumata Rau - The University of Auckland, Sun Yat-sen University, Strong AI Lab, NAOInstitute, Waipapa Taumata Rau - The University of Auckland, Strong AI Lab, NAOInstitute, Waipapa Taumata Rau - The University of Auckland, School of Life and Environmental Sciences, The University of Sydney, School of Computer Science, University of Auckland, Strong AI Lab, NAOInstitute, Waipapa Taumata Rau - The University of Auckland, Strong AI Lab, NAOInstitute, Waipapa Taumata Rau - The University of Auckland

Abstract: Large language models (LLMs) have demonstrated strong capabilities in language understanding and generation, and their potential in educational contexts is increasingly being explored. One promising area is learnersourcing, where students engage in creating their own educational content, such as multiplechoice questions. A critical step in this process is generating effective explanations for the solutions to these questions, as such explanations aid in peer understanding and promote deeper conceptual learning. However, students often find it difficult to craft high-quality explanations due to limited understanding or gaps in their subject knowledge. To support this task, we introduce ``ILearner-LLM,'' a framework that uses iterative enhancement with LLMs to improve generated explanations. The framework combines an explanation generation model and an explanation evaluation model fine-tuned using student preferences for quality, where feedback from the evaluation model is fed back into the generation model to refine the output. Our experiments with LLaMA2-13B and GPT-4 using five large datasets from the PeerWise MCQ platform show that ILearner-LLM produces explanations of higher quality that closely align with those written by students. Our findings represent a promising approach for enriching the learnersourcing experience for students and for leveraging the capabilities of large language models for educational applications.

Abstract: In this paper we describe the development and evaluation of AITK, the Artificial Intelligence Toolkit. This opensource project contains both Python libraries and computational essays (Jupyter notebooks) that together are designed to allow a diverse audience with little or no background in AI to interact with a variety AI tools, exploring in more depth how they function, visualizing their outcomes, and gaining a better understanding of their ethical implications. These notebooks have been piloted at multiple institutions in a variety of humanities courses centered on the theme of responsible AI. In addition, we conducted usability testing of AITK. Our pilot studies and usability testing results indicate that AITK is easy to navigate and effective at helping diverse users gain a better understanding of AI and its ethical implications. Our goal, in this time of rapid innovations in AI, is for AITK to provide an accessible resource for faculty from any discipline looking to incorporate AI topics into their courses and for anyone eager to learn more about AI on their own.

Abstract: The AI Chef Trainer is an educational web app that introduces children to the role of data in machine learning (ML) through the engaging task of recipe recommendation. Initially, students tested the AI Chef's capabilities by selecting from a list of ingredients to see what the system recommended as possible recipes. After observing the recommendations, they contributed by adding their own recipes—each being a set of ingredients and a corresponding recipename—which were used to retrain the model and finally re-tested recipe suggestions. This cyclical process of testing, contributing, retraining, and post-training testing provided students with hands-on experience in how AI systems learn and adapt over time based on new data. We tested our software with middle school students. The results indicated that students recognized the importance of both data quantity and specificity in the training process. 45 of 52 students entered recipes, and 26 of the 52 tested their own recipes using the specific ingredients they entered. Students were introduced to the concept of confidence percentages via the AI recipe suggestions. Even as the primary focus was the role of data in machine learning, the AI Chef Trainer software also served as a window into students' cultural expression and personal preferences.

Abstract: Autonomous robots are becoming more versatile and widespread in our daily lives. From autonomous vehicles to companion robots for senior care, these humancentric systems must demonstrate a high degree of reliability in order to build trust and, ultimately, deliver social value. How safe is safe enough for robots to be wholeheartedly trusted by society? Is it sufficient if an autonomous vehicle can avoid hitting a fallen cyclist 99.9% of the time? What if this rate can only be achieved by the vehicle always stopping and waiting for the human to move out of the way? I argue that, for trustworthy deployment of robots in human-populated space, we need to complement standard statistical methods with clear-cut robust safety assurances under a vetted set of operation conditions. We need runtime learning to minimize the robot’s performance loss during safety-enforcing maneuvers by reducing its inherent uncertainty induced by its human peers, for example, their intent (does a human driver want to merge, cut behind, or stay in the lane?) or response (if the robot comes closer, how will the human react?). We need to close the loop between the robot’s learning and decision-making so that it can optimize efficiency by anticipating how its ongoing interaction with the human may affect the evolving uncertainty, and ultimately, its long-term performance.

Abstract: As modern network management grows increasingly complex, administrators are tasked with navigating vast volumes of log data, often resulting in inefficiencies, errors, and operational challenges. My doctoral research addresses these pressing issues by leveraging advanced AI techniques to minimize human intervention and pave the way for fully automated network operations. I propose a novel AIdriven framework that integrates Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) and a human-in-the-loop process to effectively automate key network management tasks, including log analysis, troubleshooting recommendations, and documentation generation. By enhancing the accuracy and efficiency of these tasks, this study aims to improve network reliability, reduce operational complexity, and contribute to the evolution of self-running networks.

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health Department of Biomedical Engineering, Johns Hopkins University, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health Department of Biomedical Engineering, Johns Hopkins University, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Center for Language and Speech Processing, Johns Hopkins University, Department of Biostatistics, Harvard T.H. Chan School of Public Health, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health

Abstract: Recent advancements in singlecell sequencing technologies enable the measurement of multiple modalities in individual cells, offering insights into the transcriptome and regulome in various biological systems and human diseases in an unprecedented resolution. However, effectively using these ultra-high-dimensional and large-scale multiomic data to understand gene regulation remains challenging. Inspired by the success of adapting large language models into the genomics field, we develop scMBERT, a BERT framework-based pre-trained deep learning model using single-cell multiomic data. We showed that scMBERT increases model flexibility and performance in downstream tasks like cell type annotation and batch-effect correction, demonstrating the potential of leveraging multiomic data to improve single-cell genomic data analyses.

Abstract: Demonstration selection algorithms play a crucial role in optimizing Large Language Models' (LLMs) incontext learning performance. Despite numerous proposed algorithms, their comparative effectiveness remains understudied. We present a comprehensive evaluation of six state-of-the-art demonstration selection algorithms across five datasets, examining both their effectiveness and computational efficiency. Our findings reveal significant trade-offs: while some demonstration selection algorithms achieve superior accuracy, they incur substantial computational costs. We also discover that increasing demonstration examples doesn't consistently improve performance, and some sophisticated algorithms struggle to outperform random selection in certain scenarios. These insights provide valuable benchmarks for future algorithm development and practical implementation. Our code is available at https://github.com/Tizzzzy/Demonstration_Selection_Overview.

Abstract: Datadriven analysis has shown promising results in identifying subtle patterns in the behavior of individuals with Autism Spectrum Disorder (ASD) for diagnosis and intervention. However, most existing methods primarily focus on a single behavioral modality (e.g., eye movements) instead of capturing the intricate multimodal behavior of humans. We propose a multimodal approach that investigates the underlying connections between eye movements and hand motions through eye-to-hand prediction. To tackle the highly noisy and irregular behavioral data, we propose a novel approach that defines the prediction as a machine translation problem and leverages a sequence-to-sequence machine learning model for the prediction. An experimental study on a dataset collected from a VR system has demonstrated high prediction accuracy. The significant difference in the prediction accuracy between the autistic group and their typically developing (TD) peers serves as quantitative evidence to objectively understand the restricted and repetitive behaviors (RRBs) in autistic children. The source code can be accessed here: https://github.com/mathjams/AAAI_2024.

Abstract: This paper explores the application of computer vision technology as a proactive solution to prevent Road Traffic Accidents in Nigeria. By leveraging machine learning algorithms and realtime video analysis, computer vision can reduce incidents caused by human error. The research focuses on designing an autonomous but elaborate system that monitors traffic patterns, road irregularities and triggers automated interventions when risky conditions are detected. The aim is to suggest that computer vision can be pivotal in enhancing road safety and reducing traffic-related fatalities in Nigeria.

Abstract: We present SPASCA a conversational AI system that promotes psychological and cognitive well-being of persons living with dementia (PLWD). This system features an AI agent that provides social presence and support to PLWD through verbal communications, without physical presence of human caregivers. The system integrates (1) a novel dialogue model that generates dialogue items relevant to the user's experiences and lifestyle, (2) a digital avatar in the form of a talking head with the identity of a caregiver who is familiar to the demented user. We develop prototypes that adopt various interaction modalities and conversational styles and report the pros and cons of different system configurations through expert review. Our system shows the potential of conversational AI for personalized and affordable healthcare services.

Abstract: In machine learning applications over Big streaming Data, Neural Networks (NNs) are continuously and rapidly trained over voluminous data arriving at high speeds. As soon as a new version of the NN becomes available, it gets deployed for prediction purposes (e.g. classification). The realtime character of such applications greatly depends on the volume and velocity of the data streams, as well as the NN complexity. Training on large volume of ingested streams or using complex NNs, potentially increases accuracy, but may compromise the real-time character of those applications. In this work, we present SuBiTO, a framework that automatically and continuously learns the training time vs accuracy trade-offs as new data stream in and fine tunes: (i) the number, size and type of NN layers; (ii) the size of the ingested data via stream synopses specific parameters; and (iii) the number of training epochs. Finally, SuBiTO suggests optimal sets of such parameters and detects concept drifts, enabling the human operator adapt these parameters on-the-fly, at runtime.

Abstract: Every pharmaceutical product must be accompanied by a comprehensive label that delineates its indications, usage, dosages, and side effects, essential for safe medication practices. Traditionally, creating drug labels is laborintensive and dependent on manual quality checks. Recent advancements in Large Language Models (LLMs) offer a promising avenue to streamline this process. In this paper we introduce ClinicalRAG, an automated labeling quality control pipeline that integrates LLM with hierarchical Retrieval Augmented Generation that allows to cross-check every statement in the drug label document. ClinicalRAG enhances the reliability of automated drug labeling by systematically reducing hallucination risks, achieving an accuracy of 96.1% in internal validation. With user-friendly interface, our pipeline aims to support pharmaceutical company in drug approval and expedite patients' access to new treatments.

Abstract: An event sequence generated by a temporal point process is often associated with a hidden and structured event branching process that captures the triggering relations between its historical and current events. In this study, we design a new plugand-play module based on the Bregman ADMM (BADMM) algorithm, which infers event branches associated with event sequences in the maximum likelihood estimation framework of temporal point processes (TPPs). Specifically, we formulate the inference of event branches as an optimization problem of event transition matrix under sparse and low-rank constraints, which is embedded in existing TPP models or their learning paradigms. We can implement this optimization problem based on subspace clustering and sparse group-lasso, respectively, and solve it using the Bregman ADMM algorithm, whose unrolling leads to the proposed BADMM module. When learning a classic TPP (e.g., Hawkes process) by the expectation-maximization algorithm, the BADMM module helps derive structured responsibility matrices in the E-step. Similarly, the BADMM module helps derive low-rank and sparse attention maps for the neural TPPs with self-attention layers. The structured responsibility matrices and attention maps, which work as learned event transition matrices, indicate event branches, e.g., inferring isolated events and those key events triggering many subsequent events. Experiments on both synthetic and real-world data show that plugging our BADMM module into existing TPP models and learning paradigms can improve model performance and provide us with interpretable structured event branches.

Abstract: Graph Neural Networks (GNNs) have become the preferred tool to process graph data, with their efficacy being boosted through graph data augmentation techniques. Despite the evolution of augmentation methods, issues like graph property distortions and restricted structural changes persist. This leads to the question: Is it possible to develop more propertyconserving and structure-sensitive augmentation methods? Through a spectral lens, we investigate the interplay between graph properties, their augmentation, and their spectral behavior, and found that keeping the low-frequency eigenvalues unchanged can preserve the critical properties at a large scale when generating augmented graphs. These observations inform our introduction of the Dual-Prism (DP) augmentation method, comprising DP-Noise and DP-Mask, which adeptly retains essential graph properties while diversifying augmented graphs. Extensive experiments validate the efficiency of our approach, providing a new and promising direction for graph data augmentation.

Abstract: Recent advances in differentiable scorebased methods for Directed Acyclic Graph (DAG) structure learning have revolutionized the problem of combinatorial structure learning, transforming it into a continuous optimization task. Despite their remarkable success, these methods rely on a key assumption that all samples have the same level of difficulty and no data heterogeneity. When this assumption does not hold, causal discovery algorithms based on it inevitably return networks with many spurious edges. Despite existing research, the current method ignores the reality of outliers in the samples, introducing certain limitations that still result in erroneous edges. Inspired by the rapid decay of the Gaussian distribution as distance from the center increases, we propose an innovative adaptive sample reweighting framework based on asymmetric exponential modulation Gaussian, coined DAG-AEG. DAG-AEG boosts DAG structure learning by analyzing the distribution of sample losses and employing the proposed method for adaptive sample attention. Additionally, it can be adapted to heterogeneous data. We used various causal structure learning methods to test the performance of DAG-AEG on synthetic and real datasets. The experimental results demonstrate that the proposed framework significantly improves the performance across all methods, outperforming existing methods.

Abstract: Recently, the advent of generative AI technologies has made transformational impacts on our daily lives, yet its application in scientific applications remains in its early stages. Data scarcity is a major, wellknown barrier in data-driven scientific computing, so physics-guided generative AI holds significant promise. In scientific computing, most tasks study the conversion of multiple data modalities to describe physical phenomena, for example, spatial and waveform in seismic imaging, time and frequency in signal processing, and temporal and spectral in climate modeling; as such, multi-modal pairwise data generation is highly required instead of single-modal data generation, which is usually used in natural images (e.g., faces, scenery). Moreover, in real-world applications, the unbalance of available data in terms of modalities commonly exists; for example, the spatial data (i.e., velocity maps) in seismic imaging can be easily simulated, but real-world seismic waveform is largely lacking. While the most recent efforts enable the powerful diffusion model to generate multi-modal data, how to leverage the unbalanced available data is still unclear. In this work, we use seismic imaging in subsurface geophysics as a vehicle to present "UB-Diff", a novel diffusion model for multi-modal paired scientific data generation. One major innovation is a one-in-two-out encoder-decoder network structure, which can ensure pairwise data is obtained from a co-latent representation. Then, the co-latent representation will be used by the diffusion process for pairwise data generation. Experimental results on the OpenFWI dataset show that UB-Diff significantly outperforms existing techniques in terms of Fréchet Inception Distance (FID) score and pairwise evaluation, indicating the generation of reliable and useful multi-modal pairwise data.

Abstract: We investigate the samplememory-pass trade-offs for pure exploration in multi-pass streaming multi-armed bandits (MABs) with the *a priori* knowledge of the optimality gap ?_[2]. Here, and throughout, the optimality gap ?_[i] is defined as the mean reward gap between the best and the i-th best arms. A recent line of results have shown that if there is no known ?_[2], a pass complexity of ̃?(log(1/?_[2])) is necessary and sufficient to obtain the *worst-case optimal* O(n/?²_[2]) sample complexity with a single-arm memory. However, our understanding of multi-pass algorithms with known ?_[2] is still limited. Here, the key open problem is how many passes are required to achieve the complexity, i.e., O( ∑ᵢ₌₂ⁿ1/?²_[i] log{n}) arm pulls, with a sublinear memory size. In this work, we show that the ``right answer'' for the question is ̃?(log{n}) passes. We first present a lower bound, showing that any algorithm that finds the best arm with slightly sublinear memory -- a memory of o(n/polylog(n)) arms -- and O( ∑ᵢ₌₂ⁿ1/?²_[i] log n) arm pulls has to make ?(log n/loglog n) passes over the stream. We then show a nearly-matching algorithm that assuming the knowledge of ?_[2], finds the best arm with O( ∑ᵢ₌₂ⁿ1/?²_[i] log n) arm pulls and a *single arm* memory.

Abstract: Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception capability, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a documentcentric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. We introduce an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, we propose a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to adaptively create a more compact visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks and the effectiveness of each component.

Abstract: As a fundamental method in economics and finance, the factor model has been extensively utilized in quantitative investment. In recent years, there has been a paradigm shift from traditional linear models with expertdesigned factors to more flexible nonlinear machine learning-based models with data-driven factors, aiming to enhance the effectiveness of these factor models. However, due to the low signal-to-noise ratio in market data, mining effective factors in data-driven models remains challenging. In this work, we propose a hypergraph-based factor model with temporal residual contrastive learning (FactorGCL) that employs a hypergraph structure to better capture high-order nonlinear relationships among stock returns and factors. To mine hidden factors that supplement human-designed prior factors for predicting stock returns, we design a cascading residual hypergraph architecture, in which the hidden factors are extracted from the residual information after removing the influence of prior factors. Additionally, we propose a temporal residual contrastive learning method to guide the extraction of effective and comprehensive hidden factors by contrasting stock-specific residual information over different time periods. Our extensive experiments on real stock market data demonstrate that FactorGCL not only outperforms existing state-of-the-art methods but also mines effective hidden factors for predicting stock returns.

Abstract: A central problem in quantum mechanics involves solving the Electronic Schrödinger Equation for a molecule or material. The Variational Monte Carlo approach to this problem approximates a particular variational objective via sampling, and then optimizes this approximated objective over a chosen parameterized family of wavefunctions, known as the ansatz. Recently neural networks have been used as the ansatz, with accompanying success. However, sampling from such wavefunctions has required the use of a Markov Chain Monte Carlo approach, which is inherently inefficient. In this work, we propose a solution to this problem via an ansatz which is cheap to sample from, yet satisfies the requisite quantum mechanical properties. We prove that a normalizing flow using the following two essential ingredients satisfies our requirements: (a) a base distribution which is constructed from Determinantal Point Processes; (b) flow layers which are equivariant to a particular subgroup of the permutation group. We then show how to construct both continuous and discrete normalizing flows which satisfy the requisite equivariance. We further demonstrate the manner in which the nonsmooth nature (``cusps'') of the wavefunction may be captured, and how the framework may be generalized to provide induction across multiple molecules. The resulting theoretical framework entails an efficient approach to solving the Electronic Schrödinger Equation.

BNRist, Department of Computer Science and Technology, Tsinghua University, UCL Cancer Institute, University College London, Monash Data Futures Institute, Monash University Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Monash Data Futures Institute, Monash University Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Monash Data Futures Institute, Monash University Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, BNRist, Department of Computer Science and Technology, Tsinghua University, BNRist, Department of Computer Science and Technology, Tsinghua University, Monash Data Futures Institute, Monash University Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, BNRist, Department of Computer Science and Technology, Tsinghua University

Abstract: Accurately measuring proteinRNA binding affinity is crucial in many biological processes and drug design. Previous computational methods for protein-RNA binding affinity prediction rely on either sequence or structure features, unable to capture the binding mechanisms comprehensively. The recent emerging pre-trained language models trained on massive unsupervised sequences of protein and RNA have shown strong representation ability for various in-domain downstream tasks, including binding site prediction. However, applying different-domain language models collaboratively for complex-level tasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained language models from different biological domains via Complex structure for Protein-RNA binding Affinity prediction. We demonstrate for the first time that cross-biological modal language models can collaborate to improve binding affinity prediction. We propose a Co-Former to combine the cross-modal sequence and structure information and a bi-scope pre-training strategy for improving Co-Former's interaction understanding. Meanwhile, we build the largest protein-RNA binding affinity dataset PRA310 for performance evaluation. We also test our model on a public dataset for mutation effect prediction. CoPRA reaches state-of-the-art performance on all the datasets. We provide extensive analyses and verify that CoPRA can (1) accurately predict the protein-RNA binding affinity; (2) understand the binding affinity change caused by mutations; and (3) benefit from scaling data and model size.

Abstract: Drug response prediction (DRP) is a longstanding challenge in modern oncology that underpins personalized treatment. Early DRP methods, trained on labelrich cell line samples, suffer from performance degradation when applied to label-scarce patient samples due to the distribution shift. Recently, a few transfer learning efforts have addressed this issue by aligning cell line (source domain) and patient (target domain) data via unsupervised domain adaptation (UDA). However, these efforts often treat each drug's response prediction as an isolated task, requiring model retraining when the drug changes; and focus only on aligning data distributions as a whole, neglecting the category (e.g., different cancers or tissues) confusion problem. To address these limitations, we propose a knowledge-guided domain adaptation model to transfer the DRP from cell lines to patients, named TransDRP. Specifically, TransDRP operates in two phases: pre-training and adaptation. In the first phase, we pre-train a multi-label graph neural network using molecular knowledge, to simultaneously predict responses for various drugs and capture their interdependencies. In the second phase, we implement a global-local domain adversarial strategy with clinical knowledge, to encourage representation alignment within same cancer categories and separation among different cancer categories across domains. Extensive experiments demonstrate that TransDRP outperforms state-of-the-art UDA methods in both transfer efficiency and precision for the patient DRP.

Abstract: Edge computingbased video analytics faces data drift issues due to the occurrence of unseen objects or scenes in ever-changing environments. To maintain accuracy, continuous learning (CL) retrains stale models periodically with newly obtained data. However, it leads to unaffordable costs, as we must keep labeling drift data and retraining models. Regarding this concern, we first investigate video patterns across multiple cameras within an area and reveal significant data redundancies. We find that many of the same objects can be captured by multiple edge cameras or appear many times on the same edges. Our quantitative findings suggest that selecting a subset of high-quality data for CL is preferable over using a larger quantity. Yet, existing efforts for data acquisition have only focused on a single static dataset. These methods are not suitable for multi-edge video analytics scenarios, where videos are captured from multiple sources with non-iid data distribution. Hence, we propose a multi-edge collaborative active video acquisition (AVA) framework to collaboratively learn a reinforced video acquisition strategy to identify informative video frames from multiple edge nodes that best enhance model accuracy, avoiding redundancy across edges. Extensive experiments on three video datasets demonstrate that, our method achieves comparable performance to full-set video training while utilizing only 20% of the data in classification tasks. In object detection tasks, our methods can maintain productive accuracy with a reduction of nearly 70% in training video frames.

Abstract: Spatiotemporal forecasting (STF) is pivotal in urban computing, yet data scarcity in developing cities hampers robust model training. Addressing this, recent studies leverage transfer learning to migrate knowledge from datarich (source) to data-poor (target) cities. This strategy, while effective, faces challenges as pre-trained models risk absorbing noise and harmful information due to data distribution disparities, potentially undermining the accuracy of forecasts for target cities. To address this issue, we propose a one-stage STF framework named Target-Skewed Joint Training (TSJT). Central to TSJT is a novel Target-Skewed Backward training strategy that selectively refines gradients from source city data, preserving only the elements that positively impact the target city. To further enhance the quality of these gradients, we have designed a Node Prompting Module (NPM). TSJT is crafted for seamless integration with existing STF models, endowing them with the capability to efficiently tackle challenges stemming from data scarcity. Experimental results on several real-world datasets from multiple cities substantiate the efficacy of TSJT in the realm of cross-city transfer learning.

School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China

Abstract: Driver attention recognition in driving scenarios is a popular direction in traffic scene perception technology. It aims to understand human driver attention to focus on specific targets/objects in the driving scene. However, traffic scenes contain not only a large amount of visual information but also semantic information related to driving tasks. Existing methods lack attention to the actual semantic information present in driving scenes. Additionally, the traffic scene is a complex and dynamic process that requires constant attention to objects related to the current driving task. Existing models, influenced by their foundational frameworks, tend to have large parameter counts and complex structures. Therefore, this paper proposes a realtime saliency Mamba network based on the latest Mamba framework. As shown in Figure 1, our model uses very few parameters (0.08M, only 0.09~11.16% of other models), while maintaining SOTA performance or achieving over 98% of the SOTA model's performance.

Abstract: Underwater salient object detection (USOD) plays a pivotal role in various visionbased marine exploration tasks. However, existing USOD techniques face the dilemma of object mislocalization and imprecise boundaries due to the complex underwater environment. The quality degradation of raw underwater images (caused by selective absorption and medium scattering) makes it challenging to perform instance detection directly. One conceivable approach involves initially removing visual disturbances through underwater image enhancement (UIE), followed by saliency detection. However, this two-stage approach neglects the potential positive impact of the restoration procedure on saliency detection due to it executes in a cascade. Based on this insight, we propose a generalized prior-involved diffusion model, called WaterDiffusion for collaborative underwater saliency detection and visual restoration. Specifically, we first propose a revised self-attention joint diffusion, which embeds dynamic saliency masks into the diffusive network as latent features. By extending the underwater degradation prior into the multi-scale decoder, we innovatively exploit optical transmission maps to aid in localizing underwater salient objects. Then, we further design a gate-guided binary indicator to select either normalized or raw channels for improving feature generalization. Finally, the Half-quadratic Splitting is introduced into the unfolding sampling to refine saliency masks iteratively. Comprehensive experiments demonstrate the superior performance of WaterDiffusion over state-of-the-art methods in both quantitative and qualitative evaluations.

Abstract: Model editing aims to correct outdated or erroneous knowledge in large models without costly retraining. Recent research discovered that the midlayer representation of the subject's final token in a prompt has a strong influence on factual predictions, and developed Large Language Model (LLM) editing techniques based on this observation. However, for Vision-LLMs (VLLMs), how visual representations impact the predictions from a decoder-only language model remains largely unexplored. To the best of our knowledge, model editing for VLLMs has not been extensively studied in the literature. In this work, we employ the contribution allocation and noise perturbation methods to measure the contributions of visual representations for token predictions. Our attribution analysis shows that visual representations in mid-to-later layers that are highly relevant to the prompt contribute significantly to predictions. Based on these insights, we propose *VisEdit*, a novel model editor for VLLMs that effectively corrects knowledge by editing intermediate visual representations in regions important to the edit prompt. We evaluated *VisEdit* using multiple VLLM backbones and public VLLM editing benchmark datasets. The results show the superiority of *VisEdit* over the strong baselines adapted from existing state-of-the-art editors for LLMs.

School of Computer Science and Engineering, University of Electronic Science and Technology of China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, School of Computer Science and Engineering, University of Electronic Science and Technology of China

Abstract: Different from traditional object detection, pure vision is not enough to infrared small target detection, due to small target size and weak background contrast. For promoting detection performance, more target representations are needed. Currently, motion representations have been proved to be one of the most potential feature kinds for infrared small target detection. Existing methods have an obvious weakness, that besides vision features, they could only capture coarse motion representations from temporal domain. With vision features, fine motion representations could be more effective to enhance detection performance. To overcome this weakness, inspired by prevalent visionlanguage models, we propose the first vision-language framework with motion prior knowledge learning (MoPKL). Breaking through traditional pure-vision modality, it utilizes homogeneous language descriptions, formatted for moving targets, to directionally guide vision channel learning motion prior knowledge. With the facilitation of motion-vision alignment and motion-relation mining, the motion of infrared small targets is further refined by graph attention, to generate more fine motion representations. The extensive experiments on datasets ITSDT-15K and IRDST show that our framework is effective. It could often obviously outperform other methods.

Abstract: The perception system for autonomous driving generally requires to handle multiple diverse subtasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.

Abstract: Unsupervised point cloud shape correspondence aims to establish dense correspondences between source and target point clouds. Existing methods universally follow a onestep paradigm to obtain shape correspondence directly, but it often fails in large-scale motions of humans and animals. To address this challenge, we propose a conditional Diffusion model with reliable pseudo-label guidance for unsupervised point cloud shape Correspondence (DiffCorr), including a transformer-based conditional diffusion model and a reliable pseudo-label generator. The proposed DiffCorr enjoys several merits. Firstly, the transformer-based conditional diffusion model implements a coarse-to-fine optimization for coarse correspondences. Secondly, we design a reliable pseudo-label generator to provide high-quality pseudo-labels for training. Extensive experiments on four human and animal datasets demonstrate that DiffCorr surpasses state-of-the-art methods and exhibits favorable generalization capabilities.

Abstract: Multimodal Large Language Models (MLLMs) have significantly improved performance across various imagelanguage applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM's temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.

Abstract: Whole Slide Image (WSI) classification has very significant applications in clinical pathology, e.g., tumor identification and cancer diagnosis. Currently, most research attention is focused on Multiple Instance Learning (MIL) using static datasets. One of the most obvious weaknesses of these methods is that they cannot efficiently preserve and utilize previously learned knowledge. With any new data arriving, classification models are required to be retrained on both previous and current new data. To overcome this shortcoming and break through traditional vision modality, this paper proposes the first Vision-Language-based framework with Queryable Prototype Multiple Instance Learning (QPMIL-VL) specially designed for incremental WSI classification. This framework mainly consists of two information processing branches: one is for generating bag-level features by prototype-guided aggregation of instance features, while the other is for enhancing class features through a combination of class ensemble, tunable vector and class similarity loss. The experiments on four public WSI datasets demonstrate that our QPMIL-VL framework is effective for incremental WSI classification and often significantly outperforms other compared methods, achieving state-of-the-art (SOTA) performance.

Abstract: Pringle maneuver (PM) in laparoscopic liver resection aims to reduce blood loss and provide a clear surgical view by intermittently blocking blood inflow of the liver, whereas prolonged PM may cause ischemic injury. To comprehensively monitor this surgical procedure and provide timely warnings of ineffective and prolonged blocking, we suggest two complementary AIassisted surgical monitoring tasks: workflow recognition and blocking effectiveness detection in liver resections. The former presents challenges in real-time capturing of short-term PM, while the latter involves the intraoperative discrimination of long-term liver ischemia states. To address these challenges, we meticulously collect a novel dataset, called PmLR50, consisting of 25,037 video frames covering various surgical phases from 50 laparoscopic liver resection procedures. Additionally, we develop an online baseline for PmLR50, termed PmNet. This model embraces Masked Temporal Encoding (MTE) and Compressed Sequence Modeling (CSM) for efficient short-term and long-term temporal information modeling, and embeds Contrastive Prototype Separation (CPS) to enhance action discrimination between similar intraoperative operations. Experimental results demonstrate that PmNet outperforms existing state-of-the-art surgical workflow recognition methods on the PmLR50 benchmark. Our research offers potential clinical applications for the laparoscopic liver surgery community.

Abstract: 3D Gaussian Splatting (3D GS) has gained popularity due to its faster rendering speed and highquality novel view synthesis. Some researchers have explored using 3D GS for reconstructing driving scenes. However, these methods often rely on various types of data, such as depth maps, 3D bounding boxes, and trajectories of moving objects. Additionally, the lack of annotations for synthesized images limits their direct application in downstream tasks. To address these issues, we propose EGSRAL, a 3D GS-based method that relies solely on training images without extra annotations. EGSRAL enhances 3D GS's capability to model both dynamic objects and static backgrounds and introduces a novel adaptor for auto labeling, generating corresponding annotations based on existing annotations. We also propose a grouping strategy for vanilla 3D GS to address perspective issues in rendering large-scale, complex scenes. Our method achieves state-of-the-art performance on multiple datasets without any extra annotation. For example, the PSNR metric reaches 29.04 on the nuScenes dataset. Moreover, our automated labeling can significantly improve the performance of 2D/3D detection tasks.

Abstract: Spatialaware image editing focuses on modifying the position and size of elements within a given image. However, previous works still struggle with maintaining background harmony in the original editing areas, as well as preserving the initial identity of the edited elements, making it difficult to achieve complex multi-object editing in a single pass. In this paper, we aim to perform flexible spatial editing in a simple yet straightforward manner. We propose to inpaint the background first and develop a two-stage multi-layered latent diffusion framework to edit each element independently. Specifically, we design a key-masking self-attention scheme alongside artifact suppression to achieve background inpainting within the denoising process, leveraging the powerful generative capabilities of the Latent Diffusion Model, Stable Diffusion XL-1.0. The latent decomposition and fusion framework is capable of unifying various spatial-aware operations, including removal, resizing, relocation, flipping, addition, camera panning, zooming out, occlusion-aware editing, and cross-image editing. Experiments demonstrate the superior inpainting quality for object removal, along with enhanced versatility and higher precision in spatial-aware editing achieved by our method.

School of Artificial Intelligence, Anhui University, Hefei 230601, China Anhui Provincial Key Laboratory of Security Artificial Intelligence, Hefei 230601, China, School of Computer Science and Technology, Anhui University, Hefei 230601, China, School of Computer Science and Technology, Anhui University, Hefei 230601, China, School of Computer Science and Technology, Anhui University, Hefei 230601, China, School of Artificial Intelligence, Anhui University, Hefei 230601, China Anhui Provincial Key Laboratory of Security Artificial Intelligence, Hefei 230601, China

Abstract: Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in humancentered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework.

Abstract: Learning a perception and reasoning module for robotic assistants to plan steps to perform complex tasks based on natural language instructions often requires large freeform language annotations, especially for short high-level instructions. To reduce the cost of annotation, large language models (LLMs) are used as a planner with few data. However, when elaborating the steps, even the state-of-the-art planner that uses LLMs mostly relies on linguistic common sense, often neglecting the status of the environment at command reception, resulting in inappropriate plans. To generate plans grounded in the environment, we propose FLARE (Few-shot Language with environmental Adaptive Replanning Embodied agent), which improves task planning using both language command and environmental perception. As language instructions often contain ambiguities or incorrect expressions, we additionally propose to correct the mistakes using visual cues from the agent. The proposed scheme allows us to use a few language pairs thanks to the visual cues and outperforms state-of-the-art approaches. Our code and the dataset are publicly available to facilitate further research.

Abstract: Adversarial patches pose a significant threat to computer vision models' integrity, decreasing the accuracy of various tasks, including object detection (OD). Most existing OD defenses exhibit a tradeoff between enhancing the model's adversarial robustness and maintaining its performance on benign images. We propose KDAT (knowledge distillation with adversarial tuning), a novel mechanism that enhances the robustness of an OD model without compromising its performance on benign images or its inference time. Our method combines the knowledge distillation (KD) technique with the adversarial tuning concept to teach the model to match the predictions of adversarial images with those of their corresponding benign ones. To match these predictions, we designed four unique loss components, allowing the student model to effectively distill the knowledge of different features from various parts of the teacher model. Our extensive evaluation on the COCO and INRIA datasets demonstrates KDAT's ability to improve the performance of Faster R-CNN and DETR on benign images by 2-4 mAP% and adversarial examples by 10-15 mAP%, outperforming other state-of-the-art (SOTA) defenses. Furthermore, our additional physical evaluation on the Superstore dataset demonstrates KDAT's SOTA adversarial robustness against printed patches (improvement of 22 mAP% compared to the undefended model).

Zhejiang Gongshang University Key Laboratory of Public Security Information Application Based on Big-Data Architecture, Ministry of Public Security Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Zhejiang Gongshang University, Zhejiang Gongshang University, Zhejiang Gongshang University, Zhejiang University, Zhejiang Gongshang University, Zhejiang Key Laboratory of Artificial Intelligence of Things (AIoT) Network and Data Security, Zhejiang Gongshang University Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Zhejiang Gongshang University Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology

Abstract: The paper targets the challenging task of Ship License Plate (SLP) recognition. Existing methods for SLP recognition are hampered by the scarcity of large and publicly available datasets, leading to evaluations on small and nonrepresentative datasets. To alleviate it, we have built a large dataset, called SLP34K, which consists of 34,385 images collected by an intelligent traffic surveillance system. The dataset is carefully manually annotated with text labels and attributes, and presents high data diversity by multiple installation locations and long capturing period of the cameras. Additionally, we propose a simple yet effective SLP recognition baseline method. The baseline is equipped with a strong visual encoder that benefits from initial pre-training via self-supervised learning, followed by further refinement through our devised semantic enhancement module. Extensive experiments on SLP34K verify the effectiveness of our proposed baseline. Moreover, while our baseline is designed for SLP recognition, it can also be used for common scene text recognition and achieve state-of-the-art performance on seven mainstream scene text recognition datasets.

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

Abstract: Chart Question Answering (CQA) requires models to perform chart perception and reasoning. Recent studies driven by Large Language Models (LLMs) have dominated CQA. These include employing more cognitively capable LLMs for indirectly reasoning over transformed charts, i.e., tables, and directly perceiving charts utilizing Multimodal Large Language Models (MLLMs) with a wider perceptual range. Yet, they often encounter bottlenecks due to the limitation of the receptive field of LLMs and the fragility of the complex reasoning of some MLLMs. To unite the strengths of LLMs and MLLMs to complement each other's limitations, we propose Synergy, a framework that unites the power of both models for CQA. Synergy first unites the chart with a table as the augmented perceptual signal. Next, it unites LLMs and MLLMs, scheduling the former to decompose a question into subquestions and the latter to answer these by perceiving the chart. Lastly, it operates LLMs to summarize the subquestionanswer pairs to refine the final answer. Extensive experimental results on popular CharQA and PlotQA benchmarks reveal that, with the power of union, Synergy outperforms strong competitors and achieves superior boosts over naive MLLMs by uniting them with a smaller LLM.

Abstract: Videos showcasing specific products are increasingly important for Ecommerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to such moments, or video decoration with SFX (VDSFX), is crucial for enhancing user engaging experience. Previous work adds SFX to videos by video-to-SFX matching at a holistic level, lacking the ability of adding SFX to a specific moment. Meanwhile, previous studies on video highlight detection or video moment retrieval consider only moment localization, leaving moment to SFX matching untouched. By contrast, we propose in this paper D&M, a unified method that accomplishes key moment detection and moment-to-SFX matching simultaneously. Moreover, for the new VDSFX task we build a large-scale dataset SFX-Moment from an E-commerce video creation platform. For a fair comparison, we build competitive baselines by extending a number of current video moment detection methods to the new task. Extensive experiments on SFX-Moment show the superior performance of the proposed method over the baselines.

College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Department of Mechanical Engineering, Stevens Institute of Technology, College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education

Abstract: Textvideo retrieval is a foundation task in multi-modal research which aims to align texts and videos in the embedding space. The key challenge is to learn the similarity between videos and texts. A conventional approach involves directly aligning video-text pairs using cosine similarity. However, due to the disparity in the information conveyed by videos and texts, i.e., a single video can be described from multiple perspectives, the retrieval accuracy is suboptimal. An alternative approach employs cross-modal interaction to enable videos to dynamically acquire distinct features from various texts, thus facilitating similarity calculations. Nevertheless, this solution incurs a computational complexity of O(n^2) during retrieval. To this end, this paper proposes a novel method called Bidirectional Hierarchical Sliding Semantic Probe (BiHSSP), which calculates dynamic similarity between videos and texts with O(n) complexity during retrieval. We introduce a hierarchical semantic probe module that learns semantic probes at different scales for both video and text features. Semantic probe involves a sliding calculation of the cross-correlation between semantic probes at different scales and embeddings from another modality, allowing for dynamic similarity computation between video and text descriptions from various perspectives. Specifically, for text descriptions from different angles, we calculate the similarity at different locations within the video features and vice versa. This approach preserves the complete information of the video while addressing the issue of unequal information between video and text without requiring cross-modal interaction. Additionally, our method can function as a plug-and-play module across various methods, thereby enhancing the corresponding performance. Experimental results demonstrate that our BiHSSP significantly outperforms the baseline.

Abstract: Large VisionLanguage Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models. However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving. Existing vision-language driving datasets focus primarily on scene understanding and decision-making, without providing explicit guidance on traffic rules and driving skills, which are critical aspects directly related to driving safety. To bridge this gap, we propose IDKB, a large-scale dataset containing over one million data items collected from various countries, including driving handbooks, theory test data, and simulated road test data. Much like the process of obtaining a driver's license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice. In particular, we conducted comprehensive tests on 15 LVLMs using IDKB to assess their reliability in the context of autonomous driving and provided extensive analysis. We also fine-tuned popular models, achieving notable performance improvements, which further validate the significance of our dataset.

Abstract: Posters serve an essential function in marketing and advertising by improving visual communication and brand visibility, thus significantly contributing to industrial design. With the latest developments in controllable T2I diffusion models, research interest has surged in text rendering within synthesized images. Although text rendering accuracy has seen advancements, automatic poster generation remains a relatively untapped area. This paper presents an automatic poster generation framework featuring text rendering capabilities through the use of LLMs. Our framework employs a triplecross attention mechanism based on alignment learning to achieve precise text placement within detailed contextual backgrounds. Moreover, it supports adjustable fonts, varying image resolutions, and poster rendering with textual prompts in both English and Chinese. Additionally, we present a comprehensive bilingual image-text dataset, GlyphDraw-3M, comprising 3 million image-text pairs, each with OCR annotations and resolutions exceeding 1024. Our method utilizes the SDXL architecture, and extensive experiments confirm its ability to generate posters with intricate and context-rich backgrounds.

Abstract: Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of textto-image models has become increasingly important. Current metrics focus on directly matching the input text with the generated image, but due to cross-modal information asymmetry, this leads to unreliable or incomplete assessment results. Motivated by this, we introduce the Image Regeneration task in this study to assess text-to-image models by tasking the T2I model with generating an image according to the reference image. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model, allowing T2I models to understand image content. This evaluation process is simplified as comparisons between the generated image and the reference image are straightforward. Two regeneration datasets spanning content-diverse and style-diverse evaluation dataset are introduced to evaluate the leading diffusion models currently available. Additionally, we present ImageRepainter framework to enhance the quality of generated images by improving content comprehension via MLLM guided iterative generation and revision. Our comprehensive experiments have showcased the effectiveness of this framework in assessing the generative capabilities of models. By leveraging MLLM, we have demonstrated that a robust T2M can produce images more closely resembling the reference image.

Abstract: Hallucinations in Multimodal Large Language Models (MLLMs) where generated responses fail to accurately reflect the given image pose a significant challenge to their reliability. To address this, we introduce ConVis, a novel trainingfree contrastive decoding method. ConVis leverages a text-to-image (T2I) generation model to semantically reconstruct the given image from hallucinated captions. By comparing the contrasting probability distributions produced by the original and reconstructed images, ConVis enables MLLMs to capture visual contrastive signals that penalize hallucination generation. Notably, this method operates purely within the decoding process, eliminating the need for additional data or model updates. Our extensive experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs, highlighting its potential to enhance model reliability.

Abstract: Sourcefree unsupervised domain adaptation aims to eliminate domain shifts when data from the source domain and annotation from the target domain are not available. The multi-object detection tasks in medical image analysis are constrained by patient privacy and extremely huge annotation consumption. Hence, Source-free UDA is considered a more practical approach for eliminating the domain gap. However, relevant research that explores this topic is a dearth. In this paper, we design an Anatomy-aware Alignment Teacher-Student learning method using topological consistency based on a mean-teacher framework for Source-free UDA in multiple medical object detection named AATS, including Unsupervised Structure Refinement (USR) and Graph-aware Morphology Alignment (GMA). To match the student and teacher at the low-level and visual features, we propose the USR via an unsupervised clustering algorithm to group organs in ultrasound images. Based on USR, we obtain a graph with organ relations on the teacher branch. While in the student branch, we acquire visual features to construct graphical space and optimize the model with graph propagation. Finally, to match the student and teacher, GMA is designed to align the teacher and student based on both topology and morphology information that is derived from prior medical knowledge. Four groups of adaptation experiments were conducted on available medical datasets, and the outcomes demonstrate that our approach not only achieves state-of-the-art performance but also provides substantial advantages over existing methods.

Abstract: Enhancing the performance of semantic segmentation models with multispectral images (RGB-IR) is crucial, particularly for low-light and adverse environments. While multi-modal fusion techniques aim to learn cross-modality features for generating fused images or engage in knowledge distillation, they often treat multi-modal and missing modality scenarios as separate challenges, which is not an optimal approach. To address this, a novel multi-modal fusion approach called Optically-Guided Pixel-level contrastive learning Network (OGP-Net) is proposed, which uses Distillation with Multi-View Contrastive (DMC) and Distillation for Uni-modal Re- tention (DUR) to maintain the correlation between modality-shared and modality-specific features. DMC aligns the uni-modal features by projecting the semantic information across modalities into a unified latent space, ensuring that the feature maps retain multi-modal representations. Pixel-level multi-view contrastive learning is introduced to enable modality-invariant representation learning. To retain modality-specific information, DUR is proposed, which distills detailed textures from RGB images into the optical branch of OGP-Net. Additionally, the Gated Spectral Unit (GSU) is integrated into the framework to eliminate the need for manual tuning and avoid forced feature alignment. Comprehensive experiments show that OGP-Net outperforms state-of-the-art models in multi-modal and missing modality scenarios across three public benchmarking datasets. It achieves quicker convergence and learns efficiently from limited training samples.

Abstract: Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottomup features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video Slient Object Ranking Chain of Thought), which utilizes an MLLM with a grounding module to caption video content and infer salient objects along with their rankings and positions. This process derives ranking maps that can be sufficiently leveraged by the diffusion model to accurately decode the saliency maps for the given video. Extensive experiments showcase the effectiveness of VSOR-CoT in improving the performance of video saliency prediction. The proposed CaRDiff performs better than state-of-the-art models on the MVS dataset and demonstrates cross-dataset capabilities on the DHF1k dataset through zero-shot evaluation.

Abstract: Existing skeletonbased human action classification models rely on well-trimmed action-specific skeleton videos for both training and testing, precluding their scalability to real-world applications where untrimmed videos exhibiting concatenated actions are predominant. To overcome this limitation, recently introduced skeleton action segmentation models involve un-trimmed skeleton videos into end-to-end training. The model is optimized to provide frame-wise predictions for any length of testing videos, simultaneously realizing action localization and classification. Yet, achieving such an improvement im-poses frame-wise annotated skeleton videos, which remains time-consuming in practice. This paper features a novel framework for skeleton-based action segmentation trained on short trimmed skeleton videos, but that can run on longer un-trimmed videos. The approach is implemented in three steps: Stitch, Contrast, and Segment. First, Stitch proposes a tem-poral skeleton stitching scheme that treats trimmed skeleton videos as elementary human motions that compose a semantic space and can be sampled to generate multi-action stitched se-quences. Contrast learns contrastive representations from stitched sequences with a novel discrimination pretext task that enables a skeleton encoder to learn meaningful action-temporal contexts to improve action segmentation. Finally, Segment relates the proposed method to action segmentation by learning a segmentation layer while handling particular da-ta availability. Experiments involve a trimmed source dataset and an untrimmed target dataset in an adaptation formulation for real-world skeleton-based human action segmentation to evaluate the effectiveness of the proposed method.

Abstract: While existing semisupervised object detection (SSOD) methods perform well in general scenes, they encounter challenges in handling oriented objects in aerial images. We experimentally find three gaps between general and oriented object detection in semi-supervised learning: 1) Sampling inconsistency: the common center sampling is not suitable for oriented objects with larger aspect ratios when selecting positive labels from labeled data. 2) Assignment inconsistency: balancing the precision and localization quality of oriented pseudo-boxes poses greater challenges which introduces more noise when selecting positive labels from unlabeled data. 3) Confidence inconsistency: there exists more mismatch between the predicted classification and localization qualities when considering oriented objects, affecting the selection of pseudo-labels. Therefore, we propose a Multi-clue Consistency Learning (MCL) framework to bridge gaps between general and oriented objects in semi-supervised detection. Specifically, considering various shapes of rotated objects, the Gaussian Center Assignment is specially designed to select the pixel-level positive labels from labeled data. We then introduce the Scale-aware Label Assignment to select pixel-level pseudo-labels instead of unreliable pseudo-boxes, which is a divide-and-rule strategy suited for objects with various scales. The Consistent Confidence Soft Label is adopted to further boost the detector by maintaining the alignment of the predicted results. Comprehensive experiments on DOTA-v1.5 and DOTA-v1.0 benchmarks demonstrate that our proposed MCL can achieve state-of-the-art performance in the semi-supervised oriented object detection task.

Abstract: Incremental object detection (IOD) is a challenging task that requires detection models to continuously learn from newly arriving data. This work focuses on incremental learning for visionlanguage detectors (VLDs), an under explored domain. Existing research typically adopts a local alignment paradigm to avoid label conflicts, where different tasks are learned separately without interaction. However, we reveal that this practice fails to effectively preserve the semantic structure. Specifically, aligned relationships between objects and texts would collapse when handling novel categories, ultimately leading to catastrophic forgetting. Though knowledge distillation (KD) is a common approach for tackling this, traditional KD performs poorly when directly applied to VLDs, as for different phases, a natural knowledge gap exists in both encoding and decoding processes. To address above issues, we propose a novel method called Global alignment and Correspondence Distillation (GCD). Differently, we first integrate knowledge across phases within the same embedding space to construct global semantic structure. We then enable effective knowledge distillation in VLDs through a semantic correspondence mechanism, ensuring consistent proposal generation and decoding. On the top of that, we distill teacher model’s informative predictions and topological relationships to maintain stable local semantic structure. Extensive experiments on COCO 2017 demonstrate that our method significantly outperforms existing approaches, achieving new state-of-the-art in various IOD scenarios.

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China School of Computer Science and Engineering, Beihang University, Beijing, China, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China School of Computer Science and Engineering, Beihang University, Beijing, China, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China School of Computer Science and Engineering, Beihang University, Beijing, China

Abstract: Large VisionLanguage Model (LVLM), leveraging Large Language Model (LLM) as the cognitive core, has recently become one of the most representative multimodal model paradigms. However, with the expansion of unimodal branches, \emph{i.e.} visual encoder and LLM, the storage and computational burdens intensify, posing challenges for deployment. Structured pruning has proved promising in compressing large models by trimming a large portion of insignificant network structures. Nevertheless, most of them are predominantly designed for LLMs, either relying on unitary importance metrics that fail to deal with modality-wise imbalances or adopting generic pruning and recovery paradigms that overlook the unique calibration status and capability requirements of large models, leading to substantial performance degradation. To address these issues, we propose a novel structured pruning approach for LVLMs, dubbed Unified Knowledge Maintenance Pruning and Progressive Recovery with Weight Recalling (UKMP). Specifically, we design a Unified Knowledge Maintenance Importance (UKMI) metric, which simultaneously considers balancing the block-wise and modality-wise importance by adaptive normalization, optimizing the importance estimation by refining gradient-based criteria, and maintaining the knowledge capacity of LVLMs by using the angle distribution information entropy. Moreover, we develop a LoRA-based Progressive Distillation (LPD) method that recalls the pruned weights and performs progressive distillation for comprehensive recovery. Extensive experimental results across various vision-language tasks demonstrate the effectiveness of our approach, comparing to the state-of-the-art structured pruning methods.

South China University of Technology, China The Hong Kong Polytechnic University, Hong Kong SAR, China, South China University of Technology, China, South China University of Technology, China, South China University of Technology, China Guangdong Engineering Center for Large Model and GenAI Technology State Key Laboratory of Subtropical Building Science Ministry of Education Key Laboratory of Big Data and Intelligent Robot Guangdong Provincial Key Lab of Computational Intelligence and Cyberspace Information, South China University of Technology, China, The Hong Kong Polytechnic University, Hong Kong SAR, China, The Hong Kong Polytechnic University, Hong Kong SAR, China

Abstract: In clinical imaging, medical segmentation networks typically require continually adapting to new data from multiple sites over time, as aggregating all data for learning at once can be impractical due to storage limitations and privacy concerns. However, existing methods basically overlook domainspecific characteristics and fall short of adequately capturing domain-invariant knowledge during continual learning, leading to undesired catastrophic forgetting of previous sites and inferior generalization to new sites. To tackle this issue, this paper introduces FR2Seg, to sufficiently exploit both domain-specific and domain-invariant knowledge for efficient continual learning with the aid of low-frequency cues. For the former aspect, we propose a Fourier style replay module to synthesize pseudo images with old-site styles for data augmentation during new-site training, effectively preventing catastrophic forgetting without sacrificing data privacy. For the latter, we present a Fourier adaptive consistency regularization to identify and constrain the optimization of domain-invariant parameters with explicit awareness of knowledge transferability across sites, ensuring excellent generalizability to new sites. Experimental results on two public datasets confirm our method's superiority over existing state-of-the-art continual learning methods.

Abstract: The integration of RGB and depth modalities significantly enhances the accuracy of segmenting complex indoor scenes, with depth data from RGBD cameras playing a crucial role in this improvement. However, collecting an RGB-D dataset is more expensive than an RGB dataset due to the need for specialized depth sensors. Aligning depth and RGB images also poses challenges due to sensor positioning and issues like missing data and noise. In contrast, Pseudo Depth (PD) from high-precision depth estimation algorithms can eliminate the dependence on RGB-D sensors and alignment processes, as well as provide effective depth information and show significant potential in semantic segmentation. Therefore, to explore the practicality of utilizing pseudo depth instead of real depth for semantic segmentation, we design an RGB-PD segmentation pipeline to integrate RGB and pseudo depth and propose a Pseudo Depth Aggregation Module (PDAM) for fully exploiting the informative clues provided by the diverse pseudo depth maps. The PDAM aggregates multiple pseudo depth maps into a single modality, making it easily adaptable to other RGB-D segmentation methods. In addition, the pre-trained diffusion model serves as a strong feature extractor for RGB segmentation tasks, but multi-modal diffusion-based segmentation methods remain unexplored. Therefore, we present a Pseudo Depth Diffusion Model (PDDM) that adopts a large-scale text-image diffusion model as a feature extractor and a simple yet effective fusion strategy to integrate pseudo depth. To verify the applicability of pseudo depth and our PDDM, we perform extensive experiments on the NYUv2 and SUNRGB-D datasets. The experimental results demonstrate that pseudo depth can effectively enhance segmentation performance, and our PDDM achieves state-of-the-art performance, outperforming other methods by +6.98 mIoU on NYUv2 and +2.11 mIoU on SUNRGB-D.

Abstract: Compared to fully supervised object detection, training with sparse annotations typically leads to a decline in performance due to insufficient feature diversity. Existing sparsely annotated object detection (SAOD) methods often rely on pseudolabeling strategies, but these pseudo-labels tend to introduce noise under extreme sparsity. To simultaneously avoid the impact of pseudo-label noise and enhance feature diversity, we propose a novel Adaptive Feature Generation (AdaptFG) model that generates features based on class names. This model integrates a pre-trained CLIP into a VAE-based feature generator, with its core innovation being an Adaptor that adaptively maps CLIP’s semantic embeddings to the object detector domain. Additionally, we introduce inter-class relationship reasoning in detector, which effectively mitigates misclassifications stemming from similar features. Extensive experimental results demonstrate that AdaptFG consistently outperforms state-of-the-art SAOD methods on the PASCAL VOC and MS COCO benchmarks.

Abstract: A multimodal fusion technique using LiDARcamera has been developed for precise 3D object detection in autonomous driving and provides acceptable detection performance in ideal conditions with clear weather. However, the existing multimodal methods are still vulnerable to adverse weather conditions, such as snow, rain, and fog. These factors increase the point cloud sparsity due to occlusion and attenuation of the laser signal. A point cloud becomes sparser with increased distance, posing a challenge for object detection. To address these problems, we propose a point reconstruction network using equirectangular projection for multimodal 3D object detection. This network consists of distance-constrained denoising to remove adverse weather noise and an object-centric ray generator to generate distant object points flexibly. We propose a domain adaptation method that injects feature perturbations to improve detection performance by reducing the domain gap between different datasets. Furthermore, we propose a multimodal weather noise matching method for realistic data synthesis-based training to align the adverse weather noise between synthetic point clouds and images. The experimental results on adverse weather datasets confirm that the proposed approach outperforms the existing methods.

Abstract: Datadriven deep learning models have enabled tremendous progress in change detection (CD) with the support of pixel-level annotations. However, collecting diverse data and manually annotating them is costly, laborious, and knowledge-intensive. Existing generative methods for CD data synthesis show competitive potential in addressing this issue but still face the following limitations: 1) difficulty in flexibly controlling change events, 2) dependence on additional data to train the data generators, 3) focus on specific change detection tasks. To this end, this paper focuses on the semantic CD (SCD) task and develops a multi-temporal SCD data generator ChangeDiff by exploring powerful diffusion models. ChangeDiff innovatively generates change data in two steps: first, it uses text prompts and a text-to-layout (T2L) model to create continuous layouts, and then it employs layout-to-image (L2I) to convert these layouts into images. Specifically, we propose multi-class distribution-guided text prompts (MCDG-TP), allowing for layouts to be generated flexibly through controllable classes and their corresponding ratios. Subsequently, to generalize the T2L model to the proposed MCDG-TP, a class distribution refinement loss is further designed as training supervision. Our generated data shows significant progress in temporal continuity, spatial diversity, and quality realism, empowering change detectors with accuracy and transferability.

Abstract: Current stateof-the-art (SOTA) methods in 3D Human Pose Estimation (HPE) are primarily based on Transformers. However, existing Transformer-based 3D HPE backbones often encounter a trade-off between accuracy and computational efficiency. To resolve the above dilemma, in this work, we leverage recent advances in state space models and utilize Mamba for high-quality and efficient long-range modeling. Nonetheless, Mamba still faces challenges in precisely exploiting local dependencies between joints. To address these issues, we propose a new attention-free hybrid spatiotemporal architecture named Hybrid Mamba-GCN (Pose Magic). This architecture introduces local enhancement with GCN by capturing relationships between neighboring joints, thus producing new representations to complement Mamba's outputs. By adaptively fusing representations from Mamba and GCN, Pose Magic demonstrates superior capability in learning the underlying 3D structure. To meet the requirements of real-time inference, we also provide a fully causal version. Extensive experiments show that Pose Magic achieves new SOTA results (0.9 mm drop) while saving 74.1% FLOPs. In addition, Pose Magic exhibits optimal motion consistency and the ability to generalize to unseen sequence lengths.

Abstract: We present a framework that achieves shadow removal by learning intrinsic image decomposition (IID) from unpaired shadow and shadowfree images. Although it is well-known that intrinsic images, \ie, illumination and reflectance, are highly beneficial to shadow removal, IID is rarely adopted by previous work due to its inherent ambiguity and the scarcity of training data. However, we find that by properly coupling shadow removal and IID into a joint learning framework, they can reinforce each other and enable promising results on both tasks, even with unpaired training data. Our framework is comprised of an IID network for separating the shadow input image into illumination and reflectance, and an illumination recovery network for predicting shadow-free illumination with which we are able to produce the shadow removal output by recombining with the estimated reflectance. We perform extensive experiments on various benchmark datasets to demonstrate the effectiveness of our method in shadow removal, and also showcase our advantage over previous IID methods in handling images with complex shadows.

Abstract: Model ensembling is a widely used technique that enhances performance in imagetext matching tasks by combining multiple models, each trained with different initializations. However, the inefficiencies associated with training several models and generating outputs from them constrain their practical applicability. In this paper, we argue that while the parameters of two randomly initialized models can differ significantly, their feature distributions can be similar at certain stages. By employing a proposed technique called cross-modal realignment, we demonstrate that features derived from differently initialized models maintain similarity at the feature extraction stage and can be effectively transformed by fine-tuning a small number of parameters. These findings provide an efficient way to achieve ensemble-like performance within a single model. Specifically, we propose a Feature Diversification Framework (FDF) that emulates the outputs of multiple model initializations to generate diverse features from a common shared feature. Firstly, we introduce feature conversion methods to transform shared features into a set of distinct features. Next, a realignment training strategy is presented to optimize negative pairs for realigning these transformed features, thereby enhancing their diversification to resemble the outputs of different models. Additionally, we propose a reweighting module that assigns weights to these features, enabling a weighted fusion approach for robust feature representation. Extensive experiments on the Flickr30K and MS-COCO datasets demonstrate the effectiveness and generalizability of our framework.

Abstract: While Implicit Neural Representations (INRs) have demonstrated significant success in image representation, they are often hindered by large training memory and slow decoding speed. Recently, Gaussian Splatting (GS) has emerged as a promising solution in 3D reconstruction due to its highquality novel view synthesis and rapid rendering capabilities, positioning it as a valuable tool for a broad spectrum of applications. In particular, a GSbased representation, 2DGS, has shown potential for image fitting. In our work, we present Large Images are Gaussians (LIG), which delves deeper into the application of 2DGS for image representations, addressing the challenge of fitting large images with 2DGS in the situation of numerous Gaussian points, through two distinct modifications: 1) we adopt a variant of representation and optimization strategy, facilitating the fitting of a large number of Gaussian points; 2) we propose a Level-of-Gaussian approach for reconstructing both coarse low-frequency initialization and fine high-frequency details. Consequently, we successfully represent large images as Gaussian points and achieve high-quality large image representation, demonstrating its efficacy across various types of large images.

Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China,, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China,, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China,, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China,, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China,, Inceptio, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China,

Abstract: Recently, Visual Foundation Models (VFMs) have shown a remarkable generalization performance in 3D perception tasks. However, their effectiveness in largescale outdoor datasets remains constrained by the scarcity of accurate supervision signals, the extensive noise caused by variable outdoor conditions, and the abundance of unknown objects. In this work, we propose a novel label-free learning method, Adaptive Label Correction (AdaCo), for 3D semantic segmentation. AdaCo first introduces the Cross-modal Label Generation Module (CLGM), providing cross-modal supervision with the formidable interpretive capabilities of the VFMs. Subsequently, AdaCo incorporates the Adaptive Noise Corrector (ANC), updating and adjusting the noisy samples within this supervision iteratively during training. Moreover, we develop an Adaptive Robust Loss (ARL) function to modulate each sample's sensitivity to noisy supervision, preventing potential underfitting issues associated with robust loss. Our proposed AdaCo can effectively mitigate the performance limitations of label-free learning networks in 3D semantic segmentation tasks. Extensive experiments on two outdoor benchmark datasets highlight the superior performance of our method.

School of Computer Science and Technology, Soochow University, Suzhou, China, School of Artificial Intelligence and Computer Science, Jiangnan University, School of Electrical Engineering and Computer Science, University of Queensland, School of Computer Science and Technology, Soochow University, Suzhou, China, School of Computer Science and Technology, Soochow University, Suzhou, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, School of Information, Renmin University of China, Beijing, China International College (Suzhou Research Institute), Renmin University of China, Suzhou, China

Abstract: Next POI recommendation aids users in predicting their destinations of interest and plays an increasingly vital role in locationbased social services. Recent works focus on analyzing both long-term and short-term interests in POI recommendation to gain a deeper understanding of user profiles. However, these methods for modeling long-term user’s sequences primarily rely on the Transformer model, which functions as a low-pass filter, often leading to the loss of high-frequency information. Additionally, long-term and short-term sequences are typically modeled independently, with short-term sequences often defined solely by the most recent check-ins, overlooking their interactions and dependencies. Therefore, we propose Enhancing Long-and Short-Term Representations for Next POI Recommendations via Frequency and Hierarchical Contrastive Learning (FHCRec). FHCRec captures both high-frequency and low-frequency information in long-term sequences to model richer long-term user’s preference representations. Moreover, it harnesses the characteristics of the short-term subsequences embedded within long-term sequences to enhance short-term preference characterization via local and global hierarchical contrastive learning, resulting in more personalized short-term preferences. The enhanced long-term and short-term preferences are integrated to improve model recommendation performance. Extensive experiments on three real-world datasets demonstrate the effectiveness of our method.

Abstract: Recommender systems used in online platforms can drive users to consume content continuously in an attempt to maximize satisfaction. Such engagement is invariably broken due to more pressing work, alternate pursuits, distractions or fatigue. Recommender systems need to ensure the continuity of experience when the user joins back. Sessionbased recommender systems typically create different sessions based on a fixed time interval (θ), often resulting in creation of a separate session when the user gets off the platform temporarily. When the user joins back, session-based recommender systems are likely to recommend content different than what they would have in case the earlier session had continued. This may cause dissatisfaction given that there is a difference in the predicted world model of the user, i.e. the expectation from the last session, and the observed one, i.e. the recommendations. To handle this problem, we propose the creation of content-driven sessions instead of time-driven sessions. In our setting, a session continues while a single item category dominates in the user-item interactions. A new session is created when a different item category begins to dominate. The proposed content-driven method also solves the long-standing problem of deciding the optimal value of time threshold (θ) for defining the time-based session. We report that the proposed method outperforms existing SOTA methodologies set by time-based sessions by a large margin in terms of recommendation performance on multiple datasets.

Abstract: Sequential recommendation aims to capture the temporal dependencies of items in a user's historical interactions and make recommendations based on this. Previous generative methods addressed the issue of data not directly reflecting user preference uncertainty by modeling the distribution of latent item representations. Diffusion model (DM)based methods have achieved significant success due to their high-quality generation and stable training. However, they lack satisfactory user sequence representations to guide the generation process, impacting recommendation performance. Moreover, these methods overlook the drawback of slow inference speed, severely limiting their practical value. To obtain effective generative guidance signals and accelerate the recommendation process, we propose DAE4Rec. In this approach, a Graph Auto-Encoder (GAE) is used to obtain interpretable item node representations, revealing global transitions of items that previous methods struggled to uncover. Then, we use it to construct a generative guidance signal with lower coupling and variance for the diffusion model. Additionally, by employing a non-Markov chain derived from the forward diffusion process, it is the first to implement a 'skip-step' reverse process in diffusion model-based methods. And a creatively designed compensator is used to bridge the performance gap caused by 'skip-step'. Extensive experiments on three real-world datasets demonstrate that DAE4Rec outperforms other state-of-the-art generative sequential recommenders.

State Key Laboratory of Blockchain and Data Security, Zhejiang University College of Computer Science, Zhejiang University, China, State Key Laboratory of Blockchain and Data Security, Zhejiang University College of Computer Science, Zhejiang University, China Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, College of Computer Science, Zhejiang University, China, College of Computer Science, Zhejiang University, China, College of Computer Science, Zhejiang University, China, State Key Laboratory of Blockchain and Data Security, Zhejiang University College of Computer Science, Zhejiang University, China, State Key Laboratory of Blockchain and Data Security, Zhejiang University College of Computer Science, Zhejiang University, China, State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security

Abstract: Loss functions play a pivotal role in optimizing recommendation models. Among various loss functions, Softmax Loss (SL) and Cosine Contrastive Loss (CCL) are particularly effective. Their theoretical connections and differences warrant indepth exploration. This work conducts comprehensive analyses of these losses, yielding significant insights: 1) Common strengths --- both can be viewed as augmentations of traditional losses with Distributional Robust Optimization (DRO), enhancing robustness to distributional shifts; 2) Respective limitations --- stemming from their use of different distribution distance metrics in DRO optimization, SL exhibits high sensitivity to false negative instances, whereas CCL suffers from low data utilization. To address these limitations, this work proposes a new loss function, DrRL, which generalizes SL and CCL by leveraging Rényi-divergence in DRO optimization. DrRL incorporates the advantageous structures of both SL and CCL, and can be demonstrated to effectively mitigate their limitations. Extensive experiments have been conducted to validate the superiority of DrRL on both recommendation accuracy and robustness.

Abstract: Wearable Human Activity Recognition (WHAR) is a prominent research area within ubiquitous computing. Multisensor synchronous measurement has proven to be more effective for WHAR than using a single sensor. However, existing WHAR methods use shared convolutional kernels for indiscriminate temporal feature extraction across each sensor variable, which fails to effectively capture spatio-temporal relationships of intra-sensor and inter-sensor variables. We propose the DecomposeWHAR model consisting of a decomposition phase and a fusion phase to better model the relationships between modality variables. The decomposition creates high-dimensional representations of each intra-sensor variable through the improved Depth Separable Convolution to capture local temporal features while preserving their unique characteristics. The fusion phase begins by capturing relationships between intra-sensor variables and fusing their features at both the channel and variable levels. Long-range temporal dependencies are modeled using the State Space Model (SSM), and later cross-sensor interactions are dynamically captured through a self-attention mechanism, highlighting inter-sensor spatial correlations. Our model demonstrates superior performance on three widely used WHAR datasets, significantly outperforming state-of-the-art models while maintaining acceptable computational efficiency.

Abstract: Approximation fixpoint theory (AFT) is a robust and popular mathematical framework that characterizes many nonmonotonic semantics, where the construction of stable fixpoints, called stable revision, play a central role. Nondeterministic AFT is a recent development that redefines AFT for a nondeterministic setting to capture disjunctive semantics. This theory departs from traditional AFT by introducing distinct definitions, thus raising the question of whether deterministic AFT can be adopted directly to define nondeterministic stable revision. This work proposes such an alternate theory and creates a new way to study disjunctive semantics in terms of normal (nondisjunctive) knowledge bases. To demonstrate the viability of our framework, we show how to capture stable and partial stable models for disjunctive logic programs. We then study the relationships between this alternative theory and the state-of-the-art nondeterministic AFT.

School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University, South China Normal University, School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University Guangdong Key Laboratory of Big Data Analysis and Processing

Abstract: Deductive reasoning is a crucial logical capability that assists us in solving complex problems based on existing knowledge. Although augmented by Chainof-Thought prompts, Large Language Models (LLMs) might not follow the correct reasoning paths. Enhancing the deductive reasoning abilities of LLMs, and leveraging their extensive built-in knowledge for various reasoning tasks, remains an open question. Attempting to mimic the human deductive reasoning paradigm, we propose a multi-stage Syllogistic-Reasoning Framework of Thought (SR-FoT) that enables LLMs to perform syllogistic deductive reasoning to handle complex knowledge-based reasoning tasks. Our SR-FoT begins by interpreting the question and then uses the interpretation and the original question to propose a suitable major premise. It proceeds by generating and answering minor premise questions in two stages to match the minor premises. Finally, it guides LLMs to use the previously generated major and minor premises to perform syllogistic deductive reasoning to derive the answer to the original question. Extensive and thorough experiments on knowledge-based reasoning tasks have demonstrated the effectiveness and advantages of our SR-FoT.

Abstract: Formal representations of traffic scenarios can be used to generate test cases for the safety verification of autonomous driving. However, most existing methods are limited to highway or highly simplified intersection scenarios due to the intricacy and diversity of traffic scenarios. In response, we propose Traffic Scenario Logic (TSL), which is a spatialtemporal logic designed for modeling and reasoning of urban pedestrian-free traffic scenarios. TSL provides a formal representation of the urban road network that can be derived from OpenDRIVE, i.e., the de facto industry standard of high-definition maps for autonomous driving, enabling the representation of a broad range of traffic scenarios without discretization approximations. We implemented the reasoning of TSL using Telingo, i.e., a solver for temporal programs based on Answer Set Programming, and tested it on different urban road layouts. Demonstrations show the effectiveness of TSL in test scenario generation and its potential value in areas like decision-making and control verification of autonomous driving. The code for TSL reasoning has been open-sourced.

Institute for AI in Medicine (IKIM), University Medicine Essen, Institute for Neuroinformatics, Ruhr-University Bochum, Institute for AI in Medicine (IKIM), University Medicine Essen, Carenostics, Department of Computer and Data Sciences, Case Western Reserve University, Carenostics, Department of Data Science & AI, Monash University, Institute for AI in Medicine (IKIM), University Medicine Essen, Institute for Neuroinformatics, Ruhr-University Bochum, Department of Data Science & AI, Monash University, Carenostics

Abstract: In many critical applications, sensitive data is inherently distributed and cannot be centralized due to privacy concerns. A wide range of federated learning approaches have been proposed to train models locally at each client without sharing their sensitive data, typically by exchanging model parameters, or probabilistic predictions (soft labels) on a public dataset or a combination of both. However, these methods still disclose private information and restrict local models to those that can be trained using gradientbased methods. We propose a federated co-training (FEDCT) approach that improves privacy by sharing only definitive (hard) labels on a public unlabeled dataset. Clients use a consensus of these shared labels as pseudo-labels for local training. This federated co-training approach empirically enhances privacy without compromising model quality. In addition, it allows the use of local models that are not suitable for parameter aggregation in traditional federated learning, such as gradient-boosted decision trees, rule ensembles, and random forests. Furthermore, we observe that FEDCT performs effectively in federated fine-tuning of large language models, where its pseudo-labeling mechanism is particularly beneficial. Empirical evaluations and theoretical analyses suggest its applicability across a range of federated learning scenarios.

Abstract: Measuring the similarity of the internal representations of deep neural networks is an important and challenging problem. Model stitching has been proposed as a possible approach, where two halfnetworks are connected by mapping the output of the first half-network to the input of the second one. The representations are considered functionally similar if the resulting stitched network achieves good task-specific performance. The mapping is normally created by training an affine stitching layer on the task at hand while freezing the two half-networks, a method called task loss matching. Here, we argue that task loss matching may be very misleading as a similarity index. For example, it can indicate very high similarity between very distant layers, whose representations are known to have different functional properties. Moreover, it can indicate very distant layers to be more similar than architecturally corresponding layers. Even more surprisingly, when comparing layers within the same network, task loss matching often indicates that some layers are more similar to a layer than itself. We argue that the main reason behind these problems is that task loss matching tends to create out-of-distribution representations to improve task-specific performance. We demonstrate that direct matching (when the mapping minimizes the distance between the stitched representations) does not suffer from these problems. We compare task loss matching, direct matching, and well-known similarity indices such as CCA and CKA. We conclude that direct matching strikes a good balance between the structural and functional requirements for a good similarity index.

Abstract: Tsetlin Machines (TMs) have garnered increasing interest for their ability to learn concepts via propositional formulas and their proven efficiency across various application domains. Despite this, the convergence proof for the TMs, particularly for the AND operator (conjunction of literals), in the generalized case (inputs greater than two bits) remains an open problem. This paper aims to fill this gap by presenting a comprehensive convergence analysis of Tsetlin automatonbased Machine Learning algorithms. We introduce a novel framework, referred to as Probabilistic Concept Learning (PCL), which simplifies the TM structure while incorporating dedicated feedback mechanisms and dedicated inclusion/exclusion probabilities for literals. Given n features, PCL aims to learn a set of conjunction clauses Ci each associated with a distinct inclusion probability pi. Most importantly, we establish a theoretical proof confirming that, for any clause k, PCL converges to a conjunction of literals when pk is between 0.5 and 1. This result serves as a stepping stone for future research on the convergence properties of Tsetlin automaton-based learning algorithms. Our findings not only contribute to the theoretical understanding of Tsetlin automaton-based learning algorithms but also have implications for their practical application, potentially leading to more robust and interpretable machine learning models.

Abstract: Simulating fuel sloshing within aircraft tanks during flight is crucial for aircraft safety research. Traditional methods based on NavierStokes equations are computationally expensive. In this paper, we treat fluid motion as point cloud transformation and propose the first neural network method specifically designed for simulating fuel sloshing in aircraft. This model is also the first deep learning model capable of stably modeling fluid particle dynamics in such complex scenarios. Our triangle feature fusion design achieves an optimal balance among fluid dynamics modeling, momentum conservation constraints, and global stability control. Additionally, we constructed the Fueltank dataset, the first dataset for aircraft fuel surface sloshing. It comprises 320,000 frames across four typical tank types and covers a wide range of flight maneuvers, including multi-directional rotations. We conducted comprehensive experiments on both our dataset and the take-off scenario of the aircraft. Compared to existing neural network-based fluid simulation algorithms, we significantly enhanced accuracy while maintaining high computational speed. Compared to traditional SPH methods, our speed improved approximately 10 times. Furthermore, compared to traditional fluid simulation software such as Flow3D, our computation speed increased by more than 300 times.

Abstract: Handling imbalance in class distribution when building a classifier over tabular data has been a problem of longstanding interest. One popular approach is augmenting the training dataset with synthetically generated data. While classical augmentation techniques were limited to linear interpolation of existing minority class examples, recently higher capacity deep generative models are providing greater promise. However, handling of imbalance in class distribution when building a deep generative model is also a challenging problem, that has not been studied as extensively as imbalanced classifier model training. We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. We propose a novel technique of converting the binary class labels to ternary class labels by introducing a class for the region where minority and majority distributions overlap. We show that just this pre-processing of the training set, significantly improves the quality of data generated spanning several state-of-the-art diffusion and GAN-based models. While training the classifier using synthetic data, we remove the overlap class from the training data and justify the reasons behind the enhanced accuracy. We perform extensive experiments on four real-life datasets, five different classifiers, and five generative models, demonstrating that our method enhances not only the synthesizer performance of state-of-the-art models but also the classifier performance

Abstract: The problem of constrained Markov game has recently attracted interests in the study of multiagent reinforcement learning (MARL). The existing literature has focused on safe MARL problems where safety constraints are imposed for each agent individually. In this work, we consider Markov potential game (MPG) with a shared constraint, where the cost function with respect to the constraint depends on states and joint actions of all agents. We adopt a primal-dual framework to tackle the problem and establish the Slater condition to ensure the strong duality. Moreover, we propose our primal-dual learning algorithm for learning approximate Nash equilibrium in MPG with shared constraint. Thanks to the novel design of the dual update, we provide asymptotic convergence on the weighted output policy. Specifically, we prove that both the value function gap and the constraint violation of the output policy converge at the rate O(epsilon+1/sqrt(T)), where epsilon is the accuracy level of the primal update, and T is the number of iterations. We further show that the weighted output policy outperforms the existing uniformly chosen policy.

Abstract: Defining a reward function is usually a challenging but critical task for the system designer in reinforcement learning, especially when specifying complex behaviors. Reinforcement learning from human feedback (RLHF) emerges as a promising approach to circumvent this. In RLHF, the agent typically learns a reward function by querying a human teacher using pairwise comparisons of trajectory segments. A key question in this domain is how to reduce the number of queries necessary to learn an informative reward function since asking a human teacher too many queries is impractical and costly. To tackle this question, we propose DUO, a novel method for diverse, uncertain, onpolicy query generation and selection in RLHF. Our method produces queries that are (1) more relevant for policy training (via an on-policy criterion), (2) more informative (via a principled measure of epistemic uncertainty), and (3) diverse (via a clustering-based filter). Experimental results on a variety of locomotion and robotic manipulation tasks demonstrate that our method can outperform state-of-the-art RLHF methods given the same total budget of queries, while being robust to possibly irrational teachers.

Abstract: Unsupervised domain adaptation (UDA) is a machine learning approach designed to minimize reliance on labeled data by aligning features between a labeled source domain and an unlabeled target domain, thereby reducing feature discrepancies, which is efficient for multivariate time series (MTS) prediction. However, most MTS UDA methods focus solely on aligning intraseries temporal features, overlooking the valuable information in inter-series dependencies. Research has highlighted that analyzing decomposed frequency dependencies in time series can reveal significant trends, noise patterns, and intricate temporal details. To address these unexplored frequency dependencies, we introduce the Frequency Graph Discovery Module (FGD), which uncovers and aligns shared frequency information and correlations across domains. Additionally, we propose a Frequency-Contextual Contrastive Learning (FCCL) framework to better capture and align frequency-contextual representations in multivariate time series, ensuring the extraction of label-invariant information for prediction. Furthermore, considering existing models overlooking the valuable and abundant information outside source and target dataset, we enhance the MTS UDA prediction model with a Language-guided Adversary Alignment (LAA) module, which leverages the advancement and capabilities of Large Language Models (LLMs) to get text-encoded labeled embeddings and align the classification features, thereby improving prediction accuracy. Our model achieves state-of-the-art results on three public multivariate time-series datasets for unsupervised domain adaptation, as demonstrated by empirical evidence.

Abstract: Efficient transfer learning methods such as adapterbased methods have shown great success in unimodal models and vision-language models. However, existing methods have two main challenges in fine-tuning multimodal models. Firstly, they are designed for vision-language tasks and fail to extend to situations where there are more than two modalities. Secondly, they exhibit limited exploitation of interactions between modalities and lack efficiency. To address these issues, in this paper, we propose the loW-rank sequence multimodal adapter (Wander). We first use the outer product to fuse the information from different modalities in an element-wise way effectively. For efficiency, we use CP decomposition to factorize tensors into rank-one components and achieve substantial parameter reduction. Furthermore, we implement a token-level low-rank decomposition to extract more fine-grained features and sequence relationships between modalities. With these designs, Wander enables token-level interactions between sequences of different modalities in a parameter-efficient way. We conduct extensive experiments on datasets with different numbers of modalities, where Wander outperforms state-of-the-art efficient transfer learning methods consistently. The results fully demonstrate the effectiveness, efficiency and universality of Wander.

Abstract: Federated Learning (FL) has emerged as a promising approach for privacypreserving model training across decentralized devices. However, it faces challenges such as statistical heterogeneity and susceptibility to adversarial attacks, which can impact model robustness and fairness. Personalized FL attempts to provide some relief by customizing models for individual clients. However, it falls short in addressing server-side aggregation vulnerabilities. We introduce a novel method called FedAA, which optimizes client contributions via Adaptive Aggregation to enhance model robustness against malicious clients and ensure fairness across participants in non-identically distributed settings. To achieve this goal, we propose an approach involving a Deep Deterministic Policy Gradient-based algorithm for continuous control of aggregation weights, an innovative client selection method based on model parameter distances, and a reward mechanism guided by validation set performance. Empirically, extensive experiments demonstrate that, in terms of robustness, FedAA outperforms the state-of-the-art methods, while maintaining comparable levels of fairness, offering a promising solution to build resilient and fair federated systems.

Abstract: Deep learning timeseries processing often relies on convolutional neural networks with overlapping windows. This overlap allows the network to produce an output faster than the window length. However, it introduces additional computations. This work explores the potential to optimize computational efficiency during inference by exploiting convolution's shift-invariance properties to skip the calculation of layer activations between successive overlapping windows. Although convolutions are shift-invariant, zero-padding and pooling operations, widely used in such networks, are not and complicate efficient streaming inference. We introduce StreamiNNC, a strategy to deploy Convolutional Neural Networks for online streaming inference. We explore the adverse effects of zero padding and pooling on the accuracy of streaming inference, deriving theoretical error upper bounds for pooling during streaming. We address these limitations by proposing signal padding and pooling alignment and provide guidelines for designing and deploying models for StreamiNNC. We validate our method in simulated data and on three real-world biomedical signal processing applications. StreamiNNC achieves a low deviation between streaming output and normal inference for all three networks (2.03 - 3.55% NRMSE). This work demonstrates that it is possible to linearly speed up the inference of streaming CNNs processing overlapping windows, negating the additional computation typically incurred by overlapping windows.

Abstract: AI becomes increasingly vital for telecom industry, as the burgeoning complexity of upcoming mobile communication networks places immense pressure on network operators. While there is a growing consensus that intelligent network selfdriving holds the key, it heavily relies on expert experience and knowledge extracted from network data. In an effort to facilitate convenient analytics and utilization of wireless big data, we introduce the concept of knowledge graphs into the field of mobile networks, giving rise to what we term as wireless data knowledge graphs (WDKGs). However, the heterogeneous and dynamic nature of communication networks renders manual WDKG construction both prohibitively costly and error-prone, presenting a fundamental challenge. In this context, we propose an unsupervised data-and-model driven graph structure learning (DMGSL) framework, aimed at automating WDKG refinement and updating. Tackling WDKG heterogeneity involves stratifying the network into homogeneous layers and refining it at a finer granularity. Furthermore, to capture WDKG dynamics effectively, we segment the network into static snapshots based on the coherence time and harness the power of recurrent neural networks to incorporate historical information. Extensive experiments conducted on the established WDKG demonstrate the superiority of the DMGSL over the baselines, particularly in terms of node classification accuracy.

Abstract: In this paper, we draw an analogy between processing natural languages and processing multivariate event streams from vehicles in order to predict when and what error pattern is most likely to occur in the future for a given car. Our approach leverages the temporal dynamics and contextual relationships of our event data from a fleet of cars. Event data is composed of discrete values of error codes as well as continuous values such as time and mileage. Modelled by two causal Transformers, we can anticipate vehicle failures and malfunctions before they happen. Thus, we introduce CarFormer, a Transformer model trained via a new selfsupervised learning strategy, and EPredictor, an autoregressive Transformer decoder model capable of predicting when and what error pattern will most likely occur after some error code apparition. Despite the challenges of high cardinality of event types, their unbalanced frequency of appearance and limited labelled data, our experimental results demonstrate the excellent predictive ability of our novel model. Specifically, with sequences of 160 error codes on average, our model is able with only half of the error codes to achieve 80% F1 score for predicting what error pattern will occur and achieves an average absolute error of 58.4 ± 13.2h when forecasting the time of occurrence, thus enabling confident predictive maintenance and enhancing vehicle safety.

Abstract: Shapley valuebased explanations are widely utilized to demystify predictions made by opaque models. Approaches to estimating Shapley values often approximate explanation games as inessential and estimate the Shapley value directly as feature attribution with a limited capacity to quantify feature interactions. This paper introduces a new approach for calculating Shapley values that relaxes the assumption of inessential games and is proven to provide additive feature attribution. The initial formulation of the proposed approach includes the estimation of game values in their Möbius representation with exponentially many parameters, but we put forward a polynomial-time algorithm designed to manage the game's numerous values and achieve an efficient linear-time computation of the Shapley value. Moreover, this formulation uniquely enables identifying only the significant high-order feature interactions amidst a potentially exponential set. Through experiments, we demonstrate the robust performance of our methodology in game estimation and in providing explanations for multiple black-box models.

Abstract: Multivariate Time Series Classification (MTSC) is crucial in extensive practical applications, such as environmental monitoring, medical EEG analysis, and action recognition. Realworld time series datasets typically exhibit complex dynamics. To capture this complexity, RNN-based, CNN-based, Transformer-based, and hybrid models have been proposed. Unfortunately, current deep learning-based methods often neglect the simultaneous construction of local features and global dependencies at different time scales, lacking sufficient feature extraction capabilities to achieve satisfactory classification accuracy. To address these challenges, we propose a novel Multiscale Periodic Time Series Network (MPTSNet), which integrates multiscale local patterns and global correlations to fully exploit the inherent information in time series. Recognizing the multi-periodicity and complex variable correlations in time series, we use the Fourier transform to extract primary periods, enabling us to decompose data into multiscale periodic segments. Leveraging the inherent strengths of CNN and attention mechanism, we introduce the PeriodicBlock, which adaptively captures local patterns and global dependencies while offering enhanced interpretability through attention integration across different periodic scales. The experiments on UEA benchmark datasets demonstrate that the proposed MPTSNet outperforms 21 existing advanced baselines in the MTSC tasks.

Abstract: Inspired by the human brain's ability to adapt to new tasks without erasing prior knowledge, we develop spiking neural networks (SNNs) with dynamic structures for Class Incremental Learning (CIL). Our analytical experiments reveal that limited datasets introduce biases in logits distributions among tasks. Fixed features from frozen pasttask extractors can cause overfitting and hinder the learning of new tasks. To address these challenges, we propose the ALADE-SNN framework, which includes adaptive logit alignment for balanced feature representation and OtoN suppression to manage weights mapping frozen old features to new classes during training, releasing them during fine-tuning. This approach dynamically adjusts the network architecture based on analytical observations, improving feature extraction and balancing performance between new and old tasks. Experiment results show that ALADE-SNN achieves an average incremental accuracy of 75.42 ± 0.74% on the CIFAR100-B0 dataset over 10 incremental steps. ALADE-SNN not only matches the performance of DNN-based methods but also surpasses state-of-the-art SNN-based continual learning algorithms. This advancement enhances continual learning in neuromorphic computing, offering a brain-inspired, energy-efficient solution for real-time data processing.

Abstract: The importance of time series forecasting drives continuous research and the development of new approaches to tackle this problem. Typically, these methods are introduced through empirical studies that frequently claim superior accuracy for the proposed approaches. Nevertheless, concerns are rising about the reliability and generalizability of these results due to limitations in experimental setups. This paper addresses a critical limitation: the number and representativeness of the datasets used. We investigate the impact of dataset selection bias, particularly the practice of cherrypicking datasets, on the performance evaluation of forecasting methods. Through empirical analysis with a diverse set of benchmark datasets, our findings reveal that cherry-picking datasets can significantly distort the perceived performance of methods, often exaggerating their effectiveness. Furthermore, our results demonstrate that by selectively choosing just four datasets — what most studies report — 46% of methods could be deemed best in class, and 77% could rank within the top three. Additionally, recent deep learning-based approaches show high sensitivity to dataset selection, whereas classical methods exhibit greater robustness. Finally, our results indicate that, when empirically validating forecasting algorithms on a subset of the benchmarks, increasing the number of datasets tested from 3 to 6 reduces the risk of incorrectly identifying an algorithm as the best one by approximately 40%. Our study highlights the critical need for comprehensive evaluation frameworks that more accurately reflect real-world scenarios. Adopting such frameworks will ensure the development of robust and reliable forecasting methods.

Abstract: The recent surge in popularity of cloudnative applications using microservice architectures has led to a focus on accurate end-to-end latency prediction for proactive resource allocation. Existing models leverage Graph Transformers to Microservice Call Graphs or the Program Evaluation and Review Technique (PERT) graphs to capture complex temporal dependencies between microservices. However, these models incur a high computational cost during both training and inference phases. This paper introduces FastPERT, an efficient model for predicting end-to-end latency in microservice applications. FastPERT dissects an execution trace into several microservices tasks, using observations from prior execution traces of the application, akin to the PERT approach. Subsequently, a prediction model is constructed to estimate the completion time for each individual task. This information, coupled with the computational and structural inductive bias of the PERT graph, facilitates the efficient computation of the end-to-end latency of an execution trace. As a result, FastPERT can efficiently capture the complex temporal causality of different microservice tasks without relying on Graph Neural Networks, leading to more accurate and robust latency predictions across a variety of applications. An evaluation based on datasets generated from large-scale Alibaba microservice traces reveals that FastPERT significantly improves training and inference efficiency without compromising performance, demonstrating its potential as a superior solution for real-time end-to-end latency prediction in cloud-native microservice applications.

Abstract: To address the challenges of imbalanced multiclass datasets typically used for rare event detection in critical cyber-physical systems, we propose an optimal, efficient, and adaptable mixed integer programming (MIP) ensemble weighting scheme. Our approach leverages the diverse capabilities of the classifier ensemble on a granular per class basis, while optimizing the weights of classifier-class pairs using elastic net regularization for improved robustness and generalization. Additionally, it seamlessly and optimally selects a predefined number of classifiers from a given set. We evaluate and compare our MIP-based method against six well-established weighting schemes, using representative datasets and suitable metrics, under various ensemble sizes. The experimental results reveal that MIP outperforms all existing approaches, achieving an improvement in balanced accuracy ranging from 0.99% to 7.31%, with an overall average of 4.53% across all datasets and ensemble sizes. Furthermore, it attains an overall average increase of 4.63%, 4.60%, and 4.61% in macro-averaged precision, recall, and F1-score, respectively, while maintaining computational efficiency.

Abstract: Selective statespace models (SSMs) are an emerging alternative to the Transformer, offering the unique advantage of parallel training and sequential inference. Although these models have shown promising performance on a variety of tasks, their formal expressiveness and length generalization properties remain underexplored. In this work, we provide insight into the workings of selective SSMs by analyzing their expressiveness and length generalization performance on regular language tasks, i.e., finite-state automaton (FSA) emulation. We address certain limitations of modern SSM-based architectures by introducing the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization on a set of various regular language tasks using a single layer. It utilizes a dictionary of dense transition matrices, a softmax selection mechanism that creates a convex combination of dictionary matrices at each time step, and a readout consisting of layer normalization followed by a linear map. We then proceed to evaluate variants of diagonal selective SSMs by considering their empirical performance on commutative and non-commutative automata. We explain the experimental results with theoretical considerations.

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China., Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China., Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China., Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China.

Abstract: Recent progress in Multimodal Large Language Models (MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and trainingfree approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes.

Abstract: Encoding time series into tokens and using language models for processing has been shown to substantially augment the models' ability to generalize to unseen tasks. However, existing language models for time series forecasting encounter several obstacles, including aliasing distortion and prolonged inference times, primarily due to the limitations of quantization processes and the computational demands of large models. This paper introduces ApolloForecast, a novel framework that tackles these challenges with two key innovations: the Anti-Aliasing Quantization Module (AAQM) and the Race Decoding (RD) technique. AAQM adeptly encodes sequences into tokens while mitigating high-frequency noise in the original signals, thus enhancing both signal fidelity and overall quantization efficiency. RD employs a draft model to enable parallel processing and results integration, which markedly accelerates the inference speed for long-term predictions, particularly in large-scale models. Extensive experiments on various real-world datasets show that Apollo-Forecast outperforms state-of-the-art methods by 35.41% and 18.99% in WQL and MASE metrics, respectively, in zero-shot scenarios. Furthermore, our method achieves an acceleration of 1.9X-2.7X in inference speed over the baseline methods.

Abstract: Video speaking style recognition (VSSR) aims to classify different types of conversations in videos, contributing significantly to understanding human interactions. A significant challenge in VSSR is the inherent similarity among conversation videos, which makes it difficult to distinguish between different speaking styles. Existing VSSR methods commit to providing available multimodal information to enhance the differentiation of conversation videos. Nevertheless, treating each modality equally leads to a suboptimal result for these methods due to text is inherently more aligned with conversation understanding compared to nonverbal modalities. To address this issue, we propose a textguided nonverbal enhancement method, TNvE, which is composed of two core modules: 1) a text-guided nonverbal representation selection module employs cross-modal attention based on modality-invariant representations, picking out critical nonverbal information via textual guide; and 2) a modality-invariant and -specific representation decoupling module incorporates modality-specific representations and decouples them from modality-invariant representations, enabling a more comprehensive understanding of multimodal data. The former module encourages multimodal representations close to each other, while the latter module provides unique characteristics of each modality as a supplement. Extensive experiments are conducted on long-form video understanding datasets to demonstrate that TNvE is highly effective for VSSR, achieving a new state-of-the-art.

School of Artificial Intelligence, Jilin University, Changchun, Jilin, China, School of Artificial Intelligence, Jilin University, Changchun, Jilin, China, School of Artificial Intelligence, Jilin University, Changchun, Jilin, China, School of Artificial Intelligence, Jilin University, Changchun, Jilin, China International Center of Future Science, Jilin University, Changchun, Jilin, China Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, Changchun, Jilin, China

Abstract: Blackbox prompt tuning has become a prevalent parameter-efficient paradigm that leverages the capabilities of large language models (LLMs) for customized applications in specific downstream tasks. In practical scenarios, downstream tasks frequently involve data distributions that are heavily imbalanced. Such imbalances tend to impair performance, causing severe performance collapse in minority classes. Conducting effective imbalanced black-box prompt tuning to mitigate the adverse effects of imbalanced data distribution on prompt performance remains a significant challenge. In this paper, we propose black-box prompt tuning with first and zeroth order gradient (BPT-FZG) for handling the imbalanced data. Specifically, BPT-FZG introduces AUC maximization as the objective for prompt tuning and equivalently formulates it as a nonconvex-concave saddle point problem to avoid the construction of sample pairs from opposite classes. Indeed, BPT-FZG optimizes the latent representation of the continuous prompt in the low-dimensional subspace with AUC loss and leverages the first and zeroth order gradients alternately to update the parameters. Furthermore, we establish the theoretical convergence guarantee for BPT-FZG under common assumptions, showing that our method can find a stationary point of the objective function. Our experiments on RoBERTa-large, GPT2-XL, and Llama3 show that BPT-FZG achieves improvement on various imbalanced datasets, emphasizing the effectiveness of our methods.

Abstract: Contrastive learning is a prevalent technique in selfsupervised vision representation learning, typically generating positive pairs by applying two data augmentations to the same image. Designing effective data augmentation strategies is crucial for the success of contrastive learning. Inspired by the story of the blind men and the elephant, we introduce JointCrop and JointBlur. These methods generate more challenging positive pairs by leveraging the joint distribution of the two augmentation parameters, thereby enabling contrastive learning to acquire more effective feature representations. To the best of our knowledge, this is the first effort to explicitly incorporate the joint distribution of two data augmentation parameters into contrastive learning. As a plug-and-play framework without additional computational overhead, JointCrop and JointBlur enhance the performance of SimCLR, BYOL, MoCo v1, MoCo v2, MoCo v3, SimSiam, and Dino baselines with notable improvements.

Abstract: LLMbased autonomous agents often fail to execute complex web tasks that require dynamic interaction, largely due to the inherent uncertainty and complexity of these environments. Existing LLM-based web agents typically rely on rigid, expert-designed policies specific to certain states and actions, lacking the flexibility and generalizability needed to adapt to unseen tasks. In contrast, humans excel by exploring unknowns, continuously adapting strategies based on new observations, and resolving ambiguities through exploration. To emulate human-like adaptability, web agents need strategic exploration and complex decision-making. Monte Carlo Tree Search (MCTS) is well-suited for this, but classical MCTS struggles with vast action spaces, unpredictable state transitions, and incomplete information in web tasks. In light of this, we develop WebPilot, a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. Specifically, the Global Optimization phase involves generating a high-level plan by breaking down tasks into manageable subtasks, continuously refining this plan through reflective analysis of new observations and previous subtask attempts, thereby focusing the search process and mitigating challenges posed by vast action spaces in classical MCTS. Subsequently, the Local Optimization phase executes each subtask using a tailored MCTS designed for complex environments, effectively addressing uncertainties and managing incomplete information by iteratively refining decisions based on new observations. Experimental results on WebArena and MiniWoB++ demonstrate the effectiveness of WebPilot. Notably, on WebArena, WebPilot achieves SOTA performance with GPT-4, achieving a 93% relative increase in success rate over the concurrent tree search-based method. WebPilot advances autonomous agents, enabling more reliable decision-making in practical environments.

Abstract: Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech into text, yet the errors they introduce can significantly degrade the performance of downstream tasks like summarization. This issue is particularly pronounced in clinical dialogue summarization, a lowresource domain where supervised data for fine-tuning is scarce, necessitating the use of ASR models as black-box solutions. Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). Specifically, we leverage the in-context learning capabilities of LLMs and instruct them to generate ASR-like errors based on a few available medical dialogue examples with audio recordings. Experimental results show that LLMs can effectively model ASR noise, and incorporating this noisy data into the training process significantly improves the robustness and accuracy of medical dialogue summarization systems. This approach addresses the challenges of noisy ASR outputs in critical applications, offering a robust solution to enhance the reliability of clinical dialogue summarization.

Abstract: Large Language Models (LLMs) are susceptible to malicious influence by cyber attackers through intrusions such as adversarial, backdoor, and embedding inversion attacks. In response, the burgeoning field of LLM Security aims to study and defend against such threats. Thus far, the majority of works in this area have focused on monolingual English models; however, emerging research suggests that multilingual LLMs may be more vulnerable to various attacks than their monolingual counterparts. While previous work has investigated embedding inversion over a small subset of European languages, it is challenging to extrapolate these findings to languages from different linguistic families and with differing scripts. To this end, we explore the security of multilingual LLMs in the context of embedding inversion attacks and investigate crosslingual and cross-script inversion across 20 languages, spanning over 8 language families and 12 scripts. Our findings indicate that languages written in Arabic and Cyrillic scripts are particularly vulnerable to embedding inversion, as are languages within the Indo-Aryan language family. We further observe that inversion models tend to suffer from language confusion, sometimes significantly reducing the efficacy of an attack. Accordingly, we systematically explore this bottleneck for inversion models, uncovering predictable patterns attackers could leverage. Ultimately, this study aims to further the field's understanding of the outstanding security vulnerabilities facing multilingual LLMs and raise awareness for the languages most at risk of negative impact from these attacks.

Abstract: Reviewer Matchmaking (RM) is a pivotal process in academic publishing that aligns manuscripts with appropriate reviewers based on their expertise and prior publications. The demand for an automated RM system has escalated with the significant surge in submissions over the past decade. Stateof-the-art (SOTA) RM models are document-representation-based (DR-RM) and match the manuscript and reviewer's past publication using a similarity method defined on a high-dimensional vector space. However, they are far from accurate despite their large-scale usage. In this paper, we establish that conventional RM evaluation measures are unreliable and instead emphasize that standard correlation measures are adequate. For the first time, we compare the performance of six SOTA DR-RM models with those of fourteen SOTA Key-phrase Extraction-based RM (KPE-RM) models - an alternate unexplored approach. We observe that KPE-RM models show comparable results in many cases, with the new best model being PatternRank-RM - a KPE-RM model beating the best DR-RM model SPECTER2-RM (Pearson: 0.004+, Spearman: 0.006+, Kendall: 0.043+). We conclude that KPE-RM models must be contextualized to the RM task and cannot be used as plug-n-play.

Abstract: Knowledge editing aims to update outdated or incorrect knowledge in large language models (LLMs). However, current knowledge editing methods have limited scalability for lifelong editing. This study explores the fundamental reason why knowledge editing fails in lifelong editing. We begin with the closedform solution derived from linear associative memory, which underpins state-of-the-art knowledge editing methods. We extend the solution from single editing to lifelong editing, and through rigorous mathematical derivation, identify an interference term in the final solution, suggesting that editing knowledge may impact irrelevant knowledge. Further analysis of the interference term reveals a close relationship with superposition between knowledge representations. When knowledge superposition does not exist in language models, the interference term vanishes, allowing for lossless knowledge editing. Experiments across numerous language models reveal that knowledge superposition is universal, exhibiting high kurtosis, zero mean, and heavy-tailed distributions with clear scaling laws. Ultimately, by combining theory and experiments, we demonstrate that knowledge superposition is the fundamental reason for the failure of lifelong editing. Moreover, this is the first study to investigate knowledge editing from the perspective of superposition and provides a comprehensive observation of superposition across numerous real-world language models.

Abstract: Crossprompt automated essay scoring (AES) aims to train models using essays from different source prompts and test them on new target prompt essays. A core challenge of the task is to learn as much shared knowledge as possible between essays from different prompts in order to better represent new prompt essays. Previous studies primarily focus on learning this knowledge on a general, coarse-grained level, ignoring that the shared knowledge among prompts is highly detailed and contains a more comprehensive range of information that is not fully investigated. In this paper, we propose a novel multi-aspect knowledge finding and aligning optimization strategy to better acquire this detailed various shared knowledge. We also introduce LLM to extract explicit, interpretable knowledge from implicit, multi-aspect shared knowledge and use this knowledge to improve the representation and evaluation performance of new prompt essays. We conduct extensive experiments on public datasets. The results show that our approach outperforms current state-of-the-art models and is effective on cross-prompt AES.

Abstract: Graph Retrieval Augmented Generation (GRAG) is a novel paradigm that takes the naive RAG system a step further by integrating graph information, such as knowledge graph (KGs), into largescale language models (LLMs) to mitigate hallucination. However, existing GRAG still encounter limitations: 1) simple paradigms usually fail with the complex problems due to the narrow and shallow correlations capture from KGs 2) methods of strong coupling with KGs tend to be high computation cost and time consuming if the graph is dense. In this paper, we propose the Fast Think-on-Graph (FastToG), an innovative paradigm for enabling LLMs to think ``community by community" within KGs. To do this, FastToG employs community detection for deeper correlation capture and two stages community pruning - coarse and fine pruning for faster retrieval. Furthermore, we also develop two Community-to-Text methods to convert the graph structure of communities into textual form for better understanding by LLMs. Experimental results demonstrate the effectiveness of FastToG, showcasing higher accuracy, faster reasoning, and better explainability compared to the previous works.

Abstract: Document summarization has greatly benefited from advances in large language models (LLMs). In realworld situations, summaries often need to be generated from multiple documents with diverse sources and authors, lacking a clear information flow. Naively concatenating these documents and generating a summary can lead to poorly structured narratives and redundancy. Additionally, attributing each part of the generated summary to a specific source is crucial for reliability. In this study, we address multi-document summarization with attribution using our proposed solution ***MiDAS-PRo***, consisting of three stages: (i) Planning the hierarchical organization of source documents, (ii) Reasoning by generating relevant entities/topics, and (iii) Summary Generation. We treat the first two sub-problems as a code completion task for LLMs. By incorporating well-selected in-context learning examples through a graph attention network, LLMs effectively generate plans and reason topics for a document collection. Experiments on summarizing scientific articles from public datasets show that our approach outperforms state-of-the-art baselines in both automated and human evaluations.

Harbin Institute of Technology, Shenzhen, China Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Shenzhen, China, The Chinese University of Hong Kong, Hong Kong, China, The Chinese University of Hong Kong, Hong Kong, China, Harbin Institute of Technology, Shenzhen, China, Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences, Shenzhen, China, Huawei Noah’s Ark Lab, Shenzhen, China, The Chinese University of Hong Kong, Hong Kong, China, Harbin Institute of Technology, Shenzhen, China Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Shenzhen, China Peng Cheng Laboratory, Shenzhen, China

Abstract: Stickers are widely used in online chatting, which can vividly express someone's intention, emotion, or attitude. Existing conversation research typically retrieves stickers based on a single session or the previous textual information, which can not adapt to the multimodal and multi-session nature of the real-world conversation. To this end, we introduce MultiChat, a new dataset for sticker retrieval facing the multi-modal and multi-session conversation, comprising 1,542 sessions, featuring 50,192 utterances and 2,182 stickers. Based on the created dataset, we propose a novel Intent-Guided Sticker Retrieval (IGSR) framework that retrieves stickers for multi-modal and multi-session conversation history drawing support from intent learning. Specifically, we introduce sticker attributes to better leverage the sticker information in multi-modal conversation, which are incorporated with utterances to construct a memory bank. Further, we extract relevant memories for the current conversation from the memory bank to identify the intent of the current conversation, and then retrieve a sticker to respond guided by the intent. Extensive experiments on our MultiChat dataset reveal the robustness and effectiveness of our IGSR approach in multi-session, multi-modal scenarios.

Abstract: Large Language Models (LLMs) have permeated various Natural Language Processing (NLP) tasks. For the summarization tasks, LLMs can generate wellstructured rationales, which consist of Essential Aspects (EA), Associated Sentences (AS) and Triple Entity Relations (TER). These rationales guide smaller models (≤1B) to produce better summaries. However, their high deployment costs (≥70B), such as substantial storage space and high computing requirements, limit their utilization in resource-constrained environments. Furthermore, effectively distilling these structured rationales from LLMs into Small Language Models (SLMs) models remains a challenge. To address this, we propose the LLM-based Structured Rationale-guided Multi-view Weak-gated Fusion framework (LSR-MWF). The framework initially employs LLMs to dig structural rationales from a document, considering multiple viewpoints such as EA, AS, and TER. Then, it develop a multi-step summary generation evaluation strategy to select high-quality structured rationales. Subsequently, it aligns with these rationales using additional modules organized in a hierarchical structure. Finally, the framework integrates the features output by these modules with original abstractive model through a weak-gated mechanism. Experimental results on two publicly available CNN/DailyMail and XSum datasets show that our method improves the performance of the abstractive model, outperforming baselines by 11.2％ and 5.8％, respectively. In addition, our method improves the interpretability of summary generation from the viewpoints of EA, AS and TER.

Abstract: Natural language processing (NLP) has seen remarkable advancements with the development of large language models (LLMs). Despite these advancements, LLMs often produce socially biased outputs. Recent studies have mainly addressed this problem by prompting LLMs to behave ethically, but this approach results in unacceptable performance degradation. In this paper, we propose a multiobjective approach within a multi-agent framework (MOMA) to mitigate social bias in LLMs without significantly compromising their performance. The key idea of MOMA involves deploying multiple agents to perform causal interventions on bias-related contents of the input questions, breaking the shortcut connection between these contents and the corresponding answers. Unlike traditional debiasing techniques leading to performance degradation, MOMA substantially reduces bias while maintaining accuracy in downstream tasks. Our experiments conducted in two datasets and two models demonstrate that MOMA reduces bias scores by up to 87.7%, with only a marginal performance degradation of up to 6.8% in the BBQ dataset. Additionally, it significantly enhances the multi-objective metric icat in the StereoSet dataset by up to 58.1%.

Theta Health Inc. Chen Frontier Lab for AI and Mental Health, Tianqiao and Chrissy Chen Institute, Shanghai, China, Theta Health Inc., Theta Health Inc., Theta Health Inc., Theta Health Inc., Chen Frontier Lab for AI and Mental Health, Tianqiao and Chrissy Chen Institute, Shanghai, China, Shanghai Mental Health Center Shanghai Jiao Tong University School of Medicine Shanghai Clinical Research Center for Mental Health Shanghai Key Laboratory of Psychotic Disorders, Theta Health Inc. Chen Frontier Lab for AI and Mental Health, Tianqiao and Chrissy Chen Institute, Shanghai, China

Abstract: The clinical diagnosis of most mental disorders primarily relies on the conversations between psychiatrist and patient. The creation of such diagnostic conversation datasets is promising to boost the AI mental healthcare community. However, directly collecting the conversations in real diagnosis scenarios is near impossible due to stringent privacy and ethical considerations. To address this issue, we seek to synthesize diagnostic conversation by exploiting anonymized patient cases that are easier to access. Specifically, we design a neurosymbolic multi-agent framework for synthesizing the diagnostic conversation of mental disorders with large language models. It takes patient case as input and is capable of generating multiple diverse conversations with one single patient case. The framework basically involves the interaction between a doctor agent and a patient agent, and generates conversations under symbolic control via a dynamic diagnosis tree. By applying the proposed framework, we develop the largest Chinese mental disorders diagnosis dataset MDD-5k. This dataset is built upon 1000 real, anonymized patient cases by cooperating with Shanghai Mental Health Center and comprises 5000 high-quality long conversations with diagnosis results and treatment opinions as labels. To the best of our knowledge, it's also the first labeled dataset for Chinese mental disorders diagnosis. Human evaluation demonstrates the proposed MDD-5k dataset successfully simulates human-like diagnostic process of mental disorders.

Abstract: As Large Language Models (LLMs) continue to evolve, more are being designed to handle longcontext inputs. Despite this advancement, most of them still face challenges in accurately handling long-context tasks, often showing the "lost in the middle" issue. We identify that insufficient retrieval capability is one of the important reasons for this issue. To tackle this challenge, we propose a novel approach to design training data for long-context tasks, aiming at augmenting LLMs' proficiency in extracting key information from long context. Specially, we incorporate an additional part named "paraphrasing the original text" when constructing the answer of training samples and then fine-tuning the model. Experimenting on LongBench and NaturalQuestions Multi-document-QA dataset with models of Llama and Qwen series, our method achieves an improvement of up to 8.48% and 4.48% in average scores, respectively, showing effectiveness in improving the model’s performance on long-context tasks.

Abstract: Generating novel active molecules for a given protein is an extremely challenging task for generative models that requires an understanding of the complex physical interactions between the molecule and its environment. This paper presents a novel generative model, BindGPT, which uses a conceptually simple but powerful approach to create 3D molecules within the protein's binding site. Our model produces molecular graphs and conformations jointly, eliminating the need for an extra graph reconstruction step. We pretrain BindGPT on a large-scale dataset and fine-tune it with reinforcement learning using scores from external simulation software. We demonstrate how a single pre-trained language model can serve at the same time as a 3D molecular generative model, a conformer generator conditioned on the molecular graph, and a pocket-conditioned 3D molecule generator. Notably, the model does not make any representational equivariance assumptions about the domain of generation. We show how such a simple conceptual approach combined with pre-training and scaling can perform on par or better than the current best-specialized diffusion models, language models, and graph neural networks while being two orders of magnitude cheaper to sample.

Abstract: Link prediction in dynamic graphs (LPDG) has been widely applied to realworld applications such as website recommendation, traffic flow prediction, organizational studies, etc. These models are usually kept local and secure, with only the interactive interface restrictively available to the public. Thus, the problem of the black-box evasion attack on the LPDG model, where model interactions and data perturbations are restricted, seems to be essential and meaningful in practice. In this paper, we propose the first practicable black-box evasion attack method that achieves effective attacks against the target LPDG model, within a limited amount of interactions and perturbations. To perform effective attacks under limited perturbations, we develop a graph sequential embedding model to find the desired state embedding of the dynamic graph sequences, under a deep reinforcement learning framework. To overcome the scarcity of interactions, we design a multi-environment training pipeline and train our agent for multiple instances, by sharing an aggregate interaction buffer. Finally, we evaluate our attack against three advanced LPDG models on three real-world graph datasets of different scales and compare its performance with related methods under the interaction and perturbation constraints. Experimental results show that our attack is both effective and practicable.

Abstract: In the last few years, the artifact patterns in fake images synthesized by different generative models have been inconsistent, leading to the failure of previous research that relied on spotting subtle differences between real and fake. In our preliminary experiments, we find that the artifacts in fake images always change with the development of the generative model, while natural images exhibit stable statistical properties. In this paper, we employ natural traces shared only by real images as an additional target for a classifier. Specifically, we introduce a selfsupervised feature mapping process for natural trace extraction and develop a transfer learning based on soft contrastive loss to bring them closer to real images and further away from fake ones. This motivates the detector to make decisions based on the proximity of images to the natural traces. To conduct a comprehensive experiment, we built a high-quality and diverse dataset that includes generative models comprising GANs and diffusion models, to evaluate the effectiveness in generalizing unknown forgery techniques and robustness in surviving different transformations. Experimental results show that our proposed method gives 96.2% mAP significantly outperforms the baselines. Extensive experiments conducted on the widely recognized platform Midjourney reveal that our proposed method achieves an accuracy exceeding 78.4%, underscoring its practicality for real-world application deployment.

Abstract: Hardness of modeling a planning domain is a major obstacle for making automated planning techniques accessible. We developed a tool that helps modelers correct domains based on available information such as the known feasibility or infeasibility of certain plans. Designing model repair strategies that are capable of repairing flawed planning domains automatically has been explored in previous work to use positive plans (invalid in the given (flawed) domain but feasible in the ``true'' domain). In this work, we highlight the importance of and study counterexample negative plans (valid in the given (flawed) domain but infeasible in the ``true'' domain). Our approach automatically corrects a domain by finding an optimal repair set to the domain which turns all negative plans into non-solutions, in addition to making all positive plans solutions. Experiments indicate strong performance in the fast-downward benchmark suite with random errors. A handcrafted benchmark with domain flaws inspired by some practical applications also motivates the method's efficacy.

Abstract: Automatic Heuristic Design (AHD) is an active research area due to its utility in solving complex search and NPhard combinatorial optimization problems in the real world. The recent advancements in Large Language Models (LLMs) introduce new possibilities by coupling LLMs with evolutionary computation to automatically generate heuristics, known as LLM-based Evolutionary Program Search (LLM-EPS). While previous LLM-EPS studies obtained great performance on various tasks, there is still a gap in understanding the properties of heuristic search spaces and achieving a balance between exploration and exploitation, which is a critical factor in large heuristic search spaces. In this study, we address this gap by proposing two diversity measurement metrics and perform an analysis on previous LLM-EPS approaches, including FunSearch, EoH, and ReEvo. Results on black-box AHD problems reveal that while EoH demonstrates higher diversity than FunSearch and ReEvo, its objective score is unstable. Conversely, ReEvo's reflection mechanism yields good objective scores but fails to optimize diversity effectively. With this finding in mind, we introduce HSEvo, an adaptive LLM-EPS framework that maintains a balance between diversity and convergence with a harmony search algorithm. Through experimentation, we find that HSEvo achieved high diversity indices and good objective scores while remaining cost-effective. These results underscore the importance of balancing exploration and exploitation and understanding heuristic search spaces in designing frameworks in LLM-EPS.

Abstract: Though numerous solvers have been proposed for the MaxSAT problem, and the benchmark environment such as MaxSAT Evaluations provides a platform for the comparison of the stateof-the-art solvers, existing assessments were usually evaluated based on the quality, e.g., fitness, of the best-found solutions obtained within a given running time budget. However, concerning solely the final obtained solutions regarding specific time budgets may restrict us from comprehending the behavior of the solvers along the convergence process. This paper demonstrates that Empirical Cumulative Distribution Functions can be used to compare MaxSAT stochastic local search solvers' anytime performance across multiple problem instances and various time budgets. The assessment reveals distinctions in solvers' performance and displays that the (dis)advantages of solvers adjust along different running times. This work also exhibits that the quantitative and high variance assessment of anytime performance can guide machines, i.e., automatic configurators, to search for better parameter settings. Our experimental results show that the hyperparameter optimization tool, i.e., SMAC, can achieve better parameter settings of solvers when using the anytime performance as the cost function, compared to using the metrics based on the fitness of the best-found solutions.

Abstract: Knowledge tracing (KT) models students' knowledge states and predicts their future performance based on their historical interaction data. However, attention based KT models struggle to accurately capture diverse forgetting behaviors in evergrowing interaction sequences. First, existing models use uniform time decay matrices, conflating forgetting representations with problem relevance. Second, the fixed-length window prediction paradigm fails to model continuous forgetting processes in expanding sequences. To address these challenges, this paper introduces LefoKT, a unified architecture that enhances attention based KT models by incorporating proposed relative forgetting attention. LefoKT improves forgetting modeling through relative forgetting attention to decouple forgetting patterns from problem relevance. It also enhances attention based KT models' length extrapolation capability for capturing continuous forgetting processes in ever-growing interaction sequences. Extensive experimental results on three datasets validate the effectiveness of LefoKT.

Abstract: Coral reefs play a crucial role in marine ecosystems, offering a nutrientrich environment and safe shelter for numerous marine species. Automated coral image recognition aids in monitoring ocean health at a scale without experts' manual effort. Recently, large vision-language models like CLIP have greatly enhanced zero-shot and low-shot classification capabilities for various visual tasks. However, these models struggle with fine-grained coral-related tasks due to a lack of specific knowledge. To bridge this gap, we compile a fine-grained coral image dataset consisting of 16,659 images with taxonomy labels (from Kingdom to Species), accompanied by morphology-specific text descriptions for each species. Based on the dataset, we propose CORAL-Adapter, integrating two complementary kinds of coral-specific knowledge (biological taxonomy and coral morphology) with general knowledge learned by CLIP. CORAL-Adapter is a simple yet powerful extension of CLIP with only a few parameter updates and can be used as a plug-and-play module with various CLIP-based methods. We show improvements in accuracy across diverse coral recognition tasks, e.g., recognizing corals unseen during training that are prone to bleaching or originate from different oceans.

Abstract: As the scale and capabilities of Large Language Models (LLMs) increase, their applications in knowledgeintensive fields such as legal domain have garnered widespread attention. However, it remains doubtful whether these LLMs make judgments based on domain knowledge for reasoning. If LLMs base their judgments solely on specific words or patterns, rather than on the underlying logic of the language, the “LLM-as-judges” paradigm poses substantial risks in the real-world applications. To address this question, we propose a method of legal knowledge injection attacks for robustness testing, thereby inferring whether LLMs have learned legal knowledge and reasoning logic. In this paper, we propose J&H: an evaluation framework for detecting the robustness of LLMs under knowledge injection attacks in the legal domain. The aim of the framework is to explore whether LLMs perform deductive reasoning when accomplishing legal tasks. To further this aim, we have attacked each part of the reasoning logic underlying these tasks (major premise, minor premise, and conclusion generation). We have collected mistakes that legal experts might make in judicial decisions in the real world, such as typos, legal synonyms, inaccurate external legal statutes retrieval. However, in real legal practice, legal experts tend to overlook these mistakes and make judgments based on logic. However, when faced with these errors, LLMs are likely to be misled by typographical errors and may not utilize logic in their judgments. We conducted knowledge injection attacks on existing general and domain-specific LLMs. Current LLMs are not robust against the attacks employed in our experiments. In addition we propose and compare several methods to enhance the knowledge robustness of LLMs. All code can be found at the link.

Abstract: Understanding internal joint loading is critical for diagnosing gaitrelated diseases such as knee osteoarthritis; however, current methods of measuring joint risk factors are time-consuming, expensive, and restricted to lab settings. In this paper, we enable the large-scale, cost-effective biomechanical analysis of joint loading via three key contributions: the development and deployment of novel instrumented insoles, the creation of a large multimodal biomechanics dataset (VidSole), and a baseline deep learning pipeline to predict internal joint loading factors. Our novel instrumented insole measures the tri-axial forces and moments across five high-pressure points under the foot. VidSole consists of the forces and moments measured by these insoles along with corresponding RGB video from two viewpoints, 3D body motion capture, and force plate data for over 2,600 trials of 52 diverse participants performing four fundamental activities of daily living (sit-to-stand, stand-to-sit, walking, and running). We feed the insole data and kinematic parameters extractable from video (i.e., pose, knee angle) into a deep learning pipeline consisting of an ensemble Gated Recurrent Unit (GRU) activity classifier followed by activity-specific Long Short Term Memory (LSTM) regression networks to estimate knee adduction moment (KAM), a biomechanical risk factor for knee osteoarthritis. The successful classification of activities at an accuracy of 99.02 percent and KAM estimation with mean absolute error (MAE) less than 0.5 percent*body weight*height, the current threshold for accurately detecting knee osteoarthritis with KAM, illustrates the usefulness of our dataset for future research and clinical settings.

Abstract: Public health programs often provide interventions to encourage program adherence, and effectively allocating interventions is vital for producing the greatest overall health outcomes, especially in underserved communities where resources are limited. Such resource allocation problems are often modeled as restless multiarmed bandits (RMABs) with unknown underlying transition dynamics, hence requiring online reinforcement learning (RL). We present Bayesian Learning for Contextual RMABs (BCoR), an online RL approach for RMABs that novelly combines techniques in Bayesian modeling with Thompson sampling to flexibly model the complex RMAB settings present in public health program adherence problems, namely context and non-stationarity. BCoR's key strength is the ability to leverage shared information within and between arms to learn the unknown RMAB transition dynamics quickly in intervention-scarce settings with relatively short time horizons, which is common in public health applications. Empirically, BCoR achieves substantially higher finite-sample performance over a range of experimental settings, including a setting using real-world adherence data that was developed in collaboration with ARMMAN, an NGO in India which runs a large-scale maternal mHealth program, showcasing BCoR practical utility and potential for real-world deployment.

Abstract: Large language models (LLMs) offer a valuable technology for various applications in healthcare. However, their tendency to hallucinate and the existing reliance on proprietary systems pose challenges in environments concerning critical decisionmaking and strict data privacy regulations, such as healthcare, where the trust in such systems is paramount. Through combining the strengths and discounting the weaknesses of humans and AI, the field of Human-AI Collaboration (HAIC) presents one front for tackling these challenges and hence improving trust. This paper presents a novel HAIC \textit{guided deferral} system that can simultaneously parse medical reports for disorder classification, and defer uncertain predictions with intelligent guidance to humans. We develop methodology which builds efficient, effective and open-source LLMs for this purpose, for the real-world deployment in healthcare. We conduct a pilot study which showcases the effectiveness of our proposed system in practice. Additionally, we highlight drawbacks of standard calibration metrics in imbalanced data scenarios commonly found in healthcare, and suggest a simple yet effective solution: the Imbalanced Expected Calibration Error.

Abstract: Recent advances in eXplainable AI (XAI) for education have highlighted a critical challenge: ensuring that explanations for stateof-the-art models are understandable for non-technical users such as educators and students. In response, we introduce iLLuMinaTE, a zero-shot, chain-of-prompts LLM-XAI pipeline inspired by Miller (2019)'s cognitive model of explanation. iLLuMinaTE is designed to deliver theory-driven, actionable feedback to students in online courses. iLLuMinaTE navigates three main stages — causal connection, explanation selection, and explanation presentation — with variations drawing from eight social science theories (e.g. Abnormal Conditions, Pearl's Model of Explanation, Necessity and Robustness Selection, Contrastive Explanation). We extensively evaluate 21,915 natural language explanations of iLLuMinaTE extracted from three LLMs (GPT-4o, Gemma2-9B, Llama3-70B), with three different underlying XAI methods (LIME, Counterfactuals, MC-LIME), across students from three diverse online courses. Our evaluation involves analyses of explanation alignment to the social science theory, understandability of the explanation, and a real-world user preference study with 114 university students containing a novel actionability simulation. We find that students prefer iLLuMinaTE explanations over traditional explainers 89.52% of the time. Our work provides a robust, ready-to-use framework for effectively communicating hybrid XAI-driven insights in education, with significant generalization potential for other human-centric fields.

Abstract: The rapid and accurate direct multiframe interpolation method for Digital Subtraction Angiography (DSA) images is crucial for reducing radiation and providing real-time assistance to physicians for precise diagnostics and treatment. DSA images contain complex vascular structures and various motions. Applying natural scene Video Frame Interpolation (VFI) methods results in motion artifacts, structural dissipation, and blurriness. Recently, MoSt-DSA has specifically addressed these issues for the first time and achieved SOTA results. However, MoSt-DSA's focus on real-time performance leads to insufficient suppression of high-frequency noise and incomplete filtering of low-frequency noise in the generated images. To address these issues within the same computational time scale, we propose GaraMoSt. Specifically, we optimize the network pipeline with a parallel design and propose a module named MG-MSFE. MG-MSFE extracts frame-relative motion and structural features at various granularities in a fully convolutional parallel manner and supports independent, flexible adjustment of context-aware granularity at different scales, thus enhancing computational efficiency and accuracy. Extensive experiments demonstrate that GaraMoSt achieves the SOTA performance in accuracy, robustness, visual effects, and noise suppression, comprehensively surpassing MoSt-DSA and other natural scene VFI methods.

IBM Research - Israel, University of Washington, Michael G. Foster School of Business, Seattle, WA University of Washington, Paul G. Allen School of Computer Science & Engineering, Seattle, WA Laboratory for Innovation Science at Harvard (LISH), Cambridge, MA, University of Amsterdam, Amsterdam Business School, Business Analytics Section, MIT-IBM Watson AI Lab, IBM Research, Cambridge, MA, University of Amsterdam, Amsterdam Business School, Business Analytics Section, University of Amsterdam, Amsterdam Business School, Business Analytics Section, University of Amsterdam, Amsterdam Business School, Business Analytics Section

Abstract: Many critical business and societal decisions in areas such as supply chain and healthcare involve numerous potential actions, complex constraints, and goals that can be modeled as objective functions. Mathematical optimization, a core area in Operations Research (OR), provides robust, mathematically grounded methodologies to address such decisions and has shown tremendous benefits in many applications. However, its application requires the creation of accurate and efficient optimization models, necessitating rare expertise and considerable time, creating a barrier to widespread adoption in decisionmaking. Thus, it is a long-standing goal to make these capabilities widely accessible. The advent of Large Language Models (LLMs) has made advanced Artificial Intelligence (AI) capabilities widely accessible through natural language. LLMs can accelerate expert work in creating formal models like computer programs, and emerging research indicates they can also speed up the development of optimization models by OR experts. We, therefore, propose integrating and advancing LLM and optimization modeling to empower organizational decision-makers to model and solve such complex problems without requiring deep expertise in optimization. In this work, we present our vision for democratizing optimization modeling for organizational decision-making by such a combination of LLMs and optimization modeling. We identify a set of fundamental requirements for the vision's implementation and describe the state of the art through a literature survey and some experimentation. We show that a) LLMs already provide substantial novel capabilities relevant to realizing this vision, but that b) major research challenges remain to be addressed. We also propose possible research directions to overcome these gaps. We would like this work to serve as a call to action to bring together the LLM and OR optimization modeling communities to pursue this vision, thereby enabling much more widespread improved decision-making and increasing by orders of magnitude the benefits AI and OR can bring to enterprises and society.

Abstract: AI has immense potential for positive social impact, including in domains ranging from conservation to health. However, it can be challenging to account for human collaborations and realworld uncertainties when deploying such systems, which can lead to critical errors. Therefore, my research focuses on developing new methods in multi-agent systems and machine learning, including methods for participatory design of AI, human-AI collaboration, and uncertainty quantification, to develop safe, impactful AI systems, particularly in the domains of water conservation and reproductive health.

Abstract: Developing competency in artificial intelligence is becoming increasingly crucial for computer science (CS) students at all levels of the CS curriculum. However, most previous research focuses on advanced CS courses, as traditional introductory courses provide limited opportunities to develop AI skills and knowledge. This paper introduces an introductory CS course where students learn computational thinking through computer vision, a subfield of AI, as an application context. The course aims to achieve computational thinking outcomes alongside critical thinking outcomes that expose students to AI approaches and their societal implications. Through experiential activities such as individual projects and reading discussions, our course seeks to balance technical learning and critical thinking goals. Our evaluation, based on pre-and post-course surveys, shows an improved sense of belonging, self-efficacy, and AI ethics awareness among students. The results suggest that an AI-focused context can enhance participation and employability, student-selected projects support self-efficacy, and ethically grounded AI instruction can be effective for interdisciplinary audiences. Students' discussions on reading assignments demonstrated deep engagement with the complex challenges in today's AI landscape. Finally, we share insights on scaling such courses for larger cohorts and improving the learning experience for introductory CS students.

Abstract: In this student paper, we report on our project to enhance road safety in South Carolina (SC) by analyzing traffic data provided by the Department of Transportation and evaluating the impact of a schoollevel student driver education program called Alive@25. We improve the understanding of road safety using these traffic and training data to understand collision patterns and areas for improvement and assess training coverage gaps. Our approach combines geospatial analysis, economic impact assessment, temporal trend analysis, and interactive visualizations while leveraging AI techniques to clean and analyze extensive datasets. Key findings revealed higher collision rates in urban counties and rising collision rates in mostly rural areas, where Alive@25 participation is declining. These insights led to recommendations for improving road infrastructure and expanding safety training programs. This research demonstrates the potential of AI-driven insights to inform timely, cost-effective interventions and promote multi-stakeholder engagement in addressing public safety challenges while teaching students data science and AI skills and civic engagement.

Abstract: Highaccuracy image segmentation models require abundant training annotated data which is costly for pixel-level annotations. Our work addresses a high-cost manual annotating process or the lack of detailed annotations via a generative approach. In particular, our approach (1) proposes the conditional instance-level synthesis to enrich the limited data to enhance the segmentation performance, and (2) employs the generative architectures to complete the segmentation task under few-shot learning concepts. The initial results on the Cityscapes benchmark emphasize our potential generative solution on the instance segmentation task given limited data.

Abstract: Molecular machine learning has broad applications across multiple domains such as drug development, environmental toxicology, and materials science. Various pretrained frameworks using self-supervised representation learning have emerged to tackle the difficulty of obtaining large molecular datasets useful for training high-performing molecular machine learning models. In this study, we explore a novel representation learning framework trained using both 2D and 3D molecular data. Specifically, a 3D invariant graph neural network to learn how to capture 3D atomic information and then pass these atomic representations into a regular 2D graph neural network which can leverage molecular topology. Results from experiments demonstrate the representations produced by our method using both 3D and 2D molecular information lead to strong performance in downstream tasks.

Abstract: In an era where societal narratives are increasingly shaped by algorithmic curation, investigating the political neutrality of LLMs is an important research question. This study presents a fresh perspective on quantifying the political neutrality of LLMs through the lens of abstractive text summarization of polarizing news articles. We consider five pressing issues in current US politics: abortion, gun control/rights, healthcare, immigration, and LGBTQ+ rights. Via a substantial corpus of 20,344 news articles, our study reveals a consistent trend towards proDemocratic biases in several well-known LLMs, with gun control and healthcare exhibiting the most pronounced biases (max polarization differences of -9.49% and -6.14%, respectively). Further analysis uncovers a strong convergence in the vocabulary of the LLM outputs for these divisive topics (55% overlap for Democrat-leaning representations, 52% for Republican). Being months away from a US election of consequence, we consider our findings important.

Abstract: When digitizing documents using conventional equipment, shadows often appear, posing significant challenges to the visual quality and readability of the digital copies. Given that the removal of document shadows typically involves complex image processing and computational tasks, which require substantial computational resources and time, the cost can become prohibitive, limiting the practicality and efficiency of shadow removal algorithms. This research aims to address the critical task of designing a model capable of achieving superior shadow removal effects. We propose a deep learning model for document shadow removal that harnesses Sobel text prior and ground truth masks as supervision. This prior knowledge encapsulates regular information regarding document structure and shadow formation, thereby enhancing its ability to utilize edge information for shadow removal. Additionally, the integration of prior knowledge and supervised learning can help the model learn more quickly, reducing the amount of information the model needs to process and improving its efficiency.

Abstract: Microplastics and microfibres are now widespread in aquatic ecosystems, as oceans and rivers. A serious portion of these microplastics come from urban wastewater treatment plants. Traditional methods for detecting and quantifying them are labourintensive and time-consuming. This paper introduces MicroFiberDetect, a novel application designed to enhance the detection and quantification of microfibres within sludge samples. Leveraging the power of deep learning, this innovative tool provides detection accuracy and insights into the size and colour of each identified fibre. Reducing time and manpower required for analysis while increasing accuracy and throughput. The application has been deployed as a desktop application that allows field experts to quantify and analyse microfibres in sludge samples.

Abstract: We propose a novel system, MathMistake Checker, designed to automate stepby-step mistake finding in mathematical problems with lengthy answers through a two-stage process. The system aims to simplify grading, increase efficiency, and enhance learning experiences from a pedagogical perspective. It integrates advanced technologies, including computer vision and the chain-of-thought capabilities of the latest large language models (LLMs). Our system supports open-ended grading without reference answers and promotes personalized learning by providing targeted feedback. We demonstrate its effectiveness across various types of math problems, such as calculation and word problems.

Abstract: We focus on the classification problem with a separable dataset, one of the most important and classical problems from machine learning. The standard approach to this task is logistic regression with gradient descent (LR+GD). Recent studies have observed that LR+GD can find a solution with arbitrarily large step sizes, defying conventional optimization theory. Our work investigates this phenomenon and makes three interconnected key observations about LR+GD with large step sizes. First, we find a remarkably simple explanation of why LR+GD with large step sizes solves the classification problem: LR+GD reduces to a batch version of the celebrated perceptron algorithm when the step size tends to infinity. Second, we observe that larger step sizes lead LR+GD to higher logistic losses when it tends to the perceptron algorithm, but larger step sizes also lead to faster convergence to a solution for the classification problem, meaning that logistic loss is an unreliable metric of the proximity to a solution. Surprisingly, high loss values can actually indicate faster convergence. Third, since the convergence rate in terms of loss function values of LR+GD is unreliable, we examine the iteration complexity required by LR+GD with large step sizes to solve the classification problem and prove that this complexity is suboptimal. To address this, we propose a new method, Normalized LR+GD – based on the connection between LR+GD and the perceptron algorithm – with much better theoretical guarantees.

Abstract: Continuous control tasks often involve highdimensional, dynamic, and non-linear environments. State-of-the-art performance in these tasks is achieved through complex closed-box policies that are effective, but suffer from an inherent opacity. Interpretable policies, while generally underperforming compared to their closed-box counterparts, advantageously facilitate transparent decision-making within automated systems. Hence, their usage is often essential for diagnosing and mitigating errors, supporting ethical and legal accountability, and fostering trust among stakeholders. In this paper, we propose SMoSE, a novel method to train sparsely activated interpretable controllers, based on a top-1 Mixture-of-Experts architecture. SMoSE combines a set of interpretable decision-makers, trained to be experts in different basic skills, and an interpretable router that assigns tasks among the experts. The training is carried out via state-of-the-art Reinforcement Learning algorithms, exploiting load-balancing techniques to ensure fair expert usage. We then distill decision trees from the weights of the router, significantly improving the ease of interpretation. We evaluate SMoSE on six benchmark environments from MuJoCo: our method outperforms recent interpretable baselines and narrows the gap with non-interpretable state-of-the-art algorithms.

Abstract: Tabular data are fundamental in common machine learning applications, ranging from finance to genomics and healthcare. This paper focuses on tabular regression tasks, a field where deep learning (DL) methods are not consistently superior to machine learning (ML) models due to the challenges posed by irregular target functions inherent in tabular data, causing sensitive label changes with minor variations from features. To address these issues, we propose a novel ArithmeticAware Pre-training and Adaptive-Regularized Fine-tuning framework (APAR), which enables the model to fit irregular target function in tabular data while reducing the negative impact of overfitting. In the pre-training phase, APAR introduces an arithmetic-aware pretext objective to capture intricate sample-wise relationships from the perspective of continuous labels. In the fine-tuning phase, a consistency-based adaptive regularization technique is proposed to self-learn appropriate data augmentation. Extensive experiments across 10 datasets demonstrated that APAR outperforms existing GBDT-, supervised NN-, and pretrain-finetune NN-based methods in RMSE (+9.43% ~ 20.37%), and empirically validated the effects of pre-training tasks, including the study of arithmetic operations.

Abstract: Probabilistic forecasting of irregularly sampled multivariate time series with missing values is crucial for decisionmaking in various domains, including health care, astronomy, and climate. State-of-the-art methods estimate only marginal distributions of observations in single channels and at single timepoints, assuming a Gaussian distribution for the data. In this work, we propose a novel model, ProFITi using conditional normalizing flows to learn multivariate conditional distribution: joint distribution of the future values of the time series conditioned on past observations and specific channels and timepoints, without assuming any fixed shape of the underlying distribution. As model components, we introduce a novel invertible triangular attention layer and an invertible non-linear activation function on and onto the whole real line. Through extensive experiments on 4 real-world datasets, ProFITi demonstrates significant improvement, achieving an average log-likelihood gain of 2.0 compared to the previous state-of-the-art method.

Abstract: Predicting roll call votes through modeling political actors has emerged as a focus in quantitative political science and computer science. Widely used embeddingbased methods generate vectors for legislators from diverse data sets to predict legislative behaviors. However, these methods often contend with challenges such as the need for manually predefined features, reliance on extensive training data, and a lack of interpretability. Achieving more interpretable predictions under flexible conditions remains an unresolved issue. This paper introduces the Political Actor Agent (PAA), a novel agent-based framework that utilizes Large Language Models to overcome these limitations. By employing role-playing architectures and simulating legislative system, PAA provides a scalable and interpretable paradigm for predicting roll-call votes. Our approach not only enhances the accuracy of predictions but also offers multi-view, human-understandable decision reasoning, providing new insights into political actor behaviors. We conducted comprehensive experiments using voting records from the 117-118th U.S. House of Representatives, validating the superior performance and interpretability of PAA. This study not only demonstrates PAA's effectiveness but also its potential in political science research.

Abstract: In product advertising applications, the automated inpainting of backgrounds utilizing AI techniques in product images has emerged as a significant task. However, the techniques still suffer from issues such as inappropriate background and inconsistent product in generated product images, and existing approaches for evaluating the quality of generated product images are mostly inconsistent with human feedback causing the evaluation for this task to depend on manual annotation. To relieve the issues above, this paper proposes Human Feedback and Product Consistency (HFPC), which can automatically assess the generated product images based on two modules. Firstly, to solve inappropriate backgrounds, human feedback on 44,000 automated inpainting product images is collected to train a reward model based on multimodal features extracted from BLIP and comparative learning. Secondly, to filter generated product images containing inconsistent products, a fine-tuned segmentation model is employed to segment the product of the original and generated product images and then compare the differences between the above two. Extensive experiments have demonstrated that HFPC can effectively evaluate the quality of generated product images and significantly reduce the expense of manual annotation. Moreover, HFPC achieves state-of-the-art (96.4% in precision) in comparison to other open-source visual-quality-assessment models.

Abstract: Reinforcement Learning has revolutionized decisionmaking processes in dynamic environments, yet it often struggles with autonomously detecting and achieving goals without clear feedback signals. For example, in a Source Term Estimation problem, the lack of precise environmental information makes it challenging to provide clear feedback signals and to define and evaluate how the source's location is determined. To address this challenge, the Autonomous Goal Detection and Cessation (AGDC) module was developed, enhancing various RL algorithms by incorporating a self-feedback mechanism for autonomous goal detection and cessation upon task completion. Our method effectively identifies and ceases undefined goals by approximating the agent's belief, significantly enhancing the capabilities of RL algorithms in environments with limited feedback. To validate effectiveness of our approach, we integrated AGDC with deep Q-Network, proximal policy optimization, and deep deterministic policy gradient algorithms, and evaluated its performance on the Source Term Estimation problem. The experimental results showed that AGDC-enhanced RL algorithms significantly outperformed traditional statistical methods such as infotaxis, entrotaxis, and dual control for exploitation and exploration, as well as a non-statistical random action selection method. These improvements were evident in terms of success rate, mean traveled distance, and search time, highlighting AGDC's effectiveness and efficiency in complex, real-world scenarios.

Abstract: Recent advancements have underscored the exceptional analytical and situational understanding capabilities of Large Language Models (LLMs) in autonomous driving decisions. However, the inherent hallucination issues of LLMs pose significant safety concerns when utilized as standalone decisionmaking systems. To address these challenges, we propose the Hybrid-Driving framework, which leverages LLMs' situational comprehension and reasoning abilities alongside the specialized driving expertise embedded in knowledge graphs and driving rules, thereby enhancing the safety, robustness, and reliability of autonomous driving decisions. To articulate driving experiences clearly, we introduce the Scenario Evolution Knowledge Graph (SEKG), which integrates scenario prediction and action risk analysis in autonomous driving. By delineating observation areas and defining Time-to-Collision (TTC) levels, we effectively control the number of driving scenario nodes and ensure scenario diversity. Based on the scenario evolution relationships within the SEKG, we predict scenarios and assess associated action risks. Additionally, we implement a rule-filtering mechanism to eliminate unreasonable actions and employ prompt engineering to integrate scenario information, optional actions, and SEKG-based action risk analysis into the LLMs for decision-making. Extensive experiments demonstrate that our approach substantially improves decision success rates compared to using LLMs alone (≥37.5%), as well as surpasses the DiLu framework with LLMs and few-shot driving memory (≥7.5%), and other reinforcement learning methods (≥11%). These results validate the effectiveness of the Hybrid-Driving framework in enhancing LLM reliability for autonomous driving and advocate for its broader application of domain-specific knowledge across other fields.

Abstract: Proteinnucleic acid interactions play a fundamental and critical role in a wide range of life activities. Accurate identification of nucleic acid-binding residues helps to understand the intrinsic mechanisms of the interactions. However, the accuracy and interpretability of existing computational methods for recognizing nucleic acid-binding residues need to be further improved. Here, we propose a novel method called GeSite based the domain-adaptive protein language model and E(3)-equivariant graph neural network. Prediction results across multiple benchmark test sets demonstrate that GeSite is superior or comparable to state-of-the-art prediction methods. The MCC values of GeSite are 0.522 and 0.326 for the one DNA-binding residue test set and one RNA-binding resi-due test set, which are 0.57 and 38.14% higher than that of the second-best method, respectively. Detailed experi-mental results suggest that the advanced performance of GeSite lies in the well-designed nucleic acid-binding pro-tein adaptive language model. Additionally, interpretabil-ity analysis exposes the perception of the prediction mod-el on various remote and close functional domains, which is the source of its discernment ability.

Abstract: The proliferation of fake news on social media platforms has exerted a substantial influence on society, leading to discernible impacts and deleterious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from the necessity for extensive supervised training and the challenge of adapting to rapidly evolving circumstances. Large language models (LLMs), despite their robust zeroshot capabilities, have fallen short in effectively identifying fake news due to a lack of pertinent demonstrations and the dynamic nature of knowledge. In this paper, a novel framework Multi-Round Collaboration Detection (MRCD) is proposed to address these aforementioned limitations. The MRCD framework is capable of enjoying the merits from both LLMs and SLMs by integrating their generalization abilities and specialized functionalities, respectively. Our approach features a two-stage retrieval module that selects relevant and up-to-date demonstrations and knowledge, enhancing in-context learning for better detection of emerging news events. We further design a multi-round learning framework to ensure more reliable detection results. Our framework MRCD achieves SOTA results on two real-world datasets Pheme and Twitter16, with accuracy improvements of 7.4% and 12.8% compared to using only SLMs, which effectively addresses the limitations of current models and improves the detection of emergent fake news detection.

Abstract: This paper presents a novel method for automatically recognizing people's apparent personality traits as perceived by others. In previous studies, apparent personality trait recognition from multimodal human behavior is often modeled to directly estimate personality trait scores, i.e., the ``Big Five'' scores. In the model training phase, groundtruth personality trait scores were often determined from personality test results scored by many other people using fine-grained questionnaires, however, rich information in the personality test results have not been leveraged for anything other than determining the ground-truth Big Five scores. The scores assigned to each questionnaire item are thought to include more meta-level differences in personality characteristics. Therefore, we propose joint modeling methods that can estimate not only the Big Five scores but also questionnaire item-level scores. This enables us to improve awareness of multimodal human behavior. In addition, we present a newly created self-introduction video dataset with 50-item Big Five questionnaire results since previous apparent personality trait recognition datasets do not provide such personality test results. Experiments using the created dataset demonstrate that our proposed joint modeling methods with a multimodal transformer backbone can improve to estimate Big Five scores and effectively estimate questionnaire item-level scores. We also verify that the estimation performance reached human evaluation performance.

School of Computer Science and Technology, Shanghai Institute of Artificial Intelligence for Education, Lab of Artificial Intelligence for Education, East China Normal University, Shanghai, China, School of Psychology and Cognitive Science, Institute of Brain and Education Innovation, Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, Shanghai Key Laboratory of Brain Functional Genomics, Key Laboratory of Brain Functional Genomics, East China Normal University, Shanghai, China, School of Psychology and Cognitive Science, Institute of Brain and Education Innovation, Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, Shanghai Key Laboratory of Brain Functional Genomics, Key Laboratory of Brain Functional Genomics, East China Normal University, Shanghai, China, School of Psychology and Cognitive Science, Institute of Brain and Education Innovation, Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, Shanghai Key Laboratory of Brain Functional Genomics, Key Laboratory of Brain Functional Genomics, East China Normal University, Shanghai, China, School of Psychology and Cognitive Science, Institute of Brain and Education Innovation, Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, Shanghai Key Laboratory of Brain Functional Genomics, Key Laboratory of Brain Functional Genomics, East China Normal University, Shanghai, China, School of Computer Science and Technology, Shanghai Institute of Artificial Intelligence for Education, Lab of Artificial Intelligence for Education, East China Normal University, Shanghai, China, School of Psychology and Cognitive Science, Institute of Brain and Education Innovation, Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, Shanghai Key Laboratory of Brain Functional Genomics, Key Laboratory of Brain Functional Genomics, East China Normal University, Shanghai, China, School of Psychology and Cognitive Science, Institute of Brain and Education Innovation, Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, Shanghai Key Laboratory of Brain Functional Genomics, Key Laboratory of Brain Functional Genomics, East China Normal University, Shanghai, China Shanghai Center for Brain Science and Brain-Inspired Technology, Shanghai, China NYU-ECNU Institute of Brain and Cognitive Science, New York University Shanghai, Shanghai, China

Abstract: In recent years, predicting Big Five personality traits from multimodal data has received significant attention in artificial intelligence (AI). However, existing computational models often fail to achieve satisfactory performance. Psychological research has shown a strong correlation between pose and personality traits, yet previous research has largely ignored pose data in computational models. To address this gap, we develop a novel multimodal dataset that incorporates fullbody pose data. The dataset includes video recordings of 287 participants completing a virtual interview with 36 questions, along with self-reported Big Five personality scores as labels. To effectively utilize this multimodal data, we introduce the Psychology-Inspired Network (PINet), which consists of three key modules: Multimodal Feature Awareness (MFA), Multimodal Feature Interaction (MFI), and Psychology-Informed Modality Correlation Loss (PIMC Loss). The MFA module leverages the Vision Mamba Block to capture comprehensive visual features related to personality, while the MFI module efficiently fuses the multimodal features. The PIMC Loss, grounded in psychological theory, guides the model to emphasize different modalities for different personality dimensions. Experimental results show that the PINet outperforms several state-of-the-art baseline models. Furthermore, the three modules of PINet contribute almost equally to the model’s overall performance. Incorporating pose data significantly enhances the model’s performance, with the pose modality ranking mid-level in importance among the five modalities. These findings address the existing gap in personality-related datasets that lack full-body pose data and provide a new approach for improving the accuracy of personality prediction models, highlighting the importance of integrating psychological insights into AI frameworks.

Abstract: Multimodal AspectBased Sentiment Analysis (MABSA) plays a pivotal role in the advancement of sentiment analysis technology. Although current methods strive to integrate multimodal information to enhance the performance of sentiment analysis, they still face two critical challenges when dealing with multi-aspect and multi-sentiment data: i) the importance of aspect terms within multimodal data is often overlooked, and ii) models fail to accurately associate specific aspect terms with corresponding sentiment words in multi-aspect and multi-sentiment sentences. To tackle these problems, we propose a novel multimodal aspect-based sentiment analysis method that combines Aspect Enhancement and Text Simplification (AETS). Specifically, we develop an aspect enhancement module that boosts the ability of model to discern relevant aspect terms. Concurrently, we employ text simplification module to simplify and restructure multi-aspect and multi-sentiment texts, accurately capturing aspects and their corresponding sentiments while reducing irrelevant information. Leveraging this method, we perform three tasks including multimodal aspect term extraction, multimodal aspect sentiment classification, and joint multimodal aspect-based sentiment analysis. Experimental results indicate that our proposed AETS model achieved state-of-the-art performance on two benchmark datasets.

Abstract: With the explosive growth of deep learning applications and increasing privacy concerns, the right to be forgotten has become a critical requirement in various AI industries. For example, given a facial recognition system, some individuals may wish to remove their personal data that might have been used in the training phase. Unfortunately, deep neural networks sometimes unexpectedly leak personal identities, making this removal challenging. While recent machine unlearning algorithms aim to enable models to forget specific data, we identify an unintended utility drop—correlation collapse—in which the essential correlations between image features and true labels weaken during the forgetting process. To address this challenge, we propose DistributionLevel Feature Distancing (DLFD), a novel method that efficiently forgets instances while preserving task-relevant feature correlations. Our method synthesizes data samples by optimizing the feature distribution to be distinctly different from that of forget samples, achieving effective results within a single training epoch. Through extensive experiments on facial recognition datasets, we demonstrate that our approach significantly outperforms state-of-the-art machine unlearning methods in both forgetting performance and model utility preservation.

Abstract: Segment anything model (SAM) demonstrates strong generalization ability on natural image segmentation. However, its direct adaptation in medical image segmentation tasks shows significant performance drops. It also requires an excessive number of prompt points to obtain a reasonable accuracy. Although quite a few studies explore adapting SAM into medical image volumes, the efficiency of 2D adaptation methods is unsatisfactory and 3D adaptation methods are only capable of segmenting specific organs/tumors. In this work, we propose a comprehensive and scalable 3D SAM model for wholebody CT segmentation, named CT-SAM3D. Instead of adapting SAM, we propose a 3D promptable segmentation model using a (nearly) fully labeled CT dataset. To train CT-SAM3D effectively, ensuring the model's accurate responses to higher-dimensional spatial prompts is crucial, and 3D patch-wise training is required due to GPU memory constraints. Therefore, we propose two key technical developments: 1) a progressively and spatially aligned prompt encoding method to effectively encode click prompts in local 3D space; and 2) a cross-patch prompt scheme to capture more 3D spatial context, which is beneficial for reducing the editing workloads when interactively prompting on large organs. CT-SAM3D is trained using a curated dataset of 1204 CT scans containing 107 whole-body anatomies and extensively validated using five datasets, achieving significantly better results against all previous SAM-derived models.

Abstract: Medical report generation is crucial for clinical diagnosis and patient management, summarizing diagnoses and recommendations based on medical imaging. However, existing work often overlook the clinical pipeline involved in report writing, where physicians typically conduct an initial quick review followed by a detailed examination. Moreover, current alignment methods may lead to misaligned relationships. To address these issues, we propose DAMPER, a dualstage framework for medical report generation that mimics the clinical pipeline of report writing in two stages. In the first stage, a MeSH-Guided Coarse-Grained Alignment (MCG) stage that aligns chest X-ray (CXR) image features with medical subject headings (MeSH) features to generate a rough keyphrase representation of the overall impression. In the second stage, a Hypergraph-Enhanced Fine-Grained Alignment (HFG) stage that constructs hypergraphs for image patches and report annotations, modeling high-order relationships within each modality and performing hypergraph matching to capture semantic correlations between image regions and textual phrases. Finally,the coarse-grained visual features, generated MeSH representations, and visual hypergraph features are fed into a report decoder to produce the final medical report. Extensive experiments on public datasets demonstrate the effectiveness of DAMPER in generating comprehensive and accurate medical reports, outperforming state-of-the-art methods across various evaluation metrics.

Abstract: Security concerns related to Large Language Models (LLMs) have been extensively explored; however, the safety implications for Multimodal Large Language Models (MLLMs), particularly in medical contexts (MedMLLMs), remain inadequately addressed. This paper investigates the security vulnerabilities of MedMLLMs, focusing on their deployment in clinical environments where the accuracy and relevance of questionand-answer interactions are crucial for addressing complex medical challenges. We introduce and redefine two attack types: mismatched malicious attack (2M-attack) and optimized mismatched malicious attack (O2M-attack), by integrating existing clinical data with atypical natural phenomena. Using the comprehensive 3MAD dataset that we developed, which spans a diverse range of medical imaging modalities and adverse medical scenarios, we performed an in-depth analysis and proposed the MCM optimization method. This approach significantly improves the attack success rate against MedMLLMs. Our evaluations, which include white-box attacks on LLaVA-Med and transfer (black-box) attacks on four other SOTA models, reveal that even MedMLLMs designed with advanced security mechanisms remain vulnerable to breaches. This study highlights the critical need for robust security measures to enhance the safety and reliability of open-source MedMLLMs, especially in light of the potential impact of jailbreak attacks and other malicious exploits in clinical applications. Warning: Medical jailbreaking may generate content that includes unverified diagnoses and treatment recommendations. Always consult professional medical advice.

Abstract: Partially Relevant Video Retrieval (PRVR) addresses the challenges of textto-video retrieval in real-world scenarios where untrimmed videos are prevalent. Traditional PRVR methods encode videos at two feature scales: (1) frame-level to capture fine details, and (2) clip-level to recognize broader content. However, these approaches align both scales with a single sentence representation, leading to suboptimal performance. In particular, we point out the level mismatch in aligning frame-level video features with a sentence representation, as the entire meaning of a sentence contains broader and more diverse content than what frame-level features can encode. This misalignment causes frame-level features to capture broader contexts and overlook local fine details. To tackle this issue, we propose a framework that represents a sentence as a set of multiple components, where each component aligns with frame-level semantics. Specifically, we introduce Semantic-Decomposed Matching (SDM) to adjust the granularity of the text description to match them with frame-level video features. In addition to the matching process, we develop the Adaptive Local Aggregator (ALA) to enhance video encoding in capturing finer local details, ensuring precise text-video alignment at the frame level. ALA adaptively integrates multi-scale local details within short temporal spans obtained by enforcing a strict temporal aggregation range. Finally, we reinforce detailed encoding at the frame level with newly designed objectives for both modalities. Extensive experiments integrating our framework with existing clip branches demonstrate its effectiveness and applicability, highlighting significant improvements in PRVR performance.

Abstract: Functional Magnetic Resonance Imaging (fMRI) data is a widely used kind of fourdimensional biomedical data, which requires effective compression. However, fMRI compressing poses unique challenges due to its intricate temporal dynamics, low signal-to-noise ratio, and complicated underlying redundancies. This paper reports a novel compression paradigm specifically tailored for fMRI data based on Implicit Neural Representation (INR). The proposed approach focuses on removing the various redundancies among the time series by employing several methods, including (i) conducting spatial correlation modeling for intra-region dynamics, (ii) decomposing reusable neuronal activation patterns, and (iii) using proper initialization together with nonlinear fusion to describe the inter-region similarity. This scheme appropriately incorporates the unique features of fMRI data, and experimental results on publicly available datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art algorithms in both conventional image quality evaluation metrics and fMRI downstream tasks. This work in this paper paves the way for sharing massive fMRI data at low bandwidth and high fidelity.

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, College of Computer Science and Software Engineering, Shenzhen University, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Computer Science and Engineering, Hong Kong University of Science and Technology, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen

Abstract: With the advancement of computer vision, numerous models have been proposed for screening of fundus diseases. However, the recognition of multiple fundus diseases is often hampered by the simultaneous presence of multiple disease types and the confluence of lesion types in fundus images. This paper addresses these challenges by conceptualizing them as multilevel feature fusion and self-supervised disease-indicative feature learning problems. We decode fundus images at various levels of granularity to delineate scenarios wherein multiple diseases and lesions co-occur. To effectively integrate these features, we introduce a hierarchical vision transformer (HVT) that adeptly captures both inter-level and intra-level dependencies. A novel forward-attention module is proposed to enhance the integration of lower-level semantic information into higher semantic layers, thereby enriching the representation of complex features. Additionally, we introduce a novel self-supervised mask-consistent feature learner (MCFL). Unlike traditional mask-autoencoders that reconstruct original images using encoder-decoder structures, MCFL utilizes a teacher-student framework to reconstruct mask-consistent feature maps. In this setup, exponential moving averaging is employed to derive classification-guided features, serving as labels for reconstruction rather than merely reconstructing the original images. This innovative approach facilitates the extraction of disease-indicative features. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art models.

Abstract: Single Domain Generalization (SDG) is critical in medical imaging applications. Recently, Vision Foundation Models (VFMs) have spearheaded a trend in AI development due to their robust generalizability and versatility. This work aims to fully explore the generalization capabilities of VFMs alongside the domainspecific expertise of specialized models, thoroughly investigating the boundaries of their respective capabilities, thereby collaboratively addressing SDG challenges within medical imaging. We propose a framework for Collaborative reasoning between Specialized and Universal models for Single Domain Generalization (CollaSU-SDG) in medical imaging. Specifically, we first design a model-aware perturbation injection method from the perspective of single-source domain data, enabling differentiated and adaptive perturbation injection for two different scales of models. Then, a domain expansion adapter is designed for the VFM to adapt to the augmented single-source domain medical data. Lastly, we introduce an adaptive hierarchical transfer and dynamic dense prompting method that facilitate collaborative reasoning between the specialized and universal models, eliminating the need for explicit prompts. Through these designs, CollaSU-SDG fully leverages the strengths of both specialized and universal models, achieving robust out-of-distribution generalization capabilities on single-source domain data. Experimental results demonstrate that CollaSU-SDG significantly advances the state-of-the-art performance across a wide range of medical datasets. All the code will be publicly available.

Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Zhejiang University CSSC Intelligent Innovation Research Institute CSSC Systems Engineering Research Institute, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence

Abstract: Vehicle reidentification aims to match vehicles across non-overlapping camera views. Many existing methods extract features from one specific image, and these methods lack view-invariance when comparing vehicles of different orientations. As a result, discriminative parts obscured by viewpoint changes cannot contribute effectively to matching. This work presents a novel keypoint-based framework for vehicle Re-ID. We propose to explicitly model the intrinsic structural relationships between vehicle components via knowledge graph. By establishing connection between keypoints, our approach aims to leverage such prior to match vehicles even when some parts are not directly comparable due to orientation inconsistencies. Specifically, given query and gallery images, we first detect visible keypoints. Then, a transformer-based model infers features for non-overlapped keypoints by conditioning on visible correspondences defined in the knowledge graph. The final representation integrates visible and inferred features. Extensive experiments demonstrate our method outperforms state-of-the-arts on standard benchmarks under cross-view matching scenarios. To our knowledge, this is the first work introducing structural priors via keypoint knowledge graphs for view-invariant vehicle re-identification.

Abstract: Coronary Artery Disease (CAD) poses a significant threat to cardiovascular patients worldwide, underscoring the critical importance of automated CAD diagnostic technologies in clinical practice. Previous technologies for lesion assessment in Coronary CT Angiography (CCTA) images have been insufficient in terms of interpretability, resulting in solutions that lack clinical reliability in both network architecture and prediction outcomes, even when diagnoses are accurate. To address the limitation of interpretability, we introduce the Trusted LesionAssessment Network (TLA-Net), which provides a clinically reliable solution for multi-view CAD diagnosis: (1) The causality-informed evidence collection constructs a causal graph for the diagnostic process and implements causal interventions, preventing confounders' interference and enhancing the transparency of the network architecture. (2) The clinically-aligned uncertainty integration hierarchically combines Dirichlet distributions from various views based on clinical priors, offering confidence coefficients for prediction outcomes that align with physicians' image analysis procedures. Experimental results on a dataset of 2,618 lesions demonstrate that TLA-Net, supported by its interpretable methodological design, exhibits superior performance with outstanding generalization, domain adaptability, and robustness.

Digital Medical Research Center, School of Basic Medical Science, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Shanghai 200032, China, Digital Medical Research Center, School of Basic Medical Science, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Shanghai 200032, China Academy for Engineering and Technology, Fudan University, Shanghai 200433, China, Digital Medical Research Center, School of Basic Medical Science, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Shanghai 200032, China, Digital Medical Research Center, School of Basic Medical Science, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Shanghai 200032, China

Abstract: The accelerated MRI reconstruction process presents a challenging illposed inverse problem due to the extensive under-sampling in k-space. Recently, Vision Transformers (ViTs) have become the mainstream for this task, demonstrating substantial performance improvements. However, there are still three significant issues remain unaddressed: (1) ViTs struggle to capture high-frequency components of images, limiting their ability to detect local textures and edge information, thereby impeding MRI restoration; (2) Previous methods calculate multi-head self-attention (MSA) among both related and unrelated tokens in content, introducing noise and significantly increasing computational burden; (3) The naive feed-forward network in ViTs cannot model the multi-scale information that is important for image restoration. In this paper, we propose FPS-Former, a powerful ViT-based framework, to address these issues from the perspectives of frequency modulation, spatial purification, and scale diversification. Specifically, for issue (1), we introduce a frequency modulation attention module to enhance the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid. For issue (2), we customize a spatial purification attention module to capture interactions among closely related tokens, thereby reducing redundant or irrelevant feature representations. For issue (3), we propose an efficient feed-forward network based on a hybrid-scale fusion strategy. Comprehensive experiments conducted on three public datasets show that our FPS-Former outperforms state-of-the-art methods while requiring lower computational costs.

School of Software Technology, Zhejiang University State Key Lab of CAD&CG, Zhejiang University, School of Software Technology, Zhejiang University, School of Software Technology, Zhejiang University State Key Lab of CAD&CG, Zhejiang University, School of Software Technology, Zhejiang University, Department of Computer Science and Information Technology, La Trobe University, Australia, School of Earth Sciences, Zhejiang University, School of Software Technology, Zhejiang University Innovation Center of Yangtze River Delta, Zhejiang University

Abstract: Remote sensing images (RSIs) are frequently characterized by multiscale inter-class objects and inconsistently distributed objects due to scene limitations, which would cause a significant data imbalance challenging the corresponding semantic segmentation. Recent methods have leveraged various deep learning techniques to capture high-quality representations for RSI semantic segmentation, but are hardly capable of addressing the afore-mentioned challenge given their limited explorations towards the mechanisms behind the representations. The recently discovered Neural Collapse (NC) phenomenon in computer vision models suggests the simplex equiangular tight frame (ETF) as the optimal representation structure, which has motivated us to observe that the optimal structure of last-layer representations is disrupted and inter-class representations for minor classes tend to become closer to each other beacuse of data imbalance. To address these issues, we propose Inter-class and Intra-class Neural Collapse Tuning (In2NeCT) to optimize the representations that satisfy the simplex ETF, which facilitates the discrimination of inter-class representations and the coherence of intra-class representations. Extensive experiments on three datasets demonstrate that our In2NeCT consistently leads to significant improvements in performance and outperforms the state-of-the-art methods.

Qingdao Institute of Software, China University of Petroleum (East China) College of Computer Science and Technology, China University of Petroleum (East China) Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, Qingdao Institute of Software, China University of Petroleum (East China) College of Computer Science and Technology, China University of Petroleum (East China) Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, Qingdao Institute of Software, China University of Petroleum (East China) College of Computer Science and Technology, China University of Petroleum (East China) Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, Qingdao Institute of Software, China University of Petroleum (East China) College of Computer Science and Technology, China University of Petroleum (East China) Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software

Abstract: Existing Salient Object Ranking (SOR) aims to infer ranking of salient objects based on their saliency degree. However, it tends to only focus on salient objects while neglecting nonsalient ones. This coarse-grained ranking limits the performance of downstream tasks. For instance, in image retrieval tasks, focusing solely on the relationship between salient objects is insufficient for achieving fine-grained scene analysis, which may result in retrieved results that do not satisfy user requirements. High-quality retrieval requires fine-grained analysis, making it essential to rank non-salient objects. Based on this need, we propose a new task: Fine-grained Object Importance Ranking in 360 Scenes (FOIR-360), which focus on predicting the relative importance of "ALL objects'' at the instance-level. Our task takes into account all objects, allowing us to refine the original "coarse-grained'' to a "fine-grained'' level. Currently, the main challenge for this new task is the lack of supervised data for model training or even for model testing. Therefore, we propose a novel weakly supervised method to address the shortage of datasets. Furthermore, to the best of our knowledge, there is no existing suitable annotation protocol for this new task. The main reason is that annotating fine-grained rankings is extremely difficult, especially in panoramic scenes that contain numerous instances where even humans are unable to determine which one is more important than others. As the first attempt, we introduce a new annotation protocol designed to highlight the ranking of objects that are non-salient yet still important. Based on this protocol, we construct the first fine-grained 360Rank dataset. In summary, all these new task, weakly supervised method, annotation protocol, and dataset have the potential to drive advancements in the field.

Abstract: In our quest to decode the visual processing of the human brain, we aim to reconstruct dynamic visual experiences from brain activities, a task both challenging and intriguing. Although recent advances have made significant strides in reconstructing static images from noninvasive brain recordings, the translation of continuous brain activities into video formats has not been extensively explored. Our study introduces NeuralFlix, a simple but effective dual-phase framework designed to address the inherent challenges in decoding fMRI data, such as noise, spatial redundancy, and temporal lags. The framework employs spatial and temporal augmentation for contrastive learning of fMRI representations, and a diffusion model enhanced with dependent prior noise for generating videos. Tested on a publicly available fMRI dataset, NeuralFlix demonstrates promising results, significantly outperforming previous state-of-the-art models by margins of 20.97%, 31.00%, and 12.30%, respectively, in decoding the brain activities of three subjects individually, as measured by SSIM.

Abstract: Estimating optical flow in occluded regions is a crucial challenge in unsupervised settings. In this work, we introduce M2Flow, a novel framework for unsupervised optical flow estimation that integrates motion information from multiple frames to address occlusions. By modeling interframe motion information and employing Motion Information Propagation (MIP) module, M2Flow effectively propagates and integrates motion information across frames, while concurrently estimating bidirectional optical flows for multiple frames. In addition, to handle occlusions across multiple frames, we provide two augmentation modules specifically designed for our multi-frame model to further refine optical flow. The experiments on KITTI and Sintel datasets demonstrate that M2Flow outperforms other state-of-the-art unsupervised approaches, especially in solving occlusions.

School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065 China Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065 China Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, China, School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001 China, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, China

Abstract: Nighttime Semantic Segmentation (NSS) is essential to many cuttingedge vision applications. However, existing technologies overly rely on massive labeled data, whose annotation is time-consuming and laborious. In this paper, we pioneer a new task focusing on exploring the potential of training strategy and framework design with limited annotation to achieve high-performance NSS. Insufficient information at very low labeling budgets can easily lead to under-optimization or overfitting of the model. Our solution comprises two main components: i) a novel region-based active sampling strategy called Contextual-Aware Region Query (CARQ), which identifies highly informative target nighttime regions for labeling; and ii) an innovative Fragmentation Synergy Active Domain Adaptation framework (FS-ADA), which progressively broadcasts the limited annotation to the unlabeled regions, achieving high performance with a minimal annotation budget. Extensive experiments demonstrate that our method outperforms state-of-the-art UDA-NSS & ADA-SS methods across four day-to-nighttime benchmarks, and generalizes well to foggy, rainy, & snowy scenes. In particular only with 1% target nighttime data annotation, our method is on par with the mainstream fully-supervised methods on the BDD100K-Night val dataset.

School of Computer Science, Wuhan University, China National Engineering Research Center for Multimedia Software, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, School of Computer Science, Wuhan University, China National Engineering Research Center for Multimedia Software, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, School of Cyber Science and Engineering, Wuhan University, China National Engineering Research Center for Multimedia Software, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, School of Computer Science, Wuhan University, China School of Cyber Science and Engineering, Wuhan University, China National Engineering Research Center for Multimedia Software, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, School of Computer Science, Wuhan University, China School of Cyber Science and Engineering, Wuhan University, China National Engineering Research Center for Multimedia Software, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, School of Computer Science, Wuhan University, China School of Cyber Science and Engineering, Wuhan University, China National Engineering Research Center for Multimedia Software, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China

Abstract: Video question answering (VideoQA) aims to answer natural language questions according to the given videos. Although existing models perform well in the factoid VideoQA task, they still face challenges in deep video understanding (DVU) task, which focuses on story videos. Compared to factoid videos, the most significant feature of story videos is storylines, which are composed of complex interactions and longrange evolvement of core story topics including characters, actions and locations. Understanding these topics requires models to possess DVU capability. However, existing DVU datasets rarely organize questions according to these story topics, making them difficult to comprehensively assess VideoQA models' DVU capability of complex storylines. Additionally, the question quantity and video length of these dataset are limited by high labor costs of handcrafted dataset building method. In this paper, we devise a large language model based multi-agent collaboration framework, StoryMind, to automatically generate a new large-scale DVU dataset. The dataset, FriendsQA, derived from the renowned sitcom Friends with an average episode length of 1,358 seconds, contains 44.6K questions evenly distributed across 14 fine-grained topics. Finally, We conduct comprehensive experiments on 10 state-of-the-art VideoQA models using the FriendsQA dataset.

Abstract: Textvideo retrieval (TVR) has seen substantial advancements in recent years, fueled by the utilization of pre-trained models and large language models (LLMs). Despite these advancements, achieving accurate matching in TVR remains challenging due to inherent disparities between video and textual modalities and irregularities in data representation. In this paper, we propose Text-Video-ProxyNet (TV-ProxyNet), a novel framework designed to decompose the conventional 1-to-N relationship of TVR into N distinct 1-to-1 relationships. By replacing a single text query with a series of text proxies, TV-ProxyNet not only broadens the query scope but also achieves a more precise expansion. Each text proxy is crafted through a refined iterative process, controlled by mechanisms we term as the director and dash, which regulate the proxy's direction and distance relative to the original text query. This setup not only facilitates more precise semantic alignment but also effectively manages the disparities and noise inherent in multimodal data. Our experiments on three representative video-text retrieval benchmarks, MSRVTT, DiDeMo, and ActivityNet Captions, demonstrate the effectiveness of TV-ProxyNet. The results show an improvement of 2.0% to 3.3% in R@1 over the baseline. TV-ProxyNet achieved state-of-the-art performance on MSRVTT and ActivityNet Captions, and a 2.0% improvement on DiDeMo compared to existing methods, validating our approach's ability to enhance semantic mapping and reduce error propensity.

Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China, Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China, Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China, Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China, Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China, Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China, Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China

Abstract: The fair and objective assessment of performances and competitions is a common pursuit and challenge in human society. The application of computer vision technology offers hope for this purpose, but it still faces obstacles such as occlusion and motion blur. To address these hindrances, our DanceFix proposes a bidirectional spatialtemporal context optical flow correction (BOFC) method. This approach leverages the consistency and complementarity of motion information between two modalities: optical flow, which excels at pixel capture, and lightweight skeleton data. It enables the extraction of pixel-level motion changes and the correction of abnormal skeleton data. Furthermore, we propose a part-level dance dataset (Dancer Parts) and part-level motion feature extraction based on task decoupling (PETD). This aims to decouple complex whole-body parts tracking into fine-grained limb-level motion extraction, enhancing the confidence of temporal information and the accuracy of correction for abnormal data. Finally, we present the DNV dataset, which simulates fully neat group dance scenes and provides reliable labels and validation methods for the newly introduced group dance neatness assessment (GDNA). To the best of our knowledge, this is the first work to develop quantitative criteria for assessing limb and joint neatness in group dance. We conduct experiments on DNV and video-based public JHMDB datasets. Our method effectively corrects abnormal skeleton points, flexibly embeds, and improves the accuracy of existing pose estimation algorithms.

Abstract: Timeresolved imaging is an emerging sensing modality that has been shown to enable advanced applications, including remote sensing, fluorescence lifetime imaging, and even non-line-of-sight sensing. Single-photon avalanche diodes (SPADs) outperform relevant time-resolved imaging technologies thanks to their excellent photon sensitivity and superior temporal resolution on the order of tens of picoseconds. The capability of exceeding the sensing limits of conventional cameras for SPADs also draws attention to the photon-efficient imaging area. However, photon-efficient imaging under degraded conditions with low photon counts and low signal-to-background ratio (SBR) still remains an inevitable challenge. In this paper, we propose a spatio-temporal transformer network for photon-efficient imaging under low-flux scenarios. In particular, we introduce a view-interweaved attention mechanism (VIAM) to extract both spatial-view and temporal-view self-attention in each transformer block. We also design an adaptive-weighting scheme to dynamically adjust the weights between different views of self-attention in VIAM for different signal-to-background levels. We extensively validate and demonstrate the effectiveness of our approach on the simulated Middlebury dataset and a specially self-collected dataset with real-world-captured SPAD measurements and well-annotated ground truth depth maps.

Abstract: Generative AI presents transformative potential across various domains, from creative arts to scientific visualization. However, the utility of AIgenerated imagery is often compromised by visual flaws, including anatomical inaccuracies, improper object placements, and misplaced textual elements. These imperfections pose significant challenges for practical applications. To overcome these limitations, we introduce Yuan, a novel framework that autonomously corrects visual imperfections in text-to-image synthesis. Yuan uniquely conditions on both the textual prompt and the segmented image, generating precise masks that identify areas in need of refinement without requiring manual intervention—a common constraint in previous methodologies. Following the automated masking process, an advanced inpainting module seamlessly integrates contextually coherent content into the identified regions, preserving the integrity and fidelity of the original image and associated text prompts. Through extensive experimentation on publicly available datasets such as ImageNet100 and Stanford Dogs, along with a custom-generated dataset, Yuan demonstrated superior performance in eliminating visual imperfections. Our approach consistently achieved higher scores in quantitative metrics, including NIQE, BRISQUE, and PI, alongside favorable qualitative evaluations. These results underscore Yuan's potential to significantly enhance the quality and applicability of AI-generated images across diverse fields.

Abstract: Information diffusion prediction aims to predict the next infected user in the information diffusion, which is a critical task to understand how information spreads on social platforms. Existing methods mainly focus on the sequences or topology structure in euclidean space. However, they fail to sufficiently consider the hierarchical structure or powerlaw structure of the underlying topology of information cascade graphs and social networks, resulting in distortion of user features. To tackle above issue, we propose an innovative Constrained Temporal Hypergraphs and Graph Neural Networks (THGNets) framework that is tailored for information diffusion prediction. Specifically, we introduce hyperbolic temporal hypergraphs neural network to alleviate the distortion of user features by hyperbolic hierarchical learning in information cascades. Additionally, it also captures high-order dynamic interaction patterns between users and further integrates the time-consistency constraint mechanism to mitigate the instability and non-smoothness of user features in latent space. In parallel, we apply the hyperbolic graph neural network to investigate the hierarchical structure and user homogeneity on social networks, enhancing our understanding of social relationships. Moreover, hyperbolic gated recurrent units are employed to capture the potential dependency relationships between contextual users. Experiments conducted on four public datasets demonstrate that the proposed THGNets significantly outperform the existing methods, thereby validating the superiority and rationality of our approach.

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

Abstract: Graph neural network (GNN) based recommender systems have been widely used in diverse service platforms as they can more effectively capture users' interests. Nevertheless, recent investigations have revealed that the neighborhood aggregation and contrastive learning mechanisms render GNNbased recommender systems more vulnerable to fake profile injection attacks, i.e., shilling attacks. Despite numerous defenses against shilling attacks having emerged, these approaches still face certain challenges, such as the demand for prior knowledge and the difficulty in defending against multiple attacks. Therefore, this paper proposes a two-stage trustworthy GNN-based recommender systems training framework (Trust-GRS), which models the probability of data being fake in a zero-knowledge scenario and establishes a trustworthy neighborhood aggregation and contrastive learning. Through extensive experiments on multiple benchmark datasets against 12 state-of-the-art shilling attacks, we demonstrate that Trust-GRS substantially mitigates the influence of fake data in all attacks, up to 100%, while preserving the original recommendation performance. Benefiting from the absence of the requirement for prior knowledge, Trust-GRS holds significant application value for real-world recommendation platforms.

Abstract: Sharedaccount Sequential Recommendation (SSR) aims to provide personalized recommendations for accounts shared by multiple users with varying sequential preferences. Previous studies on SSR struggle to capture the fine-grained associations between interactions and different latent users within the shared account's hybrid sequences. Moreover, most existing SSR methods (e.g., RNN-based or GCN-based methods) have quadratic computational complexities, hindering the deployment of SSRs on resource-constrained devices. To this end, we propose a Lightweight Graph Capsule Convolutional Network with subspace alignment for shared-account sequential recommendation, named LightGC2N. Specifically, we devise a lightweight graph capsule convolutional network. It facilitates the fine-grained matching between interactions and latent users by attentively propagating messages on the capsule graphs. Besides, we present an efficient subspace alignment method. This method refines the sequence representations and then aligns them with the finely clustered preferences of latent users. The experimental results on four real-world datasets indicate that LightGC2N outperforms nine state-of-the-art methods in accuracy and efficiency.

Abstract: This paper considers the challenging problem of 3D Human Pose Estimation (HPE) from a sparse set of Inertial Measurement Units (IMUs). Existing efforts typically reconstruct a pose sequence by either directly tackling wholebody motions or focusing on distinctive spatio-temporal features of local body parts. Unfortunately, these methods ignore existing interdependent motor synergies amongst body parts, which may lead to pose estimation with ambiguous local parts. This observation motivates us to propose a hierarchical learning-based approach, HiPoser, which utilizes a hierarchical shared structure using Mamba blocks as the backbone to focus on the following estimation tasks, involving: 1) torso pose, 2) lower limbs pose, 3) upper limbs pose, and finally 4) global translation. These tasks selectively incorporate body motion states and are to be carried out sequentially in reconstructing part-based poses, which are amalgamated to estimate the final full-body pose with the global translation that satisfies inter-part consistencies. Our hierarchical structure allows HiPoser the flexibility in prioritizing different aspects of pose estimation, to emphasize more on detail or stability. Empirical evaluations over three benchmark datasets demonstrate the superiority of HiPoser over existing state-of-the-art models, suggesting that analyzing the synergistic movement of body parts is indeed important for advancing IMU-based 3D HPE.

Abstract: Robots can acquire complex manipulation skills by learning policies from expert demonstrations, which is often known as visionbased imitation learning. Generating policies based on diffusion and flow matching models has been shown to be effective, particularly in robotic manipulation tasks. However, recursion-based approaches are inference inefficient in working from noise distributions to policy distributions, posing a challenging trade-off between efficiency and quality. This motivates us to propose FlowPolicy, a novel framework for fast policy generation based on consistency flow matching and 3D vision. Our approach refines the flow dynamics by normalizing the self-consistency of the velocity field, enabling the model to derive task execution policies in a single inference step. Specifically, FlowPolicy conditions on the observed 3D point cloud, where consistency flow matching directly defines straight-line flows from different time states to the same action space, while simultaneously constraining their velocity values, that is, we approximate the trajectories from noise to robot actions by normalizing the self-consistency of the velocity field within the action space, thus improving the inference efficiency. We validate the effectiveness of FlowPolicy in Adroit and Metaworld, demonstrating a 7× increase in inference speed while maintaining competitive average success rates compared to state-of-the-art methods.

Abstract: Temporal knowledge graph (TKG) reasoning that infers future missing facts is an essential and challenging task. Predicting future events typically relies on closely related historical facts, yielding more accurate results for repetitive or periodic events. However, for future events with sparse historical interactions, the effectiveness of this method, which focuses on leveraging highfrequency historical information, diminishes. Recently, the capabilities of diffusion models in image generation have opened new opportunities for TKG reasoning. Therefore, we propose a graph node diffusion model with dual-domain periodic contrastive learning (DPCL-Diff). Graph node diffusion model (GNDiff) introduces noise into sparsely related events to simulate new events, generating high-quality data that better conforms to the actual distribution. This generative mechanism significantly enhances the model's ability to reason about new events. Additionally, the dual-domain periodic contrastive learning (DPCL) maps periodic and non-periodic event entities to Poincaré and Euclidean spaces, leveraging their characteristics to distinguish similar periodic events effectively. Experimental results on four public datasets demonstrate that DPCL-Diff significantly outperforms state-of-the-art TKG models in event prediction, demonstrating our approach's effectiveness. This study also investigates the combined effectiveness of GNDiff and DPCL in TKG tasks.

Abstract: Contrastive learning is a paradigm for learning representations from unlabelled data and several recent works have claimed that such models effectively learn spectral embeddings and show relations between (wide) contrastive models and kernel principal component analysis (PCA). However, it is not known if trained contrastive models indeed correspond to kernel methods or PCA. In this work, we analyze the training dynamics of twolayer contrastive models, with non-linear activation, and answer when these models are close to PCA or kernel methods. It is well known in the supervised setting that neural networks are equivalent to neural tangent kernel (NTK) machines, and that the NTK of infinitely wide networks remains constant during training. We provide the first constancy results of NTK for contrastive losses, and present a nuanced picture: NTK of wide networks remains almost constant for cosine similarity based contrastive losses, but not for losses based on dot product similarity. We further study the training dynamics of contrastive models with orthogonality constraints on output layer, which is implicitly assumed in works relating contrastive learning to spectral embedding. Our deviation bounds suggest that representations learned by contrastive models are close to the principal components of a certain matrix computed from random features.

Abstract: Data augmentation methods, especially SoTA interpolationbased methods such as Fair Mixup, have been widely shown to increase model fairness. However, this fairness is evaluated on metrics that do not capture model uncertainty and on datasets with only one, relatively large, minority group. As a remedy, multicalibration has been introduced to measure fairness while accommodating uncertainty and accounting for multiple minority groups. However, existing methods of improving multicalibration involve reducing initial training data to create a holdout set for post-processing, which is not ideal when minority training data is already sparse. This paper uses multicalibration to more rigorously examine data augmentation for classification fairness. We stress-test four versions of Fair Mixup on two structured data classification problems with up to 81 marginalized groups, evaluating multicalibration violations and balanced accuracy. We find that on nearly every experiment, Fair Mixup worsens baseline performance and fairness, but the simple vanilla Mixup outperforms both Fair Mixup and the baseline, especially when calibrating on small groups. Combining vanilla Mixup with multicalibration post-processing, which enforces multicalibration through post-processing on a holdout set, further increases fairness.

School of Computer Science and Artificial Intelligence & Beijing Key Laboratory of Commercial Data Security Protection and Intelligent Governance, Beijing Technology and Business University, Beijing, China Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Taipa, Macau, Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Taipa, Macau Centre for Artificial Intelligence and Robotics, Institute of Collaborative Innovation, University of Macau, Taipa, Macau, School of Computer Science and Artificial Intelligence & Beijing Key Laboratory of Commercial Data Security Protection and Intelligent Governance, Beijing Technology and Business University, Beijing, China

Abstract: Solutions to timevarying problems are crucial for research areas such as predicting changes in human body shape over time. While recurrent neural networks have made significant advancements in this field, their reliance on centralized processing has led to challenges such as model silos and data isolation. In response, distributed AI systems like federated learning have emerged to facilitate dynamic collaboration among models; however, they still depend on central coordinators, which pose risks to system security and efficiency. Moreover, traditional federated learning primarily supports homogeneous models and lacks effective strategies for the interaction of heterogeneous models. To address these limitations, we propose a novel method called Dynamic Collaboration of Heterogeneous Models (DCHM), based on Isomerism Learning, which leverages a consortium blockchain network to enhance model credibility and facilitate coordination among heterogeneous models. Additionally, we introduce a Distributed Hierarchical Aggregation (DHA) algorithm that enables permissioned nodes within each group to aggregate local model results and share them for standardized processing. After several iterative cycles, these nodes perform secondary integration of local results to produce global outcomes. Experimental results demonstrate that DCHM effectively analyzes the temporal variability of body shape changes with high efficiency.

Abstract: Time series data is essential in various applications, including climate modeling, healthcare monitoring, and financial analytics. Understanding the contextual information associated with realworld time series data is often essential for accurate and reliable event predictions. In this paper, we introduce TimeCAP, a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data, extending their typical usage as predictors. TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions. In addition, TimeCAP employs a multi-modal encoder that synergizes with the LLM agents, enhancing predictive performance through mutual augmentation of inputs with in-context examples. Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art methods for time series event prediction, including those utilizing LLMs as predictors, achieving an average improvement of 28.75% in F1 score.

Abstract: With the popularity of 3D volumetric video applications, such as Autonomous Driving, Virtual Reality, and Mixed Reality, current developers have turned to deep learning for compressing volumetric video frames, i.e., point clouds for video upstreaming. The latest deep learningbased solutions offer higher efficiency, lower distortion, and better hardware support compared to traditional ones like MPEG and JPEG. However, privacy threats arise, especially reconstruction attacks targeting to recover the original input point cloud from the intermediate results. In this paper, we design VVRec, to the best of our knowledge, which is the first targeting DL-based Volumetric Video Reconstruction attack scheme. VVRec demonstrates the ability to reconstruct high-quality point clouds from intercepted transmission intermediate results using four well-trained neural network modules we design. Leveraging the latest latent diffusion models with Gamma distribution and a refinement algorithm, VVRec excels in reconstruction quality and color recovery and surpasses existing defenses. We evaluate VVRec using three volumetric video datasets. The results demonstrate that VVRec achieves 64.70dB reconstruction accuracy, with an impressive 46.39% reduction of distortion over baselines.

Abstract: We propose a novel sample selection method for image classification in the presence of noisy labels. Existing methods typically consider smallloss samples as correctly labeled. However, some correctly labeled samples are inherently difficult for the model to learn and can exhibit high loss similar to mislabeled samples in the early stages of training. Consequently, setting a threshold on per-sample loss to select correct labels results in a trade-off between precision and recall in sample selection: a lower threshold may miss many correctly labeled hard-to-learn samples (low recall), while a higher threshold may include many mislabeled samples (low precision). To address this issue, our goal is to accurately distinguish correctly labeled yet hard-to-learn samples from mislabeled ones, thus alleviating the trade-off dilemma. We achieve this by considering the trends in model prediction confidence rather than relying solely on loss values. Empirical observations show that only for correctly labeled samples, the model's prediction confidence for the annotated labels typically increases faster than for any other classes. Based on this insight, we propose tracking the confidence gaps between the annotated labels and other classes during training and evaluating their trends using the Mann-Kendall Test. A sample is considered potentially correctly labeled if all its confidence gaps tend to increase. Our method functions as a plug-and-play component that can be seamlessly integrated into existing sample selection techniques. Experiments on several standard benchmarks and real-world datasets demonstrate that our method enhances the performance of existing methods for learning with noisy labels.

Abstract: Gaussian Process Regression (GPR) is a powerful and elegant method for learning complex functions from noisy data with a wide range of applications, including in safetycritical domains. Such applications have two key features: (i) they require rigorous error quantification, and (ii) the noise is often bounded and non-Gaussian due to, e.g., physical constraints. While error bounds for applying GPR in the presence of non-Gaussian noise exist, they tend to be overly restrictive and conservative in practice. In this paper, we provide novel error bounds for GPR under bounded support noise. Specifically, by relying on concentration inequalities and assuming that the latent function has low complexity in the reproducing kernel Hilbert space (RKHS) corresponding to the GP kernel, we derive both probabilistic and deterministic bounds on the error of the GPR. We show that these errors are substantially tighter than existing state-of-the-art bounds and are particularly well-suited for GPR with neural network kernels, i.e., Deep Kernel Learning (DKL). Furthermore, motivated by applications in safety-critical domains, we illustrate how these bounds can be combined with stochastic barrier functions to successfully quantify the safety probability of an unknown dynamical system from finite data. We validate the efficacy of our approach through several benchmarks and comparisons against existing bounds. The results show that our bounds are consistently smaller, and that DKLs can produce error bounds tighter than sample noise, significantly improving the safety probability of control systems.

Abstract: The selfattention mechanism has been adopted in various popular message passing neural networks (MPNNs), enabling the model to adaptively control the amount of information that flows along the edges of the underlying graph. Such attention-based MPNNs (Att-GNNs) have also been used as a baseline for multiple studies on explainable AI (XAI) since attention has steadily been seen as natural model interpretations, while being a viewpoint that has already been popularized in other domains (e.g., natural language processing and computer vision). However, existing studies often use naive calculations to derive attribution scores from attention, undermining the potential of attention as interpretations for Att-GNNs. In our study, we aim to fill the gap between the widespread usage of Att-GNNs and their potential explainability via attention. To this end, we propose GAtt, edge attribution calculation method for self-attention MPNNs based on the computation tree, a rooted tree that reflects the computation process of the underlying model. Despite its simplicity, we empirically demonstrate the effectiveness of GAtt in three aspects of model explanation: faithfulness, explanation accuracy, and case studies by using both synthetic and real-world benchmark datasets. In all cases, the results demonstrate that GAtt greatly improves edge attribution scores, especially compared to the previous naive approach.

Abstract: With the popularity of federated learning, federated domain generalization (FedDG) has attracted more and more attentions. Existing works of federated learning indicate that the generalization performance of the global model can be improved when the global model is obtained by aggregating local models according to a suitable weights. However, the existing methods to calculate weights do not fully utilize the data influences on the global model update, which gives us an opportunity to improve the generalization performance of the global model further. In this paper, we propose the method DI (data influences), which utilizes the data influences on the global model update to calculate dynamical weights of local model in each round of training. Specifically, the first component data influences calculator (DIC) of DI calculates the local weights of local model from the influences of each data on the global model update and we introduce the influences function to complete the calculation process. The second component data influences adjuster (DIA) of DI calculates the global weights (which are used in the aggregation process of the global model) from local weights. Extensive experiments indicate that our method improves the generalization performance of models significantly. In particular, our method improves model accuracy on benchmark datasets PACS, OfficeHome, and Office31 by 1.79%, 1.61%, and 2.39% on average, respectively. Source code is publicly available at github.

Abstract: Lyricto-melody generation is a highly challenging task in the field of AI music generation. Due to the difficulty of learning strict yet weak correlations between lyrics and melodies, previous methods have suffered from weak controllability, low-quality and poorly structured generation. To address these challenges, we propose CSL-L2M, a controllable song-level lyric-to-melody generation method based on an in-attention Transformer decoder with fine-grained lyric and musical controls, which is able to generate full-song melodies matched with the given lyrics and user-specified musical attributes. Specifically, we first introduce REMI-Aligned, a novel music representation that incorporates strict syllable- and sentence-level alignments between lyrics and melodies, facilitating precise alignment modeling. Subsequently, sentence-level semantic lyric embeddings independently extracted from a sentence-wise Transformer encoder are combined with word-level part-of-speech embeddings and syllable-level tone embeddings as fine-grained controls to enhance the controllability of lyrics over melody generation. Then we introduce human-labeled musical tags, sentence-level statistical musical attributes, and learned musical features extracted from a pre-trained VQ-VAE as coarse-grained, fine-grained and high-fidelity controls, respectively, to the generation process, thereby enabling user control over melody generation. Finally, an in-attention Transformer decoder technique is leveraged to exert fine-grained control over the full-song melody generation with the aforementioned lyric and musical conditions. Experimental results demonstrate that our proposed CSL-L2M outperforms the state-of-the-art models, generating melodies with higher quality, better controllability and enhanced structure.

Abstract: Although CoordinateMLP-based implicit neural representations have excelled in representing radiance fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this gap, we investigate existing implicit neural representations, from which we extract 3 types of positional encoding and 16 commonly used activation functions. Through combinatorial design, we establish the first benchmark for Coordinate-MLPs in audio signal representations. Our benchmark reveals that Coordinate-MLPs require complex hyperparameter tuning and frequency-dependent initialization, limiting their robustness. To address these issues, we propose Fourier-ASR, a novel framework based on the Fourier series theorem and the Kolmogorov-Arnold representation theorem. Fourier-ASR introduces Fourier Kolmogorov-Arnold Networks (Fourier-KAN), which leverage periodicity and strong nonlinearity to represent audio signals, eliminating the need for additional positional encoding. Furthermore, a Frequency-adaptive Learning Strategy (FaLS) is proposed to enhance the convergence of Fourier-KAN by capturing high-frequency components and preventing overfitting of low-frequency signals. Extensive experiments conducted on natural speech and music datasets reveal that: (1) well-designed positional encoding and activation functions in Coordinate-MLPs can effectively improve audio representation quality; and (2) Fourier-ASR can robustly represent complex audio signals without extensive hyperparameter tuning. Looking ahead, the continuity and infinite resolution of implicit audio representations make our research highly promising for tasks such as audio compression, synthesis, and generation.

Abstract: Malware authors often employ code obfuscations to make their malware harder to detect. Existing tools for generating obfuscated code often require access to the original source code (e.g., C++ or Java), and adding new obfuscations is a nontrivial, labor-intensive process. In this study, we ask the following question: Can Large Language Models (LLMs) potentially generate a new obfuscated assembly code? If so, this poses a risk to anti-virus engines and potentially increases the flexibility of attackers to create new obfuscation patterns. We answer this in the affirmative by developing the MetamorphASM benchmark comprising MetamorphASM Dataset (MAD) along with three code obfuscation techniques: dead code, register substitution, and control flow change. The MetamorphASM systematically evaluates the ability of LLMs to generate and analyze obfuscated code using MAD, which contains 328,200 obfuscated assembly code samples. We release this dataset and analyze the success rate of various LLMs (e.g., GPT-3.5/4, GPT-4o-mini, Starcoder, CodeGemma, CodeLlama, CodeT5, and LLaMA 3.1) in generating obfuscated assembly code. The evaluation was performed using established information-theoretic metrics and manual human review to ensure correctness and provide the foundation for researchers to study and develop remediations to this risk.

Abstract: This paper presents the Text Encoding Diffusion Model (TEncDM), a novel approach to diffusion modeling that operates in the space of pretrained language model encodings. In contrast to traditionally used embeddings, encodings integrate contextual information. In our approach, we also employ a transformer-based decoder, specifically designed to incorporate context in the token prediction process. We conduct a comprehensive examination of the influence of the encoder, decoder, noise scheduler, and self-conditioning on zero-shot generation. Furthermore, we compare TEncDM with previous approaches on three conditional text generation tasks: QQP, XSum, and Wiki-Auto. The results show that TEncDM exhibits superior performance compared to existing non-autoregressive diffusion models.

Abstract: Retrieval Augmented Generation (RAG) with Knowledge Graphs (KGs) is an effective way to enhance Large Language Models (LLMs). Due to the natural discrepancy between structured KGs and sequential LLMs, KGs must be linearized to text before being inputted into LLMs, leading to the problem of KG Alignment with LLMs (KGA). However, recent KG+RAG methods only consider KGA as a simple step without comprehensive and indepth explorations, leaving three essential problems unclear: (1) What are the factors and their effects in KGA? (2) How do LLMs understand KGs? (3) How to improve KG+RAG by KGA? To fill this gap, we conduct systematic explorations on KGA, where we first define the problem of KGA and subdivide it into the graph transformation phase (graph-to-graph) and the linearization phase (graph-to-text). In the graph transformation phase, we study graph features at the node, edge, and full graph levels from low to high granularity. In the linearization phase, we study factors on formats, orders, and templates from structural to token levels. We conduct substantial experiments on 15 typical LLMs and three common datasets. Our main findings include: (1) The centrality of the KG affects the final generation; formats have the greatest impact on KGA; orders are model-dependent, without an optimal order adapting for all models; the templates with special token separators are better. (2) LLMs understand KGs by a unique mechanism, different from processing natural sentences, and separators play an important role. (3) We achieved 7.3% average performance improvements on four common LLMs on the KGQA task by combining the optimal factors to enhance KGA.

Abstract: Evaluating the performance of Grammatical Error Correction (GEC) models has become increasingly challenging, as large language model (LLM)based GEC systems often produce corrections that diverge from provided gold references. This discrepancy undermines the reliability of traditional reference-based evaluation metrics. In this study, we propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism. Our framework employs the Analytic Hierarchy Process (AHP) in conjunction with large language models to ascertain the relative importance of various evaluation criteria. Additionally, we develop a dataset incorporating human annotations and LLM-simulated sentences to validate our algorithms and fine-tune more cost-effective models. Experimental results indicate that our proposed approach enhances the effectiveness of GEC model evaluations.

Abstract: The Dynamic Retrieval Augmented Generation (RAG) paradigm actively decides when and what to retrieve during the text generation process of Large Language Models (LLMs). However, current dynamic RAG methods fall short in both aspects: identifying the optimal moment to activate the retrieval module and crafting the appropriate query once retrieval is triggered. To overcome these limitations, we introduce an approach, namely, RaDIO, RealTime Hallucination Detection with Contextual Index Optimized query formulation for dynamic RAG. The approach is specifically designed to make decisions on when and what to retrieve based on the LLM’s real-time information needs during the text generation process. We evaluate RaDIO along with existing methods comprehensively over several knowledge-intensive generation datasets. Experimental results show that RaDIO achieves superior performance on all tasks, demonstrating the effectiveness of our work.

Abstract: Retrievalaugmented generation (RAG) is a promising way to improve large language models (LLMs) for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved information or directly fine-tune LLMs to adapt to RAG scenarios. Although fine-tuning can yield better performance, it often compromises the LLMs' general generation capabilities by modifying their parameters. This limitation poses challenges in practical applications, especially when LLMs are already deployed, as parameter adjustments may affect their original functionality. To address this, we propose a novel method that involves learning scalable and pluggable virtual tokens for RAG. By maintaining the LLMs' original parameters and fine-tuning only the embeddings of these pluggable tokens, our approach not only enhances LLMs' performance but also preserves their general generation capabilities. Furthermore, we design several training strategies to improve the scalability, flexibility, and generalizability of our method. Comprehensive experiments across 12 question-answering tasks demonstrate the superiority of our approach.

Abstract: Mainstream backdoor attacks on large language models (LLMs) typically set a fixed trigger in the input instance and specific responses for triggered queries. However, the fixed trigger setting (e.g., unusual words) may be easily detected by human detection, limiting the effectiveness and practicality in realworld scenarios. To enhance the stealthiness of backdoor activation, we present a new poisoning paradigm against LLMs triggered by specifying generation conditions, which are commonly adopted strategies by users during model inference. The poisoned model performs normally for output under normal/other generation conditions, while becomes harmful for output under target generation conditions. To achieve this objective, we introduce BrieFool, an efficient attack framework. It leverages the characteristics of generation conditions by efficient instruction sampling and poisoning data generation, thereby influencing the behavior of LLMs under target conditions. Our attack can be generally divided into two types with different targets: Safety unalignment attack and Ability degradation attack. Our extensive experiments demonstrate that BrieFool is effective across safety domains and ability domains, achieving higher success rates than baseline methods, with 94.3% on GPT-3.5-turbo.

Abstract: Understanding causal relations in dynamic systems is essential in epidemiology. While causal inference methods have been extensively studied, they often rely on fully specified causal graphs, which may not always be available in complex dynamic systems. Partially specified causal graphs, and in particular summary causal graphs (SCGs), provide a simplified representation of causal relations between time series when working spaciotemporal data, omitting temporal information and focusing on causal structures between clusters of of temporal variables. Unlike fully specified causal graphs, SCGs can contain cycles, which complicate their analysis and interpretation. In addition, their cluster-based nature introduces new challenges concerning the types of queries of interest: macro queries, which involve relationships between clusters represented as vertices in the graph, and micro queries, which pertain to relationships between variables that are not directly visible through the vertices of the graph. In this paper, we first clearly distinguish between macro conditional independencies and micro conditional independencies and between macro total effects and micro total effects. Then, we demonstrate the soundness and completeness of the d-separation to identify macro conditional independencies in SCGs. Furthermore, we establish that the do-calculus is sound and complete for identifying macro total effects in SCGs. Finally, we give a graphical characterization for the non-identifiability of macro total effects in SCGs.

Abstract: The last two years have seen a rapid growth in concerns around the safety of large language models (LLMs). Researchers and practitioners have met these concerns by creating an abundance of datasets for evaluating and improving LLM safety. However, much of this work has happened in parallel, and with very different goals in mind, ranging from the mitigation of nearterm risks around bias and toxic content generation to the assessment of longer-term catastrophic risk potential. This makes it difficult for researchers and practitioners to find the most relevant datasets for their use case, and to identify gaps in dataset coverage that future work may fill. To remedy these issues, we conduct a first systematic review of open datasets for evaluating and improving LLM safety. We review 144 datasets, which we identified through an iterative and community-driven process over the course of several months. We highlight patterns and trends, such as a trend towards fully synthetic datasets, as well as gaps in dataset coverage, such as a clear lack of non-English and naturalistic datasets. We also examine how LLM safety datasets are used in practice -- in LLM release publications and popular LLM benchmarks -- finding that current evaluation practices are highly idiosyncratic and make use of only a small fraction of available datasets. Our contributions are based on SafetyPrompts.com, a living catalogue of open datasets for LLM safety, which we plan to update continuously as the field of LLM safety develops.

Abstract: This study evaluates the usage of OpenAI’s ChatGPT Large Language Model (LLM) as a tool for constructing multiple choice questions for assessing student academic performance through quizzes and exams. We randomly deploy questions constructed with and without use of the LLM tool and gauge the ability of the students to correctly answer, as well as their ability to correctly perceive the difference between humanauthored and LLM-authored questions. In determining whether the questions written with the aid of ChatGPT were consistent with the instructor’s questions and source text, we computed representative vectors of both the human and ChatGPT questions using SBERT and compared cosine similarity to the course textbook. A non-significant Mann- Whitney U test (z = 1.018, p = .309) suggests that students were unable to perceive whether questions were written with or without the aid of ChatGPT. However, student scores on LLM-authored questions were almost 9% lower (z = 2.702, p < .01). This result may indicate that either the AI questions were more difficult or that the students were more familiar with the instructor’s style of questions. Overall, the study suggests that while there is potential for using LLM tools to aid in the construction of assessments, care must be taken to ensure that the questions are fair, well-composed, and relevant to the course material.

Abstract: The increased screen time and isolation caused by the COVID19 pandemic have led to a significant surge in cases of online grooming, which is the use of strategies by predators to lure children into sexual exploitation. Previous efforts to detect grooming in industry and academia have involved accessing and monitoring private conversations through centrally-trained models or sending private conversations to a global server. In this work, we implement a privacy-preserving pipeline for the early detection of sexual predators. We leverage federated learning and differential privacy in order to create safer online spaces for children while respecting their privacy. We investigate various privacy-preserving implementations and discuss their benefits and shortcomings. Our extensive evaluation using real-world data proves that privacy and utility can coexist with only a slight reduction in utility.

Abstract: Crop field boundaries are foundational datasets for agricultural monitoring and assessments but are expensive to collect manually. Machine learning (ML) methods for automatically extracting field boundaries from remotely sensed images could help realize the demand for these datasets at a global scale. However, current ML methods for field instance segmentation lack sufficient geographic coverage, accuracy, and generalization capabilities. Further, research on improving ML methods is restricted by the lack of labeled datasets representing the diversity of global agricultural fields. We present Fields of The World (FTW)a novel ML benchmark dataset for agricultural field instance segmentation spanning 24 countries on four continents (Europe, Africa, Asia, and South America). FTW is an order of magnitude larger than previous datasets with 70,462 samples, each containing instance and semantic segmentation masks paired with multi-date, multi-spectral Sentinel-2 satellite images. We provide results from baseline models for the new FTW benchmark, show that models trained on FTW have better zero-shot and fine-tuning performance in held-out countries than models that aren't pre-trained with diverse datasets, and show positive qualitative zero-shot results of FTW models in a real-world scenario -- running on Sentinel-2 scenes over Ethiopia.

Abstract: Artificial intelligence has achieved notable results in sign language recognition and translation. However, relatively few efforts have been made to significantly improve the quality of life for the 72 million hearingimpaired people worldwide. Sign language translation models, relying on video inputs, involves with large parameter sizes, making it time-consuming and computationally intensive to be deployed. This directly contributes to the scarcity of human-centered technology in this field. Additionally, the lack of datasets in sign language translation hampers research progress in this area. To address these, we first propose a cross-modal multi-knowledge distillation technique from 3D to 1D and a novel end-to-end pre-training text correction framework. Compared to other pre-trained models, our framework achieves significant advancements in correcting text output errors. Our model achieves a decrease in Word Error Rate (WER) of at least 1.4% on PHOENIX14 and PHOENIX14T datasets compared to the state-of-the-art CorrNet. Additionally, the TensorFlow Lite (TFLite) quantized model size is reduced to 12.93 MB, making it the smallest, fastest, and most accurate model to date. We have also collected and released extensive Chinese sign language datasets, and developed a specialized training vocabulary. To address the lack of research on data augmentation for landmark data, we have designed comparative experiments on various augmentation methods. Moreover, we performed a simulated deployment and prediction of our model on Intel platform CPUs and assessed the feasibility of deploying the model on other platforms.

Abstract: This talk examines the intersection of artificial intelligence and policymaking, focusing on legislative and regulatory frameworks in the United States. It explores the role of key federal agencies, existing technologyagnostic laws affecting AI, and gaps in regulatory oversight that require legislative intervention. Consumer protection laws are analyzed for their relevance to AI governance, particularly in financial services. The discussion also highlights the implications for AI research, emphasizing the importance of interdisciplinary collaboration between computer scientists and policymakers to ensure responsible AI development that aligns with democratic values and societal interests.

Abstract: While Large Language Models excel in language processing, Large Agent Models are designed to interact with the environment. This transition poses significant challenges in understanding lowerlevel visual details, and long-horizon reasoning for effective goal interpretation and decision-making. Despite the impressive performance of LLMs/VLMs on various benchmarks, these models perceive images as bags of words (semantic concepts). In detail, they use semantic understanding as a shortcut but lack the ability to recognize geometric structures or solve spatial problems such as mazes. To interact with the physical world, we focus on two dimensions: (1) From high-level semantic to low-level geometric understanding: We introduce a low-level visual description language that serves as geometric tokens, allowing the abstraction of multimodal low-level geometric structures. (2) From fast-thinking to slow-thinking: We propose to quantify long-horizon reasoning by incorporating Markov Decision Process (MDP) based decision-making. The key difference between language models and agent models lies in their decision-making capabilities. This fundamental difference necessitates a shift in how we approach the development of large agent models, focusing on both geometric understanding and long-term planning to create more capable embodied AI agents.

Abstract: Geologists seek to understand the relationship between volcanic unrest and eruptions by identifying subtle Volcanic Thermal Features (VTFs) in highresolution satellite imagery. This analysis requires the careful curation of large databases of relevant volcanic thermal information. However, volcanic unrest is characterized by highly subtle thermal anomalies. Manual identification on a global scale is highly labor- and time-intensive. We propose Hotspotter: an end-to-end system to automatically detect subtle volcanic thermal anomalies in satellite images and derive relevant thermal statistics. Previous solutions for automated VTF detection have limited data size and geographic diversity. To accommodate an unprecedentedly large and diverse volcanic dataset, we propose an automated pipeline combining unsupervised anomaly detection with supervised classification to filter anomalous regions. Hotspotter gives 90% anomaly detection accuracy and robust generalization to new volcanoes. Our automated approach can accelerate scientists' search for VTFs to help identify relevant thermal precursors and enable more precise forecasts of global volcanic eruptions.

Abstract: Simulating learner actions helps stresstest open-ended interactive learning environments and prototype new adaptations before deployment. While recent studies show the promise of using large language models (LLMs) for simulating human behavior, such approaches have not gone beyond rudimentary proof-of-concept stages due to key limitations. First, LLMs are highly sensitive to minor prompt variations, raising doubts about their ability to generalize to new scenarios without extensive prompt engineering. Moreover, apparently successful outcomes can often be unreliable, either because domain experts unintentionally guide LLMs to produce expected results, leading to self-fulfilling prophecies; or because the LLM has encountered highly similar scenarios in its training data, meaning that models may not be simulating behavior so much as regurgitating memorized content. To address these challenges, we propose Hyp-Mix, a simulation authoring framework that allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior. Testing this framework in a physics learning environment, we found that GPT-4 Turbo maintains calibrated behavior even as the underlying learner model changes, providing the first evidence that LLMs can be used to simulate realistic behaviors in open-ended interactive learning environments, a necessary prerequisite for useful LLM behavioral simulation.

Abstract: Given the massive transformation of all areas of society by AI, it is becoming essential to integrate AI literacy into the various school curricula from an early age. However, teaching the basic concepts of AI and Machine Learning (e.g. training a model; artificial neural networks) at the K12 level might seem too abstract, whereas teaching only how to use AI fails to really "open the black box". To overcome these difficulties, we have developed AlphAI, a software resource designed to make the understanding of AI algorithms accessible and attractive to the general public and children as young as 8 years old. This is achieved by making AI very concrete, first by manipulating the learning of educational robots that users train for different behaviors, such as circuit racing, using either supervised or reinforcement learning; second by visualizing in real time in a graphical interface the details of AI algorithms (neural networks, k-nearest neighbors, Q-learning, etc). In addition, the use of the software is not limited to beginners, since it allows to write one's own AI in Python to control the robots. In this paper, we present the basic principles of the software, its graphical interface, how to use it with various educational robots, and example activities with classes from Elementary school to University. AlphAI software and robotic kits are commercially available from Learning Robots.

Abstract: This paper examines the codesign process for a foundational AI microcredential course targeting K-12 teachers' knowledge, agency, and effectiveness in integrating AI into their classrooms. We collaborated with six K-12 teachers and instructional coaches to ensure the course's relevance and practicality. Using conjecture mapping and memoing, we systematically captured and analyzed insights from the collaborative process. These methods helped us pinpoint essential themes and requirements for effective professional development (PD) that meets the unique challenges and opportunities of teaching about and using AI in K-12 classrooms. Themes included concerns about in-class monitoring for unethical impacts of AI integration and the desire for empowerment in evaluating and selecting AI tools that they can best leverage to meet state and national standards. Educator requirements centered on the creation of quick, easily accessible, and asynchronous learning activities. In addition, educators requested just-in-time AI integration resources and learning opportunities that can be leveraged throughout the year, rather than being limited to PD sessions. This study contributes to AI education by providing a framework for designing teacher professional development programs that are responsive to the evolving educational landscape and the specific needs of K-12 teachers.

Abstract: Adoption of artificial intelligence (AI) is at an inflection point. With daily use of AI escalating due to widely available software tools, educators, researchers, and policymakers must adapt swiftly to changing educational needs. While think tanks and Big Tech companies often promote the notion that AI serves as a powerful tool for democratizing access to knowledge and opportunities, our work in rural communities underscores the disparity in access to AI education and related opportunities. In this paper, we report on our experience introducing foundational AI concepts to rural middle school students using an unplugged gamebased learning activity. By providing engaging learning experiences to rural populations, we hope to broaden interest in and understanding of AI technologies. To this end, we conducted a classroom study in which two middle school teachers implemented our unplugged AI learning activity with their students. Analyzing survey data from 60 of the participating students, we explore the impact of the activity on their interest in AI, their conceptual understanding, and examine potential gender differences. Additionally, we share insights from the teachers who participated in our professional development sessions in preparation for the classroom implementations.

Abstract: This paper presents an explainable AI (XAI) education tool designed for K12 classrooms, particularly for students aged 11-16. The tool was designed for interventions on the fundamental processes behind social media platforms, focusing on four AI- and data-driven core concepts: data collection, user profiling, engagement metrics, and recommendation algorithms. An Instagram-like interface and a monitoring tool for explaining the data-driven processes make these complex ideas accessible and engaging for young learners. The tool provides hands-on experiments and real-time visualizations, illustrating how user actions influence their personal experience on the platform as well as the experience of others. This approach seeks to enhance learners' data agency, AI literacy, and sensitivity to AI ethics. The paper includes a case example from 12 two-hour test sessions involving 209 children, using learning analytics to demonstrate how they navigated their social media feeds and the browsing patterns that emerged.

Abstract: My work aims to enable robots to better learn from human feedback in humanrobot interactions. The way in which people want to collaborate with a robot can vary person-to-person, interaction-to-interaction, or even within an interaction with a given person. Thus, robots need to be able to adapt their behavior during interactions. Robots typically learn from humans via explicit feedback, such as evaluative feedback, preferences, or demonstrations. We know that humans also provide additional information implicitly through non-verbal behavior that gives clues about their internal states during interactions. My work investigates how we can incorporate both kinds of feedback into robot learning paradigms.

Abstract: Artificial Intelligence (AI) continues to evolve rapidly, impacting numerous fields, including time series (TS) classification and human activity recognition (HAR). Despite the advancements in deep learning models, such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), these models face several challenges, including the need for extensive labeled datasets, significant computational resources, and lack of interpretability. This research aims to address these limitations by developing an adaptive hierarchical deep neural network framework that integrates fuzzy logic principles and adaptive learning techniques for robust, computationally efficient, and interpretable realtime TS analysis. The reduction in the number of parameters and the efficient learning of hierarchical features mean that less training data is needed to achieve robust performance. The model's ability to generalize from hierarchical representations allows it to make effective use of smaller datasets, which is particularly advantageous in scenarios where data is limited or expensive to obtain.The proposed framework specifically targets HAR applications using data from wearable sensors.

Abstract: The exponential growth of unstructured medical data presents a unique opportunity and challenge for advancing healthcare. Traditional methods struggle to extract meaningful insights from this complex data due to its inherent noise, ambiguity, and heterogeneity. To address these limitations, we propose a novel hybrid approach that integrates Knowledge Graphs (KGs) with Large Language Models (LLMs) within a RetrievalAugmented Generation (RAG) framework. By leveraging the structured knowledge of KGs and the contextual understanding of LLMs, we aim to improve the precision of feature extraction and disease progression modeling. Our research focuses on refining the KG representation through advanced entity extraction and relation extraction techniques, ensuring that the KG accurately captures the semantic nuances and temporal dynamics of medical data. By integrating this enhanced KG with the RAG framework, we can derive more precise and informative insights for clinical decision-making.

Abstract: Integrating Large Language Models (LLMs) into software engineering unlocks new opportunities to automate manual processes but raises challenges around reliability, safety, and scalability. My research centers on this synergy, with two key objectives: first, harnessing LLMs to solve software engineering tasks traditionally dependent on laborintensive, domain-specific methods, and second, applying robust software engineering principles to improve LLM safety and performance. This dual focus creates a powerful feedback loop, where LLMs drive innovation while engineering rigor ensures these systems meet the high standards required for real-world applications.

Abstract: Braincomputer interfaces (BCIs) can provide a means of communication for individuals with severe neuromuscular diseases, the target end-users. While personalized BCI machine learning models are the current standard, models trained on data from other users could reduce BCI calibration time. We use a novel dataset with BCI users with and without amyotrophic lateral sclerosis (ALS) and a popular BCI deep learning model, EEGNet, to assess the impact of population domain data on transfer learning of a P300 speller task in the ALS cohort. Results show that training on source data from the non-ALS cohort was detrimental to transfer learning. In contrast, generic EEGNet models trained on source data from the ALS cohort performed comparably as user-specific models. Our findings highlight the need for more data from target end-users populations in publicly available BCI datasets.

Abstract: Adverse Drug Events (ADEs) are a major healthcare issue in the United States, contributing to millions of outpatient and emergency department visits and ranking as the fourth leading cause of death. While many ADEs are identified postmarket, improved detection methods are crucial for enhancing patient safety. This study explores the application of large language models (LLMs) to the n2c2 task for ADE detection, evaluating optimal prompting techniques without requiring ADE-specific training data. Results indicate that an entity-only extraction approach outperforms the inline method, offering higher precision, recall, and token efficiency. This study highlights the potential of LLMs for accurate ADE detection in clinical text, improving performance while maintaining model efficiency.

Abstract: We propose ERFSL, an efficient reward function searcher using large language models (LLMs) for customenvironment, multi-objective reinforcement learning (RL). ERFSL generates reward components based on explicit user requirements and rectifies them, and iteratively optimizes the weights of these components based on textual context. Applied to an underwater data collection RL task, ERFSL corrects reward codes with only one feedback iteration per requirement, and acquires diverse reward functions within the Pareto set. ERFSL also presents robust capability for deviated weights and small-size LLMs such as GPT-4o mini. The full-text prompts, examples of LLM-generated answers, and source code are available at https://360zmem.github.io/LLMRsearcher/ .

Abstract: In recent years, federated learning (FL) has emerged as a promising technique to enable decentralized training of models without the need for data centralization, addressing privacy concerns and reducing communication overhead. The challenge, however, lies in scaling federated systems to accommodate clients with different computational capabilities. The heterogeneity of clients in terms of data, model structures, and computational resources presents significant challenges. Addressing these challenges can lead to more robust and efficient FL systems, making it possible to leverage diverse data sources and computational environments. Here we propose a system where small language models run on heterogeneous cli-ents while a large, more powerful model at the server aggregates their contributions. This architecture leverages the strengths of both small, task-specific client models and a large server model to enhance generalization and efficiency. This is important because it addresses the growing need for scalable, privacy-preserving systems that can operate in diverse environments with varying resources. Through such a system we intend to contribute to the AI field by improving the efficiency of federated learning systems while enhancing their adaptability to real-world applications.

Abstract: Categorical Distributional Reinforcement Learning (CDRL) has demonstrated superior sample efficiency in learning complex tasks compared to conventional Reinforcement Learning (RL) approaches. However, the practical application of CDRL is encumbered by challenging projection steps, detailed parameter tuning, and domain knowledge. This paper addresses these challenges by introducing a pioneering Continuous Distributional ModelFree RL algorithm tailored for continuous action spaces. The proposed algorithm simplifies the implementation of distributional RL, adopting an actor-critic architecture wherein the critic outputs a continuous probability distribution. Additionally, we propose an ensemble of multiple critics fused through a Kalman fusion mechanism to mitigate overestimation bias. Through a series of experiments, we validate that our proposed method provides a sample-efficient solution for executing complex continuous-control tasks.

Abstract: We consider online convex optimization with timevarying constraints and conduct performance analysis using two stringent metrics: dynamic regret with respect to the online solution benchmark, and hard constraint violation that does not allow any compensated violation over time. We propose an efficient algorithm called Constrained Online Learning with Doubly-bounded Queue (COLDQ), which introduces a novel virtual queue that is both lower and upper bounded, allowing tight control of the constraint violation without the need for the Slater condition. We prove via a new Lyapunov drift analysis that COLDQ achieves O(T^(1+Vx)/2) dynamic regret and O(T^Vg) hard constraint violation, where Vx and Vg capture the dynamics of the loss and constraint functions. For the first time, the two bounds smoothly approach to the best-known O(T^1/2) regret and O(1) violation, as the dynamics of the losses and constraints diminish. For strongly convex loss functions, COLDQ matches the best-known O(logT) static regret while maintaining the O(T^Vg) hard constraint violation. We further introduce an expert-tracking variation of COLDQ, which achieves the same performance bounds without any prior knowledge of the system dynamics. Simulation results demonstrate that COLDQ outperforms the state-of-the-art approaches.

Abstract: Lowcost accelerometers play a crucial role in modern society due to their advantages of small size, ease of integration, wearability, and mass production, making them widely applicable in automotive systems, aerospace, and wearable technology. However, this widely used sensor suffers from severe accuracy and range limitations. To this end, we propose a honed-energy regularized and optimal supervised GAN (HEROS-GAN), which transforms low-cost sensor signals into high-cost equivalents, thereby overcoming the precision and range limitations of low-cost accelerometers. Due to the lack of frame-level paired low-cost and high-cost signals for training, we propose an Optimal Transport Supervision (OTS), which leverages optimal transport theory to explore potential consistency between unpaired data, thereby maximizing supervisory information. Moreover, we propose a Modulated Laplace Energy (MLE), which injects appropriate energy into the generator to encourage it to break range limitations, enhance local changes, and enrich signal details. Given the absence of a dedicated dataset, we specifically establish a Low-cost Accelerometer Signal Enhancement Dataset (LASED) containing tens of thousands of samples, which is the first dataset serving to improve the accuracy and range of accelerometers and is released in Github. Experimental results demonstrate that a GAN combined with either OTS or MLE alone can surpass the previous signal enhancement SOTA methods by an order of magnitude. Integrating both OTS and MLE, the HEROS-GAN achieves remarkable results, which doubles the accelerometer range while reducing signal noise by two orders of magnitude, establishing a benchmark in the accelerometer signal processing.

Abstract: Federated learning has become a promising solution for collaboration among medical institutions. However, data owned by each institution would be highly heterogeneous and the distribution is always nonindependent and identical distribution (non-IID), resulting in client drift and unsatisfactory performance. Despite existing federated learning methods attempting to solve the non-IID problems, they still show marginal advantages but rely on frequent communication which would incur high costs and privacy concerns. In this paper, we propose a novel federated learning method: Federated learning via Valuable Condensed Knowledge (FedVCK). We enhance the quality of condensed knowledge and select the most necessary knowledge guided by models, to tackle the non-IID problem within limited communication budgets effectively. Specifically, on the client side, we condense the knowledge of each client into a small dataset and further enhance the condensation procedure with latent distribution constraints, facilitating the effective capture of high-quality knowledge. During each round, we specifically target and condense knowledge that has not been assimilated by the current model, thereby preventing unnecessary repetition of homogeneous knowledge and minimizing the frequency of communications required. On the server side, we propose relational supervised contrastive learning to provide more supervision signals to aid the global model updating. Comprehensive experiments across various medical tasks show that FedVCK can outperform state-of-the-art methods, demonstrating that it's non-IID robust and communication-efficient.

Abstract: Information Retrieval (IR) systems used in search and recommendation platforms frequently employ Learningto-Rank (LTR) models to rank items in response to user queries. These models heavily rely on features derived from user interactions, such as clicks and engagement data. This dependence introduces cold start issues for items lacking user engagement and poses challenges in adapting to non-stationary shifts in user behavior over time. We address both challenges holistically as an online learning problem and propose BayesCNS, a Bayesian approach designed to handle cold start and non-stationary distribution shifts in search systems at scale. BayesCNS achieves this by estimating prior distributions for user-item interactions, which are continuously updated with new user interactions gathered online. This online learning procedure is guided by a ranker model, enabling efficient exploration of relevant items using contextual information provided by the ranker. We successfully deployed BayesCNS in a large-scale search system and demonstrated its efficacy through comprehensive offline and online experiments. Notably, an online A/B experiment showed a 10.60% increase in new item interactions and a 1.05% improvement in overall success metrics over the existing production baseline.

School of Computer Science and Engineering, Central South University, Changsha Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, School of Computer Science and Engineering, Central South University, Changsha Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, School of Computer Science and Engineering, Central South University, Changsha Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi

Abstract: Identifying cancer genes is crucial for treatment and understanding pathogenesis. Recent methods typically leverage proteinprotein interaction (PPI) networks or gene functional association data from annotated gene sets. There may be some shared neighborhood structure information between these two types of gene association data. While this common information may contain more accurate gene association information, existing methods often overlook this potential. To address this gap, we introduce DISFusion, which integrates multi-omics cancer data, PPI networks, and gene functional associations to identify cancer genes. A key innovation of DISFusion is the cross-view decorrelation loss, which enhances the common information between PPI networks and gene functional associations, thereby improving prediction accuracy. Extensive experiments indicate that DISFusion outperforms state-of-the-art methods and exhibits greater generalization ability. Moreover, analysis of CPTAC pan-cancer proteomic data highlights significant associations between the 30 novel cancer genes predicted by DISFusion and multiple cancer types, underscoring its practical utility. These findings validate the effectiveness of enhancing common information and provide new insights into cancer gene identification.

Abstract: Due to the growing complexity of modern Integrated Circuits (ICs), automating hardware design can prevent a significant amount of human error from the engineering process and result in less errors. Verilog is a popular hardware description language for designing and modeling digital systems; thus, Verilog generation is one of the emerging areas of research to facilitate the design process. In this work, we propose VerilogCoder, a system of multiple Artificial Intelligence (AI) agents for Verilog code generation, to autonomously write Verilog code and fix syntax and functional errors using collaborative Verilog tools (i.e., syntax checker, simulator, and waveform tracer). Firstly, we propose a task planner that utilizes a novel Task and Circuit Relation Graph retrieval method to construct a holistic plan based on module descriptions. To debug and fix functional errors, we develop a novel and efficient abstract syntax tree (AST)based waveform tracing tool, which is integrated within the autonomous Verilog completion flow. The proposed methodology successfully generates 94.2% syntactically and functionally correct Verilog code, surpassing the state-of-the-art methods by 33.9% on the VerilogEval-Human v2 benchmark.

Abstract: The scarcity of highquality large-scale labeled datasets poses a huge challenge for employing deep learning models in video deception detection. To address this issue, inspired by the psychological theory on the relation between deception and expressions, we propose a novel method called AFFAKT in this paper, which enhances the classification performance by transferring useful and correlated knowledge from a large facial expression dataset. Two key challenges in knowledge transfer arise: 1) how much knowledge of facial expression data should be transferred and 2) how to effectively leverage transferred knowledge for the deception classification model during inference. Specifically, the optimal relation mapping between facial expression classes and deception samples is firstly quantified using proposed H-OTKT module and then transfers knowledge from the facial expression dataset to deception samples. Moreover, a correlation prototype within another proposed module SRKB is well designed to retain the invariant correlations between facial expression classes and deception classes through momentum updating. During inference, the transferred knowledge is fine-tuned with the correlation prototype using a sample-specific re-weighting strategy. Experimental results on two deception detection datasets demonstrate the superior performance of our proposed method. The interpretability study reveals high associations between deception and negative affections, which coincides with the theory in psychology.

Abstract: The emergence of unveiling humanlike behaviors in Large Language Models (LLMs) has led to a closer connection between NLP and human psychology. However, research on the personalities exhibited by LLMs has largely been confined to limited investigations using individual psychological tests, primarily focusing on a small number of commercially licensed LLMs. This approach overlooks the extensive use and significant advancements observed in open-source LLMs. This work aims to address both the above limitations by conducting an in-depth investigation of a significant body of 12 LLM Agents based on the most representative Open models, through the two most well-known psychological assessment tests, namely Myers-Briggs Type Indicator (MBTI) and Big Five Inventory (BFI). Our approach involves evaluating the intrinsic personality traits of LLM agents and determining the extent to which these agents can mimic human personalities when conditioned by specific personalities and roles. Our findings unveil that (i) each LLM agent showcases distinct human personalities; (ii) personality-conditioned prompting produces varying effects on the agents, with only few successfully mirroring the imposed personality, while most of them being ``closed-minded'' (i.e., they retain their intrinsic traits); and (iii) combining role and personality conditioning can enhance the agents' ability to mimic human personalities. Our work represents a step up in understanding the dense relationship between NLP and human psychology through the lens of LLMs.

Abstract: As Large Language Models (LLMs) are used for increasingly complex cognitive tasks, a natural question is whether AI really understands. The study of understanding in LLMs is in its infancy, and the community has yet to incorporate research and insights from philosophy, psychology, and education. Here we focus on understanding algorithms, and propose a hierarchy of levels of understanding. We validate the hierarchy using a study with human subjects (undergraduate and graduate students). Following this, we apply the hierarchy to large language models (generations of GPT), revealing interesting similarities and differences with humans. We expect that our rigorous criteria for algorithm understanding will help monitor and quantify AI's progress in such cognitive domains.

Abstract: Oriented object detection is crucial for complex scenes such as aerial images and industrial inspection, providing precise delineation by minimizing background interference. Recently, the weaklysupervised detector paradigm H2RBox has demonstrated promise in learning rotated bounding box (RBox) from the more readily available horizontal bounding box (HBox), alleviating the scarcity and high cost of RBox annotations. However, these H2RBox-based methods have primarily focused on the gap in orientation information between HBox- and RBox-supervised approaches, overlooking the gap in training sample selection. In response, we propose the Adaptive Fine-grained Sample Mining (AFSM) strategy, which improves the selection of fine-grained training samples in HBox-supervised methods. AFSM assigns the best-matching prediction RBox to each ground truth (GT) HBox and selects positive samples based on these paired boxes. Furthermore, to effectively filter the best-matching prediction RBox for AFSM, we introduce the Prediction Rbox Assignment (PRA) scheme, employing Kullback-Leibler Divergence (KLD) as a localization quality metric. Additionally, we introduce an improved self-supervised branch loss (Lss) to address the symmetry of weakly-supervised branch prediction boxes. Incorporating these core components (AFSM, PRA, and Lss), we develop an end-to-end network architecture (BGHR) to further bridge the gap between HBox- and RBox-supervised oriented object detection. Extensive experiments on DOTA-v1.0 and DIOR-R demonstrate that BGHR achieves state-of-the-art performance compared to HBox-supervised methods without additional overhead. Even when benchmarked against fully supervised FCOS, our method still exhibits a slight performance advantage.

Abstract: Generating Natural Language Explanations (NLEs) for model predictions on medical images, particularly those depicting thoracic pathologies, remains a critical and challenging task. Existing methodologies often struggle due to general models' insufficient domainspecific medical knowledge and privacy concerns associated with retrieval-based augmentation techniques. To address these issues, we propose a novel Vision-Language framework augmented with a Knowledge Graph (KG)-based datastore, which enhances the model's understanding by incorporating additional domain-specific medical knowledge essential for generating accurate and informative NLEs. Our framework employs a KG-based retrieval mechanism that not only improves the precision of the generated explanations but also preserves data privacy by avoiding direct data retrieval. The KG datastore is designed as a plug-and-play module, allowing for seamless integration with various model architectures. We introduce and evaluate three distinct frameworks within this paradigm: KG-LLaVA, which integrates the pre-trained LLaVA model with KG-RAG; Med-XPT, a custom framework combining MedCLIP, a transformer-based projector, and GPT-2; and Bio-LLaVA, which adapts LLaVA by incorporating the Bio-ViT-L vision model. These frameworks are validated on the MIMIC-NLE dataset, where they achieve state-of-the-art results, underscoring the effectiveness of KG augmentation in generating high-quality NLEs for thoracic pathologies.

School of Computer Science and Information Engineering, Hefei University of Technology Institute of Information Engineering, Chinese Academy of Sciences, School of Computer Science and Engineering, Beihang University, School of Artificial Intelligence, Beihang University, School of Computer Science and Information Engineering, Hefei University of Technology, School of Artificial Intelligence, Beihang University, Meituan, Institute of Information Engineering, Chinese Academy of Sciences, School of Artificial Intelligence, Beihang University

Abstract: In this paper, we propose an AudioLanguage-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT’s temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods.

Karlsruhe Institute of Technology, Karlsruhe, Germany, Institute for AI in Medicine (IKIM), University Medicine Essen, Essen, Germany, Karlsruhe Institute of Technology, Karlsruhe, Germany, Karlsruhe Institute of Technology, Karlsruhe, Germany, Karlsruhe Institute of Technology, Karlsruhe, Germany, Karlsruhe Institute of Technology, Karlsruhe, Germany, Karlsruhe Institute of Technology, Karlsruhe, Germany, Institute for AI in Medicine (IKIM), University Medicine Essen, Essen, Germany, Karlsruhe Institute of Technology, Karlsruhe, Germany

Abstract: We present ConnectedComponent (CC)-Metrics, a novel semantic segmentation evaluation protocol, targeted to align existing semantic segmentation metrics to a multi-instance detection scenario in which each connected component matters. We motivate this setup in the common medical scenario of semantic metastases segmentation in a full-body PET/CT. We show how existing semantic segmentation metrics suffer from a bias towards larger connected components contradicting the clinical assessment of scans in which tumor size and clinical relevance are uncorrelated. To rebalance existing segmentation metrics, we propose to evaluate them on a per-component basis thus giving each tumor the same weight irrespective of its size. To match predictions to ground-truth segments, we employ a proximity-based matching criterion, evaluating common metrics locally at the component of interest. Using this approach, we break free of biases introduced by large metastasis for overlap-based metrics such as Dice or Surface Dice. CC-Metrics also improves distance-based metrics such as Hausdorff Distances which are uninformative for small changes that do not influence the maximum or 95th percentile, and avoids pitfalls introduced by directly combining counting-based metrics with overlap-based metrics as it is done in Panoptic Quality.

Abstract: Large visionlanguage models struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit high uncertainty. In this study, we focus on a Visual Question Answering (VQA) task and comprehensively evaluate how well the output of the state-of-the-art vision-language model correlates with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ, not only accuracy, but also three new human-correlated metrics for the first time in VQA, to investigate the impact of HUD. We also verify the effect of common calibration and human calibration (Baan et al. 2022) on the alignment of models and humans. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3’s ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, to better align model confidence with human uncertainty. Our findings highlight that for VQA, the alignment between human responses and model predictions is understudied and is an important target for future studies.

State Key Laboratory of Multimodal Artificial Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

Abstract: Point cloud segmentation has a wide range of applications in autonomous driving, augmented reality and virtual reality. Multimodal fusion strategies have received increasing attention in point cloud segmentation recently. Despite the success, existing methods usually generate unnecessary information loss or redundancy. In this paper, we propose FEAST-Mamba, a novel FEAture and SpaTial aware Mamba network to tackle multi-modal point cloud segmentation. To exploit the complementarity between different modals, we propose a bidirectional orthogonal attention module, where features are first bidirectionally interacted with each other through cross-modal attention, and then orthogonal fusion is used to reduce feature redundancy. Furthermore, a reordering strategy is proposed for the Mamba architecture that takes into account both spatial and semantic information during cross-modal feature ordering. Experiments on indoor datasets, S3DIS and ScanNet, and outdoor datasets, nuScenes and SemanticKITTI, show that the proposed method achieves state-of-the-art performances.

Abstract: In the context of OmniDirectional Image (ODI) Super-Resolution (SR), the unique challenge arises from the non-uniform oversampling characteristics caused by EquiRectangular Projection (ERP). Considerable efforts in designing complex spherical convolutions or polyhedron reprojection offer significant performance improvements but at the expense of cumbersome processing procedures and slower inference speeds. Under these circumstances, this paper proposes a new ODI-SR model characterized by its capacity to perform Fast and Arbitrary-scale ODI-SR processes, denoted as FAOR. The key innovation lies in adapting the implicit image function from the planar image domain to the ERP image domain by incorporating spherical geometric priors at both the latent representation and image reconstruction stages, in a low-overhead manner. Specifically, at the latent representation stage, we adopt a pair of pixel-wise and semantic-wise sphere-to-planar distortion maps to perform affine transformations on the latent representation, thereby incorporating it with spherical properties. Moreover, during the image reconstruction stage, we introduce a geodesic-based resampling strategy, aligning the implicit image function with spherical geometrics without introducing additional parameters. As a result, the proposed FAOR outperforms the state-of-the-art ODI-SR models with a much faster inference speed. Extensive experimental results and ablation studies have demonstrated the effectiveness of our design.

Abstract: Solving jigsaw puzzles has been extensively studied. While most existing models focus on solving either smallscale puzzles or puzzles with no gap between fragments, solving large-scale puzzles with gaps presents distinctive challenges in both image understanding and combinatorial optimization. To tackle these challenges, we propose a framework of Evolutionary Reinforcement Learning with Multi-head Puzzle Perception (ERL-MPP) to derive a better set of swapping actions for solving the puzzles. Specifically, to tackle the challenges of perceiving the puzzle with gaps, a Multi-head Puzzle Perception Network (MPPN) with a shared encoder is designed, where multiple puzzlet heads comprehensively perceive the local assembly status, and a discriminator head provides a global assessment of the puzzle. To explore the large swapping action space efficiently, an Evolutionary Reinforcement Learning (EvoRL) agent is designed, where an actor recommends a set of suitable swapping actions from a large action space based on the perceived puzzle status, a critic updates the actor using the estimated rewards and the puzzle status, and an evaluator coupled with evolutionary strategies evolves the actions aligning with the historical assembly experience. The proposed ERL-MPP is comprehensively evaluated on the JPLEG-5 dataset with large gaps and the MIT dataset with large-scale puzzles. It significantly outperforms all state-of-the-art models on both datasets.

Abstract: With the availability of egocentric 3D handobject interaction datasets, there is increasing interest in developing unified models for hand-object pose estimation and action recognition. However, existing methods still struggle to recognise seen actions on unseen objects due to the limitations in representing object shape and movement using 3D bounding boxes. Additionally, the reliance on object templates at test time limits their generalisability to unseen objects. To address these challenges, we propose to leverage superquadrics as an alternative 3D object representation to bounding boxes and demonstrate their effectiveness on both template-free object reconstruction and action recognition tasks. Moreover, as we find that pure appearance-based methods can outperform the unified methods, the potential benefits from 3D geometric information remain unclear. Therefore, we study the compositionality of actions by considering a more challenging task where the training combinations of verbs and nouns do not overlap with the testing split. We extend H2O and FPHA datasets with compositional splits and design a novel collaborative learning framework that can explicitly reason about the geometric relations between hands and the manipulated object. Through extensive quantitative and qualitative evaluations, we demonstrate significant improvements over the state-of-the-arts in (compositional) action recognition.

School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, The University of Sydney, School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Jiangsu University, Shenzhen Campus of Sun Yat-sen University, School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, College of Computing and Data Science, Nanyang Technological University

Abstract: Multimodal large language models (MLLMs) have experienced significant advancements recently, but still struggle to recognize and interpret intricate details in highresolution (HR) images effectively. While state-of-the-art (SOTA) MLLMs claim to process images at 4K resolution, existing MLLM benchmarks only support up to 2K, leaving the capabilities of SOTA models on true HR images largely untested. Furthermore, existing methods for enhancing HR image perception in MLLMs rely on computationally expensive visual instruction tuning. To address these limitations, we introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K & 8K images. Through extensive experiments, we demonstrate that while downsampling HR images leads to vision information loss, leveraging complementary modalities, e.g., text, can effectively compensate for this loss. Building upon this insight, we propose Divide, Conquer and Combine, a novel training-free framework for enhancing MLLM perception of HR images. Our method follows a three-staged approach: 1) Divide: recursively partitioning the HR image into patches and merging similar patches to minimize computational overhead, 2) Conquer: leveraging the MLLM to generate accurate textual descriptions for each image patch, and 3) Combine: utilizing the generated text descriptions to enhance the MLLM's understanding of the overall HR image. Extensive experiments show that: 1) the SOTA MLLM achieves 63% accuracy, which is markedly lower than the 87% accuracy achieved by humans on HR-Bench; 2) our method brings consistent and significant improvements (a relative increase of +6% on HR-Bench and +8% on general multimodal benchmarks).

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, School of Computer Science and Technology, Soochow University, Suzhou, China, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China

Abstract: Pseudolabel learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.

Abstract: The rise in sophisticated image forgery techniques, driven by advancements in image editing and generation, has posed new security challenges. Traditional methods, designed for specific tampering artifacts, struggle with outof-distribution image forgery detection. In this paper, we propose a shift in paradigm, placing greater emphasis on the universal characteristics of authentic images, as opposed to solely focusing on specific forgery signals. We introduce an enhancement to the Masked Autoencoder (MAE), aptly termed the Forgery MAE (FMAE). This modification retains the inherent characteristics of natural images while integrating multi-source forgery information. Our implementation involves applying the lottery ticket hypothesis during pre-training to identify forgery-sensitive parameters, followed by their sparse fine-tuning to target the forgery detection and localization task. Concurrently, we develop a ``mixture of experts'' noise extractor to compile multi-source forgery data. Our FMAE effectively extracts forgery features and shows strong resilience against unseen forgeries. Extensive experiments across multiple datasets confirm our method's superior accuracy and generalization capability over existing techniques.

Abstract: Cascade ranking architecture, composed of matching, preranking, ranking and re-ranking stages, is usually adopted to balance the efficiency and effectiveness in real-world recommendation system (RS). As the middle stage of RS, pre-ranking aims to quickly filter out the low-quality items selected at the matching stage and then forwarding high-quality items to the ranking stage. Existing pre-ranking approaches mainly endure two problems 1) Sample Selection Bias (SSB) problem, which heavily limits the performance improvement of filtering out low-quality items owing to ignoring the data flow between stages; and 2) Ranking Consistency (RC) problem, which may cause the ranked lists of the ranking stage and previous pre-ranking stage to be inconsistent. As a result, the competitive items with high scores at the ranking stage may not be selected because of low scores at the pre-ranking stage. These both two problems may cause sub-optimal performances, but previous works usually only focus on the one of them. In this paper, we propose a novel Sample Debias and Ranking Consistency Joint Learning Framework (SDCL) to jointly alleviate SSB and RC problems. SDCL consists of two main modules including 1) Multi-Task Distillation Module (MTD), which enhances the ability of identifying high-quality items by distilling knowledge across all tasks simultaneously from the more complex ranking model which jointly trained with the pre-ranking model; and 2) Adaptive Negative Sample Learning Module (ANSL), which improves the performance of filtering out low-quality items by adaptively adjusting negative samples learning weights based on the current performance of model. SDCL seamlessly integrates two modules in an end-to-end multi-task learning framework. Evaluations on both real-world large-scale traffic logs and online A/B test demonstrate the efficacy and superiority of SDCL.

Abstract: Audiovisual navigation has received considerable attention in recent years. However, the majority of related investigations have focused on single sound-source scenarios. Studies in this field for multiple sound-source scenarios remain underexplored due to the limitations of two aspects. First, the existing audio-visual navigation dataset only has limited audio samples, making it difficult to simulate diverse multiple sound-source environments. Second, existing navigation frameworks are mainly designed for single sound-source scenarios, thus their performance is severely reduced in multiple sound-source scenarios. In this work, we make an attempt to fill in these two research gaps to some extent. First, we establish a large-scale BEnchmark Dataset for Audio-Vsual Navigation, namely BeDAViN. This dataset consists of 2,258 audio samples with a total duration of 10.8 hours, which is more than 33 times longer than the existing audio dataset employed in the audio-visual navigation task. Second, we propose a new Embodied Navigation framework for MUltiple Sound-Sources Scenarios called ENMuS3. There are mainly two essential components in ENMuS3, the sound event descriptor and the multi-scale scene memory transformer. The former component equips the agent with the ability to extract spatial and semantic features of the target sound-source among multiple sound-sources, while the latter provides the ability to track the target object effectively in noisy environments. Experimental results on our BeDAViN show that ENMuS3 strongly outperforms its counterparts with a significant improvement in success rates across diverse scenarios.

Abstract: We study reasoning about relative position, orientation and distance of moving objects in 2D space. We first construct a new hybrid calculus HOPA by augmenting qualitative distance and quantitative constraints into Oriented Point Relation Algebra (OPRA). Then we develop a framework for consistency checking and reasoning with HOPA using Answer Set Programming. This framework can also explain the source of inconsistency, infer new knowledge and generate a layout of objects and their orientation in the discrete space. The framework is capable of reasoning with (un)certain, heterogenous and presumed information. We evaluate efficiency and scalability of our method by computational experiments, and illustrate its applications with sample scenarios from robotic perception and marine navigation.

Abstract: Due to the communication bottleneck in distributed and decentralized federated learning applications, algorithms using compressed communication have attracted significant attention. The Error Feedback (EF) is a widelystudied compression framework for convergence with biased compressors such as top-k sparsification. Although various improvements have been obtained in recent years, the theoretical guarantee for EF-type framework is still limited. Previous works either 1) rely on strong assumptions such as bounded gradient/dissimilarity assumptions, thus can not deal with arbitrary data heterogeneity and also slow the convergence speed, or 2) can not enjoy linear speedup in the number of clients. In this work, we propose a new EFSkip framework which removes the strong assumptions to allow arbitrary data heterogeneity and enjoys linear speedup for significantly improving upon previous results. In particular, EFSkip achieves a substantially lower computational complexity compared to the previous EF21, i.e., EFSkip enjoys the linear speedup in the number of clients (reducing the result linearly using more clients). We also show that EFSkip enjoys linear speedup and achieves faster convergence for nonconvex problems satisfying Polyak-Lojasiewicz (PL) condition. We believe that the new EFSkip framework will have a large impact on the communication- and computation-efficient distributed and decentralized federated learning.

Abstract: Machine learning has grown in popularity to help assign resources and make decisions about users, which can result in discrimination. This includes hiring markets, where employers have increasingly been interested in using automated tools to help hire candidates. In response, there has been significant effort to understand and mitigate the sources of discrimination in these tools. However, previous work has largely assumed that discrimination, in any area of ML, is the result of some initial unequal distribution of resources across groups: One group is on average less qualified, there is less training data for one group, or the classifier is less accurate on one group, etc. However, recent work have suggested that there are other sources of discrimination, such as relational inequality, that are notably nondistributional. First, we show consensus in strategy choice is a non-distributional source of inequality at equilibrium in games: We provide subgame perfect equilibria in a simple sequential model of a hiring market with Rubinstein-style bargaining between firms and candidates that exhibits asymmetric wages resulting from differences in agents' threat strategies during bargaining. Second, we give an initial analysis of how agents could learn such strategies via convergence of an online learning algorithm to asymmetric equilibria. Ultimately, this work motivates the further study of endogenous, possibly non-distributional, mechanisms of inequality in ML.

Abstract: Today, deep neural networks are widely used since they can handle a variety of complex tasks. Their generality makes them very powerful tools in modern technology. However, deep neural networks are often overparameterized. The usage of these large models consumes a lot of computation resources. In this paper, we introduce a method called Till the Layers Collapse (TLC), which compresses deep neural networks through the lenses of batch normalization layers. By reducing the depth of these networks, our method decreases deep neural networks' computational requirements and overall latency. We validate our method on popular models such as SwinT, MobileNet-V2, and RoBERTa, across both image classification and natural language processing (NLP) tasks.

Department of Electrical and Computer Engineering, University of Houston, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Frontier Research Center, Peng Cheng Laboratory, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Department of Electrical and Computer Engineering, Stevens Institute of Technology, Department of Electrical and Computer Engineering, University of Houston, Department of Electrical and Computer Engineering, University of Houston

Abstract: As a popular distributed learning paradigm, federated learning (FL) over mobile devices fosters numerous applications, while their practical deployment is hindered by participating devices' computing and communication heterogeneity. Some pioneering research efforts proposed to extract subnetworks from the global model, and assign as large a subnetwork as possible to the device for local training based on its full computing capacity. Although such fixed size subnetwork assignment enables FL training over heterogeneous mobile devices, it is unaware of (i) the dynamic changes of devices' communication and computing conditions and (ii) FL training progress and its dynamic requirements of local training contributions, both of which may cause very long FL training delay. Motivated by those dynamics, in this paper, we develop a wireless and heterogeneity aware latency efficient FL (WHALEFL) approach to accelerate FL training through adaptive subnetwork scheduling. Instead of sticking to the fixed size subnetwork, WHALE-FL introduces a novel subnetwork selection utility function to capture device and FL training dynamics, and guides the mobile device to adaptively select the subnetwork size for local training based on (a) its computing and communication capacity, (b) its dynamic computing and/or communication conditions, and (c) FL training status and its corresponding requirements for local training contributions. Our evaluation shows that, compared with peer designs, WHALE-FL effectively accelerates FL training without sacrificing learning accuracy.

Abstract: We address the challenge of multiagent cooperation, where agents achieve a common goal by cooperating with decentralized agents under complex partial observations. Existing cooperative agent systems often struggle with efficiently processing continuously accumulating information, managing globally suboptimal planning due to lack of consideration of collaborators, and addressing false planning caused by environmental changes introduced by other collaborators. To overcome these challenges, we propose the RElevance, Proximity, and Validation-Enhanced Cooperative Language Agent (REVECA), a novel cognitive architecture powered by GPT-4o-mini. REVECA enables efficient memory management, optimal planning, and cost-effective prevention of false planning by leveraging Relevance Estimation, Adaptive Planning, and Trajectory-based Validation. Extensive experimental results demonstrate REVECA's superiority over existing methods across various benchmarks, while a user study reveals its potential for achieving trustworthy human-AI cooperation.

Abstract: A poster from a long input document can be considered as a onepage easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.

Abstract: Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus to generate a new token and keeps all generated tokens in the vocabulary, it unavoidably holds tokens that primarily act as components of a longer token and appear infrequently on their own. We term such tokens as Scaffold Tokens. Due to their infrequent occurrences in the text corpus, Scaffold Tokens pose a learning imbalance issue. To address that issue, we propose ScaffoldBPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE method. This novel approach ensures the exclusion of low-frequency Scaffold Tokens from the token representations for given texts, thereby mitigating the issue of frequency imbalance and facilitating model training. On extensive experiments across language modeling and even machine translation, Scaffold-BPE consistently outperforms the original BPE, well demonstrating its effectiveness.

Abstract: Continual Semantic Parsing (CSP) aims to train parsers to convert natural language questions into SQL across tasks with limited annotated examples, adapting to dynamically updated databases in realworld scenarios. Previous studies mitigate this challenge by replaying historical data or employing parameter-efficient tuning (PET), but they often violate data privacy or rely on ideal continual learning settings. To address these issues, we propose a new Large Language Model (LLM)-Enhanced Continuous Semantic Parsing method, named LECSP, which alleviates forgetting while encouraging generalization, without requiring real data replay or ideal settings. Specifically, it first analyzes the commonalities and differences between tasks from the SQL syntax perspective to guide LLMs in reconstructing key memories and improving memory accuracy through calibration. Then, it uses a task-aware dual-teacher distillation framework to promote the accumulation and transfer of knowledge during sequential training. Experimental results on two CSP benchmarks show that our method significantly outperforms existing methods, even those utilizing data replay or ideal settings. Additionally, we achieve generalization performance beyond upper limits, better adapting to unseen tasks.

Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, University of Electronic Science and Technology of China, Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences

Abstract: The storage and recall of factual associations in autoregressive transformer language models (LMs) have drawn a great deal of attention, inspiring knowledge editing by directly modifying the located model weights. Most editing works achieve knowledge editing under the guidance of existing interpretations of knowledge recall that mainly focus on subject knowledge. However, these interpretations are seriously flawed, neglecting relation information and leading to the *over-generalizing* problem for editing. In this work, we discover a novel relation-focused perspective to interpret the knowledge recall of transformer LMs during inference and apply it on single knowledge editing to avoid over-generalizing. Experimental results on the dataset supplemented with a new R-Specificity criterion demonstrate that our editing approach significantly alleviates over-generalizing while remaining competitive on other criteria, breaking the domination of subject-focused editing for future research.

Abstract: Automatic prompt optimization is an important approach to improving the performance of large language models (LLMs). Recent research demonstrates the potential of using LLMs as prompt optimizers, which can generate improved task prompts via iterative refinement. In this paper, we propose a novel perspective to investigate the design of LLMbased prompt optimizers, by drawing an analogy with gradient-based model optimizers. To connect these two approaches, we identify two pivotal factors in model parameter learning: update direction and update method. By systematically analyzing a rich set of improvement strategies on the two aspects, we further develop a capable Gradient-inspired LLM-based Prompt Optimizer called GPO. At each step, it first retrieves relevant prompts from the optimization trajectory as the update direction. Then, it utilizes the generation-based refinement strategy to perform the update, while controlling the edit distance through a cosine-based decay strategy. Extensive experiments demonstrate the effectiveness and efficiency of GPO. In particular, GPO brings an additional improvement of up to 56.8% on Big-Bench Hard and 62.6% on MMLU compared to baseline methods.

Abstract: Recent years have witnessed a profound evolution in the abilities of Large Language Model, which has significantly boosted the proliferation of roleplaying agents and platforms. Nonetheless, there is a conspicuous absence of systematic and comprehensive evaluations of role-playing abilities which are truly aligned with users' interaction scenarios in real-world. To address this gap, we have devised DMT-RoleBench, a benchmark designed to evaluate the role-playing abilities of large language models and agents based on dynamic multi-turn dialogues. Compared with existed role-playing benchmarks, DMT-RoleBench boasts several principal advantages: (1) It contains a more diverse role types and system prompts of different formats. (2) We propose an innovative evaluation paradigm to assess role-playing abilities based on dynamically generating multi-turn dialogues constrained by specific evaluation intents and topics, which is well aligned with users' interaction scenarios in real-world. (3) We define a three-tiered metric system and provide DMT-RM, which is a reward model aligned with human annotations, to annotate the dialogues. And we propose DMT-Score to calculate the final scores based on the annotated dialogues. Our experiments and analysis of leading models equipped with role-playing abilities have demonstrated the effectiveness of DMT-RoleBench.

Abstract: Farmers rely on infield observations to make well-informed crop management decisions to maximize profit and minimize adverse environmental impact. However, obtaining real-world crop state measurements is labor-intensive, time-consuming and expensive. In most cases, it is not feasible to gather crop state measurements before every decision moment. Moreover, in previous research pertaining to farm management optimization, these observations are often assumed to be readily available without any cost, which is unrealistic. Hence, enabling optimization without the need to have *temporally complete* crop state observations is important. An approach to that problem is to include measuring as part of decision making. As a solution, we apply reinforcement learning (RL) to recommend opportune moments to simultaneously measure crop features and apply nitrogen fertilizer. With realistic considerations, we design an RL environment with explicit crop feature measuring costs. While balancing costs, we find that an RL agent, trained with recurrent PPO, discovers adaptive measuring policies that follow critical crop development stages, with results aligned by what domain experts would consider a sensible approach. Our results highlight the importance of measuring when crop feature measurements are not readily available.

Abstract: Artificial General Intelligence is the idea that someday an hypothetical agent will arise from artificial intelligence (AI) progresses, and will surpass by far the brightest and most gifted human minds. This idea has been around since the early development of AI. Since then, scenarios on how such AI may behave towards humans have been the subject of many fictional and research works. This paper analyzes the current state of artificial intelligence progresses, and how the current AI race with the ever faster release of impressive new AI methods (that can deceive humans, outperform them at tasks we thought impossible to tackle by AI a mere decade ago, and that disrupt the job market) have raised concerns that Artificial General Intelligence (AGI) might be coming faster that we thought. In particular, we focus on 3 specific families of modern AIs to develop the idea that deep neural networks, which are the current backbone of nearly all artificial intelligence methods, are poor candidates for any AGI to arise due to their many limitations, and therefore that any threat coming from the recent AI race does not lie in AGI but in the limitations, uses, and lack of regulations of our current models and algorithms.

Defence Science & Technology Group (DSTG), Australia., Army Combat Capabilities Development Command (DEVCOM), USA., University of Canterbury, New Zealand., Cybermonic, USA., Defence Science & Technology Group (DSTG), Australia., Punch Cyber Analytics, USA., Army Combat Capabilities Development Command (DEVCOM), USA., Naval Research Laboratory (NRL), USA., Air Force Research Laboratory (AFRL), USA., Defence Science Technology Laboratory (Dstl), United Kingdom., Punch Cyber Analytics, USA., Naval Information Warfare Center Pacific, USA., Naval Information Warfare Center Pacific, USA., Defence Science Technology Laboratory (Dstl), United Kingdom., Cybermonic, USA., Cornell University, USA., Defence Research and Development Canada (DRDC), Canada., Cybermonic, USA., Defence Research and Development Canada (DRDC), Canada., Naval Information Warfare Center Pacific, USA., Defence Science & Technology Group (DSTG), Australia., Defence Science Technology Laboratory (Dstl), United Kingdom., Defence Science Technology Laboratory (Dstl), United Kingdom., Defence Science Technology Laboratory (Dstl), United Kingdom., Defence Science Technology Laboratory (Dstl), United Kingdom., National Security Agency (NSA), USA., Defence Research and Development Canada (DRDC), Canada., Defence Science Technology Laboratory (Dstl), United Kingdom., University of Canterbury, New Zealand., Cornell University, USA.

Abstract: As cyber threats become increasingly automated and sophisticated, novel solutions must be introduced to improve defence of enterprise networks. Deep Reinforcement Learning (DRL) has demonstrated potential in mitigating these advanced threats. Single DRL Agents have proven utility toward execution of autonomous cyber defence. Despite the success of employing single DRL Agents, this approach presents significant limitations, especially regarding scalability within large enterprise networks. An attractive alternative to the single agent approach is the use of MultiAgent Reinforcement Learning (MARL). However, developing MARL agents is costly with few options for examining MARL cyber defence techniques against adversarial agents. This paper presents a MARL network security environment, the fourth iteration of the Cyber Autonomy Gym for Experimentation (CAGE) challenges. This challenge was specifically designed to test the efficacy of MARL algorithms in an enterprise network. Our work aims to evaluate the potential of MARL as a robust and scalable solution for autonomous network defence.

Abstract: Health and longevity are topics of great interest, leading to an exploration of the Japanese concept of ikigai, known for its impact on a fulfilling, extended life. Ikigai levels are dynamic, changing with personal growth and life situations, but traditional assessment methods are timeconsuming, discouraging frequent tracking. In this paper, we propose Personalized Optimization and Wellbeing Enhancement Recommendation (POWER), which integrates an ikigai simulator to pre- dict ikigai levels from profile information and a hobby recommender that uses reinforcement learning to adapt recommendations based on continuous user feedback. Our methods, validated through both offline data and an online user study, effectively capture and enhance ikigai.

Abstract: We describe the development of a onecredit course to promote AI literacy at the University of Texas at Austin. In response to a call for the rapid deployment of class that would serve a broad audience in Fall of 2023, we designed a 14-week seminar-style course that incorporated an interdisciplinary group of speakers who lectured on topics ranging from the fundamentals of AI to societal concerns including disinformation and employment. University students, faculty, and staff, and even community members outside of the University were invited to enroll in this online offering: The Essentials of AI for Life and Society. We collected feedback from course participants through weekly reflections and a final survey. Satisfyingly, we found that attendees reported gains in their AI literacy. We sought critical feedback through quantitative and qualitative analysis, which uncovered challenges in designing a course for this general audience. We utilized the course feedback to design a three-credit version of the course that is being offered in Fall of 2024. The lessons we learned and our plans for this new iteration may serve as a guide to instructors designing AI courses for a broad audience.

Lee Kong Chian School of Medicine, Nanyang Technological University, Department of Endocrinology, Tan Tock Seng Hospital, Lee Kong Chian School of Medicine, Nanyang Technological University Department of Preventive and Population Medicine, Tan Tock Seng Hospital, Lee Kong Chian School of Medicine, Nanyang Technological University, Lee Kong Chian School of Medicine, Nanyang Technological University, Mohammed Bin Rashid University of Medicine and Health Sciences, National Healthcare Group, Lee Kong Chian School of Medicine, Nanyang Technological University, Lee Kong Chian School of Medicine, Nanyang Technological University, Lee Kong Chian School of Medicine, Nanyang Technological University College of Computing and Data Science, Nanyang Technological University

Abstract: Artificial Intelligence (AI) has rapidly transformed the medical field, necessitating significant changes in medical education to prepare healthcare professionals for future work requirements. However, the integration of AI into medical curricula has been slow and lacks standardization. In this paper, we present our work in developing a yearlong postgraduate-level AI in Medicine program offered by a medical school at a public university in Singapore. Our curriculum design follows Kern's six-step approach to medical curriculum development, organized into a four-session framework. These sessions involved collaboration with hospital and university administrators, educators, industry experts, and healthcare professionals. The program is structured around three core courses: Foundational Healthcare AI, Clinical Applications of Healthcare AI, and Governance and Ethics for Healthcare AI. Each course comprises multiple modules with associated projects, emphasizing hands-on learning. The program adopts a problem-based learning approach, supported by a blended learning environment to accommodate the schedules of working healthcare professionals. Evaluations by industry experts highlight the program's potential to address critical gaps in the healthcare sector. This study contributes to the integration of AI into medical training by providing a standardized approach that can be adapted globally.

Abstract: Human–Computer Interaction for AI Systems Design is an eightweek short online course aimed at professional students. It is part of an online course platform called Cambridge Advance Online, which is a joint effort between Cambridge University Press & Assessment and the University of Cambridge. This course launched in July 2023 amidst a massive increase in interest in AI and its applications, and quickly became one of the platform's highest-enrolling courses, attracting about 50 students per quarterly course run. To date, more than 200 students have completed the course, and more than 90 percent have rated their experience `good' or `excellent'. This paper reports on our experiences in designing and teaching this course.

Abstract: There is a growing consensus on the importance of AI ethics in K12 education, yet effective teaching remains a challenge. AI ethics requires an interdisciplinary understanding of computer science, philosophy, and the humanities, alongside epistemic insights into how AI systems acquire, process, and apply knowledge differently from humans. To address this challenge, this study presents the design, development, and implementation of three theory-informed activities aimed at fostering epistemic insight and ethical understanding of AI among upper primary school students (ages 10-12). Grounded in constructionism, our pedagogical design leverages hands-on experimentation with guided reflection to concretize complex AI concepts. Students examine rule-based, data-driven, and generative AI systems, employing mathematical reasoning to represent AI decision-making processes and reflect on ethical issues such as fairness, bias, and transparency. The interdisciplinary, constructionist approach encourages learners to discern how AI knowledge construction differs from human cognition, thereby enhancing their ethical reasoning. The findings show that students not only developed a foundational understanding of ethical principles but also gained epistemic insight into AI’s relationship with human knowledge and values. This article provides a practical, theory-informed framework and interdisciplinary teaching resources to advance K-12 AI ethics education and support educators in fostering AI literacy.

Abstract: Students with learning disabilities (LDs) face significant challenges in key academic areas such as reading comprehension, cognitive organization, selfexpression, mathematics, and handwriting. These difficulties increase their susceptibility to discrimination and mental health related issues. Although existing studies have primarily focused on AI’s diagnostic capabilities, there is limited research examining how Generative AI (GenAI) can be utilized to produce measurable learning outcomes and enhance learning experiences for students with LDs. Moreover, GenAI is increasingly gaining prominence in educational settings. Therefore, the relationship between GenAI tools, LDs, and instructional methods needs to be further examined. This research aims to develop a comprehensive theoretical framework for helping design and implement tools specifically tailored to the unique needs of students with LDs. A prototype based on this framework will be implemented in selected educational settings to assess its effectiveness in improving learning outcomes and providing targeted support to students with LDs. The prototype will provide mobile phone integration to ensure scalability and enhance educational accessibility. The expected findings will contribute to the promotion of more inclusive learning environments for students with LDs.

Abstract: To address the limitations of current Largescale Video-Language Models (LVLMs) in fine-grained understanding and long-term temporal memory, we propose a novel video understanding approach that integrates a Vision Language Model (VLM) and a Large Language Model (LLM) with a textual memory mechanism to ensure continuity and contextual coherence. In addition, we introduce a novel evaluation metric, VAD-Score (Video Automated Description Score), to assess precision, recall, and F1 scores for events, subjects, and objects. Our approach delivers competitive results on a diverse set of videos from the DREAM-1K dataset, spanning categories such as live-action, animation, shorts, stock, and YouTube, with a focus on fine-grained comprehension.

Abstract: Recent advances in deep learning have expanded the application of large language models (LLMs) across fields such as medicine, finance, and education. Understanding the mechanisms underlying these models is essential to mitigate issues like hallucinations and bias. This study provides deep learning practitioners with insights into how specific training data points and internal structures influence model behaviour. Using influence functions and mechanistic interpretability, we will analyze the impact of data on model predictions across various tasks. Preliminary findings indicate that semantic search techniques, such as FAISS, enable efficient identification of influential training points in GPT2 small. Future work will extend these methods to additional tasks and more complex models, with a focus on further elucidating LLM structures to improve interpretability.

Abstract: Crossmodal contrastive pre-training between natural language and other modalities, e.g., vision and audio, has demonstrated astonishing performance and effectiveness across a diverse variety of tasks and domains. In this paper, we investigate whether such natural language supervision can be used for wearable sensor based Human Activity Recognition (HAR), and discover that--surprisingly--it performs substantially worse than standard end-to-end training and self-supervision. We identify the primary causes for this as: sensor heterogeneity and the lack of rich, diverse text descriptions of activities. To mitigate their impact, we also develop strategies and assess their effectiveness through an extensive experimental evaluation. These strategies lead to significant increases in activity recognition, bringing performance closer to supervised and self-supervised training, while also enabling the recognition of unseen activities and cross modal retrieval of videos. Overall, our work paves the way for better sensor-language learning, ultimately leading to the development of foundational models for HAR using wearables.

College of Computer Science, Sichuan University, China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, College of Computer Science, Sichuan University, China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, College of Computer Science, Sichuan University, China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, College of Computer Science, Sichuan University, China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China, College of Computer Science, Sichuan University, China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China Department of Computer and Information Science, University of Macao, Macao SAR, College of Computer Science, Sichuan University, China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education of China

Abstract: Nonsemantic features or semantic-agnostic features, which are irrelevant to image context but sensitive to image manipulations, are recognized as evidential to Image Manipulation Localization (IML). Since manual labels are impossible, existing works rely on handcrafted methods to extract non-semantic features. Handcrafted non-semantic features jeopardize IML model's generalization ability in unseen or complex scenarios. Therefore, for IML, the elephant in the room is: How to adaptively extract non-semantic features? Non-semantic features are context-irrelevant and manipulation-sensitive. That is, within an image, they are consistent across patches unless manipulation occurs. Then, spare and discrete interactions among image patches are sufficient for extracting non-semantic features. However, image semantics vary drastically on different patches, requiring dense and continuous interactions among image patches for learning semantic representations. Hence, in this paper, we propose a Sparse Vision Transformer (SparseViT), which reformulates the dense, global self-attention in ViT into a sparse, discrete manner. Such sparse self-attention breaks image semantics and forces SparseViT to adaptively extract non-semantic features for images. Besides, compared with existing IML models, the sparse self-attention mechanism largely reduced the model size (max 80% in FLOPs), achieving stunning parameter efficiency and computation reduction. Extensive experiments demonstrate that, without any handcrafted feature extractors, SparseViT is superior in both generalization and efficiency across benchmark datasets.

Abstract: Deep learning has significantly enhanced survival prediction using whole slide images (WSIs) by adopting a twostage learning paradigm: WSI preparation and patient-level prediction. While existing research generally concentrates on developing advanced patient-level prediction modules, the critical importance of WSI preparation has been largely overlooked. In practice, WSI preparation is influenced by numerous factors, including tissue heterogeneity, sampling strategies, and technical considerations. These uncontrollable external factors incur variability in the number of WSIs among patients, introducing significant bias and resulting in inferior performance for patients with few WSIs. To address this challenge, we propose a novel approach named WSI-Diffusion. Unlike existing WSI generation models that produce augmented versions of input WSIs, our method generates entirely new WSIs in representation space to serve as complementary data. WSIDiffusion employs a two-stage hierarchical diffusion process. Two novel modules, WSI-level and patch-level Diffusers are designed to capture complex correlations between WSIs and patches. The generated WSIs are integrated as supplementary data, and a light patient-level prediction module is then trained for survival prediction. Experimental results across five datasets demonstrate the superiority of our proposal.

Abstract: World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose DriveOccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning.

School of Information, Renmin University of China, School of Information, Renmin University of China, School of Information, Renmin University of China, Beijing Institute of Technology, Xiangjiang Laboratory, Central South University, Beijing Institute of Control Engineering, Western Xia Research Institute, Ningxia University,, School of Information, Renmin University of China, School of Information, Renmin University of China, Harvest Fund Management Co., Ltd., Shanghai Algorithm Innovation Research Institute, School of Computer, Qufu Normal University

Abstract: The significance of Temporal Knowledge Graphs (TKGs) in Artificial Intelligence (AI) lies in their capacity to incorporate timedimensional information, support complex reasoning and prediction, optimize decision-making processes, enhance the accuracy of recommendation systems, promote multimodal data integration, and strengthen knowledge management and updates. This provides a robust foundation for various AI applications. To effectively learn and apply both static and dynamic temporal patterns for reasoning, a range of embedding methods and large language models (LLMs) have been proposed in the literature. However, these methods often rely on a single underlying embedding space, whose geometric properties severely limit their ability to model intricate temporal patterns, such as hierarchical and ring structures. To address this limitation, this paper proposes embedding TKGs into projective geometric space and leverages LLMs technology to extract crucial temporal node information, thereby constructing the 5EL model. By embedding TKGs into projective geometric space and utilizing Möbius Group transformations, we effectively model various temporal patterns. Subsequently, LLMs technology is employed to process the trained TKGs. We adopt a parameter-efficient fine-tuning strategy to align LLMs with specific task requirements, thereby enhancing the model's ability to recognize structural information of key nodes in historical chains and enriching the representation of central entities. Experimental results on five advanced TKG datasets demonstrate that our proposed 5EL model significantly outperforms existing models.

Abstract: The early diagnosis of Parkinson’s disease (PD) is crucial for potential patients to receive timely treatment and prevent disease progression. Recent studies have shown that PD is closely linked to impairments in facial muscle control, resulting in characteristic “masked face” symptoms. This discovery offers a novel perspective for PD diagnosis by leveraging facial expression recognition and analysis techniques to capture and quantify these features, thereby distinguishing between PD patients and nonPD individuals based on their facial expressions. However, concerns about data privacy and legal restrictions have led to significant “data silos”, posing challenges to data sharing and limiting the accuracy and generalization of existing diagnostic models due to small, localized datasets. To address this issue, we propose an innovative adaptive federated learning approach that aims to jointly analyze facial expression data from multiple medical institutions while preserving data privacy. Our proposed approach comprehensively evaluates each client's contributions in terms of gradient, data, and learning efficiency, overcoming the non-IID issues caused by varying data sizes or heterogeneity across clients. To demonstrate the real-world impact of our approach, we collected a new facial expression dataset of PD patients in collaboration with a hospital. Extensive experiments validate the effectiveness of our proposed method for PD diagnosis and facial expression recognition, offering a promising avenue for rapid, non-invasive initial screening and advancing healthcare intelligence.

Abstract: Explainable AI is increasingly employing argumentation methods to facilitate interactive explanations between AI agents and human users. While existing approaches typically rely on predetermined human user models, there remains a critical gap in dynamically learning and updating these models during interactions. In this paper, we present a framework that enables AI agents to adapt their understanding of human users through argumentationbased dialogues. Our approach, called Persona, draws on prospect theory and integrates a probability weighting function with a Bayesian belief update mechanism that refines a probability distribution over possible human models based on exchanged arguments. Through empirical evaluations with human users in an applied argumentation setting, we demonstrate that Persona effectively captures evolving human beliefs, facilitates personalized interactions, and outperforms state-of-the-art methods.

Abstract: Large VisionLanguage Models (VLMs), possessing millions or billions of parameters, typically require large text and image datasets for effective fine-tuning. However, collecting data from various sites, especially in healthcare, is challenging due to strict privacy regulations. An alternative is to fine-tune these foundation models on end-user devices, such as in medical clinics and hospitals, without sending data to a server. These local clients typically have limited computing power and small datasets, which are not enough for fully fine-tuning large VLMs on their own. A naive solution to these scenarios is to leverage parameter-efficient fine-tuning (PEFT) strategies such as adapters and apply federated learning (FL) algorithms to combine the learned adapter weights, thereby respecting the resource limitations and data privacy of the clients. However, this approach does not fully leverage the knowledge from multiple adapters trained on diverse data distributions and for diverse tasks. The adapters are adversely impacted by data heterogeneity and task heterogeneity across clients resulting in sub-optimal convergence. To this end, we propose a novel framework called FedPIA that improves upon the naive combinations of FL and PEFT by introducing Permutation and Integration of the local Adapters in the server and global Adapters in the clients exploiting Wasserstein barycenters for improved blending of client-specific and client-agnostic knowledge. This layerwise permutation helps to bridge the gap in the parameter space of local and global adapters before integration. We conduct over 2000 client-level experiments utilizing 48 medical image datasets across five different medical vision-language FL task settings encompassing visual question answering as well as image and report-based multi-label disease detection. Our experiments involving diverse client settings, ten different modalities, and two VLM backbones demonstrate that FedPIA consistently outperforms the state-of-the-art PEFT-FL baselines.

Abstract: This paper presents a practical problem in dialogue systems: the capability to adapt to changing user intentions and resolve inconsistencies in conversation histories. It is crucial in scenarios like train ticket booking, where travel plans often change dynamically. Notwithstanding the advancements in NLP and large language models (LLMs), these systems struggle with realtime information updates during conversations. We introduce a specialized dataset to evaluate LLM-based chatbots on such conversational adaptability by asking a broad range of open-domain questions, focusing on scenarios where users modify their requests mid-conversation. Additionally, as LLMs are susceptible to generating superfluous sentences, we propose a novel, Chain-of-Thought-free evaluation framework to distill the user intention from their responses. Through extensive investigations on four LLMs, we observe that these contemporary LLMs are not well-aligned with the latest user intent in long-term conversations; they often fail to capture the nuances of natural conversations in a zero-shot setting. Interestingly, the results demonstrate that GPT-4, widely recognized as having the most advanced reasoning capabilities to date, is bested by GPT-3.5 in this task. This work aims to improve the practicality of LLM-based chatbots, bridging the gap between the current capabilities of dialogue systems and the fluidity of human interactions.

Abstract: Striking an optimal balance between minimal drafting latency and high speculation accuracy to enhance the inference speed of Large Language Models remains a significant challenge in speculative decoding. In this paper, we introduce Falcon, an innovative semiautoregressive speculative decoding framework fashioned to augment both the drafter's parallelism and output quality. Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy. We offer a comprehensive theoretical analysis to illuminate the underlying mechanisms. Additionally, we introduce a Custom-Designed Decoding Tree, which permits the drafter to generate multiple tokens in a single forward pass and accommodates multiple forward passes as needed, thereby boosting the number of drafted tokens and significantly improving the overall acceptance rate. Comprehensive evaluations on benchmark datasets such as MT-Bench, HumanEval, and GSM8K demonstrate Falcon's superior acceleration capabilities. The framework achieves a lossless speedup ratio ranging from 2.91x to 3.51x when tested on the Vicuna and LLaMA2-Chat model series. These results outstrip existing speculative decoding methods for LLMs, including Eagle, Medusa, Lookahead, SPS, and PLD, while maintaining a compact drafter architecture equivalent to merely two Transformer layers.

Abstract: Large Language Models (LLMs) have demonstrated significant potential across various applications, but their use as AI copilots in complex and specialized tasks is often hindered by AI hallucinations, where models generate outputs that seem plausible but are incorrect. To address this challenge, we develop AutoFEA, an intelligent system that integrates LLMs with Finite Element Analysis (FEA) to automate the generation of FEA input files. Our approach features a novel planning method and a graph convolutional network (GCN)Transformer Link Prediction retrieval model, which enhances the accuracy and reliability of the generated simulations. The AutoFEA system proceeds with key steps: dataset preparation, step-by-step planning, GCN-Transformer Link Prediction retrieval, LLM-driven code generation, and simulation using CalculiX. In this workflow, the GCN-Transformer model predicts and retrieves relevant example codes based on relationships between different steps in the FEA process, guiding the LLM in generating accurate simulation codes. We validate AutoFEA using a specialized dataset of 512 meticulously prepared FEA projects, which provides a robust foundation for training and evaluation. Our results demonstrate that AutoFEA significantly reduces AI hallucinations by grounding LLM outputs in physically accurate simulation data, thereby improving the success rate and accuracy of FEA simulations and paving the way for future advancements in AI-assisted engineering tasks.

Abstract: Alzheimer’s Disease (AD) is an irreversible neurodegenerative disease affecting 50 million people worldwide. Lowcost, accurate identification of key markers of AD is crucial for timely diagnosis and intervention. Language impairment is one of the earliest signs of cognitive decline, which can be used to discriminate AD patients from normal control (NC) individuals. Patient-interviewer dialogues may be used to detect such impairments, but they are often mixed with ambiguous, noisy, and irrelevant information, making the AD detection task difficult. Moreover, the limited availability of AD speech samples and variability in their speech styles pose significant challenges in developing robust speech-based AD detection models. To address these challenges, we propose DECT, a novel speech-based domain-specific approach leveraging large language models (LLMs) for fine-grained linguistic analysis and label-switched label-preserved (LSLP) data generation. Our study presents four novelties: (1) We harness the summarizing capabilities of LLMs to identify and distill key Cognitive-Linguistic (CL) information (atoms) from noisy speech transcripts, effectively filtering irrelevant information. (2) We leverage the inherent linguistic knowledge of LLMs to extract linguistic markers from unstructured and heterogeneous audio transcripts. (3) We exploit the compositional ability of LLMs to generate LSLP AD speech transcripts consisting of diverse linguistic patterns to overcome the speech data scarcity challenge and enhance the robustness of AD detection models. (4) We use the augmented AD textual speech transcript dataset and a more fine-grained representation of AD textual speech transcript data to fine-tune the AD detection model. The results have shown that DECT, an integrated, LLM-assisted, speech-based AD detection model demonstrates superior model performance with an 11% improvement in AD detection accuracy on the datasets from DementiaBank compared to the baselines.

Abstract: Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). While prior studies have primarily focused on the postgeneration analysis and refinement of outputs, this paper centers on the effectiveness of queries in eliciting accurate responses from LLMs. We present HalluciBot, a model that estimates the query's propensity to hallucinate before generation, without invoking any LLMs during inference. HalluciBot can serve as a proxy reward model for query rewriting, offering a general framework to estimate query quality based on accuracy and consensus. In essence, HalluciBot investigates how poorly constructed queries can lead to erroneous outputs - moreover, by employing query rewriting guided by HalluciBot's empirical estimates, we demonstrate that 95.7% output accuracy can be achieved for Multiple Choice questions. The training procedure for HalluciBot consists of perturbing 369,837 queries n times, employing n+1 independent LLM agents, sampling an output from each query, conducting a Multi-Agent Monte Carlo simulation on the sampled outputs, and training an encoder classifier. The idea of perturbation is the outcome of our ablation studies that measures the increase in output diversity (+12.5 agreement spread) by perturbing a query in lexically different but semantically similar ways. Therefore, HalluciBot paves the way to ratiocinate (76.0% test F1 score, 46.6% in saved computation on hallucinatory queries), rewrite (+30.2% positive class transition from hallucinatory to non-hallucinatory), rank (+50.6% positive class transition from hallucinatory to non-hallucinatory), and route queries to effective pipelines.

Abstract: Current multimodal sentiment analysis (MSA) and emotion recognition in conversations (ERC) methods based on pretrained language models exhibit two primary limitations: 1) Once trained for MSA and ERC tasks, these pre-trained language models lose their original generalized capabilities. 2) They demand considerable computational resources. As the size of pre-trained language models continues to grow, training larger multimodal sentiment analysis models using previous approaches could result in unnecessary computational cost. In response to this challenge, we propose Multimodal Sentiment Analysis and Emotion Recognition Adapter (MSE-Adapter), a lightweight and adaptable plugin. This plugin enables a large language model (LLM) to carry out MSA or ERC tasks with minimal computational overhead (only introduces approximately 2.6M to 2.8M trainable parameters upon the 6/7B models), while preserving the intrinsic capabilities of the LLM. In the MSE-Adapter, the Text-Guide-Mixer (TGM) module is introduced to establish explicit connections between non-textual and textual modalities through the Hadamard product. This allows non-textual modalities to better align with textual modalities at the feature level, promoting the generation of higher-quality pseudo tokens. Extensive experiments were conducted on four public English and Chinese datasets using consumer-grade GPUs and open-source LLMs (Qwen-1.8B, ChatGLM3-6B-base, and LLaMA2-7B) as the backbone. The results demonstrate the effectiveness of the proposed plugin.

The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences School of Artifcial Intelligence, University of Chinese Academy of Sciences, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences School of Artifcial Intelligence, University of Chinese Academy of Sciences, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences School of Artifcial Intelligence, University of Chinese Academy of Sciences, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences School of Artifcial Intelligence, University of Chinese Academy of Sciences, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences School of Artifcial Intelligence, University of Chinese Academy of Sciences, The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences School of Artifcial Intelligence, University of Chinese Academy of Sciences

Abstract: LLM have achieved success in many fields but still troubled by problematic content in the training corpora. LLM unlearning aims at reducing their influence and avoid undesirable behaviours. However, existing unlearning methods remain vulnerable to adversarial queries and the unlearned knowledge resurfaces after the manually designed attack queries. As part of a redteam effort to proactively assess the vulnerabilities of unlearned models, we design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack these models and evaluate their robustness. It optimizes adversarial suffixes to reintroduce the unlearned knowledge in various scenarios. We find that unlearned knowledge can be recovered in 55.2% of the questions, even without revealing the unlearned model's parameters. In response to this vulnerability, we propose Latent Adversarial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process. It formulates the unlearning process as a min-max optimization problem and resolves it through two stages: an attack stage, where perturbation vectors are trained and added to the latent space of LLMs to recover the unlearned knowledge, and a defense stage, where previously trained perturbation vectors are used to enhance unlearned model's robustness. With our LAU framework, we obtain two robust unlearning methods, AdvGA and AdvNPO. We conduct extensive experiments across multiple unlearning benchmarks and various models, and demonstrate that they improve the unlearning effectiveness by over 53.5%, cause only less than a 11.6% reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.

Abstract: Autonomous vehicles rely on camerabased perception systems to comprehend their driving environment and make crucial decisions, thereby ensuring vehicles to steer safely. However, a significant threat known as Electromagnetic Signal Injection Attacks (ESIA) can distort the images captured by these cameras, leading to incorrect AI decisions and potentially compromising the safety of autonomous vehicles. Despite the serious implications of ESIA, there is limited understanding of its impacts on the robustness of AI models across various and complex driving scenarios. To address this gap, our research analyzes the performance of different models under ESIA, revealing their vulnerabilities to the attacks. Moreover, due to the challenges in obtaining real-world attack data, we develop a novel ESIA simulation method and generate a simulated attack dataset for different driving scenarios. Our research provides a comprehensive simulation and evaluation framework, aiming to enhance the development of more robust AI models and secure intelligent systems, ultimately contributing to the advancement of safer and more reliable technology across various fields.

Abstract: Large Language Models (LLMs) have shown remarkable performance across various tasks, yet significant disparities remain for nonEnglish languages, and especially native African languages. This paper addresses these disparities by creating approximately 1 million human-translated words of new benchmark data in 8 low-resource African languages, covering a population of over 160 million speakers of: Amharic, Bambara, Igbo, Sepedi (Northern Sotho), Shona, Sesotho (Southern Sotho), Setswana, and Tsonga. Our benchmarks are translations of Winogrande and three sections of MMLU: college medicine, clinical knowledge, and virology. Using the translated benchmarks, we report previously unknown performance gaps between state-of-the-art (SOTA) LLMs in English and African languages. Finally, using results from over 400 fine-tuned models, we explore several methods to reduce the LLM performance gap, including high-quality dataset fine-tuning (using an LLM-as-an-Annotator), cross-lingual transfer, and cultural appropriateness adjustments. Key findings include average mono-lingual improvements of 5.6% with fine-tuning (with 5.4% average mono-lingual improvements when using high-quality data over low-quality data), 2.9% average gains from cross-lingual transfer, and a 3.0% out-of-the-box performance boost on culturally appropriate questions. The publicly available benchmarks, translations, and code from this study support further research and development aimed at creating more inclusive and effective language technologies.

Abstract: Ensuring street food safety in developing countries is crucial due to the high prevalence of foodborne illnesses. Traditional methods of food safety assessments face challenges such as resource constraints, logistical issues, and subjective biases influenced by surveyors personal lived experiences, particularly when interacting with local communities. For instance, a local food safety inspector may inadvertently overrate the quality of infrastructure due to prior familiarity or past purchases, thereby compromising objective assessment. This subjectivity highlights the necessity for technologies that reduce human biases and enhance the accuracy of survey data across various domains. This paper proposes a novel approach based on a combination of Computer Vision and a lightweight Visual Large Language Model (VLLM) to automate the detection and analysis of critical food safety infrastructure in street food vendor environments at a field experiment in Kolkata, India. The system utilises a threestage object extraction pipeline from the video to identify, extract and select unique representations of critical elements such as hand-washing stations, dishwashing areas, garbage bins, and water tanks. These four infrastructure items are crucial for maintaining safe food practices, irrespective of the specific methods employed by the vendors. A VLLM then analyses the extracted representations to assess compliance with food safety standards. Notably, over half of the pipeline can be processed using a user's smartphone, significantly reducing government server workload. By leveraging this decentralised approach, the proposed system decreases the analysis cost by many orders of magnitude compared to alternatives like ChatGPT or Claude 3.5. Additionally, processing data on local government servers provides better privacy and security than cloud platforms, addressing critical ethical considerations. This automated approach significantly improves efficiency, consistency, and scalability, providing a robust solution to enhance public health outcomes in developing regions.

Abstract: Health outcomes depend on complex environmental and sociodemographic factors whose effects change over location and time. Only recently has finegrained spatial and temporal data become available to study these effects, namely the MEDSAT dataset of English health, environmental, and sociodemographic information. Leveraging this new resource, we use a variety of variable importance techniques to robustly identify the most informative predictors across multiple health outcomes. We then develop an interpretable machine learning framework based on Generalized Additive Models (GAMs) and Multiscale Geographically Weighted Regression (MGWR) to analyze both local and global spatial dependencies of each variable on various health outcomes. Our findings identify NO2 as a global predictor for asthma, hypertension, and anxiety, alongside other outcome-specific predictors related to occupation, marriage, and vegetation. Regional analyses reveal local variations with air pollution and solar radiation, with notable shifts during COVID. This comprehensive approach provides actionable insights for addressing health disparities, and advocates for the integration of interpretable machine learning in public health.

Abstract: Accurate standard plane acquisition in fetal ultrasound (US) videos is crucial for fetal growth assessment, anomaly detection, and adherence to clinical guidelines. However, manually selecting standard frames is timeconsuming and prone to intra- and inter-sonographer variability. Existing methods primarily rely on image-based approaches that capture standard frames and then classify the input frames across different anatomies. This ignores the dynamic nature of video acquisition and its interpretation. To address these challenges, we introduce Multi-Tier Class-Aware Token Transformer (MCAT); a visual query-based video clip localization (VQ-VCL) method to assist sonographers by enabling them to capture a quick US sweep. By then providing a visual query of the anatomy they wish to analyze, MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. We evaluate MCAT on two ultrasound video datasets and a natural image VQ-VCL dataset based on Ego4D. Our model outperforms state-of-the-art methods by 10% and 13% mtIoU on the ultrasound datasets and by 5.35% mtIoU on the Ego4D dataset, using 96% fewer tokens. MCAT’s efficiency and accuracy have significant potential implications for public health, especially in low- and middle-income countries (LMICs), where it may enhance prenatal care by streamlining standard plane acquisition, simplifying US based screening, diagnosis and allowing sonographers to examine more patients.

Abstract: This paper presents a vision for creating AI systems that are inclusive at every stage of development, from data collection to model design and evaluation. We address key limitations in the current AI pipeline and its WEIRD* representation, such as lack of data diversity, biases in model performance, and narrow evaluation metrics. We also focus on the need for diverse representation among the developers of these systems, as well as incentives that are not skewed toward certain groups. We highlight opportunities to develop AI systems that are for everyone (with diverse stakeholders in mind), with everyone (inclusive of diverse data and annotators), and by everyone (designed and developed by a globally diverse workforce). *WEIRD = an acronym coined by Joseph Henrich to highlight the coverage limitations of many psychological studies, referring to populations that are Western, Educated, Industrialized, Rich, and Democratic; while we do not fully adopt this term for AI, as its current scope does not perfectly align with the WEIRD dimensions, we believe that today's AI has a similarly "weird" coverage, particularly in terms of who is involved in its development and who benefits from it.

Abstract: In recent years, Large Language Model (LLM) agents, exemplified by models like ChatGPT, and PaLM, have showcased remarkable prowess in various tasks, owing to their vast number of parameters and emergent incontext learning capabilities. People expect the wide usage of LLM serving at edge hardware, personal devices, and organization/enterprise IT infrastructures to revolutionize global access to information, communication, automation, and creativity. However, due to the extreme large-scale LLM parameters (LLaMA 3.1 contains 405 billion of 2 or 4 bytes floating point numbers), the LLM serving is facing significant sustainability pressure due to its requirements on the latest high-embodied carbon hardware (e.g., GPUs, HBMs, memory, storage, and network hardware) and the high operational carbon emissions, leading to a significant and alarming increase in carbon emissions and a high barrier to their widespread deployments and practical applications in various scenarios. Companies, organizations, and institutes usually have the complete general-purpose IT infrastructure, which consists of a large amount of computing, memory, storage, and network hardware. Although these general-purpose IT infrastructures are far more than enough for existing application executions, deploying and executing the LLM for a broad spectrum of serving platforms can be challenging and difficult due to resource limitations. Purchasing the latest hardware including GPUs (e.g., Nvidia H100 or H200) will lead to considerable issues including 1) serious embodied carbon emissions during the new hardware production, 2) no explicitly lower operational carbon emissions with essential modeling and optimizations, 3) high economic and financial pressures, and 4) potentially tremendous existing hardware resource wasting. Therefore, it is a trend and becomes a must to explore how to use the existing hardware, especially outdated hardware, to collectively improve both environmental sustainability, efficiency, and reliability for LLM serving. A few pioneering examples include Microsoft’s Project Natick, Google’s TPU Pod Optimization, Alibaba’s Cloud Server Repurposing, and Facebook’s Network Hardware Reuse. In this talk, I will traverse my series of contributions with promising new directions, particularly emphasizing modularized LLM architecture (Part 1), in-storage sustainable computing (Part 2), and reliable serving against software and hardware attacks (Part 3).

Abstract: Students frequently make mistakes while solving mathematical problems, and traditional error correction methods are both timeconsuming and labor-intensive. This paper introduces an innovative Virtual AI Teacher system designed to autonomously analyze and correct student Errors (VATE). Leveraging advanced large language models (LLMs) like GPT-4, the system uses student drafts as a primary source for error analysis, which enhances understanding of the student's learning process. It incorporates sophisticated prompt engineering and maintains an error pool to reduce computational overhead. The AI-driven system also features a real-time dialogue component for efficient student interaction. Our approach demonstrates significant advantages over traditional and machine learning-based error correction methods, including reduced educational costs, high scalability, and superior generalizability. The system has been deployed in Squirrel AI's learning platform for elementary mathematics education, where it achieves 78.3% accuracy in error analysis and shows a marked improvement in student learning efficiency. Satisfaction surveys indicate a strong positive reception, highlighting the system's potential to transform educational practices.

Abstract: In the rapidly evolving landscape of site reliability engineering (SRE), the demand for efficient and effective solutions to manage and resolve issues in site and cloud applications is paramount. This paper presents an innovative approach to action automation using large language models (LLMs) for script generation, assessment, and refinement. By leveraging the capabilities of LLMs, we aim to significantly reduce the human effort involved in writing and debugging scripts, thereby enhancing the productivity of SRE teams. Our experiments focus on Bash scripts, a commonly used tool in SRE, and involve the CodeSift dataset of 100 tasks and the InterCode dataset of 153 tasks. The results show that LLMs can automatically assess and refine scripts efficiently, reducing the need for script validation in an execution environment. Results demonstrate that the framework shows an overall improvement of 710% in script generation.

Abstract: Artificial Intelligence (AI) literacy is increasingly important across many fields, yet caregivers remain underrepresented in AIrelated fields due to a combination of systemic and individual barriers. To address this, the Caregivers and Machine Learning (C&ML) program developed and delivered an accessible AI education program to caregivers on parental leave. Two cohorts participated in this 6-week interprofessional program, featuring fundamental machine learning concepts, hands-on programming assignments, and a capstone project. This study examines the program's impact on participants, focusing on their motivations and barriers before, during, and after the program as outcomes after completion. Post-program surveys and semi-structured interviews highlight that caregivers often face barriers such as the rapid pace of AI, discrimination, and balancing caregiving responsibilities with learning new skills. The C&ML program's flexible structure and personalized support network were critical in enabling participants to fully engage in the program, leading to significant improvements in their knowledge of ML and increased confidence in applying these skills. After completing the program, 20\% of participants transitioned into AI-related roles or pursued further education. This research highlights the value of targeted, inclusive educational programs for underrepresented groups and provides practical recommendations for refining future AI training programs for caregivers.

Abstract: With the rise of Artificial Intelligence (AI) systems in society, our children have routine interactions with these technologies. It has become increasingly important for them to understand how these technologies are trained, what their limitations are and how they work. To introduce children to AI and Machine Learning (ML) concepts, recent efforts introduce tools that integrate ML concepts with physical computing and robotics. However, some of these tools cannot be easily integrated into building projects and the high price of robotics kits can be a limiting factor to many schools. We address these limitations by offering a lowcost hardware and software toolkit that we call the Smart Motor to introduce supervised machine learning to elementary school students. Our Smart Motor uses the nearest neighbor algorithm and utilizes visualizations to highlight the underlying decision-making of the model. We conducted a one week long study using Smart Motors with 9- to 12- year old students and measured their learning through observation, questioning and examining what they built. We found that students were able to integrate the Smart Motors into their building projects but some students struggled with understanding how the underlying model functioned. In this paper we discuss these findings and insights for future directions for the Smart Motor.

Abstract: In the fastgrowing field of K–12 AI education, there is an urgent need for accessible, hands-on tools that introduce AI concepts and workflows to novice learners. In recent years, a variety of AI education tools have been introduced, ranging from coding environments to physical kits and robots. To provide an alternative to existing AI education tools, this paper presents a low-cost robotics kit (<50€) designed to teach modern ML concepts through a no-code approach. The kit is grounded in maker pedagogy and designed for easy customizability to different materials commonly found in classrooms, like cardboard, wood, metal, and plastic builder kits without the need for specialized tools. For programming the robot’s actions, the kit features an all-in-one development studio that is compatible with most phone, laptop, and tablet platforms and can operate with or without an Internet connection, making it applicable to a wide range of educational contexts, including ICT4D.

Abstract: Editbased approaches for Grammatical Error Correction (GEC) have attracted volume attention due to their outstanding explanations of the correction process and rapid inference. Through exploring the characteristics of the generalized and specific knowledge learning for GEC, we discover that efficiently training GEC systems with satisfactory generalization capacity prefers more generalized knowledge rather than specific knowledge. Current gradient-based methods for training GEC systems, however, usually prioritize minimizing training loss over generalization loss. This paper proposes the strategy of Adjusting Learning Rate Based on Mermory Rate to optimize the edit-based GEC scorer (ALRMR-GEC). Specifically, we introduce the memory rate, a novel metric, to provide an explicit indicator for the model’s state of learning generalized and specific knowledge, which can effectively guide the GEC system to adjust the learning rate timely. Extensive experiments, conducted by optimizing the published edit scorer on the BEA2019 dataset, have shown our ALRMR-GEC significantly enhances the model generalization ability with stable and satisfactory performance nearly irrespective of the initial learning rate selection. Also, our method can accelerate the training over tenfold faster in certain cases. Finally, the experiments indicate the memory rate introduced in our ALRMR-GEC guides the GEC editscorer to learn more generalized knowledge.

Laboratory for Big Data and Decision, Nation University of Defense Technology,, School of Computer Science and Technology, Xi’an Jiaotong University, Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xi’an Jiaotong University. Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University., School of Computer Science and Technology, Xi’an Jiaotong University, Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xi’an Jiaotong University. Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University., School of Computer Science and Technology, Xi’an Jiaotong University, Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xi’an Jiaotong University. Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University., Laboratory for Big Data and Decision, Nation University of Defense Technology, Laboratory for Big Data and Decision, Nation University of Defense Technology, Laboratory for Big Data and Decision, Nation University of Defense Technology

Abstract: Social platforms, while facilitating access to information, have also become saturated with a plethora of fake news, resulting in negative consequences. Automatic multimodal fake news detection is a worthwhile pursuit. Existing multimodal fake news datasets only provide binary labels of real or fake. However, real news is alike, while each fake news is fake in its own way. These datasets fail to reflect the mixed nature of various types of multimodal fake news. To bridge the gap, we construct an attributing multigranularity multimodal fake news detection dataset AMG, revealing the inherent fake pattern. Furthermore, we propose a multi-granularity clue alignment model MGCA to achieve multimodal fake news detection and attribution. Experimental results demonstrate that AMG is a challenging dataset, and its attribution setting opens up new avenues for future research.

Abstract: Stochastic Block Models (SBMs) are a popular approach to modeling single realworld graphs. The key idea of SBMs is to partition the vertices of the graph into blocks with similar edge densities within, as well as between different blocks. However, what if we are given not one but multiple graphs that are unaligned and of different sizes? How can we find out if these graphs share blocks with similar connectivity structures? In this paper, we propose the shared stochastic block modeling (SSBM) problem, in which we model n graphs using SBMs that share parameters of s blocks. We show that fitting an SSBM is NP-hard, and consider two approaches to fit good models in practice. In the first, we directly maximize the likelihood of the shared model using a Markov chain Monte Carlo algorithm. In the second, we first fit an SBM for each graph and then select which blocks to share. We propose an integer linear program to find the optimal shared blocks and to scale to large numbers of blocks, we propose a fast greedy algorithm. Through extensive empirical evaluation on synthetic and real-world data, we show that our methods work well in practice.

Abstract: Graph neural networks (GNNs) have emerged as an effective tool for fraud detection, identifying fraudulent users, and uncovering malicious behaviors. However, attacks against GNNbased fraud detectors and their risks have rarely been studied, thereby leaving potential threats unaddressed. Recent findings suggest that frauds are increasingly organized as gangs or groups. In this work, we design attack scenarios where fraud gangs aim to make their fraud nodes misclassified as benign by camouflaging their illicit activities in collusion. Based on these scenarios, we study adversarial attacks against GNN-based fraud detectors by simulating attacks of fraud gangs in three real-world fraud cases: spam reviews, fake news, and medical insurance frauds. We define these attacks as multi-target graph injection attacks and propose MonTi, a transformer-based Multi-target one-Time graph injection attack model. MonTi simultaneously generates attributes and edges of all attack nodes with a transformer encoder, capturing interdependencies between attributes and edges more effectively than most existing graph injection attack methods that generate these elements sequentially. Additionally, MonTi adaptively allocates the degree budget for each attack node to explore diverse injection structures involving target, candidate, and attack nodes, unlike existing methods that fix the degree budget across all attack nodes. Experiments show that MonTi outperforms the state-of-the-art graph injection attack methods on five real-world graphs.

Abstract: Decision trees are a popular machine learning method, valued for their inherent explainability. In Explainable AI, decision trees serve as surrogate models for complex black box AI models or as approximations of parts of such models. A key challenge of this approach is assessing how accurately the extracted decision tree represents the original model and determining the extent to which it can be trusted as an approximation of its behaviour. In this work, we investigate the use of the Probably Approximately Correct (PAC) framework to provide a theoretical guarantee of fidelity for decision trees extracted from AI models. Leveraging the theoretical foundations of the PAC framework, we adapt a decision tree algorithm to ensure a PAC guarantee under specific conditions. We focus on binary classification and conduct experiments where we extract decision trees from BERTbased language models with PAC guarantees. Our results indicate occupational gender bias in these models, which confirm previous results in the literature. Additionally, the decision tree format enhances the visualization of which occupations are most impacted by social bias.

Abstract: RefusalAware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions. By modifying responses of unknown questions in the training data to refusal responses such as ''I don't know", RAIT enhances the reliability of LLMs and reduces their hallucination. Generally, RAIT modifies training samples based on the correctness of the initial LLM's response. However, this crude approach can cause LLMs to excessively refuse answering questions they could have correctly answered, the problem we call over-refusal. In this paper, we explore two primary causes of over-refusal: Static conflict occurs when similar samples within the LLM’s feature space receive differing supervision signals (original vs. modified ''I don't know"). Dynamic conflict arises as the LLM's evolving knowledge during SFT enables it to answer previously unanswerable questions, but the now-answerable training samples still retain the original ''I don't know" supervision signals from the initial LLM state, leading to inconsistencies. These conflicts cause the trained LLM to misclassify known questions as unknown, resulting in over-refusal. To address this issue, we introduce Certainty Represented Knowledge Flow for Refusal-Aware Instructions Tuning (CRaFT). CRaFT centers on two main contributions: First, we additionally incorporate response certainty to selectively filter and modify data, reducing static conflicts. Second, we implement preliminary rehearsal training to characterize changes in the LLM's knowledge state, which helps mitigate dynamic conflicts during the fine-tuning process. We conducted extensive experiments on open-ended question answering and multiple-choice question task. Experiment results show that CRaFT can improve LLM's overall performance during the RAIT process.

Abstract: Recently it has been proven that simple GP systems can efficiently evolve a conjunction of n variables if they are equipped with the minimal required components. In this paper, we make a considerable step forward by analysing the behaviour and performance of a GP system for evolving a Boolean conjunction or disjunction of n variables using a complete function set that allows the expression of any Boolean function of up to n variables. First we rigorously prove that a GP system using the complete truth table to evaluate the program quality, and equipped with both the AND and OR operators and positive literals, evolves the exact target function in O(\ell n log^2 n) iterations in expectation, where\ell ≥ n is a limit on the size of any accepted tree. Additionally, we show that when a polynomial sample of possible inputs is used to evaluate the solution quality, conjunctions or disjunctions with any polynomially small generalisation error can be evolved with probability 1 − O(log^2(n)/n). The latter result also holds if GP uses AND, OR and positive and negated literals, thus has the power to express any Boolean function of n distinct variables. To prove our results we introduce a supermultiplicative drift theorem that gives significantly stronger runtime bounds when the expected progress is only slightly superlinear in the distance from the optimum.

Abstract: Graph Contrastive Learning (GCL) aims to selfsupervised learn low-dimensional graph representations, primarily through instance discrimination, which involves manually mining positive and negative pairs from graphs, increasing the similarity of positive pairs while decreasing negative pairs. Drawing from the success of Contrastive Learning (CL) in other domains, a consensus has been reached that the effectiveness of GCLs depends on a large number of negative pairs. As a result, despite the significant computational overhead, GCLs typically leverage as many negative node pairs as possible to improve model performance. However, given that nodes within a graph are interconnected, we argue that nodes cannot be treated as independent instances. Therefore, we challenge this consensus: Does employing more negative nodes lead to a more effective GCL model? To answer this, we explore the role of negative nodes in the commonly used InfoNCE loss for GCL and observe that: (1) Counterintuitively, a large number of negative nodes can actually hinder the model's ability to distinguish between nodes with different semantics. (2) A smaller number of high-quality and non-topologically coupled negative nodes are sufficient to enhance the discriminability of representations. Based on these findings, we propose a new method called GCL with Effective and Efficient Negative samples, E2Neg, which learns discriminative representations using only a very small set of representative negative samples. E2Neg significantly reduces computational overhead and speeds up model training. We demonstrate the effectiveness and efficiency of E2Neg across multiple datasets compared to other GCL methods.

Abstract: Alzheimer’s Disease (AD) affects over 55 million people globally, yet the key genetic contributors remain poorly understood. Leveraging recent advancements in genomic foundation models, we present the innovative ReverseGene-Finder technology, a ground-breaking neuron-to-gene-token backtracking approach in a neural network architecture to elucidate the novel causal genetic biomarkers driving AD onset. Reverse-Gene-Finder comprises three key innovations. Firstly, we exploit the observation that genes with the highest probability of causing AD, defined as the most causal genes (MCGs), must have the highest probability of activating those neurons with the highest probability of causing AD, defined as the most causal neurons (MCNs). Secondly, we utilize a gene token representation at the input layer to allow each gene (known or novel to AD) to be represented as a discrete and unique entity in the input space. Lastly, in contrast to the existing neural network architectures, which track neuron activations from the input layer to the output layer in a feed-forward manner, we develop an innovative backtracking method to track backwards from the MCNs to the input layer, identifying the Most Causal Tokens (MCTs) and the corresponding MCGs. Reverse-Gene-Finder is highly interpretable, generalizable, and adaptable, providing a promising avenue for application in other disease scenarios.

Abstract: According to an industry survey, many people miss opportunities to apply for government subsidy programs because they do not know how to apply. People also need to search manually and check whether these programs are suitable for them. To address this issue, our study develops a new generative recommender system with both users’ information and government subsidy documents. Within our recommender system framework, we modify the existing Residual Quantization Variational AutoEncoder (RQ-VAE) model to capture deep and abstract information from subsidy documents. Using semantic IDs generated for approximately 185,610 user click-stream histories and 240,000 documents, we train our recommender system to predict the semantic IDs of the next subsidy policy documents in which a user might be interested. In 2024, we successfully deploy our generative recommender system in Wello, a Korean Gov-Tech startup. In collaboration with the Korean government, our generative recommender system could save 7.8 million dollar, that might otherwise have gone unused due to a lack of applications. Also, Wello observed a 68% improvement in Click-Through Ratio (CTR), increasing from 41.4% in the third quarter of 2024 to 69.6% in the fourth quarter of 2024. We thus anticipate that our generative recommender system will have a significant impact on both individuals and the government.

Abstract: As artificial intelligence (AI) becomes increasingly central to various fields, there is a growing need to equip K12 students with AI literacy skills that extend beyond computer science. This paper explores the integration of a Project-Based Learning (PBL) AI toolkit into diverse subject areas, aimed at helping educators teach AI concepts more effectively. Through interviews and co-design sessions with K-12 teachers, we examined current AI literacy levels and how teachers adapt AI tools like the AI Art Lab, AI Music Studio, and AI Chatbot into their course designs. While teachers appreciated the potential of AI tools to foster creativity and critical thinking, they also expressed concerns about the accuracy, trustworthiness, and ethical implications of AI-generated content. Our findings reveal the challenges teachers face, including limited resources, varying student and instructor skill levels, and the need for scalable, adaptable AI tools. This research contributes insights that can inform the development of AI curricula tailored to diverse educational contexts.

Abstract: The definition of words is the fundamental and crucial linguistic concept. Any changes in word definition lead to changes in the theoretical system of the respective language. Traditionally, researchers in Natural Language Processing (NLP) for Vietnamese texts believe Vietnamese words are constructed from syllables. However, their works did not explicitly mention which linguistic theory they followed for this assumption. Although there are no theoretical guarantees, most NLP studies in Vietnamese accept this assumption. Consequently, word segmentation is recognized as one of the essential stages in NLP for Vietnamese texts. In this study, we address the role of word segmentation for Vietnamese texts from linguistic perspectives. Through our extensive experiments, we show that, based on linguistic theories, performing word segmentation is not appropriate for Vietnamese text understanding. Moreover, we present a novel method, Vietnamese Word TransFormer (ViWordFormer), for modeling Vietnamese word formation. Experimental results indicate that our method is appropriate for modeling Vietnamese word formation from both theoretical and experimental aspects and embark on a novel approach to Vietnamese word representation.

Abstract: Part of a university initiative supporting responsible AI for social empowerment and education, the projectbased RAICA (Responsible AI for Computational Action) curriculum supports middle/high school learners and novice AI literacy teachers use AI creatively for good. This paper offers a rare example of design-based implementation research (DBIR) in AI education across widely varied contexts, provides fine grain implementation data that contributes to a foundation for evaluating effectiveness and expanding access. We present a novel approach to analyzing fidelity of implementation data from RAICA’s computer vision module beta-test. Twelve educators working with ~282 students across nine pilot sites in four countries used a bespoke fidelity of implementation data collection tool (pre-made comment prompts in a Google Docs version of the teacher guide) to provide 236 qualitative responses about AI literacy and responsible design activities, plus 111 ordinal ratings of embedded teacher supports. Analyses revealed that while the curriculum was generally implemented as designed, educators frequently made modifications. Although most changes produced practical insights for improved curriculum design, others helped the design team anticipate and prevent changes that could obscure learning objectives and hinder outcomes. We discuss the pedagogical, design, and research implications of these findings for effective AI teaching/learning in diverse settings.

Abstract: While there is widespread interest in supporting young people to critically evaluate machine learningpowered systems, there is little research on how we can support them in inquiring about how these systems work and what their limitations and implications may be. Outside of K-12 education, an effective strategy in evaluating black-boxed systems is algorithm auditing—a method for understanding algorithmic systems’ opaque inner workings and external impacts from the outside in. In this paper, we review how expert researchers conduct algorithm audits and how end users engage in auditing practices to propose five steps that, when incorporated into learning activities, can support young people in auditing algorithms. We present a case study of a team of teenagers engaging with each step during an out-of-school workshop in which they audited peer-designed generative AI TikTok filters. We discuss the kind of scaffolds we provided to support youth in algorithm auditing and directions and challenges for integrating algorithm auditing into classroom activities. This paper contributes: (a) a conceptualization of five steps to scaffold algorithm auditing learning activities, and (b) examples of how youth engaged with each step during our pilot study.

School of Computer Science & Technology, Beijing Jiaotong University, Beijing, 100044, China Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, 100044, China, School of Computer Science & Technology, Beijing Jiaotong University, Beijing, 100044, China Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, 100044, China, University of Science and Technology Beijing, Beijing, 100083, China, School of Computer Science & Technology, Beijing Jiaotong University, Beijing, 100044, China MoE Key Lab of Big Data & Artificial Intelligence in Transportation, Beijing Jiaotong University, Beijing 100044, China, School of Computer Science & Technology, Beijing Jiaotong University, Beijing, 100044, China MoE Key Lab of Big Data & Artificial Intelligence in Transportation, Beijing Jiaotong University, Beijing 100044, China, School of Computer Science & Technology, Beijing Jiaotong University, Beijing, 100044, China Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, 100044, China

Abstract: Knowledge Graph Embedding (KGE) methods have achieved great success in predicting missing links in knowledge graphs, a task also known as Knowledge Graph Completion (KGC). Under this task, the Reciprocal Rank (RR) of groundtruth items serve as a key indicator for evaluating the method’s performance. However, most existing studies have overlooked the inconsistency between the ranking metric, RR, and the optimization objective functions, resulting in sub-optimal KGC performance. To address this issue, we propose a KGC framework called KGCRR by designing a novel upper bound function named CRR. By introducing the parameter-pressure ρ to shift the sigmoid function, CRR achieves a better approximation to RR compared with existing objective functions. We theoretically proved that by adjusting ρ, CRR can achieve a more effective approximation to RR. By narrowing the discrepancy with RR and alleviating the gradient vanishing issue associated with the direct optimization of RR loss, CRR demonstrates an advantage in optimizing RR. CRR serves as a plug-and-play objective, capable of seamless integration into various KGE methods. Through extensive experiments conducted on FB15k-237 and WN18RR datasets, we have obtained promising results, with an average improvement of 19.06% in MRR, indicating that CRR significantly enhances the performance of existing methods.

Abstract: Masked autoencoders employ random masking to effectively reconstruct input images using selfsupervised techniques, which allows for efficient training on large datasets. However, the random masking strategy does not adequately tap into information encapsulated within high-dimensional hyperspectral satellite imagery that is used in several domains. We propose a novel masking strategy, HOGMAE, based on the Histogram of Oriented Gradients that incorporates rich information inherent within satellite images during the mask creation step. Our experiments, over a hyperspectral satellite dataset, demonstrate the effectiveness of our methodology.

Abstract: Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graphbased method towards training-free visual token pruning, termed G-Prune. In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background. To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks. The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95% and 2.34% accuracy drops, respectively.

Abstract: The prominence of artificial intelligence and machine learning in everyday life has led to efforts to foster AI literacy for all K–12 students. In this paper, we review how Hour of Code activities engage with the five big ideas of AI, in particular with machine learning and societal impact. We found that a large majority of activities focus on perception and machine learning, with little attention paid to representation and other topics. A surprising finding was the increased attention paid to critical aspects of computing. However, we also observed a limited engagement with handson activities. In the discussion, we address how future introductory activities could be designed to offer a broader array of topics, including the development of tools to introduce novices to artificial intelligence and machine learning and the design of more unplugged and collaborative activities.

Abstract: Assessing social cognition in adolescents by understanding emotional perception is crucial, especially through tasks like the Reading the Mind in the Eyes Test (RMET). This ongoing research investigates the emotional perception skills of Nigerian high school girls through the Reading the Mind in the Eyes Test (RMET). Preliminary analysis looks at how age and class level (SS1 and SS2) affect RMET scores, with 20% of the data (n = 215) already gathered. ANOVA findings display a notable distinction among class levels (p = 0.024), while regression analysis suggests that age effectively predicts RMET scores (β = 1.06, p = 0.037), showcasing that older students achieve higher scores. These preliminary results indicate that age is a significant factor in how emotions are perceived, and more data is being collected and analyzed to gain deeper understanding. These findings can guide strategies to enhance social skills in education and improve AI models for emotion recognition in diverse and agesensitive contexts.